string concatenation vs join

bytearray에 대한 글을 읽다 다음과 같은 구문이 있어서 확인해 보았다.
스트링을 concat 하는 것보다는 join을 하는게 퍼포먼스가 좋다는 내용이다.

http://dabeaz.blogspot.kr/2010/01/few-useful-bytearray-tricks.html

The only problem with this code is that concatenation (+=) has horrible performance. Therefore, a common performance optimization in Python 2 is to collect all of the chunks in a list and perform a join when you’re done. Like this:

# remaining = number of bytes being received (determined already)
msgparts = []
while remaining > 0:
    chunk = s.recv(remaining)    # Get available data
    msgparts.append(chunk)       # Add it to list of chunks
    remaining -= len(chunk)  
msg = b"".join(msgparts)          # Make the final message

https://github.com/python/cpython/blob/master/Objects/unicodeobject.c

Concat

모든 element에 대해서 매번 새로운 object를 생성해서 object들을 복사한다.

PyObject *
PyUnicode_Concat(PyObject *left, PyObject *right)
{
 ...

left_len = PyUnicode_GET_LENGTH(left);
 right_len = PyUnicode_GET_LENGTH(right);

 new_len = left_len + right_len;

maxchar = PyUnicode_MAX_CHAR_VALUE(left);
 maxchar2 = PyUnicode_MAX_CHAR_VALUE(right);
 maxchar = Py_MAX(maxchar, maxchar2);

/* Concat the two Unicode strings */
 result = PyUnicode_New(new_len, maxchar);

 _PyUnicode_FastCopyCharacters(result, 0, left, 0, left_len);
 _PyUnicode_FastCopyCharacters(result, left_len, right, 0, right_len);
 assert(_PyUnicode_CheckConsistency(result, 1));
 return result;
}

Join

생성한 하나의 object에 memcpy를 이용해서 모든 element를 복사한다.

PyObject *
_PyUnicode_JoinArray(PyObject *separator, PyObject **items, Py_ssize_t seqlen)
{
...

res = PyUnicode_New(sz, maxchar);
 if (res == NULL)
 goto onError;

/* Catenate everything. */
 if (use_memcpy) {
for (i = 0; i < seqlen; ++i) {
Py_ssize_t itemlen;
item = items[i];

/* Copy item, and maybe the separator. */
if (i && seplen != 0) {
memcpy(res_data,
sep_data,
kind * seplen);
res_data += kind * seplen;
}

itemlen = PyUnicode_GET_LENGTH(item);
if (itemlen != 0) {
memcpy(res_data,
PyUnicode_DATA(item),
kind * itemlen);
res_data += kind * itemlen;
}
}
assert(res_data == PyUnicode_1BYTE_DATA(res)
+ kind * PyUnicode_GET_LENGTH(res));
}
...

Conclusion

concat의 경우 list의 모든 element에 대해서 매번 새로운 object를 생성해서 두개의 object를 복사하기 때문에 속도가 느리게 되고,
반면에 join의 경우 하나의 object만 새로 생성해서  모든 elements를 memcpy하기 때문에 상대적으로 속도가 빠르다.

Advertisements

About rookiecj

Hi all. Today is the day.
This entry was posted in python and tagged , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s