Wednesday, December 05, 2012

That grouping by zip/iter trick explained

In Python 2.7 you can get the following neat chunking of items in a list through application of zip and iter:


Python 2.7.3 (default, Apr 10 2012, 23:24:47) [MSC v.1500 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> items, chunk = [1,2,3,4,5,6,7,8,9], 3
>>> zip(*[iter(items)]*chunk)
[(1, 2, 3), (4, 5, 6), (7, 8, 9)]
>>> 


It is not as neat in Python 3.X as you have to wrap the expression in list(..) to see the final list.

I thought I had to think hard about what was going on here so here is my annottated Python 3.x session that explains what is going on:


Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> from pprint import pprint as pp
>>> items, chunk = [1,2,3,4,5,6,7,8,9], 3
>>> items
[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> chunk
3
>>> # A quick recap of what an iterator does 
>>> i = iter(items)
>>> i.__next__()
1
>>> i.__next__()
2
>>> i.__next__()
3
>>> i.__next__()
4
>>> i.__next__()
5
>>> i.__next__()
6
>>> i.__next__()
7
>>> i.__next__()
8
>>> i.__next__()
9
>>> i.__next__()
Traceback (most recent call last):
  File "", line 1, in 
    i.__next__()
StopIteration
>>>
>>> # Three copies of the same iterator
>>> threei = [iter(items)] * chunk
>>> pp(threei)
[object at 0x0000000003259D30>,
 object at 0x0000000003259D30>,
 object at 0x0000000003259D30>]
>>> # (notice that the addresses above are the same)
>>>
>>> # Lets break-out the three copies
>>> i, j, k = threei
>>> i
object at 0x0000000003259D30>
>>> j
object at 0x0000000003259D30>
>>> k
object at 0x0000000003259D30>
>>> # i,j and k are the same iterator
>>> i.__next__()
1
>>> j.__next__()
2
>>> k.__next__()
3
>>> # ...
>>>
>>> # Lets try that again with zip:
>>> i, j, k = threei = [iter(items)] * chunk
>>> list(zip(i, j, k))
[(1, 2, 3), (4, 5, 6), (7, 8, 9)]
>>> # zip pulls from i, j, k once to form its first tuple of the list
>>> # then pulls from i, j, k again to form its second tuple of the list
>>> # and so on ...
>>> # Because i,j, and k are names for the same iterator we get the
>>> # given grouping action as output.
>>>
>>> # We could use *threei instead of i,j,k as the arguments to zip:
>>> i, j, k = threei = [iter(items)] * chunk
>>> list(zip(*threei))
[(1, 2, 3), (4, 5, 6), (7, 8, 9)]
>>>
>>> # Or we could use the expression generating threei:
>>> i, j, k = threei = [iter(items)] * chunk
>>> list(zip(*[iter(items)] * chunk))
[(1, 2, 3), (4, 5, 6), (7, 8, 9)]
>>> # But note that we are relying on the unary '*' operator
>>> # binding less strongly than binary '*'
>>>


For me it remains a curiosity as I doubt I will have a use for its functionality. (List-size must be a multiple of chunk-size; its functionality is obscure).




12 comments:

  1. You could do the same thing with the numpy.reshape function:

    items = [1, 2, 3, 4, 5, 6, 7, 8, 9]
    chunk = 3
    numpy.reshape(items, (len(items)/chunk, chunk))

    ReplyDelete
    Replies
    1. This isn't quite the same thing though, reshape makes a copy, what this is really doing is more like:

      itemIter = iter(items)
      while itemIter:
      ..x = itemIter.next()
      ..y = itemIter.next()
      ..z = itemIter.next()


      in the case of [iter(items)]*3 we are just copying the reference to the iterator so this has a much better memory footprint. I think that this is quite elegant actually.

      Note that the list doesn't really have to be a multiple of 3 as zip will halt when it can't grab the next element from all zip'd iterators. I.e. if the list is of size 5 only the first group of three is returned.

      Delete
  2. As leonardo implies by his chunk size of 3 there are many datasets where this is quite useful. Take 3D coordinates for example:

    coords = [1,2,3,4,5,6,7,8,9]
    for x,y,z in zip(*[iter(coords)]*3):
    print x,y,z

    1 2 3
    4 5 6
    7 8 9

    ReplyDelete
  3. You can also make make this a generator (consuming an iterator instead of a list) by using the itertools module:

    from itertools import izip_longest

    def grouper(n, iterable, fillvalue=None):
    "Collect data into fixed-length chunks or blocks"
    # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)

    This also handles the case where the iterable isn't a multiple of n. There are a number of recipes in the itertools module that will really stretch your mind. :)

    ReplyDelete
  4. Hi Brian, Naveen, Leonardo; I still think that this is no more than a trick, or at best it may become a design pattern but if it is of use then it should be replaced by an itertools function that has the functionality without having to worry about the awful code within it.It is mech worse than, say, when we had to use dict keys as sets; or the explicit decorate-sort-undecorate pattern.

    ReplyDelete
  5. I enjoyed reading about this way of using iterators with zip. Thanks!

    I thought it would be worth mentioning that for practical use, there are other worthy solutions to the chunking problem. There are two downsides to the iter/zip approach. First, it's hard to understand, and second, if the number of items is not a multiple of the chunk size, not all items appear in the output. (The latter will be fine in some situations, and undesirable in others.)

    This solution is significantly easier to understand, and outputs every input item. When the items are not multiple of the chunk size, the output groups are as similar in size as possible:

    >>> seq = [1,2,3,4,5,6,7,8,9,10]
    >>> [seq[i::num] for i in range(num)]
    [[1, 5, 9], [2, 6, 10], [3, 7], [4, 8]]

    There's a major discussion of this on my blog at http://www.garyrobinson.net/2008/04/splitting-a-pyt.html. I'm going to update that post to also mention the zip/iter solution.

    ReplyDelete
    Replies
    1. I forget to set one of the variables in my example:

      >>> num = 3

      Delete
    2. Thanks Gary. I guess I should have added a more maintainable alternative myself, but you've done the job.

      Delete
  6. Lol that's an absolutely retarded way of doing it.

    >>> items, chunk = [1,2,3,4,5,6,7,8,9], 3
    >>> items, chunk = [1,2,3,4,5,6,7,8,9,10], 3
    >>> zip(*[iter(items)]*chunk)
    [(1, 2, 3), (4, 5, 6), (7, 8, 9)]

    You should be using this instead;

    >>> split_list = lambda l, n: [l[x: x+n] for x in xrange(0, len(l), n)]
    >>> split_list(items, chunk)
    [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]
    >>>

    ReplyDelete
    Replies
    1. No. Your LOL's come easy it seems. As does your use of retarded.

      The object of the post is to explain why it works. If it is obvious then there is no need for an explanation.

      Delete
    2. Of course, kudos for the in-depth analysis.. but maybe a more appropriate title would be "That grouping by zip/iter trick.. and why you shouldn't use it".. I only just saw the part at the bottom of your post saying "this is useless due to equal sizes".

      Delete
    3. Hi Cal, I did think about the title and wanted to encourage more to read it so made sure the word 'trick' was in there as I thought it would attract more readers. I left the "don't use me" to the end for a similar reason, I thought it might be a barrier to potential readers.

      Your suggestion _is_ more informative, but is it a better _title_ for a blog entry for a (little) vain person who does want some readership?

      :-)

      Delete