it.", by Phillip Smith, in which the author wishes Perl

solutions were credited as much as solutions in other languages such as

Python and Ruby.

I agreed that Perl wasn't mentioned as much as it was, and a day or so

later I came across an example of why....

I monitor whats happening in Python, and came across a new recipe added

to Activestate code called: "Recipe 576961: Technique for cyclical

iteration", by Raymond Hettinger. On first, second, and third

readings, i just couldn't understand the need for the deferred_output

function; the whole recipe remained opaque:

Raymond's

code:

from itertools import tee, chain, islice

from heapq import mergedefhamming_numbers():

# Generate "5-smooth" numbers, also called "Hamming numbers"

# or "Regular numbers". See: http://en.wikipedia.org/wiki/Regular_number

# Finds solutions to 2**i * 3**j * 5**k for some integers i, j, and k.

defno_repeats(it):

last = None

foriinit:

ifiisnotlast:

yieldi

last = i

defdeferred_output():

foriinoutput:

yieldi

result, p2, p3, p5 = tee(deferred_output(), 4)

m2 = (2*xforxinp2) # multiples of 2

m3 = (3*xforxinp3) # multiples of 3

m5 = (5*xforxinp5) # multiples of 5

merged = merge(m2, m3, m5)

combined = chain([1], merged) # prepend a starting point

output = no_repeats(combined) # eliminate duplicates

returnresultif__name__ == '__main__':

I was hooked by the series however and so went looking for more

information.

### Perl

I came across a mention to the following page: "Infinite

lists in Perl" by Mark Dominus which is copyright 1997 by the

Perl Journal. Normally I wouldn't mention the article, as it spends

more time creating an implementation of generators in Perl as it does

on addressing the computation of Hamming numbers. It also uses the term

"stream" instead of generators. but maybe this is down to the age of

the article, i.e. 1997?

### Haskell

Looking a bit further, I found this Dr Dobbs Article, and this

commented Haskell

solution:

hamming = 1 : map (2*) hamming `merge` map (3*) hamming `merge` map (5*) hamming

where merge (x:xs) (y:ys)

| x < y = x : xs `merge` (y:ys)

| x > y = y : (x:xs) `merge` ys

| otherwise = x : xs `merge` ys

The Haskell solution lead me to think that maybe Python has an issue

with defining recursive generators, and some experimentation lead me to

believe that yep, maybe that was why Raymond has the deferred_output

function.

### My Python

Since I have trouble understanding Raymond s I went away and wrote my

own solution. It works. It can generate very large values of the

sequence quite quickly. But it does need memory to store

multiples:

from itertools import islice

from pprint import pprint as ppdefregularnumbergen(start = 1, multipliers=(2, 3, 5)):

'''\

Generates the Regular Number sequence defined here:

http://en.wikipedia.org/wiki/Regular_number

The sequence starts with 1 and consists of all multiples by 2, 3, or 5

of any number in the sequence in non-repeating, increasing order.

'''

multiples = [[]foriinmultipliers]

this = start

whileTrue:

yieldthis

# get multiples

formultiple, multiplierinzip(multiples, multipliers):

multiple.append(this * multiplier)

# get next

this = min(m[0]forminmultiples)

# remove this from multiples

formultipleinmultiples:

ifmultiple[0] == this:

delmultiple[0]if__name__ == '__main__':

Generating the following large values of the sequence took only seconds:

>>> print(list(islice(regularnumbergen(), 99999, 100001)))

[290142196707511001929482240000000000000, 290237644800000000000000000000000000000]

>>>

### Wrapup

Well, I've learnt a bit more about generators, the limitatins in their

Python implementation w.r.t. Haskel, and I've put more meat on why I

tend to not quote much Perl In this case, i couldn't see the wood for

the trees! (But I do quote Perl).

Squirrelled away in the Python testsuite file test_generators.py is a much better example of Hamming number generators than i could achieve that usues "lazy lists":

ReplyDelete'''

Another famous problem: generate all integers of the form

2**i * 3**j * 5**k

in increasing order, where i,j,k >= 0. Trickier than it may look at first!

Try writing it without generators, and correctly, and without generating

3 internal results for each result output.

>>> def times(n, g):

... for i in g:

... yield n * i

>>> firstn(times(10, intsfrom(1)), 10)

[10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

>>> def merge(g, h):

... ng = g.next()

... nh = h.next()

... while 1:

... if ng < nh:

... yield ng

... ng = g.next()

... elif ng > nh:

... yield nh

... nh = h.next()

... else:

... yield ng

... ng = g.next()

... nh = h.next()

The following works, but is doing a whale of a lot of redundant work --

it's not clear how to get the internal uses of m235 to share a single

generator. Note that me_times2 (etc) each need to see every element in the

result sequence. So this is an example where lazy lists are more natural

(you can look at the head of a lazy list any number of times).

>>> def m235():

... yield 1

... me_times2 = times(2, m235())

... me_times3 = times(3, m235())

... me_times5 = times(5, m235())

... for i in merge(merge(me_times2,

... me_times3),

... me_times5):

... yield i

Don't print "too many" of these -- the implementation above is extremely

inefficient: each call of m235() leads to 3 recursive calls, and in

turn each of those 3 more, and so on, and so on, until we've descended

enough levels to satisfy the print stmts. Very odd: when I printed 5

lines of results below, this managed to screw up Win98's malloc in "the

usual" way, i.e. the heap grew over 4Mb so Win98 started fragmenting

address space, and it *looked* like a very slow leak.

>>> result = m235()

>>> for i in range(3):

... print firstn(result, 15)

[1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24]

[25, 27, 30, 32, 36, 40, 45, 48, 50, 54, 60, 64, 72, 75, 80]

[81, 90, 96, 100, 108, 120, 125, 128, 135, 144, 150, 160, 162, 180, 192]

Heh. Here's one way to get a shared list, complete with an excruciating

namespace renaming trick. The *pretty* part is that the times() and merge()

functions can be reused as-is, because they only assume their stream

arguments are iterable -- a LazyList is the same as a generator to times().

>>> class LazyList:

... def __init__(self, g):

... self.sofar = []

... self.fetch = g.next

...

... def __getitem__(self, i):

... sofar, fetch = self.sofar, self.fetch

... while i >= len(sofar):

... sofar.append(fetch())

... return sofar[i]

>>> def m235():

... yield 1

... # Gack: m235 below actually refers to a LazyList.

... me_times2 = times(2, m235)

... me_times3 = times(3, m235)

... me_times5 = times(5, m235)

... for i in merge(merge(me_times2,

... me_times3),

... me_times5):

... yield i

Print as many of these as you like -- *this* implementation is memory-

efficient.

>>> m235 = LazyList(m235())

>>> for i in range(5):

... print [m235[j] for j in range(15*i, 15*(i+1))]

[1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24]

[25, 27, 30, 32, 36, 40, 45, 48, 50, 54, 60, 64, 72, 75, 80]

[81, 90, 96, 100, 108, 120, 125, 128, 135, 144, 150, 160, 162, 180, 192]

[200, 216, 225, 240, 243, 250, 256, 270, 288, 300, 320, 324, 360, 375, 384]

[400, 405, 432, 450, 480, 486, 500, 512, 540, 576, 600, 625, 640, 648, 675]

'''

Would like to hear your suggestions on how to clarify the recipe's discussion of the deferred_output() technique. The whole point of the recipe was to demonstrate a re-usable skill, the hamming problem itself was not of interest.

ReplyDeleteThe same technique could be applied to generating the fibonacci series or any other series that feeds back into itself. The hamming problem is more interesting than the fibonacci series because the stream needs to be buffered (holding many previously generated values) before being used as inputs to generate a new value.

FWIW, here are A couple of quick comments on the code in regularnumbergen(). First, the "del multiple[0]" step has O(n) running time. Second, the lists of multiples can contain duplicates taking-up unnecessary memory (i.e. 120 is in all three lists, once as the two multiple of 60, once as the three multiple of 40, and once as the five multiple of 24). The original recipe had a single stream with no duplicates, so no memory was wasted.

-- Raymond Hettinger

First,

ReplyDeleteThanks Raymond for the inspiration, (its turning into good mental perspiration too).

I came across the term lazy-lists used in conjunction with solutions to Hamming in other languages and I could understand that a little better, but you don't seem to be using lazy lists.

I can see what you are doing, but not *why*!

I might be able to accept it as a recipe:

Generate a stream of results from Python generators connected in a ring by:

1. defer() what should be the output before...

2. ... teeing off the rings output to get a generator of final results.

- Paddy.

P.S. Why defer_output() rather than defer(output) ?

ReplyDeleteThanks.

> Why defer_output() rather than defer(output)?

ReplyDeleteBecause the "output" doesn't exist at the time of the call. The whole point of a deferred is to not look for "output" until the generator itself is being consumed (at which point "output" has already been defined). This insight is the key to understanding the recipe and the general purpose technique.

-- Raymond

Looking at your Lazy list code, I think you're close to reproducing my recipe as given.

ReplyDeleteThe "Gack: m235 below actually refers to a LazyList" comment highlights a trick that is *exactly* the same as the deferred_output trick. Both are a way to reference output that doesn't exist yet.

Also, look at what LazyList does. It is almost exactly the same as itertools.tee() except that it uses indices instead of iterators and that it wastes space by keeping the full list in memory (even after the consumers no longer need to access earlier indices). In contrast, tee() is smart enough to forget the earlier entries once they are no longer needed.

The code for "times(2, m235)" is equivalent to the generator expression: (2 * x for x in m235).

Also note, that none of your alternatives get rid of duplicate entries in the output (which a requirement in the problem statement -- the output needs to be monotonically ascending).

-- Raymond

The redefinition of m235 in you lazy list code is doing exactly the same thing at the deferred_output() in my original code. Both are ways of referring to output that hasn't yet been created.

ReplyDeleteAlso, the lazy list class is very much like itertools.tee() except that the lazy list uses indicies instead of iterators and it builds the entire list in memory instead of forgetting entries when they are no longer needed.

IOW, the lazy list code is essentially equivalent to the original.

-- Raymond

I took a look at a fanciful algorithm from the Dr Dobbs article and made it concrete. I then minimised the memory used: (timings at the end):

ReplyDeletedef hamming():

....'''\

....This version is based on a snippet from:

........http://dobbscodetalk.com/index.php?option=com_content&task=view&id=913&Itemid=85

........When expressed in some imaginary pseudo-C with automatic

........unlimited storage allocation and BIGNUM arithmetics, it can be

........expressed as:

............hamming = h where

............ array h;

............ n=0; h[0]=1; i=0; j=0; k=0;

............ x2=2*h[ i ]; x3=3*h[j]; x5=5*h[k];

............ repeat:

................h[++n] = min(x2,x3,x5);

................if (x2==h[n]) { x2=2*h[++i]; }

................if (x3==h[n]) { x3=3*h[++j]; }

................if (x5==h[n]) { x5=5*h[++k]; }

....'''

....h = 1

...._h=[h] # memoized

....multipliers = (2, 3, 5)

....multindeces = [0 for i in multipliers] # index into _h for multipliers

....multvalues = [x * _h[i] for x,i in zip(multipliers, multindeces)]

....yield h

....while True:

........h = min(multvalues)

........_h.append(h)

........for (n,(v,x,i)) in enumerate(zip(multvalues, multipliers, multindeces)):

............if v == h:

................i += 1

................multindeces[n] = i

................multvalues[n] = x * _h[i]

........yield h, len(_h), multindeces

def hamming2():

....'''\

....As hamming(), but reduced memory usage for memoization

....'''

....h = 1

...._h=[h] # memoized

....multipliers = (2, 3, 5)

....multindeces = [0 for i in multipliers] # index into _h for multipliers

....multvalues = [x * _h[i] for x,i in zip(multipliers, multindeces)]

....yield h

....while True:

........h = min(multvalues)

........_h.append(h)

........for (n,(v,x,i)) in enumerate(zip(multvalues, multipliers, multindeces)):

............if v == h:

................i += 1

................multindeces[n] = i

................multvalues[n] = x * _h[i]

........# cap the memoization

........mini = min(multindeces)

........if mini >= 1000:

............del _h[:mini]

............multindeces = [i - mini for i in multindeces]

........#

........yield h, len(_h), multindeces

Raymonds code is the fastest when computing the millionth term:

Time for regularnumbergen 117.180878296

Time for hamming 14.8070693342

Time for hamming2 16.7893559098

Time for raymonds_hamming 5.5855925823

- Paddy.

And now there is this recipe from John Crichton that turns Raymond's method into a useful Python decorator.

ReplyDelete