Saturday, November 28, 2009

Hamming Numbers (and Perl: Why I just ignored it)

I had read the following post: "Perl: Love it, or hate it, but don’t ignore
it."
, by Phillip Smith, in which the author wishes Perl
solutions were credited as much as solutions in other languages such as
Python and Ruby.



I agreed that Perl wasn't mentioned as much as it was, and a day or so
later I came across an example of why....



I monitor whats happening in Python, and came across a new recipe added
to Activestate code called: "Recipe 576961: Technique for cyclical
iteration"
, by Raymond Hettinger. On first, second, and third
readings, i just couldn't understand the need for the deferred_output
function; the whole recipe remained opaque:



Raymond's
code:



from itertools import tee, chain, islice
from heapq import merge

def hamming_numbers():
# Generate "5-smooth" numbers, also called "Hamming numbers"
# or "Regular numbers". See: http://en.wikipedia.org/wiki/Regular_number
# Finds solutions to 2**i * 3**j * 5**k for some integers i, j, and k.

def no_repeats(it):
last = None
for i in it:
if i is not last:
yield i
last = i

def deferred_output():
for i in output:
yield i

result, p2, p3, p5 = tee(deferred_output(), 4)
m2 = (2*x for x in p2) # multiples of 2
m3 = (3*x for x in p3) # multiples of 3
m5 = (5*x for x in p5) # multiples of 5
merged = merge(m2, m3, m5)
combined = chain([1], merged) # prepend a starting point
output = no_repeats(combined) # eliminate duplicates

return result

if __name__ == '__main__':
print(list(islice(hamming_numbers(), 1000))) # Show first 1000 hamming numbers




I was hooked by the series however and so went looking for more
information.



Perl


I came across a mention to the following page: "Infinite
lists in Perl"
by Mark Dominus which is copyright 1997 by the
Perl Journal. Normally I wouldn't mention the article, as it spends
more time creating an implementation of generators in Perl as it does
on addressing the computation of Hamming numbers. It also uses the term
"stream" instead of generators. but maybe this is down to the age of
the article, i.e. 1997?



Haskell


Looking a bit further, I found this Dr Dobbs Article, and this
commented Haskell
solution
:


hamming = 1 : map (2*) hamming `merge` map (3*) hamming `merge` map (5*) hamming
where merge (x:xs) (y:ys)
| x < y = x : xs `merge` (y:ys)
| x > y = y : (x:xs) `merge` ys
| otherwise = x : xs `merge` ys


The Haskell solution lead me to think that maybe Python has an issue
with defining recursive generators, and some experimentation lead me to
believe that yep, maybe that was why Raymond has the deferred_output
function.



My Python


Since I have trouble understanding Raymond s I went away and wrote my
own solution. It works. It can generate very large values of the
sequence quite quickly. But it does need memory to store
multiples:

from itertools import islice
from pprint import pprint as pp


def regularnumbergen(start = 1, multipliers=(2, 3, 5)):
'''\
Generates the Regular Number sequence defined here:
http://en.wikipedia.org/wiki/Regular_number
The sequence starts with 1 and consists of all multiples by 2, 3, or 5
of any number in the sequence in non-repeating, increasing order.
'''
multiples = [[] for i in multipliers]
this = start
while True:
yield this
# get multiples
for multiple, multiplier in zip(multiples, multipliers):
multiple.append(this * multiplier)
# get next
this = min(m[0] for m in multiples)
# remove this from multiples
for multiple in multiples:
if multiple[0] == this:
del multiple[0]

if __name__ == '__main__':
print(list(islice(regularnumbergen(), 10)))

Generating the following large values of the sequence took only seconds:


>>> print(list(islice(regularnumbergen(), 99999, 100001)))
[290142196707511001929482240000000000000, 290237644800000000000000000000000000000]
>>>




Wrapup


Well, I've learnt a bit more about generators, the limitatins in their
Python implementation w.r.t. Haskel, and I've put more meat on why I
tend to not quote much Perl In this case, i couldn't see the wood for
the trees! (But I do quote Perl).

9 comments:

  1. Squirrelled away in the Python testsuite file test_generators.py is a much better example of Hamming number generators than i could achieve that usues "lazy lists":

    '''
    Another famous problem: generate all integers of the form
    2**i * 3**j * 5**k
    in increasing order, where i,j,k >= 0. Trickier than it may look at first!
    Try writing it without generators, and correctly, and without generating
    3 internal results for each result output.

    >>> def times(n, g):
    ... for i in g:
    ... yield n * i
    >>> firstn(times(10, intsfrom(1)), 10)
    [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

    >>> def merge(g, h):
    ... ng = g.next()
    ... nh = h.next()
    ... while 1:
    ... if ng < nh:
    ... yield ng
    ... ng = g.next()
    ... elif ng > nh:
    ... yield nh
    ... nh = h.next()
    ... else:
    ... yield ng
    ... ng = g.next()
    ... nh = h.next()

    The following works, but is doing a whale of a lot of redundant work --
    it's not clear how to get the internal uses of m235 to share a single
    generator. Note that me_times2 (etc) each need to see every element in the
    result sequence. So this is an example where lazy lists are more natural
    (you can look at the head of a lazy list any number of times).

    >>> def m235():
    ... yield 1
    ... me_times2 = times(2, m235())
    ... me_times3 = times(3, m235())
    ... me_times5 = times(5, m235())
    ... for i in merge(merge(me_times2,
    ... me_times3),
    ... me_times5):
    ... yield i

    Don't print "too many" of these -- the implementation above is extremely
    inefficient: each call of m235() leads to 3 recursive calls, and in
    turn each of those 3 more, and so on, and so on, until we've descended
    enough levels to satisfy the print stmts. Very odd: when I printed 5
    lines of results below, this managed to screw up Win98's malloc in "the
    usual" way, i.e. the heap grew over 4Mb so Win98 started fragmenting
    address space, and it *looked* like a very slow leak.

    >>> result = m235()
    >>> for i in range(3):
    ... print firstn(result, 15)
    [1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24]
    [25, 27, 30, 32, 36, 40, 45, 48, 50, 54, 60, 64, 72, 75, 80]
    [81, 90, 96, 100, 108, 120, 125, 128, 135, 144, 150, 160, 162, 180, 192]

    Heh. Here's one way to get a shared list, complete with an excruciating
    namespace renaming trick. The *pretty* part is that the times() and merge()
    functions can be reused as-is, because they only assume their stream
    arguments are iterable -- a LazyList is the same as a generator to times().

    >>> class LazyList:
    ... def __init__(self, g):
    ... self.sofar = []
    ... self.fetch = g.next
    ...
    ... def __getitem__(self, i):
    ... sofar, fetch = self.sofar, self.fetch
    ... while i >= len(sofar):
    ... sofar.append(fetch())
    ... return sofar[i]

    >>> def m235():
    ... yield 1
    ... # Gack: m235 below actually refers to a LazyList.
    ... me_times2 = times(2, m235)
    ... me_times3 = times(3, m235)
    ... me_times5 = times(5, m235)
    ... for i in merge(merge(me_times2,
    ... me_times3),
    ... me_times5):
    ... yield i

    Print as many of these as you like -- *this* implementation is memory-
    efficient.

    >>> m235 = LazyList(m235())
    >>> for i in range(5):
    ... print [m235[j] for j in range(15*i, 15*(i+1))]
    [1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 18, 20, 24]
    [25, 27, 30, 32, 36, 40, 45, 48, 50, 54, 60, 64, 72, 75, 80]
    [81, 90, 96, 100, 108, 120, 125, 128, 135, 144, 150, 160, 162, 180, 192]
    [200, 216, 225, 240, 243, 250, 256, 270, 288, 300, 320, 324, 360, 375, 384]
    [400, 405, 432, 450, 480, 486, 500, 512, 540, 576, 600, 625, 640, 648, 675]
    '''

    ReplyDelete
  2. Would like to hear your suggestions on how to clarify the recipe's discussion of the deferred_output() technique. The whole point of the recipe was to demonstrate a re-usable skill, the hamming problem itself was not of interest.

    The same technique could be applied to generating the fibonacci series or any other series that feeds back into itself. The hamming problem is more interesting than the fibonacci series because the stream needs to be buffered (holding many previously generated values) before being used as inputs to generate a new value.

    FWIW, here are A couple of quick comments on the code in regularnumbergen(). First, the "del multiple[0]" step has O(n) running time. Second, the lists of multiples can contain duplicates taking-up unnecessary memory (i.e. 120 is in all three lists, once as the two multiple of 60, once as the three multiple of 40, and once as the five multiple of 24). The original recipe had a single stream with no duplicates, so no memory was wasted.

    -- Raymond Hettinger

    ReplyDelete
  3. First,
    Thanks Raymond for the inspiration, (its turning into good mental perspiration too).

    I came across the term lazy-lists used in conjunction with solutions to Hamming in other languages and I could understand that a little better, but you don't seem to be using lazy lists.

    I can see what you are doing, but not *why*!


    I might be able to accept it as a recipe:

    Generate a stream of results from Python generators connected in a ring by:
    1. defer() what should be the output before...
    2. ... teeing off the rings output to get a generator of final results.

    - Paddy.

    ReplyDelete
  4. P.S. Why defer_output() rather than defer(output) ?

    Thanks.

    ReplyDelete
  5. > Why defer_output() rather than defer(output)?

    Because the "output" doesn't exist at the time of the call. The whole point of a deferred is to not look for "output" until the generator itself is being consumed (at which point "output" has already been defined). This insight is the key to understanding the recipe and the general purpose technique.

    -- Raymond

    ReplyDelete
  6. Looking at your Lazy list code, I think you're close to reproducing my recipe as given.

    The "Gack: m235 below actually refers to a LazyList" comment highlights a trick that is *exactly* the same as the deferred_output trick. Both are a way to reference output that doesn't exist yet.

    Also, look at what LazyList does. It is almost exactly the same as itertools.tee() except that it uses indices instead of iterators and that it wastes space by keeping the full list in memory (even after the consumers no longer need to access earlier indices). In contrast, tee() is smart enough to forget the earlier entries once they are no longer needed.

    The code for "times(2, m235)" is equivalent to the generator expression: (2 * x for x in m235).

    Also note, that none of your alternatives get rid of duplicate entries in the output (which a requirement in the problem statement -- the output needs to be monotonically ascending).

    -- Raymond

    ReplyDelete
  7. The redefinition of m235 in you lazy list code is doing exactly the same thing at the deferred_output() in my original code. Both are ways of referring to output that hasn't yet been created.

    Also, the lazy list class is very much like itertools.tee() except that the lazy list uses indicies instead of iterators and it builds the entire list in memory instead of forgetting entries when they are no longer needed.

    IOW, the lazy list code is essentially equivalent to the original.

    -- Raymond

    ReplyDelete
  8. I took a look at a fanciful algorithm from the Dr Dobbs article and made it concrete. I then minimised the memory used: (timings at the end):

    def hamming():
    ....'''\
    ....This version is based on a snippet from:
    ........http://dobbscodetalk.com/index.php?option=com_content&task=view&id=913&Itemid=85

    ........When expressed in some imaginary pseudo-C with automatic
    ........unlimited storage allocation and BIGNUM arithmetics, it can be
    ........expressed as:
    ............hamming = h where
    ............ array h;
    ............ n=0; h[0]=1; i=0; j=0; k=0;
    ............ x2=2*h[ i ]; x3=3*h[j]; x5=5*h[k];
    ............ repeat:
    ................h[++n] = min(x2,x3,x5);
    ................if (x2==h[n]) { x2=2*h[++i]; }
    ................if (x3==h[n]) { x3=3*h[++j]; }
    ................if (x5==h[n]) { x5=5*h[++k]; }
    ....'''
    ....h = 1
    ...._h=[h] # memoized
    ....multipliers = (2, 3, 5)
    ....multindeces = [0 for i in multipliers] # index into _h for multipliers
    ....multvalues = [x * _h[i] for x,i in zip(multipliers, multindeces)]
    ....yield h
    ....while True:
    ........h = min(multvalues)
    ........_h.append(h)
    ........for (n,(v,x,i)) in enumerate(zip(multvalues, multipliers, multindeces)):
    ............if v == h:
    ................i += 1
    ................multindeces[n] = i
    ................multvalues[n] = x * _h[i]
    ........yield h, len(_h), multindeces

    def hamming2():
    ....'''\
    ....As hamming(), but reduced memory usage for memoization
    ....'''
    ....h = 1
    ...._h=[h] # memoized
    ....multipliers = (2, 3, 5)
    ....multindeces = [0 for i in multipliers] # index into _h for multipliers
    ....multvalues = [x * _h[i] for x,i in zip(multipliers, multindeces)]
    ....yield h
    ....while True:
    ........h = min(multvalues)
    ........_h.append(h)
    ........for (n,(v,x,i)) in enumerate(zip(multvalues, multipliers, multindeces)):
    ............if v == h:
    ................i += 1
    ................multindeces[n] = i
    ................multvalues[n] = x * _h[i]
    ........# cap the memoization
    ........mini = min(multindeces)
    ........if mini >= 1000:
    ............del _h[:mini]
    ............multindeces = [i - mini for i in multindeces]
    ........#
    ........yield h, len(_h), multindeces

    Raymonds code is the fastest when computing the millionth term:

    Time for regularnumbergen 117.180878296
    Time for hamming 14.8070693342
    Time for hamming2 16.7893559098
    Time for raymonds_hamming 5.5855925823

    - Paddy.

    ReplyDelete
  9. And now there is this recipe from John Crichton that turns Raymond's method into a useful Python decorator.

    ReplyDelete