Wednesday, December 20, 2017

Not so flash, flashtext!

The announcement of the Python flashtext library has been generally well recieved. It seems to lead on its speed advantage over regexps for replacing hundreds of words in hundreds of thousands of words of text.

I had a small discussion about the speed claims at the time, but flashtext was mentioned in another, more recent post that pointed to its speed test. I decided to have a look.

The Original Speed test

His speed test seemed straight forward:
  1. His words are random, and of random length.
  2. His string to search is 5000 of those random words separated by a space.
  3. he creates a table of response times where the words to be replaced increments
  4. He gives a table of sample output at the end.

Expanded Speed test

I wanted to time flashtext against what I would think is a more pythonic solution, which is to use the dict.get method which has the useful property that if you call dict.get(word, word) then it will return the value from the dict if word is in the dict; but will otherwise return the second argument i'e. the original word.

I decided to turn off any garbage collection during the timing and also increase the size of the string to search from 5000 to 50,000 random words as some timings were very short.

My code is on github, and reproduced below:

"""
Original: https://gist.github.com/vi3k6i5/dc3335ee46ab9f650b19885e8ade6c7a
Additional dict.get() word replacer added
Created on Wed Dec 20 14:03:51 2017
@author: Paddy3118
"""
 
#!/bin/python
from flashtext.keyword import KeywordProcessor
import random
import string
import re
import time
import gc
 
 
def get_word_of_length(str_length):
    # generate a random word of given length
    return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length))
 
# generate a list of 100K words of randomly chosen size
all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)]
 
print('Count  | FlashText | Regex    | dict.get() | Comments')
print('------------------------------------------------------')
for keywords_length in range(1, 20002, 2500):
    gc.collect()
    # chose 5000*10 terms and create a string to search in.
    all_words_chosen = random.sample(all_words, 5000*10)
    story = ' '.join(all_words_chosen)
 
    # get unique keywords from the list of words generated.
    unique_keywords_sublist = list(set(random.sample(all_words, keywords_length)))
 
    # compile regex
    # source: https://stackoverflow.com/questions/6116978/python-replace-multiple-strings
    rep = dict([(key, '_keyword_') for key in unique_keywords_sublist])
    compiled_re = re.compile("|".join(rep.keys()))
 
    # add keywords to flashtext
    keyword_processor = KeywordProcessor()
    for keyword in unique_keywords_sublist:
        keyword_processor.add_keyword(keyword, '_keyword_')
 
    gc.disable()
    # time the modules
    start = time.time()
    # flashtext (but ommiting its keyword setup)
    _1 = keyword_processor.replace_keywords(story)
    mid = time.time()
    # re (ommiting its regexp compilation)
    _2 = compiled_re.sub(lambda m: rep[re.escape(m.group(0))], story)
    end = time.time()
    # dict.get(word, word) returns the original word if it is not in the dict
    _3 = ' '.join(rep.get(word, word) for word in story.split())
    end3 = time.time()
 
    gc.enable()
    # print output
    print(str(keywords_length).ljust(6), '|',
          "{0:.5f}".format(mid - start).ljust(9), '|',
          "{0:.5f}".format(end - mid).ljust(9), '|',
          "{0:.5f}".format(end3 - end).ljust(9), '|',
          end=' ')
    comment = []
    if _1 != _2:
        comment.append('#1 != #2')
    else:
        comment.append('#1 == #2')
    if _1 != _3:
        comment.append('#1 != #3')
    else:
        comment.append('#1 == #3')
    if _2 != _3:
        comment.append('#2 != #3')
    else:
        comment.append('#2 == #3')
print(' and '.join(comment))


The main addition is the line beginning _3 = ' '.join(.... and extensions to the tabular output that includes a dict.get column and a comments section where I print the comparisons of the returned strings of each algorithm.

Sample timings 

This is for just one run but timings are similar for repeated runs, apart from a potential bug that I will talk about later.

Count  | FlashText | Regex     | dict.get()| Comments
------------------------------------------------------
1      | 0.09375   | 0.00000   | 0.00000   | #1 != #2 and #1 == #3 and #2 != #3
2501   | 0.10938   | 3.67579   | 0.00000   | #1 != #2 and #1 == #3 and #2 != #3
5001   | 0.10939   | 6.98075   | 0.01563   | #1 != #2 and #1 == #3 and #2 != #3
7501   | 0.09745   | 9.90468   | 0.01563   | #1 != #2 and #1 != #3 and #2 != #3
10001  | 0.10937   | 12.43827  | 0.01563   | #1 != #2 and #1 == #3 and #2 != #3
12501  | 0.12498   | 14.58402  | 0.01563   | #1 != #2 and #1 == #3 and #2 != #3
15001  | 0.12500   | 16.56654  | 0.01562   | #1 != #2 and #1 == #3 and #2 != #3
17501  | 0.10937   | 18.32079  | 0.01563   | #1 != #2 and #1 == #3 and #2 != #3
20001  | 0.12500   | 20.04271  | 0.01563   | #1 != #2 and #1 == #3 and #2 != #3


Ignoring the comment column, for now, the dict.get() is around 10x faster than flashtext. (It could be more, It might be useful to rerun with an even longer sample text.

Fast algorithms written in Python are still written in interpreted Python. The Python dict is written in C and honed over decades. 

The Python `one-liner` solution is easy to reason about its correctness, so when there are issues, it is more likely to be the more opaque library that is harder to reason about when determining where the error lies.

Problems

In the tables comments, The regxp is wrong (word endings not catered for, words that occur in longer replacements not catered for).
flashtext mostly agrees with dict.get, but occaisionally shows a difference - such as in the count=7501  comment. The flashtext author might want to adjust this code to save the arguments whenever flashtext != dict.get to debug the problem ...

END.

Thursday, December 07, 2017

Elixirs' chunk_every in Python


Another way to do it...

I had created a parallel process Python Program to solve a stack overflow users problem and decided to share it on reddit
Along came user aseigo with a solution written in the Elixir programming language.
My solution looked like this:
def calc_with_concurrency(s):
    cores = max(os.cpu_count() -2, 1)
    n = math.ceil(len(s) / max(1, cores - 1))
    sections = [s[i * n: (i + 1) * n + 3] for i in range(cores)]
    sections = [ss for ss in sections if len(ss) >= 3]
    with Pool(cores) as pool:
        counts = pool.map(triples, sections)
    double_counted = Counter([s[-3:] for s in sections[:-1]])
    tot = sum(counts, Counter()) - double_counted
    return tot
aseigo's solution was this:
input
|> Stream.chunk_every(3, 1, :discard)
|> Flow.from_enumerable()
|> Flow.partition()
|> Flow.reduce(fn -> %{} end,  fn trip, acc -> Map.update(acc, trip, 1, &(&1 + 1)) end)
|> Flow.reject(fn {_, 1} -> true
                       _ -> false end)

Comparison of solutions

In my Python solution I spent a lot of time working out how to:
  1. Split the string of characters, s into parts for each sub-proces.
  2. Have each sub-process chunk its fragment into threes and count the triples.
  3. Subtract the doubly-counted triples that appear at the split points as I use an overlap of three characters at each split.
The Elixir solution works differently:
  1. It calls chunk_every to stream the whole of its input as triples.
  2. It then splits the triples between its sub-processes.
  3. These sub-prcesses only have to count their recieved triples.
I was thinking that my method of splitting the input with an overlap, then removing the double-counting on the overlap might become a personal design pattern and I had made a mental note of this. I did wonder about the Elixir chunk_every: Wasits definition missing? was it a built-in? I decided to look it up.

Elixir's chunk_every

Elixir's chunk_every is indeed a language built-in function.
The docs state:
chunk_every(enumerable, count, step, leftover \\ [])
Returns list of lists containing count items each, where each new chunk starts step elements into the enumerable.
step is optional and, if not passed, defaults to count, i.e. chunks do not overlap.
If the last chunk does not have count elements to fill the chunk, elements are taken from leftover to fill in the chunk. If leftover does not have enough elements to fill the chunk, then a partial chunk is returned with less than count elements.
If :discard is given in leftover, the last chunk is discarded unless it has exactly count elements.

Examples

iex> Enum.chunk_every([1, 2, 3, 4, 5, 6], 2)
[[1, 2], [3, 4], [5, 6]]

iex> Enum.chunk_every([1, 2, 3, 4, 5, 6], 3, 2, :discard)
[[1, 2, 3], [3, 4, 5]]

iex> Enum.chunk_every([1, 2, 3, 4, 5, 6], 3, 2, [7])
[[1, 2, 3], [3, 4, 5], [5, 6, 7]]

iex> Enum.chunk_every([1, 2, 3, 4], 3, 3, [])
[[1, 2, 3], [4]]

iex> Enum.chunk_every([1, 2, 3, 4], 10)
[[1, 2, 3, 4]]

Does Python need its own?

I thought that by creating a Python version of chunk_every, then the function might proove useful in other cases and I could add it to my toolbox.
I conducted a web search and couldn't find anything like it in Python and so set about implementing the function.

The Python implementation

Function type

I want to stream input from any iterable and generate chunks of output as needed. This means I have to write a generator function with yield statements.

Parameters

  • The Python itertools functions use the name iterable and not enumerable, I'll switch.
  • The Elixir leftover parameter has two distinct types of values:
    1. A (possibly empty), container of fill values.
    2. Or the value :discard
I split and replace leftover by two distinct Python parameters of fillvalue and discard_partial.

What I want the Python to do

I had many questions about the Elixir chunk_every function that were not answered by the docs. In the end I installed Elixir and tried a few things; asked around; and finally decided to do what came naturally in the coding too.
  • chunk_every should always start from the first element, and produce chunks made from consecutive elements of length count in size (assuming its not complicated by end considerations).
  • When traversing the input iterable it should start a new chunk from every step'th item.
  • Chunks should be yielded when they reach length count
  • At the end of the iterable, fillvalue's items, (if any) should be appended to any remeaining chunks then, if they are of count in length, they too should be yielded, in order of their starting .
  • If there remains chunks that are too small, then they should only be yielded if discard_partial == False, otherwise they are discarded.
  • Yielded chunks should be tuples of items that can become dict keys.
Elixir will only yield one partial chunk at the end of the iterable, even though if count >= 2 * step there can be more than one incomplete chunk.

chunk_every in Python

In [52]:
from itertools import chain

def chunk_every(iterable, count, step=None, fillvalue=None, discard_partial=True):
    # Check arguments
    if step is None:
        step = count
    if fillvalue is None:
        fillvalue = []
    else: 
        fillvalue = list(fillvalue)
    if type(count) != int or type(step) != int or count <1 or step < 1:
        raise NotImplementedError
    chunks = []  # list of possibly multiple chunk accumulators
    for n, x in enumerate(iterable):
        if n >= count and (n - count) % step == 0:
            yield tuple(chunks.pop(0))  # time to yield the first chunk
        if n % step == 0:
            chunks.append([])  # Start accumulating another chunk
        for c in chunks:
            c.append(x)  # Extend current chunks with iterables latest element
    # End processing
    for c in chunks:
        c += fillvalue
        if len(c) < count and discard_partial:
            continue # skip yielding end chunk
        # else:
        yield tuple(c[:count])  # yield end chunks (possibly) extended by the fillvalue.

Elixir examples

lets try out the Python with the original Elixir docs chunk_every examples arguments. I turn the Python generators output into a list to show the output.
In [53]:
list(chunk_every([1, 2, 3, 4, 5, 6], 2))
Out[53]:
[(1, 2), (3, 4), (5, 6)]
In [54]:
list(chunk_every([1, 2, 3, 4, 5, 6], 3, 2))  # discard_partial=True is the default value
Out[54]:
[(1, 2, 3), (3, 4, 5)]
In [55]:
# In contrast to the above, this shows the short chunk accumulated at the end, but discarded above
list(chunk_every([1, 2, 3, 4, 5, 6], 3, 2, discard_partial=False))
Out[55]:
[(1, 2, 3), (3, 4, 5), (5, 6)]
In [56]:
list(chunk_every([1, 2, 3, 4, 5, 6], 3, 2, [7])) # That last previously short chunk from above is filled
Out[56]:
[(1, 2, 3), (3, 4, 5), (5, 6, 7)]
In [57]:
list(chunk_every([1, 2, 3, 4], 3, 3, [])) # Totally different answer to the Elixir doc!
Out[57]:
[(1, 2, 3)]
Now the example above comes about because of a sublte difference between the Python and Elixir API's.
The Elixir result makes sense if their discard feature defaults to False unless :default is explicitely stated as the value for their leftover parameter.
In Python that similar feature is controlled by a separate parameter, discard_partial which defaults to True.
(In previous examples, the default value for the feature would not change the output).
In [58]:
list(chunk_every([1, 2, 3, 4], 3, 3, [], discard_partial=False)) # Now same answer as the Elixir doc!
Out[58]:
[(1, 2, 3), (4,)]
In [59]:
list(chunk_every([1, 2, 3, 4], 10, discard_partial=False)) # Last Elixir example
Out[59]:
[(1, 2, 3, 4)]

Python examples

In [60]:
def ppchunk_every(iterable, count, step=None, fillvalue=None, discard_partial=True):
    'Prettyprinting wrapper for chunk_every generator'
    iterable = list(iterable)   # So it can be both printed and passed.
    fillvalue = None if fillvalue==None else list(fillvalue) # ditto.
    print(f'chunk_every({iterable}, count={count}, step={step}, fillvalue={fillvalue}, '
          f'discard_partial={discard_partial})\n  -> ', end='')
    print(list(chunk_every(iterable, count, step, fillvalue, discard_partial)))
In [61]:
ppchunk_every([1, 2, 3, 4, 5, 6], 2)
chunk_every([1, 2, 3, 4, 5, 6], count=2, step=None, fillvalue=None, discard_partial=True)
  -> [(1, 2), (3, 4), (5, 6)]

With increasing iterable, chunk into 3 on every third item

In [62]:
for size in range(2, 10):
    ppchunk_every(range(size), 3, 3, None, True)
chunk_every([0, 1], count=3, step=3, fillvalue=None, discard_partial=True)
  -> []
chunk_every([0, 1, 2], count=3, step=3, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2)]
chunk_every([0, 1, 2, 3], count=3, step=3, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2)]
chunk_every([0, 1, 2, 3, 4], count=3, step=3, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2)]
chunk_every([0, 1, 2, 3, 4, 5], count=3, step=3, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (3, 4, 5)]
chunk_every([0, 1, 2, 3, 4, 5, 6], count=3, step=3, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (3, 4, 5)]
chunk_every([0, 1, 2, 3, 4, 5, 6, 7], count=3, step=3, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (3, 4, 5)]
chunk_every([0, 1, 2, 3, 4, 5, 6, 7, 8], count=3, step=3, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (3, 4, 5), (6, 7, 8)]

Overlap chunks by using step < count

In [63]:
for size in range(2, 10):
    ppchunk_every(range(size), 3, 1, None, True)
chunk_every([0, 1], count=3, step=1, fillvalue=None, discard_partial=True)
  -> []
chunk_every([0, 1, 2], count=3, step=1, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2)]
chunk_every([0, 1, 2, 3], count=3, step=1, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (1, 2, 3)]
chunk_every([0, 1, 2, 3, 4], count=3, step=1, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (1, 2, 3), (2, 3, 4)]
chunk_every([0, 1, 2, 3, 4, 5], count=3, step=1, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5)]
chunk_every([0, 1, 2, 3, 4, 5, 6], count=3, step=1, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6)]
chunk_every([0, 1, 2, 3, 4, 5, 6, 7], count=3, step=1, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6), (5, 6, 7)]
chunk_every([0, 1, 2, 3, 4, 5, 6, 7, 8], count=3, step=1, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (1, 2, 3), (2, 3, 4), (3, 4, 5), (4, 5, 6), (5, 6, 7), (6, 7, 8)]

Leave gaps between chunks by using step > count

In [64]:
for size in range(3, 12):
    ppchunk_every(range(size), 3, 4, None, True)
chunk_every([0, 1, 2], count=3, step=4, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2)]
chunk_every([0, 1, 2, 3], count=3, step=4, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2)]
chunk_every([0, 1, 2, 3, 4], count=3, step=4, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2)]
chunk_every([0, 1, 2, 3, 4, 5], count=3, step=4, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2)]
chunk_every([0, 1, 2, 3, 4, 5, 6], count=3, step=4, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (4, 5, 6)]
chunk_every([0, 1, 2, 3, 4, 5, 6, 7], count=3, step=4, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (4, 5, 6)]
chunk_every([0, 1, 2, 3, 4, 5, 6, 7, 8], count=3, step=4, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (4, 5, 6)]
chunk_every([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], count=3, step=4, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (4, 5, 6)]
chunk_every([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], count=3, step=4, fillvalue=None, discard_partial=True)
  -> [(0, 1, 2), (4, 5, 6), (8, 9, 10)]

(Partially) complete end chunks from fillvalues

In [65]:
for size in range(0, 5):
    ppchunk_every(range(size), 4, 1, [], True)
    ppchunk_every(range(size), 4, 1, [], False)
    ppchunk_every(range(size), 4, 1, 'F0 F1'.split(), True)
    ppchunk_every(range(size), 4, 1, 'F0 F1'.split(), False)
    print('===')
chunk_every([], count=4, step=1, fillvalue=[], discard_partial=True)
  -> []
chunk_every([], count=4, step=1, fillvalue=[], discard_partial=False)
  -> []
chunk_every([], count=4, step=1, fillvalue=['F0', 'F1'], discard_partial=True)
  -> []
chunk_every([], count=4, step=1, fillvalue=['F0', 'F1'], discard_partial=False)
  -> []
===
chunk_every([0], count=4, step=1, fillvalue=[], discard_partial=True)
  -> []
chunk_every([0], count=4, step=1, fillvalue=[], discard_partial=False)
  -> [(0,)]
chunk_every([0], count=4, step=1, fillvalue=['F0', 'F1'], discard_partial=True)
  -> []
chunk_every([0], count=4, step=1, fillvalue=['F0', 'F1'], discard_partial=False)
  -> [(0, 'F0', 'F1')]
===
chunk_every([0, 1], count=4, step=1, fillvalue=[], discard_partial=True)
  -> []
chunk_every([0, 1], count=4, step=1, fillvalue=[], discard_partial=False)
  -> [(0, 1), (1,)]
chunk_every([0, 1], count=4, step=1, fillvalue=['F0', 'F1'], discard_partial=True)
  -> [(0, 1, 'F0', 'F1')]
chunk_every([0, 1], count=4, step=1, fillvalue=['F0', 'F1'], discard_partial=False)
  -> [(0, 1, 'F0', 'F1'), (1, 'F0', 'F1')]
===
chunk_every([0, 1, 2], count=4, step=1, fillvalue=[], discard_partial=True)
  -> []
chunk_every([0, 1, 2], count=4, step=1, fillvalue=[], discard_partial=False)
  -> [(0, 1, 2), (1, 2), (2,)]
chunk_every([0, 1, 2], count=4, step=1, fillvalue=['F0', 'F1'], discard_partial=True)
  -> [(0, 1, 2, 'F0'), (1, 2, 'F0', 'F1')]
chunk_every([0, 1, 2], count=4, step=1, fillvalue=['F0', 'F1'], discard_partial=False)
  -> [(0, 1, 2, 'F0'), (1, 2, 'F0', 'F1'), (2, 'F0', 'F1')]
===
chunk_every([0, 1, 2, 3], count=4, step=1, fillvalue=[], discard_partial=True)
  -> [(0, 1, 2, 3)]
chunk_every([0, 1, 2, 3], count=4, step=1, fillvalue=[], discard_partial=False)
  -> [(0, 1, 2, 3), (1, 2, 3), (2, 3), (3,)]
chunk_every([0, 1, 2, 3], count=4, step=1, fillvalue=['F0', 'F1'], discard_partial=True)
  -> [(0, 1, 2, 3), (1, 2, 3, 'F0'), (2, 3, 'F0', 'F1')]
chunk_every([0, 1, 2, 3], count=4, step=1, fillvalue=['F0', 'F1'], discard_partial=False)
  -> [(0, 1, 2, 3), (1, 2, 3, 'F0'), (2, 3, 'F0', 'F1'), (3, 'F0', 'F1')]
===
In the last section above, at the end of the iterable, all chunks are filled from copies of the fillvalue; chunks of length count are yielded, then the discard_partial argument determines if any remaining short chunks are yielded.