Go deh!: Not so flash, flashtext!

The announcement of the Python flashtext library has been generally well recieved. It seems to lead on its speed advantage over regexps for replacing hundreds of words in hundreds of thousands of words of text.

I had a small discussion about the speed claims at the time, but flashtext was mentioned in another, more recent post that pointed to its speed test. I decided to have a look.

The Original Speed test

His speed test seemed straight forward:

His words are random, and of random length.
His string to search is 5000 of those random words separated by a space.
he creates a table of response times where the words to be replaced increments
He gives a table of sample output at the end.

Expanded Speed test

I wanted to time flashtext against what I would think is a more pythonic solution, which is to use the dict.get method which has the useful property that if you call dict.get(word, word) then it will return the value from the dict if word is in the dict; but will otherwise return the second argument i'e. the original word.

I decided to turn off any garbage collection during the timing and also increase the size of the string to search from 5000 to 50,000 random words as some timings were very short.

My code is on github, and reproduced below:

"""
Original: https://gist.github.com/vi3k6i5/dc3335ee46ab9f650b19885e8ade6c7a
Additional dict.get() word replacer added
Created on Wed Dec 20 14:03:51 2017
@author: Paddy3118
"""
 
#!/bin/python
from flashtext.keyword import KeywordProcessor
import random
import string
import re
import time
import gc
 
 
def get_word_of_length(str_length):
    # generate a random word of given length
    return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length))
 
# generate a list of 100K words of randomly chosen size
all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)]
 
print('Count  | FlashText | Regex    | dict.get() | Comments')
print('------------------------------------------------------')
for keywords_length in range(1, 20002, 2500):
    gc.collect()
    # chose 5000*10 terms and create a string to search in.
    all_words_chosen = random.sample(all_words, 5000*10)
    story = ' '.join(all_words_chosen)
 
    # get unique keywords from the list of words generated.
    unique_keywords_sublist = list(set(random.sample(all_words, keywords_length)))
 
    # compile regex
    # source: https://stackoverflow.com/questions/6116978/python-replace-multiple-strings
    rep = dict([(key, '_keyword_') for key in unique_keywords_sublist])
    compiled_re = re.compile("|".join(rep.keys()))
 
    # add keywords to flashtext
    keyword_processor = KeywordProcessor()
    for keyword in unique_keywords_sublist:
        keyword_processor.add_keyword(keyword, '_keyword_')
 
    gc.disable()
    # time the modules
    start = time.time()
    # flashtext (but ommiting its keyword setup)
    _1 = keyword_processor.replace_keywords(story)
    mid = time.time()
    # re (ommiting its regexp compilation)
    _2 = compiled_re.sub(lambda m: rep[re.escape(m.group(0))], story)
    end = time.time()
    # dict.get(word, word) returns the original word if it is not in the dict
    _3 = ' '.join(rep.get(word, word) for word in story.split())
    end3 = time.time()
 
    gc.enable()
    # print output
    print(str(keywords_length).ljust(6), '|',
          "{0:.5f}".format(mid - start).ljust(9), '|',
          "{0:.5f}".format(end - mid).ljust(9), '|',
          "{0:.5f}".format(end3 - end).ljust(9), '|',
          end=' ')
    comment = []
    if _1 != _2:
        comment.append('#1 != #2')
    else:
        comment.append('#1 == #2')
    if _1 != _3:
        comment.append('#1 != #3')
    else:
        comment.append('#1 == #3')
    if _2 != _3:
        comment.append('#2 != #3')
    else:
        comment.append('#2 == #3')
print(' and '.join(comment))

The main addition is the line beginning _3 = ' '.join(.... and extensions to the tabular output that includes a dict.get column and a comments section where I print the comparisons of the returned strings of each algorithm.

Sample timings

This is for just one run but timings are similar for repeated runs, apart from a potential bug that I will talk about later.

Count | FlashText | Regex    | dict.get()| Comments
------------------------------------------------------
1      | 0.09375   | 0.00000   | 0.00000   | #1 != #2 and #1 == #3 and #2 != #3
2501   | 0.10938   | 3.67579   | 0.00000   | #1 != #2 and #1 == #3 and #2 != #3
5001   | 0.10939   | 6.98075   | 0.01563   | #1 != #2 and #1 == #3 and #2 != #3
7501   | 0.09745   | 9.90468   | 0.01563   | #1 != #2 and #1 != #3 and #2 != #3
10001 | 0.10937   | 12.43827 | 0.01563   | #1 != #2 and #1 == #3 and #2 != #3
12501 | 0.12498   | 14.58402 | 0.01563   | #1 != #2 and #1 == #3 and #2 != #3
15001 | 0.12500   | 16.56654 | 0.01562   | #1 != #2 and #1 == #3 and #2 != #3
17501 | 0.10937   | 18.32079 | 0.01563   | #1 != #2 and #1 == #3 and #2 != #3
20001 | 0.12500   | 20.04271 | 0.01563   | #1 != #2 and #1 == #3 and #2 != #3

Ignoring the comment column, for now, the dict.get() is around 10x faster than flashtext. (It could be more, It might be useful to rerun with an even longer sample text.

Fast algorithms written in Python are still written in interpreted Python. The Python dict is written in C and honed over decades.

The Python `one-liner` solution is easy to reason about its correctness, so when there are issues, it is more likely to be the more opaque library that is harder to reason about when determining where the error lies.

Problems

In the tables comments, The regxp is wrong (word endings not catered for, words that occur in longer replacements not catered for).
flashtext mostly agrees with dict.get, but occaisionally shows a difference - such as in the count=7501 comment. The flashtext author might want to adjust this code to save the arguments whenever flashtext != dict.get to debug the problem ...

END.

Go deh!

Wednesday, December 20, 2017

Not so flash, flashtext!

The Original Speed test

Expanded Speed test

Sample timings

Problems

No comments:

Post a Comment

About Me

Followers

Subscribe Now: google

Go deh too!

whos.amung.us

Blog Archive