I had a small discussion about the speed claims at the time, but flashtext was mentioned in another, more recent post that pointed to its speed test. I decided to have a look.
The Original Speed test
His speed test seemed straight forward:- His words are random, and of random length.
- His string to search is 5000 of those random words separated by a space.
- he creates a table of response times where the words to be replaced increments
- He gives a table of sample output at the end.
Expanded Speed test
I wanted to time flashtext against what I would think is a more pythonic solution, which is to use the dict.get method which has the useful property that if you call dict.get(word, word) then it will return the value from the dict if word is in the dict; but will otherwise return the second argument i'e. the original word.I decided to turn off any garbage collection during the timing and also increase the size of the string to search from 5000 to 50,000 random words as some timings were very short.
My code is on github, and reproduced below:
""" Original: https://gist.github.com/vi3k6i5/dc3335ee46ab9f650b19885e8ade6c7a Additional dict.get() word replacer added Created on Wed Dec 20 14:03:51 2017 @author: Paddy3118 """ #!/bin/python from flashtext.keyword import KeywordProcessor import random import string import re import time import gc def get_word_of_length(str_length): # generate a random word of given length return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length)) # generate a list of 100K words of randomly chosen size all_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)] print('Count | FlashText | Regex | dict.get() | Comments') print('------------------------------------------------------') for keywords_length in range(1, 20002, 2500): gc.collect() # chose 5000*10 terms and create a string to search in. all_words_chosen = random.sample(all_words, 5000*10) story = ' '.join(all_words_chosen) # get unique keywords from the list of words generated. unique_keywords_sublist = list(set(random.sample(all_words, keywords_length))) # compile regex # source: https://stackoverflow.com/questions/6116978/python-replace-multiple-strings rep = dict([(key, '_keyword_') for key in unique_keywords_sublist]) compiled_re = re.compile("|".join(rep.keys())) # add keywords to flashtext keyword_processor = KeywordProcessor() for keyword in unique_keywords_sublist: keyword_processor.add_keyword(keyword, '_keyword_') gc.disable() # time the modules start = time.time() # flashtext (but ommiting its keyword setup) _1 = keyword_processor.replace_keywords(story) mid = time.time() # re (ommiting its regexp compilation) _2 = compiled_re.sub(lambda m: rep[re.escape(m.group(0))], story) end = time.time() # dict.get(word, word) returns the original word if it is not in the dict _3 = ' '.join(rep.get(word, word) for word in story.split()) end3 = time.time() gc.enable() # print output print(str(keywords_length).ljust(6), '|', "{0:.5f}".format(mid - start).ljust(9), '|', "{0:.5f}".format(end - mid).ljust(9), '|', "{0:.5f}".format(end3 - end).ljust(9), '|', end=' ') comment = [] if _1 != _2: comment.append('#1 != #2') else: comment.append('#1 == #2') if _1 != _3: comment.append('#1 != #3') else: comment.append('#1 == #3') if _2 != _3: comment.append('#2 != #3') else: comment.append('#2 == #3') print(' and '.join(comment))
The main addition is the line beginning _3 = ' '.join(.... and extensions to the tabular output that includes a dict.get column and a comments section where I print the comparisons of the returned strings of each algorithm.
Sample timings
This is for just one run but timings are similar for repeated runs, apart from a potential bug that I will talk about later.Count | FlashText | Regex | dict.get()| Comments
------------------------------------------------------
1 | 0.09375 | 0.00000 | 0.00000 | #1 != #2 and #1 == #3 and #2 != #3
2501 | 0.10938 | 3.67579 | 0.00000 | #1 != #2 and #1 == #3 and #2 != #3
5001 | 0.10939 | 6.98075 | 0.01563 | #1 != #2 and #1 == #3 and #2 != #3
7501 | 0.09745 | 9.90468 | 0.01563 | #1 != #2 and #1 != #3 and #2 != #3
10001 | 0.10937 | 12.43827 | 0.01563 | #1 != #2 and #1 == #3 and #2 != #3
12501 | 0.12498 | 14.58402 | 0.01563 | #1 != #2 and #1 == #3 and #2 != #3
15001 | 0.12500 | 16.56654 | 0.01562 | #1 != #2 and #1 == #3 and #2 != #3
17501 | 0.10937 | 18.32079 | 0.01563 | #1 != #2 and #1 == #3 and #2 != #3
20001 | 0.12500 | 20.04271 | 0.01563 | #1 != #2 and #1 == #3 and #2 != #3
Ignoring the comment column, for now, the dict.get() is around 10x faster than flashtext. (It could be more, It might be useful to rerun with an even longer sample text.
Fast algorithms written in Python are still written in interpreted Python. The Python dict is written in C and honed over decades.
The Python `one-liner` solution is easy to reason about its correctness, so when there are issues, it is more likely to be the more opaque library that is harder to reason about when determining where the error lies.
Problems
In the tables comments, The regxp is wrong (word endings not catered for, words that occur in longer replacements not catered for).flashtext mostly agrees with dict.get, but occaisionally shows a difference - such as in the count=7501 comment. The flashtext author might want to adjust this code to save the arguments whenever flashtext != dict.get to debug the problem ...
END.