(Best browsed on a larger than phone screen)
Someone updated the RC task, and I couldn't remember attempting it so took a closer look at the Python solution, (at least the more idiomatic of the two).
Task
The task is: Given two character strings, find the longest string of characters present anywhere in both strings.
I should point out that there is a different, well known, task which is to find the longest common subsequence between two strings.
There is no such skipping allowed in a sub**string**
The "Rote search" method
I took a look at the Python on Rosetta Code, it is by user Halilmal:
s1 = "thisisatest"
s2 = "testing123testing"
len1, len2 = len(s1), len(s2)
ir, jr = 0, -1
for i1 in range(len1):
i2 = s2.find(s1[i1])
while i2 >= 0:
j1, j2 = i1, i2
while j1 < len1 and j2 < len2 and s2[j2] == s1[j1]:
if j1-i1 >= jr-ir:
ir, jr = i1, j1
j1 += 1; j2 += 1
i2 = s2.find(s1[i1], i2+1)
print (s1[ir:jr+1])
test
The code works. To understand it I decided to take the liberty of renaming and extracting a function from the original. I also named the input strings from zero rather than 1.
s0 = "thisisatest"
s1 = "testing123testing"
def lcs_rs(s0, s1):
"Commented/longer names, RosettaCode entry by user Halilmali"
len0, len1 = len(s0), len(s1)
max_sub_lh, max_sub_rh = 0, -1 # Start LH and RH indices of max substring
for i0 in range(len0): # Start LH index of s0 char
i1 = s1.find(s0[i0]) # Start LH index of (first) s1 char match
while i1 >= 0:
sub_end_s0, sub_end_s1 = i0, i1 # End indices of this substring match
while (sub_end_s0 < len0 and
sub_end_s1 < len1 and
s1[sub_end_s1] == s0[sub_end_s0]):
if sub_end_s0 - i0 >= max_sub_rh-max_sub_lh:
max_sub_lh, max_sub_rh = i0, sub_end_s0 # New Max
sub_end_s0 += 1
sub_end_s1 += 1
i1 = s1.find(s0[i0], i1+1) # Start LH index of next s1 char match
return s0[max_sub_lh:max_sub_rh+1]
print (lcs_rs(s0, s1))
test
In this algorithm we have the two strings:
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6
s0[i0]: t h i s i s a t e s t
s1[i1]: t e s t i n g 1 2 3 t e s t i n g
For every character in s0 starting from the left, it looks up that character in s1 then tries to extend the matching string to the right as much as possible; moving to second and subsequent matches in s1; all the while keeping track of the largest substring match so far.
It seems to do a lot of traversal of the strings!
Dynamic Programming.
I searched and found a site that had a dynamic programming solution.
I liked the algorithm. In this you start with an X-Y array of zeroes, what I call the dp
array:
#s1[j] t e s t i n g 1 2 3 t e s t i n g; # s0[i]
dp = [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # t
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # h
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # i
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # s
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # i
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # s
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # a
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # t
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # e
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], # s
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]] # t
You compare every character of s0 with every character of s1 in order of increasing i and j and fill in the table like this:
if s0[i] == s1[j] then dp[i][j] = dp[i-1][j-1] + 1
You are comparing two characters from each string. If they are equal, then accumulate how many characters have been equal by adding one to the equality count for the dp coordinate corresponding to accumulated run of equivalences for the "previous" character.
There is an issue with what to do at the "edges". When i
or j
are zero then set dp[i][j] = 1
After filling the dp array, to get the longest substring search dp for a maximum value, mx, then mx is the number of characters in the longest substring and the coordinates of mx give the end index iof the string in s0 and s1 respectively.
An example:
s0 = 'ABBA'
s1 = 'AABB'
#s1[j] A A B B; # s0[i]
dp = [[1, 1, 0, 0], # A
[0, 0, 2, 1], # B
[0, 0, 1, 3], # B
[1, 1, 0, 0]] # A
The above fills in dp
after using the equivalent to for i in range(len(s0)): for j in range(len(s1)): ...
to traverse the array. The maximum value is clearly 3 at i,j == (2, 3)
. Using the i coordinate corresponding to the end of the longest substring being the first B in s0 knowing the length, the longest substring must be 'ABB'.
You can cross check that the same 3 character substring appears in s1 ending at s1[3]
Code
from itertools import product
def dp_array_print(s0, s1, dp, ans):
print(f"\ns0 = {repr(s0)}\ns1 = {repr(s1)}\n")
print(f"#s1[j] {' '.join(c for c in s1)}; # s0[i]\n")
for i, (dp_row, ch0) in enumerate(zip(dp, s0)):
txt = f" {dp_row}, # {ch0}"
if i == 0:
txt = "dp = [" + txt[6:]
if i + 1 == len(s0):
txt = txt.replace(', #', '] #')
print(txt)
print('\nAnswer = ', repr(ans),'\n=====================')
def lcs_dp(s0, s1):
len0, len1 = len(s0), len(s1)
# Bottom up Dynamic Programming table gen
dp = [[0] * len1 for _ in range(len0)]
for i, j in product(range(len0), range(len1)):
dp[i][j] = (1 + (dp[i-1][j-1] if i and j else 0) # 0 if off "edge"
if s0[i] == s1[j] # Only increment if same chars
else 0)
# Find max
mx, ij_mx = 0, (-1, -1)
for i, j in product(range(len0), range(len1)):
if dp[i][j] > mx:
mx, ij_mx = dp[i][j], (i, j)
return s0[1 + ij_mx[0] - mx: 1 + ij_mx[0]], dp
s0, s1 = 'ABBA', 'AABB'
ans01, dp01 = lcs_dp(s0, s1)
dp_array_print(s0, s1, dp01, ans01)
s0 = 'ABBA' s1 = 'AABB' #s1[j] A A B B; # s0[i] dp = [[1, 1, 0, 0], # A [0, 0, 2, 1], # B [0, 0, 1, 3], # B [1, 1, 0, 0]] # A Answer = 'ABB' =====================
The code isn't too bad, but there is that nested if-expression needed to calculate the dp[i][j] value under all cases.
Dynamic Programming for more than two strings?
For three strings I thought of filling in a dp cube in a similar way to the dp array, should work.
After trying a few two-string examples I noticed that there could be a
fair amount of zeroes in the dp array. I decided to therefore make the
dp-cube a sparse array held in a dict with a tuple of the string indices
of any point and only keeping non-zero counts in the dp sparse dict.
I chose not to use a defaultdict(int)
for the dp dict as it would accumulate zero values for accesses to any tuple coordinate having a -1 in it.
It prooved to be easy enough to move from first doing a three-string function to actually going ahead and doing a two or more string version straight away.
def lcs_dpx(*strings):
"Longest Common Substring: Dynamic Programming solution for many strings"
index_ranges = [range(len(s)) for s in strings]
# Bottom up Dynamic Programming table gen
dp = {}
for indices in product(*index_ranges):
string_chars = (s[i] for i, s in zip(indices, strings))
first_char = next(string_chars)
if all(next_char == first_char for next_char in string_chars):
prev = tuple(i-1 for i in indices) # Index previous chars
dp[indices] = 1 + (dp[prev] if prev in dp else 0)
# Find max
mx, mx_ind = 0, tuple(-1 for _ in strings)
for indices, val in dp.items():
if val > mx:
mx, mx_ind = val, indices
return strings[0][1 + mx_ind[0] - mx: 1 + mx_ind[0]], dict(dp)
s0 = "thisisatest_stinger"
s1 = "testing123testingthing"
s2 = "thisis"
ans, dp = lcs_dpx(s0, s1)
print(f"\n{repr(s0)}, {repr(s1)} ->> {repr(ans)}; len(dp) = {len(dp)};")
ans, dp = lcs_dpx(s0, s2)
print(f"\n{repr(s0)}, {repr(s2)} ->> {repr(ans)}; len(dp) = {len(dp)};")
ans, dp = lcs_dpx(s1, s2)
print(f"\n{repr(s1)}, {repr(s2)} ->> {repr(ans)}; len(dp) = {len(dp)};")
ans, dp = lcs_dpx(s0, s1, s2)
print(f"\n{repr(s0)}, {repr(s1)}, {repr(s2)} ->> {repr(ans)}; len(dp) = {len(dp)};")
'thisisatest_stinger', 'testing123testingthing' ->> 'sting'; len(dp) = 48; 'thisisatest_stinger', 'thisis' ->> 'thisis'; len(dp) = 19; 'testing123testingthing', 'thisis' ->> 'thi'; len(dp) = 16; 'thisisatest_stinger', 'testing123testingthing', 'thisis' ->> 'thi'; len(dp) = 55;
This works too!
The code uses product to perform arbitrarily nested for loops. The logic
for incrementing dp is simplified as dp starts without, and cannot
contain an indices
tuple with a -1 in it as that is generated in prev
and dp[prev]
is never assigned to.
Strings containing long runs of the samwe single character will have the largest number of matches and so the largest dp
size for this algorithm.
The longest of the common substrings of all substrings of each string
This algorithm, I saw mentioned somewhere. From just the menton of sets, i worked out what would be needed in Python, without bothering to understand the Perl original.
This method
- Works out the set of all the possible sub-strings for each individual string.
- Finds the set of substrings common to all the strings
- Returns the common substring that is the longest.
The description of the algorithm "does exactly what it says on the tin". I decided to create a version that worked with two and more strings as it seemed the most straight-forward of extensions.
Code
function _set_of_substrings
returns
the set of all possible substrings of its argument. Rather than
continue to just generate all the substrings of all strings, function _set_of_common_substrings
is used for subsequent strings to accumulate and return the set of only the common substrings. This latter function also makes use of the walrus operator to assign to this
.
def _set_of_substrings(s:str) -> set:
"_set_of_substrings('ABBA') == {'A', 'AB', 'ABB', 'ABBA', 'B', 'BA', 'BB', 'BBA'}"
len_s = len(s)
return {s[m: n] for m in range(len_s) for n in range(m+1, len_s +1)}
def _set_of_common_substrings(s:str, common: set) -> set:
"Substrings of s that are also in common"
len_s = len(s)
return {this for m in range(len_s) for n in range(m+1, len_s +1)
if (this := s[m:n]) in common}
def lcs_ss(*strings):
"longest of the common substrings of all substrings of each string"
strings_iter = (s for s in strings)
common = _set_of_substrings(next(strings_iter)) # First string substrings
for s in strings_iter:
if not common:
break
common = _set_of_common_substrings(s, common) # Accumulate the common
return max(common, key= lambda x: len(x)) if common else ''
s0 = "thisisatest_stinger"
s1 = "testing123testingthing"
s2 = "thisis"
ans = lcs_ss(s0, s1)
print(f"\n{repr(s0)}, {repr(s1)} ->> {repr(ans)}")
ans = lcs_ss(s0, s2)
print(f"\n{repr(s0)}, {repr(s2)} ->> {repr(ans)}")
ans = lcs_ss(s1, s2)
print(f"\n{repr(s1)}, {repr(s2)} ->> {repr(ans)}")
ans = lcs_ss(s0, s1, s2)
print(f"\n{repr(s0)}, {repr(s1)}, {repr(s2)} ->> {repr(ans)}")
'thisisatest_stinger', 'testing123testingthing' ->> 'sting' 'thisisatest_stinger', 'thisis' ->> 'thisis' 'testing123testingthing', 'thisis' ->> 'thi' 'thisisatest_stinger', 'testing123testingthing', 'thisis' ->> 'thi'
Comparative Timings
Simplistic data so have a pinch of salt to hand:
s0 = "thisisatest_stinger"
s1 = "testing123testingthing"
s2 = "thisis"
%timeit _ = lcs_rs(s0, s1)
122 µs ± 2.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit _ = lcs_dp(s0, s1)
315 µs ± 13.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit _ = lcs_dpx(s0, s1)
1.4 ms ± 14.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit _ = lcs_ss(s0, s1)
265 µs ± 4.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Now for timing 3-string runs:
%timeit _ = lcs_dpx(s0, s1, s2)
8.4 ms ± 411 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit _ = lcs_ss(s0, s1, s2)
284 µs ± 13.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Stop Press! Incremental DP
It seems to me that in the multi-string, dynamic programming case, you should be able to create the dp
sparse array for the first two strings then add in subsequent strings to successive dp
sparse arrays incrementing dimensionality. When adding a dimension/string to dp
, the search for string equality only needs to be done where there were matches for all previous strings/lower dimentionality dp
.
def lcs_idp(*strings):
"Longest Common Substring: Incremental Dynamic Programming for many strings"
index_ranges = [range(len(s)) for s in strings]
## Incremental Dynamic Programming table gen
# First two strings
dp = {}
for indices in product(*index_ranges[:2]):
string_chars = (s[i] for i, s in zip(indices, strings[:2]))
first_char = next(string_chars)
if all(next_char == first_char for next_char in string_chars):
prev = tuple(i-1 for i in indices) # Index previous chars
dp[indices] = 1 + (dp[prev] if prev in dp else 0)
# Next strings
for index_range, string in zip(index_ranges[2:], strings[2:]):
dp_last, dp = dp, {}
for last_match_index in dp_last: # These are non-zero from before
last_match_char = strings[0][last_match_index[0]]
for index, this_char in enumerate(string):
if last_match_char == this_char:
indices = tuple(list(last_match_index) + [index])
prev = tuple([i-1 for i in last_match_index] + [index - 1])
dp[indices] = 1 + (dp[prev] if prev in dp else 0)
# Find max
mx, mx_ind = 0, tuple(-1 for _ in strings)
for indices, val in dp.items():
if val > mx:
mx, mx_ind = val, indices
return strings[0][1 + mx_ind[0] - mx: 1 + mx_ind[0]], dict(dp)
s0 = "thisisatest_stinger"
s1 = "testing123testingthing"
s2 = "thisis"
ans, dp = lcs_idp(s0, s1)
print(f"\n{repr(s0)}, {repr(s1)} ->> {repr(ans)}; len(dp) = {len(dp)};")
ans, dp = lcs_idp(s0, s2)
print(f"\n{repr(s0)}, {repr(s2)} ->> {repr(ans)}; len(dp) = {len(dp)};")
ans, dp = lcs_idp(s1, s2)
print(f"\n{repr(s1)}, {repr(s2)} ->> {repr(ans)}; len(dp) = {len(dp)};")
ans, dp = lcs_idp(s0, s1, s2)
print(f"\n{repr(s0)}, {repr(s1)}, {repr(s2)} ->> {repr(ans)}; len(dp) = {len(dp)};")
'thisisatest_stinger', 'testing123testingthing' ->> 'sting'; len(dp) = 48; 'thisisatest_stinger', 'thisis' ->> 'thisis'; len(dp) = 19; 'testing123testingthing', 'thisis' ->> 'thi'; len(dp) = 16; 'thisisatest_stinger', 'testing123testingthing', 'thisis' ->> 'thi'; len(dp) = 55;
Timings
%timeit _ = lcs_idp(s0, s1)
1.5 ms ± 43.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit _ = lcs_idp(s0, s1, s2)
1.73 ms ± 59.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This new Incremental Dynamic Programming approach seems to be best the straight DP algorithm for three and more strings as it prunes the (hyper) cube of comparisons.
Conclusion.
The timings are based on a laughably small example. You really need to try the routines on your own data. lcs_ss
and lcs_idp
should be checked with your own sample data to see what is right for
you. They both try and cut down on memoryand computations, but your type of data might make one outclas the other.
Note: There could be another algorithm that makes use of suffix trees...
END.
No comments:
Post a Comment