Mainly Tech projects on Python and Electronic Design Automation.

Friday, July 10, 2020

Merging large streams

Merging large streams

Merging two or more huge sources of ordered data into one.
I was thinking about order in data, and more specifically, how easy it is for me to destroy order when processing input data only to then need to apply the same ordering to the result. For exsmple , it is so simple to just do something like:
merged_list = sorted(ordered_list1 + ordered_list2)
To merge ordered lists, especisally when you know that Timsort's algorithm will search for, and optimise the sorting of "runs" of pre-ordered values.
I was looking for an algorithm for data too big to process at once due to size, or in which the size is unknown such as data feeds from a pipes/ports and needing to generate merged values as soon as possible, possibly to an output pipe/port.

Modelling

I model input streams as iterators genersting increasing randomised ints. The merger is a generator function that consumes just enough of its input iterator streams to generate successive output values; its first value can be generated from reading the first values of each of its input iterators, for example.

Two input streams merged


  • Visibility
Two separate state variables for the left and right input streams control how they are to be processed in the functions loop: is the stream empty? do we have a value from the stream? Do we need to get the next value from the stream?
Only one value is read from any stream at any one time to yield an output merged value, (after the initial read of one value from each input stream). The merge is independent of the size of input streams.

Multiple input streams merged


  • Visibility
This merges any numnber of input streams in a similar fashion. A list of input streams is given as an argument, and a listy of state stracking values is used instead of individual ones. streams are removed from lists when they are emptied, and, as before, if we get down to only one active input stream then we just yield from it

TESTING

Given two lists ordered lists then we can generate a target answer which is just the two lists concatenated and sorted into merged_lists. We can create input streams by calling iter on each list for the merge generators functions to work from then check the output by turning it into a list and comparing to merged_lists.

  • Visibility
Given left and right streams of [1, 3, 5, 7] and [0, 2, 4, 6, 7, 11] as iterators Streaming merges of two streams correct

Testing three streams

Three streams of different lengths tested in all orderings of streams

  • Visibility
Given 3 streams of iterators: [0, 1, 3, 6, 9] [4, 7, 10, 11, 12, 16] [2, 5, 8, 10, 13, 14, 15, 20, 21] Streaming merges of 3 streams correct

Constrained Random testing

Generate N tests with randomised number of input streams; randomised number of input stream lengths; randomised input stream ordered values.
As before, data is generated this time randomly, as lists - the lists summed and sorted to form the merge_target, then converted to streams and fed to merge_s.

  • Visibility
## ## RANDOM SOAK TEST ## Soak test of 10_000 random sets of random streams correct
The above passes now. During development it did fail on the equivalent of merge_s([iter([])]). I created the following set of extra 'directed' tests of this, and similar inputs

Directed tests


  • Visibility
## ## DIRECTED TESTS: ## Directed tests pass.

Review

Created a generator function to merge multiple ordered input streams needing little memory, and yielding values while holding only one value from each of its input streams.
Ran simple "shakedown" tests then built a Coonstrained Random framework of tests to disclose hidden bugs.

End.


  • Visibility
Last saved 43 minutes
Python 3 | not connected

No comments:

Post a Comment

Followers

Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

whos.amung.us

Blog Archive