Sunday, June 21, 2020

Successive Approximation, Hashes in high dimensions; Geohashing.


Successive Approximation

Given a named variable and the range in which its value must exist, generates successive binary digits which can be used to converge on the value successive binary digits approximating the value to ever greater precision.

Binary search for a value.

Starting with the original lower and upper range boundaries for the named variable, the value is compared to the range midpoint: if its lower than the midpoint then the range maximum is updated to become the midpoint as the value is in this half; otherwise the range minimum is updated with the midpoint to narrow therange containing the value by a half on each round. If the lower half of the range is used then generate a zero otherwise generate a one for this "round" in the binary search.

[1]

There is a TUTOR mode to turn on and before giving the encoder a whirl.

[2]
Successively encoding 0.6666666666666666 between [-2, 4]: is <= 1.0 is > -0.5 is > 0.25 is > 0.625 is <= 0.8125 is <= 0.71875 is <= 0.671875 is > 0.6484375 is > 0.66015625 Resulting bit stream: [0, 1, 1, 1, 0, 0, 0, 1, 1].
([0, 1, 1, 1, 0, 0, 0, 1, 1], 0.66015625, 0.671875)

You should be able to see how the encoder successively approaches the value.

Lets try some corner cases:

[3]
Successively encoding 0.5 between [0, 1]: is <= 0.5 is > 0.25 is > 0.375 is > 0.4375 is > 0.46875 is > 0.484375 is > 0.4921875 is > 0.49609375 Resulting bit stream: [0, 1, 1, 1, 1, 1, 1, 1].
([0, 1, 1, 1, 1, 1, 1, 1], 0.49609375, 0.5)
[4]
Successively encoding 0 between [0, 1]: is <= 0.5 is <= 0.25 is <= 0.125 is <= 0.0625 is <= 0.03125 is <= 0.015625 is <= 0.0078125 is <= 0.00390625 Resulting bit stream: [0, 0, 0, 0, 0, 0, 0, 0].
([0, 0, 0, 0, 0, 0, 0, 0], 0, 0.00390625)
[5]
Successively encoding 1 between [0, 1]: is > 0.5 is > 0.75 is > 0.875 is > 0.9375 is > 0.96875 is > 0.984375 is > 0.9921875 is > 0.99609375 Resulting bit stream: [1, 1, 1, 1, 1, 1, 1, 1].
([1, 1, 1, 1, 1, 1, 1, 1], 0.99609375, 1)

Bitstream decoding

With the encoders bitstream and theinitial range, the initial value can be recovered to ever greater accuracy as more bits are processed. The decoder takes the same decisions as the encoder when successively bisecting the initial range.

[6]

Lets encode a value then decode the whole of its bitstream

[7]
Successively encoding 0.6666666666666666 between [-2, 4]: is <= 1.0 is > -0.5 is > 0.25 is > 0.625 is <= 0.8125 is <= 0.71875 is <= 0.671875 is > 0.6484375 is > 0.66015625 is > 0.666015625 is <= 0.6689453125 is <= 0.66748046875 is <= 0.666748046875 is > 0.6663818359375 is > 0.66656494140625 Resulting bit stream: [0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1]. Successively decoding [0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1] between [-2, 4]: in range [-2, 1.0] in range [-0.5, 1.0] in range [0.25, 1.0] in range [0.625, 1.0] in range [0.625, 0.8125] in range [0.625, 0.71875] in range [0.625, 0.671875] in range [0.6484375, 0.671875] in range [0.66015625, 0.671875] in range [0.666015625, 0.671875] in range [0.666015625, 0.6689453125] in range [0.666015625, 0.66748046875] in range [0.666015625, 0.666748046875] in range [0.6663818359375, 0.666748046875] in range [0.66656494140625, 0.666748046875] Resulting value: 0.666656494140625 ±9.1552734375e-05.

Truncating binary digits

The full 15 bits of the encoding above give a result that is correct when rounded to 5 decimal digits. If a smaller number of bits (from the left), are decoded then the value should be reproduced to a lesser level of precision. Lets try:

[8]
# Decode the first 10 bits. Successively decoding [0, 1, 1, 1, 0, 0, 0, 1, 1, 1] between [-2, 4]: in range [-2, 1.0] in range [-0.5, 1.0] in range [0.25, 1.0] in range [0.625, 1.0] in range [0.625, 0.8125] in range [0.625, 0.71875] in range [0.625, 0.671875] in range [0.6484375, 0.671875] in range [0.66015625, 0.671875] in range [0.666015625, 0.671875] Resulting value: 0.6689453125 ±0.0029296875. # Decode only the first 5 bits. Successively decoding [0, 1, 1, 1, 0] between [-2, 4]: in range [-2, 1.0] in range [-0.5, 1.0] in range [0.25, 1.0] in range [0.625, 1.0] in range [0.625, 0.8125] Resulting value: 0.71875 ±0.09375.

Higher dimensions

What if we have more than one named variables values to encode in a similar fashion - preserving the ability to truncate on the right, (or generate more bits), to affect precision. Well lets give each variable, which we can think of as a "dimension" a name, a range, and all the dimensions a constant ordering.

If we assume the same precision for each dimension and interleave the bitstream for the dimensions in encoding, then the combined interleaved bitstream could be truncated as before ; de-interleaved, then individually decoded back to generate values for each dimension.

[9]
[10]
Encode then decode (-1.5, 2.7) -1.5001220703125 ±0.00012207 2.700042724609375 ±0.000152588

Again, we can truncate and take the leftmost 14 bits of the encoding to decode to less precision in both values

[11]
-1.53125 ±0.03125) 2.6953125 ±0.0390625)

A closer look at interleaving.

Create separate bitstreams for two dimensions and compare to that of the combined.

[12]
Dimensions: (Dim(name='s', rangemin=-4, rangemax=4), Dim(name='t', rangemin=-5, rangemax=5)) Values: (-1.5, 2.7)
[13]
Successively encoding -1.5 between [-4, 4]: is <= 0.0 is > -2.0 is <= -1.0 is <= -1.5 is > -1.75 is > -1.625 is > -1.5625 Resulting bit stream: [0, 1, 0, 0, 1, 1, 1].
[14]
Successively encoding 2.7 between [-5, 5]: is > 0.0 is > 2.5 is <= 3.75 is <= 3.125 is <= 2.8125 is > 2.65625 is <= 2.734375 Resulting bit stream: [1, 1, 0, 0, 0, 1, 0].
[15]
[16]
[0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0]
[17]
[0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0]
[18]
Successively encoding -1.5 between [-4, 4]: is <= 0.0 is > -2.0 is <= -1.0 is <= -1.5 is > -1.75 is > -1.625 is > -1.5625 Resulting bit stream: [0, 1, 0, 0, 1, 1, 1]. Successively encoding 2.7 between [-5, 5]: is > 0.0 is > 2.5 is <= 3.75 is <= 3.125 is <= 2.8125 is > 2.65625 is <= 2.734375 Resulting bit stream: [1, 1, 0, 0, 0, 1, 0]. OK

On precision.

The code works to the precision of the Python datatypes passed into it. The ranges so far have used Pythons 64 bit floating point datatype for range and value. If decimals or fractions are used then much more precision could be achieved with the same code due to duck typing.

Human readability

[19]
[0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0]

These strings of binary bits are hard for us to accurately remember and transcribe. Traditionally we have used octal, base-8; and hexadecimal, base 16 number bases to get shorter, more memorable, representations for binary numbers.

I am going to use abase 32 numbering scheme to encode the bits. Every five successive bits will become one of 32 different characters. The characters used are the digits zero to nine then 22 lowercase ASCII letters missing out certain letters, (a, i, l, and o), that could be mistaken for numbers.

Users will more naturally express precision in terms of the number of characters in the representation and so the underlying bit precision will be in multiples of 5-bits.

[20]
[21]
[22]
Dimensions: (Dim(name='s', rangemin=-4, rangemax=4), Dim(name='t', rangemin=-5, rangemax=5)) Values: (-1.5, 2.7)
[23]
Values encode to 'f2x'
[24]
Then decode to: -1.515625 ±0.015625 2.6953125 ±0.0390625

Geohashes

Geohashes are a convenient way of encoding geographical coordinates as a character string.

I have used the same textual mapping so that the following dimensional data will reproduce Geohash en/decoding.

[25]

Unfortunately Geohashes, whilst specifying coordinates in the order latitude then longitude; alternate bits from longitude and then latidude - i.e. in reverse, so adjustments are needed.

[26]
[27]
'gcpue5'
[28]
Geohash 'gcpue5' decodes to: latitude : 51.43524169921875 ±0.00274658 longitude : -0.2142333984375 ±0.00549316

External sites for testing Geohashes

You can enter the coordinates or the geohash above into this site and find that it points to the "Wimpledon All England Lawn Tennis and Croquet Club".

End Notes.

This blog post was written because the WIkipedia entry and a web search did nt turn up a good description of Geohashes for me to finish the Python Rosetta Code entry initially. After more reading I finally completed the Python entry.

I thought about generalising the method to more than two dimensions and also wanted to highlight the swithing of order in geohashing: lat/long normal coord order vs long/lat used when alternating the bits from the binary search.

Another generalisation might be to use n-ary search for a value producing a base-n "bit"; but I could not think of a need, and this post is already long.

END.

Not saved yet

python3 | launching