Mainly Tech projects on Python and Electronic Design Automation.

Tuesday, July 28, 2009

The case of the disappearing over-bar

Rosetta-code has a task
which asks you to reverse a Unicode string
correctly. I had glanced at the Python
and thought nothing of
it until someone did something similar in the R language and stated
that it may be incorrect for the given pattern which includes an
over-bar (my description) over the f in "as⃝df̅" (See the
orginal article
for the true Unicode string).

I cut-n-pasted the Python solution and the test string into Python 3.1
idle and decided to test it:

Python 3.1 (r31:73574, Jun 26 2009, 20:21:35) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> # I cut and paste the following string with an over-bar on the f
>>> x = input()
>>> # Just showing x gives the over-bar on the f
>>> x
>>> # Print it and it is fine.
>>> print(x)
>>> # Reverse x though, and the over-bar movesover the quote!
>>> x[::-1]
>>> # print the reversed x and it disappears altogether!
>>> print(x[::-1])

|NowCut and paste the unicode characters from firefox for the input
statement, and make sure that:

>>> ['%x' % ord(char) for char in x]
['61', '73', '20dd', '64', '66', '305']

As the unicode might otherwise be lost in my editing to try and present
it to you above.

A bit of scary reading introduced me to
Unicode character classes. It seems that Unicode characters are
sometimes composed, and that if I know what character gets composed
with another, (always the last non-composed character to the left),
then you should be able to form composed groups of characters and then
reverse them.

The WP article gave me the vocabulary, so a quick search in Pythons
library gave me the unicodedata
module which has the name
function. I am not sure if this will work in every case, but from the
WP article and this experimentation:

>>> [(c,'%s' %,0xfffff)) for c in x]

I think I can group by the presence of the word COMBINING in the name
of a character and so produced the following reversal function:

Reverse a Unicode string with proper handling of combining characters

import unicodedata

def ureverse(ustring):
Reverse a string including unicode combining characters

>>> ucode = ''.join( chr(int(n, 16))
for n in ['61', '73', '20dd', '64', '66', '305'] )
>>> ucoderev = ureverse(ucode)
>>> ['%x' % ord(char) for char in ucoderev]
['66', '305', '64', '73', '20dd', '61']
groupedchars = []
uchar = list(ustring)
while uchar:
if 'COMBINING' in[0], ''):
groupedchars[-1] += uchar.pop(0)
# Grouped reversal
groupedchars = groupedchars[::-1]

return ''.join(groupedchars)

if __name__ == '__main__':
ucode = ''.join( chr(int(n, 16))
for n in ['61', '73', '20dd', '64', '66', '305'] )
ucoderev = ureverse(ucode)
print (ucode)
print (ucoderev)


It works for the given example text (Try running it to see the output -
I've given up trying to work out what characters you might see).

Gosh, an attractive Unicode issue! Whatever next.


Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

Blog Archive