Tuesday, July 28, 2009

The case of the disappearing over-bar

Rosetta-code has a task
which asks you to reverse a Unicode string
correctly. I had glanced at the Python
solution
and thought nothing of
it until someone did something similar in the R language and stated
that it may be incorrect for the given pattern which includes an
over-bar (my description) over the f in "as⃝df̅" (See the
orginal article
for the true Unicode string).




I cut-n-pasted the Python solution and the test string into Python 3.1
idle and decided to test it:




Python 3.1 (r31:73574, Jun 26 2009, 20:21:35) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> # I cut and paste the following string with an over-bar on the f
>>> x = input()
as⃝df̅
>>> # Just showing x gives the over-bar on the f
>>> x
'as⃝df̅'
>>> # Print it and it is fine.
>>> print(x)
as⃝df̅
>>> # Reverse x though, and the over-bar movesover the quote!
>>> x[::-1]
'̅fd⃝sa'
>>> # print the reversed x and it disappears altogether!
>>> print(x[::-1])
̅fd⃝sa
>>>



|NowCut and paste the unicode characters from firefox for the input
statement, and make sure that:


>>> ['%x' % ord(char) for char in x]
['61', '73', '20dd', '64', '66', '305']


As the unicode might otherwise be lost in my editing to try and present
it to you above.



A bit of scary reading introduced me to
Unicode character classes. It seems that Unicode characters are
sometimes composed, and that if I know what character gets composed
with another, (always the last non-composed character to the left),
then you should be able to form composed groups of characters and then
reverse them.




The WP article gave me the vocabulary, so a quick search in Pythons
library gave me the unicodedata
module which has the name
function. I am not sure if this will work in every case, but from the
WP article and this experimentation:


>>> [(c,'%s' % unicodedata.name(c,0xfffff)) for c in x]
[('a', 'LATIN SMALL LETTER A'), ('s', 'LATIN SMALL LETTER S'), ('⃝', 'COMBINING ENCLOSING CIRCLE'), ('d', 'LATIN SMALL LETTER D'), ('f', 'LATIN SMALL LETTER F'), ('̅', 'COMBINING OVERLINE')]


I think I can group by the presence of the word COMBINING in the name
of a character and so produced the following reversal function:


'''
Reverse a Unicode string with proper handling of combining characters
'''

import unicodedata

def ureverse(ustring):
'''
Reverse a string including unicode combining characters

Example:
>>> ucode = ''.join( chr(int(n, 16))
for n in ['61', '73', '20dd', '64', '66', '305'] )
>>> ucoderev = ureverse(ucode)
>>> ['%x' % ord(char) for char in ucoderev]
['66', '305', '64', '73', '20dd', '61']
>>>
'''
groupedchars = []
uchar = list(ustring)
while uchar:
if 'COMBINING' in unicodedata.name(uchar[0], ''):
groupedchars[-1] += uchar.pop(0)
else:
groupedchars.append(uchar.pop(0))
# Grouped reversal
groupedchars = groupedchars[::-1]

return ''.join(groupedchars)

if __name__ == '__main__':
ucode = ''.join( chr(int(n, 16))
for n in ['61', '73', '20dd', '64', '66', '305'] )
ucoderev = ureverse(ucode)
print (ucode)
print (ucoderev)


 

It works for the given example text (Try running it to see the output -
I've given up trying to work out what characters you might see).

Gosh, an attractive Unicode issue! Whatever next.