Mainly Tech projects on Python and Electronic Design Automation.

Tuesday, July 28, 2009

The case of the disappearing over-bar

Rosetta-code has a task
which asks you to reverse a Unicode string
correctly. I had glanced at the Python
solution
and thought nothing of
it until someone did something similar in the R language and stated
that it may be incorrect for the given pattern which includes an
over-bar (my description) over the f in "as⃝df̅" (See the
orginal article
for the true Unicode string).




I cut-n-pasted the Python solution and the test string into Python 3.1
idle and decided to test it:




Python 3.1 (r31:73574, Jun 26 2009, 20:21:35) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> # I cut and paste the following string with an over-bar on the f
>>> x = input()
as⃝df̅
>>> # Just showing x gives the over-bar on the f
>>> x
'as⃝df̅'
>>> # Print it and it is fine.
>>> print(x)
as⃝df̅
>>> # Reverse x though, and the over-bar movesover the quote!
>>> x[::-1]
'̅fd⃝sa'
>>> # print the reversed x and it disappears altogether!
>>> print(x[::-1])
̅fd⃝sa
>>>



|NowCut and paste the unicode characters from firefox for the input
statement, and make sure that:


>>> ['%x' % ord(char) for char in x]
['61', '73', '20dd', '64', '66', '305']


As the unicode might otherwise be lost in my editing to try and present
it to you above.



A bit of scary reading introduced me to
Unicode character classes. It seems that Unicode characters are
sometimes composed, and that if I know what character gets composed
with another, (always the last non-composed character to the left),
then you should be able to form composed groups of characters and then
reverse them.




The WP article gave me the vocabulary, so a quick search in Pythons
library gave me the unicodedata
module which has the name
function. I am not sure if this will work in every case, but from the
WP article and this experimentation:


>>> [(c,'%s' % unicodedata.name(c,0xfffff)) for c in x]
[('a', 'LATIN SMALL LETTER A'), ('s', 'LATIN SMALL LETTER S'), ('⃝', 'COMBINING ENCLOSING CIRCLE'), ('d', 'LATIN SMALL LETTER D'), ('f', 'LATIN SMALL LETTER F'), ('̅', 'COMBINING OVERLINE')]


I think I can group by the presence of the word COMBINING in the name
of a character and so produced the following reversal function:


'''
Reverse a Unicode string with proper handling of combining characters
'''

import unicodedata

def ureverse(ustring):
'''
Reverse a string including unicode combining characters

Example:
>>> ucode = ''.join( chr(int(n, 16))
for n in ['61', '73', '20dd', '64', '66', '305'] )
>>> ucoderev = ureverse(ucode)
>>> ['%x' % ord(char) for char in ucoderev]
['66', '305', '64', '73', '20dd', '61']
>>>
'''
groupedchars = []
uchar = list(ustring)
while uchar:
if 'COMBINING' in unicodedata.name(uchar[0], ''):
groupedchars[-1] += uchar.pop(0)
else:
groupedchars.append(uchar.pop(0))
# Grouped reversal
groupedchars = groupedchars[::-1]

return ''.join(groupedchars)

if __name__ == '__main__':
ucode = ''.join( chr(int(n, 16))
for n in ['61', '73', '20dd', '64', '66', '305'] )
ucoderev = ureverse(ucode)
print (ucode)
print (ucoderev)


 

It works for the given example text (Try running it to see the output -
I've given up trying to work out what characters you might see).

Gosh, an attractive Unicode issue! Whatever next.

4 comments:

  1. Nice example!

    I think it might be better to check

    unicodedata.combining(uchar[0]) != 0

    rather than looking for 'COMBINING' in the character’s name, as there are non-Western European combining characters that don’t satisfy your condition.

    The first example I found by iteration is U+0591 (HEBREW ACCENT ETNAHTA), but there are others, all in scripts I know next to nothing about, like Hebrew, Arabic, Balinese musical symbols, and more! :)

    ReplyDelete
  2. thanks for doing all the hard work Paddy! can the function be used in an open source project (gpl v3?)

    dent

    ReplyDelete
  3. Hi Anon. The license conditions on Rosetta Code should suffice: http://rosettacode.org/wiki/Reverse_a_string#Unicode_reversal

    - Paddy.

    ReplyDelete

Followers

Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

whos.amung.us