which asks you to reverse a Unicode string
correctly. I had glanced at the Python
solution and thought nothing of
it until someone did something similar in the R language and stated
that it may be incorrect for the given pattern which includes an
over-bar (my description) over the f in "as⃝df̅" (See the
orginal article for the true Unicode string).
I cut-n-pasted the Python solution and the test string into Python 3.1
idle and decided to test it:
Python 3.1 (r31:73574, Jun 26 2009, 20:21:35) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> # I cut and paste the following string with an over-bar on the f
>>> x = input()
as⃝df̅
>>> # Just showing x gives the over-bar on the f
>>> x
'asâƒdfÌ…'
>>> # Print it and it is fine.
>>> print(x)
asâƒdfÌ…
>>> # Reverse x though, and the over-bar movesover the quote!
>>> x[::-1]
'Ì…fdâƒsa'
>>> # print the reversed x and it disappears altogether!
>>> print(x[::-1])
Ì…fdâƒsa
>>>
|NowCut and paste the unicode characters from firefox for the input
statement, and make sure that:
>>> ['%x' % ord(char) for char in x]
['61', '73', '20dd', '64', '66', '305']
As the unicode might otherwise be lost in my editing to try and present
it to you above.
A bit of scary reading introduced me to
Unicode character classes. It seems that Unicode characters are
sometimes composed, and that if I know what character gets composed
with another, (always the last non-composed character to the left),
then you should be able to form composed groups of characters and then
reverse them.
The WP article gave me the vocabulary, so a quick search in Pythons
library gave me the unicodedata
module which has the name
function. I am not sure if this will work in every case, but from the
WP article and this experimentation:
>>> [(c,'%s' % unicodedata.name(c,0xfffff)) for c in x]
[('a', 'LATIN SMALL LETTER A'), ('s', 'LATIN SMALL LETTER S'), ('⃝', 'COMBINING ENCLOSING CIRCLE'), ('d', 'LATIN SMALL LETTER D'), ('f', 'LATIN SMALL LETTER F'), ('̅', 'COMBINING OVERLINE')]
I think I can group by the presence of the word COMBINING in the name
of a character and so produced the following reversal function:
'''
Reverse a Unicode string with proper handling of combining characters
'''
import unicodedata
def ureverse(ustring):
'''
Reverse a string including unicode combining characters
Example:
>>> ucode = ''.join( chr(int(n, 16))
for n in ['61', '73', '20dd', '64', '66', '305'] )
>>> ucoderev = ureverse(ucode)
>>> ['%x' % ord(char) for char in ucoderev]
['66', '305', '64', '73', '20dd', '61']
>>>
'''
groupedchars = []
uchar = list(ustring)
while uchar:
if 'COMBINING' in unicodedata.name(uchar[0], ''):
groupedchars[-1] += uchar.pop(0)
else:
groupedchars.append(uchar.pop(0))
# Grouped reversal
groupedchars = groupedchars[::-1]
return ''.join(groupedchars)
if __name__ == '__main__':
ucode = ''.join( chr(int(n, 16))
for n in ['61', '73', '20dd', '64', '66', '305'] )
ucoderev = ureverse(ucode)
print (ucode)
print (ucoderev)
It works for the given example text (Try running it to see the output -
I've given up trying to work out what characters you might see).
Gosh, an attractive Unicode issue! Whatever next.