Go deh!: July 2009

Rosetta-code has a task
which asks you to reverse a Unicode string
correctly. I had glanced at the Python
solution and thought nothing of
it until someone did something similar in the R language and stated
that it may be incorrect for the given pattern which includes an
over-bar (my description) over the f in "as⃝df̅" (See the
orginal article for the true Unicode string).

I cut-n-pasted the Python solution and the test string into Python 3.1
idle and decided to test it:

Python 3.1 (r31:73574, Jun 26 2009, 20:21:35) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> # I cut and paste the following string with an over-bar on the f
>>> x = input()
as⃝df̅
>>> # Just showing x gives the over-bar on the f
>>> x
'asâƒdfÌ…'
>>> # Print it and it is fine.
>>> print(x)
asâƒdfÌ…
>>> # Reverse x though, and the over-bar movesover the quote!
>>> x[::-1]
'Ì…fdâƒsa'
>>> # print the reversed x and it disappears altogether!
>>> print(x[::-1])
Ì…fdâƒsa
>>>

|NowCut and paste the unicode characters from firefox for the input
statement, and make sure that:

>>> ['%x' % ord(char) for char in x]
['61', '73', '20dd', '64', '66', '305']

As the unicode might otherwise be lost in my editing to try and present
it to you above.

A bit of scary reading introduced me to
Unicode character classes. It seems that Unicode characters are
sometimes composed, and that if I know what character gets composed
with another, (always the last non-composed character to the left),
then you should be able to form composed groups of characters and then
reverse them.

The WP article gave me the vocabulary, so a quick search in Pythons
library gave me the unicodedata
module which has the name
function. I am not sure if this will work in every case, but from the
WP article and this experimentation:

>>> [(c,'%s' % unicodedata.name(c,0xfffff)) for c in x]
[('a', 'LATIN SMALL LETTER A'), ('s', 'LATIN SMALL LETTER S'), ('⃝', 'COMBINING ENCLOSING CIRCLE'), ('d', 'LATIN SMALL LETTER D'), ('f', 'LATIN SMALL LETTER F'), ('̅', 'COMBINING OVERLINE')]

I think I can group by the presence of the word COMBINING in the name
of a character and so produced the following reversal function:

'''
  Reverse a Unicode string with proper handling of combining characters
'''

import unicodedata

def ureverse(ustring):
    '''
    Reverse a string including unicode combining characters

    Example:
        >>> ucode = ''.join( chr(int(n, 16))
                             for n in ['61', '73', '20dd', '64', '66', '305'] )
        >>> ucoderev = ureverse(ucode)
        >>> ['%x' % ord(char) for char in ucoderev]
        ['66', '305', '64', '73', '20dd', '61']
        >>> 
    '''
    groupedchars = []
    uchar = list(ustring)
    while uchar:
        if 'COMBINING' in unicodedata.name(uchar[0], ''):
            groupedchars[-1] += uchar.pop(0)
        else:
            groupedchars.append(uchar.pop(0))
    # Grouped reversal
    groupedchars = groupedchars[::-1]

    return ''.join(groupedchars)

if __name__ == '__main__':
    ucode = ''.join( chr(int(n, 16))
                     for n in ['61', '73', '20dd', '64', '66', '305'] )
    ucoderev = ureverse(ucode)
    print (ucode)
    print (ucoderev)

It works for the given example text (Try running it to see the output -
I've given up trying to work out what characters you might see).

Gosh, an attractive Unicode issue! Whatever next.

Go deh!

Tuesday, July 28, 2009

The case of the disappearing over-bar