Mainly Tech projects on Python and Electronic Design Automation.

Thursday, September 24, 2009

Natural Sorting

Natural sorting orders items in some way that is more than just the
order dictated by computer character codes. It is called natural
because the new ordering is supposed to mean more to a human consumer
than an ordering on base character codes .



I guess, what is becoming the canonical example is the ordering of
numbered files in a directory listing. A normal sorted directory
listing might show:

style="font-family: monospace;">bash$ for (( i=0;
i<111;i+=5)); do touch f$i.txt; done style="font-family: monospace;">
bash$ ls -1 style="font-family: monospace;">
f0.txt style="font-family: monospace;">
f10.txt style="font-family: monospace;">
f100.txt style="font-family: monospace;">
f105.txt style="font-family: monospace;">
f110.txt style="font-family: monospace;">
f15.txt style="font-family: monospace;">
f20.txt style="font-family: monospace;">
f25.txt style="font-family: monospace;">
f30.txt style="font-family: monospace;">
f35.txt style="font-family: monospace;">
f40.txt style="font-family: monospace;">
f45.txt style="font-family: monospace;">
f5.txt style="font-family: monospace;">
f50.txt style="font-family: monospace;">
f55.txt style="font-family: monospace;">
f60.txt style="font-family: monospace;">
f65.txt style="font-family: monospace;">
f70.txt style="font-family: monospace;">
f75.txt style="font-family: monospace;">
f80.txt style="font-family: monospace;">
f85.txt style="font-family: monospace;">
f90.txt style="font-family: monospace;">
f95.txt

It might be more friendly if the files were ordered so that the numeric
section increased steadily.



In other situations, such as in Movie listings, you sometimes see
common words such as "the" ignored for sorting purposes (or even
transferred to the end of the item so "The Silence of the Lambs" would
be listed as "Silence of the Lambs, The")



Ignoring case, and treating white-space characters as the same - and
multiple strings of whitespace characters as a single whitespace, might
also be useful.



Since my initial tentative steps into working with non-ASCII characters
I tried to consider sorting accented characters as if the accent was
not present; ligatures as separate characters, and made a stab at
tackling  what I call 'replacements' - where certain
characters, such as the German href="http://en.wikipedia.org/wiki/Scharfes_S">scharfes S
might be sorted as SS,.



I had a quick stab at creating a natural sort routine  and
added several natural sorting abilities. They might not all be useful
at once, but it should be straight forward to pick and choose what you
want n your natural sort..



As always, when working with non-ASCII character sets, I find it
difficult to translate my working multi-byte Python program to HTML,
but there should be enough surrounding information to ensure that any
multi-byte character can be worked-out.




  1 
color="#804040"> 2 # -*- coding: utf-8 -*-
color="#804040"> 3 # Not Python 3.x (Can't compare str and int)
color="#804040"> 4
color="#804040"> 5 '''
color="#804040"> 6 Natural sorting: Sorting of text that does more than rely on the
color="#804040"> 7 order of individual characters codes to make the finding of
color="#804040"> 8 individual strings easier for a human reader.
color="#804040"> 9
color="#804040"> 10 Some 'natural' orderings might include:
color="#804040"> 11 1) Ignore leading, trailing and multiple adjacent spaces
color="#804040"> 12 2) All space types equivalent.
color="#804040"> 13 3) Sorting without regard to case.
color="#804040"> 14 4) Sorting numeric portions of strings in numeric order:
color="#804040"> 15 foo9.txt before foo10.txt
color="#804040"> 16 As well as ... x9y99 before x9y100, before x10y0
color="#804040"> 17 ... (for any number of groups of integers in a string).
color="#804040"> 18 5) Title sorts: without regard to a leading, very common, word such
color="#804040"> 19 as 'The' in "The thirty-nine steps".
color="#804040"> 20 6) Sort letters without regard to accents.
color="#804040"> 21 7) Sort ligatures as separate letters.
color="#804040"> 22 8) Replacements:
color="#804040"> 23 Sort german scharfes S (ß) as ss
color="#804040"> 24 Sort Å¿, LATIN SMALL LETTER LONG S as s
color="#804040"> 25 Sort Ê’, LATIN SMALL LETTER EZH as s
color="#804040"> 26 ...
color="#804040"> 27 '''
color="#804040"> 28
color="#804040"> 29 from itertools color="#a020f0">import groupby
color="#804040"> 30 from unicodedata color="#a020f0">import decomposition, name
color="#804040"> 31
color="#804040"> 32 commonleaders = [' color="#ff00ff">the'] # lowercase leading words to ignore
color="#804040"> 33 replacements = {' color="#ff00ff">ß': 'ss', color="#0000ff"># Map single char to replacement string
color="#804040"> 34 ' color="#ff00ff">Å¿': 's',
color="#804040"> 35 ' color="#ff00ff">Ê’': 's',
color="#804040"> 36 }
color="#804040"> 37
color="#804040"> 38 hexdigits = set(' color="#ff00ff">0123456789abcdef')
color="#804040"> 39 decdigits = set(' color="#ff00ff">0123456789') color="#0000ff"># Don't use str.isnumeric
color="#804040"> 40
color="#804040"> 41 def color="#008080">splitchar(c):
color="#804040"> 42 ' color="#ff00ff"> De-ligature. De-accent a char'
color="#804040"> 43 de = decomposition(c)
color="#804040"> 44 if de:
color="#804040"> 45 color="#0000ff"># Just the words that are also hex numbers
color="#804040"> 46 de = [d color="#804040">for d color="#804040">in de.split()
color="#804040"> 47 color="#804040">if all(c.lower()
color="#804040"> 48 color="#804040">in hexdigits color="#804040">for c color="#804040">in d)]
color="#804040"> 49 n = name(c, c).upper()
color="#804040"> 50 color="#0000ff"># (Gosh it's onerous)
color="#804040"> 51 color="#804040">if len(de)> 1 color="#804040">and ' color="#ff00ff">PRECEDE' color="#804040">in n:
color="#804040"> 52 color="#0000ff"># E.g. ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
color="#804040"> 53 de[1], de[0] = de[0], de[1]
color="#804040"> 54 tmp = [ unichr(int(k, 16)) color="#804040">for k color="#804040">in de]
color="#804040"> 55 base, others = tmp[0], tmp[1:]
color="#804040"> 56 color="#804040">if ' color="#ff00ff">LIGATURE' color="#804040">in n:
color="#804040"> 57 color="#0000ff"># Assume two character ligature
color="#804040"> 58 base += others.pop(0)
color="#804040"> 59 else:
color="#804040"> 60 base = c
color="#804040"> 61 return base
color="#804040"> 62
color="#804040"> 63
color="#804040"> 64 def color="#008080">sortkeygen(s):
color="#804040"> 65 ''' color="#ff00ff">Generate 'natural' sort key for s
color="#804040"> 66
color="#804040"> 67 Doctests:
color="#804040"> 68 >>> sortkeygen(' some extra spaces ')
color="#804040"> 69 [u'some extra spaces']
color="#804040"> 70 >>> sortkeygen('CasE InseNsItIve')
color="#804040"> 71 [u'case insensitive']
color="#804040"> 72 >>> sortkeygen('The Wind in the Willows')
color="#804040"> 73 [u'wind in the willows']
color="#804040"> 74 >>> sortkeygen(u' color="#6a5acd">\462 ligature')
color="#804040"> 75 [u'ij ligature']
color="#804040"> 76 >>> sortkeygen(u' color="#6a5acd">\335\375 upper/lower case Y with acute accent')
color="#804040"> 77 [u'yy upper/lower case y with acute accent']
color="#804040"> 78 >>> sortkeygen('foo9.txt')
color="#804040"> 79 [u'foo', 9, u'.txt']
color="#804040"> 80 >>> sortkeygen('x9y99')
color="#804040"> 81 [u'x', 9, u'y', 99]
color="#804040"> 82 '''
color="#804040"> 83 # Ignore leading and trailing spaces
color="#804040"> 84 s = unicode(s).strip()
color="#804040"> 85 # All space types are equivalent
color="#804040"> 86 s = ' color="#ff00ff"> '.join(s.split())
color="#804040"> 87 # case insentsitive
color="#804040"> 88 s = s.lower()
color="#804040"> 89 # Title
color="#804040"> 90 words = s.split()
color="#804040"> 91 if len(words) > 1 color="#804040">and words[0] color="#804040">in commonleaders:
color="#804040"> 92 s = ' color="#ff00ff"> '.join( words[1:])
color="#804040"> 93 # accent and ligatures
color="#804040"> 94 s = ''.join(splitchar(c) color="#804040">for c color="#804040">in s)
color="#804040"> 95 # Replacements (single char replaced by one or more)
color="#804040"> 96 s = ''.join( replacements.get(ch, ch) color="#804040">for ch color="#804040">in s )
color="#804040"> 97 # Numeric sections as numerics
color="#804040"> 98 s = [ int("".join(g)) color="#804040">if isinteger color="#804040">else "".join(g)
color="#804040"> 99 color="#804040">for isinteger,g color="#804040">in groupby(s, color="#804040">lambda x: x color="#804040">in decdigits)]
color="#804040">100
color="#804040">101 return s
color="#804040">102
color="#804040">103 def color="#008080">naturalsort(items):
color="#804040">104 ''' color="#ff00ff"> Naturally sort a series of strings
color="#804040">105
color="#804040">106 Doctests:
color="#804040">107 >>> naturalsort(['The Wind in the Willows','The 40th step more',
color="#804040">108 'The 39 steps', 'Wanda'])
color="#804040">109 ['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']
color="#804040">110
color="#804040">111 '''
color="#804040">112 return sorted(items, key=sortkeygen)
color="#804040">113
color="#804040">114
color="#804040">115



color="#804040">116 # Extras...
color="#804040">117
color="#804040">118 '''
color="#804040">119 Help on built-in function decomposition in module unicodedata:
color="#804040">120
color="#804040">121 decomposition(...)
color="#804040">122 decomposition(unichr)
color="#804040">123
color="#804040">124 Returns the character decomposition mapping assigned to the Unicode
color="#804040">125 character unichr as string. An empty string is returned in case no
color="#804040">126 such mapping is defined.
color="#804040">127
color="#804040">128 Help on built-in function name in module unicodedata:
color="#804040">129
color="#804040">130 name(...)
color="#804040">131 name(unichr[, default])
color="#804040">132 Returns the name assigned to the Unicode character unichr as a
color="#804040">133 string. If no name is defined, default is returned, or, if not
color="#804040">134 given, ValueError is raised.
color="#804040">135 '''
color="#804040">136
color="#804040">137
color="#804040">138 def color="#008080">_temp():
color="#804040">139 # for c in u'\370\157\117\463\462\511\733\337\337':
color="#804040">140 for c color="#804040">in u' class="Constant">øoOijIJʼnÇßÝý':
color="#804040">141 de = decomposition(c)
color="#804040">142 color="#804040">if de:
color="#804040">143 de = [d color="#804040">for d color="#804040">in de.split()
color="#804040">144 color="#804040">if all(c.upper()
color="#804040">145 color="#804040">in ' color="#ff00ff">0123456789ABCDEF' color="#804040">for c color="#804040">in d)]
color="#804040">146 n = name(c, c).upper()
color="#804040">147 color="#804040">if len(de)> 1 color="#804040">and ' color="#ff00ff">PRECEDE' color="#804040">in n:
color="#804040">148 color="#0000ff"># E.g. ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
color="#804040">149 de[1], de[0] = de[0], de[1]
color="#804040">150 tmp = [ unichr(int(k, 16)) color="#804040">for k color="#804040">in de]
color="#804040">151 base, others = tmp[0], tmp[1:]
color="#804040">152 color="#804040">else:
color="#804040">153 base, others = c, []
color="#804040">154 color="#804040">print(" color="#ff00ff">%s %o %s\n color="#ff00ff"> %s %s" %
color="#804040">155 (c,ord(c), name(c,' color="#ff00ff">-'),
color="#804040">156 name(base, base),
color="#804040">157 ' color="#ff00ff"> '.join(name(o,o) color="#804040">for o color="#804040">in others)))
color="#804040">158
color="#804040">159
color="#804040">160



8 comments:

  1. If you keep at this, you will eventually end up reinventing the Unicode Collation Algorithm! Something related you may find of interest is Sean Burke's Unidecode which takes Unicode text and turns it into ascii. His article describing it is at http://interglacial.com/tpj/22/

    Although the code is in Perl (find it at CPAN), it is ultimately just data that is easy to convert into Python data.

    ReplyDelete
  2. Your blog has serious UTF-8 trouble.

    ReplyDelete
  3. Marius, Yep. I find it difficult in going from a correct windows .py file translated to HTML with vim then composed in Nvu and lastly the HTML cut-n-pasted into Blogger.

    I was inclined to dump trying to work with Unicode if people cannot follow the articles, but I'll persevere as it's polite to reach out to my Unicode-using readership. One day, I'll get so p*!ssed off with Blogger that I will move!


    Roger, I liked the article on Unidecode. Thanks.

    ReplyDelete
  4. (I have trouble commenting with Iceweasel 3.0. Now trying webkit-based..)

    Hi. This is a topic I find very interesting. I dove into stripping accents from text just recently:

    http://stackoverflow.com/questions/1410308/how-to-implement-unicode-string-matching-by-folding-in-python

    However, that was for purposes of matching queries to catalog items.

    For sorting, my application just uses locale-aware sorting without natural sort. Natural sort is certainly worthwhile, perhaps not just to my application (most of the time it displays objects ranked by relevance to the query. Yeah, it's a simplistic launcher app, kupfer).

    But, since I'm in the know I wanted to say that there are some important things to understand about locale-aware sorting.

    Don't strip accents and then sort! Rhythmbox does this and it's totally wrong. The special cases will sort themselves out in their native locales. And the same letter might sort in different places in different languages.

    Example: ß will sort in the right place in the German locale
    Example: Ö sorts after X,Y,Z,Å,Ä in Swedish
    Example: But Ö sorts as Oe in German!

    So at some point you will have to involve Python's locale module.

    My quick idea to throw into the bowl is to split each string into a tuple of its string and number parts

    Split: file10.txt into ("file", 10, ".txt")
    Split: testfile.pdf into ("testfile.pdf", )

    Now replace each string with its locale.strxfrm counterpart (works with UTF-8 encoded text too), and sort the tuples.

    ReplyDelete
  5. The implementation of my natural sort algorithm that I made this morning (while posting here didn't work), looks like this:

    http://kaizer.se/wiki/code/natsort.py

    I guess it's a very inefficient first try.

    ReplyDelete
  6. Another thing is that when I try, even the en_US.UTF-8 locale sorts ß in the right spot (as ss). What I've seen on stackoverflow, locale is a much unknown and underused python module. Might be that it's linux/unix-specific? (Also that it's impossible to use in a server if you want to work with more than one locale.)

    ReplyDelete
  7. Your post has been eaten by HTML encoders.

    ReplyDelete

Followers

Subscribe Now: google

Add to Google Reader or Homepage

Go deh too!

whos.amung.us