Go deh!: Natural Sorting

Natural sorting orders items in some way that is more than just the
order dictated by computer character codes. It is called natural
because the new ordering is supposed to mean more to a human consumer
than an ordering on base character codes .

I guess, what is becoming the canonical example is the ordering of
numbered files in a directory listing. A normal sorted directory
listing might show:

style="font-family: monospace;">bash$ for (( i=0;
i<111;i+=5)); do touch f$i.txt; done style="font-family: monospace;">
bash$ ls -1 style="font-family: monospace;">
f0.txt style="font-family: monospace;">
f10.txt style="font-family: monospace;">
f100.txt style="font-family: monospace;">
f105.txt style="font-family: monospace;">
f110.txt style="font-family: monospace;">
f15.txt style="font-family: monospace;">
f20.txt style="font-family: monospace;">
f25.txt style="font-family: monospace;">
f30.txt style="font-family: monospace;">
f35.txt style="font-family: monospace;">
f40.txt style="font-family: monospace;">
f45.txt style="font-family: monospace;">
f5.txt style="font-family: monospace;">
f50.txt style="font-family: monospace;">
f55.txt style="font-family: monospace;">
f60.txt style="font-family: monospace;">
f65.txt style="font-family: monospace;">
f70.txt style="font-family: monospace;">
f75.txt style="font-family: monospace;">
f80.txt style="font-family: monospace;">
f85.txt style="font-family: monospace;">
f90.txt style="font-family: monospace;">
f95.txt

It might be more friendly if the files were ordered so that the numeric
section increased steadily.

In other situations, such as in Movie listings, you sometimes see
common words such as "the" ignored for sorting purposes (or even
transferred to the end of the item so "The Silence of the Lambs" would
be listed as "Silence of the Lambs, The")

Ignoring case, and treating white-space characters as the same - and
multiple strings of whitespace characters as a single whitespace, might
also be useful.

Since my initial tentative steps into working with non-ASCII characters
I tried to consider sorting accented characters as if the accent was
not present; ligatures as separate characters, and made a stab at
tackling what I call 'replacements' - where certain
characters, such as the German href="http://en.wikipedia.org/wiki/Scharfes_S">scharfes S
might be sorted as SS,.

I had a quick stab at creating a natural sort routine and
added several natural sorting abilities. They might not all be useful
at once, but it should be straight forward to pick and choose what you
want n your natural sort..

As always, when working with non-ASCII character sets, I find it
difficult to translate my working multi-byte Python program to HTML,
but there should be enough surrounding information to ensure that any
multi-byte character can be worked-out.

  1 
 color="#804040">  2 # -*- coding: utf-8 -*-
 color="#804040">  3 # Not Python 3.x (Can't compare str and int)
 color="#804040">  4 
 color="#804040">  5 '''
 color="#804040">  6     Natural sorting: Sorting of text that does more than rely on the
 color="#804040">  7     order of individual characters codes to make the finding of
 color="#804040">  8     individual strings easier for a human reader.
 color="#804040">  9     
 color="#804040"> 10     Some 'natural' orderings might include:
 color="#804040"> 11         1) Ignore leading, trailing and multiple adjacent spaces
 color="#804040"> 12         2) All space types equivalent.
 color="#804040"> 13         3) Sorting without regard to case.
 color="#804040"> 14         4) Sorting numeric portions of strings in numeric order:
 color="#804040"> 15             foo9.txt before foo10.txt
 color="#804040"> 16             As well as ... x9y99 before x9y100, before x10y0
 color="#804040"> 17             ... (for any number of groups of integers in a string).
 color="#804040"> 18         5) Title sorts: without regard to a leading, very common, word such
 color="#804040"> 19            as 'The' in "The thirty-nine steps".
 color="#804040"> 20         6) Sort letters without regard to accents.
 color="#804040"> 21         7) Sort ligatures as separate letters.
 color="#804040"> 22         8) Replacements:
 color="#804040"> 23             Sort german scharfes S (ÃŸ) as ss
 color="#804040"> 24             Sort Å¿,  LATIN SMALL LETTER LONG S as s
 color="#804040"> 25             Sort Ê’,  LATIN SMALL LETTER EZH as s
 color="#804040"> 26             ...
 color="#804040"> 27 '''
 color="#804040"> 28 
 color="#804040"> 29 from itertools  color="#a020f0">import groupby
 color="#804040"> 30 from unicodedata  color="#a020f0">import decomposition, name
 color="#804040"> 31 
 color="#804040"> 32 commonleaders = [' color="#ff00ff">the'] # lowercase leading words to ignore
 color="#804040"> 33 replacements = {' color="#ff00ff">ÃŸ': 'ss',   color="#0000ff"># Map single char to replacement string
 color="#804040"> 34                 ' color="#ff00ff">Å¿': 's',
 color="#804040"> 35                 ' color="#ff00ff">Ê’': 's',
 color="#804040"> 36                 }
 color="#804040"> 37 
 color="#804040"> 38 hexdigits = set(' color="#ff00ff">0123456789abcdef')
 color="#804040"> 39 decdigits = set(' color="#ff00ff">0123456789')    color="#0000ff"># Don't use str.isnumeric
 color="#804040"> 40 
 color="#804040"> 41 def  color="#008080">splitchar(c):
 color="#804040"> 42     ' color="#ff00ff"> De-ligature. De-accent a char'
 color="#804040"> 43     de = decomposition(c)
 color="#804040"> 44     if de:
 color="#804040"> 45          color="#0000ff"># Just the words that are also hex numbers
 color="#804040"> 46         de = [d  color="#804040">for d  color="#804040">in de.split()
 color="#804040"> 47                    color="#804040">if all(c.lower()
 color="#804040"> 48                           color="#804040">in hexdigits  color="#804040">for c  color="#804040">in d)]
 color="#804040"> 49         n = name(c, c).upper()
 color="#804040"> 50          color="#0000ff"># (Gosh it's onerous)
 color="#804040"> 51          color="#804040">if len(de)> 1  color="#804040">and ' color="#ff00ff">PRECEDE'  color="#804040">in n:
 color="#804040"> 52              color="#0000ff"># E.g. Å‰  LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
 color="#804040"> 53             de[1], de[0] = de[0], de[1]
 color="#804040"> 54         tmp = [ unichr(int(k, 16))  color="#804040">for k  color="#804040">in de]
 color="#804040"> 55         base, others = tmp[0], tmp[1:]
 color="#804040"> 56          color="#804040">if ' color="#ff00ff">LIGATURE'  color="#804040">in n:
 color="#804040"> 57              color="#0000ff"># Assume two character ligature
 color="#804040"> 58             base += others.pop(0)
 color="#804040"> 59     else:
 color="#804040"> 60         base = c
 color="#804040"> 61     return base
 color="#804040"> 62 
 color="#804040"> 63 
 color="#804040"> 64 def  color="#008080">sortkeygen(s):
 color="#804040"> 65     ''' color="#ff00ff">Generate 'natural' sort key for s
 color="#804040"> 66 
 color="#804040"> 67     Doctests:
 color="#804040"> 68         >>> sortkeygen('  some extra    spaces  ')
 color="#804040"> 69         [u'some extra spaces']
 color="#804040"> 70         >>> sortkeygen('CasE InseNsItIve')
 color="#804040"> 71         [u'case insensitive']
 color="#804040"> 72         >>> sortkeygen('The Wind in the Willows')
 color="#804040"> 73         [u'wind in the willows']
 color="#804040"> 74         >>> sortkeygen(u' color="#6a5acd">\462 ligature')
 color="#804040"> 75         [u'ij ligature']
 color="#804040"> 76         >>> sortkeygen(u' color="#6a5acd">\335\375 upper/lower case Y with acute accent')
 color="#804040"> 77         [u'yy upper/lower case y with acute accent']
 color="#804040"> 78         >>> sortkeygen('foo9.txt')
 color="#804040"> 79         [u'foo', 9, u'.txt']
 color="#804040"> 80         >>> sortkeygen('x9y99')
 color="#804040"> 81         [u'x', 9, u'y', 99]
 color="#804040"> 82     '''
 color="#804040"> 83     # Ignore leading and trailing spaces
 color="#804040"> 84     s = unicode(s).strip()
 color="#804040"> 85     # All space types are equivalent
 color="#804040"> 86     s = ' color="#ff00ff"> '.join(s.split())
 color="#804040"> 87     # case insentsitive
 color="#804040"> 88     s = s.lower()
 color="#804040"> 89     # Title
 color="#804040"> 90     words = s.split()
 color="#804040"> 91     if len(words) > 1  color="#804040">and words[0]  color="#804040">in commonleaders:
 color="#804040"> 92         s = ' color="#ff00ff"> '.join( words[1:])
 color="#804040"> 93     # accent and ligatures
 color="#804040"> 94     s = ''.join(splitchar(c)  color="#804040">for c  color="#804040">in s)
 color="#804040"> 95     # Replacements (single char replaced by one or more)
 color="#804040"> 96     s = ''.join( replacements.get(ch, ch)  color="#804040">for ch  color="#804040">in s )
 color="#804040"> 97     # Numeric sections as numerics
 color="#804040"> 98     s = [ int("".join(g))  color="#804040">if isinteger  color="#804040">else "".join(g)
 color="#804040"> 99            color="#804040">for isinteger,g  color="#804040">in groupby(s,  color="#804040">lambda x: x  color="#804040">in decdigits)]
 color="#804040">100 
 color="#804040">101     return s
 color="#804040">102 
 color="#804040">103 def  color="#008080">naturalsort(items):
 color="#804040">104     ''' color="#ff00ff"> Naturally sort a series of strings
 color="#804040">105 
 color="#804040">106     Doctests:
 color="#804040">107         >>> naturalsort(['The Wind in the Willows','The 40th step more',
 color="#804040">108                          'The 39 steps', 'Wanda'])
 color="#804040">109         ['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']
 color="#804040">110 
 color="#804040">111     '''
 color="#804040">112     return sorted(items, key=sortkeygen)
 color="#804040">113 
 color="#804040">114 
 color="#804040">115 



 color="#804040">116 # Extras...
 color="#804040">117 
 color="#804040">118 '''
 color="#804040">119 Help on built-in function decomposition in module unicodedata:
 color="#804040">120 
 color="#804040">121 decomposition(...)
 color="#804040">122     decomposition(unichr)
 color="#804040">123     
 color="#804040">124     Returns the character decomposition mapping assigned to the Unicode
 color="#804040">125     character unichr as string. An empty string is returned in case no
 color="#804040">126     such mapping is defined.
 color="#804040">127 
 color="#804040">128 Help on built-in function name in module unicodedata:
 color="#804040">129 
 color="#804040">130 name(...)
 color="#804040">131     name(unichr[, default])
 color="#804040">132     Returns the name assigned to the Unicode character unichr as a
 color="#804040">133     string. If no name is defined, default is returned, or, if not
 color="#804040">134     given, ValueError is raised.
 color="#804040">135 '''
 color="#804040">136 
 color="#804040">137 
 color="#804040">138 def  color="#008080">_temp():
 color="#804040">139     # for c in u'\370\157\117\463\462\511\733\337\337':
 color="#804040">140     for c  color="#804040">in u' class="Constant">Ã¸oOÄ³Ä²Å‰ÇÃŸÃÃ½':
 color="#804040">141         de = decomposition(c)
 color="#804040">142          color="#804040">if de:
 color="#804040">143             de = [d  color="#804040">for d  color="#804040">in de.split()
 color="#804040">144                        color="#804040">if all(c.upper()
 color="#804040">145                           color="#804040">in ' color="#ff00ff">0123456789ABCDEF'  color="#804040">for c  color="#804040">in d)]
 color="#804040">146             n = name(c, c).upper()
 color="#804040">147              color="#804040">if len(de)> 1  color="#804040">and ' color="#ff00ff">PRECEDE'  color="#804040">in n:
 color="#804040">148                  color="#0000ff"># E.g. Å‰  LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
 color="#804040">149                 de[1], de[0] = de[0], de[1]
 color="#804040">150             tmp = [ unichr(int(k, 16))  color="#804040">for k  color="#804040">in de]
 color="#804040">151             base, others = tmp[0], tmp[1:]
 color="#804040">152          color="#804040">else:
 color="#804040">153             base, others = c, []
 color="#804040">154          color="#804040">print(" color="#ff00ff">%s %o %s\n color="#ff00ff">  %s  %s" %
 color="#804040">155                   (c,ord(c), name(c,' color="#ff00ff">-'),
 color="#804040">156                    name(base, base),
 color="#804040">157                    ' color="#ff00ff"> '.join(name(o,o)  color="#804040">for o  color="#804040">in others)))
 color="#804040">158 
 color="#804040">159 
 color="#804040">160

8 comments:

UnknownThu Sept 24, 11:01:00 am
If you keep at this, you will eventually end up reinventing the Unicode Collation Algorithm! Something related you may find of interest is Sean Burke's Unidecode which takes Unicode text and turns it into ascii. His article describing it is at http://interglacial.com/tpj/22/

Although the code is in Perl (find it at CPAN), it is ultimately just data that is easy to convert into Python data.
Marius GedminasThu Sept 24, 12:27:00 pm
Your blog has serious UTF-8 trouble.
Paddy3118Thu Sept 24, 05:49:00 pm
Marius, Yep. I find it difficult in going from a correct windows .py file translated to HTML with vim then composed in Nvu and lastly the HTML cut-n-pasted into Blogger.

I was inclined to dump trying to work with Unicode if people cannot follow the articles, but I'll persevere as it's polite to reach out to my Unicode-using readership. One day, I'll get so p*!ssed off with Blogger that I will move!

Roger, I liked the article on Unidecode. Thanks.
ulrikThu Sept 24, 08:37:00 pm
(I have trouble commenting with Iceweasel 3.0. Now trying webkit-based..)

Hi. This is a topic I find very interesting. I dove into stripping accents from text just recently:

http://stackoverflow.com/questions/1410308/how-to-implement-unicode-string-matching-by-folding-in-python

However, that was for purposes of matching queries to catalog items.

For sorting, my application just uses locale-aware sorting without natural sort. Natural sort is certainly worthwhile, perhaps not just to my application (most of the time it displays objects ranked by relevance to the query. Yeah, it's a simplistic launcher app, kupfer).

But, since I'm in the know I wanted to say that there are some important things to understand about locale-aware sorting.

Don't strip accents and then sort! Rhythmbox does this and it's totally wrong. The special cases will sort themselves out in their native locales. And the same letter might sort in different places in different languages.

Example: ß will sort in the right place in the German locale
Example: Ö sorts after X,Y,Z,Å,Ä in Swedish
Example: But Ö sorts as Oe in German!

So at some point you will have to involve Python's locale module.

My quick idea to throw into the bowl is to split each string into a tuple of its string and number parts

Split: file10.txt into ("file", 10, ".txt")
Split: testfile.pdf into ("testfile.pdf", )

Now replace each string with its locale.strxfrm counterpart (works with UTF-8 encoded text too), and sort the tuples.
ulrikThu Sept 24, 08:44:00 pm
The implementation of my natural sort algorithm that I made this morning (while posting here didn't work), looks like this:

http://kaizer.se/wiki/code/natsort.py

I guess it's a very inefficient first try.
ulrikThu Sept 24, 09:05:00 pm
Another thing is that when I try, even the en_US.UTF-8 locale sorts ß in the right spot (as ss). What I've seen on stackoverflow, locale is a much unknown and underused python module. Might be that it's linux/unix-specific? (Also that it's impossible to use in a server if you want to work with more than one locale.)
Thomas MayfieldWed Sept 22, 03:25:00 pm
Your post has been eaten by HTML encoders.
Paddy3118Wed Sept 22, 06:39:00 pm
Hi Thomas. Yes it did that!

Go deh!

Thursday, September 24, 2009

Natural Sorting

8 comments:

About Me

Followers

Subscribe Now: google

Go deh too!

whos.amung.us

Blog Archive