order dictated by computer character codes. It is called natural
because the new ordering is supposed to mean more to a human consumer
than an ordering on base character codes .
I guess, what is becoming the canonical example is the ordering of
numbered files in a directory listing. A normal sorted directory
listing might show:
style="font-family: monospace;">bash$ for (( i=0;
i<111;i+=5)); do touch f$i.txt; done
style="font-family: monospace;">
bash$ ls -1
style="font-family: monospace;">
f0.txt
style="font-family: monospace;">
f10.txt
style="font-family: monospace;">
f100.txt
style="font-family: monospace;">
f105.txt
style="font-family: monospace;">
f110.txt
style="font-family: monospace;">
f15.txt
style="font-family: monospace;">
f20.txt
style="font-family: monospace;">
f25.txt
style="font-family: monospace;">
f30.txt
style="font-family: monospace;">
f35.txt
style="font-family: monospace;">
f40.txt
style="font-family: monospace;">
f45.txt
style="font-family: monospace;">
f5.txt
style="font-family: monospace;">
f50.txt
style="font-family: monospace;">
f55.txt
style="font-family: monospace;">
f60.txt
style="font-family: monospace;">
f65.txt
style="font-family: monospace;">
f70.txt
style="font-family: monospace;">
f75.txt
style="font-family: monospace;">
f80.txt
style="font-family: monospace;">
f85.txt
style="font-family: monospace;">
f90.txt
style="font-family: monospace;">
f95.txt
i<111;i+=5)); do touch f$i.txt; done
style="font-family: monospace;">
bash$ ls -1
style="font-family: monospace;">
f0.txt
style="font-family: monospace;">
f10.txt
style="font-family: monospace;">
f100.txt
style="font-family: monospace;">
f105.txt
style="font-family: monospace;">
f110.txt
style="font-family: monospace;">
f15.txt
style="font-family: monospace;">
f20.txt
style="font-family: monospace;">
f25.txt
style="font-family: monospace;">
f30.txt
style="font-family: monospace;">
f35.txt
style="font-family: monospace;">
f40.txt
style="font-family: monospace;">
f45.txt
style="font-family: monospace;">
f5.txt
style="font-family: monospace;">
f50.txt
style="font-family: monospace;">
f55.txt
style="font-family: monospace;">
f60.txt
style="font-family: monospace;">
f65.txt
style="font-family: monospace;">
f70.txt
style="font-family: monospace;">
f75.txt
style="font-family: monospace;">
f80.txt
style="font-family: monospace;">
f85.txt
style="font-family: monospace;">
f90.txt
style="font-family: monospace;">
f95.txt
It might be more friendly if the files were ordered so that the numeric
section increased steadily.
In other situations, such as in Movie listings, you sometimes see
common words such as "the" ignored for sorting purposes (or even
transferred to the end of the item so "The Silence of the Lambs" would
be listed as "Silence of the Lambs, The")
Ignoring case, and treating white-space characters as the same - and
multiple strings of whitespace characters as a single whitespace, might
also be useful.
Since my initial tentative steps into working with non-ASCII characters
I tried to consider sorting accented characters as if the accent was
not present; ligatures as separate characters, and made a stab at
tackling what I call 'replacements' - where certain
characters, such as the German href="http://en.wikipedia.org/wiki/Scharfes_S">scharfes S
might be sorted as SS,.
I had a quick stab at creating a natural sort routine and
added several natural sorting abilities. They might not all be useful
at once, but it should be straight forward to pick and choose what you
want n your natural sort..
As always, when working with non-ASCII character sets, I find it
difficult to translate my working multi-byte Python program to HTML,
but there should be enough surrounding information to ensure that any
multi-byte character can be worked-out.
1
color="#804040"> 2 # -*- coding: utf-8 -*-
color="#804040"> 3 # Not Python 3.x (Can't compare str and int)
color="#804040"> 4
color="#804040"> 5 '''
color="#804040"> 6 Natural sorting: Sorting of text that does more than rely on the
color="#804040"> 7 order of individual characters codes to make the finding of
color="#804040"> 8 individual strings easier for a human reader.
color="#804040"> 9
color="#804040"> 10 Some 'natural' orderings might include:
color="#804040"> 11 1) Ignore leading, trailing and multiple adjacent spaces
color="#804040"> 12 2) All space types equivalent.
color="#804040"> 13 3) Sorting without regard to case.
color="#804040"> 14 4) Sorting numeric portions of strings in numeric order:
color="#804040"> 15 foo9.txt before foo10.txt
color="#804040"> 16 As well as ... x9y99 before x9y100, before x10y0
color="#804040"> 17 ... (for any number of groups of integers in a string).
color="#804040"> 18 5) Title sorts: without regard to a leading, very common, word such
color="#804040"> 19 as 'The' in "The thirty-nine steps".
color="#804040"> 20 6) Sort letters without regard to accents.
color="#804040"> 21 7) Sort ligatures as separate letters.
color="#804040"> 22 8) Replacements:
color="#804040"> 23 Sort german scharfes S (ß) as ss
color="#804040"> 24 Sort Å¿, LATIN SMALL LETTER LONG S as s
color="#804040"> 25 Sort Ê’, LATIN SMALL LETTER EZH as s
color="#804040"> 26 ...
color="#804040"> 27 '''
color="#804040"> 28
color="#804040"> 29 from itertools color="#a020f0">import groupby
color="#804040"> 30 from unicodedata color="#a020f0">import decomposition, name
color="#804040"> 31
color="#804040"> 32 commonleaders = [' color="#ff00ff">the'] # lowercase leading words to ignore
color="#804040"> 33 replacements = {' color="#ff00ff">ß': 'ss', color="#0000ff"># Map single char to replacement string
color="#804040"> 34 ' color="#ff00ff">Å¿': 's',
color="#804040"> 35 ' color="#ff00ff">Ê’': 's',
color="#804040"> 36 }
color="#804040"> 37
color="#804040"> 38 hexdigits = set(' color="#ff00ff">0123456789abcdef')
color="#804040"> 39 decdigits = set(' color="#ff00ff">0123456789') color="#0000ff"># Don't use str.isnumeric
color="#804040"> 40
color="#804040"> 41 def color="#008080">splitchar(c):
color="#804040"> 42 ' color="#ff00ff"> De-ligature. De-accent a char'
color="#804040"> 43 de = decomposition(c)
color="#804040"> 44 if de:
color="#804040"> 45 color="#0000ff"># Just the words that are also hex numbers
color="#804040"> 46 de = [d color="#804040">for d color="#804040">in de.split()
color="#804040"> 47 color="#804040">if all(c.lower()
color="#804040"> 48 color="#804040">in hexdigits color="#804040">for c color="#804040">in d)]
color="#804040"> 49 n = name(c, c).upper()
color="#804040"> 50 color="#0000ff"># (Gosh it's onerous)
color="#804040"> 51 color="#804040">if len(de)> 1 color="#804040">and ' color="#ff00ff">PRECEDE' color="#804040">in n:
color="#804040"> 52 color="#0000ff"># E.g. ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
color="#804040"> 53 de[1], de[0] = de[0], de[1]
color="#804040"> 54 tmp = [ unichr(int(k, 16)) color="#804040">for k color="#804040">in de]
color="#804040"> 55 base, others = tmp[0], tmp[1:]
color="#804040"> 56 color="#804040">if ' color="#ff00ff">LIGATURE' color="#804040">in n:
color="#804040"> 57 color="#0000ff"># Assume two character ligature
color="#804040"> 58 base += others.pop(0)
color="#804040"> 59 else:
color="#804040"> 60 base = c
color="#804040"> 61 return base
color="#804040"> 62
color="#804040"> 63
color="#804040"> 64 def color="#008080">sortkeygen(s):
color="#804040"> 65 ''' color="#ff00ff">Generate 'natural' sort key for s
color="#804040"> 66
color="#804040"> 67 Doctests:
color="#804040"> 68 >>> sortkeygen(' some extra spaces ')
color="#804040"> 69 [u'some extra spaces']
color="#804040"> 70 >>> sortkeygen('CasE InseNsItIve')
color="#804040"> 71 [u'case insensitive']
color="#804040"> 72 >>> sortkeygen('The Wind in the Willows')
color="#804040"> 73 [u'wind in the willows']
color="#804040"> 74 >>> sortkeygen(u' color="#6a5acd">\462 ligature')
color="#804040"> 75 [u'ij ligature']
color="#804040"> 76 >>> sortkeygen(u' color="#6a5acd">\335\375 upper/lower case Y with acute accent')
color="#804040"> 77 [u'yy upper/lower case y with acute accent']
color="#804040"> 78 >>> sortkeygen('foo9.txt')
color="#804040"> 79 [u'foo', 9, u'.txt']
color="#804040"> 80 >>> sortkeygen('x9y99')
color="#804040"> 81 [u'x', 9, u'y', 99]
color="#804040"> 82 '''
color="#804040"> 83 # Ignore leading and trailing spaces
color="#804040"> 84 s = unicode(s).strip()
color="#804040"> 85 # All space types are equivalent
color="#804040"> 86 s = ' color="#ff00ff"> '.join(s.split())
color="#804040"> 87 # case insentsitive
color="#804040"> 88 s = s.lower()
color="#804040"> 89 # Title
color="#804040"> 90 words = s.split()
color="#804040"> 91 if len(words) > 1 color="#804040">and words[0] color="#804040">in commonleaders:
color="#804040"> 92 s = ' color="#ff00ff"> '.join( words[1:])
color="#804040"> 93 # accent and ligatures
color="#804040"> 94 s = ''.join(splitchar(c) color="#804040">for c color="#804040">in s)
color="#804040"> 95 # Replacements (single char replaced by one or more)
color="#804040"> 96 s = ''.join( replacements.get(ch, ch) color="#804040">for ch color="#804040">in s )
color="#804040"> 97 # Numeric sections as numerics
color="#804040"> 98 s = [ int("".join(g)) color="#804040">if isinteger color="#804040">else "".join(g)
color="#804040"> 99 color="#804040">for isinteger,g color="#804040">in groupby(s, color="#804040">lambda x: x color="#804040">in decdigits)]
color="#804040">100
color="#804040">101 return s
color="#804040">102
color="#804040">103 def color="#008080">naturalsort(items):
color="#804040">104 ''' color="#ff00ff"> Naturally sort a series of strings
color="#804040">105
color="#804040">106 Doctests:
color="#804040">107 >>> naturalsort(['The Wind in the Willows','The 40th step more',
color="#804040">108 'The 39 steps', 'Wanda'])
color="#804040">109 ['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']
color="#804040">110
color="#804040">111 '''
color="#804040">112 return sorted(items, key=sortkeygen)
color="#804040">113
color="#804040">114
color="#804040">115
color="#804040">116 # Extras...
color="#804040">117
color="#804040">118 '''
color="#804040">119 Help on built-in function decomposition in module unicodedata:
color="#804040">120
color="#804040">121 decomposition(...)
color="#804040">122 decomposition(unichr)
color="#804040">123
color="#804040">124 Returns the character decomposition mapping assigned to the Unicode
color="#804040">125 character unichr as string. An empty string is returned in case no
color="#804040">126 such mapping is defined.
color="#804040">127
color="#804040">128 Help on built-in function name in module unicodedata:
color="#804040">129
color="#804040">130 name(...)
color="#804040">131 name(unichr[, default])
color="#804040">132 Returns the name assigned to the Unicode character unichr as a
color="#804040">133 string. If no name is defined, default is returned, or, if not
color="#804040">134 given, ValueError is raised.
color="#804040">135 '''
color="#804040">136
color="#804040">137
color="#804040">138 def color="#008080">_temp():
color="#804040">139 # for c in u'\370\157\117\463\462\511\733\337\337':
color="#804040">140 for c color="#804040">in u' class="Constant">øoOijIJʼnÇßÃý ':
color="#804040">141 de = decomposition(c)
color="#804040">142 color="#804040">if de:
color="#804040">143 de = [d color="#804040">for d color="#804040">in de.split()
color="#804040">144 color="#804040">if all(c.upper()
color="#804040">145 color="#804040">in ' color="#ff00ff">0123456789ABCDEF' color="#804040">for c color="#804040">in d)]
color="#804040">146 n = name(c, c).upper()
color="#804040">147 color="#804040">if len(de)> 1 color="#804040">and ' color="#ff00ff">PRECEDE' color="#804040">in n:
color="#804040">148 color="#0000ff"># E.g. ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
color="#804040">149 de[1], de[0] = de[0], de[1]
color="#804040">150 tmp = [ unichr(int(k, 16)) color="#804040">for k color="#804040">in de]
color="#804040">151 base, others = tmp[0], tmp[1:]
color="#804040">152 color="#804040">else:
color="#804040">153 base, others = c, []
color="#804040">154 color="#804040">print(" color="#ff00ff">%s %o %s\n color="#ff00ff"> %s %s" %
color="#804040">155 (c,ord(c), name(c,' color="#ff00ff">-'),
color="#804040">156 name(base, base),
color="#804040">157 ' color="#ff00ff"> '.join(name(o,o) color="#804040">for o color="#804040">in others)))
color="#804040">158
color="#804040">159
color="#804040">160