Thursday, September 24, 2009

Natural Sorting

Natural sorting orders items in some way that is more than just the
order dictated by computer character codes. It is called natural
because the new ordering is supposed to mean more to a human consumer
than an ordering on base character codes .



I guess, what is becoming the canonical example is the ordering of
numbered files in a directory listing. A normal sorted directory
listing might show:

style="font-family: monospace;">bash$ for (( i=0;
i<111;i+=5)); do touch f$i.txt; done style="font-family: monospace;">
bash$ ls -1 style="font-family: monospace;">
f0.txt style="font-family: monospace;">
f10.txt style="font-family: monospace;">
f100.txt style="font-family: monospace;">
f105.txt style="font-family: monospace;">
f110.txt style="font-family: monospace;">
f15.txt style="font-family: monospace;">
f20.txt style="font-family: monospace;">
f25.txt style="font-family: monospace;">
f30.txt style="font-family: monospace;">
f35.txt style="font-family: monospace;">
f40.txt style="font-family: monospace;">
f45.txt style="font-family: monospace;">
f5.txt style="font-family: monospace;">
f50.txt style="font-family: monospace;">
f55.txt style="font-family: monospace;">
f60.txt style="font-family: monospace;">
f65.txt style="font-family: monospace;">
f70.txt style="font-family: monospace;">
f75.txt style="font-family: monospace;">
f80.txt style="font-family: monospace;">
f85.txt style="font-family: monospace;">
f90.txt style="font-family: monospace;">
f95.txt

It might be more friendly if the files were ordered so that the numeric
section increased steadily.



In other situations, such as in Movie listings, you sometimes see
common words such as "the" ignored for sorting purposes (or even
transferred to the end of the item so "The Silence of the Lambs" would
be listed as "Silence of the Lambs, The")



Ignoring case, and treating white-space characters as the same - and
multiple strings of whitespace characters as a single whitespace, might
also be useful.



Since my initial tentative steps into working with non-ASCII characters
I tried to consider sorting accented characters as if the accent was
not present; ligatures as separate characters, and made a stab at
tackling  what I call 'replacements' - where certain
characters, such as the German href="http://en.wikipedia.org/wiki/Scharfes_S">scharfes S
might be sorted as SS,.



I had a quick stab at creating a natural sort routine  and
added several natural sorting abilities. They might not all be useful
at once, but it should be straight forward to pick and choose what you
want n your natural sort..



As always, when working with non-ASCII character sets, I find it
difficult to translate my working multi-byte Python program to HTML,
but there should be enough surrounding information to ensure that any
multi-byte character can be worked-out.




  1 
color="#804040"> 2 # -*- coding: utf-8 -*-
color="#804040"> 3 # Not Python 3.x (Can't compare str and int)
color="#804040"> 4
color="#804040"> 5 '''
color="#804040"> 6 Natural sorting: Sorting of text that does more than rely on the
color="#804040"> 7 order of individual characters codes to make the finding of
color="#804040"> 8 individual strings easier for a human reader.
color="#804040"> 9
color="#804040"> 10 Some 'natural' orderings might include:
color="#804040"> 11 1) Ignore leading, trailing and multiple adjacent spaces
color="#804040"> 12 2) All space types equivalent.
color="#804040"> 13 3) Sorting without regard to case.
color="#804040"> 14 4) Sorting numeric portions of strings in numeric order:
color="#804040"> 15 foo9.txt before foo10.txt
color="#804040"> 16 As well as ... x9y99 before x9y100, before x10y0
color="#804040"> 17 ... (for any number of groups of integers in a string).
color="#804040"> 18 5) Title sorts: without regard to a leading, very common, word such
color="#804040"> 19 as 'The' in "The thirty-nine steps".
color="#804040"> 20 6) Sort letters without regard to accents.
color="#804040"> 21 7) Sort ligatures as separate letters.
color="#804040"> 22 8) Replacements:
color="#804040"> 23 Sort german scharfes S (ß) as ss
color="#804040"> 24 Sort Å¿, LATIN SMALL LETTER LONG S as s
color="#804040"> 25 Sort Ê’, LATIN SMALL LETTER EZH as s
color="#804040"> 26 ...
color="#804040"> 27 '''
color="#804040"> 28
color="#804040"> 29 from itertools color="#a020f0">import groupby
color="#804040"> 30 from unicodedata color="#a020f0">import decomposition, name
color="#804040"> 31
color="#804040"> 32 commonleaders = [' color="#ff00ff">the'] # lowercase leading words to ignore
color="#804040"> 33 replacements = {' color="#ff00ff">ß': 'ss', color="#0000ff"># Map single char to replacement string
color="#804040"> 34 ' color="#ff00ff">Å¿': 's',
color="#804040"> 35 ' color="#ff00ff">Ê’': 's',
color="#804040"> 36 }
color="#804040"> 37
color="#804040"> 38 hexdigits = set(' color="#ff00ff">0123456789abcdef')
color="#804040"> 39 decdigits = set(' color="#ff00ff">0123456789') color="#0000ff"># Don't use str.isnumeric
color="#804040"> 40
color="#804040"> 41 def color="#008080">splitchar(c):
color="#804040"> 42 ' color="#ff00ff"> De-ligature. De-accent a char'
color="#804040"> 43 de = decomposition(c)
color="#804040"> 44 if de:
color="#804040"> 45 color="#0000ff"># Just the words that are also hex numbers
color="#804040"> 46 de = [d color="#804040">for d color="#804040">in de.split()
color="#804040"> 47 color="#804040">if all(c.lower()
color="#804040"> 48 color="#804040">in hexdigits color="#804040">for c color="#804040">in d)]
color="#804040"> 49 n = name(c, c).upper()
color="#804040"> 50 color="#0000ff"># (Gosh it's onerous)
color="#804040"> 51 color="#804040">if len(de)> 1 color="#804040">and ' color="#ff00ff">PRECEDE' color="#804040">in n:
color="#804040"> 52 color="#0000ff"># E.g. ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
color="#804040"> 53 de[1], de[0] = de[0], de[1]
color="#804040"> 54 tmp = [ unichr(int(k, 16)) color="#804040">for k color="#804040">in de]
color="#804040"> 55 base, others = tmp[0], tmp[1:]
color="#804040"> 56 color="#804040">if ' color="#ff00ff">LIGATURE' color="#804040">in n:
color="#804040"> 57 color="#0000ff"># Assume two character ligature
color="#804040"> 58 base += others.pop(0)
color="#804040"> 59 else:
color="#804040"> 60 base = c
color="#804040"> 61 return base
color="#804040"> 62
color="#804040"> 63
color="#804040"> 64 def color="#008080">sortkeygen(s):
color="#804040"> 65 ''' color="#ff00ff">Generate 'natural' sort key for s
color="#804040"> 66
color="#804040"> 67 Doctests:
color="#804040"> 68 >>> sortkeygen(' some extra spaces ')
color="#804040"> 69 [u'some extra spaces']
color="#804040"> 70 >>> sortkeygen('CasE InseNsItIve')
color="#804040"> 71 [u'case insensitive']
color="#804040"> 72 >>> sortkeygen('The Wind in the Willows')
color="#804040"> 73 [u'wind in the willows']
color="#804040"> 74 >>> sortkeygen(u' color="#6a5acd">\462 ligature')
color="#804040"> 75 [u'ij ligature']
color="#804040"> 76 >>> sortkeygen(u' color="#6a5acd">\335\375 upper/lower case Y with acute accent')
color="#804040"> 77 [u'yy upper/lower case y with acute accent']
color="#804040"> 78 >>> sortkeygen('foo9.txt')
color="#804040"> 79 [u'foo', 9, u'.txt']
color="#804040"> 80 >>> sortkeygen('x9y99')
color="#804040"> 81 [u'x', 9, u'y', 99]
color="#804040"> 82 '''
color="#804040"> 83 # Ignore leading and trailing spaces
color="#804040"> 84 s = unicode(s).strip()
color="#804040"> 85 # All space types are equivalent
color="#804040"> 86 s = ' color="#ff00ff"> '.join(s.split())
color="#804040"> 87 # case insentsitive
color="#804040"> 88 s = s.lower()
color="#804040"> 89 # Title
color="#804040"> 90 words = s.split()
color="#804040"> 91 if len(words) > 1 color="#804040">and words[0] color="#804040">in commonleaders:
color="#804040"> 92 s = ' color="#ff00ff"> '.join( words[1:])
color="#804040"> 93 # accent and ligatures
color="#804040"> 94 s = ''.join(splitchar(c) color="#804040">for c color="#804040">in s)
color="#804040"> 95 # Replacements (single char replaced by one or more)
color="#804040"> 96 s = ''.join( replacements.get(ch, ch) color="#804040">for ch color="#804040">in s )
color="#804040"> 97 # Numeric sections as numerics
color="#804040"> 98 s = [ int("".join(g)) color="#804040">if isinteger color="#804040">else "".join(g)
color="#804040"> 99 color="#804040">for isinteger,g color="#804040">in groupby(s, color="#804040">lambda x: x color="#804040">in decdigits)]
color="#804040">100
color="#804040">101 return s
color="#804040">102
color="#804040">103 def color="#008080">naturalsort(items):
color="#804040">104 ''' color="#ff00ff"> Naturally sort a series of strings
color="#804040">105
color="#804040">106 Doctests:
color="#804040">107 >>> naturalsort(['The Wind in the Willows','The 40th step more',
color="#804040">108 'The 39 steps', 'Wanda'])
color="#804040">109 ['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']
color="#804040">110
color="#804040">111 '''
color="#804040">112 return sorted(items, key=sortkeygen)
color="#804040">113
color="#804040">114
color="#804040">115



color="#804040">116 # Extras...
color="#804040">117
color="#804040">118 '''
color="#804040">119 Help on built-in function decomposition in module unicodedata:
color="#804040">120
color="#804040">121 decomposition(...)
color="#804040">122 decomposition(unichr)
color="#804040">123
color="#804040">124 Returns the character decomposition mapping assigned to the Unicode
color="#804040">125 character unichr as string. An empty string is returned in case no
color="#804040">126 such mapping is defined.
color="#804040">127
color="#804040">128 Help on built-in function name in module unicodedata:
color="#804040">129
color="#804040">130 name(...)
color="#804040">131 name(unichr[, default])
color="#804040">132 Returns the name assigned to the Unicode character unichr as a
color="#804040">133 string. If no name is defined, default is returned, or, if not
color="#804040">134 given, ValueError is raised.
color="#804040">135 '''
color="#804040">136
color="#804040">137
color="#804040">138 def color="#008080">_temp():
color="#804040">139 # for c in u'\370\157\117\463\462\511\733\337\337':
color="#804040">140 for c color="#804040">in u' class="Constant">øoOijIJʼnÇßÝý':
color="#804040">141 de = decomposition(c)
color="#804040">142 color="#804040">if de:
color="#804040">143 de = [d color="#804040">for d color="#804040">in de.split()
color="#804040">144 color="#804040">if all(c.upper()
color="#804040">145 color="#804040">in ' color="#ff00ff">0123456789ABCDEF' color="#804040">for c color="#804040">in d)]
color="#804040">146 n = name(c, c).upper()
color="#804040">147 color="#804040">if len(de)> 1 color="#804040">and ' color="#ff00ff">PRECEDE' color="#804040">in n:
color="#804040">148 color="#0000ff"># E.g. ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
color="#804040">149 de[1], de[0] = de[0], de[1]
color="#804040">150 tmp = [ unichr(int(k, 16)) color="#804040">for k color="#804040">in de]
color="#804040">151 base, others = tmp[0], tmp[1:]
color="#804040">152 color="#804040">else:
color="#804040">153 base, others = c, []
color="#804040">154 color="#804040">print(" color="#ff00ff">%s %o %s\n color="#ff00ff"> %s %s" %
color="#804040">155 (c,ord(c), name(c,' color="#ff00ff">-'),
color="#804040">156 name(base, base),
color="#804040">157 ' color="#ff00ff"> '.join(name(o,o) color="#804040">for o color="#804040">in others)))
color="#804040">158
color="#804040">159
color="#804040">160



Monday, September 14, 2009

Easy Command-line Parallelism

I am trying out a dual chip Xeon server with 8 cores that shows up as
16 CPU's to Redhat Linux. I have my href="http://paddy3118.blogspot.com/search?q=process">process
runner
script, that I use with a list of 100000 simulations
to do overnight and have experimented with running different numbers of
simulations in parallel to get good throughput.



The process runner is good, I still use it, but I also wanted something
for more general command-line use. for example, I have around 750
directories, each containing files relating to one test. I organize the
directories to have regular names, and am used to using shell for loops
to process the dataset.



For example, to gzip all the */transcript files I would do:


foreach f (*/transcript)
    gzip $f
end


(Yep, I know it's cshell, but that is the 'standard' at work.)



This isn't making much use of the processing power at hand, so I went
googling for a lightweight way to run such tasks in parallel.



I found two methods. One, to use make and its -P option to run several
jobs in parallel would need the creation of a make file in each case,
which is onerous. The other, which was new to me, rather than a
forgotten feature, is that xargs has a -P option that does a similar
thing - in combination with other options it will run  up to a
given number of jobs in parallel:


/bin/ls */transcript \
| xargs -n1 -iX -P16 gzip X


The above lists all the files, pipes them to xargs which takes them one
at a time (-n1) to form a job by substituting all occurrences of
character X (-iX) in its arguments after its options. a maximum of 16
(-P16) jobs are then run at any one time.



When I want to simulate all 750 tests and create a coverage listing of
each test I use xargs to call the bash shell and write something like:


/bin/ls -d * \
| xargs -n1 -iX -P16 bash \
-c 'cd X && vsim tb -c -coverage -do "coverage save -onexit -code t coverX.ucdb;do ../run.do" 2>&1 >/dev/null && vcover report -file coverX.rpt coverX.ucdb 2>&1 >/dev/null'


Yep, I do create such long monstrosities at times. It's because I am
exploring new avenues at the moment, and things are likely to change,
plus I need to know, and change, the details quite frequently. Although
i do make a note of these long one-liners, (and keep a lot of history
in my shells), I tend to only wrap things in a (bash), script when it
becomes stable.



Note:

Following the links from the Wikipedia page, it seems that the -P
option to xargs is not mandated by href="http://www.opengroup.org/onlinepubs/9699919799/utilities/xargs.html">The
Open Group, and does not appear in href="http://docs.sun.com/app/docs/doc/816-5165/xargs-1?a=view">Suns
version of xargs. I work in a mixed Solaris/Linux environment
and have been using Unix for many years which makes it hard to keep up
with the GNU extensions to what were familiar commands. But such is
progress. Bring it on!

Wednesday, September 02, 2009

J for a Py Guy

I just started reading a little bit more about the J programming language, after a lot of J activity on RC. I read the forward to "J for C Programmers" where I picked out an acknowledgement to Ken Iverson whose name rang a bell.



It seems that the J-language is from the APL family tree, but uses the ASCII character set and with later ideas from function-level programming added. It is good for MIMD machines it says, so I wonder how well it works for todays cheaper four to sixteen cores in a box machines, or whether it needs massively MIMD machines to really shine?



It has tickled my fancy. I'll read some more of J for C.