Go deh!: September 2009

Natural sorting orders items in some way that is more than just the
order dictated by computer character codes. It is called natural
because the new ordering is supposed to mean more to a human consumer
than an ordering on base character codes .

I guess, what is becoming the canonical example is the ordering of
numbered files in a directory listing. A normal sorted directory
listing might show:

style="font-family: monospace;">bash$ for (( i=0;
i<111;i+=5)); do touch f$i.txt; done style="font-family: monospace;">
bash$ ls -1 style="font-family: monospace;">
f0.txt style="font-family: monospace;">
f10.txt style="font-family: monospace;">
f100.txt style="font-family: monospace;">
f105.txt style="font-family: monospace;">
f110.txt style="font-family: monospace;">
f15.txt style="font-family: monospace;">
f20.txt style="font-family: monospace;">
f25.txt style="font-family: monospace;">
f30.txt style="font-family: monospace;">
f35.txt style="font-family: monospace;">
f40.txt style="font-family: monospace;">
f45.txt style="font-family: monospace;">
f5.txt style="font-family: monospace;">
f50.txt style="font-family: monospace;">
f55.txt style="font-family: monospace;">
f60.txt style="font-family: monospace;">
f65.txt style="font-family: monospace;">
f70.txt style="font-family: monospace;">
f75.txt style="font-family: monospace;">
f80.txt style="font-family: monospace;">
f85.txt style="font-family: monospace;">
f90.txt style="font-family: monospace;">
f95.txt

It might be more friendly if the files were ordered so that the numeric
section increased steadily.

In other situations, such as in Movie listings, you sometimes see
common words such as "the" ignored for sorting purposes (or even
transferred to the end of the item so "The Silence of the Lambs" would
be listed as "Silence of the Lambs, The")

Ignoring case, and treating white-space characters as the same - and
multiple strings of whitespace characters as a single whitespace, might
also be useful.

Since my initial tentative steps into working with non-ASCII characters
I tried to consider sorting accented characters as if the accent was
not present; ligatures as separate characters, and made a stab at
tackling what I call 'replacements' - where certain
characters, such as the German href="http://en.wikipedia.org/wiki/Scharfes_S">scharfes S
might be sorted as SS,.

I had a quick stab at creating a natural sort routine and
added several natural sorting abilities. They might not all be useful
at once, but it should be straight forward to pick and choose what you
want n your natural sort..

As always, when working with non-ASCII character sets, I find it
difficult to translate my working multi-byte Python program to HTML,
but there should be enough surrounding information to ensure that any
multi-byte character can be worked-out.

  1 
 color="#804040">  2 # -*- coding: utf-8 -*-
 color="#804040">  3 # Not Python 3.x (Can't compare str and int)
 color="#804040">  4 
 color="#804040">  5 '''
 color="#804040">  6     Natural sorting: Sorting of text that does more than rely on the
 color="#804040">  7     order of individual characters codes to make the finding of
 color="#804040">  8     individual strings easier for a human reader.
 color="#804040">  9     
 color="#804040"> 10     Some 'natural' orderings might include:
 color="#804040"> 11         1) Ignore leading, trailing and multiple adjacent spaces
 color="#804040"> 12         2) All space types equivalent.
 color="#804040"> 13         3) Sorting without regard to case.
 color="#804040"> 14         4) Sorting numeric portions of strings in numeric order:
 color="#804040"> 15             foo9.txt before foo10.txt
 color="#804040"> 16             As well as ... x9y99 before x9y100, before x10y0
 color="#804040"> 17             ... (for any number of groups of integers in a string).
 color="#804040"> 18         5) Title sorts: without regard to a leading, very common, word such
 color="#804040"> 19            as 'The' in "The thirty-nine steps".
 color="#804040"> 20         6) Sort letters without regard to accents.
 color="#804040"> 21         7) Sort ligatures as separate letters.
 color="#804040"> 22         8) Replacements:
 color="#804040"> 23             Sort german scharfes S (ÃŸ) as ss
 color="#804040"> 24             Sort Å¿,  LATIN SMALL LETTER LONG S as s
 color="#804040"> 25             Sort Ê’,  LATIN SMALL LETTER EZH as s
 color="#804040"> 26             ...
 color="#804040"> 27 '''
 color="#804040"> 28 
 color="#804040"> 29 from itertools  color="#a020f0">import groupby
 color="#804040"> 30 from unicodedata  color="#a020f0">import decomposition, name
 color="#804040"> 31 
 color="#804040"> 32 commonleaders = [' color="#ff00ff">the'] # lowercase leading words to ignore
 color="#804040"> 33 replacements = {' color="#ff00ff">ÃŸ': 'ss',   color="#0000ff"># Map single char to replacement string
 color="#804040"> 34                 ' color="#ff00ff">Å¿': 's',
 color="#804040"> 35                 ' color="#ff00ff">Ê’': 's',
 color="#804040"> 36                 }
 color="#804040"> 37 
 color="#804040"> 38 hexdigits = set(' color="#ff00ff">0123456789abcdef')
 color="#804040"> 39 decdigits = set(' color="#ff00ff">0123456789')    color="#0000ff"># Don't use str.isnumeric
 color="#804040"> 40 
 color="#804040"> 41 def  color="#008080">splitchar(c):
 color="#804040"> 42     ' color="#ff00ff"> De-ligature. De-accent a char'
 color="#804040"> 43     de = decomposition(c)
 color="#804040"> 44     if de:
 color="#804040"> 45          color="#0000ff"># Just the words that are also hex numbers
 color="#804040"> 46         de = [d  color="#804040">for d  color="#804040">in de.split()
 color="#804040"> 47                    color="#804040">if all(c.lower()
 color="#804040"> 48                           color="#804040">in hexdigits  color="#804040">for c  color="#804040">in d)]
 color="#804040"> 49         n = name(c, c).upper()
 color="#804040"> 50          color="#0000ff"># (Gosh it's onerous)
 color="#804040"> 51          color="#804040">if len(de)> 1  color="#804040">and ' color="#ff00ff">PRECEDE'  color="#804040">in n:
 color="#804040"> 52              color="#0000ff"># E.g. Å‰  LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
 color="#804040"> 53             de[1], de[0] = de[0], de[1]
 color="#804040"> 54         tmp = [ unichr(int(k, 16))  color="#804040">for k  color="#804040">in de]
 color="#804040"> 55         base, others = tmp[0], tmp[1:]
 color="#804040"> 56          color="#804040">if ' color="#ff00ff">LIGATURE'  color="#804040">in n:
 color="#804040"> 57              color="#0000ff"># Assume two character ligature
 color="#804040"> 58             base += others.pop(0)
 color="#804040"> 59     else:
 color="#804040"> 60         base = c
 color="#804040"> 61     return base
 color="#804040"> 62 
 color="#804040"> 63 
 color="#804040"> 64 def  color="#008080">sortkeygen(s):
 color="#804040"> 65     ''' color="#ff00ff">Generate 'natural' sort key for s
 color="#804040"> 66 
 color="#804040"> 67     Doctests:
 color="#804040"> 68         >>> sortkeygen('  some extra    spaces  ')
 color="#804040"> 69         [u'some extra spaces']
 color="#804040"> 70         >>> sortkeygen('CasE InseNsItIve')
 color="#804040"> 71         [u'case insensitive']
 color="#804040"> 72         >>> sortkeygen('The Wind in the Willows')
 color="#804040"> 73         [u'wind in the willows']
 color="#804040"> 74         >>> sortkeygen(u' color="#6a5acd">\462 ligature')
 color="#804040"> 75         [u'ij ligature']
 color="#804040"> 76         >>> sortkeygen(u' color="#6a5acd">\335\375 upper/lower case Y with acute accent')
 color="#804040"> 77         [u'yy upper/lower case y with acute accent']
 color="#804040"> 78         >>> sortkeygen('foo9.txt')
 color="#804040"> 79         [u'foo', 9, u'.txt']
 color="#804040"> 80         >>> sortkeygen('x9y99')
 color="#804040"> 81         [u'x', 9, u'y', 99]
 color="#804040"> 82     '''
 color="#804040"> 83     # Ignore leading and trailing spaces
 color="#804040"> 84     s = unicode(s).strip()
 color="#804040"> 85     # All space types are equivalent
 color="#804040"> 86     s = ' color="#ff00ff"> '.join(s.split())
 color="#804040"> 87     # case insentsitive
 color="#804040"> 88     s = s.lower()
 color="#804040"> 89     # Title
 color="#804040"> 90     words = s.split()
 color="#804040"> 91     if len(words) > 1  color="#804040">and words[0]  color="#804040">in commonleaders:
 color="#804040"> 92         s = ' color="#ff00ff"> '.join( words[1:])
 color="#804040"> 93     # accent and ligatures
 color="#804040"> 94     s = ''.join(splitchar(c)  color="#804040">for c  color="#804040">in s)
 color="#804040"> 95     # Replacements (single char replaced by one or more)
 color="#804040"> 96     s = ''.join( replacements.get(ch, ch)  color="#804040">for ch  color="#804040">in s )
 color="#804040"> 97     # Numeric sections as numerics
 color="#804040"> 98     s = [ int("".join(g))  color="#804040">if isinteger  color="#804040">else "".join(g)
 color="#804040"> 99            color="#804040">for isinteger,g  color="#804040">in groupby(s,  color="#804040">lambda x: x  color="#804040">in decdigits)]
 color="#804040">100 
 color="#804040">101     return s
 color="#804040">102 
 color="#804040">103 def  color="#008080">naturalsort(items):
 color="#804040">104     ''' color="#ff00ff"> Naturally sort a series of strings
 color="#804040">105 
 color="#804040">106     Doctests:
 color="#804040">107         >>> naturalsort(['The Wind in the Willows','The 40th step more',
 color="#804040">108                          'The 39 steps', 'Wanda'])
 color="#804040">109         ['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']
 color="#804040">110 
 color="#804040">111     '''
 color="#804040">112     return sorted(items, key=sortkeygen)
 color="#804040">113 
 color="#804040">114 
 color="#804040">115 



 color="#804040">116 # Extras...
 color="#804040">117 
 color="#804040">118 '''
 color="#804040">119 Help on built-in function decomposition in module unicodedata:
 color="#804040">120 
 color="#804040">121 decomposition(...)
 color="#804040">122     decomposition(unichr)
 color="#804040">123     
 color="#804040">124     Returns the character decomposition mapping assigned to the Unicode
 color="#804040">125     character unichr as string. An empty string is returned in case no
 color="#804040">126     such mapping is defined.
 color="#804040">127 
 color="#804040">128 Help on built-in function name in module unicodedata:
 color="#804040">129 
 color="#804040">130 name(...)
 color="#804040">131     name(unichr[, default])
 color="#804040">132     Returns the name assigned to the Unicode character unichr as a
 color="#804040">133     string. If no name is defined, default is returned, or, if not
 color="#804040">134     given, ValueError is raised.
 color="#804040">135 '''
 color="#804040">136 
 color="#804040">137 
 color="#804040">138 def  color="#008080">_temp():
 color="#804040">139     # for c in u'\370\157\117\463\462\511\733\337\337':
 color="#804040">140     for c  color="#804040">in u' class="Constant">Ã¸oOÄ³Ä²Å‰ÇÃŸÃÃ½':
 color="#804040">141         de = decomposition(c)
 color="#804040">142          color="#804040">if de:
 color="#804040">143             de = [d  color="#804040">for d  color="#804040">in de.split()
 color="#804040">144                        color="#804040">if all(c.upper()
 color="#804040">145                           color="#804040">in ' color="#ff00ff">0123456789ABCDEF'  color="#804040">for c  color="#804040">in d)]
 color="#804040">146             n = name(c, c).upper()
 color="#804040">147              color="#804040">if len(de)> 1  color="#804040">and ' color="#ff00ff">PRECEDE'  color="#804040">in n:
 color="#804040">148                  color="#0000ff"># E.g. Å‰  LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
 color="#804040">149                 de[1], de[0] = de[0], de[1]
 color="#804040">150             tmp = [ unichr(int(k, 16))  color="#804040">for k  color="#804040">in de]
 color="#804040">151             base, others = tmp[0], tmp[1:]
 color="#804040">152          color="#804040">else:
 color="#804040">153             base, others = c, []
 color="#804040">154          color="#804040">print(" color="#ff00ff">%s %o %s\n color="#ff00ff">  %s  %s" %
 color="#804040">155                   (c,ord(c), name(c,' color="#ff00ff">-'),
 color="#804040">156                    name(base, base),
 color="#804040">157                    ' color="#ff00ff"> '.join(name(o,o)  color="#804040">for o  color="#804040">in others)))
 color="#804040">158 
 color="#804040">159 
 color="#804040">160

I am trying out a dual chip Xeon server with 8 cores that shows up as
16 CPU's to Redhat Linux. I have my href="http://paddy3118.blogspot.com/search?q=process">process
runner script, that I use with a list of 100000 simulations
to do overnight and have experimented with running different numbers of
simulations in parallel to get good throughput.

The process runner is good, I still use it, but I also wanted something
for more general command-line use. for example, I have around 750
directories, each containing files relating to one test. I organize the
directories to have regular names, and am used to using shell for loops
to process the dataset.

For example, to gzip all the */transcript files I would do:

foreach f (*/transcript)
    gzip $f
end

(Yep, I know it's cshell, but that is the 'standard' at work.)

This isn't making much use of the processing power at hand, so I went
googling for a lightweight way to run such tasks in parallel.

I found two methods. One, to use make and its -P option to run several
jobs in parallel would need the creation of a make file in each case,
which is onerous. The other, which was new to me, rather than a
forgotten feature, is that xargs has a -P option that does a similar
thing - in combination with other options it will run up to a
given number of jobs in parallel:

/bin/ls */transcript \
  | xargs -n1 -iX -P16 gzip X

The above lists all the files, pipes them to xargs which takes them one
at a time (-n1) to form a job by substituting all occurrences of
character X (-iX) in its arguments after its options. a maximum of 16
(-P16) jobs are then run at any one time.

When I want to simulate all 750 tests and create a coverage listing of
each test I use xargs to call the bash shell and write something like:

/bin/ls -d * \
 | xargs -n1 -iX -P16 bash \
 -c 'cd X && vsim tb -c -coverage -do "coverage save -onexit -code t coverX.ucdb;do ../run.do" 2>&1 >/dev/null && vcover report -file coverX.rpt coverX.ucdb 2>&1 >/dev/null'

Yep, I do create such long monstrosities at times. It's because I am
exploring new avenues at the moment, and things are likely to change,
plus I need to know, and change, the details quite frequently. Although
i do make a note of these long one-liners, (and keep a lot of history
in my shells), I tend to only wrap things in a (bash), script when it
becomes stable.

Note:

Following the links from the Wikipedia page, it seems that the -P
option to xargs is not mandated by href="http://www.opengroup.org/onlinepubs/9699919799/utilities/xargs.html">The
Open Group, and does not appear in href="http://docs.sun.com/app/docs/doc/816-5165/xargs-1?a=view">Suns
version of xargs. I work in a mixed Solaris/Linux environment
and have been using Unix for many years which makes it hard to keep up
with the GNU extensions to what were familiar commands. But such is
progress. Bring it on!

Go deh!

Thursday, September 24, 2009

Natural Sorting

Monday, September 14, 2009

Easy Command-line Parallelism

Wednesday, September 02, 2009

J for a Py Guy

About Me

Followers

Subscribe Now: google

Go deh too!

whos.amung.us

Blog Archive