- Does a language have first-class functions if it can use its functions to solve particular problems in a certain manner, or
- Does a language have first-class functions if functions are just as flexible as other types in the language itself?
Thursday, December 03, 2009
What is a first-class function?
There is a debate about it on Wikipedia, and I am curious:
Saturday, November 28, 2009
Hamming Numbers (and Perl: Why I just ignored it)
I had read the following post: "Perl: Love it, or hate it, but don’t ignore
it.", by Phillip Smith, in which the author wishes Perl
solutions were credited as much as solutions in other languages such as
Python and Ruby.
I agreed that Perl wasn't mentioned as much as it was, and a day or so
later I came across an example of why....
I monitor whats happening in Python, and came across a new recipe added
to Activestate code called: "Recipe 576961: Technique for cyclical
iteration", by Raymond Hettinger. On first, second, and third
readings, i just couldn't understand the need for the deferred_output
function; the whole recipe remained opaque:
Raymond's
code:
I was hooked by the series however and so went looking for more
information.
I came across a mention to the following page: "Infinite
lists in Perl" by Mark Dominus which is copyright 1997 by the
Perl Journal. Normally I wouldn't mention the article, as it spends
more time creating an implementation of generators in Perl as it does
on addressing the computation of Hamming numbers. It also uses the term
"stream" instead of generators. but maybe this is down to the age of
the article, i.e. 1997?
Looking a bit further, I found this Dr Dobbs Article, and this
commented Haskell
solution:
The Haskell solution lead me to think that maybe Python has an issue
with defining recursive generators, and some experimentation lead me to
believe that yep, maybe that was why Raymond has the deferred_output
function.
Since I have trouble understanding Raymond s I went away and wrote my
own solution. It works. It can generate very large values of the
sequence quite quickly. But it does need memory to store
multiples:
Generating the following large values of the sequence took only seconds:
Well, I've learnt a bit more about generators, the limitatins in their
Python implementation w.r.t. Haskel, and I've put more meat on why I
tend to not quote much Perl In this case, i couldn't see the wood for
the trees! (But I do quote Perl).
it.", by Phillip Smith, in which the author wishes Perl
solutions were credited as much as solutions in other languages such as
Python and Ruby.
I agreed that Perl wasn't mentioned as much as it was, and a day or so
later I came across an example of why....
I monitor whats happening in Python, and came across a new recipe added
to Activestate code called: "Recipe 576961: Technique for cyclical
iteration", by Raymond Hettinger. On first, second, and third
readings, i just couldn't understand the need for the deferred_output
function; the whole recipe remained opaque:
Raymond's
code:
from itertools import tee, chain, islice
from heapq import merge
def hamming_numbers():
# Generate "5-smooth" numbers, also called "Hamming numbers"
# or "Regular numbers". See: http://en.wikipedia.org/wiki/Regular_number
# Finds solutions to 2**i * 3**j * 5**k for some integers i, j, and k.
def no_repeats(it):
last = None
for i in it:
if i is not last:
yield i
last = i
def deferred_output():
for i in output:
yield i
result, p2, p3, p5 = tee(deferred_output(), 4)
m2 = (2*x for x in p2) # multiples of 2
m3 = (3*x for x in p3) # multiples of 3
m5 = (5*x for x in p5) # multiples of 5
merged = merge(m2, m3, m5)
combined = chain([1], merged) # prepend a starting point
output = no_repeats(combined) # eliminate duplicates
return result
if __name__ == '__main__':
print(list(islice(hamming_numbers(), 1000))) # Show first 1000 hamming numbers
I was hooked by the series however and so went looking for more
information.
Perl
I came across a mention to the following page: "Infinite
lists in Perl" by Mark Dominus which is copyright 1997 by the
Perl Journal. Normally I wouldn't mention the article, as it spends
more time creating an implementation of generators in Perl as it does
on addressing the computation of Hamming numbers. It also uses the term
"stream" instead of generators. but maybe this is down to the age of
the article, i.e. 1997?
Haskell
Looking a bit further, I found this Dr Dobbs Article, and this
commented Haskell
solution:
hamming = 1 : map (2*) hamming `merge` map (3*) hamming `merge` map (5*) hamming
where merge (x:xs) (y:ys)
| x < y = x : xs `merge` (y:ys)
| x > y = y : (x:xs) `merge` ys
| otherwise = x : xs `merge` ys
The Haskell solution lead me to think that maybe Python has an issue
with defining recursive generators, and some experimentation lead me to
believe that yep, maybe that was why Raymond has the deferred_output
function.
My Python
Since I have trouble understanding Raymond s I went away and wrote my
own solution. It works. It can generate very large values of the
sequence quite quickly. But it does need memory to store
multiples:
from itertools import islice
from pprint import pprint as pp
def regularnumbergen(start = 1, multipliers=(2, 3, 5)):
'''\
Generates the Regular Number sequence defined here:
http://en.wikipedia.org/wiki/Regular_number
The sequence starts with 1 and consists of all multiples by 2, 3, or 5
of any number in the sequence in non-repeating, increasing order.
'''
multiples = [[] for i in multipliers]
this = start
while True:
yield this
# get multiples
for multiple, multiplier in zip(multiples, multipliers):
multiple.append(this * multiplier)
# get next
this = min(m[0] for m in multiples)
# remove this from multiples
for multiple in multiples:
if multiple[0] == this:
del multiple[0]
if __name__ == '__main__':
print(list(islice(regularnumbergen(), 10)))
Generating the following large values of the sequence took only seconds:
>>> print(list(islice(regularnumbergen(), 99999, 100001)))
[290142196707511001929482240000000000000, 290237644800000000000000000000000000000]
>>>
Wrapup
Well, I've learnt a bit more about generators, the limitatins in their
Python implementation w.r.t. Haskel, and I've put more meat on why I
tend to not quote much Perl In this case, i couldn't see the wood for
the trees! (But I do quote Perl).
Thursday, November 12, 2009
Curious about Go.
Verilog is to SytemVerilog as C is to C++ .
Verilog is to __________ as C is to Go ?
My main criticism of SytemVerilog was the size of the language compared to Verilog. Maybe we'll get SystemGo in five years time. (Or maybe Go itself will warp into some monstrosity; my crystal ball needs a new backlight).
Verilog is to __________ as C is to Go ?
My main criticism of SytemVerilog was the size of the language compared to Verilog. Maybe we'll get SystemGo in five years time. (Or maybe Go itself will warp into some monstrosity; my crystal ball needs a new backlight).
Wednesday, November 04, 2009
More Words From Hex Letters (with extra zero's)
STOP PRESS! Someone commented that zeros could be used as ones so an update is at the bottom.
It was pointed out to me that I needn't restrict myself to four character words, and could use the number '1' as letter 'l', and the number '5' as the letter 'S'.
So here is the program and output of words of four or more characters that can be made from hexadecimal characters.
The list of words found have some gems, such as babble, for when you know your generating data that no one will read; baffle, baseball, blab, bleed, bless, (and blessed if it were in the dictionary), cabal, cascade, cease, debacle, decade, decease, (and should have deceased), false, flee, label, ...
(Someone at work suggested we use 1abe1 or label to prefix a section of writes to a peripheral in an assembler prog which got me going).
The words:
Additional Words with an 'o'
Without Michaels comment I would have skipped all words that use zero for an o, so I made the following two changes:
And created the following extra words:
It was pointed out to me that I needn't restrict myself to four character words, and could use the number '1' as letter 'l', and the number '5' as the letter 'S'.
So here is the program and output of words of four or more characters that can be made from hexadecimal characters.
import urllib
words = urllib.urlopen('http://www.puzzlers.org/pub/wordlists/unixdict.txt'
).read().split()
#len(words) == 25104
def findwords(hexchars):
hexchars = set(hexchars)
hexwords = [w for w in words if 4 <= len(w) and all(ch in hexchars for ch in w)]
return hexwords
def printwords(hexwords):
for i in range(0, len(hexwords), 5):
print (' ' + ' '.join("%-8s" % w for w in hexwords[i:i+5]))
hexchars = 'abcdef'
#hexwords = findwords(hexchars)
#print "\nHEXWORDS"
#printwords(hexwords)
# Now to use one: 1 as ell: l; and five: 5 as ess: s
hexwords = findwords(hexchars + 'ls')
hexwords = [word.replace('s','5').replace('l','1') for word in hexwords]
print "\nHEXWORDS USING one: 1 AS ell: l; AND five: 5 AS ess: s"
printwords(hexwords)
The list of words found have some gems, such as babble, for when you know your generating data that no one will read; baffle, baseball, blab, bleed, bless, (and blessed if it were in the dictionary), cabal, cascade, cease, debacle, decade, decease, (and should have deceased), false, flee, label, ...
(Someone at work suggested we use 1abe1 or label to prefix a section of writes to a peripheral in an assembler prog which got me going).
The words:
HEXWORDS USING one: 1 AS ell: l; AND five: 5 AS ess: s
aaa5 ababa aba5e abba5 abbe
abed abe1 ab1e ab5ce55 accede
acce55 added add1e ade1e affab1e
a1ba a1ec a1fa1fa a11e1e a55e55
babb1e babe babe1 bade baff1e
ba1d ba1e ba11 ba11ad ba11ed
ba15a ba5a1 ba5e ba5eba11 ba5e1
ba55 bead bead1e beebe beef
befa11 befe11 be1a be11 be11a
be11e be55 be55e1 b1ab b1ade
b1ed b1eed b1e55 caba1 cab1e
cafe ca1eb ca1f ca11 ca11a
ca5cade ca5e cea5e cede ce1ebe5
ce11 c1ad c1a55 dabb1e dacca
dada dade da1e da11a5 dead
deaf dea1 debac1e deba5e decade
deca1 decca decea5e deed deface
de11 de11a ea5e ea5e1 ecc1e5
efface effaceab1e e1ba e11a e15e
fab1e facade face fade fa11
fa15e feeb1e feed fee1 fe11
f1ea f1ed f1ee f1eece 1abe1
1ace 1ad1e 1a5e 1a55 1ead
1eaf 1ea5e 1eed5 1e55 1e55ee
5ab1e 5accade 5add1e 5afe 5a1ad
5a1e 5a11e 5cab 5ca1a 5ca1d
5ca1e 5ea1 5ecede 5eeab1e 5eed
5eedbed 5e1f 5e11 51ab 51ed
Additional Words with an 'o'
Without Michaels comment I would have skipped all words that use zero for an o, so I made the following two changes:
hexwords = findwords(hexchars + 'lso')
hexwords = [word.replace('s','5').replace('l','1').replace('o', '0') for word in hexwords]
And created the following extra words:
ab0de acc01ade ad0be a1c0a a110cab1e
a10e a100f a150 ba1b0a ba550
b10b b10c b100d b0bb1e b0ca
b0de b01d b01dface b01e b010
b05e b055 caca0 c10d c105e
c0a1 c0a1e5ce c0bb c0bb1e c0b01
c0ca c0c0 c0c0a c0da c0dd1e
c0de c0ed c0ffee c01a c01d
c01e c01055a1 c001 c05ec dec0de
d0bb5 d0dd d0d0 d0ff d01ce
d01e d011 d00d1e d05e ec01e
ee0c fa110ff f10c f10e f100d
f0a1 f0ca1 f01d f00d f001
f055 1a05 1a550 10ad 10af
10be 10b0 10ca1 10ca1e 10eb
10e55 101a 1011 1005e 1005e1eaf
105ab1e 105e 1055 0a5e5 0be5e
0b0e 0b5e55 0de55a 0ffa1 0ff10ad
0ff5add1e 01af 00d1e5 0510 5caff01d
5c0ff 5c01d 5eaf00d 510b 510e
50da 50fa 501ace 501d 501e
5010
Sunday, November 01, 2009
The 24 game
I've just created a new task over on Rosetta Code for the 24 game, and a separate task to create a solver for the game.
It's a brute -force solver, but I find the game to be playable, so will get my sprogs to test their arithmetic skills with it later.
I've just realised that the solver could be easily modified to solve "The numbers game" from the UK Channel 4 program Countdown. (Although their could be an issue with division as, when given the digits 3 3 7 7, the solver finds the solution: ( 3 + 3 / 7 ) * 7 which may not be allowed by the countdown rules which state that "only whole numbers may be used").
It's a brute -force solver, but I find the game to be playable, so will get my sprogs to test their arithmetic skills with it later.
I've just realised that the solver could be easily modified to solve "The numbers game" from the UK Channel 4 program Countdown. (Although their could be an issue with division as, when given the digits 3 3 7 7, the solver finds the solution: ( 3 + 3 / 7 ) * 7 which may not be allowed by the countdown rules which state that "only whole numbers may be used").
Sunday, October 18, 2009
Template for forking a Python program
I am doing a large regression run of some 30,000 simulations producing aroung 30 Gigs of logs to extract test parameter and simulation result from each.
I have a program that goes through each log in turn extracting results and this may take tens of minutes to complete.
A future optimisation will be to extract pass/fail data for each regression run in the same job that ran the simulation, but at present, there may be changes to the pass/fail criteria, so it is run as a separate task after the regression simulations are all done.
I normally run the log extraction process on an 8core, 16 thread machine with a fast connection to the file server, so was thinking of ways to parallelise the task.
Enter this article, part of Doug Hellmann's PyMOTW series, about forking a process.
After running it, I decided to expand on it to be more like my situation where you have a list of similar tasks to perform and want to split it between multiple processes via the Posix fork() call. The following is just a framework - I hope to get time to acctually use it and see what the problems with it are.
Running the above on cygwin produced:
Notice how each child gets to do a fair poition of the jobs, (assuming all jobs need the same resources).
I have a program that goes through each log in turn extracting results and this may take tens of minutes to complete.
A future optimisation will be to extract pass/fail data for each regression run in the same job that ran the simulation, but at present, there may be changes to the pass/fail criteria, so it is run as a separate task after the regression simulations are all done.
I normally run the log extraction process on an 8core, 16 thread machine with a fast connection to the file server, so was thinking of ways to parallelise the task.
Enter this article, part of Doug Hellmann's PyMOTW series, about forking a process.
After running it, I decided to expand on it to be more like my situation where you have a list of similar tasks to perform and want to split it between multiple processes via the Posix fork() call. The following is just a framework - I hope to get time to acctually use it and see what the problems with it are.
1 import os
2 import sys
3 import time
4
5 maxprocs = 4 # procs - 1 children and the parent to do the work
6 jobarray = range(25) # Work to be split between proccesses
7
8 def do1job(job):
9 'Process one item from the jobarray'
10 return job
11
12 def picklesave(results):
13 'save partialresults to file'
14 time.sleep(3)
15
16 def unpickleread(proc):
17 'read partial results previousely saved by process number procc'
18 return [] # dummy
19
20 def dowork(proc, maxprocs=maxprocs, jobarray=jobarray):
21 ' Split jobarray tasks between maxprocs by proci number'
22 time.sleep(proc) # convenience processing time
23 results = []
24 for count, job in enumerate(jobarray):
25 if count % maxprocs == proc:
26 results.append(do1job(job))
27 picklesave(results)
28 print ("Process: %i completed jobs: %s" % (proc, results))
29 return results
30
31
32 def createworkerchildren(maxprocs=maxprocs):
33 for proc in range(1, maxprocs):
34 print 'PARENT: Forking %s' % proc
35 worker_pid = os.fork()
36 if not worker_pid:
37 'This is a child process'
38 dowork(proc)
39 # Children exit here!
40 sys.exit(proc)
41
42 # Start and don't wait for child proccesses to do their work
43 createworkerchildren()
44
45 # Do parent processes share of the work
46 results = []
47 results += dowork(0) # don't have to pickle/unpickle Parent processes share
48
49 # wait for children
50 for proc in range(1, maxprocs):
51 print 'PARENT: Waiting for any child'
52 done = os.wait()
53 print 'PARENT: got child', done
54
55 # read and integrate child results
56 for proc in range(1, maxprocs):
57 results += unpickleread(proc)
58
59 # Calculate and print data summary
60 pass
61
Running the above on cygwin produced:
bash$ python testfork.py
PARENT: Forking 1
PARENT: Forking 2
PARENT: Forking 3
Process: 0 completed jobs: [0, 4, 8, 12, 16, 20, 24]
PARENT: Waiting for any child
Process: 1 completed jobs: [1, 5, 9, 13, 17, 21]
PARENT: got child (5536, 256)
PARENT: Waiting for any child
Process: 2 completed jobs: [2, 6, 10, 14, 18, 22]
PARENT: got child (3724, 512)
PARENT: Waiting for any child
Process: 3 completed jobs: [3, 7, 11, 15, 19, 23]
PARENT: got child (4236, 768)
bash$
PARENT: Forking 1
PARENT: Forking 2
PARENT: Forking 3
Process: 0 completed jobs: [0, 4, 8, 12, 16, 20, 24]
PARENT: Waiting for any child
Process: 1 completed jobs: [1, 5, 9, 13, 17, 21]
PARENT: got child (5536, 256)
PARENT: Waiting for any child
Process: 2 completed jobs: [2, 6, 10, 14, 18, 22]
PARENT: got child (3724, 512)
PARENT: Waiting for any child
Process: 3 completed jobs: [3, 7, 11, 15, 19, 23]
PARENT: got child (4236, 768)
bash$
Notice how each child gets to do a fair poition of the jobs, (assuming all jobs need the same resources).
Sunday, October 11, 2009
Extend functools.partial to nested function calls?
I am reading the 1977 ACM Turing Award Lecture "Can Programming Be Liberated from the von Neumann Style? A Functional Style and Its Algebra of Programs" by John Backus.
In section 5.1 he is contrasting a von Neumann program for inner product, which he gives as:
Well, the Python works, but it still has this name x that I would like to eliminate.
I think I can go part of the way by using functools.partial. For example, for the single function call of map(mul, *x) I can do:
All help would be appreciated, Thanks.
In section 5.1 he is contrasting a von Neumann program for inner product, which he gives as:
c := 0The above converts quit nicely to the python equivalent:
for i := 1 step 1 until n do
c := c + c[i] x b[i]
>>> v1, v2 = [1,2,3], [6,5,4]Backus then goes on to contrast the von Neumann style with his functional program:
>>> def ip_von_neumann(a, b):
c = 0
for i in range(3):
c = c + a[i] * b[i]
return c
>>> ip_von_neumann(v1, v2)
28
Def Innerproduct ≡ (Insert +)o(ApplyToAll x)oTransposeWhich I converted to the following functional style Python:
>>> from operator import add, mul
>>> from functools import reduce as insert
>>>
>>> def ip_functional(*x):
return insert(add, map(mul, *x))
>>> ip_functional(v1, v2)
28
Well, the Python works, but it still has this name x that I would like to eliminate.
I think I can go part of the way by using functools.partial. For example, for the single function call of map(mul, *x) I can do:
>>> from functools import partialBut I don't know how to go that extra step and remove x from the definition fully, ie end up with just some assignment like:
>>> p1 = partial(map, mul)
>>> assert p1(v1, v2) == map(mul, *(v1, v2))
>>> map(mul, *(v1, v2))
[6, 10, 12]
>>>
ip_purefunctional = {A pure function of functions without any argument placeholders}
All help would be appreciated, Thanks.
Friday, October 02, 2009
The Otter Algorithm
I was reading about a Richard Dawkins program that through random mutation and selection of best fit, would gradually converge a random string into a target string. It is called the Weasel Program,and a rough outline of it is:
I decided to change things.
It seemed to me that if you are further away from the ideal, then you should change more letters; and, after seeing the score for how good the current best string is sometimes decrease, I made a modification to choose the best of the copies and the best string that they are mutations of.
With that modification, convergence was swift.
Oh, I call it the Otter program, as it is more than the Weasel.
The prog
Function score at line 19 uses the fact that booleans True and False are also numbers and calculates how many characters are in the right position as its result.
Line 23 modifies the amount of randomness injected, based on how bad the last highscore was: Change more if you are further away from the target - less as you approach.
Name thischance is used in line 28. random returns a float between 0.0 and 1.0. if it is less than thischance then a particular character is not changed.
The function for thischance in line 23 is equal to chance when no characters match, and 1.0 if all characters were to match. This forces less change as you approach the target string.
Here is a little snippet to show how thischance varies with the score (sc):
P.S. The post is only tangentially linked to Darwinism. I am secure in the knowledge that my children's school teaches it, and I am not really interested in it being mentioned in any comments. Treat it just as a way to go from a random choice of letters and home in on a target string by repeated random alterations, and selection of the "best".
- Start with a random string of 28 characters
- Copy this string 100 times, with a 5% chance per character of that character being replaced with a random character
- Compare each new string with the target "METHINKS IT IS LIKE A WEASEL", and give each a score
- If any of the new strings has a perfect score, halt
- Otherwise, take the highest scoring string, and go to step 2
I decided to change things.
It seemed to me that if you are further away from the ideal, then you should change more letters; and, after seeing the score for how good the current best string is sometimes decrease, I made a modification to choose the best of the copies and the best string that they are mutations of.
With that modification, convergence was swift.
Oh, I call it the Otter program, as it is more than the Weasel.
The prog
1 2 ''' 3 From: http://en.wikipedia.org/wiki/Weasel_program 4 With modifications 5 ''' 6 7 from string import ascii_uppercase 8 from random import choice, random 9 10 target = list("METHINKS IT IS LIKE A WEASEL") 11 chance = .05 12 times = 100 13 14 charset = ascii_uppercase + ' ' 15 iterations = 0 16 highscore = [choice(charset) for _ in range(len(target))] 17 perfectscore = len(target) 18 19 def score(trial): 20 return sum(t==h for t,h in zip(trial, target)) 21 22 while highscore != target: 23 thischance = 1-((perfectscore - score(highscore)) / perfectscore * (1 - chance)) 24 #thischance = chance 25 iterations += 1 26 if iterations % 50 == 0: 27 print (iterations, score(highscore), ''.join(highscore)) 28 copies = [ [(ch if random() <= thischance else choice(charset)) for ch in highscore] 29 for _ in range(times) ] + [highscore] 30 highscore = max(copies, key=score) 31 print (iterations, ''.join(highscore))
Function score at line 19 uses the fact that booleans True and False are also numbers and calculates how many characters are in the right position as its result.
Line 23 modifies the amount of randomness injected, based on how bad the last highscore was: Change more if you are further away from the target - less as you approach.
Name thischance is used in line 28. random returns a float between 0.0 and 1.0. if it is less than thischance then a particular character is not changed.
The function for thischance in line 23 is equal to chance when no characters match, and 1.0 if all characters were to match. This forces less change as you approach the target string.
Here is a little snippet to show how thischance varies with the score (sc):
>>> for sc in range(0,29,7): print( "score=%2i, thischance=%5.2f" % (sc, 1-((perfectscore - sc) / perfectscore * (1 - chance))) ) score= 0, thischance= 0.05 score= 7, thischance= 0.29 score=14, thischance= 0.53 score=21, thischance= 0.76 score=28, thischance= 1.00
In line 29 the addition of highscore into copies means that the max taken in line 30 can never decrease.
P.S. The post is only tangentially linked to Darwinism. I am secure in the knowledge that my children's school teaches it, and I am not really interested in it being mentioned in any comments. Treat it just as a way to go from a random choice of letters and home in on a target string by repeated random alterations, and selection of the "best".
Thursday, September 24, 2009
Natural Sorting
Natural sorting orders items in some way that is more than just the
order dictated by computer character codes. It is called natural
because the new ordering is supposed to mean more to a human consumer
than an ordering on base character codes .
I guess, what is becoming the canonical example is the ordering of
numbered files in a directory listing. A normal sorted directory
listing might show:
It might be more friendly if the files were ordered so that the numeric
section increased steadily.
In other situations, such as in Movie listings, you sometimes see
common words such as "the" ignored for sorting purposes (or even
transferred to the end of the item so "The Silence of the Lambs" would
be listed as "Silence of the Lambs, The")
Ignoring case, and treating white-space characters as the same - and
multiple strings of whitespace characters as a single whitespace, might
also be useful.
Since my initial tentative steps into working with non-ASCII characters
I tried to consider sorting accented characters as if the accent was
not present; ligatures as separate characters, and made a stab at
tackling what I call 'replacements' - where certain
characters, such as the German href="http://en.wikipedia.org/wiki/Scharfes_S">scharfes S
might be sorted as SS,.
I had a quick stab at creating a natural sort routine and
added several natural sorting abilities. They might not all be useful
at once, but it should be straight forward to pick and choose what you
want n your natural sort..
As always, when working with non-ASCII character sets, I find it
difficult to translate my working multi-byte Python program to HTML,
but there should be enough surrounding information to ensure that any
multi-byte character can be worked-out.
order dictated by computer character codes. It is called natural
because the new ordering is supposed to mean more to a human consumer
than an ordering on base character codes .
I guess, what is becoming the canonical example is the ordering of
numbered files in a directory listing. A normal sorted directory
listing might show:
style="font-family: monospace;">bash$ for (( i=0;
i<111;i+=5)); do touch f$i.txt; done
style="font-family: monospace;">
bash$ ls -1
style="font-family: monospace;">
f0.txt
style="font-family: monospace;">
f10.txt
style="font-family: monospace;">
f100.txt
style="font-family: monospace;">
f105.txt
style="font-family: monospace;">
f110.txt
style="font-family: monospace;">
f15.txt
style="font-family: monospace;">
f20.txt
style="font-family: monospace;">
f25.txt
style="font-family: monospace;">
f30.txt
style="font-family: monospace;">
f35.txt
style="font-family: monospace;">
f40.txt
style="font-family: monospace;">
f45.txt
style="font-family: monospace;">
f5.txt
style="font-family: monospace;">
f50.txt
style="font-family: monospace;">
f55.txt
style="font-family: monospace;">
f60.txt
style="font-family: monospace;">
f65.txt
style="font-family: monospace;">
f70.txt
style="font-family: monospace;">
f75.txt
style="font-family: monospace;">
f80.txt
style="font-family: monospace;">
f85.txt
style="font-family: monospace;">
f90.txt
style="font-family: monospace;">
f95.txt
i<111;i+=5)); do touch f$i.txt; done
style="font-family: monospace;">
bash$ ls -1
style="font-family: monospace;">
f0.txt
style="font-family: monospace;">
f10.txt
style="font-family: monospace;">
f100.txt
style="font-family: monospace;">
f105.txt
style="font-family: monospace;">
f110.txt
style="font-family: monospace;">
f15.txt
style="font-family: monospace;">
f20.txt
style="font-family: monospace;">
f25.txt
style="font-family: monospace;">
f30.txt
style="font-family: monospace;">
f35.txt
style="font-family: monospace;">
f40.txt
style="font-family: monospace;">
f45.txt
style="font-family: monospace;">
f5.txt
style="font-family: monospace;">
f50.txt
style="font-family: monospace;">
f55.txt
style="font-family: monospace;">
f60.txt
style="font-family: monospace;">
f65.txt
style="font-family: monospace;">
f70.txt
style="font-family: monospace;">
f75.txt
style="font-family: monospace;">
f80.txt
style="font-family: monospace;">
f85.txt
style="font-family: monospace;">
f90.txt
style="font-family: monospace;">
f95.txt
It might be more friendly if the files were ordered so that the numeric
section increased steadily.
In other situations, such as in Movie listings, you sometimes see
common words such as "the" ignored for sorting purposes (or even
transferred to the end of the item so "The Silence of the Lambs" would
be listed as "Silence of the Lambs, The")
Ignoring case, and treating white-space characters as the same - and
multiple strings of whitespace characters as a single whitespace, might
also be useful.
Since my initial tentative steps into working with non-ASCII characters
I tried to consider sorting accented characters as if the accent was
not present; ligatures as separate characters, and made a stab at
tackling what I call 'replacements' - where certain
characters, such as the German href="http://en.wikipedia.org/wiki/Scharfes_S">scharfes S
might be sorted as SS,.
I had a quick stab at creating a natural sort routine and
added several natural sorting abilities. They might not all be useful
at once, but it should be straight forward to pick and choose what you
want n your natural sort..
As always, when working with non-ASCII character sets, I find it
difficult to translate my working multi-byte Python program to HTML,
but there should be enough surrounding information to ensure that any
multi-byte character can be worked-out.
1
color="#804040"> 2 # -*- coding: utf-8 -*-
color="#804040"> 3 # Not Python 3.x (Can't compare str and int)
color="#804040"> 4
color="#804040"> 5 '''
color="#804040"> 6 Natural sorting: Sorting of text that does more than rely on the
color="#804040"> 7 order of individual characters codes to make the finding of
color="#804040"> 8 individual strings easier for a human reader.
color="#804040"> 9
color="#804040"> 10 Some 'natural' orderings might include:
color="#804040"> 11 1) Ignore leading, trailing and multiple adjacent spaces
color="#804040"> 12 2) All space types equivalent.
color="#804040"> 13 3) Sorting without regard to case.
color="#804040"> 14 4) Sorting numeric portions of strings in numeric order:
color="#804040"> 15 foo9.txt before foo10.txt
color="#804040"> 16 As well as ... x9y99 before x9y100, before x10y0
color="#804040"> 17 ... (for any number of groups of integers in a string).
color="#804040"> 18 5) Title sorts: without regard to a leading, very common, word such
color="#804040"> 19 as 'The' in "The thirty-nine steps".
color="#804040"> 20 6) Sort letters without regard to accents.
color="#804040"> 21 7) Sort ligatures as separate letters.
color="#804040"> 22 8) Replacements:
color="#804040"> 23 Sort german scharfes S (ß) as ss
color="#804040"> 24 Sort Å¿, LATIN SMALL LETTER LONG S as s
color="#804040"> 25 Sort Ê’, LATIN SMALL LETTER EZH as s
color="#804040"> 26 ...
color="#804040"> 27 '''
color="#804040"> 28
color="#804040"> 29 from itertools color="#a020f0">import groupby
color="#804040"> 30 from unicodedata color="#a020f0">import decomposition, name
color="#804040"> 31
color="#804040"> 32 commonleaders = [' color="#ff00ff">the'] # lowercase leading words to ignore
color="#804040"> 33 replacements = {' color="#ff00ff">ß': 'ss', color="#0000ff"># Map single char to replacement string
color="#804040"> 34 ' color="#ff00ff">Å¿': 's',
color="#804040"> 35 ' color="#ff00ff">Ê’': 's',
color="#804040"> 36 }
color="#804040"> 37
color="#804040"> 38 hexdigits = set(' color="#ff00ff">0123456789abcdef')
color="#804040"> 39 decdigits = set(' color="#ff00ff">0123456789') color="#0000ff"># Don't use str.isnumeric
color="#804040"> 40
color="#804040"> 41 def color="#008080">splitchar(c):
color="#804040"> 42 ' color="#ff00ff"> De-ligature. De-accent a char'
color="#804040"> 43 de = decomposition(c)
color="#804040"> 44 if de:
color="#804040"> 45 color="#0000ff"># Just the words that are also hex numbers
color="#804040"> 46 de = [d color="#804040">for d color="#804040">in de.split()
color="#804040"> 47 color="#804040">if all(c.lower()
color="#804040"> 48 color="#804040">in hexdigits color="#804040">for c color="#804040">in d)]
color="#804040"> 49 n = name(c, c).upper()
color="#804040"> 50 color="#0000ff"># (Gosh it's onerous)
color="#804040"> 51 color="#804040">if len(de)> 1 color="#804040">and ' color="#ff00ff">PRECEDE' color="#804040">in n:
color="#804040"> 52 color="#0000ff"># E.g. ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
color="#804040"> 53 de[1], de[0] = de[0], de[1]
color="#804040"> 54 tmp = [ unichr(int(k, 16)) color="#804040">for k color="#804040">in de]
color="#804040"> 55 base, others = tmp[0], tmp[1:]
color="#804040"> 56 color="#804040">if ' color="#ff00ff">LIGATURE' color="#804040">in n:
color="#804040"> 57 color="#0000ff"># Assume two character ligature
color="#804040"> 58 base += others.pop(0)
color="#804040"> 59 else:
color="#804040"> 60 base = c
color="#804040"> 61 return base
color="#804040"> 62
color="#804040"> 63
color="#804040"> 64 def color="#008080">sortkeygen(s):
color="#804040"> 65 ''' color="#ff00ff">Generate 'natural' sort key for s
color="#804040"> 66
color="#804040"> 67 Doctests:
color="#804040"> 68 >>> sortkeygen(' some extra spaces ')
color="#804040"> 69 [u'some extra spaces']
color="#804040"> 70 >>> sortkeygen('CasE InseNsItIve')
color="#804040"> 71 [u'case insensitive']
color="#804040"> 72 >>> sortkeygen('The Wind in the Willows')
color="#804040"> 73 [u'wind in the willows']
color="#804040"> 74 >>> sortkeygen(u' color="#6a5acd">\462 ligature')
color="#804040"> 75 [u'ij ligature']
color="#804040"> 76 >>> sortkeygen(u' color="#6a5acd">\335\375 upper/lower case Y with acute accent')
color="#804040"> 77 [u'yy upper/lower case y with acute accent']
color="#804040"> 78 >>> sortkeygen('foo9.txt')
color="#804040"> 79 [u'foo', 9, u'.txt']
color="#804040"> 80 >>> sortkeygen('x9y99')
color="#804040"> 81 [u'x', 9, u'y', 99]
color="#804040"> 82 '''
color="#804040"> 83 # Ignore leading and trailing spaces
color="#804040"> 84 s = unicode(s).strip()
color="#804040"> 85 # All space types are equivalent
color="#804040"> 86 s = ' color="#ff00ff"> '.join(s.split())
color="#804040"> 87 # case insentsitive
color="#804040"> 88 s = s.lower()
color="#804040"> 89 # Title
color="#804040"> 90 words = s.split()
color="#804040"> 91 if len(words) > 1 color="#804040">and words[0] color="#804040">in commonleaders:
color="#804040"> 92 s = ' color="#ff00ff"> '.join( words[1:])
color="#804040"> 93 # accent and ligatures
color="#804040"> 94 s = ''.join(splitchar(c) color="#804040">for c color="#804040">in s)
color="#804040"> 95 # Replacements (single char replaced by one or more)
color="#804040"> 96 s = ''.join( replacements.get(ch, ch) color="#804040">for ch color="#804040">in s )
color="#804040"> 97 # Numeric sections as numerics
color="#804040"> 98 s = [ int("".join(g)) color="#804040">if isinteger color="#804040">else "".join(g)
color="#804040"> 99 color="#804040">for isinteger,g color="#804040">in groupby(s, color="#804040">lambda x: x color="#804040">in decdigits)]
color="#804040">100
color="#804040">101 return s
color="#804040">102
color="#804040">103 def color="#008080">naturalsort(items):
color="#804040">104 ''' color="#ff00ff"> Naturally sort a series of strings
color="#804040">105
color="#804040">106 Doctests:
color="#804040">107 >>> naturalsort(['The Wind in the Willows','The 40th step more',
color="#804040">108 'The 39 steps', 'Wanda'])
color="#804040">109 ['The 39 steps', 'The 40th step more', 'Wanda', 'The Wind in the Willows']
color="#804040">110
color="#804040">111 '''
color="#804040">112 return sorted(items, key=sortkeygen)
color="#804040">113
color="#804040">114
color="#804040">115
color="#804040">116 # Extras...
color="#804040">117
color="#804040">118 '''
color="#804040">119 Help on built-in function decomposition in module unicodedata:
color="#804040">120
color="#804040">121 decomposition(...)
color="#804040">122 decomposition(unichr)
color="#804040">123
color="#804040">124 Returns the character decomposition mapping assigned to the Unicode
color="#804040">125 character unichr as string. An empty string is returned in case no
color="#804040">126 such mapping is defined.
color="#804040">127
color="#804040">128 Help on built-in function name in module unicodedata:
color="#804040">129
color="#804040">130 name(...)
color="#804040">131 name(unichr[, default])
color="#804040">132 Returns the name assigned to the Unicode character unichr as a
color="#804040">133 string. If no name is defined, default is returned, or, if not
color="#804040">134 given, ValueError is raised.
color="#804040">135 '''
color="#804040">136
color="#804040">137
color="#804040">138 def color="#008080">_temp():
color="#804040">139 # for c in u'\370\157\117\463\462\511\733\337\337':
color="#804040">140 for c color="#804040">in u' class="Constant">øoOijIJʼnÇßÃý ':
color="#804040">141 de = decomposition(c)
color="#804040">142 color="#804040">if de:
color="#804040">143 de = [d color="#804040">for d color="#804040">in de.split()
color="#804040">144 color="#804040">if all(c.upper()
color="#804040">145 color="#804040">in ' color="#ff00ff">0123456789ABCDEF' color="#804040">for c color="#804040">in d)]
color="#804040">146 n = name(c, c).upper()
color="#804040">147 color="#804040">if len(de)> 1 color="#804040">and ' color="#ff00ff">PRECEDE' color="#804040">in n:
color="#804040">148 color="#0000ff"># E.g. ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
color="#804040">149 de[1], de[0] = de[0], de[1]
color="#804040">150 tmp = [ unichr(int(k, 16)) color="#804040">for k color="#804040">in de]
color="#804040">151 base, others = tmp[0], tmp[1:]
color="#804040">152 color="#804040">else:
color="#804040">153 base, others = c, []
color="#804040">154 color="#804040">print(" color="#ff00ff">%s %o %s\n color="#ff00ff"> %s %s" %
color="#804040">155 (c,ord(c), name(c,' color="#ff00ff">-'),
color="#804040">156 name(base, base),
color="#804040">157 ' color="#ff00ff"> '.join(name(o,o) color="#804040">for o color="#804040">