Go deh!

Mainly Tech projects on Python and Electronic Design Automation.

Wednesday, August 26, 2009

The Story of the Regexp and the Primes

If you are all sitting comfortably, then I shall begin...

Once upon a time, in a land an internet away, A python programmer was
browsing the web when he StumbledUpon a post with a curious regular
expression purporting to test numbers for primality. On further
investigation, the regexp was attributed in more than one place to
"Abigail" and took the form:

style="font-weight: bold; font-family: monospace;">^1?\$|^(11+?)\1+\$

Wanting to find out how it worked, the programmer was stumped, as all
the posts mentioning the regexp asked you to visit a page with a
supposedly excellent explanation of its inner workings, but the domain
name was no longer present.

The programmer decided to work it out for himself and came up with the
following...

A prime number is a
positive integer that is only divisible by itself and one. Zero and one
are not prime.

Let's say a number N, greater than one,  is not prime. Then
there exists two numbers, S and T, that when multiplied together equal
N. We have:

S x T = N

Lets us further assume
that T is greater than, or equal to S. (They can be swapped if
necessary to make this so). Then a small bit of manipulation gives:

(S x (T-1)) + S = N

(You take one less S in
the multiplication, then you need to add the S back).

Lets swap the terms and write this as equation (a):

style="font-weight: bold;">S + (S x (T-1)) = N  style="font-weight: bold;"> when S,T,N are integers greater
than one, and N is style="font-style: italic; font-weight: bold;">not style="font-weight: bold;"> prime

Abigail's regexp relies on representing integers as strings of that
number of ones:

Zero would be
''

One would be     '1'

two becomes     '11'

and so on...

The regexp gives a match if the string of ones does not represent a
prime.

To match zero or one occurrences of a 1 we use the first half of the
regexp:

style="font-family: monospace;">^1?\$

Where ^anchors the following to a match from the beginning of the
string, ? will match zero or one occurrences of the preceding 1, and
the  trailing \$ matches the end of the string, i.e. it matches
a string that contains only zero or one 1 characters.

Going back to equation (a), greater than one, is the same as two or
more.

So S will be represented by two or more 1's - we don't know how many,
but we do know that it is two or more. S, in a regexp, becomes
something like:

style="font-family: monospace;">11+

1 followed by one or more 1's. S is less than or equal to T,
which translates to using a less greedy form of the zero-or-one
matcher, which is +? giving a representation of S as:

style="font-family: monospace;">11+?

We want to find S followed by T-1 identical copies of S, so lets form a
group out of what matches S:

style="font-family: monospace;">(11+?)

Then by referring to this group, it is the first group in the regexp so
can be referred to as \1, we can search for extra copies of the group
by:

style="font-family: monospace;">\1+

S and its copies must cover the whole of the number so must start at ^
and finish at \$:

style="font-weight: bold; font-family: monospace;">^(11+?)\1+\$

Together with the earlier regexp fraction that covers the numbers zero
and one, they can be or'd together to cover all non primes as
the following:

style="font-family: monospace;">^1?\$|^(11+?)\1+\$

...Not knowing when to quit, the programmer opted to display some of
his Python code that defines a primality tester using the regexp:

`>>> import re`

`>>> def isprime(n):`

`    return not re.match(r' style="font-weight: bold;">^1?\$|^(11+?)\1+\$', '1' * n)>>> [i for i in range(40) if isprime(i)]`

`[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]`

And another function that shows S and T  from the regexp match:

`>>> def is style="font-weight: bold;">notprime(n):`

`    notprime = re.match(r'^(1?)\$|^(11+?)(\2+)\$', '1' * n)`

`    if notprime:`

`        if n <=1:`

`            return (len( notprime.groups()), )`

`        else:`

`            S, SxT1 = notprime.groups()[1:]`

`            s = len(S)`

`            t = 1 + len(SxT1)//s`

`            return (s,t)`

`    else:`

`        return ()`

`>>> [(i, isnotprime(i)) for i in range(15)]`

`[(0, (0,)), (1, (1,)), (2, ()), (3, ()), (4, (2, 2)), (5, ()), (6, (2, 3)), (7, ()), (8, (2, 4)), (9, (3, 3)), (10, (2, 5)), (11, ()), (12, (2, 6)), (13, ()), (14, (2, 7))]`

... And so to bed.

THE END.

1. Thanks for this post, it's really cool. A nice way to start my day! :)

2. Nice post.

3. I just say: WOW:)

4. Fwiw, Abigail is a (relatively) famous perl developer. A google search for 'abigail perl' should lead you to him (I think it's a 'him', anyway.)

Awesome analysis, though.

5. I found Abigail, the contributor to CPAN and sent them a thank you email!

6. I have since found this explanation, and a pointer to this one in the comments. You might try an alternative if you are having problems understanding this one, or maybe critique all three? (Be gentle with me...)

P.S. Don't take it too seriously with suggestions for improvements, or benchmarks - it is just an exercise in explaining a clever regexp; not a practical suggestion for testing primality.

7. This comment has been removed by a blog administrator.

8. Hi
Just a mathematical note: even though this can be described in a regexp, it is not formally a 'regular language'. This is because back-references are not regular.

If they were, then this regexp would prove it was possible to decide on a number's primality by going over its digits only once.

9. Hi lorg,
Yes, there is a strict mathematical meaning of what constitutes a regular expression, whereas I'm using the term in the more general sense of "Those terse hieroglyphics that form a sub-language for matching and extracting stuff from strings".

:-)

10. I presented this post today (with proper credit) in a lightning talk during the brazilian Pycon (Pythonbrasil). Thanks again for this post, people loved the presentation!

Best,
--Rob

11. Hi Abagale,
General nice comments are encouraged! Nice is good. I try to do nice too.
:-)

12. Aaarghhh....
I can stand it no more!

There is a problem with the style of Abigail's original regexp that glares at me every time that I go back to it.

I prefer the ^ and \$ matchers for the beginning and end of a string, to appear once, at the beginning and/or end of the regexp. It just looks better that way to me.

So I would re-factor Abigail's regexp, (and assuming that the -x or verbose flag is in operation to allow spaces to be ignored in the regexp), as:

^ (?: 1? | (11+?) \1+ )\$

(?: ... ) is the non-capturing parenthesis that groups without using up a group number.

Sorry Abigail, I could resist no longer.

13. Using (?:) and one set of ^\$ is two characters longer than using two sets of ^\$. Yes, length matters. And while using a capturing set of parens uses the same amount of characters, one would have to use \2, losing on style.