Go deh!: The Story of the Regexp and the Primes

Wednesday, August 26, 2009

The Story of the Regexp and the Primes

If you are all sitting comfortably, then I shall begin...

Once upon a time, in a land an internet away, A python programmer was
browsing the web when he StumbledUpon a post with a curious regular
expression purporting to test numbers for primality. On further
investigation, the regexp was attributed in more than one place to
"Abigail" and took the form:

style="font-weight: bold; font-family: monospace;">^1?$|^(11+?)\1+$

Wanting to find out how it worked, the programmer was stumped, as all
the posts mentioning the regexp asked you to visit a page with a
supposedly excellent explanation of its inner workings, but the domain
name was no longer present.

The programmer decided to work it out for himself and came up with the
following...

A prime number is a
positive integer that is only divisible by itself and one. Zero and one
are not prime.

Let's say a number N, greater than one, is not prime. Then
there exists two numbers, S and T, that when multiplied together equal
N. We have:

S x T = N

Lets us further assume
that T is greater than, or equal to S. (They can be swapped if
necessary to make this so). Then a small bit of manipulation gives:

(S x (T-1)) + S = N

(You take one less S in
the multiplication, then you need to add the S back).

Lets swap the terms and write this as equation (a):

style="font-weight: bold;">S + (S x (T-1)) = N style="font-weight: bold;"> when S,T,N are integers greater
than one, and N is style="font-style: italic; font-weight: bold;">not style="font-weight: bold;"> prime

Abigail's regexp relies on representing integers as strings of that
number of ones:

Zero would be
    ''

One would be     '1'

two becomes     '11'

and so on...

The regexp gives a match if the string of ones does not represent a
prime.

To match zero or one occurrences of a 1 we use the first half of the
regexp:

style="font-family: monospace;">^1?$

Where ^anchors the following to a match from the beginning of the
string, ? will match zero or one occurrences of the preceding 1, and
the trailing $ matches the end of the string, i.e. it matches
a string that contains only zero or one 1 characters.

Going back to equation (a), greater than one, is the same as two or
more.

So S will be represented by two or more 1's - we don't know how many,
but we do know that it is two or more. S, in a regexp, becomes
something like:

style="font-family: monospace;">11+

1 followed by one or more 1's. S is less than or equal to T,
which translates to using a less greedy form of the zero-or-one
matcher, which is +? giving a representation of S as:

style="font-family: monospace;">11+?

We want to find S followed by T-1 identical copies of S, so lets form a
group out of what matches S:

style="font-family: monospace;">(11+?)

Then by referring to this group, it is the first group in the regexp so
can be referred to as \1, we can search for extra copies of the group
by:

style="font-family: monospace;">\1+

S and its copies must cover the whole of the number so must start at ^
and finish at $:

style="font-weight: bold; font-family: monospace;">^(11+?)\1+$

Together with the earlier regexp fraction that covers the numbers zero
and one, they can be or'd together to cover all non primes as
the following:

style="font-family: monospace;">^1?$|^(11+?)\1+$

...Not knowing when to quit, the programmer opted to display some of
his Python code that defines a primality tester using the regexp:

>>> import re

>>> def isprime(n):

    return not re.match(r' style="font-weight: bold;">^1?$|^(11+?)\1+$', '1' * n)

>>> [i for i in range(40) if isprime(i)]

[2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37]

And another function that shows S and T from the regexp match:

>>> def is style="font-weight: bold;">notprime(n):

    notprime = re.match(r'^(1?)$|^(11+?)(\2+)$', '1' * n)

    if notprime:

        if n <=1:

            return (len( notprime.groups()[0]), )

        else:

            S, SxT1 = notprime.groups()[1:]

            s = len(S)

            t = 1 + len(SxT1)//s

            return (s,t)

    else:

        return ()

>>> [(i, isnotprime(i)) for i in range(15)]

[(0, (0,)), (1, (1,)), (2, ()), (3, ()), (4, (2, 2)), (5, ()), (6, (2, 3)), (7, ()), (8, (2, 4)), (9, (3, 3)), (10, (2, 5)), (11, ()), (12, (2, 6)), (13, ()), (14, (2, 7))]

... And so to bed.

THE END.

13 comments:

RobertoWed Aug 26, 12:42:00 pm
Thanks for this post, it's really cool. A nice way to start my day! :)
ReplyDelete
Replies
André RobergeWed Aug 26, 01:26:00 pm
Nice post.
ReplyDelete
Replies
AnonymousWed Aug 26, 02:33:00 pm
I just say: WOW:)
ReplyDelete
Replies
AnonymousWed Aug 26, 03:57:00 pm
Fwiw, Abigail is a (relatively) famous perl developer. A google search for 'abigail perl' should lead you to him (I think it's a 'him', anyway.)

Awesome analysis, though.
ReplyDelete
Replies
Paddy3118Wed Aug 26, 07:08:00 pm
Thanks for the google tip.
I found Abigail, the contributor to CPAN and sent them a thank you email!

- Paddy.
ReplyDelete
Replies
Paddy3118Wed Aug 26, 08:43:00 pm
I have since found this explanation, and a pointer to this one in the comments. You might try an alternative if you are having problems understanding this one, or maybe critique all three? (Be gentle with me...)

- Paddy.

P.S. Don't take it too seriously with suggestions for improvements, or benchmarks - it is just an exercise in explaining a clever regexp; not a practical suggestion for testing primality.
ReplyDelete
Replies
AnonymousMon Sept 07, 07:26:00 am
This comment has been removed by a blog administrator.
ReplyDelete
Replies
lorgWed Sept 09, 12:03:00 pm
Hi
Just a mathematical note: even though this can be described in a regexp, it is not formally a 'regular language'. This is because back-references are not regular.

If they were, then this regexp would prove it was possible to decide on a number's primality by going over its digits only once.
ReplyDelete
Replies
Paddy3118Wed Sept 09, 06:57:00 pm
Hi lorg,
Yes, there is a strict mathematical meaning of what constitutes a regular expression, whereas I'm using the term in the more general sense of "Those terse hieroglyphics that form a sub-language for matching and extracting stuff from strings".

:-)
ReplyDelete
Replies
RobertoSat Sept 12, 04:11:00 am
Hi, Paddy!

I presented this post today (with proper credit) in a lightning talk during the brazilian Pycon (Pythonbrasil). Thanks again for this post, people loved the presentation!

Best,
--Rob
ReplyDelete
Replies
Paddy3118Sat Sept 12, 06:54:00 am
Hi Abagale,
General nice comments are encouraged! Nice is good. I try to do nice too.
:-)
ReplyDelete
Replies
Paddy3118Sat Sept 12, 07:18:00 am
Aaarghhh....
I can stand it no more!

There is a problem with the style of Abigail's original regexp that glares at me every time that I go back to it.

I prefer the ^ and $ matchers for the beginning and end of a string, to appear once, at the beginning and/or end of the regexp. It just looks better that way to me.

So I would re-factor Abigail's regexp, (and assuming that the -x or verbose flag is in operation to allow spaces to be ignored in the regexp), as:

^ (?: 1? | (11+?) \1+ )$

(?: ... ) is the non-capturing parenthesis that groups without using up a group number.

Sorry Abigail, I could resist no longer.
ReplyDelete
Replies
AbigailFri Jul 09, 05:53:00 pm
Using (?:) and one set of ^$ is two characters longer than using two sets of ^$. Yes, length matters. And while using a capturing set of parens uses the same amount of characters, one would have to use \2, losing on style.
ReplyDelete
Replies

Add comment

Go deh!

Wednesday, August 26, 2009

The Story of the Regexp and the Primes

13 comments:

About Me

Followers

Subscribe Now: google

Go deh too!

whos.amung.us

Blog Archive