Go deh!: Critique of pseudocode explanations of the Closest Pair Algorithm

I watched this
task on Rosetta Code from its inception on the 6^th of
May, but had problems understanding the pseudo-code given, and some
of the stuff I googled too. after a couple of days though I was more
familiar with attempts at O(nlog(n)) divide-n-conquer
algorithms having seen a few and happened to find this
explanation (.ppt file) where it seemed to click for me.

After problems with the other references I thought I might go
through and identify what I found problematic with them.

The Rosetta Code Pseudo-code:

closestPair of P(1), P(2), ... P(N)
if N ≤ 3 then
  return closest points of P using brute-force algorithm
else
  xP ← P ordered by the x coordinate, in ascending order
  P_L ← points of xP from 1 to ⌈N/2⌉
  P_R ← points of xP from ⌈N/2⌉+1 to N
  (d_L, pair_L) ← closestPair of P_L

  (d_R, pair_R) ← closestPair of P_R
  (d_min, pair_min) ← (d_R, pair_R)
  if d_L < d_R then

    (d_min, pair_min) ← (d_L, pair_L)
  endif
  x_M ← xP(⌈N/2⌉)_x

  S ← { p ∈ xP : |x_M - p_x| < d_min }
  yP ← S ordered by the y coordinate, in ascending order
  nP ← number of points in yP
  (closest, closestPair) ← (d_min, pair_min)
  for i from 1 to nP - 1
    k ← i + 1
    while k ≤ nP and yP(k)_y - yP(k)_y < d_min

      if |yP(k) - yP(i)| < closest then
        (closest, closestPair) ← (|yP(k) - yP(i)|, {yP(k), yP(i)})
      endif
      k ← k + 1
    endwhile
  return closest, closestPair

endif

The author had noted that the above could have problems so I was
warned, but still started by trying to follow the above. I was
blindly following the pseudo-code until I got the the statement: x_M
← xP(⌈N/2⌉)_x which I couldn't understand.

This prompted my wider search for a better explanation. After
finding it (linked in the first paragraph), I modified my Python
program to follow the new explanation.

The RC references

There were several references quoted.

The Wikipedia article
Did not give pseudo-code
Closest
Pair (McGill)
Section 3.4 on improving their previous
O(nlog²n) into an O(nlogn)
algorithm I found to be very vague. Just how do you “returning
the points in each set in sorted order by y-coordinate “
without doing an nlogn sort leading again to an O(nlog²n)
solution?
Closest
Pair (UCBS)
This seems to concur with my preferred reference
but is given at a higher level than pseudo-code, and so would be
harder to implement from the description given.
Closest
pair (WUStL) by Dr. Jeremy Buhler
Their seems to be a
problem with having N, the number of points as an argument of the
function. What if N is odd? The pseudo-code seems to rely on other
material probably presented in a class and is at a higher level than
my preferred solution as it seems to be set as an exercise for
students.

My Python Solution

The only tests done is to generate random points and compare the
brute force algorithms solution with that of the divide and conquer.

'''

  Compute nearest pair of points using two algorithms
  
  First algorithm is 'brute force' comparison of every possible pair.
  Second, 'divide and conquer', is based on:
    www.cs.iupui.edu/~xkzou/teaching/CS580/Divide-and-conquer-closestPair.ppt 
'''

from random import randint

infinity = float('inf')


# Note the use of complex numbers to represent 2D points making distance == abs(P1-P2)

def bruteForceClosestPair(point):
    numPoints = len(point)
    if numPoints < 2:
        return infinity, (None, None)
    return min( ((abs(point[i] - point[j]), (point[i], point[j]))
                 for i in range(numPoints-1)
                 for j in range(i+1,numPoints)),
                key=lambda x: x[0])


def closestPair(point):
    xP = sorted(point, key= lambda p: p.real)
    yP = sorted(point, key= lambda p: p.imag)
    return _closestPair(xP, yP)

def _closestPair(xP, yP):
    numPoints = len(xP)
    if numPoints <= 3:
        return bruteForceClosestPair(xP)
    Pl = xP[:numPoints/2]
    Pr = xP[numPoints/2:]
    Yl, Yr = [], []
    xDivider = Pl[-1].real
    for p in yP:
        if p.real <= xDivider:
            Yl.append(p)
        else:
            Yr.append(p)
    dl, pairl = _closestPair(Pl, Yl)
    dr, pairr = _closestPair(Pr, Yr)
    dm, pairm = (dl, pairl) if dl < dr else (dr, pairr)
    # Points within dm of xDivider sorted by Y coord

    closeY = [p for p in yP  if abs(p.real - xDivider) < dm]
    numCloseY = len(closeY)
    if numCloseY > 1:
        # There is a proof that you only need compare a max of 7 other points

        closestY = min( ((abs(closeY[i] - closeY[j]), (closeY[i], closeY[j]))
                         for i in range(numCloseY-1)
                         for j in range(i+1,min(i+8, numCloseY))),
                        key=lambda x: x[0])
        return (dm, pairm) if dm <= closestY[0] else closestY
    else:
        return dm, pairm


def times():
    ''' Time the different functions
    '''
    import timeit

    functions = [bruteForceClosestPair, closestPair]
    for f in functions:
        print 'Time for', f.__name__, timeit.Timer(
            '%s(pointList)' % f.__name__,
            'from closestpair import %s, pointList' % f.__name__).timeit(number=1)



pointList = [randint(0,1000)+1j*randint(0,1000) for i in range(2000)]


if __name__ == '__main__':
    pointList = [(5+9j), (9+3j), (2+0j), (8+4j), (7+4j), (9+10j), (1+9j), (8+2j), 10j, (9+6j)]
    print pointList
    print '  bruteForceClosestPair:', bruteForceClosestPair(pointList)
    print '            closestPair:', closestPair(pointList)
    for i in range(10):
        pointList = [randint(0,10)+1j*randint(0,10) for i in range(10)]
        print '\n', pointList
        print ' bruteForceClosestPair:', bruteForceClosestPair(pointList)
        print '           closestPair:', closestPair(pointList)
    print '\n'
    times()
    times()
    times()

Sample output:

[(5+9j), (9+3j), (2+0j), (8+4j), (7+4j), (9+10j), (1+9j), (8+2j), 10j, (9+6j)]
  bruteForceClosestPair: (1.0, ((8+4j), (7+4j)))
            closestPair: (1.0, ((8+4j), (7+4j)))

[(6+8j), (4+1j), 7j, (4+10j), (10+6j), (1+9j), (10+10j), (5+0j), (3+1j), (6+4j)]
 bruteForceClosestPair: (1.0, ((4+1j), (3+1j)))
           closestPair: (1.0, ((3+1j), (4+1j)))

[(5+5j), (10+9j), 3j, (6+1j), (9+5j), 1j, (8+0j), 8j, (3+2j), (10+4j)]
 bruteForceClosestPair: (1.4142135623730951, ((9+5j), (10+4j)))
           closestPair: (1.4142135623730951, ((9+5j), (10+4j)))

[(9+2j), (1+2j), 5j, (10+0j), (6+6j), (6+4j), (5+1j), (2+5j), (8+6j), (3+2j)]
 bruteForceClosestPair: (2.0, ((1+2j), (3+2j)))
           closestPair: (2.0, ((6+6j), (6+4j)))

[(2+3j), (9+0j), (9+7j), 8j, (4+5j), (6+7j), (6+2j), (3+3j), (3+10j), (7+5j)]
 bruteForceClosestPair: (1.0, ((2+3j), (3+3j)))
           closestPair: (1.0, ((2+3j), (3+3j)))

[(7+2j), (2+10j), (3+7j), (8+6j), (2+1j), 3j, (8+5j), (6+9j), (1+0j), 0j]
 bruteForceClosestPair: (1.0, ((8+6j), (8+5j)))
           closestPair: (1.0, ((8+6j), (8+5j)))

[(10+7j), (6+3j), (2+7j), (3+6j), (9+3j), (8+5j), (5+7j), (3+10j), 5j, (3+7j)]
 bruteForceClosestPair: (1.0, ((2+7j), (3+7j)))
           closestPair: (1.0, ((3+6j), (3+7j)))

[(4+6j), (5+4j), 2j, (7+0j), 4j, (6+5j), (7+5j), (8+9j), (10+5j), (8+2j)]
 bruteForceClosestPair: (1.0, ((6+5j), (7+5j)))
           closestPair: (1.0, ((6+5j), (7+5j)))

[(4+0j), (7+9j), 7j, 8j, (7+1j), (2+5j), (7+3j), (1+3j), (3+9j), (10+7j)]
 bruteForceClosestPair: (1.0, (7j, 8j))
           closestPair: (1.0, (7j, 8j))

[(9+1j), (5+0j), (8+3j), (6+9j), (3+7j), (4+6j), (7+6j), (8+6j), (10+8j), (2+5j)]
 bruteForceClosestPair: (1.0, ((7+6j), (8+6j)))
           closestPair: (1.0, ((7+6j), (8+6j)))

[(2+2j), (10+8j), (3+0j), (8+0j), (1+1j), (6+5j), 10j, (6+8j), (10+4j), (4+10j)]
 bruteForceClosestPair: (1.4142135623730951, ((2+2j), (1+1j)))
           closestPair: (1.4142135623730951, ((1+1j), (2+2j)))


Time for bruteForceClosestPair 4.59288093878
Time for closestPair 0.120782948671
Time for bruteForceClosestPair 5.00096353028
Time for closestPair 0.120047939054
Time for bruteForceClosestPair 4.86709875138
Time for closestPair 0.123363723602

Note how much slower the brute force algorithm is for only 2000
points.

(Oh, I guess I should add that the above is all my own work –
students might want to say the same).

2 comments:

shintakezouTue May 12, 10:43:00 am
about «There is a proof that you only need compare a max of 7 other points» (I've found 6 indeed, but it is not essential). The "a max" (at most?) is very important, and not exploited in your code: you test all the 7 points you need to test (as I can understand the python code, maybe I'm wrong); indeed, most of the time, you *don't* need to test them all: when you reach the first beyond "dmin", you can stop! and this is the "why" of the second condition in the while loop in the pseudocode of WUStL; that code, most of the time (depending on the distribution of the points), tests less than 7 points; exactly it tests only the points that need to be tested: no one more. So it is slightly more efficient. (then there's a proof that the number of points that can live "inside" the dmin limit is 6or7, no more... but the important fact for the implementation is to test when the limit is reached)
MartinTue May 12, 10:43:00 am
"xM ← xP(⌈N/2⌉)x"

I would translate this into the following python code:
xM = xP[math.ceil(N/2.0)].x

Tuesday, May 12, 2009

Critique of pseudocode explanations of the Closest Pair Algorithm

The Rosetta Code Pseudo-code:

The RC references

My Python Solution

Sample output:

2 comments: