Sunday 13 May 2012

DH7: When it's your job to fix you'r critics' cases for them


Sometime after I came accross Paul Graham's excellent disagreement hierarchy, I came accross a little known addition by a blogger known as Black Belt Bayesian:

DH7: To win, you must fight not only the creature you encounter; you must fight the most horrible thing that can be constructed from its corpse.

Paul's article revolved around civility in on-line forums, so I can see why he stopped at 6. Trying to construct an argument out of a jumbled mess of an argument that 'someone on the internet' makes is a hobby few may be interested in. Perhaps the purists, perhaps the philosophers.

Now, look again through the eyes of an entrepreneur. For many on Hacker News this requires no imagination whatsoever. You pitch your idea to many people every week. They may come back with a blurb of a potential problem. One that sounds like those you dismiss easily with a ready-made response. The other side will not push their case. They will not try to state their case clearly, nor will they counter-argue. It is a social situation, they are only making conversation, and are only too happy to move on to the next topic or tell you about their startup, or network with someone else. But for you, DH7 is a matter of life and death:

To win, you must fight not only the creature you encounter; you must fight the most horrible thing that can be constructed from its corpse.

Only this time, winning is not about some argument on a forum. The stakes are much higher. If you failed to recognise a valid argument in the mumblings of an experienced but unmotivated interlocutor, you may hear it again, loud and clear, in the epitaph of your business venture. So you must inquire, open up, defeat the ugh field, and push through until you have found what was lurking behind the bushes, or you find out it was only the wind afterall.

So open your ears fellow entrepreneurs, and don't let that tiny note of discord get lost in the noise. Steamrolling objections with your well-practiced arguments (or non-arguments) is good fun, until you miss that one valuable insight.

Sunday 29 January 2012

Solving Causes' Levenshtein Distance challenge in Python, the Sequel


The 3 faithful readers of this blog have probably seen my previous attempt at cracking Causes' Levenshtein distance challenge. It all went well until Adam Derewecki of Causes commented with the following:
...Pretty good solution though, about 15s on our benchmark machine. Record is 11.3s if you're up to the challenge :)
At first I was like "Yeah right, mate, you're not roping -me- in with that one, I have a startup to run." But the predictable engineer's mind just couldn't let it go. How could someone have done about 30% better in Python? What was I missing? So, I started hacking at the code again. Turns out (surprise!) I was missing quite a bit. Let's start with putting the original code up for you to see:

import string
w = set(open("00wordlist.txt").read().splitlines())
f, nf = set(), set(["causes"])

#from b, yield all unused words where levdist==1
def nextgen(b):
    for i in range(len(b)): #for each index in b
        for c in string.ascii_lowercase: #for letters [a..z]
            if c != b[i]:
                #substitute b[i] with c
                if b[:i] + c + b[i+1:] in w:
                    yield b[:i] + c + b[i+1:]
                #inject c before b[i]
                if b[:i] + c + b[i:] in w:
                    yield b[:i] + c + b[i:]
        #remove b[i]
        if b[:i] + b[i+1:] in w: yield b[:i] + b[i+1:]
    
    for c in string.ascii_lowercase: #for letters [a..z]
        if b + c in w: yield b + c #append c after b

while len(nf):
    cf = nf
    nf = set([j for i in cf for j in nextgen(i) if j not in f])
    w -= nf
    f |= nf

print len(f)

First, Adam's suggestion was very good by itself. Why write this:

nf = set([j for i in cf
        for j in nextgen(i) 
            if j not in f])

when you can omit the intermediate array and just write this:

nf = set(j for i in cf 
        for j in nextgen(i) 
            if j not in f)

But it gets better. Since I subtract nf from w, from where the values are sourced, why even check if j not in f? No reason. So, we end up with the much more palatable:

nf = set(j for i in cf for j in nextgen(i))

After improving that line, I noticed that I had a line above that was doing absolutely nothing whatsoever:

cf = nf

This line simply betrays my uncertainty about how python's comprehensions work. It turns out the next line can be simply written as follows, with no need to ever declare cf at all.

nf = set(j for i in nf for j in nextgen(i))

Next up, let's look at the little optimisation I had in line 9:

if c != b[i]:

Here I used a whole line to check that I wasn't going to be doing any useless checks. Even though I was aiming for small code. Even though Python has O(1) membership testing. When I looked again at the code and doubted my own premature optimisation, the results were damning: The test cost more time than it saved. Removing that line yields a speed improvement.

All these improvements were small. They saved 1-2 seconds over the total of 25 seconds it takes on my laptop. The big improvement came when I tried the technique seen in this stackoverflow answer. Interrupting the program while running for a few times indicated the culprit. The constant use of the slicing operation was not doing me any favours. For every given letter and every position in a string I did operations like this:

if b[:i] + c + b[i+1:] in w:
    yield b[:i] + c + b[i+1:]

That's 4 slice operations, and actually this is done twice for a total of 8 per letter. So I decided to do the slicing only once per position, assign the results to variables and use those for each letter. That sped things up enormously. It brought runtime from slightly under 23 to well under 17 seconds.

UPDATE: After some impromptu after-work tinkering with my co-founder Pagan, we realised Python iterates over lists faster than over strings, which means that adding the line

letters = list(string.ascii_lowercase)

to the setup part of the code speeds things up by a cool 4%.

All these improvements add up to 1/3 of the total running time. Since Adam said that my programme ran for 15 seconds on the benchmark machine, while the best Python they had ran at 11.3, I suspect this may be enough to beat the frontrunner. Now I just have to get Adam to test this one again.

Another change I did is improve the horrible variable naming I had last time around, and also add a few more comments. I also was very strict about keeping lines under 65 characters in length. So here is the resulting program:

import string
words = set(open("00wordlist.txt").read().splitlines())
frnds, newfrnds = set(), set(["causes"])
letters = list(string.ascii_lowercase)

#from word wd, yield all unused words where levdist==1
def freefrnds(wd):
    for i in range(len(wd)): #for each index in wd
        wd_upto_i,wd_from_i,wd_after_i = wd[:i],wd[i:],wd[i+1:]
        for char in letters: #for letters [a..z]
            #substitute wd[i] with char
            if wd_upto_i + char + wd_after_i in words:
                yield wd_upto_i + char + wd_after_i
            #inject char before wd[i]
            if wd_upto_i + char + wd_from_i in words:
                yield wd_upto_i + char + wd_from_i
        #remove wd[i] from word
        if wd_upto_i + wd_after_i in words:
            yield wd_upto_i + wd_after_i

    for char in letters: #for letters [a..z]
        #append char after word
        if wd + char in words: yield wd + char

while len(newfrnds):
    newfrnds = set(j for i in newfrnds for j in freefrnds(i))
    frnds |= newfrnds #add newfrnds to the frnds set
    words -= newfrnds #remove list of newfrnds from words

print len(frnds)