Mental hashing for paper address books (with Python)

phreeza · on Jan 9, 2011

One problem I see with the evaluation is he uses a generic English corpus. The real application is names though, which probably have different statistical properties. Shouldn't be too hard to find a list of real names somewhere on the web.

edit: here for example http://names.mongabay.com/most_common_surnames.htm

limmeau · on Jan 9, 2011

Thanks for the link. Hash function "First and fourth letter" wins for english surnames:

https://gist.github.com/bd9fabf91a5501b215c5

(I copied the names table from your link and applied the original program to it)

lars512 · on Jan 9, 2011

There's a tradeoff between speed of lookup and ability to find things when you only partially recall the name. For those situations (the "tip-of-the-tongue" phenomenon), I'd pick either the first and final letters as the easiest to recall, or the first and second letters. The fourth letter won't be easy to retrieve unless you recall the name exactly.

kragen · on Jan 9, 2011

This is awesome! Thanks!

The reason I used English corpus words was precisely that I did not have this surname frequency list handy.

I don't think this is a list of English names, though. #18 is Garcia, and #19 is Martinez, and so on. It's a list of US names. A similar list for Argentina would probably be most useful for me at the moment.

Amnon · on Jan 9, 2011

Did anyone notice the mailing list? It's one dedicated to the author. Interesting alternative for a blog.

silentbicycle · on Jan 9, 2011

There are some real gems on the list, too:

"Smalltalk Performance and Moore's Law" http://lists.canonical.org/pipermail/kragen-tol/2007-March/0...

"OCaml vs. SBCL, and various other interpreters" http://lists.canonical.org/pipermail/kragen-tol/2007-March/0...

"what affects programming language adoption?" http://lists.canonical.org/pipermail/kragen-tol/2006-Novembe...

(kragen is also kragen here)

kragen · on Jan 9, 2011

Thank you! I'm glad you enjoyed them.

gwern · on Jan 9, 2011

I will say one thing, a blog would have trouble surviving since 1998: http://lists.canonical.org/pipermail/kragen-hacks/

kragen · on Jan 9, 2011

The mailing list is having some trouble surviving, too. Apparently at some point Google decided our domain was spammy, and all the Gmail subscribers started getting their mail automatically spam-filtered. I contacted a bunch of people directly (via Google Talk or Facebook) to get them to fish the latest mail out of the spambox.

Beyond that, I don't know what to do. I guess I could post more often.

I certainly need a better blog interface for it. What I have right now is http://www.bentwookie.org/blog.

gwern · on Jan 10, 2011

Reading through http://lists.canonical.org/pipermail/kragen-tol/2010-March/0...

Managing a site through a DVCS is, IMO, a good idea. (I do it for my own site, http://www.gwern.net/ ). But I think your worries are somewhat groundless. If you are interested in preventing patent problems years down the line, there's no need for fancy cryptographic commitment schemes; you could probably just appeal to archive sites like the Internet Archive or WebCitation. When you access their archived pages, the pages come with timestamps in the frame or as part of the URL.

To some extent, you are already doing this: http://web.archive.org/web/*/http://lists.canonical.org/pipe...

(I know that the Internet Archive has been used by some courts for one purpose or another, though I don't know that it has been employed for demonstrating prior art.)

That said, if you investigated existing cryptographic time-stamping services (http://en.wikipedia.org/wiki/Trusted_timestamping#External_l...) and figured out how to integrate them into a DVCS (a shell script called from cron?), I would certainly find that an interesting thing to read on Hacker News.

abecedarius · on Jan 9, 2011

Rationale here: http://lists.canonical.org/pipermail/kragen-tol/2010-March/0...