Bingo, string comparisons will eat you alive. Even if you have a very small perc...

AlisdairO · on April 5, 2010

Fascinating - thanks for the detailed response, and that does make a lot of sense.

It seems that if string comparisons are the major cost, you could speed up collision handling on the hash table quite a bit by storing additional metadata. Storing the size of the string, for example, would allow you to skip a lot of comparisons - you can do a simple size comparison prior to actually comparing characters. This would also allow you to do slightly more funky string comparisons, such as comparing characters from the end first, or even starting at the middle, to skip common prefixes.

Alternatively, instead of storing the size you could store the computed hash value for each string, which would allow you to compare hash values rather than full strings. The likelihood of a collision on a 32 bit hash value (say) is pretty low, so you'd mostly be avoiding extraneous string comparisons. This would also speed up growing/shrinking the table quite a bit. Have you tried these sorts of changes, and what (if any) difference do they make?

I can see that these optimisations merely reduce the cost of collisions rather than the baseline non-collision cost, and it definitely makes sense that tries require fewer operations. What I wonder is whether, as you scale up the trie to very large datasets, the superiority of the trie tails off due to the memory latency? Or are tries basically always superior in memory, due to their depth never getting all that large?

(To be clear, I don't mean to doubt your work on this subject, since you're clearly rather more knowledgeable than I am - I'm just interested :-) )

elblanco · on April 5, 2010

No, not at all.

I think this is an incredible interesting, and often overlooked subject. I've even seen hashes that resolve collisions by sticking tries in the slots instead of linked lists for chaining. That way the "I'm checking for a collision" operation part of the hash becomes "give me the memory location 100% of the time I want". It essentially eliminates hash collisions in the classic sense, and the tries in the hash slots usually don't get very big/you can delay resizing the hash for quite a while (just use a different method like when the first trie hold more than n strings).

Some of the optimizations you mention make a lot of sense and I was generally responding with the more naive variations of the two concepts. One thing to keep in mind is that tries grow in memory amazingly fast. So fast that the rest of your system will have to start swapping out just to keep the trie in memory. A list of a quarter million tokens, average length of 6 will probably eat about 300-400MB before optimizations. If you restrict the tree depth to 4 or so, and stick hashes on the end, you can keep it around ~100MB. But a quarter million tokens is not really all that much, real life applications will usually involve millions of token. And a naive trie will rapidly outgrow the memory on 32-bit systems. It's the classic tradeoff, memory for speed. However, hashes are not really all that bad speed-wise. So the tradeoff isn't quite as good as one might hope. If you have to go to a trie from a hash to eek out an extra few percentage points in performance, you're likely having bigger problems.

In practice I've actually only used a trie a couple of times over a hash. Mostly because in the applications I'm using, I never have to grow the hash which is the most expensive hash operation. But also because there are already so many very good hash libraries in most languages particularly persistent hashes which can be very memory efficient. The practical different in speed between the two is really not that noticeable for most applications unless you are doing billions of lookups a day (in which case you probably are looking for a different setup anyway involving some kind of database).

AlisdairO · on April 5, 2010

That is (again) really interesting/informative. Thanks - this is one of those conversations that really restores my faith in communicating over the web :-).

elblanco · on April 5, 2010

Actually, your idea of just sticking in the strlen on the hash nodes as a heuristic to prevent unnecessary collision checks is pretty good.

I too am glad to engage in this topic!