Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

No, not at all.

I think this is an incredible interesting, and often overlooked subject. I've even seen hashes that resolve collisions by sticking tries in the slots instead of linked lists for chaining. That way the "I'm checking for a collision" operation part of the hash becomes "give me the memory location 100% of the time I want". It essentially eliminates hash collisions in the classic sense, and the tries in the hash slots usually don't get very big/you can delay resizing the hash for quite a while (just use a different method like when the first trie hold more than n strings).

Some of the optimizations you mention make a lot of sense and I was generally responding with the more naive variations of the two concepts. One thing to keep in mind is that tries grow in memory amazingly fast. So fast that the rest of your system will have to start swapping out just to keep the trie in memory. A list of a quarter million tokens, average length of 6 will probably eat about 300-400MB before optimizations. If you restrict the tree depth to 4 or so, and stick hashes on the end, you can keep it around ~100MB. But a quarter million tokens is not really all that much, real life applications will usually involve millions of token. And a naive trie will rapidly outgrow the memory on 32-bit systems. It's the classic tradeoff, memory for speed. However, hashes are not really all that bad speed-wise. So the tradeoff isn't quite as good as one might hope. If you have to go to a trie from a hash to eek out an extra few percentage points in performance, you're likely having bigger problems.

In practice I've actually only used a trie a couple of times over a hash. Mostly because in the applications I'm using, I never have to grow the hash which is the most expensive hash operation. But also because there are already so many very good hash libraries in most languages particularly persistent hashes which can be very memory efficient. The practical different in speed between the two is really not that noticeable for most applications unless you are doing billions of lookups a day (in which case you probably are looking for a different setup anyway involving some kind of database).



That is (again) really interesting/informative. Thanks - this is one of those conversations that really restores my faith in communicating over the web :-).


Actually, your idea of just sticking in the strlen on the hash nodes as a heuristic to prevent unnecessary collision checks is pretty good.

I too am glad to engage in this topic!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: