Bingo, string comparisons will eat you alive. Even if you have a very small percentage of collisions in your hash (even close to 0%), you still have to check if you have a collision to determine if you need to iterate to the next slot in the hash no matter what. If you are hashing strings, you have to compare the entire string, each character comparison is guaranteed to be at least one integer sub operation for 1 byte wide character sets and 2 for unicode.
But basically, tries compute almost as fast as just reading the string, while hashes have to read the string a few times AND compute the function AND deal with collisions (AND resize the hash which is stupid expensive).
So to either insert or lookup in a hash you have to:
1) Compute the hash from the string, which naively involves reading the entire string into memory and computing the hash code. So for each character you have to move at least one or two bytes into memory, do something with it (likely some kind of integer op), store the result, then move the next byte or two in, do an operation with that and so on. Let's not forget that I also have to check for end of string, which uses at least one sub operation for each character as well for C style strings.
Otherwise you keep a strlen counter in a register somewhere and do sub operations against that, either way.
1b)If you are smart you simply restrict the length you compute off of to some smaller length that will still do a pretty good job of assuring a reasonable level of uniqueness in the hash, but you can't ever guarantee it (uniqueness is the principle problem of hashing functions).
2) But still, let's assume I'm dealing with a nice small 16 character string (n=16). I have to compute many more operations than just the string length, at least 1 mov, 1 sub (to check for EOS), and then a few operations for your hash and probably some copies back into memory, so let's be generous and say 8 ops for each character. So if n=16 at least 128 ops to simply compute the hash.
3) Then we have to do some pointer arithmetic or some table lookups or whatever to get to the memory location and store the string. So in all we're talking about 500-1000 ops to insert a 16 character string without even checking for collisions.
4) Factor that in, assume no collision, we might be on the order of 1200 executed ops in the end. If we have a collision, a couple thousand or more.
So for n=16, ops=~1200 minimum.
5) Even if there isn't a collision, I have to do a linear string compare to the hash->string value in the hash to assure me that there wasn't a collision. So we have to start moving integers into registers and doing a bunch of subs again (and checking for EOS on both strings now). This could easily introduce another 50-100 ops to ensure no collision.
Lookups on the hash are also of a similar scale. Heaven help you if you have to expand the hash and move all that crap from one place to another.
For a trie, there is no such thing as collisions. In the naive version, you simply allocate an array in memory the length of the character set + 1 in memory, which takes a couple of ops since you know the character set + 1 length ahead of time, use the character (I'm assuming 1 byte character lengths, but the principle is the same for 2 byte) as an operand in some simple pointer arithmetic. And store a memory location pointer in each array index to the next array (of character set + 1) until you finish off the string. If you are inserting, point to one more array and set that extra + 1 byte to something meaningful. If you are doing a lookup, you check that location to see if the byte is set to something meaningful (or you can just set that on the last character instead of one more down the tree, whatever).
The point is, for a 16 character string you count say two ops for memory allocation, an op to find the slot in the array, and another op for allocation or value setting for each character.
So in all, an insert or lookup happens almost as fast as a direct string compare, maybe on the order of n=16, ops=<100. Lookups can be even faster since you have have to short circuit the lookup (read: stop reading the string) as soon as you can't find just one character).
I'm wagging a bit, I've never built either structure in asm, but if you analyze the number of operations each takes, the trie is insanely faster. Even with memory access lags.
You can do some obvious speedups, like loading several characters into a register at a time for integer arithmetic, or loading the hash locations value chains or something...
One of the confusing points on this is that the big-O estimates I've seen never bother to account for the string reads and comparisons. With a hash you basically have to do them a minimum of two times, with a trie, only once (and sometimes not even that much). Usually you see hashes as O(1) best and O(n) worst, but really they are O(k) and O(k*n) where k is the string length while tries are really O(k).
All that being said, as soon as you try to work with either structure on disk, hashes easily overtake tries in terms of speed since hard drive random access times are much worse and tries require lots and lots and lots of random access (for each character).
Fascinating - thanks for the detailed response, and that does make a lot of sense.
It seems that if string comparisons are the major cost, you could speed up collision handling on the hash table quite a bit by storing additional metadata. Storing the size of the string, for example, would allow you to skip a lot of comparisons - you can do a simple size comparison prior to actually comparing characters. This would also allow you to do slightly more funky string comparisons, such as comparing characters from the end first, or even starting at the middle, to skip common prefixes.
Alternatively, instead of storing the size you could store the computed hash value for each string, which would allow you to compare hash values rather than full strings. The likelihood of a collision on a 32 bit hash value (say) is pretty low, so you'd mostly be avoiding extraneous string comparisons. This would also speed up growing/shrinking the table quite a bit. Have you tried these sorts of changes, and what (if any) difference do they make?
I can see that these optimisations merely reduce the cost of collisions rather than the baseline non-collision cost, and it definitely makes sense that tries require fewer operations. What I wonder is whether, as you scale up the trie to very large datasets, the superiority of the trie tails off due to the memory latency? Or are tries basically always superior in memory, due to their depth never getting all that large?
(To be clear, I don't mean to doubt your work on this subject, since you're clearly rather more knowledgeable than I am - I'm just interested :-) )
I think this is an incredible interesting, and often overlooked subject. I've even seen hashes that resolve collisions by sticking tries in the slots instead of linked lists for chaining. That way the "I'm checking for a collision" operation part of the hash becomes "give me the memory location 100% of the time I want". It essentially eliminates hash collisions in the classic sense, and the tries in the hash slots usually don't get very big/you can delay resizing the hash for quite a while (just use a different method like when the first trie hold more than n strings).
Some of the optimizations you mention make a lot of sense and I was generally responding with the more naive variations of the two concepts. One thing to keep in mind is that tries grow in memory amazingly fast. So fast that the rest of your system will have to start swapping out just to keep the trie in memory. A list of a quarter million tokens, average length of 6 will probably eat about 300-400MB before optimizations. If you restrict the tree depth to 4 or so, and stick hashes on the end, you can keep it around ~100MB. But a quarter million tokens is not really all that much, real life applications will usually involve millions of token. And a naive trie will rapidly outgrow the memory on 32-bit systems. It's the classic tradeoff, memory for speed. However, hashes are not really all that bad speed-wise. So the tradeoff isn't quite as good as one might hope. If you have to go to a trie from a hash to eek out an extra few percentage points in performance, you're likely having bigger problems.
In practice I've actually only used a trie a couple of times over a hash. Mostly because in the applications I'm using, I never have to grow the hash which is the most expensive hash operation. But also because there are already so many very good hash libraries in most languages particularly persistent hashes which can be very memory efficient. The practical different in speed between the two is really not that noticeable for most applications unless you are doing billions of lookups a day (in which case you probably are looking for a different setup anyway involving some kind of database).
That is (again) really interesting/informative. Thanks - this is one of those conversations that really restores my faith in communicating over the web :-).
But basically, tries compute almost as fast as just reading the string, while hashes have to read the string a few times AND compute the function AND deal with collisions (AND resize the hash which is stupid expensive).
So to either insert or lookup in a hash you have to:
1) Compute the hash from the string, which naively involves reading the entire string into memory and computing the hash code. So for each character you have to move at least one or two bytes into memory, do something with it (likely some kind of integer op), store the result, then move the next byte or two in, do an operation with that and so on. Let's not forget that I also have to check for end of string, which uses at least one sub operation for each character as well for C style strings.
Otherwise you keep a strlen counter in a register somewhere and do sub operations against that, either way.
1b)If you are smart you simply restrict the length you compute off of to some smaller length that will still do a pretty good job of assuring a reasonable level of uniqueness in the hash, but you can't ever guarantee it (uniqueness is the principle problem of hashing functions).
2) But still, let's assume I'm dealing with a nice small 16 character string (n=16). I have to compute many more operations than just the string length, at least 1 mov, 1 sub (to check for EOS), and then a few operations for your hash and probably some copies back into memory, so let's be generous and say 8 ops for each character. So if n=16 at least 128 ops to simply compute the hash.
3) Then we have to do some pointer arithmetic or some table lookups or whatever to get to the memory location and store the string. So in all we're talking about 500-1000 ops to insert a 16 character string without even checking for collisions.
4) Factor that in, assume no collision, we might be on the order of 1200 executed ops in the end. If we have a collision, a couple thousand or more.
So for n=16, ops=~1200 minimum.
5) Even if there isn't a collision, I have to do a linear string compare to the hash->string value in the hash to assure me that there wasn't a collision. So we have to start moving integers into registers and doing a bunch of subs again (and checking for EOS on both strings now). This could easily introduce another 50-100 ops to ensure no collision.
Lookups on the hash are also of a similar scale. Heaven help you if you have to expand the hash and move all that crap from one place to another.
For a trie, there is no such thing as collisions. In the naive version, you simply allocate an array in memory the length of the character set + 1 in memory, which takes a couple of ops since you know the character set + 1 length ahead of time, use the character (I'm assuming 1 byte character lengths, but the principle is the same for 2 byte) as an operand in some simple pointer arithmetic. And store a memory location pointer in each array index to the next array (of character set + 1) until you finish off the string. If you are inserting, point to one more array and set that extra + 1 byte to something meaningful. If you are doing a lookup, you check that location to see if the byte is set to something meaningful (or you can just set that on the last character instead of one more down the tree, whatever).
The point is, for a 16 character string you count say two ops for memory allocation, an op to find the slot in the array, and another op for allocation or value setting for each character.
So in all, an insert or lookup happens almost as fast as a direct string compare, maybe on the order of n=16, ops=<100. Lookups can be even faster since you have have to short circuit the lookup (read: stop reading the string) as soon as you can't find just one character).
I'm wagging a bit, I've never built either structure in asm, but if you analyze the number of operations each takes, the trie is insanely faster. Even with memory access lags.
You can do some obvious speedups, like loading several characters into a register at a time for integer arithmetic, or loading the hash locations value chains or something...
One of the confusing points on this is that the big-O estimates I've seen never bother to account for the string reads and comparisons. With a hash you basically have to do them a minimum of two times, with a trie, only once (and sometimes not even that much). Usually you see hashes as O(1) best and O(n) worst, but really they are O(k) and O(k*n) where k is the string length while tries are really O(k).
All that being said, as soon as you try to work with either structure on disk, hashes easily overtake tries in terms of speed since hard drive random access times are much worse and tries require lots and lots and lots of random access (for each character).
http://www.statemaster.com/encyclopedia/Trie