Inside the Redis distribution you can find "dict.c" that is a very easy to understand hash table implementation that uses incremental rehashing, that is, when it needs to switch to a bigger or smaller table this happens incrementally in three ways:
1) At every operation you perform there is a rehashing step.
2) You have a function that you can call to perform a rehashing step.
3) There is an higher level function where you can say "rehash for N milliseconds" in case you have some idle time to spend for a good cause inside your program.
Another uncommon thing of dict.c is that it supports the "pick a random element" operation with guaranteed O(1) behaviour.
EDIT: Now that I look at the source, clearly while the code is easy to follow a design section should be present in the dict.c top-comment. I'll make sure to add this, but long story short:
* We use both chaining and incremental rehashing. This means that there is no need to rehash ASAP when the hash table reached the max % of elements allowed. At the same time using the incremental rehashing there is no need to block.
* We support safe and unsafe iterators: safe iterators block the rehashing when they are active, so you have the feeling you are accessing the hash table in a read-only way. Unsafe iterators are ok for looping even while an incremental rehashing is in progress, but is not ok to touch the hash table while iterating with an unsafe iterator (otherwise the iterator no longer guarantees to return all the elements, or to avoid duplicated elements).
* The hash table is rehashed both ways: if it's too full, or if it's too empty. This way there is the guarantee that a table has always a percentage between a min and max. This guarantees that our random element function is in the average case O(1).
* When we rehash, a second table is created, and stuff are moved incrementally from one to the other. When we have two tables, all the lookup / delete operations are run against both the table as the element can be in the old or in the new table. Instead insertions only happen in the new table.
Just a bit of clarification, when you say "pick a random element", do you mean choosing an arbitrary element, or do you actually mean choosing one at random such that each element has a 1/N probability of being chosen?
I'd yield to someone who understands the code, but a quick glance at dictGetRandomKey() shows it is randomized, and I believe is very close to 1/N. If there are a large number of elements in a single bucket this may not be true. At an extreme, if there is one element in a bucket and N-1 in another, there is a 50% chance teh first will get picked, and 50/(N-1)% the others will.
Exactly as you said, it's a decent approximation and with large tables it works well enough, but if there are clusters (long chains) in a bucket those elements have a smaller percentage of chances to be picked.
However there is a trick to improve this that I don't use, that is, instead of searching for a non empty bucket, and select a random element from the chain, it is possible to do this, choosing a small M (like M=3):
* Find M non-empty buckets.
* Pick the bucket among the M buckets, with a chance proportional to the chain length at every bucket.
* Finally pick a random element from the bucket.
The bigger M, the more accurate the algorithm becomes.
For instance in the pathological case you shown where a bucket as one element and another all the rest, this would find the two buckets and then pick the 1 element bucket with a much smaller probability that would adjust the difference in chain length.
1) At every operation you perform there is a rehashing step.
2) You have a function that you can call to perform a rehashing step.
3) There is an higher level function where you can say "rehash for N milliseconds" in case you have some idle time to spend for a good cause inside your program.
Another uncommon thing of dict.c is that it supports the "pick a random element" operation with guaranteed O(1) behaviour.
EDIT: Now that I look at the source, clearly while the code is easy to follow a design section should be present in the dict.c top-comment. I'll make sure to add this, but long story short:
* We use both chaining and incremental rehashing. This means that there is no need to rehash ASAP when the hash table reached the max % of elements allowed. At the same time using the incremental rehashing there is no need to block.
* We support safe and unsafe iterators: safe iterators block the rehashing when they are active, so you have the feeling you are accessing the hash table in a read-only way. Unsafe iterators are ok for looping even while an incremental rehashing is in progress, but is not ok to touch the hash table while iterating with an unsafe iterator (otherwise the iterator no longer guarantees to return all the elements, or to avoid duplicated elements).
* The hash table is rehashed both ways: if it's too full, or if it's too empty. This way there is the guarantee that a table has always a percentage between a min and max. This guarantees that our random element function is in the average case O(1).
* When we rehash, a second table is created, and stuff are moved incrementally from one to the other. When we have two tables, all the lookup / delete operations are run against both the table as the element can be in the old or in the new table. Instead insertions only happen in the new table.
Well these are the main ideas I think.