41C. Hash Tables


Hash tables

A hash table stores entries in an array. If the physical size of array is n, and h is a hash function, then an entry with key k is stored at index h(k) mod n. Then looking up key k just involves computing h(k) mod n again and looking at the index in the array. It is very fast, taking full advantage of the random-access nature of the memory.

Unfortunately, there is a serious catch. It is possible for h(k1) mod n and h(k2) mod n to be the same for two different keys k1 and k2. That is called a collision. There are several ways of doing dealing with collisions. Here, we look at one of those ways, called chaining.


Handling collisions by chaining

One way to handle collisions is to store a linked list of (key, item) pairs at each index in the array. For example, suppose that the physical size of the array is 5. Imagine that we want to store (key, value) pairs for keys "cat", "dog", "frog", "goat" and "horse". Just for illustration, imagine that the hash function yields the following values. (Our values are unrealistically small, but this is just for illustration.)

h("cat") = 63
h("dog") = 35
h("frog") = 79
h("goat") = 72
h("horse") = 117

Then the hash table stores the keys as follows. (Imagine that the item is stored underneath the key, like two playing cards on top of one another; the value is there but you do not see it.)

To look up a value, just compute the index where it belongs and do a linear search of the linked list found at that index.


Analysis and managing the physical array size

For hash tables to be efficient, the linked lists need to be short, on the average. If the physical array size is too small, the lists will be long. If the physical size is too large, there are a lot of empty lists, and that wastes memory.

A good compromise is to choose the physical size of the array to be approximately the same as the number of entries in the table, since then the average list length is about 1. (That is a little optimistic because some of the linked lists will be empty, and those lists are less commonly searched than the nonempty lists. The average length of the nonempty lists is a little more than 1, but it is still a small constant.)

So as long as the physical size of the array is kept very roughly the same as the number of entries, insertion, deletion and lookup in a hash table take constant time, on the average. The average cost does not depend on the number of entries! That is extraordinarly fast. As you can imagine, hash tables are very popular, especially for large databases.

We have seen how to reallocate an array. If the number of entries is within a factor of 3 of the physical size then hash tables will work well. As the hash table grows, you move it into a larger array. When it shrinks a lot, you move it into a smaller array.


Exercises

  1. How much time does each operation on a hash table take in the worst case, if the hash table uses chaining? Answer

  2. Can you think of a way of modifying how collisions are handled that will make a hash table take time O(log(n)) in the worst case? (Hint. Consider binary search trees.) Answer

  3. Suppose the following types are used for a hash table whose keys and values are both null-terminated strings.

      //==========================================
      //               ListCell
      //==========================================
    
      struct ListCell
      {
        const char* key;
        const char* item;
        ListCell*   tail;
    
        ListCell(const char* k, const char* v, ListCell* nx)
        {
          key   = k;
          item  = v;
          tail  = nx;
        }
      };
    
      //==========================================
      //               HashTable
      //==========================================
    
      struct HashTable
      {
        ListCell* A;
        int       load;
        int       size;
    
        HashTable(int n)
        {
          load = 0;
          size = n;
          A    = new ListCell*[n];
          for(int i = 0; i < n; i++)
          {
            A[i] = NULL;
          }      
        }
      };
    

    Write a definition of lookup(x, T), which returns the value associated with key x in hash table T, or returns NULL if x does not occur in T. Assume that the hash function is strhash

    Answer

  4. Using the same types as in the previous exercise, write a definition of insert(x, v, T), which inserts key x with associated item v into hash table T. If there is already an entry for key x, then the item associated with x should be replaced by v.

    If the load becomes more than twice the physical size, then the array should be reallocated to be the same as the load. Be careful with this step. The index where a given key goes depends on the physical size of the array. If you change the physical size, you need to remember that keys might need to move to different indices.

    Answer