41B. Hash Functions

Our goal is to find an implementation of tables that takes less time than height-balanced binary search trees, in the average case. The starting point is the idea of a hash function, which takes a key and yields a nonnegative integer. For example, if a key is a string and h is a hash function then you might find that h("frog") = 3144099.

Suppose that you choose a number from 0 to n−1 at random, each value equally likely. Then you do the same thing again, independently of the first choice, yielding a second number. The probability of choosing the same value twice is 1/n.

Ideally, hash functions should behave similarly. But we take h(k) mod n to be sure there are only n possible values. The probability that two of the keys k1 and k2 satisfy h(k1) mod n = h(k2) mod n should be close to 1/n, as if they had been chosen at random.

But a hash function is not really random at all. If you find that h("frog") = 3144099, then the next time you compute h("frog"), you will get the same answer, 3144099. No choice of hash function can always behave in a perfectly random way. But some hash functions do very well in practice at approximating what a truly random function would do.


Hash functions for strings

Strings are among the most common kinds of keys, so let's think about a hash function for strings. One idea is to get the integer values of the characters in the string and to add them up. For example, 'c' = 99, 'a' = 97 and 't' = 116, so this hash function would yield 99 + 97 + 116 = 312 for "cat". That is a simple hash function, but it is not very good. For example, it yields the same value for "act" and "sad" as for "cat".

A more sophisticated hash function for strings should try to keep more information about the characters in the string. It should certainly depend on the order of the characters. If only letters are allowed in a key string then you can think of the key as representing a number in base 26, each character being a digit.

But then the hash values can get very large, and that is impractical. Taking into account upper and lower case letters, digits and special symbols, there might be about 80 possible characters instead of 26, making a key represent a number in base 80, which makes the hash values even larger.

One way to reduce the size of the number is only to look a the first or last 4 characters. But then the hash function does not depend on all of the characters in the string, and it is not a good hash function.


A practical hash functions for strings

The following hash function is a practical one based partly on the idea of treating a string like a number. It yields a 32-bit integer result, which allows about 2 billion possible answers. As it sees new characters it shifts the result to the left by 4 bits and adds in the new character; it appears to treat each character as a digit in base 16. But this function does not only look at the low order 4 bits of a character: it adds in the entire character, so the 'digits' of this base 16 number can be from 0 to 255, and overlap.

But there is an important modification of the basic idea. Any part of the result that would be shifted off the left end is brought back around to the right end and combined with the result, to ensure that the result depends on all of the characters. This hash function is used in ELF executable and object files.

  //=============================================
  //              strhash
  //=============================================
  // strhash is a hash function for strings.
  // Parameter str is a null-terminated string.
  //=============================================

  int strhash(const char* str)
  {
    const char* p;
    int         g;
    int         h = 0;

    for(p = str; *p != '\0'; p++) 
    {
      h = (h << 4) + (unsigned int)(*p);
      g = h & 0xF0000000;
      if(g != 0)
      {
        h = h ^ (g >> 24);
      }
      h = h & ~g;
    }
    return h;
  }