How does the implementation of hash tables work?

Question

How does the implementation of hash tables work?

The concept of hashes (or tables hash) is already embedded in various programming languages such as Javascript, Python, Ruby, etc., and is known to be much faster than, for example, chained lists. It associates key-value pairs like this:

meuHash = []
meuHash["minha chave"] = "meu valor"

My question is: how are they implemented? That is, what happens behind when we include or fetch a value in a hash?

Just Know Your implementation is different of chained structures like stacks, queues, lists, binary trees, etc.

11

hashmap característica-linguagem

Author: utluiz, 2014-08-05

Source

2 answers

A table hash is a structure that allows you to associate a key with a value and subsequently have access to the value from its associated key. The key is transformed by a function into a position in the table; always using this transformation, the location of keys in the table is performed quickly.

In Java, this data structure is implemented by objects of the Hashtable class. To manipulate elements in the hash table, the put () methods are used, which stores a pair of objects specified in the table, get (), which returns the value object associated with the specified key object, and remove (), which removes the pair of objects with the specified key. You can also query whether a certain key exists in the table, with the containsKey () method, or whether a certain value is present in the table associated with any key, with the contains () method. The number of pairs of elements in the table can be obtained with the size () method.

0

Author: JOHNYE, 2014-08-05 13:13:43

score 16 · Accepted Answer

Every thing is one thing

First of all, it is important to understand that each type of data structure has a more specific application. So to say that hashes are faster than chained lists is to compare oranges with bananas.

A" table " is quite efficient at retrieving an element contained in it from a key. This means it's quick to find a needle in the haystack if you have a shape ( hash ) of identify the needle.

However, traversing an integer table is generally less efficient than traversing an integer list.

Also, only a vector-based list allows you to access random elements efficiently.

Hashes

Are numerical values calculated by a function from an original value of arbitrary size. There are several hash algorithms, the best of which are the ones that emit the lowest possible number of repeated hashes .

For example, the Java class String has the following method to generate the hash code :

public int More ...hashCode() {
    int h = hash;
    if (h == 0) {
        int off = offset;
        char val[] = value;
        int len = count;

        for (int i = 0; i < len; i++) {
            h = 31*h + val[off++];
        }
        hash = h;
    }
    return h;
}

The method documentation states that the hash is calculated as follows:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Where:

s[i]: i-th character of the String.
m: string size, in characters

In short, generating a hash is like generating a numeric ID for the object.

In Java, any object can have its own implementation of hashCode. This implementation should consider all attributes that identify the object. It is important that different objects have different hashes as much as possible.

Tables of hashes

The number generated by a function hash can be used to easily locate an object in a kind of "table".

Internally, in Java for example, this table is a vector. Each position of the vector is assigned a range of numbers. It would be like slots of a shelf.

Storing values

Let's assume that we create a Hashtable and the inner Vector has 10 positions ( slots). Suppose further that we add objects and report keys values of hash range from 1 to 100. Thus, hashes that return values from 1 to 10 will be placed in the first slot, those that return values from 11 to 20 in the second slot and so on.

If we add an object (1) whose hash of the key is 20 it will be placed in the second slot. If we add another object (2) with key of hash 45 it will be placed fifth slot. If we add another object (3) with a key of hash 15, it will also be placed in the second slot. We would have the following data structure:

Slot 1: 
Slot 2: Objeto 1, Objeto 3
Slot 3: 
Slot 4: 
Slot 5: Objeto 2 
Slot 6: 
Slot 7: 
Slot 8: 
Slot 9: 
Slot 10:

Note that each table item is also a vector type. In Java, in class Hashtable, this is implemented through a linked list of objects of type Entry (input).

Retrieving values

Then, to retrieve an object, just pass the key to the table, it will compute the hash and go directly to the slot to retrieve the item.

Continuing the example above, if I wanted to retrieve Object 3, I would have to pass the key to it. The table would calculate the hash code of the key, which would return the value 15. hashes always return the same value for the same key. So with a simple Account I know that the item will be found in the second slot . However, since we still have a list of objects, the table will have to go through the list of entries in the slot 2 and check, one by one, What is actually Object 3.

Implications

The above explanations have several implications:

if there is a lot of collision of hashes, many objects will end up falling into a few slots. if there is a bad distribution, each time it is necessary to retrieve an item the efficiency will be bad, since it will be necessary to go through the list of elements of the slot and compare one by one.
the internal size of the table needs to be well calculated, so that there are not many items in each slot, nor many empty slots. In Java implementation, when the slots start to get full the Hashtable increases the number of slots and makes a redistribution.
in Java, the method that generates the hash can be called multiple times, so it's important to take care of the performance and cache of the value when possible.
depending on the implementation of the hash table , including too many items repeatedly can be quite inefficient, as there will be repeated pauses to add new ones slots and redistribute the elements.

Different types of implementations

A large part of what has been described above is just one of the possible types of implementation of using hashes to retrieve values.

For example, in Java, there is the TreeMap , a "table" whose hashes of entries are not stored in a vector, but in a binary tree.

There are several implications to this. Depending on the balancing the tree can be more or less efficient to retrieve elements. Comparing a balanced tree with a vector implementation with filled slots the tree would win.

However, I already made some implementations that needed to load very large trees in memory and TreeMap was much more efficient.