Hash Tables)

I have a closed hash table with open addressing and linear polling for the following word dictionary, of a text;

The input operation works like this:

1) the program asks for the first line of the text (with a limit of 100 characters) at this point I'm breaking the text, and adding the words in the hash table; then ask for the 2nd line... continue asking for the lines until you find a line called STOP STOP

2) after $STOP, the program asks for words (strings) to be searched in the hash, the search result will be the number of the line(s) of incidence of that word in the text;

//EXEMPLO:
$START
o mundo era bonito demais
o mundo jah foi muito belo 
$STOP
mundo
bonito
//retornará:
mundo: 1 2
bonito: 1

Considering that the number of keys can vary from 1 to N (10000 for example) distinct words, how should I manage the size of my table? Should I put a table vector[10001]?

Won't you spend too much memory?

How to deal with this problem of hash table closed with open addressing and linear polling"?

 3
Author: Maniero, 2014-10-27

1 answers

One very important thing to keep in mind when using hash tables, especially with open addressing, is the load factor: the ratio of the number of distinct keys in the table to the maximum capacity of the table. If the load factor is too low you are wasting memory with an overly sparse hash table. On the other hand, if the load factor is too high, you are very prone to collisions in your hash function and clusters in your table (long segments of busy indexes, which mean linear polls also long...). For example, imagine what happens with a capacity table 10001 with 10000 keys used, like the one you proposed in the question: you will have 10000 consecutive indexes used and only one free Index left. When trying to insert the 10001or element, it is almost certain that there is a collision of the hash function and on average you will have to do a poll of 5000 indexes until you reach the empty bucket!

Because of all this, ideally your load factor is neither too large nor too small. The Wikipedia says that if you are using linear addressing, it is good to avoid load factors above 70% or 80% (but I did not find the source for that specific number). Of course, this all depends on the hash function you are using and the text you use as input. I suggest you do tests with different capabilities and look at the result.

Other than that, if you do not know a priori the capacity of the table you will need a possibility is to resize the table dynamically. Every time the load factor goes beyond a pre-set limit, you mallocate a new table with double1 transfer all keys from the old table to the new table.


1 what matters here is always to multiply the capacity by a constant greater than 1 (grow the capacities exponentially) rather than increasing by a fixed number of elements (growing capacity linearly). This makes the asymptotic amortized cost of resizes constant. The higher the resizing factor, the less time you will spend with resizes but in compensation you will have tables spending more memory.

 5
Author: hugomg, 2014-10-27 16:50:37