Hashing is a fundamental technique in computer science, enabling fast data retrieval, efficient storage, and robust security mechanisms. This tutorial explores hashing from an algorithmic perspective, covering core concepts, popular methods, implementation examples, and best practices for software engineers.
1. Introduction to Hashing
A hash function transforms input data of arbitrary size into a fixed-size string of bytes, typically called a hash code or digest. In algorithm design, hashing is used primarily for hash tables—data structures that provide average‑case O(1) time complexity for insert, search, and delete operations.
1.1 Key Properties of a Good Hash Function
- Deterministic: Same input always yields the same hash.
- Uniform distribution: Keys are spread evenly across the hash space to minimize collisions.
- Fast computation: The function should run in constant time.
- Low collision probability: Different inputs should rarely produce the same hash.
2. Core Hashing Techniques
2.1 Division (Modulo) Method
The division method computes the hash as h(k) = k mod m, where m is the size of the hash table. Choosing a prime number for m reduces clustering.
def division_hash(key, size):
return key % size
int divisionHash(int key, int size) {
return key % size;
}
2.2 Multiplication Method
The multiplication method uses a constant A (0 < A < 1) and computes h(k) = ⌊m * ((k * A) mod 1)⌋. It is less sensitive to the choice of m than the division method.
def multiplication_hash(key, size, A=0.6180339887):
return int(size * ((key * A) % 1))
int multiplicationHash(int key, int size) {
double A = 0.6180339887; // (sqrt(5)-1)/2
double fractional = (key * A) % 1;
return (int)(size * fractional);
}
2.3 Universal Hashing
Universal hashing selects a hash function at random from a family of functions, guaranteeing a low collision probability for any set of keys. A common construction is h(k) = ((a * k + b) mod p) mod m, where p is a prime larger than the maximum key, and a, b are random integers.
import random
def universal_hash(key, size, prime=109345121):
a = random.randint(1, prime-1)
b = random.randint(0, prime-1)
return ((a * key + b) % prime) % size
2.4 Double Hashing (Open Addressing)
Double hashing resolves collisions by probing with a second hash function: h_i(k) = (h1(k) + i * h2(k)) mod m. The second hash h2(k) must never evaluate to zero.
def double_hash(key, i, size, prime=101):
h1 = key % size
h2 = 1 + (key % (prime - 1))
return (h1 + i * h2) % size
3. Collision Resolution Strategies
3.1 Separate Chaining
Each table slot holds a linked list (or dynamic array) of entries that hash to the same index. This method tolerates a load factor > 1 without degrading performance dramatically.
3.2 Open Addressing (Linear Probing, Quadratic Probing, Double Hashing)
All entries reside within the table itself; when a collision occurs, the algorithm probes alternative slots according to a probing sequence.
- Linear probing:
h_i(k) = (h(k) + i) mod m - Quadratic probing:
h_i(k) = (h(k) + c1*i + c2*i^2) mod m - Double hashing: see section 2.4
4. Performance Analysis
The expected cost of a successful search in a hash table with load factor α = n/m is:
| Load Factor (α) | Separate Chaining (Avg. Probes) | Open Addressing – Linear Probing (Avg. Probes) |
|---|---|---|
| 0.5 | 1 + α/2 ≈ 1.25 | ½ (1 + 1/(1-α)) ≈ 1.33 |
| 0.7 | 1 + α/2 ≈ 1.35 | ½ (1 + 1/(1-α)) ≈ 1.71 |
| 0.9 | 1 + α/2 ≈ 1.45 | ½ (1 + 1/(1-α)) ≈ 2.95 |
5. Practical Implementations
5.1 Python – Using Built‑in dict
Python’s dictionary type already implements a highly optimized hash table with open addressing and perturbation. Custom hash functions can be supplied via the __hash__ method of user‑defined classes.
class Person:
def __init__(self, ssn, name):
self.ssn = ssn
self.name = name
def __hash__(self):
# Simple universal hash using built‑in hash
return hash(self.ssn)
def __eq__(self, other):
return isinstance(other, Person) and self.ssn == other.ssn
people = {}
people[Person('123-45-6789', 'Alice')] = 'Engineer'
print(people[Person('123-45-6789', 'Alice')]) # → Engineer
5.2 Java – HashMap Custom Hashing
Java’s HashMap uses the hashCode() method of keys. Overriding this method lets you implement any of the techniques discussed.
public class Employee {
private final long id;
private final String name;
public Employee(long id, String name) {
this.id = id;
this.name = name;
}
@Override
public int hashCode() {
// Example: multiplicative hashing with a prime constant
long A = 0x9e3779b97f4a7c15L; // 2^64 / φ
return (int) ((id * A) >>> 32);
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (!(o instanceof Employee)) return false;
Employee e = (Employee) o;
return this.id == e.id;
}
}
Map<Employee, String> payroll = new HashMap<>();
payroll.put(new Employee(1001L, "Bob"), "Analyst");
System.out.println(payroll.get(new Employee(1001L, "Bob")));
6. Security‑Oriented Hashing vs. Data‑Structure Hashing
Cryptographic hash functions (e.g., SHA‑256, BLAKE2) are designed for pre‑image resistance, collision resistance, and avalanche effect. They are far slower than the simple, non‑cryptographic hashes used in hash tables. Never use a cryptographic hash for a hash table unless security is a primary requirement.
7. Best Practices Checklist
- Choose a table size that is a prime number or a power of two with a good secondary hash.
- Keep the load factor ≤ 0.75; trigger rehashing when it exceeds this threshold.
- Prefer universal or multiplicative hashing to mitigate clustering.
- When security matters, switch to a keyed hash like SipHash.
- Profile both insertion and lookup performance; tune the probing strategy accordingly.
Q: What is the difference between a hash function and a hash code?
A: A hash function is the algorithm that processes an input to produce a hash code (or digest). The hash code is the numeric result that the data structure or security protocol actually uses.
Q: When should I use separate chaining versus open addressing?
A: Separate chaining is preferable when the load factor may exceed 1 or when deletions must be cheap. Open addressing saves memory and often offers better cache performance for low to moderate load factors.
Q: Can I reuse a cryptographic hash like SHA‑256 for a hash table?
A: Technically you can, but SHA‑256 is orders of magnitude slower than needed for typical in‑memory hash tables, and it offers no advantage for uniform distribution in that context.
Q. Which of the following ensures that the second hash function in double hashing never returns zero?
- h2(k) = k % m
- h2(k) = 1 + (k % (m‑1))
- h2(k) = (k * A) mod 1
- h2(k) = (a*k + b) mod p
Answer: h2(k) = 1 + (k % (m‑1))
Adding 1 guarantees the result is at least 1, preventing zero steps that would cause infinite loops.
Q. If a hash table has size 101 and uses the division method, what is the hash of key 12345?
- 12345
- 57
- 34
- 0
Answer: 57
12345 mod 101 = 57.