Hashing Techniques in Algorithms

Hashing is a fundamental technique in computer science, enabling fast data retrieval, efficient storage, and robust security mechanisms. This tutorial explores hashing from an algorithmic perspective, covering core concepts, popular methods, implementation examples, and best practices for software engineers.

1. Introduction to Hashing

A hash function transforms input data of arbitrary size into a fixed-size string of bytes, typically called a hash code or digest. In algorithm design, hashing is used primarily for hash tables—data structures that provide average‑case O(1) time complexity for insert, search, and delete operations.

1.1 Key Properties of a Good Hash Function

  • Deterministic: Same input always yields the same hash.
  • Uniform distribution: Keys are spread evenly across the hash space to minimize collisions.
  • Fast computation: The function should run in constant time.
  • Low collision probability: Different inputs should rarely produce the same hash.

2. Core Hashing Techniques

2.1 Division (Modulo) Method

The division method computes the hash as h(k) = k mod m, where m is the size of the hash table. Choosing a prime number for m reduces clustering.

def division_hash(key, size):
    return key % size
int divisionHash(int key, int size) {
    return key % size;
}

2.2 Multiplication Method

The multiplication method uses a constant A (0 < A < 1) and computes h(k) = ⌊m * ((k * A) mod 1)⌋. It is less sensitive to the choice of m than the division method.

def multiplication_hash(key, size, A=0.6180339887):
    return int(size * ((key * A) % 1))
int multiplicationHash(int key, int size) {
    double A = 0.6180339887; // (sqrt(5)-1)/2
    double fractional = (key * A) % 1;
    return (int)(size * fractional);
}

2.3 Universal Hashing

Universal hashing selects a hash function at random from a family of functions, guaranteeing a low collision probability for any set of keys. A common construction is h(k) = ((a * k + b) mod p) mod m, where p is a prime larger than the maximum key, and a, b are random integers.

import random

def universal_hash(key, size, prime=109345121):
    a = random.randint(1, prime-1)
    b = random.randint(0, prime-1)
    return ((a * key + b) % prime) % size

2.4 Double Hashing (Open Addressing)

Double hashing resolves collisions by probing with a second hash function: h_i(k) = (h1(k) + i * h2(k)) mod m. The second hash h2(k) must never evaluate to zero.

def double_hash(key, i, size, prime=101):
    h1 = key % size
    h2 = 1 + (key % (prime - 1))
    return (h1 + i * h2) % size

3. Collision Resolution Strategies

3.1 Separate Chaining

Each table slot holds a linked list (or dynamic array) of entries that hash to the same index. This method tolerates a load factor > 1 without degrading performance dramatically.

3.2 Open Addressing (Linear Probing, Quadratic Probing, Double Hashing)

All entries reside within the table itself; when a collision occurs, the algorithm probes alternative slots according to a probing sequence.

  • Linear probing: h_i(k) = (h(k) + i) mod m
  • Quadratic probing: h_i(k) = (h(k) + c1*i + c2*i^2) mod m
  • Double hashing: see section 2.4
⚠ Warning: Open addressing suffers from primary clustering (linear probing) and secondary clustering (quadratic probing). Choose the probing method carefully based on expected load factor.

4. Performance Analysis

The expected cost of a successful search in a hash table with load factor α = n/m is:

Load Factor (α)Separate Chaining (Avg. Probes)Open Addressing – Linear Probing (Avg. Probes)
0.51 + α/2 ≈ 1.25½ (1 + 1/(1-α)) ≈ 1.33
0.71 + α/2 ≈ 1.35½ (1 + 1/(1-α)) ≈ 1.71
0.91 + α/2 ≈ 1.45½ (1 + 1/(1-α)) ≈ 2.95
📝 Note: When the load factor exceeds 0.7, rehashing (expanding the table and recomputing hashes) is recommended to maintain constant‑time operations.

5. Practical Implementations

5.1 Python – Using Built‑in dict

Python’s dictionary type already implements a highly optimized hash table with open addressing and perturbation. Custom hash functions can be supplied via the __hash__ method of user‑defined classes.

class Person:
    def __init__(self, ssn, name):
        self.ssn = ssn
        self.name = name
    def __hash__(self):
        # Simple universal hash using built‑in hash
        return hash(self.ssn)
    def __eq__(self, other):
        return isinstance(other, Person) and self.ssn == other.ssn

people = {}
people[Person('123-45-6789', 'Alice')] = 'Engineer'
print(people[Person('123-45-6789', 'Alice')])  # → Engineer

5.2 Java – HashMap Custom Hashing

Java’s HashMap uses the hashCode() method of keys. Overriding this method lets you implement any of the techniques discussed.

public class Employee {
    private final long id;
    private final String name;

    public Employee(long id, String name) {
        this.id = id;
        this.name = name;
    }

    @Override
    public int hashCode() {
        // Example: multiplicative hashing with a prime constant
        long A = 0x9e3779b97f4a7c15L; // 2^64 / φ
        return (int) ((id * A) >>> 32);
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (!(o instanceof Employee)) return false;
        Employee e = (Employee) o;
        return this.id == e.id;
    }
}

Map<Employee, String> payroll = new HashMap<>();
payroll.put(new Employee(1001L, "Bob"), "Analyst");
System.out.println(payroll.get(new Employee(1001L, "Bob")));

6. Security‑Oriented Hashing vs. Data‑Structure Hashing

Cryptographic hash functions (e.g., SHA‑256, BLAKE2) are designed for pre‑image resistance, collision resistance, and avalanche effect. They are far slower than the simple, non‑cryptographic hashes used in hash tables. Never use a cryptographic hash for a hash table unless security is a primary requirement.

💡 Tip: If you need both fast lookup and resistance to adversarial inputs (e.g., in a web server), consider using a SipHash implementation for the table’s hash function.

7. Best Practices Checklist

  1. Choose a table size that is a prime number or a power of two with a good secondary hash.
  2. Keep the load factor ≤ 0.75; trigger rehashing when it exceeds this threshold.
  3. Prefer universal or multiplicative hashing to mitigate clustering.
  4. When security matters, switch to a keyed hash like SipHash.
  5. Profile both insertion and lookup performance; tune the probing strategy accordingly.
📘 Summary: Hashing provides O(1) average‑case performance for many core operations in software engineering. By selecting an appropriate hash function, handling collisions wisely, and maintaining the load factor, developers can build fast, reliable data structures while avoiding common pitfalls such as clustering and security vulnerabilities.

Q: What is the difference between a hash function and a hash code?
A: A hash function is the algorithm that processes an input to produce a hash code (or digest). The hash code is the numeric result that the data structure or security protocol actually uses.


Q: When should I use separate chaining versus open addressing?
A: Separate chaining is preferable when the load factor may exceed 1 or when deletions must be cheap. Open addressing saves memory and often offers better cache performance for low to moderate load factors.


Q: Can I reuse a cryptographic hash like SHA‑256 for a hash table?
A: Technically you can, but SHA‑256 is orders of magnitude slower than needed for typical in‑memory hash tables, and it offers no advantage for uniform distribution in that context.


Q. Which of the following ensures that the second hash function in double hashing never returns zero?
  • h2(k) = k % m
  • h2(k) = 1 + (k % (m‑1))
  • h2(k) = (k * A) mod 1
  • h2(k) = (a*k + b) mod p

Answer: h2(k) = 1 + (k % (m‑1))
Adding 1 guarantees the result is at least 1, preventing zero steps that would cause infinite loops.

Q. If a hash table has size 101 and uses the division method, what is the hash of key 12345?
  • 12345
  • 57
  • 34
  • 0

Answer: 57
12345 mod 101 = 57.

Further reading on universal hashing

References