String Algorithms - Tutorials

String algorithms form the backbone of many software engineering challenges, ranging from simple search functions to sophisticated text analysis, bioinformatics, and natural language processing. This tutorial provides a deep dive into the most essential string algorithms, their theoretical foundations, practical implementations, and performance considerations.

Fundamentals of String Algorithms

Before exploring specific algorithms, it is crucial to understand how strings are represented in memory, common operations on strings, and the constraints imposed by character encodings.

String Representation

In most programming languages a string is an array of char (or byte) values. The choice of encoding (ASCII, UTF‑8, UTF‑16, etc.) directly affects algorithm design, especially when dealing with multi‑byte characters.

📝 Note: When working with Unicode strings, always use language‑provided libraries that correctly handle grapheme clusters; otherwise, index‑based algorithms may produce incorrect results.

Common Operations

Length retrieval (O(1) for most languages)
Character access by index (O(1) for array‑based strings, O(n) for linked‑list representations)
Substring extraction (often O(k) where k is the length of the substring)
Concatenation (O(m+n) unless the language uses rope data structures)
Comparison (lexicographic, O(min(m,n)))

Pattern Matching Algorithms

Pattern matching is the process of finding occurrences of a "pattern" string P within a larger "text" string T. The naïve approach checks each possible alignment, leading to O(|T|·|P|) time. Advanced algorithms improve this bound dramatically.

Naïve Search

python cpp

# Python implementation of naïve string search

def naive_search(text, pattern):
    n, m = len(text), len(pattern)
    occurrences = []
    for i in range(n - m + 1):
        match = True
        for j in range(m):
            if text[i + j] != pattern[j]:
                match = False
                break
        if match:
            occurrences.append(i)
    return occurrences

// C++ naïve search (O(n*m))
#include <vector>
#include <string>

std::vector<int> naiveSearch(const std::string &text, const std::string &pattern) {
    std::vector<int> occ;
    int n = text.size(), m = pattern.size();
    for (int i = 0; i <= n - m; ++i) {
        int j = 0;
        for (; j < m; ++j) {
            if (text[i + j] != pattern[j]) break;
        }
        if (j == m) occ.push_back(i);
    }
    return occ;
}

⚠ Warning: The naïve algorithm has a worst‑case time complexity of O(|T|·|P|), which becomes prohibitive for large inputs or repetitive patterns.

Knuth‑Morris‑Pratt (KMP) Algorithm

KMP achieves linear time O(|T| + |P|) by pre‑computing a longest‑prefix‑suffix (LPS) array that allows the search to skip unnecessary character comparisons.

python java

# Python KMP implementation

def compute_lps(pattern):
    m = len(pattern)
    lps = [0] * m
    length = 0  # length of the previous longest prefix suffix
    i = 1
    while i < m:
        if pattern[i] == pattern[length]:
            length += 1
            lps[i] = length
            i += 1
        else:
            if length != 0:
                length = lps[length - 1]
            else:
                lps[i] = 0
                i += 1
    return lps

def kmp_search(text, pattern):
    n, m = len(text), len(pattern)
    lps = compute_lps(pattern)
    i = j = 0
    occ = []
    while i < n:
        if pattern[j] == text[i]:
            i += 1
            j += 1
        if j == m:
            occ.append(i - j)
            j = lps[j - 1]
        elif i < n and pattern[j] != text[i]:
            if j != 0:
                j = lps[j - 1]
            else:
                i += 1
    return occ

// Java KMP implementation
public class KMP {
    private static int[] computeLPS(String pat) {
        int m = pat.length();
        int[] lps = new int[m];
        int len = 0, i = 1;
        while (i < m) {
            if (pat.charAt(i) == pat.charAt(len)) {
                len++;
                lps[i] = len;
                i++;
            } else {
                if (len != 0) {
                    len = lps[len - 1];
                } else {
                    lps[i] = 0;
                    i++;
                }
            }
        }
        return lps;
    }

    public static List<Integer> search(String txt, String pat) {
        int n = txt.length(), m = pat.length();
        int[] lps = computeLPS(pat);
        List<Integer> occ = new ArrayList<>();
        int i = 0, j = 0;
        while (i < n) {
            if (pat.charAt(j) == txt.charAt(i)) {
                i++; j++;
            }
            if (j == m) {
                occ.add(i - j);
                j = lps[j - 1];
            } else if (i < n && pat.charAt(j) != txt.charAt(i)) {
                if (j != 0) j = lps[j - 1];
                else i++;
            }
        }
        return occ;
    }
}

💡 Tip: Re‑using the LPS array for multiple searches of the same pattern on different texts saves preprocessing time.

Rabin‑Karp Algorithm

Rabin‑Karp uses a rolling hash to compare the pattern's hash with each substring of the text. It runs in average‑case O(|T| + |P|), but worst‑case can degrade to O(|T|·|P|) due to hash collisions.

python

# Python Rabin‑Karp (modular arithmetic)

def rabin_karp(text, pattern, base=256, mod=101):
    n, m = len(text), len(pattern)
    if m > n:
        return []
    h = pow(base, m - 1, mod)  # highest power of base
    pat_hash = txt_hash = 0
    for i in range(m):
        pat_hash = (base * pat_hash + ord(pattern[i])) % mod
        txt_hash = (base * txt_hash + ord(text[i])) % mod
    occ = []
    for i in range(n - m + 1):
        if pat_hash == txt_hash:
            if text[i:i+m] == pattern:
                occ.append(i)
        if i < n - m:
            txt_hash = (txt_hash - ord(text[i]) * h) % mod
            txt_hash = (txt_hash * base + ord(text[i+m])) % mod
            txt_hash = (txt_hash + mod) % mod  # ensure positive
    return occ

⚠ Warning: Choosing a small modulus (e.g., 101) reduces the probability of overflow but increases collision risk; for production use a large prime or double‑hashing technique.

Z‑Algorithm

The Z‑algorithm computes an array Z[i] that stores the length of the longest substring starting at i which is also a prefix of the string. It runs in linear time and is useful for pattern matching, substring counting, and finding repetitions.

cpp

// C++ Z‑algorithm implementation
#include <vector>
#include <string>

std::vector<int> ZAlgorithm(const std::string &s) {
    int n = s.size();
    std::vector<int> Z(n, 0);
    int L = 0, R = 0;
    for (int i = 1; i < n; ++i) {
        if (i <= R)
            Z[i] = std::min(R - i + 1, Z[i - L]);
        while (i + Z[i] < n && s[Z[i]] == s[i + Z[i]])
            ++Z[i];
        if (i + Z[i] - 1 > R) {
            L = i;
            R = i + Z[i] - 1;
        }
    }
    return Z;
}

// Pattern search using Z‑algorithm std::vector<int> zSearch(const std::string &text, const std::string &pat) {
    std::string concat = pat + "$" + text; // $ is a sentinel not appearing in either string
    std::vector<int> Z = ZAlgorithm(concat);
    std::vector<int> occ;
    int m = pat.size();
    for (size_t i = m + 1; i < Z.size(); ++i) {
        if (Z[i] == m) occ.push_back(i - m - 1);
    }
    return occ;
}

Advanced Data Structures for String Processing

Beyond single‑pattern searches, many applications require preprocessing the entire text to answer multiple queries efficiently. Tries, suffix arrays, and suffix trees are the primary data structures for such tasks.

Trie (Prefix Tree)

A trie stores a set of strings in a tree where each edge represents a character. It enables O(k) lookup and insertion for a string of length k, independent of the total number of strings.

python

# Python Trie implementation using dictionaries
class TrieNode:
    def __init__(self):
        self.children = {}
        self.is_end = False

class Trie:
    def __init__(self):
        self.root = TrieNode()

    def insert(self, word):
        node = self.root
        for ch in word:
            node = node.children.setdefault(ch, TrieNode())
        node.is_end = True

    def search(self, word):
        node = self.root
        for ch in word:
            if ch not in node.children:
                return False
            node = node.children[ch]
        return node.is_end

    def starts_with(self, prefix):
        node = self.root
        for ch in prefix:
            if ch not in node.children:
                return False
            node = node.children[ch]
        return True

Suffix Array

A suffix array is a sorted array of all suffixes of a text. Construction can be achieved in O(n log n) using the prefix‑doubling method or in O(n) with specialized algorithms (e.g., SA‑IS).

Algorithm	Time Complexity	Space Complexity	Typical Use‑Case
Prefix‑doubling	O(n log n)	O(n)	General‑purpose, easy to implement
SA‑IS (linear)	O(n)	O(n)	Large‑scale genome indexing
Induced sorting	O(n)	O(n)	Performance‑critical systems

Suffix Tree

A suffix tree is a compressed trie of all suffixes. It allows many string queries (substring check, longest repeated substring, etc.) in O(m) time after O(n) preprocessing, at the cost of higher memory overhead.

In‑depth guide to suffix trees (Ukkonen's algorithm)

📘 Summary: String algorithms range from simple linear scans to sophisticated data structures that enable sub‑linear query times. Choosing the right algorithm depends on the problem size, pattern frequency, and memory constraints. Mastery of these techniques equips software engineers to build performant text‑heavy applications, from search engines to DNA sequence analysers.

Q: When should I prefer KMP over Rabin‑Karp?
A: Use KMP when you need guaranteed linear worst‑case performance and the pattern is static. Rabin‑Karp shines when you have multiple patterns or need fast average‑case performance, especially with large alphabets.

Q: Is a Trie always better than a HashMap for prefix searches?
A: A Trie provides ordered traversal and guarantees O(k) lookup, independent of the number of stored strings, which a HashMap cannot. However, for sparse datasets a HashMap may use less memory.

Q: Do suffix arrays handle Unicode strings?
A: Yes, as long as you treat each code point (or grapheme cluster) as a distinct character. Be mindful of variable‑length encodings like UTF‑8 when building the array.

Q. What is the time complexity of building a suffix array using the prefix‑doubling method?

O(n)
O(n log n)
O(n^2)
O(log n)

Answer: O(n log n)
Each iteration doubles the prefix length, leading to log n iterations, each processing O(n) elements.

Q. Which of the following statements about the Z‑algorithm is FALSE?

It can be used to compute the longest common prefix of any two suffixes.
It runs in linear time.
It requires extra space proportional to the length of the string.
It is equivalent to the KMP prefix function.

Answer: It is equivalent to the KMP prefix function.
Although both compute prefix information, the Z‑array and the KMP LPS array store different perspectives and are not directly equivalent.

References

Tags: Software Engineering Algorithms String Algorithms Programming Data Structures

#ad