Strings

Strings are one of the most fundamental data structures. They represent sequences of characters and are used to store and manipulate text. String algorithms are essential for solving a wide range of problems, from searching and pattern matching to data compression and encryption.

In this section, you'll find explanations and solutions for classic and advanced string problems, including: - Pattern searching (e.g., KMP, Rabin-Karp) - String manipulation and parsing - Trie and suffix data structures

String Matching

String matching, also known as pattern searching, is the process of finding one or more occurrences of a substring (pattern) P within a larger string (text) T.

Naive Approach

The simplest way to search for a pattern is to check for a match at every possible position in the text. This approach is slow for larger inputs. If length of P is m and length of T is n then time complexity is O(m * n).

Rabin-Karp

The Rabin-Karp algorithm uses hashing to efficiently search for a pattern in a text. It computes a hash value for the pattern and for each substring of the text with the same length. If the hash values match, it checks the actual substring to confirm a match. This approach is efficient for multiple pattern searches and has an average and best-case time complexity of O(n + m), but the worst case can degrade to O(mn) due to hash collisions. We need to use an efficient hash function which takes on average O(1) for finding hash of m characters. Otherwise, this will have the same complexity as a naive approach.

Approach: 1. Calculate the hash of the pattern and the first window of text. 2. Slide the window over the text, updating the hash efficiently. 3. If the hash matches, compare the actual strings to confirm.

In the case of strings containing only digits, hash function can be its decimal value.

fun rabinKarp(text: String, pattern: String) {
    val n = text.length
    val m = pattern.length

}

Finite Automata

The finite automata approach constructs a state machine that processes the text in a single pass. Each state represents how much of the pattern has been matched so far. The automaton is precomputed for the pattern, and then the text is processed character by character, transitioning between states. This method guarantees O(n) time complexity for searching, but preprocessing the automaton takes O(m * |Σ|), where |Σ| is the alphabet size.

Approach: 1. Build the transition table for the pattern. 2. Process the text using the automaton, tracking the current state. 3. When the final state is reached, a match is found.

KMP

The Knuth-Morris-Pratt (KMP) algorithm improves search efficiency by preprocessing the pattern to create a longest prefix-suffix (LPS) array. This array allows the algorithm to skip unnecessary comparisons by remembering where to resume the search after a mismatch. KMP guarantees O(n + m) time complexity.

The KMP algorithm is based on the following intuition:

When a mismatch happens during pattern matching, KMP uses previously matched information to avoid re-checking characters that we already know will match.
We can preprocess the pattern to create a lookup table that tells us the longest proper prefix which is also a suffix for any sub-pattern ending at position i.

Approach:

Preprocess the pattern to build the LPS array.
- For a given position i in the pattern: LPS[i] = length of the longest prefix of the pattern that is also a suffix for the substring pattern[0..i].
Start matching the pattern with the text.
On match, move both pointers (in text and pattern).
On mismatch:
- If j != 0 (pattern index), move pattern index to lps[j - 1].
- Else, move to next character in text.

fun prefixTable(pattern: String): List<Int> {
    val lps = MutableList(patter.length)
    var len = 0 // length of the previous longest prefix suffix
    var i = 1

    while(i < pattern.length) { 
        if(patter[i] == pattern[len]) { 
            len++
            lsp[i] = len
            i++
        } else if(len != 0){
            len = lps[len - 1]
        } else {
            lps[i] = 0
            i++
        }
    }
}

fun kmpSearch(text: String, pattern: String): List<Int> {
    val lps = prefixTable(pattern)
    val res = mutableListOf<Int>()

    var i = 0
    var j = 0

    while(i < text.length) {
        if(pattern[i] == pattern[j]) {
            i++
            j++
        }

        if(j == pattern.length) {
            res.add(i - j)
            j = lps[j - 1]
        } else if(i < text.length && patern[j] != text[i]) {
            if(j != 0) j = lps[j-1]
            else i++
        }
    }
   return res
}

Boyer-Moore Algorithm

The Boyer-Moore algorithm is an efficient string-searching (or substring matching) algorithm that skips sections of the text to reduce the number of comparisons. It’s especially efficient for large texts and small patterns.

Approach: Boyer-Moore uses heuristics to skip ahead:

Bad Character Heuristic: If a mismatch occurs at character c in the text
Check where c last appeared in the pattern
Shift the pattern to align its last occurrence of c with the mismatch position, or skip completely if c isn't in the pattern.
Good Suffix Heuristic: If a mismatch occurs after some part of the pattern matched
Shift the pattern to align the next occurrence of that matching suffix with the text.
If that suffix doesn't appear elsewhere, shift past it entirely.

fun boyerMooreSearch(text: String, pattern: String): List<Int> {
    val n = text.length
    val m = pattern.length

    // Build bad character table
    val baChar = IntArray(256) { -1 }
    for(i in patter.indices) {
        badChar[patterm[i].toInt()] = i
    }

    // Build good suffix table
    val goodSuffix = buildGoodSuffixTable(pattern)

    val res = mutableListOf<Int>()
    var s = 0 // Shift of the pattern w.r.t text

    while(s <= n-m) {
        var j = m - 1
        while(j >=0 && patterm[j] == text[s + j]) {
            j--
        }

        if(j < 0) {
            res.add(s)
            s += goodSuffix[0] // Full match, use full match shift
        } else {
            val badCharShift = j - badCahr[text[s + j].toInt()].coerceAtLeast(-1)
            val goodSuffixShift = goodSuffix[j + 1]
            s += maxOf(1, maxOf(badCharShift, goodSuffixShift))
        }
    }

   return res
}

fun buildGoodSuffixTable(pattern: String): IntArray {
   val m = pattern.length
   val goodSuffix = IntArray(m + 1)
   val suffix = IntArray(m + 1)

   // Step 1: Compute suffixes
   var f = 0
   var g = m
   suffix[m] = m

   for (i in m - 1 downTo 0) {
      if (i > g && suffix[i + m - f] < i - g) {
         suffix[i] = suffix[i + m - f]
      } else {
         g = i
         f = i
         while (g >= 0 && pattern[g] == pattern[g + m - f]) {
            g--
         }
         suffix[i] = f - g
      }
   }

   // Step 2: Compute good suffix table
   for (i in 0..m) {
      goodSuffix[i] = m
   }

   var j = 0
   for (i in m - 1 downTo 0) {
      if (suffix[i] == i + 1) {
         while (j < m - 1 - i) {
            if (goodSuffix[j] == m) {
               goodSuffix[j] = m - 1 - i
            }
            j++
         }
      }
   }

   for (i in 0 until m - 1) {
      goodSuffix[m - 1 - suffix[i]] = m - 1 - i
   }

   return goodSuffix
}

DS for Strings

If we have a set of strings and a word, we want to search in that set, in order to perform efficient search we need an efficient way of storing strings.

Trie

Trie is one of the efficient data structures for strings that provides O(L) time for both insert and search.

A Trie is a tree and each node in it contains number of pointers equal to the characters in the alphabets.

Structure

data class TrieNode(
    val isEnd: Boolean,
    val child: MutableMap<Char, TrieNode>
)

The main disadvantage with trie is that they need a lot of memory for storing strings.

Ternary Search Tree

A Ternary Search Tree (TST) is a type of trie that is used to store strings efficiently while maintaining a compact structure and enabling fast lookup, insertion, and deletion operations. It combines features of both binary search trees and tries.

TST uses three pointers:

left pointer points to TST containing all strings which are alphabetically less than data
right pointer points to TST containing all strings which are alphabetically greater than data
middle pointer points to TST containing all strings which are alphabetically equal to data

Structure

data class TSTNode(
    val data: Character,
    val isEnd: Boolean,
    val left: TSTNode?,
    val right: TSTNode?,
    val middle: TSTNode?
)

Suffix Tree

A suffix tree is a trie like data structure that stores all the suffixes of a text T. A suffix tree for an n character text T is a rooted tree with the following properties:

A suffix tree contains n leaves.
Each internal node should have at least 2 children.
Each edge is labelled by a nonempty substring of T.
No two edges of a node begin with the same character.
The path from root to leaves represent all the suffixes of T.

Construction of Suffix Trees

Let S be set of all suffixes of T. Append $ to each of the suffixes.
Sort suffixes in S based on their first characters and group them based on the first character.
For each group $S_c$
- If $S_c$ has only one element, then create a leaf node with edge equal to the substring in group.
- Otherwise, find the longest common prefix in the group, create an internal node with edge equal to lcp and recursively continue with step 2, S being set of remaining suffixes from $S_c$, after removing the lcp.

data class SuffixNode(
    val edge: String,
    val children: MutableList<SuffixNode> = mutableListOf(),
    val isLeaf: Boolean = false
)


class SuffixTree(private val text: String) {
    private val root = SuffixTreeNode()

    init{
        buildSuffixTree()
    }

    fun lcp(suffixes: List<String>): String {
        if(suffixes.isEmpty()) return ""
        var prefix = suffixes[0]
        for(s in suffixes.drop(1)){
            var i = 0
            while(i < prefix.length && i < s.length && preifx[i] == s[i]) i++
            prefix = prefix.substring(0, i)
            if(prefix.isEmpty()) break
        }
        return prefix
    }

    private fun buildSuffixTreeUtil(suffixes: List<String>, val node: SuffixTreeNode): SuffixTreeNode {
        if(suffixes.length == 1) {
            node.childeren.add(SufixTreeNode(suffixes[0], isLeaf=True))
            return
        }

        val groups = suffixes.groupBy { it.firstOrNull() ?: "$" }

        for((_, group) in groups) {
            val lcp = lcp(group)
            val childNode = SuffixTreeNode(lcp)
            node.childeren.add(childNode)
            val reduced = group.map { it.removePrefix(lcp) }
            buildSuffixTreeUtil(reduced, childNode)
        }
    }

    private fun buildSuffixTree() {
        val suffixes = (0 .. text.length).map { text.subString(it) + '$'}
        buildSuffixTreeUtil(suffixes, root)
    }

    private fun search(pattern: String) {
        var node = root
        var i = 0
        while (i < pattern.length) {
            // Find child whose edge starts with pattern[i]
            val child = node.children.find { it.edge.isNotEmpty() && it.edge[0] == pattern[i] }
                ?: return false
            val edge = child.edge
            var j = 0
            // Match as much as possible along the edge
            while (j < edge.length && i < pattern.length && edge[j] == pattern[i]) {
                j++
                i++
            }
            if (j < edge.length) return false // Pattern mismatch on edge
            node = child
        }
        return true
    }
}

Time Complexity $O(n^2)$

This is not optimal for large strings. For large text, use Ukkonen’s algorithm, which builds the tree in O(n).

Suffix trees are powerful but often replaced by suffix arrays + LCP arrays in practice because they are simpler and more space-efficient.

Suffix arrays + LCP arrays

A suffix array for a string s in an array on integers representing starting indexes of all suffixes of s sorted in lexicographical order.

LCP array contains the lengths of the longest common prefixes between each pair of adjacent suffixes in the suffix array.

We can suffix + LCP arrays for pattern matching in O(m + log n).

fun suffixArray(s: String): IntAray {
    val n = s.length
    val suffixes = Array(n) { i -> s.substring(i) to i}
    suffixes.sortBy { it.first }
    return suffixes.values
}

fun lcpArray(s: String, sa: IntArray): IntArray {
    val n = s.length
    val rank = IntArray(n) 
    val lcp = IntArray(n - 1)

    for(i in 0 until n) {
        rank[sa[i]] = i
    }

    var k = 0
    for(i in 0 until n) {
        if(rank[i] == n - 1){
            k = 0
            continue
        }
        val j = sa[rank[i] + 1]
        while(i + k < n && j + k < n && s[i + k] == s[j + k]) k++
        lcp[rank[i]] = k
        if(k > 0) k--
    }

    return lcp
}

Suffix Automaton

A Suffix Automaton (SA) is a type of finite automaton that compactly represents all substrings of a given string. It’s a minimal DFA that accepts all suffixes of the string and all substrings in general.

Suffix Automaton is powerful because they can be built in O(n).
A Suffix Automaton is a directed graph where:

Each state represents a set of equivalent substrings.
Transitions represent extending substrings by a character.
Each state store:
- len: Length of the longest string in its class
- link: The suffix link pointing to the largest suffix
- next: Transitions for each character

Approach: 1. Initialization: Start with initial empty substring. 2. Adding a character: - For each new character, extend paths from previous active states. - If a transition already exists but is too long, split the state to keep automaton minimal. 3. Suffix Links: They link each state to its longest suffix, which is in the automaton.

data class State(
    var len: Int = 0, // Length of the longest substring in this state
    var link: Int = -1, // Suffix link to another state
    val next: MutableMap<Char, Int> = mutableMapOf() // Transitions
)

class SuffixAutomaton(s: String) {
    private val states = mutableListOf<State>()
    private var last = 0 // Index of last state

    init{
        states.add(State()) // Initial state
        for(c in s) {
            extemd(c)
        }
    }

    private fun extend(c: Char) {
        val curr = states.size
        states.add(State(len=states[last].len + 1))

        /**
         * Update transitions:
         * Walk back through suffix links until we find a state that already has a transition for c
         * For each state that does not have a transition for c, add one to curr.
         */
        var p = last
        while(p != -1 && c !in states[p].next) {
            state[p].next[c] = curr
            p = states[p].link
        }

        /**
         * Set Suffix Link:
         * If p == -1: 
         *      set the new state suffix link to initial state
         * else: found a state p with a transition c
         *      link to q or its clone
         */

        if(p == -1) {
            states[curr].link = 0
        } else {
            val q = states[p].next[c]!!
            if(states[p].len + 1 == states[q].len) {
                states[curr].link = q
            } else {
                val clone = states.size
                states.size(
                    State(
                        len = state[p].len + 1,
                        link = states[q].link,
                        next = states[q].next
                    )
                )
                while(p != -1 && states[p].next[c] == q) {
                    states[p].next[c] = clone
                    p = states[p].link
                }
                states[q].link = clone
                states[curr].link = clone
            }
        }
        last = curr
    }

    fun countDistinctSubstrings(): Long {
        var res = 0L
        for(state in states) {
            if(state.link != -1) {
                res += (state.len - states[state.link].len)
            }
        }
        return res
    }

    fun contains(pattern: String): Boolean {
        var curr = 0
        for(c in pattern) {
            if(c !in states[current].next) {
                return false
            }
            current = states[current].next[c]!!
        }
        return true
    }
}

Applications:

Number of different substrings
Longest common substring
Number of occurrences of a substring
Lexicographical queries of a substrings