May the source be with you, but remember the KISS principle ;-)
Contents Bulletin Scripting in shell and Perl Network troubleshooting History Humor

Pattern Matching and Quick String Search


See Also Recommended Books Recommended Links Lecture Notes and E-books Donald Knuth TAOCP Volume3 Longest Common Substring Problem Animations
Knuth-Morris-Pratt algorithm Boyer-Moore string search algorithm Sunday string search algorithm Hume and Sunday -- Tuned Boyer-Moore  and Least Cost  Harrison Karp-Rabin Regular expressions  Humor Etc

String searching algorithms are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text. Here is bibliography that can help to start in this area (See also STRING SEARCHING ALGORITHMS by Graham A Stephen)

Exact Text Searching

Aho A.V., Algorithms for finding patterns in strings, Chapter 5 (pp. 255-300) of Leeuwen J. van (ed.) Handbook of Theoretical Computer Science, Elsevier Science Publishers, Amsterdam.

Abrahamson K., Generalized string matching, SIAM Journal on Computing, 16(6), 1039-1051, 1987.

Amir A., Landau G.M., and Vishkin U., Efficient pattern matching with scaling, Journal of Algorithms, 13(1), 2-32, 1992.

Apostolico A., and Giancarlo R., The Boyer-Moore-Galil string searching strategies revisited, SIAM Journal on Computing, 15(1), 98-105, 1986.

Baeza-Yates R., and Gonnet G.H., A new approach to text searching, Communications of the ACM, 35(10), 74-82, 1992.

Baeza-Yates R., and Regnier M., Average running time of the Boyer-Moore-Horspool algorithm, Theoretical Computer Science, 92(1), 19-31, 1992.

Baeza-Yates R., Choffrut C., and Gonnet G., On Boyer Moore Automata, Algorithmica, 12(4-5), 268-292, 1994.

Baeza-Yates R., and Fuentes L., A framework to animate string algorithms, Information Processing Letters, 59(5), 241-244, 1996.

Baeza-Yates R., Improved string matching, Software-Practice and Experience, 19(3), 257-271, 1989.

Baeza-Yates R.A., Algorithms for String Searching: A Survey, ACM SIGIR Forum, 23(3-4), 34-58, 1989.

Baeza-Yates R., Text Retrieval: Theory and Practice, in Proc. of the 12th IFIP World Computer Congress, 465-476, (Madrid, Spain), North-Holland, 1992.

Baeza-Yates R., A unified view to string matching algorithms, Thoery and Pratice of Informatics, 23rd Seminar, 1-15, 1996.

Blumer A., Blumer J., Haussler D., Ehrenfeucht A., Chen M.T., and Seiferas J., The smallest automaton recognizing the subwords of a text, Theoretical Computer Science, 40(1), 31-55, 1985.

Boyer R.S., and Moore J.S., A fast string searching algorithm, Communications of the ACM, 20(10), 762-772, 1977.

Breslauer D., Colussi L., and Toniolo L., Tight comparison bounds for the string prefix matching problem, Information Processing Letters, 47(1), 51-57, 1993.

Breslauer D., Saving comparisons in the Crochemore-Perrin string matching, Theoretical Computer Science, 158(1-2), 177-192, 1996.

Bruyere V., Baeza-Yates R., Dergrange O., and Scheihing R., About the size of Boyer-Moore automata, in Proc. of the 3rd South American Workshop on String Processing, 31-46, Carleton, University Press, 1996.

Charras C., and Lecroq T., Exact string matching algorithms, Technical Report, 1997.

Cole R., Hariharan R., Zwick U., and Paterson M.S., Tighter lower bounds on the exact complexity of string matching, SIAM Journal on Computing, 24(1), 30-45, 1995.

Cole R., Tight bounds on the complexity of the Boyer-Moore pattern matching, SIAM Journal on Computing, 23(5), 1075-1091, 1994.

Colussi L., Correctness and Efficiency of pattern matching algorithm, Information and Computation, 95(2), 225-251, 1991.

Colussi L., Fatest pattern matching in strings, Journal of Algorithms, 16( 2), 163-189, 1994.

Crochemore M., and Rytter W., Text Algorithms, Oxford University Press, 1994.

Crochemore M., Czymaj A., Gasieniec L., Jarominek S., Lecroq T., Plandowski W., and Rytter W., Speeding Up Two String Matching Algorithms, Algorithmica, 12(4-5), 247-267, 1994.

Crochemore M., and Lecroq T., Tight bounds on the complexity of the Apostolico-Giancarlo algorithm, Information Processing Letters, 63(4), 195-203, 1997.

Crochemore M., and Lecroq T., Pattern matching and text compression algorithms, in The Computer Science and Engineering Handbook, Tucker A.B., ed., CRC Press, Boca-Raton, Chapter 8, 162-202, 1997.

Crochemore M., and Hancart C., Automata for Matching Patterns, Handbook of Formal Languages, Rosenberg C., Salamaa A. eds, 2, 399-462, Springer-Verlag, 1997.

Crochemore M., and Hancart C., Pattern matching in strings, in Algorithms and theory of computation Handbook, Mikhail J., Atallah, ed. CRC Press, Boca Raton, 1998.

Crochemore M., String matching on ordered alphabets, Theoretical Computer Science, 92(1), 33-47, 1992.

Crochemore M., Off-line serial exact string searching, in Pattern Matching Algorithms, Apostolico A., Z. Galil, ed., Oxford University Press, New York, Chapter 1, 1-53, 1997.

Davies G., and Bowsher S., Algorithms for pattern matching, Software-Practice and Experience, 16(6), 575-601, 1996.

Ferragina P., and Grossi R., Optimal On-line search and sublinear time update in strings, SIAM Journal on Computing, 27(3), 713-736, 1998.

Galil Z., and Giancarlo R., On the exact complexity of string matching: Lower Bounds, SIAM Journal on Computing, 20(6), 1008-1020, 1991.

Galil Z., and Giancarlo R., On the exact complexity of string matching: Upper Bound, SIAM Journal on Computing, 21(3), 407-437, 1992.

Galil Z., On improving the worst case running time of the Boyer-Moore algorithm, Communications of the ACM, 22(9), 505-508, 1979.

Gasieniec L., Plandowski W., and Rytter E., The zooming method: a resursive approach to time-space algorithm, Theoretical Computer Science, 147(1-20), 19-30, 1995.

Gasieniec L., Plandowski W., and Rytter W., Sequential sampling: a new approach to constant space pattern matching, in Proc. of the 6th Annual Symposium on Combinatorial Pattern Matching, LNCS 937, 78-89, Springer-Verlag, Berlin, 1995.

Gonnet G.H., and Baeza-Yates R.A., An analysis of the Karp-Rabin string matching algorithm, Information Processing Letters, 34(5), 271-274, 1990.

Gonnet G.H., and Baeza-Yates R., Handbook of Algorithms and Data Structures in Pascal and C, 2nd edition, Addison-Wesley, Workingham, 251-288, 1991.

Gusfield D., Algorithms on strings, trees and sequences, Gambridge University Press, 1997.

Hancert C., On Simon?s string searching algorithm, Information Processing Letters, 47(2), 95-99, 1993.

Horspool R.N., Practical fast searching in strings, Software-Practice and Experience, 10(6), 501-506, 1980.

Hume A., and Sunday D., Fast string searching, Software-Practice and Experience, 21(11), 1221-1248, 1991.

Knuth D.E., Morris J.H., and Pratt V.R., Fast pattern matching in strings, SIAM Journal on Computing, 6(2), 323-350, 1977.

Lecroq T., A variation on the Boyer-Moore algorithm, Theoretical Computer Science, 92(1), 119-144, 1992.

Lecroq T., Experimental results on string matching algorithms, Software-Practice and Experience, 25(7), 727-765, 1995.

Lecroq T., Experiments on string matching in memory structures, Software-Practice and Experience, 28(5), 561-568, 1998.

Liu Z., Du X., and Ishii N., An improved adaptive string searching algorithm, Software-Practice and Experience, 28(2), 191-198, 1998.

Manolopoulos Y., and Faloutsos C., Experimenting with pattern matching algorithms, Information Sciences, 90(1-4), 75-89, 1996.

Perleberg C., Single character searching methods and the shift-or pattern matching algorithm, Information Processing Letters, 50(5), 269-275, 1994.

Raita T., Tunning the Boyer-Moore-Horspool string searching algorithm, Software-Practice and Experience, 22(10), 879-884, 1992.

Reingold E., Urban K.J., and Gries D., K-M-P string matching revisited, Information Processing Letters, 64(5), 217-223, 1997.

Sedgewick R., Algorithms, 2nd edition, Addison Wesley, Reading Mass, 277-303, 1988.

Smit G. De V., A Comparison of Three String Matching Algorithms, Software-Practice and Experience, 12(1), 57-66, 1982.

Smith P., Experiments with a very fast substring search algorithm, Software-Practice and Experience, 21(10), 1065-1074, 1991.

Smith P., On tuning the Boyer-Moore-Horspool string searching algorithm, Software-Practice and Experience, 24(4), 425-436, 1994.

Stephen A.G., String Search, Techinal Report, School of Electronic Engineering Science, 1992.

Sunday D., A very fast substring search algorithm, Communications of the ACM, 33(8), 132-142, 1990.

Tarhio J., and Peltora H., String matching in the DNA alphabet, Software-Practice and Experience, 27(7), 851-861, 1997.

Watson B., The performance of single and multiple keyword pattern matching algorithms, in Proc. of the 3rd South American Workshop on String Processing, 280-294, Carleton, University Press, 1996.

Watson B., SPARE Parts: A C++ toolkit for string Pattern Recognition, in Proc. of the Second Prague stringologic Club Workshop, 47-60, 1997.

Wu S., and Manber U., Fast text searching allowing errors, Communications of the ACM, 35(10), 83-91, 1992.

Approximate Text Searching

Aho A., Corasick M., Efficient string matching: an aid to bibliographic search, Communications of the ACM, 18(6), 333-340, 1975.

Arlazarov V.L., Dinic E.A., Kronrod M.A., Faradzev I.A., On economic construction of the transitive closure of a directed graph, Dokl. Akad. Nauk SSSR, 194, 487-488, 1970 (in Russian). English translation in Soviet Math. Dokl., 11,1209-1210, 1975.

Aho A.V., Algorithms for finding patterns in strings, Chapter 5 (pp. 2 55-300) of Leeuwen J. van (ed.) Handbook of Theoretical Computer Science, Elsevier Science Publishers, Amsterdam.

Aoe J., Computer Algorithms - String Pattern Matching Strategies, IEEE Computer Society Press, Los Alamitos, California, 1994.

Blumer A., Blumer J., Haussler D., Ehrenfeucht A., Chen M.T., Seiferas J., The smallest automaton recognizing the subwords of a text, Theoretical Computer Science, 40, 31-55, 1985.

Baeza-Yates R.A., Text Retrieval: Theory and practice, in Proc. 12th IFIP World Computer Congress, (Madrid, Spain), 1992.

Baeza-Yates R.A., A unified view of string matching algorithms, In SOFSEM 96: Theory and Practice of Informatics, 1-15, Springer-Verlag, 1996.

Baeza-Yates R.A., Gonnet G.H., A new approach to text searching, Communications of the ACM, 35(10), 74-82, 1992.

Baeza Yates R.A., Gonnet G., Fast string matching with mismatches, Information and Computation, 108(2), 187-199, 1994.

Baeza-Yates R.A., Navarro G., A fast heuristic for approximate string matching, in Proc WSP 96, 47-63, Carleton University Press, 1996.

Baeza-Yates R.A., Navarro G., A faster algorithm for approximate string matching, In Proc Combinatorial Pattern Matching 96, 1-23, Springer-Verlag, 1996.

Baeza-Yates R.A., Perleberg C.H., Fast and practical approximate pattern matching, in Proc Combinatorial Pattern Matching, CPM 92, 185-192, 1992.

Baeza-Yates R.A., Perleberg C.H., Fast and practical approximate string matching, Information Processing Letters, 59, 21-27, 1996.

Chang W.I, Lampe J., Theoretical and Emprirical Comparisons of Approximate String Matching Algorithms, In Proc. 3rd Combinatorial Pattern Matching, 175-184, University of Arizona, USA, 1992.

Chang W., Lawler E., Sublinear approximate string matching and biological applications, Algorithmica, 12(4/5), 327-344, 1994.

Chang W, Marr T., Approximate string matching and local similarity, In Proc Combinatorial Pattern Matching 94, 259-273, Springer-Verlag, 1994.

Crochemore M., String matching with constraints, in Proc. MFCS’ 88, 324, 44-58, 1988.

Commentz-Walter B., A string matching algorithm fast on the average, in Proc. ICALP 79, 118-132, Springer-Verlag, 1979.

Dermouche A., A fast algorithm for string matching with mismatches, Information Processing Letters, 55, 105-110, 1995.

Galil Z., Giancarlo R., Improved string matching with k mismatches, Sigact News, 17, 52-54, 1986.

Galil Z., Giancarlo R., Data Structures and Algorithms for Approximate String Matching, Journal of Complexit, 4, 33-72, 1988.

Giegerich R., Kurtz S., Hischke F., Ohlebusch E., A general technique to improve filter algorithms for approximate string matching, In Proc WSP 97, 38-52, Carleton University Press, 1997.

Grossi R., Luccio F., Simple and efficient string matching with k mismatches, Information Processing Letters, 33, 113-120, 1989.

Galil Z., Park K., An improved Algorithm for Approximate String Matching, SIAM Journal of Computing, 19(6), 989-999, 1990.

Hall P., Dowling G., Approximate string matching, ACM Computing Surveys, 12(4), 381-402, 1980.

Harel D., Tarhjan R.E., Fast algorithms for finding nearest common ancestors, SIAM Journal on Computing, 13(2), 338-355, 1984.

Jokinen P., Tarhio J., Ukkonen E., A Comparison of Approximate String Matching Algorithms, Software-Practice and Experience,  26(12), 1439-1458, 1996.

Kurtz S., Approximate string searching under weighted edit distance, in Proc. WSP 96, 156-170, Carleton University Press, 1996.

Landau G.M., Vishkin U., Efficient String Matching with k Mismatches, Theoretical Computer Science, 43, 239-249, 1986.

Landau G.M., Vishkin U., Fast String Matching with k Differences, Journal of Computer and System Sciences, 37, 63-78, 1988.

Landau G.M., Vishkin U., Fast Parallel and Serial Approximate String Matching, Journal of Algorithms, 10, 157-169, 1989.

McCreight E., A space-economical suffix tree construction algorithm, Journal of the ACM, 23(2), 262-272, 1976.

Michailidis P., Margaritis K., String Matching Algorithms, Technical Report, Dept. of Applied Informatics, University of Macedonia, 1999 (in Greek).

Masek W.J., Paterson M.S., A Faster Algorithm Computing String Edit Distances, Journal of Computer and System Sciences, 20, 18-31, 1980.

Myers G., A fast bit-vector algorithm for approximate pattern matching based on dynamic programming. In Proc Combinatorial Pattern Matching 98, 1-13, Springer-Verlag, 1998.

Navarro G., Mutiple approximate string matching by counting, in Proc WSP 97, 125-139, Carleton University Press, 1997.

Navarro G., A partial deterministic automaton for approximate string matching, in Proc WSP 97, 112-124, Carleton University Press, 1997.

Navarro G., Approximate Text Searching, Ph.D. Thesis, University of Chile, Dept. of Computer Science, 1998.

Navarro G., Raffinot M., A bit-parallel approach to suffix automata: Fast extended string matching, in Proc Combinatorial Patern Matching 98, 14-33, Springer-Verlag, 1998.

Sellers P.H., The Theory and Computation of Evolutionary Distance: Pattern Recognition, Journal of Algorithms, 1, 359-373, 1980.

Sutinen E., Tarhio J., On using q-gram locations in approximate string matching, In Proc ESA 95, 327-340, Springer-Verlag, 1995.

Stephen G. A., String Searching Algorithms, World Scientific Press, 1994.

Takaoka T., Approximate pattern matching with samples, In Proc. ISAAC 94, 234-242, Springer-Verlag, 1994.

Tarhio J., Ukkonen E., Approximate Boyer-Moore String Matching, SIAM Journal on Computing, 22(2), 243-260, 1993.

Ukkonen E., Algorithms for Approximate String Matching, Information and Control, 64, 100-118, 1985.

Ukkonen E., Finding Approximate Patterns in Strings, Journal of Algorithms, 6, 132-137, 1985.

Ukkonen E., Approximate string matching with q-grams and maximal matches, Theoretical Computer Science, 1, 191-211, 1992.

Ukkonen E., Wood D., Approximate String Matching with Suffix Automata, Algorithmica, 10, 353-364, 1993.

Weiner P., Linear pattern matching algorithm, in Proc. 14th IEEE Symposium on Swithching and Automata Theory, 1-11, 1973.

Wagner R.A., Fischer M.J., The String to String Correction Problem, Journal of the Association for Computing Machinery,  21(1), 168-173, 1974.

Wu S., Manber U., Fast text searching allowing errors, Communications of the ACM, 35(10), 83-91, 1992.

Wu S., Manber U., Myers G., A Subquadratic algorithm for approximate limited expression matching, Algorithmica, 15, 50-67, 1996.

Wright A., Approximate string matching using within-word parallelism, Software-Practice and Experience, 24( 4), 337-362, 1994.

Top Visited
Past week
Past month


Old News ;-)

[Mar 09, 2011] Unmasking Anonymous Email Senders

Mar 08, 2011 | Slashdot

alphadogg writes "Just because you send an email anonymously doesn't mean people can't figure out who you are anymore. A new technique developed by researchers at Concordia University in Quebec could be used to unmask would-be anonymous emailers by sniffing out patterns in their writing style from use of all lowercase letters to common typos. Their research, published in the journal Digital Investigation, describes techniques that could be used to serve up evidence in court, giving law enforcement more detailed information than a simple IP address can produce."

[Nov 20, 2004] AGREP, an approximate GREP

[Nov 20, 2004] Implementation of the Boyer-Moore-Horspool algorithm

Software Matches Police Sketches To Mugshots


Zothecula writes "We've seen it in numerous TV shows and movies – the witness to a crime looks through a book of mug shots, then works with a police sketch artist to come up with a likeness of the nasty person they saw. After looking through hundreds of mug shots, however, it's possible that the tired-brained witness could look right at a photo of the guilty party and not recognize them. It's also possible that there is a mug shot of the criminal on a database somewhere out there, but that this particular witness will never see it. A computer system being pioneered at Michigan State University, however, could be the solution to such problems – it automatically matches faces in police sketches to mug shots."

Fast String Searching With Suffix Trees by Mark Nelson

Dr. Dobb's Journal August, 1996

I think that I shall never see
A poem lovely as a tree.
Poems are made by fools like me,
But only God can make a tree.

Joyce Kilmer

A tree's a tree. How many more do you need to look at?

Ronald Reagan

Matching string sequences is a problem that computer programmers face on a regular basis. Some programming tasks, such as data compression or DNA sequencing, can benefit enormously from improvements in string matching algorithms. This article discusses a relatively unknown data structure, the suffix tree, and shows how its characteristics can be used to attack difficult string matching problems.

The problem

Imagine that you've just been hired as a programmer working on a DNA sequencing project. Researchers are busy slicing and dicing viral genetic material, producing fragmented sequences of nucleotides. They send these sequences to your server, which is then expected to locate the sequences in a database of genomes. The genome for a given virus can have hundreds of thousands of nucleotide bases, and you have hundreds of viruses in your database. You are expected to implement this as a client/server project that gives real-time feedback to the impatient Ph.D.s. What's the best way to go about it?

It is obvious at this point that a brute force string search is going to be terribly inefficient. This type of search would require you to perform a string comparison at every single nucleotide in every genome in your database. Testing a long fragment that has a high hit rate of partial matches would make your client/server system look like an antique batch processing machine. Your challenge is to come up with an efficient string matching solution.

The intuitive solution

Since the database that you are testing against is invariant, preprocessing it to simplify the search seems like a good idea. One preprocessing approach is to build a search trie. For searching through input text, a straightforward approach to a search trie yields a thing called a suffix trie. (The suffix trie is just one step away from my final destination, the suffix tree.) A trie is a type of tree that has N possible branches from each node, where N is the number of characters in the alphabet. The word 'suffix' is used in this case to refer to the fact that the trie contains all of the suffixes of a given block of text (perhaps a viral genome.)

Karp-Rabin string searching

#define B 131

char *search( pat, text )
char *pat, *text;

{ int hpat, htext, Bm, j, m;

if( pat[0]==EOS ) return( text );
Bm = 1;
hpat = htext = 0;

for( m=0; text[m] != EOS && pat[m] != EOS; m++ ) {
Bm *= B;
hpat = hpat*B + pat[m];
htext = htext*B + text[m];

if( text[m]==EOS && pat[m]!=EOS ) return( NULL );

for( j=m; TRUE; j++ ) {
if( hpat==htext && strncmp(text+j-m,pat,m)==0 )
return( text+j-m );
if( text[j]==EOS ) return( NULL );
htext = htext*B - text[j-m]*Bm + text[j];

String Searching by Block Sorting

The simplest algorithm for searching for one string (the "needle") in another larger string (the "haystack") looks like this:

    for i = 0 to len(haystack)-len(needle):
        matched = YES
        for j = 0 to len(needle)-1:
            if haystack[i+j] != needle[j]:
                matched = NO
            end if
        end for
        if matched = YES:
            return i
    end for
    return "not found"

This has time complexity O(H*N) (where H is the length of the haystack, and N is the length of the needle). Of course, you don't usually need to test all the characters of the needle for each value of i, but it can happen in the worst case (consider searching for the needle "aaaaaaaaaaab" in a haystack made of one million repetitions of "a").

A well-known optimisation is to start by converting the needle into a finite state machine. This reduces the cost of the search to O(H), independent of N, at the expense of a setup phase that does depend on N. For a single fixed needle and a huge amount of haystack text, this is a worthwhile tradeoff.

On the other hand, what happens if you have a single fixed haystack and a large number of different needles you want to find?

Algorithm Description

Construct the list of all terminal substrings of the haystack. (A terminal substring is one which appears at the end of the haystack. In other words, a substring you can get by taking characters off the beginning.) For example, with the haystack string "foobar", the terminal substrings would be


Now sort this list into order, so we get:


This concludes the setup phase. Now, when given a needle, we need only perform a binary search through this list!

The storage cost of this algorithm is high. The whole haystack must be kept in memory (or somewhere reasonably accessible) during the setup phase, and in addition to the haystack text itself we also need an array of offsets into the text, which is likely to take up 4 or 8 times as much space as the haystack itself. (We don't need to store all the actual terminal substrings; we need only store the position within the haystack where each one starts.)

The setup phase consists of a sort. Sorting is well known to be possible in O(n log n) comparisons, but when the comparisons are between strings, the cost of each comparison is non-trivial as well. In the absolute worst case (every character in the haystack is identical), a comparison would take O(H) time, so the setup phase could take O(H^2 log H) in the worst case. On the other hand, any reasonably normal haystack is likely to take not much more than O(H log H) to sort.

The searching procedure is similar. A binary search takes O(log n) comparisons, so in the worst case the search can take O(N log H). Assuming the needle is reasonably small, this seems entirely acceptable for the massive gain of not having an O(H) term!


No clear applications immediately spring to mind. Search engines and indexing is a related area, and it seems not impossible there might be an application there.

(back to algorithms index)

New and Faster Filters for Multiple Approximate String Matching (ResearchIndex)

[PDF] Mutable Strings in Java: Design, Implementation and Lightweight ...

File Format: PDF/Adobe Acrobat - View as HTML
... ment a few relaxed versions of Daniel Sunday's QuickSearch [9], a variant of the
Boyer–Moore algorithm. ... applied them to the text-search problem ...

[PDF] Fast String Searching

File Format: PDF/Adobe Acrobat - View as HTML
... Hill, NJ 07974, USA AND DANIEL SUNDAY Johns Hopkins ... SUMMARY Since the Boyer-Moore
algorithm was described ... benchmark for the practical string search literature. ...

[PDF] Algorithms for String Searching On A Beowulf Cluster

File Format: PDF/Adobe Acrobat
... Daniel Vaz and Hugues Treiber need a very special mention for ... Baker's Boyer
Moore-type Algorithm [37] • Sunday's Substring Search Algorithm [38] 2.5 ...

Recommended Links

Softpanorama hot topic of the month

Softpanorama Recommended

Top articles


**** Pattern Matching Pointers -- very good

NIST String matching

Dick Grune's Annotated Literature Lists

string matching algorithms

The Prague Stringology Club

Algebraic Combinatorics on Words


[PDF] COSC 348 String matching

Introduction to PL-I, Algorithms, and Structured Programming, 3rd Edition

Teaching Binary Tree Algorithms by Programming Proof and Animation

Quick string search

Quick Search algorithm

Horspool algorithm

Citations- Improved string searching - Baeza-Yates (ResearchIndex)

Lecture Notes

String Searching

Knuth-Morris-Pratt algorithm

Knuth-Morris-Pratt Algorithm

Knuth-Morris-Pratt algorithm - Encyclopedia Fablis articles about ...

Knuth-Morris-Pratt algorithm - Wikipedia, the free encyclopedia

NIST: Knuth-Morris-Pratt algorithm

The Knuth-Morris-Pratt Algorithm

Knuth-Morris-Pratt algorithm - encyclopedia article about Knuth ...

Knuth-Morris-Pratt Algorithm

Boyer-Moore string search algorithm

Boyer-Moore algorithm is the fastest known string searching algorithm, at least in its best case. For a text of length n and a fixed pattern of length m the best performance of the algorithm is n/m and the worst case is n*m. The algorithm is very widely used in software engineering. The notable feature of the algorithm, apparent from its running time formula, is that the longer the pattern we are looking for, the faster the algorithm will be usually able to find it.

String searching algorithm - Wikipedia, the free encyclopedia

Boyer-Moore String Matching over Ziv-Lempel Compressed Text ...

The Boyer-Moore Fast String Searching Algorithm

This algorithm, which Bob Boyer and I invented in about 1975, is the basis of the fastest known ways to find one string of characters in another. How might you look for the following pattern in the text below?
pattern: EXAMPLE
We can show you the Knuth-Pratt-Morris algorithm and we can show you the Boyer-Moore algorithm.

Our algorithm has the peculiar property that, roughly speaking, the longer the pattern is, the faster the algorithm goes. Furthermore, the algorithm is ``sublinear'' in the sense that it generally looks at fewer characters than it passes. The algorithm is described in

Frodo's Workshop - Boyer-Moore Algorithm

NIST: Boyer-Moore

The Boyer-Moore Algorithm

Data Structures and Algorithms II Lecture 3- String Search ...

Horspool algorithm

Boyer-Moore-Horspool String Searching

Sunday String Search algorithm


By Leonard Morgenstern

Forth Scientific Library Algorithm #41

Adapted from Daniel Sunday. Communications of the ACM 33:132 1990.

QUICKSEARCH ( c-addrp up c-addrt ut -- c-addr)
Locate the first instance of a search-pattern with coordinates c-addrp up within a
text-string c-addrt ut. Return the address of the first match in the text if found,
or NIL if not, where NIL is a user-defined impossible address. An ambiguous
condition exists if the search-pattern (up) is longer than the text-string (ut).

The search algorithm

Before beginning the search, the algorithm constructs an array called a
shift-table, having the same number of cells as the number of possible characters,
normally 256. The contained quantity is up +1 (where up is the length of the
search-pattern) for characters not present in the search-pattern, and the number
of characters from the end of the string for characters that are present, the
final character in the string being counted as 1. If duplicates are present, the
lowest number is used. For example, if the pattern string is "THAT", then T=1,
A=2, H=3, and all others are 5. If two characters are equivalent, they are
assigned the same shift-values; in the example, T and t would both have the value
1 if case is to be ignored. The shift-table is the only data structure needed by
the algorithm, except, of course, the text and pattern strings themselves.

At the start, the first character of the pattern is positioned on the first
character of the text and up characters are compared. (The choice of a comparison
algorithm will be discussed below.) If there is a mismatch, the text character
just beyond the pattern is examined, and the corresponding shift obtained from the
shift-table. The pattern is then advanced that number of characters. The process
is repeated until the pattern is matched or the text-string is exhausted.

An important characteristic of Sunday's algorithm is that the distance advanced
depends only on the text character just beyond the search-pattern; it does not
depend on the character that failed to match, its position, or the order in which
the pattern and text are compared. The algorithm may search front-to-back,
back-to-front, or a scrambled order based on letter-frequencies. Naturally, the
choice affects performance, as will be discussed in the next section.

The comparison algorithm

The algorIthm for comparing the search-pattern and text (the "comparator") is a
deferred word with the stack diagram:

COMPARATOR ( c-addrp up c-addrt ut -- n)
where n=0 for a match and non-zero for a mismatch. The stack diagram is the same
as that of the word COMPARE in the ANSI Standard String Word Set.

COMPARATOR should be carefully selected to optimize performance. The standard word
COMPARE is a good choice in most cases. Situations where another comparator should
be used include: 1) when certain sets of characters are regarded as equivalent,
for example, when case is to be ignored; 2) where wild-cards are allowed; and 3)
where character frequencies dictate a more efficient search order. In general, the
search is fastest when rare characters are matched first.


The efficiency of the search can be measured by the number of comparisons. The most
favorable case is seen when the first character in the pattern is never present in
the text and the text character just beyond the pattern never happens to be
present in the pattern; the number of comparisons is then ut/(up+1). At the other
extreme, there are degenerate cases in which the number of comparisons approaches
ut*up. This would occur when using a front-to-back comparator if the text and
search patterns contain long runs of the final character in the search pattern,
for example, searching for AAAABA in AAA...AAAAABA. This is not merely a
theoretical problem, as it might occur in searching for a string embedded in
spaces, or in searching graphics files. In the example, performance would be
improved by matching the odd character first, but the number of comparisons could
not be reduced much below nt. If there is a significant chance that patterns like
these might occur, Sunday's algorithm probably should not be used.

This is an ANS Forth program requiring:
1. A standard system with Core and Core extension sets.
2. String extension set if COMPARE is used to compare strings.
3. Uses the words V: and DEFINES from the FSL utilities to
manage vectored words
4. The compilation of the test code is controlled by the
VALUE TEST-CODE? and the conditional compilation words in
the Programming-Tools wordset
5. The test code uses the String wordset word /STRING
6. The test code uses the (non ANS) words:
TAB to emit a tab character and,
DUP>R which is equivalent to:
: DUP>R DUP >R ;

This code is released to the public domain. Leonard Morgenstern, March 1995



0 [IF]

\ PRELUDE to convert F-PC to ANSI
: CELLS 2* ;
: CELL+ 2+ ;
: COMPARE ( c-addr1 u1 c-addr2 u2 -- n)
rot min COMP ;






\ Impossible address; returned if no match found

\ Number of possible symbols


\ Initializing the shift table.
: INIT-SHIFT-TABLE ( c-addr1 u -- )

DUP 1+
>R 2DUP - R> !

\ In this version, certain quantities are stored in VALUEs
\ For many applications, local variables would be better

: QUICKSEARCH ( c-addr[p] u[p] c-addr[t] u[t] -- c-addr[m])
\ Enter with addr & len of pattern & text
\ Return address of match or NIL
TO QS-TL TO QS-TA \ Store text coordinates
2DUP INIT-SHIFT-TABLE \ Initialize shift-table
TO QS-PL TO QS-PA \ Store pattern coordinates
NIL \ NIL on stack
QS-TA QS-TL QS-PL - 1+ \ Maximum number of comparisons
OVER + SWAP \ Substitute BOUNDS if defined
COMPARATOR 0= \ Make a comparison
\ Leave if found
I QS-PL + C@ \ Extract char just past pattern
\ Extract shift


\ =======================================================================

TEST-CODE? [IF] \ test code =============================================



: Show-result ( a -- ) \ After performing search show result
CR 1 tab
DUP nil =
>R \ \ stash match addr on r.s.
QS-TA QS-TL \ ta tl \ get coordinates of text
OVER R> OVER - \ ta tl ta len1 \ get coordx of 1st part
TYPE [CHAR] ^ EMIT \ ta tl
R> /STRING \ a' l' \ index to match
TYPE [CHAR] ^ EMIT \ a' l' \ type matching part
QS-PL /STRING TYPE \ \ type tail
: QTEST ( -- a ) \ Ask for strings, compare, & return result
CR ." Look for: "
pattern$ 40 EXPECT pattern$ SPAN @
CR ." in: "
text$ 40 EXPECT text$ SPAN @


Dictionary of Algorithms and Data Structures ... [Sund98] Daniel M. Sunday, A Very Fast Substring Search Algorithm, Communications of the ACM, 33(8):132-142, August 1998. [Vitt01 ... - 82k - Nov 13, 2004

stbm.c ... 10, October 1977 pp. 762-772 */ /* * A Very Fast Substring Search Algorithm" */ /*
Daniel M. Sunday */ /* Communications of the ACM */ /* Vol. 33 No. ... - 33k

[PDF] COSC 348 String matching File Format: PDF/Adobe Acrobat - View as HTML ... MR 89g:68021 [Sun90] Daniel Sunday, A very fast substring search algorithm, Communications of the Association for Computing Machinery 33 (1990), no. ... Volume 33 , Issue 8 (August 1990) table of contents
Pages: 132 - 142 Year of Publication: 1990 ISSN:0001-0782 Daniel M. Sunday East Stroudsburg Univ. of Pennsylvania, East Stroudsburg, PA Publisher: ACM Press New York, NY, USA

This article describes a substring search algorithm that is faster than the Boyer-Moore algorithm. This algorithm does not depend on scanning the pattern string in any particular order. Three variations of the algorithm are given that use three different pattern scan orders. These include: (1) a "Quick Search" algorithm; (2) a "Maximal Shift" and (3) an "Optimal Mismatch" algorithm.

Random Findings

A Fast String Scanning Algorithm with Small Startup Overhead Thomas Wang, October 1999 [last update November 1999]


A fast substring search algorithm is presented here. This algorithm is faster than a simple left to right scanning algorithm. Startup overhead is taken into consideration to avoid the use of a large lookup table. This algorithm is easily implementable in Java.


The Java class 'String' provides substring searching capabilities. The method String.indexOf(pattern_string) will search for the location of the first occurrence of pattern string in the source string. For example, "Who is Tomas".indexOf("Tomas") would return 7.

Typical implementation algorithm would search for the character 'T' from left to right. Once 'T' is found, then 'o', 'm', 'a', and 's' would be matched. If a mismatch happens, then we have to resume searching for 'T' until we hit the end of the search string.

This is pseudo code implementing the simple algorithm.

for indx = 0 to length(source) - length(pattern)
  // match first pattern character
  if pattern[0] == source[indx] then
    for indx2 = 1 to length(pattern) - 1
      // match subsequent pattern characters
      if pattern[indx2] != source[indx + indx2] then
        continue outer_loop;
    return indx; // everything matched
return did_not_match;

We can do much better than the above algorithm.

Looking For Impossible Characters

The key to a faster scanning algorithm is to look at the last character of the pattern string first. If the last pattern character matches, then we continue to search for pattern characters from left to right, until the second last pattern character. (remember the last pattern character already matches)

"Who is Tomas".indexOf("Tomas")

Who i........         <- source
Tomas                 <- pattern

In the above example, we found a character 'i' where a character 's' is required to make a match. Because 'i' does not occur anywhere in the string 'Tomas', we can advance the search position 5 characters! While we are matching the last character of the pattern string, if we found an "impossible character" in the search string, we are safe to advance the search pointer by the length of the pattern string.

How do we program this test? We can use a 64 bit long integer for a quick test.

// initialization
long cache = 0L;
for i = 0 to length(pattern) - 1 {
  cache |= 1L << (pattern[i] & 63);


// test for last pattern character; pattern string is "Tomas"
if (source[i] != 's') { // last character did not match
  if ((cache & (1L << (source[i] & 63))) == 0L) { // got an impossible char
    i += 5; // skip 5 characters

Being able to skip 5 characters instead of 1 character at a time gives us a speed boost.

If the implementation language does not have good support for 64 bit integer arithmetic, a 32 bit integer cache can be used in a pinch.

Search Order Discussion

After the last character of the pattern string matches with the source string character under the scan pointer, we scan the rest of the pattern string from left to right. The question is 'why'? Some other choices include right to left, and random order. The answer has to do with worst case performance estimate.

If we find an "impossible character" in the subsequent pattern string matches, we can still skip more than one characters. For example, if we find the fifth character is "impossible", then we can increase search pointer by 5.

stre stress     <- source
stress          <- pattern

234561          <- match order

stre stress     <- source
stress          <- pattern

By scanning from left to right, we ensure the amount we skip is roughly proportional to the number of characters we scanned. If we scan from right to left, then it is possible for a long scan to yield a small skip amount.

MD2 Optimization

Optimization of md2 offset is used when the last character of the pattern string matches, but some other pattern characters do not match.

Take the example of "James Tomas".indexOf("Tomas")

In this example, the 5th character of the search string is 's', so it matches the last character of the pattern string. We then match 'J' against 'T', which does not match. At this point, we can again skip 5 characters! Why?

This is because in the string "Tomas", 's' only occurs as the last character and no where else. We know the 5th character is 's' but the front part did not match. If we advance the search position one character, then the 's' character would be eventually matched against the 4th character of the pattern string ('a'). But we already said 's' does not occur in the pattern string from the first position up to the 4th position.

James....       <- source
 Tomas          <- pattern

So before we do anything we can already predict any shift less than 5 would produce an ultimate mismatch.

The safe shift distance for subsequent mismatches is called 'md2'. One can calculate md2 by looking for 's' starting from the 2nd last pattern character to the first pattern character

For the pattern string "Tomss", md2 would be 1.
For the pattern string "Tosas", md2 would be 2.
For the pattern string "Tsmas", md2 would be 3.
For the pattern string "somas", md2 would be 4.

Pseudo Code for Full Implementation

// initialize the cache
long cache = 0L;
for i = 0 to length(pattern) - 1
  cache |= 1L << (pattern[i] & 63);

// calculate md2
last_pattern = pattern[ length(pattern) - 1 ];
md2 = length(pattern)
for i = 0 to length(pattern) - 1
  if last_pattern == pattern[i] then
    md2 = length(pattern) - i;

for i = length(pattern) - 1 to length(source) - 1
  if last_pattern == source[i] then
    // last character matched, match the rest

    for i2 = 0 to length(pattern) - 2
      if pattern[i2] != source[i - length(pattern) + 1] then
        // see if character under cursor is "impossible".
        if ((cache & (1L << (source[i] & 63))) == 0L)
          altskip = i2 + 1;
          altskip = 1;
        // skip the maximum of md2 and impossible calculation
        i += max(md2, altskip);
        continue scan_loop;
    return i - length(pattern) + 1; // everything matched
  if ((cache & (1L << (source[i] & 63))) == 0L)
    i += length(pattern); // got an impossible character
    i += 1;
return not_found;

Performance Estimate

The new string searching mechanism can potentially speed up string searching by 'p' times, where p is the length of the pattern string. The down side would be that there is pre-processing cost of 2p instructions, calculating the impossible bitmap, and md2. When we discovered two instances of impossible characters in the search string, then we should be even in processing time.

This performance profile compares favorably with traditional fast substring scanning algorithms, such as Boyer Moore. Since Java's character is 16 bit wide, a naive implementation of Boyer Moore would require 65536 machine instructions to initialize its lookup table. This is obviously not acceptable when we are scanning any string shorter than 70000 characters long.


Timo Raita, "On Guards and Symbol Dependencies in Substring Search", Software-Practice and Experience 29(11), pg 931-941 (1999)

Daniel M. Sunday, "A Very Fast Substring Search Algorithm", CACM 33(8), pg 133-142 (1990)

R. Boyer and S. Moore, "A Fast String Searching Algorithm", CACM

Rabin-Karp string search algorithm Rabin-Karp algorithm is a string searching algorithm that seeks a pattern, i.e. a substring, within a text by using hashing. It is not widely used for single pattern matching, but is of considerable theoretical importance and is very effective for multiple pattern matching. Historically, it predates the very popular Knuth-Morris-Pratt algorithm. For text of length n and pattern of length m, its average and best case running time is O(n) (meaning proportional to n), but the worst case scenario is O(n*m), which is one of the reasons why it is not widely used.

Let Σ be an alphabet (finite set). Formally, both the pattern and searched text are concatenation of elements of Σ. The Σ may be usual human alphabet (A-Z). Other applications may use binary alphabet (Σ = ) or DNA alphabet (Σ = ) in bioinformatics.
..... Click the link for more information. that works in O(n) time on average. It is based on the use of hashing to compare strings. See Rabin-Karp string search algorithm Rabin-Karp algorithm is a string searching algorithm that seeks a pattern, i.e. a substring, within a text by using hashing. It is not widely used for single pattern matching, but is of considerable theoretical importance and is very effective for multiple pattern matching. Historically, it predates the very popular Knuth-Morris-Pratt algorithm. For text of length n and pattern of length m, its average and best case running time is O(n) (meaning proportional to n), but the worst case scenario is O(n*m), which is one of the reasons why it is not widely used.


FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available in our efforts to advance understanding of environmental, political, human rights, economic, democracy, scientific, and social justice issues, etc. We believe this constitutes a 'fair use' of any such copyrighted material as provided for in section 107 of the US Copyright Law. In accordance with Title 17 U.S.C. Section 107, the material on this site is distributed without profit exclusivly for research and educational purposes.   If you wish to use copyrighted material from this site for purposes of your own that go beyond 'fair use', you must obtain permission from the copyright owner. 

ABUSE: IPs or network segments from which we detect a stream of probes might be blocked for no less then 90 days. Multiple types of probes increase this period.  


Groupthink : Two Party System as Polyarchy : Corruption of Regulators : Bureaucracies : Understanding Micromanagers and Control Freaks : Toxic Managers :   Harvard Mafia : Diplomatic Communication : Surviving a Bad Performance Review : Insufficient Retirement Funds as Immanent Problem of Neoliberal Regime : PseudoScience : Who Rules America : Neoliberalism  : The Iron Law of Oligarchy : Libertarian Philosophy


War and Peace : Skeptical Finance : John Kenneth Galbraith :Talleyrand : Oscar Wilde : Otto Von Bismarck : Keynes : George Carlin : Skeptics : Propaganda  : SE quotes : Language Design and Programming Quotes : Random IT-related quotesSomerset Maugham : Marcus Aurelius : Kurt Vonnegut : Eric Hoffer : Winston Churchill : Napoleon Bonaparte : Ambrose BierceBernard Shaw : Mark Twain Quotes


Vol 25, No.12 (December, 2013) Rational Fools vs. Efficient Crooks The efficient markets hypothesis : Political Skeptic Bulletin, 2013 : Unemployment Bulletin, 2010 :  Vol 23, No.10 (October, 2011) An observation about corporate security departments : Slightly Skeptical Euromaydan Chronicles, June 2014 : Greenspan legacy bulletin, 2008 : Vol 25, No.10 (October, 2013) Cryptolocker Trojan (Win32/Crilock.A) : Vol 25, No.08 (August, 2013) Cloud providers as intelligence collection hubs : Financial Humor Bulletin, 2010 : Inequality Bulletin, 2009 : Financial Humor Bulletin, 2008 : Copyleft Problems Bulletin, 2004 : Financial Humor Bulletin, 2011 : Energy Bulletin, 2010 : Malware Protection Bulletin, 2010 : Vol 26, No.1 (January, 2013) Object-Oriented Cult : Political Skeptic Bulletin, 2011 : Vol 23, No.11 (November, 2011) Softpanorama classification of sysadmin horror stories : Vol 25, No.05 (May, 2013) Corporate bullshit as a communication method  : Vol 25, No.06 (June, 2013) A Note on the Relationship of Brooks Law and Conway Law


Fifty glorious years (1950-2000): the triumph of the US computer engineering : Donald Knuth : TAoCP and its Influence of Computer Science : Richard Stallman : Linus Torvalds  : Larry Wall  : John K. Ousterhout : CTSS : Multix OS Unix History : Unix shell history : VI editor : History of pipes concept : Solaris : MS DOSProgramming Languages History : PL/1 : Simula 67 : C : History of GCC developmentScripting Languages : Perl history   : OS History : Mail : DNS : SSH : CPU Instruction Sets : SPARC systems 1987-2006 : Norton Commander : Norton Utilities : Norton Ghost : Frontpage history : Malware Defense History : GNU Screen : OSS early history

Classic books:

The Peter Principle : Parkinson Law : 1984 : The Mythical Man-MonthHow to Solve It by George Polya : The Art of Computer Programming : The Elements of Programming Style : The Unix Hater’s Handbook : The Jargon file : The True Believer : Programming Pearls : The Good Soldier Svejk : The Power Elite

Most popular humor pages:

Manifest of the Softpanorama IT Slacker Society : Ten Commandments of the IT Slackers Society : Computer Humor Collection : BSD Logo Story : The Cuckoo's Egg : IT Slang : C++ Humor : ARE YOU A BBS ADDICT? : The Perl Purity Test : Object oriented programmers of all nations : Financial Humor : Financial Humor Bulletin, 2008 : Financial Humor Bulletin, 2010 : The Most Comprehensive Collection of Editor-related Humor : Programming Language Humor : Goldman Sachs related humor : Greenspan humor : C Humor : Scripting Humor : Real Programmers Humor : Web Humor : GPL-related Humor : OFM Humor : Politically Incorrect Humor : IDS Humor : "Linux Sucks" Humor : Russian Musical Humor : Best Russian Programmer Humor : Microsoft plans to buy Catholic Church : Richard Stallman Related Humor : Admin Humor : Perl-related Humor : Linus Torvalds Related humor : PseudoScience Related Humor : Networking Humor : Shell Humor : Financial Humor Bulletin, 2011 : Financial Humor Bulletin, 2012 : Financial Humor Bulletin, 2013 : Java Humor : Software Engineering Humor : Sun Solaris Related Humor : Education Humor : IBM Humor : Assembler-related Humor : VIM Humor : Computer Viruses Humor : Bright tomorrow is rescheduled to a day after tomorrow : Classic Computer Humor

The Last but not Least

Copyright © 1996-2016 by Dr. Nikolai Bezroukov. was created as a service to the UN Sustainable Development Networking Programme (SDNP) in the author free time. This document is an industrial compilation designed and created exclusively for educational use and is distributed under the Softpanorama Content License.

The site uses AdSense so you need to be aware of Google privacy policy. You you do not want to be tracked by Google please disable Javascript for this site. This site is perfectly usable without Javascript.

Original materials copyright belong to respective owners. Quotes are made for educational purposes only in compliance with the fair use doctrine.

FAIR USE NOTICE This site contains copyrighted material the use of which has not always been specifically authorized by the copyright owner. We are making such material available to advance understanding of computer science, IT technology, economic, scientific, and social issues. We believe this constitutes a 'fair use' of any such copyrighted material as provided by section 107 of the US Copyright Law according to which such material can be distributed without profit exclusively for research and educational purposes.

This is a Spartan WHYFF (We Help You For Free) site written by people for whom English is not a native language. Grammar and spelling errors should be expected. The site contain some broken links as it develops like a living tree...

You can use PayPal to make a contribution, supporting development of this site and speed up access. In case is down you can use the at


The statements, views and opinions presented on this web page are those of the author (or referenced source) and are not endorsed by, nor do they necessarily reflect, the opinions of the author present and former employers, SDNP or any other organization the author may be associated with. We do not warrant the correctness of the information provided or its fitness for any purpose.

Last modified: September 12, 2017