【学习笔记】How to Design a Spelling Corrector

790

收藏 2019-05-15

Probability TheoryProblem: find the correction c, out of all possible candidate corrections, that maximizes the probability that c is the intended correction, given the original word w:

argmaxc ∈ candidates P(c|w) <==> argmaxc ∈ candidates P(c) P(w|c) / P(w)

The four parts of this expression are:

Selection Mechanism: argmax
We choose the candidate with the highest combined probability.
Candidate Model: c ∈ candidates
This tells us which candidate corrections, c, to consider.
Language Model: P(c)
The probability that c appears as a word of English text. For example, occurrences of "the" make up about 7% of English text, so we should have P(the) = 0.07.
Error Model: P(w|c)
The probability that w would be typed in a text when the author meant c. For example, P(teh|the) is relatively high, but P(theeexyz|the) would be very low.

One obvious question is: why take a simple expression like P(c|w) and replace it with a more complex expression involving two models rather than one? The answer is that P(c|w) is already conflating two factors, and it is easier to separate the two out and deal with them explicitly. Consider the misspelled word w="thew" and the two candidate corrections c="the" and c="thaw". Which has a higher P(c|w)? Well, "thaw" seems good because the only change is "a" to "e", which is a small change. On the other hand, "the" seems good because "the" is a very common word, and while adding a "w" seems like a larger, less probable change, perhaps the typist's finger slipped off the "e". The point is that to estimate P(c|w) we have to consider both the probability of cand the probability of the change from c to w anyway, so it is cleaner to formally separate the two factors.
How It Works in PythonThe four parts of the program are:
1. Selection Mechanism: In Python, max with a key argument does 'argmax'.

2. Candidate Model: First a new concept: a simple edit to a word is a deletion (remove one letter), a transposition (swap two adjacent letters), a replacement (change one letter to another) or an insertion (add a letter). The function edits1 returns a set of all the edited strings (whether words or not) that can be made with one simple edit:

This can be a big set. For a word of length n, there will be n deletions, n-1 transpositions, 26n alterations, and 26(n+1) insertions, for a total of 54n+25 (of which a few are typically duplicates). For example,
However, if we restrict ourselves to words that are known—that is, in the dictionary— then the set is much smaller:

We'll also consider corrections that require two simple edits. This generates a much bigger set of possibilities, but usually only a few of them are known words.

We say that the results of edits2(w) have an edit distance of 2 from w.
3. Language Model: We can estimate the probability of a word, P(word), by counting the number of times each word appears in a text file of about a million words, big.txt. It is a concatenation of public domain book excerpts from Project Gutenberg and lists of most frequent words from Wiktionary and the British National Corpus. The function words breaks text into words, then the variable WORDS holds a Counter of how often each word appears, and P estimates the probability of each word, based on this Counter.

4. Error Model: When I started to write this program, sitting on a plane in 2007, I had no data on spelling errors, and no internet connection (I know that may be hard to imagine today). Without data I couldn't build a good spelling error model, so I took a shortcut: I defined a trivial, flawed error model that says all known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0. So we can make candidates(word) produce the first non-empty list of candidates in order of priority:

The original word, if it is known; otherwise
The list of known words at edit distance one away, if there are any; otherwise
The list of known words at edit distance two away, if there are any; otherwise
The original word, even though it is not known.

Then we don't need to multiply by a P(w|c) factor, because every candidate at the chosen priority will have the same probability (according to our flawed model). That gives us: