derided researchers in machine learning who use purely statistical methods to produce behavior that mimics something in the world, but who don't try to understand the meaning of that behavior.The transcript is now available, so let's quote Chomsky himself:
It's true there's been a lot of work on trying to apply statistical models to various linguistic problems. I think there have been some successes, but a lot of failures. There is a notion of success ... which I think is novel in the history of science. It interprets success as approximating unanalyzed data.
This essay discusses what Chomsky said, speculates on what he might have meant, and tries to determine the truth and importance of his claims.
Chomsky's remarks were in response to Steven Pinker's question about the success of probabilistic models trained with statistical methods.
For example, a decade before Chomsky, Claude Shannon proposed probabilistic models of communication based on Markov chains of words. If you have a vocabulary of 100,000 words and a second-order Markov model in which the probability of a word depends on the previous two words, then you need a quadrillion (1015) probability values to specify the model. The only feasible way to learn these 1015 values is to gather statistics from data and introduce some smoothing method for the many cases where there is no data. Therefore, most (but not all) probabilistic models are trained. Also, many (but not all) trained models are probabilistic.
As another example, consider the Newtonian model of gravitational attraction, which says that the force between two objects of mass m1 and m2 a distance r apart is given by
F = G m1 m2 / r2where G is the universal gravitational constant. This is a trained model because the gravitational constant G is determined by statistical inference over the results of a series of experiments that contain stochastic experimental error. It is also a deterministic (non-probabilistic) model because it states an exact functional relationship. I believe that Chomsky has no objection to this kind of statistical model. Rather, he seems to reserve his criticism for statistical models like Shannon's that have quadrillions of parameters, not just one or two.
(This example brings up another distinction: the gravitational model is continuous and quantitative whereas the linguistic tradition has favored models that are discrete, categorical, and qualitative: a word is or is not a verb, there is no question of its degree of verbiness. For more on these distinctions, see Chris Manning's article on Probabilistic Syntax.)
A relevant probabilistic statistical model is the ideal gas law, which describes the pressure P of a gas in terms of the the number of molecules, N, the temperature T, and Boltzmann's constant, K:
P = N k T / V.
The equation can be derived from first principles using the tools of statistical mechanics. It is an uncertain, incorrect model; the true model would have to describe the motions of individual gas molecules. This model ignores that complexity and summarizes our uncertainty about the location of individual molecules. Thus, even though it is statistical and probabilistic, even though it does not completely model reality, it does provide both good predictions and insight—insight that is not available from trying to understand the true movements of individual molecules.
Now let's consider the non-statistical model of spelling expressed by the rule "I before E except after C." Compare that to the probabilistic, trained statistical model:
P(IE) = 0.0177 P(CIE) = 0.0014 P(*IE) = 0.163P(EI) = 0.0046 P(CEI) = 0.0005 P(*EI) = 0.0041This model comes from statistics on a corpus of a trillion words of English text. The notation P(IE) is the probability that a word sampled from this corpus contains the consecutive letters "IE." P(CIE) is the probability that a word contains the consecutive letters "CIE", and P(*IE) is the probability of any letter other than C followed by IE. The statistical data confirms that IE is in fact more common than EI, and that the dominance of IE lessens wehn following a C, but contrary to the rule, CIE is still more common than CEI. Examples of "CIE" words include "science," "society," "ancient" and "species." The disadvantage of the "I before E except after C" model is that it is not very accurate. Consider:Accuracy("I before E") = 0.0177/(0.0177+0.0046) = 0.793Accuracy("I before E except after C") = (0.0005+0.0163)/(0.0005+0.0163+0.0014+0.0041) = 0.753A more complex statistical model (say, one that gave the probability of all 4-letter sequences, and/or of all known words) could be ten times more accurate at the task of spelling, but offers little insight into what is going on. (Insight would require a model that knows about phonemes, syllabification, and language of origin. Such a model could be trained (or not) and probabilistic (or not).)
As a final example (not of statistical models, but of insight), consider the Theory of Supreme Court Justice Hand-Shaking: when the supreme court convenes, all attending justices shake hands with every other justice. The number of attendees, n, must be an integer in the range 0 to 9; what is the total number of handshakes, hfor a given n? Here are three possible explanations:
n: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
h: | 0 | 0 | 1 | 3 | 6 | 10 | 15 | 21 | 28 | 36 |