Peter Norvig: On Chomsky and the Two Cultures of Statistical Learning

2615

收藏 2012-11-09

这是Google Research的老大Peter Norvig关于之前MIT Linguistic教授Noam Chomsky对于用统计学习的方法做NLP和语言学习的质疑，提出的回应与反驳。其实，里面用浅显易懂的方式，解释了很多基本概念和背景。值得一看。这里只是节选了几个部分，原文在http://norvig.com/chomsky.htmlOn Chomsky and the Two Cultures of Statistical LearningAt the Brains, Minds, and Machines symposium held during MIT's 150th birthday party, Technology Reviewreports that Prof. Noam Chomsky

MIT: 150

derided researchers in machine learning who use purely statistical methods to produce behavior that mimics something in the world, but who don't try to understand the meaning of that behavior.

The transcript is now available, so let's quote Chomsky himself:

It's true there's been a lot of work on trying to apply statistical models to various linguistic problems. I think there have been some successes, but a lot of failures. There is a notion of success ... which I think is novel in the history of science. It interprets success as approximating unanalyzed data.

This essay discusses what Chomsky said, speculates on what he might have meant, and tries to determine the truth and importance of his claims.

Noam Chomsky

Chomsky's remarks were in response to Steven Pinker's question about the success of probabilistic models trained with statistical methods.

What did Chomsky mean, and is he right?
What is a statistical model?
How successful are statistical language models?
Is there anything like their notion of success in the history of science?
What doesn't Chomsky like about statistical models?

What did Chomsky mean, and is he right?I take Chomsky's points to be the following:

Statistical language models have had engineering success, but that is irrelevant to science.
Accurately modeling linguistic facts is just butterfly collecting; what matters in science (and specifically linguistics) is the underlying principles.
Statistical models are incomprehensible; they provide no insight.
Statistical models may provide an accurate simulation of some phenomena, but the simulation is done completely the wrong way; people don't decide what the third word of a sentence should be by consulting a probability table keyed on the previous two words, rather they map from an internal semantic form to a syntactic tree-structure, which is then linearized into words. This is done without any probability or statistics.
Statistical models have been proven incapable of learning language; therefore language must be innate, so why are these statistical modelers wasting their time on the wrong enterprise?

Is he right? That's a long-standing debate. These are my answers:

I agree that engineering success is not the goal or the measure of science. But I observe that science and engineering develop together, and that engineering success shows that something is working right, and so is evidence (but not proof) of a scientifically successful model.
Science is a combination of gathering facts and making theories; neither can progress on its own. I think Chomsky is wrong to push the needle so far towards theory over facts; in the history of science, the laborious accumulation of facts is the dominant mode, not a novelty. The science of understanding language is no different than other sciences in this respect.
I agree that it can be difficult to make sense of a model containing billions of parameters. Certainly a human can't understand such a model by inspecting the values of each parameter individually. But one can gain insight by examing the properties of the model—where it succeeds and fails, how well it learns as a function of data, etc.
I agree that a Markov model of word probabilities cannot model all of language. It is equally true that a concise tree-structure model without probabilities cannot model all of language. What is needed is a probabilistic model that covers words, trees, semantics, context, discourse, etc. Chomsky dismisses all probabilistic models because of shortcomings of particular 50-year old models. I understand how Chomsky arrives at the conclusion that probabilistic models are unnecessary, from his study of the generation of language. But the vast majority of people who study interpretation tasks, such as speech recognition, quickly see that interpretation is an inherently probabilistic problem: given a stream of noisy input to my ears, what did the speaker most likely mean? Einstein said to make everything as simple as possible, but no simpler. Many phenomena in science are stochastic, and the simplest model of them is a probabilistic model; I believe language is such a phenomenon and therefore that probabilistic models are our best tool for representing facts about language, for algorithmically processing language, and for understanding how humans process language.
In 1967, Gold's Theorem showed some theoretical limitations of logical deduction on formal mathematical languages. But this result has nothing to do with the task faced by learners of natural language. In any event, by 1969 we knew that probabilistic inference (over probabilistic context-free grammars) is not subject to those limitations (Horning showed that learning of PCFGs is possible). I agree with Chomsky that it is undeniable that humans have some innate capability to learn natural language, but we don't know enough about that capability to rule out probabilistic language representations, nor statistical learning. I think it is much more likely that human language learning involves something like probabilistic and statistical inference, but we just don't know yet.

Now let me back up my answers with a more detailed look at the remaining questions.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

全部回复

ltx5151

2012-11-9 10:54:28

What is a statistical model?A statistical model is a mathematical model which is modified or trained by the input of data points. Statistical models are often but not always probabilistic. Where the distinction is important we will be careful not to just say "statistical" but to use the following component terms:

A mathematical model specifies a relation among variables, either in functional form that maps inputs to outputs (e.g. y = m x + b) or in relation form (e.g. the following (x, y) pairs are part of the relation).
A probabilistic model specifies a probability distribution over possible values of random variables, e.g., P(x, y), rather than a strict deterministic relationship, e.g., y = f(x).
A trained model uses some training/learning algorithm to take as input a collection of possible models and a collection of data points (e.g. (x, y) pairs) and select the best model. Often this is in the form of choosing the values of parameters (such as m and b above) through a process of statistical inference.

Claude Shannon

For example, a decade before Chomsky, Claude Shannon proposed probabilistic models of communication based on Markov chains of words. If you have a vocabulary of 100,000 words and a second-order Markov model in which the probability of a word depends on the previous two words, then you need a quadrillion (1015) probability values to specify the model. The only feasible way to learn these 1015 values is to gather statistics from data and introduce some smoothing method for the many cases where there is no data. Therefore, most (but not all) probabilistic models are trained. Also, many (but not all) trained models are probabilistic.

As another example, consider the Newtonian model of gravitational attraction, which says that the force between two objects of mass m1 and m2 a distance r apart is given by

F = G m1 m2 / r2

where G is the universal gravitational constant. This is a trained model because the gravitational constant G is determined by statistical inference over the results of a series of experiments that contain stochastic experimental error. It is also a deterministic (non-probabilistic) model because it states an exact functional relationship. I believe that Chomsky has no objection to this kind of statistical model. Rather, he seems to reserve his criticism for statistical models like Shannon's that have quadrillions of parameters, not just one or two.

(This example brings up another distinction: the gravitational model is continuous and quantitative whereas the linguistic tradition has favored models that are discrete, categorical, and qualitative: a word is or is not a verb, there is no question of its degree of verbiness. For more on these distinctions, see Chris Manning's article on Probabilistic Syntax.)

A relevant probabilistic statistical model is the ideal gas law, which describes the pressure P of a gas in terms of the the number of molecules, N, the temperature T, and Boltzmann's constant, K:

P = N k T / V.

The equation can be derived from first principles using the tools of statistical mechanics. It is an uncertain, incorrect model; the true model would have to describe the motions of individual gas molecules. This model ignores that complexity and summarizes our uncertainty about the location of individual molecules. Thus, even though it is statistical and probabilistic, even though it does not completely model reality, it does provide both good predictions and insight—insight that is not available from trying to understand the true movements of individual molecules.

Now let's consider the non-statistical model of spelling expressed by the rule "I before E except after C." Compare that to the probabilistic, trained statistical model:

P(IE) = 0.0177 P(CIE) = 0.0014 P(*IE) = 0.163P(EI) = 0.0046 P(CEI) = 0.0005 P(*EI) = 0.0041

This model comes from statistics on a corpus of a trillion words of English text. The notation P(IE) is the probability that a word sampled from this corpus contains the consecutive letters "IE." P(CIE) is the probability that a word contains the consecutive letters "CIE", and P(*IE) is the probability of any letter other than C followed by IE. The statistical data confirms that IE is in fact more common than EI, and that the dominance of IE lessens wehn following a C, but contrary to the rule, CIE is still more common than CEI. Examples of "CIE" words include "science," "society," "ancient" and "species." The disadvantage of the "I before E except after C" model is that it is not very accurate. Consider:Accuracy("I before E") = 0.0177/(0.0177+0.0046) = 0.793Accuracy("I before E except after C") = (0.0005+0.0163)/(0.0005+0.0163+0.0014+0.0041) = 0.753A more complex statistical model (say, one that gave the probability of all 4-letter sequences, and/or of all known words) could be ten times more accurate at the task of spelling, but offers little insight into what is going on. (Insight would require a model that knows about phonemes, syllabification, and language of origin. Such a model could be trained (or not) and probabilistic (or not).)

As a final example (not of statistical models, but of insight), consider the Theory of Supreme Court Justice Hand-Shaking: when the supreme court convenes, all attending justices shake hands with every other justice. The number of attendees, n, must be an integer in the range 0 to 9; what is the total number of handshakes, hfor a given n? Here are three possible explanations:

Each of n justices shakes hands with the other n - 1 justices, but that counts Alito/Breyer and Breyer/Alito as two separate shakes, so we should cut the total in half, and we end up with h = n × (n - 1) / 2.
To avoid double-counting, we will order the justices by seniority and only count a more-senior/more-junior handshake, not a more-junior/more-senior one. So we count, for each justice, the shakes with the more junior justices, and sum them up, giving h = [size=+1]Σi = 1 .. n (i - 1).
Just look at this table:
n: 0 1 2 3 4 5 6 7 8 9
h: 0 0 1 3 6 10 15 21 28 36

Some people might prefer A, some might prefer B, and if you are slow at doing multiplication or addition you might prefer C. Why? All three explanations describe exactly the same theory — the same function from n toh, over the entire domain of possible values of n. Thus we could prefer A (or B) over C only for reasons other than the theory itself. We might find that A or B gave us a better understanding of the problem. A and B are certainly more useful than C for figuring out what happens if Congress exercises its power to add an additional associate justice. Theory A might be most helpful in developing a theory of handshakes at the end of a hockey game (when each player shakes hands with players on the opposing team) or in proving that the number of people who shook an odd number of hands at the MIT Symposium is even.

扫码加我拉你入群

请注明：姓名-公司-职位

以便审核进群资格，未注明则拒绝

扫码加我 拉你入群

扫码加我 拉你入群

分享

扫码加好友，拉您进群

扫码加我拉你入群

扫码加我拉你入群