Travess Smalley The Unreasonable Effectiveness of Data

The Unreasonable Effectiveness of Data
By Alon Halevy, Peter Norvig, and Fernando Pereira

The Unreasonable Effectiveness of Data
By Alon Halevy, Peter Norvig, and Fernando Pereira, 2009

     Eugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences” examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma or e = mc2. Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior. An informal, incomplete grammar of the English language runs over 1,700 pages. Perhaps when it comes to natural language processing and related fields, we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data.

     One of us, as an undergraduate at Brown University, remembers the excitement of having access to the Brown Corpus, containing one million English words. Since then, our field has seen several notable corpora that are about 100 times larger, and in 2006, Google released a trillion-word corpus with frequency counts for all sequences up to five words long. In some ways this corpus is a step backwards from the Brown Corpus: it’s taken from unfiltered Web pages and thus contains incomplete sentences, spelling errors, grammatical errors, and all sorts of other errors. It’s not annotated with carefully hand-corrected part-of-speech tags. But the fact that it’s a million times larger than the Brown Corpus outweighs these drawbacks. A trillion-word corpus—along with other Web-derived corpora of millions, billions, or trillions of links, videos, images, tables, and user interactions—captures even very rare aspects of human behavior. So, this corpus could serve as the basis of a complete model for certain tasks—if only we knew how to extract the model from the data.

Learning from Text at Web Scale

     The biggest successes in natural-language-related machine learning have been statistical speech recognition and statistical machine translation. The reason for these successes is not that these tasks are easier than other tasks; they are in fact much harder than tasks such as document classification that extract just a few bits of information from each document. The reason is that translation is a natural task routinely done every day for a real human need (think of the operations of the European Union or of news agencies). The same is true of speech transcription (think of closed-caption broadcasts). In other words, a large training set of the input-output behavior that we seek to automate is available to us in the wild. In contrast, traditional natural language processing problems such as document classification, part-of-speech tagging, named-entity recognition, or parsing are not routine tasks, so they have no large corpus available in the wild. Instead, a corpus for these tasks requires skilled human annotation. Such annotation is not only slow and expensive to acquire but also difficult for experts to agree on, being bedeviled by many of the difficulties we discuss later in relation to the Semantic Web. The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results or from the accumulated evidence of Web-based text patterns and formatted tables, in both cases without needing any manually annotated data.

     Another important lesson from statistical methods in speech recognition and machine translation is that memorization is a good policy if you have a lot of training data. The statistical language models that are used in both tasks consist primarily of a huge database of probabilities of short sequences of consecutive words (n-grams). These models are built by counting the number of occurrences of each n-gram sequence from a corpus of billions or trillions of words. Researchers have done a lot of work in estimating the probabilities of new n-grams from the frequencies of observed n-grams (using, for example, Good-Turing or Kneser-Ney smoothing), leading to elaborate probabilistic models. But invariably, simple models and a lot of data trump more elaborate models based on less data. Similarly, early work on machine translation relied on elaborate rules for the relationships between syntactic and semantic patterns in the source and target languages. Currently, statistical translation models consist mostly of large memorized phrase tables that give candidate mappings between specific source- and target-language phrases.

     Instead of assuming that general patterns are more effective than memorizing specific phrases, today’s translation models introduce general rules only when they improve translation over just memorizing particular phrases (for instance, in rules for dates and numbers). Similar observations have been made in every other application of machine learning to Web data: simple n-gram models or linear classifiers based on millions of specific features perform better than elaborate models that try to discover general rules. In many cases there appears to be a threshold of sufficient data. For example, James Hays and Alexei A. Efros addressed the task of scene completion: removing an unwanted, unsightly automobile or exspouse from a photograph and filling in the background with pixels taken from a large corpus of other photos.

     With a corpus of thousands of photos, the results were poor. But once they accumulated millions of photos, the same algorithm performed quite well. We know that the number of grammatical English sentences is theoretically infinite and the number of possible 2-Mbyte photos is 2562,000,000. However, in practice we humans care to make only a finite number of distinctions. For many tasks, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need, without generative rules.

     For those who were hoping that a small number of general rules could explain language, it is worth noting that language is inherently complex, with hundreds of thousands of vocabulary words and a vast variety of grammatical constructions. Every day, new words are coined and old usages are modified. This suggests that we can’t reduce what we want to say to the free combination of a few abstract primitives.

Read the full PDF