海量数据分析的前沿
Frontiers in Massive Data Analysis
THE CHALLENGE
Although humans have gathered data since the beginning of recorded
history—indeed, data gathered by ancestral humans provides much of the
raw material for the reconstruction of human history—the rate of acquisition
of data has surged in recent years, with no end in sight. Expectations
have surged as well, with hopes for new scientific discoveries pinned on
emerging massive collections of biological, physical, and social data, and
with major areas of the economy focused on the commercial implications
of massive data.
Although it is difficult to characterize all of the diverse reasons for the
rapid growth in data, a few factors are worth noting. First, many areas
of science are in possession of mature theories that explain a wide range
of phenomena, such that further testing and elaboration of these theories
requires probing extreme phenomena. These probes often generate very
large data sets. An example is the world of particle physics, where massive
data (e.g., petabytes per year for the Large Hadron Collider; 1 petabyte is
1015 bytes) arises from the new accelerators designed to test aspects of the
Standard Model of particle physics. Second, many areas of science and engineering
have become increasingly exploratory, with large data sets being
gathered outside the context of any particular theory in the hope that new
phenomena will emerge. Examples include the massive data arising from
genome sequencing projects (which can accumulate terabytes (1012 bytes)
of data for each project) as well as the massive data expected to arise from
the Large Synoptic Survey Telescope, which will be measured in petabytes.
Rapid advances in cost-effective sensing mean that engineers can readily
collect massive amounts of data about complex systems, such as those for
communication networks, the electric grid, and transportation and financial
systems, and use that data for management and control. Third, much
human activity now takes place on the Internet, and this activity generates
data that has substantial commercial and scientific value. In particular,
many commercial enterprises are aiming to provide personalized services
that adapt to individual behaviors and preferences as revealed by data associated
with the individual. Fourth, connecting these other trends is the
significant growth in the deployment of sensor networks that record biological,
physical, and social phenomena at ever-increasing scale, and these
sensor networks are increasingly interconnected.
In general, the hope is that if massive data could be exploited effectively,
science would extend its reach, and technology would become more
adaptive, personalized, and robust. It is appealing to imagine, for example,
a health-care system in which increasingly detailed data are maintained for
each individual—including genomic, cellular, and environmental data—and
in which such data can be combined with data from other individuals and
with results from fundamental biological and medical research, so that
optimized treatments can be designed for each individual. One can also
envision numerous microeconomic consequences of massive data analysis
where preferences and needs at the level of single individuals are combined
with fine-grained descriptions of goods, skills, and services to create new
markets. In general, what is particularly notable about the recent rise in
the prevalence of “big data” is not merely the size of modern data sets, but
rather that their fine-grained nature permits inferences and decisions at the
level of single individuals.
It is natural to be optimistic about the prospects. Several decades of
research and development in databases and search engines have yielded a
wealth of relevant experience in the design of scalable data-centric technology.
In particular, these fields have fueled the advent of cloud computing
and other parallel and distributed platforms that seem well suited to massive
data analysis. Moreover, innovations in the fields of machine learning, data
mining, statistics, and the theory of algorithms have yielded data-analysis
methods that can be applied to ever-larger data sets. When combined with
arguments that simple algorithms can work better than more sophisticated
algorithms on large-scale data (see, e.g., Halevy et al., 2009), it is natural
to be bullish on big data.1
While not entirely unwarranted, such optimism overlooks a number
of major difficulties that arise in attempting to achieve the goals that
are envisioned in discussions of massive data. In part these difficulties
are those familiar from implementations of large-scale databases—involving
finding and mitigating bottlenecks, achieving simplicity and generality
of the programming interface, propagating metadata, designing a system
that is robust to hardware failure, and exploiting parallel and distributed
hardware—all at an unprecedented scale. But the goals for massive data go
beyond the storage, indexing, and querying that have been the province of
classical database systems (and classical search engines), instead focusing
on the ambitious goal of inference. Inference is the problem of turning data
into knowledge, where knowledge often is expressed in terms of variables
(e.g., a patient’s general state of health, or a shopper’s tendency to buy) that
are not present in the data per se, but are present in models that one uses to
interpret the data. Statistical principles are needed to justify the inferential
leap from data to knowledge, and many difficulties arise in attempting to
bring these principles to bear on massive data. Operating in the absence
of these principles may yield results that are not useful at best or harmful
at worst. In any discussion of massive data and inference, it is essential to
be aware that it is quite possible to turn data into something resembling
knowledge but which actually is not. Moreover, it can be quite difficult to
know that this has happened.
Consider a database where the rows correspond to people and the
columns correspond to “features” that are used to describe people. If the
database contains data on only a thousand people, it may suffice to measure
only a few dozen features (e.g., age, gender, years of education, city of
residence) to make the kinds of distinctions that may be needed to support
assertions of “knowledge.” If the database contains data on several billion
people, however, we are likely to have heightened expectations for the data,
and we will want to measure many more features (e.g., latest magazine
read, culinary preferences, genomic markers, travel patterns) to support the
wider range of inferences that we wish to make on the basis of the data. We
might roughly imagine the number of features scaling linearly in the number
of individuals. Now, the knowledge we wish to obtain from such data is
often expressed in terms of combinations of the features. For example, if
one lives in Memphis, is a male, enjoys reading about gardening, and often
travels to Japan, what is the probability that the person will click on an ad
about life insurance? The problem is that there are exponential numbers of
such combinations of features and, in any given data set, a vast number of
these combinations will appear to be highly predictive of any given outcome
by chance alone.