We’re dealing with much more data. Although advances in storage capacity and CPU speed have allowed the databases to keep pace, we’re in a new era where size itself is an important part of the problem, and any significant database needs to be distributed.
We require sub-second responses to queries. In the ’80s, most database queries could run overnight as batch jobs. That’s no longer acceptable. While some analytic functions can still run as overnight batch jobs, we’ve seen the web evolve from static files to complex database-backed sites, and that requires sub-second response times for most queries.
We want applications to be up 24/7. Setting up redundant servers for static HTML files is easy, but a database replication in a complex database-backed application is another.
We’re seeing many applications in which the database has to soak up data as fast (or even much faster) than it processes queries: in a logging application, or a distributed sensor application, writes can be much more frequent than reads. Batch-oriented ETL (extract, transform, and load) hasn’t disappeared, and won’t, but capturing high-speed data flows is increasingly important.
We’re frequently dealing with changing data or with unstructured data. The data we collect, and how we use it, grows over time in unpredictable ways. Unstructured data isn’t a particularly new feature of the data landscape, since unstructured data has always existed, but we’re increasingly unwilling to force a structure on data a priority.
We’re willing to sacrifice our sacred cows. We know that consistency and isolation and other properties are very valuable, of course. But so are some other things, like latency and availability and not losing data even if our primary server goes down. The challenges of modern applications make us realize that sometimes we might need to weaken one of these constraints in order to achieve another.