TITLE: StrangeLoop 2: The Future of Databases is In Memory
AUTHOR: Eugene Wallingford
DATE: September 25, 2012 9:35 PM
DESC:
-----
BODY:
The conference opened with a keynote address by Michael
Stonebraker, who built Ingres, Postgres, and several other
influential database systems. Given all the hubbub about
NoSQL the last few years, including at StrangeLoop 2010,
this talk brought some historical perspective to a
conversation that has been dominated in recent years by
youngsters. Stonebraker told the audience that the future
is indeed here, but from the outside it will look a lot
like the past
The problem, of course, is "big data". It's big because of
volume, velocity, and variety. Stonebraker framed his opening
comments in terms of volume. In the traditional database
setting back in the 1980s, we all bought airplane tickets
through a travel agent who acted, for all meaningful purposes,
in the role of professional terminal operator. We
were doing business "at the speed of the intermediary".
The ACID properties
were inviolable: "Don't you dare lose my data."
Then came change. The internet disintermediated access to
database, cutting intermediaries out of the equation. Volume
shot through the roof. PDAs further disintermediated access,
removing limitations on the locations from which we accessed
our data. Volume shot up even further. Suddenly, databases
came to be part of the solution to a much broader class of
problems: massively online games, ad placement, new forms of
commerce. We all know what that meant for volume.
Stonebraker then offered two reality checks to frame the
solution to our big data problems. The first involved the
cost of computer memory. One terabyte is a really big database
for transaction processing, yet it 1TB of memory now costs
$25-50K. Furthermore, the price is dropping faster than
transaction volume is rising. So: the big data problem is
really now a problem for main memory.
The second reality check involved database performance. Well
under 10% of the time spent by a typical database is spent
doing useful work. Over 90% is overhead: managing a buffer
pool, latching, locking, and recovery. We can't make faster
databases by creating better DB data structures or algorithms;
a better B-tree can affect only 4% of application runtime. If
we could eliminate the buffer pool, we can gain up to 25% in
performance. We must focus on overhead.
Where to start? We can forget about all the traditional
database vendors. They have code lines that are thirty years
old and older. They have to manage backward compatibility for
a huge installed customer base. They are, from the perspective
of the future, bloatware. They can't improve.
How about the trend toward NoSQL? We can just use raw data
storage and write our own low-level code, optimized to the task.
Well, the first thing to realize is that the compiler already
translates SQL into lower-level operations. In the world of
databases as almost everywhere else, it is really hard to beat
the compiler at its game. High-level languages are good, and
our compilers do an awesome job generating near-optimal code.
Moving down an abstraction layer is, Stonebraker says, a
fundamental mistake: "Always move code to the data, never data
to the code."
Second, we must realize that the ACID properties really are a
good thing. More important, they are nearly impossible to
retrofit into a system that doesn't already provide them.
"Eventually consistent" doesn't really mean eventually
consistent if it's possible to sell your last piece of inventory.
In any situation where there exists a pair of non-commutative
transactions, "eventually consistent" is a recipe for corruption.
So, SQL and ACID are good. Let's keep them. Stonebraker says
that instead of NoSQL databases, we should build "NewSQL"
databases that improve performance through innovative
architectures. Putting the database in main memory is one way
to start. He addressed several common objections to this idea
("But what if the power fails??") by focusing on speed and
replication. Recovery may be slow, but performance is wildly
better. We should optimize for the most common case and treat
exceptional cases for what they are: rare.
He mentioned briefly several other components of a new database
architecture, such horizontally scaling across a cluster of
nodes, automatic sharding, and optimization via stored procedures
targeted at the most common activities. The result is not a
general purpose solution, but then why does it need to be?
I have a lot of gray hair, Stonebraker said, but that means he
has seen these wars before. It's better to stick with what we
know to be valuable and seek better performance where our
technology has taken us.
-----