TITLE: StrangeLoop 2: The Future of Databases is In Memory AUTHOR: Eugene Wallingford DATE: September 25, 2012 9:35 PM DESC: ----- BODY: The conference opened with a keynote address by Michael Stonebraker, who built Ingres, Postgres, and several other influential database systems. Given all the hubbub about NoSQL the last few years, including at StrangeLoop 2010, this talk brought some historical perspective to a conversation that has been dominated in recent years by youngsters. Stonebraker told the audience that the future is indeed here, but from the outside it will look a lot like the past The problem, of course, is "big data". It's big because of volume, velocity, and variety. Stonebraker framed his opening comments in terms of volume. In the traditional database setting back in the 1980s, we all bought airplane tickets through a travel agent who acted, for all meaningful purposes, in the role of professional terminal operator. We were doing business "at the speed of the intermediary". The ACID properties were inviolable: "Don't you dare lose my data." Then came change. The internet disintermediated access to database, cutting intermediaries out of the equation. Volume shot through the roof. PDAs further disintermediated access, removing limitations on the locations from which we accessed our data. Volume shot up even further. Suddenly, databases came to be part of the solution to a much broader class of problems: massively online games, ad placement, new forms of commerce. We all know what that meant for volume. Stonebraker then offered two reality checks to frame the solution to our big data problems. The first involved the cost of computer memory. One terabyte is a really big database for transaction processing, yet it 1TB of memory now costs $25-50K. Furthermore, the price is dropping faster than transaction volume is rising. So: the big data problem is really now a problem for main memory. The second reality check involved database performance. Well under 10% of the time spent by a typical database is spent doing useful work. Over 90% is overhead: managing a buffer pool, latching, locking, and recovery. We can't make faster databases by creating better DB data structures or algorithms; a better B-tree can affect only 4% of application runtime. If we could eliminate the buffer pool, we can gain up to 25% in performance. We must focus on overhead. Where to start? We can forget about all the traditional database vendors. They have code lines that are thirty years old and older. They have to manage backward compatibility for a huge installed customer base. They are, from the perspective of the future, bloatware. They can't improve. How about the trend toward NoSQL? We can just use raw data storage and write our own low-level code, optimized to the task. Well, the first thing to realize is that the compiler already translates SQL into lower-level operations. In the world of databases as almost everywhere else, it is really hard to beat the compiler at its game. High-level languages are good, and our compilers do an awesome job generating near-optimal code. Moving down an abstraction layer is, Stonebraker says, a fundamental mistake: "Always move code to the data, never data to the code." Second, we must realize that the ACID properties really are a good thing. More important, they are nearly impossible to retrofit into a system that doesn't already provide them. "Eventually consistent" doesn't really mean eventually consistent if it's possible to sell your last piece of inventory. In any situation where there exists a pair of non-commutative transactions, "eventually consistent" is a recipe for corruption. So, SQL and ACID are good. Let's keep them. Stonebraker says that instead of NoSQL databases, we should build "NewSQL" databases that improve performance through innovative architectures. Putting the database in main memory is one way to start. He addressed several common objections to this idea ("But what if the power fails??") by focusing on speed and replication. Recovery may be slow, but performance is wildly better. We should optimize for the most common case and treat exceptional cases for what they are: rare. He mentioned briefly several other components of a new database architecture, such horizontally scaling across a cluster of nodes, automatic sharding, and optimization via stored procedures targeted at the most common activities. The result is not a general purpose solution, but then why does it need to be? I have a lot of gray hair, Stonebraker said, but that means he has seen these wars before. It's better to stick with what we know to be valuable and seek better performance where our technology has taken us. -----