TITLE: Strange Loop 2010, Day 1 Morning AUTHOR: Eugene Wallingford DATE: October 14, 2010 10:02 PM DESC: ----- BODY: I arrived in the University City district this morning ready for a different kind of conference. The program is an interesting mix of hot topics in technology, programming languages, and computing. My heuristic for selecting talks to attend was this: Of the talks in each time slot, which discusses the biggest idea I need to know more about? The only exceptions will be talks that fill a specific personal need, in teaching, research, or personal connections. A common theme among the talks I attended today was big data. Here are some numbers I jotted down. They may be out of date by now, or I may have made an error in some of the details, but the orders of magnitude are impressive:

Hilary Mason reported that bit.ly processes 10s of millions URLs per day, 100s of millions of events per day, and 1,000s of millions data points each day.
Google processes 24 petabytes a day, which amounts to 8 exabytes a year.
Digg has a 3 terabyte Cassandra database. This is dwarfed by Facebook's 150-node Cassandra cluster that comprises 150 terabytes.
Twitter users generate 12 terabytes of data each day -- the equivalent of 17,000 CDs -- and more than 4 petabytes each year. Also astounding: the data rate is doubling multiple times a year.

All of these numbers speak to the need for more CS programs to introduce students to data-intensive computing. Hilary Mason on Machine Learning Mason is a data scientist at bit.ly, a company that faces "wicked hard problems" at scale. These problems arise from the combination of algorithms, on-demand computing, and "data, data, data". Mason gave an introduction to machine learning as applied to some of these problems. She had a few quotable lines:

AI was "founded on a remarkable conceit". This reminded me of a recent entry.
"Academics like to create new terms, because if they catch on the academics get prizes."
"Only a Bayesian can tell you why if there's a 50-50 chance that something will happen, then 90% of the time it will."

She also sprinkled her top with programming tips. One of the coolest was that you can append a "+" to the end of any bit.ly URL to get analytic metrics for it. Gradle I had planned to go to a talk on Flickr's architecture, but I talked so long in the break that the room full before I got there. So I stopped in on a talk by Ken Sipe on Gradle, a scriptable Java build system built on top of Groovy. He had one very nice quote as well: "I can't think of a real use of this idea. I just like showing it off." The teacher and hacker in me smiled. Eben Hewitt on Adopting Cassandra Even if you are not Google, you may have to process a lot of data. Hewitt talked about some of the efforts made to scale relational databases to handle big data (query tuning, indexes, vertical scaling, shards, and denormalization). This sequence of fixes made me think of epicycles, fixes applied to orbits of heavenly bodies in an effort to match observed data. At some point, you start looking for a theory that fits better up front. That's what happened in the computing world, too. Soon there were a number of data systems defined mostly by what they are not: "NoSQL". That idea is not new. Lotus Notes used a document-oriented storage system; Smalltalk images store universes of objects. Now, as the problem of big data becomes more prevalent, new and different implementations have been proposed. Hewitt talked about Cassandra and what distinguishes it from other approaches. He called Cassandra the love child of the Google BigTable paper (2006) and the Amazon Dynamo paper (2007). He also pointed out some of the limitations that circumscribe its applicability. In the NoSQL approaches, there is "no free lunch": you give up many relational DB advantages, such as transactions, joins, and ad hoc queries. He did advocates one idea I'll need to read more about: that we should shift our attention from the CAP theorem of distributed data systems, which is useful but misses some important dynamic distinctions, with Abadi's PACELC model: If the data store is Partitioned, then you trade off between Availability and Consistency; Else you trade off between Latency and Consistency. -----