TITLE: Strange Loop 2010, Day 1 Morning
AUTHOR: Eugene Wallingford
DATE: October 14, 2010 10:02 PM
DESC:
-----
BODY:
I arrived in the University City district this morning
ready for a different kind of conference. The program
is an interesting mix of hot topics in technology,
programming languages, and computing. My heuristic
for selecting talks to attend was this: Of the talks
in each time slot, which discusses the biggest idea I
need to know more about? The only exceptions will be
talks that fill a specific personal need, in teaching,
research, or personal connections.
A common theme among the talks I attended today was
big data. Here are some numbers I jotted down. They
may be out of date by now, or I may have made an error
in some of the details, but the orders of magnitude
are impressive:
- Hilary Mason reported that bit.ly processes 10s
of millions URLs per day, 100s of millions of
events per day, and 1,000s of millions data points
each day.
- Google processes 24 petabytes a day, which amounts
to 8 exabytes a year.
- Digg has a 3 terabyte Cassandra database. This is
dwarfed by Facebook's 150-node Cassandra cluster
that comprises 150 terabytes.
- Twitter users generate 12 terabytes of data each
day -- the equivalent of 17,000 CDs -- and more
than 4 petabytes each year. Also astounding:
the data rate is doubling multiple times a year.
All of these numbers speak to the need for more CS
programs to introduce students to
data-intensive computing.
Hilary Mason on Machine Learning
Mason is a
data scientist
at bit.ly, a company that faces "wicked hard problems"
at scale. These problems arise from the combination
of algorithms, on-demand computing, and "data, data,
data". Mason gave an introduction to machine learning
as applied to some of these problems. She had a few
quotable lines:
- AI was "founded on a remarkable conceit". This
reminded me of a
recent entry.
- "Academics like to create new terms, because if
they catch on the academics get prizes."
- "Only a Bayesian can tell you why if there's a
50-50 chance that something will happen, then
90% of the time it will."
She also sprinkled her top with programming tips.
One of the coolest was that you can append a "+" to
the end of any bit.ly URL to get analytic metrics
for it.
Gradle
I had planned to go to a talk on Flickr's architecture,
but I talked so long in the break that the room full
before I got there. So I stopped in on a talk by Ken
Sipe on Gradle, a scriptable Java build system built
on top of Groovy. He had one very nice quote as well:
"I can't think of a real use of this idea. I just like
showing it off." The teacher and hacker in me smiled.
Eben Hewitt on Adopting Cassandra
Even if you are not Google, you may have to process a
lot of data. Hewitt talked about some of the efforts
made to scale relational databases to handle big data
(query tuning, indexes, vertical scaling, shards, and
denormalization). This sequence of fixes made me think
of epicycles, fixes applied to orbits of heavenly
bodies in an effort to match observed data. At some
point, you start looking for a theory that fits better
up front.
That's what happened in the computing world, too.
Soon there were a number of data systems defined
mostly by what they are not: "NoSQL". That idea is
not new. Lotus Notes used a document-oriented storage
system; Smalltalk images store universes of objects.
Now, as the problem of big data becomes more prevalent,
new and different implementations have been proposed.
Hewitt talked about Cassandra and what distinguishes
it from other approaches. He called Cassandra the
love child of the
Google BigTable paper
(2006) and the
Amazon Dynamo paper
(2007). He also pointed out some of the limitations
that circumscribe its applicability. In the NoSQL
approaches, there is "no free lunch": you give up
many relational DB advantages, such as transactions,
joins, and ad hoc queries.
He did advocates one idea I'll need to read more
about: that we should shift our attention from the
CAP theorem
of distributed data systems, which is useful but
misses some important dynamic distinctions, with
Abadi's PACELC
model: If the data store is Partitioned,
then you trade off between Availability and
Consistency; Else you trade off
between Latency and Consistency.
-----