TITLE: Workshop 3: The Next Generation
AUTHOR: Eugene Wallingford
DATE: November 19, 2007 4:41 PM
DESC:
-----
BODY:
[A transcript of the
SECANT 2007 workshop:
Table of Contents]
The highlight for me of the final morning of the
SECANT workshop
was a session on the "next generation of scientists in the
workforce". It consisted of presentations on what scientists
are doing out in the world and how computer scientists are
helping them.
Chris Hoffman
gave a talk on applications of geometric computing in
industry. He gave examples from two domains, the
parametric design of mechanical systems and the study of
the structure and behaviors of proteins. I didn't follow
the virus talk very well, but the issue seems to lie in
understanding the icosahedral symmetry that characterizes
many viruses. The common theme in the two applications
is constraint resolution, a standard computational
technique. In the design case, the problem is represented
as a graph, and graph decomposition is used to create
local plans. Arriving at a satisfactory design requires
solving a system of constraint equations, locally and
globally. A system of constraints is also used to model
the features of a virus capsid and then solved to learn
how the virus behaves.
Essential to both domains is the notion of a constraint
solver. This won't be a surprise to folks who work on
design systems, even CAD systems, but the idea that a
biologist needs to work with a constraint solver might
surprise many. Hoffman's take-home point was that we
cannot do our work locked within our disciplinary boundaries,
because we don't usually know where the most important
connections lie.
The next two talks were by computer scientists working in
industry. Both gave wonderful glimpses of how scientists
work today and how computer scientists help them -- and
are helping them redefine their disciplines.
First was Amar Kumar of Eli Lilly, who drew on his work in
bioinformatics with biologists and chemists in drug discovery.
He views his primary responsibility as helping scientists
interpret their data.
The business model traditionally used by Lilly and other big
pharma companies is unsustainable. If Lilly creates 10,000
new compounds, roughly 1,000 will show promise, 100 will be
worth testing, and 1 will make it to market.
The failure rate, the time required by the process, the cost
of development -- all result in an unsustainable model.
Kumar means that Lilly and its scientists must undergo a
transformation in thought process about how
to find candidates and how to discover effects. He gave two
examples. Biologists must move from "How does this particular
gene respond to this particular drug?" to "How do all
human genes respond to this panel of 35,000 drugs?" There are
roughly 30,000 human genes, which means that the new question
produces 1 billion data points. Similarly, drug researchers
must move from "What does Alzheimer's do to the levels of
amyloid protein in the brain? to "When I compare a healthy
patient with an Alzheimer's patient, what is difference in the
level of every brain-specific protein over time?" Again, the
new question produces a massive number of data points.
Drug companies must ask new kinds of questions -- and design
new ways to find answers. The new paradigm shifts the power
from pure lab biologists to bioinformatics and statistics.
This is a major shift in culture at a place like Lilly research
labs. It terms of a table of gene/drug interactions, the first
adjustment is from cell thinking (1 gene/1 drug) to column
thinking (1 drug/all genes). Ultimately, Kumar believes, the
next step -- to grid thinking (m drugs/n genes) and finding
patterns throughout -- is necessary, too.
What the bioinformatician can do is to help convert
information into knowledge. Kumar said that a friend
used to ask him what he would do with infinite computational
power. He thinks the challenge these days is not to create
more computational power. We already have more data in our
possession than we know what to do with. More than more raw
power, we need new ways to understand the data that we gather.
For example, we need to use clustering techniques more
effectively to find patterns in the data, to help scientists
see the ideas. Scientists do this "in the small", by hand,
but programs can do so much better. Kumar showed an example,
a huge spreadsheet with a table of genes crossed with metabolites.
Rather than look at the data in the small he converted the
numbers to a
heat map
so that the scientist could focus on critical areas of relationship.
That is a more fruitful way to identify possible experiments than
to work through the rows of the table by hand.
Kumar suggests that future scientists require some essential
computational skills:
- data integration (across data sets)
- data visualization
- clustering
- translation of problems from one space to another
- databases
- software development lifecycle
Do CS students learn about clustering as undergrads? Biologists
need to. On the last two items, other scientists usually know
remarkably little. Knowing a bit about the software lifecycle
will help them work better with computer scientists. Knowing a
bit about databases will help them understand the technology
decisions the CS folks make. If all you know is a flat text file
or maybe a spreadsheet, then you may not understand why it is
better to put the data in a database -- and how much better that
will support your work.
The second speaker was Bob Zigon from Beckman Coulter, a company
that works in the area of
flow cytometry.
Most of us in the room didn't know that flow cytometry studies
the properties of cells as they flow through a liquid. Zigon is
a software tech lead for the development of flow cytometry tools.
He emphasized that to do his job, he has to act like the lab
scientists. He has to learn their vocabulary, how to run their
equipment, how to build their instruments, and how to perform
experiments. The software folks at Beckman Coulter spend a lot
of time observing scientists.
... and students chuckle at me when I tell them psychology,
anthropology, and sociology make great minors or double majors
for CS students! My experience came in the world of knowledge-based
systems, which require a deep understanding of the practice and
implicit knowledge of domain experts. Back in the early 1990s, I
remember AI researcher John McDermott, of
R1/XCON
fame, describing how his expert systems team had evolved toward
cultural anthropology as the natural next step in their work.
I think that all software folks must be able to develop a deep
cultural understanding of the domains they work in, if they want
to do their jobs well. As software development becomes more and
more interdisciplinary, it becomes more important. Whether they
learn these skills in the trenches or with some formal training
is up to them.
Enjoying this sort of work helps a software developer,
too. Zigon clearly does. He and his team implement computations
and build interfaces to support the scientists who use flow
cytometry to study blood cancer and other health conditions.
He gave a great two-minute description of one of the basic
processes that I can't do justice here. First they put blood
into a tube that narrows down to the thickness of a hair. The
cells line up, one by one. Then the scientists run the blood
across a laser beam, which causes the cells to effloresce.
Hardware measures the fluorescent energy, and software digitizes
it for analysis. The equipment processes 10k cells/second,
resulting in 18 data points for each of anywhere between 1 and
20 million cells.
What do scientists working in this area need? Data
management across the full continuum: acquisition,
organization, querying, and visualization. Eight years of
research data amount to about 15 gigabytes. Eight years of
pharmaceutical data reaches 185 GB. And eight years of
clinical data is 3 terabytes. Data is king.
Zigon's team moves all the data into relational databases,
converting the data into fifth normal form to eliminate as much
redundancy as possible. Their software makes the data available
to the scientists for
online transactional processing
and
online analytical processing.
Even with large data sets and expensive computations, the
scientists need query times in the range of 7-10 seconds.
With so much data, the need for ways to visualize data sets
and patterns is paramount. In real time, they process 750 MB
data sets at 20 frames per second. The biologists would still
use histograms and scatter plots as their only graphical
representations if the software guys couldn't do better. Zigon
and his team build tools for n-dimensional manipulation
and review of the data. They also work on data reduction, so
that the scientists can focus on subsets when appropriate.
Finally, to help find patterns, they create and implement
clustering algorithms. Many of the scientists tend to fall back
on
k-means clustering,
but in highly multidimensional spaces that technique imposes a
false structure on the data. They need something better, but
the alternatives are O(n²) -- which is, of course,
intractable on such large sets. So Zigon needs better algorithms
and, contrary to Kumar's case, more computational power! At the
top of his wish list are algorithms whose complexity scales to
studying 15 million cells at a time and ways to parallelize these
algorithms in cost- and resource-effective ways. Cluster
computing is attractive -- but expensive, loud, hot, .... They
need something better.
What else do scientists need? The requirements are steep. The
ability to integrate cellular, proteomic, and genomic data.
Usable HCI. On a more pedestrian tack, they need to replace
paper lab notebooks with electronic notebooks. That sounds
easy but laws on data privacy and process accountability make
that a challenging problem, too. Zigon's team draws on work in
the areas of electronic signatures, data security on a network,
and the like.
From these two talks, it seems clear that domain scientists and
computer scientists of the future will need to know more about
the other discipline than may have been needed in the past.
Computing is redefining the questions that domain scientists
must ask and redefining the tasks performed by the CS folks.
The domain scientists need to know enough about computer science,
especially databases and visualization, to know what is possible.
Computer scientists need to study algorithms, parallelism, and
HCI. They also need to take more seriously the soft skills of
communication and teamwork that we have encouraging for many
years now.
The Q-n-A session that followed pointed out an interesting
twist on the need for communication. It seems that clustering
algorithms are being reinvented across many disciplines. As each
discipline encounters the need, the scientists and mathematicians
-- and even computer scientists -- working in that area sometimes
solve their problems from scratch without reference to the
well-developed results on clustering and pattern recognition from
CS and math. This seems like a potentially valuable place to
initiate dialogue across the sciences in places looking to
increase their interdisciplinary focus.
-----