TITLE: Workshop 2: Computational Thinking in the Health Sciences
AUTHOR: Eugene Wallingford
DATE: October 30, 2008 8:39 PM
DESC:
-----
BODY:
[A transcript of the
SECANT 2008 workshop:
Table of Contents]
The next session of the workshop was a panel of university
faculty working in the health sciences, talking about how
they use computation in their disciplines and what the
key issues are. Panel chair, Raj Acharya, from Penn State's
Computer Science and Engineering department, opened with
the bon mot "all science is computer science", a reference
to a
2001 New York Times piece
that I have been using for the last few years when speaking
to prospective students, their parents, and other faculty.
By itself, this statement sounds flip, but it is true in
many ways. The telescope astronomers use today is as much
a computational instrument as a mechanical one. Many of
the most interesting advances in biology these days are
really bioinformatics.
The dawn of big data is changing
what we do in CS,
but it's having an even bigger effect in some other
sciences by creating a new way to do science. Modeling
is a nascent research method based in computation:
propose model, test it against the data, and iterate.
Data mining is an essential step in this new process:
all of the data goes into a box, and the box has to make
the sense of the data. This swaps two steps in the
traditional scientific method... Instead of forming a
hypothesis and then testing it by collecting data, a
scientist can mine a large collection of data to find
candidate hypotheses, and then confirm with more
traditional bench science and by checking models against
other and larger data sets.
Tony Hazbun, who works in the School of Pharmacy at
Purdue, talked about work in systems biology. He
identified four key ideas that biologists need to learn
from computer science, which echoed a talk from
last year's workshop:
- data visualization
- database management (relational, not flat)
- data classification (cluster analysis)
- modeling
Hazbun made one provocative claim that I think hits
the heart of why this sort of science is important.
We mine data sets to see patterns that we probably
would not have seen otherwise. This is approach is
more objective than traditional science, in
which the hypotheses we test are the ones we create
out of our own experience. This is a much more
personal approach -- and thus more subjective. Data
mining helps us to step outside our own experience.
Next up was Daisuke Kihara, a Purdue bioinformatician
who was educated in Japan. He talked about the
difficulties he has had building a research group of
graduate students. The main problem is that biology
students have few or no skills in mathematics and
programming, and CS students know little or no biology.
In the US, he said, education is often too
discipline-specific, with not enough breadth, which
limits the kind of cross-fertilization needed by
researchers in bioinformatics. My university created
an undergraduate major in Bioinformatics three years
ago in an effort to bridge this gap, in part because
biotechnology is an industry targeted for economic
development in our state.
(My mind wandered a bit as I thought about Kihara's
claim about US education. If he is right, then perhaps
the US grew strong technically and academically during
a time when the major advances came within specific
disciplines. Now that the most important advances are
coming in multidisciplinary areas, we may well need to
change our approach, or lose our lead. I've been
concerned about this for a year or so, because I have
seen the problem of specializing too soon creeping down
into our high schools. But then I wondered, is Kihara's
claim true? Computer science has a history grounded in
applications that motivate our advances; I think it's
a relatively recent phenomenon that we spend most of
our time looking inward.)
In addition to technical skills and domain
knowledge, scientists of the future need the elusive
"problem-solving skills" we all talk about and hope
to develop in our courses. Haixu Tang, from the
Informatics program at Indiana contrasted the mentality
of what he called information technology and scientific
computing:
- technique-driven versus problem-driven
- general models versus specific, even
novel, models
- robust, scalable, and modular software
versus accurate, efficient programs
These distinctions reflect a cultural divide that makes
integrating CS into science disciplines tough. In
Tang's experience, domain knowledge is not the primary
hurdle, but he has found it easier to teach computer
scientists biology than to teach biologists computer
science.
Tang also described the shift in scientific method that
computing enables. In traditional biology, scientists
work from hypothesis to data to knowledge, with a cycle
from data back to hypothesis. In genome science,
science can proceed from data to hypothesis to
knowledge, with a cycle from hypothesis back to data.
The shift is from hypothesis-driven science to
data-driven science. Simulation has joined theory
and statistics in the methodological toolbox.
In the Q-n-A session that followed the panel, someone
expressed concern with data-driven research. Too many
people don't go back to do the experiments needed to
confirm hypotheses found via data mining or to verify
their data by independent means. The result is bad
science. Olga Vitek, a statistical bioinformatician,
replied that the key is developing skill in
experimental design. Some researchers in this new
world are learning the hard way.
The last speaker was Peter Waddell, a comparative
biologist who is working to reconstruct the tree of
life based on genome sequences. One example he
offered was that the genome record shows primates'
closest relatives to be... tree lemurs and shrews!
This process is going slowly but gaining speed. He
told a great story about shotgun sequencing, BLAST,
and the challenges in aligning and matching sequences.
I couldn't follow it, because I am a computer scientist
who needs to learn more biology.
When Waddell began to talk about some of the computing
challenges he and his colleagues face, I could follow
the details much better. They are working with a
sparse matrix that will have between 102
and 103 rows and between 102
and 109 (!!) columns. The row and column
sums will differ, but he needs to generate random
matrices having the same row and column sums as the
original matrix. In his estimation, students almost
need to have a triple major in CS, math, and stats,
with lots of biology and maybe a little chemistry
thrown in, in order to contribute to this kind of
research. The next best thing is cross-fertilization.
His favorite places to work have been where all of
the faculty lunch together, where they are able to
share ideas and learn to speak each other's languages.
This remark led to another question, because it
"raised the hobgoblin of multidisciplinary research":
an undergraduate needs seven years of study in order
to prepare for a research career -- and that is only
for the best students. Average undergrads will need
more, and even that might not be enough. What can we
do? One idea: redesign the whole curriculum to be
interdisciplinary, with problems, mathematics,
computational thinking, and research methods taught
and reinforced everywhere. Graduating students will
not be as well-versed in any one area, but perhaps
they will be better at solving problems across the
boundaries of any single discipline.
This isn't just a problem for multidisciplinary science
preparation. We face the same problem in computer
science itself, where the software development side of
our discipline requires a variety of skills that are
often best learned in context. The integrated curriculum
suggestion made here makes me think of the integrated
apprenticeship-style curriculum that ChiliPLoP
produced this year.
-----