TITLE: Workshop 2: Computational Thinking in the Health Sciences AUTHOR: Eugene Wallingford DATE: October 30, 2008 8:39 PM DESC: ----- BODY:

[A transcript of the SECANT 2008 workshop: Table of Contents]

The next session of the workshop was a panel of university faculty working in the health sciences, talking about how they use computation in their disciplines and what the key issues are. Panel chair, Raj Acharya, from Penn State's Computer Science and Engineering department, opened with the bon mot "all science is computer science", a reference to a 2001 New York Times piece that I have been using for the last few years when speaking to prospective students, their parents, and other faculty. By itself, this statement sounds flip, but it is true in many ways. The telescope astronomers use today is as much a computational instrument as a mechanical one. Many of the most interesting advances in biology these days are really bioinformatics. The dawn of big data is changing what we do in CS, but it's having an even bigger effect in some other sciences by creating a new way to do science. Modeling is a nascent research method based in computation: propose model, test it against the data, and iterate. Data mining is an essential step in this new process: all of the data goes into a box, and the box has to make the sense of the data. This swaps two steps in the traditional scientific method... Instead of forming a hypothesis and then testing it by collecting data, a scientist can mine a large collection of data to find candidate hypotheses, and then confirm with more traditional bench science and by checking models against other and larger data sets. Tony Hazbun, who works in the School of Pharmacy at Purdue, talked about work in systems biology. He identified four key ideas that biologists need to learn from computer science, which echoed a talk from last year's workshop:

data visualization
database management (relational, not flat)
data classification (cluster analysis)
modeling

Hazbun made one provocative claim that I think hits the heart of why this sort of science is important. We mine data sets to see patterns that we probably would not have seen otherwise. This is approach is more objective than traditional science, in which the hypotheses we test are the ones we create out of our own experience. This is a much more personal approach -- and thus more subjective. Data mining helps us to step outside our own experience. Next up was Daisuke Kihara, a Purdue bioinformatician who was educated in Japan. He talked about the difficulties he has had building a research group of graduate students. The main problem is that biology students have few or no skills in mathematics and programming, and CS students know little or no biology. In the US, he said, education is often too discipline-specific, with not enough breadth, which limits the kind of cross-fertilization needed by researchers in bioinformatics. My university created an undergraduate major in Bioinformatics three years ago in an effort to bridge this gap, in part because biotechnology is an industry targeted for economic development in our state. (My mind wandered a bit as I thought about Kihara's claim about US education. If he is right, then perhaps the US grew strong technically and academically during a time when the major advances came within specific disciplines. Now that the most important advances are coming in multidisciplinary areas, we may well need to change our approach, or lose our lead. I've been concerned about this for a year or so, because I have seen the problem of specializing too soon creeping down into our high schools. But then I wondered, is Kihara's claim true? Computer science has a history grounded in applications that motivate our advances; I think it's a relatively recent phenomenon that we spend most of our time looking inward.) In addition to technical skills and domain knowledge, scientists of the future need the elusive "problem-solving skills" we all talk about and hope to develop in our courses. Haixu Tang, from the Informatics program at Indiana contrasted the mentality of what he called information technology and scientific computing:

technique-driven versus problem-driven
general models versus specific, even novel, models
robust, scalable, and modular software versus accurate, efficient programs

These distinctions reflect a cultural divide that makes integrating CS into science disciplines tough. In Tang's experience, domain knowledge is not the primary hurdle, but he has found it easier to teach computer scientists biology than to teach biologists computer science. Tang also described the shift in scientific method that computing enables. In traditional biology, scientists work from hypothesis to data to knowledge, with a cycle from data back to hypothesis. In genome science, science can proceed from data to hypothesis to knowledge, with a cycle from hypothesis back to data. The shift is from hypothesis-driven science to data-driven science. Simulation has joined theory and statistics in the methodological toolbox. In the Q-n-A session that followed the panel, someone expressed concern with data-driven research. Too many people don't go back to do the experiments needed to confirm hypotheses found via data mining or to verify their data by independent means. The result is bad science. Olga Vitek, a statistical bioinformatician, replied that the key is developing skill in experimental design. Some researchers in this new world are learning the hard way. The last speaker was Peter Waddell, a comparative biologist who is working to reconstruct the tree of life based on genome sequences. One example he offered was that the genome record shows primates' closest relatives to be... tree lemurs and shrews! This process is going slowly but gaining speed. He told a great story about shotgun sequencing, BLAST, and the challenges in aligning and matching sequences. I couldn't follow it, because I am a computer scientist who needs to learn more biology. When Waddell began to talk about some of the computing challenges he and his colleagues face, I could follow the details much better. They are working with a sparse matrix that will have between 10² and 10³ rows and between 10² and 10⁹ (!!) columns. The row and column sums will differ, but he needs to generate random matrices having the same row and column sums as the original matrix. In his estimation, students almost need to have a triple major in CS, math, and stats, with lots of biology and maybe a little chemistry thrown in, in order to contribute to this kind of research. The next best thing is cross-fertilization. His favorite places to work have been where all of the faculty lunch together, where they are able to share ideas and learn to speak each other's languages. This remark led to another question, because it "raised the hobgoblin of multidisciplinary research": an undergraduate needs seven years of study in order to prepare for a research career -- and that is only for the best students. Average undergrads will need more, and even that might not be enough. What can we do? One idea: redesign the whole curriculum to be interdisciplinary, with problems, mathematics, computational thinking, and research methods taught and reinforced everywhere. Graduating students will not be as well-versed in any one area, but perhaps they will be better at solving problems across the boundaries of any single discipline. This isn't just a problem for multidisciplinary science preparation. We face the same problem in computer science itself, where the software development side of our discipline requires a variety of skills that are often best learned in context. The integrated curriculum suggestion made here makes me think of the integrated apprenticeship-style curriculum that ChiliPLoP produced this year. -----