TITLE: Workshop 3: The Next Generation AUTHOR: Eugene Wallingford DATE: November 19, 2007 4:41 PM DESC: ----- BODY:

[A transcript of the SECANT 2007 workshop: Table of Contents]

The highlight for me of the final morning of the SECANT workshop was a session on the "next generation of scientists in the workforce". It consisted of presentations on what scientists are doing out in the world and how computer scientists are helping them. Chris Hoffman gave a talk on applications of geometric computing in industry. He gave examples from two domains, the parametric design of mechanical systems and the study of the structure and behaviors of proteins. I didn't follow the virus talk very well, but the issue seems to lie in understanding the icosahedral symmetry that characterizes many viruses. The common theme in the two applications is constraint resolution, a standard computational technique. In the design case, the problem is represented as a graph, and graph decomposition is used to create local plans. Arriving at a satisfactory design requires solving a system of constraint equations, locally and globally. A system of constraints is also used to model the features of a virus capsid and then solved to learn how the virus behaves. Essential to both domains is the notion of a constraint solver. This won't be a surprise to folks who work on design systems, even CAD systems, but the idea that a biologist needs to work with a constraint solver might surprise many. Hoffman's take-home point was that we cannot do our work locked within our disciplinary boundaries, because we don't usually know where the most important connections lie. The next two talks were by computer scientists working in industry. Both gave wonderful glimpses of how scientists work today and how computer scientists help them -- and are helping them redefine their disciplines. First was Amar Kumar of Eli Lilly, who drew on his work in bioinformatics with biologists and chemists in drug discovery. He views his primary responsibility as helping scientists interpret their data. The business model traditionally used by Lilly and other big pharma companies is unsustainable. If Lilly creates 10,000 new compounds, roughly 1,000 will show promise, 100 will be worth testing, and 1 will make it to market. The failure rate, the time required by the process, the cost of development -- all result in an unsustainable model. Kumar means that Lilly and its scientists must undergo a transformation in thought process about how to find candidates and how to discover effects. He gave two examples. Biologists must move from "How does this particular gene respond to this particular drug?" to "How do all human genes respond to this panel of 35,000 drugs?" There are roughly 30,000 human genes, which means that the new question produces 1 billion data points. Similarly, drug researchers must move from "What does Alzheimer's do to the levels of amyloid protein in the brain? to "When I compare a healthy patient with an Alzheimer's patient, what is difference in the level of every brain-specific protein over time?" Again, the new question produces a massive number of data points. Drug companies must ask new kinds of questions -- and design new ways to find answers. The new paradigm shifts the power from pure lab biologists to bioinformatics and statistics. This is a major shift in culture at a place like Lilly research labs. It terms of a table of gene/drug interactions, the first adjustment is from cell thinking (1 gene/1 drug) to column thinking (1 drug/all genes). Ultimately, Kumar believes, the next step -- to grid thinking (m drugs/n genes) and finding patterns throughout -- is necessary, too. What the bioinformatician can do is to help convert information into knowledge. Kumar said that a friend used to ask him what he would do with infinite computational power. He thinks the challenge these days is not to create more computational power. We already have more data in our possession than we know what to do with. More than more raw power, we need new ways to understand the data that we gather. For example, we need to use clustering techniques more effectively to find patterns in the data, to help scientists see the ideas. Scientists do this "in the small", by hand, but programs can do so much better. Kumar showed an example, a huge spreadsheet with a table of genes crossed with metabolites. Rather than look at the data in the small he converted the numbers to a heat map so that the scientist could focus on critical areas of relationship. That is a more fruitful way to identify possible experiments than to work through the rows of the table by hand. Kumar suggests that future scientists require some essential computational skills:

data integration (across data sets)
data visualization
clustering
translation of problems from one space to another
databases
software development lifecycle

Do CS students learn about clustering as undergrads? Biologists need to. On the last two items, other scientists usually know remarkably little. Knowing a bit about the software lifecycle will help them work better with computer scientists. Knowing a bit about databases will help them understand the technology decisions the CS folks make. If all you know is a flat text file or maybe a spreadsheet, then you may not understand why it is better to put the data in a database -- and how much better that will support your work. The second speaker was Bob Zigon from Beckman Coulter, a company that works in the area of flow cytometry. Most of us in the room didn't know that flow cytometry studies the properties of cells as they flow through a liquid. Zigon is a software tech lead for the development of flow cytometry tools. He emphasized that to do his job, he has to act like the lab scientists. He has to learn their vocabulary, how to run their equipment, how to build their instruments, and how to perform experiments. The software folks at Beckman Coulter spend a lot of time observing scientists. ... and students chuckle at me when I tell them psychology, anthropology, and sociology make great minors or double majors for CS students! My experience came in the world of knowledge-based systems, which require a deep understanding of the practice and implicit knowledge of domain experts. Back in the early 1990s, I remember AI researcher John McDermott, of R1/XCON fame, describing how his expert systems team had evolved toward cultural anthropology as the natural next step in their work. I think that all software folks must be able to develop a deep cultural understanding of the domains they work in, if they want to do their jobs well. As software development becomes more and more interdisciplinary, it becomes more important. Whether they learn these skills in the trenches or with some formal training is up to them. Enjoying this sort of work helps a software developer, too. Zigon clearly does. He and his team implement computations and build interfaces to support the scientists who use flow cytometry to study blood cancer and other health conditions. He gave a great two-minute description of one of the basic processes that I can't do justice here. First they put blood into a tube that narrows down to the thickness of a hair. The cells line up, one by one. Then the scientists run the blood across a laser beam, which causes the cells to effloresce. Hardware measures the fluorescent energy, and software digitizes it for analysis. The equipment processes 10k cells/second, resulting in 18 data points for each of anywhere between 1 and 20 million cells. What do scientists working in this area need? Data management across the full continuum: acquisition, organization, querying, and visualization. Eight years of research data amount to about 15 gigabytes. Eight years of pharmaceutical data reaches 185 GB. And eight years of clinical data is 3 terabytes. Data is king. Zigon's team moves all the data into relational databases, converting the data into fifth normal form to eliminate as much redundancy as possible. Their software makes the data available to the scientists for online transactional processing and online analytical processing. Even with large data sets and expensive computations, the scientists need query times in the range of 7-10 seconds. With so much data, the need for ways to visualize data sets and patterns is paramount. In real time, they process 750 MB data sets at 20 frames per second. The biologists would still use histograms and scatter plots as their only graphical representations if the software guys couldn't do better. Zigon and his team build tools for n-dimensional manipulation and review of the data. They also work on data reduction, so that the scientists can focus on subsets when appropriate. Finally, to help find patterns, they create and implement clustering algorithms. Many of the scientists tend to fall back on k-means clustering, but in highly multidimensional spaces that technique imposes a false structure on the data. They need something better, but the alternatives are O(n²) -- which is, of course, intractable on such large sets. So Zigon needs better algorithms and, contrary to Kumar's case, more computational power! At the top of his wish list are algorithms whose complexity scales to studying 15 million cells at a time and ways to parallelize these algorithms in cost- and resource-effective ways. Cluster computing is attractive -- but expensive, loud, hot, .... They need something better. What else do scientists need? The requirements are steep. The ability to integrate cellular, proteomic, and genomic data. Usable HCI. On a more pedestrian tack, they need to replace paper lab notebooks with electronic notebooks. That sounds easy but laws on data privacy and process accountability make that a challenging problem, too. Zigon's team draws on work in the areas of electronic signatures, data security on a network, and the like. From these two talks, it seems clear that domain scientists and computer scientists of the future will need to know more about the other discipline than may have been needed in the past. Computing is redefining the questions that domain scientists must ask and redefining the tasks performed by the CS folks. The domain scientists need to know enough about computer science, especially databases and visualization, to know what is possible. Computer scientists need to study algorithms, parallelism, and HCI. They also need to take more seriously the soft skills of communication and teamwork that we have encouraging for many years now. The Q-n-A session that followed pointed out an interesting twist on the need for communication. It seems that clustering algorithms are being reinvented across many disciplines. As each discipline encounters the need, the scientists and mathematicians -- and even computer scientists -- working in that area sometimes solve their problems from scratch without reference to the well-developed results on clustering and pattern recognition from CS and math. This seems like a potentially valuable place to initiate dialogue across the sciences in places looking to increase their interdisciplinary focus. -----