Indexing Genomic Sequence Libraries

 

Kevin C. O'Kane

Department of Computer Science

The University of Northern Iowa

Cedar Falls, Iowa 50613

USA

Tel. (319) 273-7322

Fax (319) 273-7123

okane@cs.uni.edu

 

Matthew J. Lockner

Department of Computer Science and Engineering

The Pennsylvania State University

University Park, PA 16802

USA

lockner@cse.psu.edu

 

_________________________________________________________________________________

Kevin C. O'Kane, Computer Science Department, University of Northern Iowa, Cedar Falls, IA 50613, USA

 

 

Abstract

 

This paper describes an extensible, open-source (GPL) data repository and retrieval system that supports fast, efficient, keyword based retrieval of genomic sequences from multiple libraries with retrieved sequences post-processed by FASTA, Smith-Waterman and other analysis software.  This application is implemented for Linux and is written in Mumps, C, and C++ with supporting components that include the Berkeley Data Base, the Perl Compatible Regular Expression Library, GLADE, and tools such as FASTA, Smith-Waterman, and modules from EMBOSS.  The package described here can quickly index data sets of up to 256 terabytes using a B-tree based multidimensional data model.  An example is presented that indexes the text of the full NCBI Genbank library.

 

Keywords: bioinformatics, sequence retrieval, genomics, information retrieval, Mumps.

 

 

Introduction

 

During the past decade, the massive growth in genetic and protein databases has created a pressing need for tools to manage, retrieve and analyze the information contained in these libraries.   Traditional tools to organize, classify and extract information have often proved inadequate when confronted with the overwhelming size and density of information which includes not only sequence and structural data, but also text that describes the data's origin, location, species, tissue sample, journal articles, and so forth.  As of this writing, the NCBI (National Center for Biotechnology Information, part of the National Institutes of Health) GenBank library alone consists of nearly 84 billion bytes of data and it is only one of several data banks storing similar information. The scope and size of these databases continues to rapidly grow and will continue to do so for many years to come as will the demand for access.

 

Currently, retrieval of genomic data is mainly based on well-established programs such as FASTA (Pearson, 2000) and BLAST (Altschul, 1997), that match candidate nucleotide sequences against massive libraries of sequence acquisitions. There have been few efforts to provide access to genomic data keyed to the extensive text annotations commonly found in these data sets. Among the few systems that deal with keyword based searching are the proprietary SRS system  (Thure and Argos, 1993a, 1993b) and PIR (Protein Information Resource) (Wu 2003).  These are limited, controlled vocabulary systems whose keys are from manually prepared annotations.  To date, there have been no systems reported to directly generate indices from the genomic data sets themselves.  The reasons for this are several: the very large size of the underlying data sets, the size of intermediate indexing files, the complexity of the data, and the time required to perform the indexing.

 

This system, MARBL (Mumps Analysis and Retrieval from Bioinformatics Libraries), is an implementation and toolkit (1) to integrate multiple, very large, genomic, databases into a unified data repository through open-source components; and (2) to provide fast, web-based keyword based access to the contents. This system differs from existing genomic text retrieval systems in that: (1) it derives the index terms from the text rather than manual assignment;  (2) it is fully built upon open-source components;  (3) it is coded in C, C++ and an extended implementation language that supports multi-dimensional databases; and,  (4) it can index in a matter of hours the largest genomic data sets.  In the example discussed here, results are given for the full GenBank library.

 

Additionally, sequences retrieved by MARBL can be post-processed by FASTA (Pearson, 2000), Smith-Waterman (Smith, 1981) and elements of EMBOSS (the European Molecular Biology Open Software Suite).  While FASTA and, especially, Smith-Waterman, are more sensitive (Shpaer et al. 1996) than BLAST, they are also more time consuming.  However, by first extracting from the larger database a subset of candidate accessions, the number of sequences to be aligned by these algorithms can be reduced significantly with corresponding reduction in the overall processing time.

 

Implementation

 

Most genomic databases include, in addition to nucleotide and protein sequences, a wealth of text information in the form of descriptions, keywords, annotations, hyper-links to text articles, journals and so forth.  In many cases, the text attachments to the data are greater in size that the actual sequence data.  Identifying the important keyword terms from this data and assigning a relative weight to these terms is one of the problems addressed in this system.

 

While indexing can be approached from the perspective of assignment to pre-existing categories and hierarchies such as the National Library of Medicine MeSH (Medical Subject Headings) (Hazel, 1997), derivative indexing is better able to adapt to changes in a rapidly evolving discipline as the terms are dynamically extracted directly from the source material rather than waiting for manual analysis.  Existing keyword based genomic retrieval systems are primarily based on assignment indexing whereas the approach taken here is based on derivative indexing, where both queries and documents are encoded into a common intermediate representation and metrics are developed to calculate the coefficients of similarity between queries and documents.  Documents are ranked according to their computed similarity to the query and presented to the user in rank order.  Several systems employing this and related models have been implemented such as Smart (Salton, 1968, 1971, 1983, 1988), Instruct (Wade, 1988), Cansearch (Pollitt, 1987) and Plexus (Vickery, 1987a, 1987b).  More recently, these approaches have been used to index Internet web pages and provide collaborative filtered recommendations regarding similar texts to book buyers at Amazon.com (Linden, 2003).

 

In this system, genomic accessions are represented by vectors that reflect accession content through descriptors derived from the source text by analysis of word usage (Salton,1968, 1971, 1983, 1988;  Willett, 1985; Crouch, 1988).    This approach can be further enhanced by identifying clusters of similar documents (El-Hamdouchi et al.,1988, 1989).  Similarly, term-term co-occurrence matrices can be constructed to identify similar or related terms and these can be automatically included into queries to enhance recall or to identify term clusters.  Other techniques based on terms and queries have also been explored (Salton, 1988; Williams, 1983). 

 

The vector model is rooted in the construction of document vectors consisting of  the weights of each term in each document.  Taken collectively, the document vectors constitute a document-term matrix whose rows are document vectors. A document-term matrix can have millions of rows, more than 22 million in GenBank case, and thousands of columns (terms), more than 500,000 in GenBank.  This yields a matrix with potentially trillions of possible elements which  must be quickly addressable not by numeric indices but by text keys.  Additionally, to enhance retrieval speed, an inverted matrix of the same size is needed which doubles the overall storage requirements. Fortunately, however, both matrices are very sparse.

 

Given the nature of the problem, namely, manipulating massive, character string indexed sparse matrices, we implemented the system in Mumps (also referred to as  M), a general purpose programming language that supports a builtin multi-dimensional database.  The language originated in the mid 60's at the Massachusetts General Hospital (Bowie and Barnett, 1976; Barnett and Greenes 1970) and it became widely used in clinical as well as commercial settings. Mumps is a simple, easy to learn, string handling language with a built-in, string indexed, multi-dimensional database facility.


The Mumps implementation used here is a portable, open-source version (O'Kane, 1980, 1983, 1999) that translates Mumps programs to C for subsequent compilation and execution.  Other implementations are mainly slower proprietary interpreters or threaded code systems.  This freely distributed, General Public (GPL) and Lesser General Public (LGPL) licensed compiler supports an enhanced subset of the 1995 Mumps ANSI standard  (American National Standards Institute, Inc., 1995) including indirection.  Mumps programs may contain embedded in-line C statements and functions and Mumps functions can call or be called by programs written in C,  C++,  or other languages.  A C++ class library is also available to access Mumps facilities from C++ programs. The compiler supports access to the PostgreSQL relational database server, the Perl Compatible Regular Expression Library, the Berkeley Data Base, and the Glade GUI builder as well as server-side development of interactive web pages.  Mumps programs can directly interface with BLAST and FASTA programs.

The principal Mumps feature of interest to this project is the database, known as global arrays, that permits direct, efficient manipulation of arrays of effectively unlimited size. Globals are persistent, sparse, undeclared, multi-dimensional, string indexed arrays stored in a disk resident B-tree. A global array may appear anywhere an ordinary array reference is permitted and data may be stored at leaf nodes as well as intermediate nodes in the database array.  The number of subscripts in an array reference is limited only by the total length of the array reference with all subscripts expanded to their string values, nominally set at 2048 bytes.  Mumps supports several functions to traverse the database and manipulate the arrays.

Using the global arrays, an accession-term matrix appears in the Mumps language as an array of the form: ^D(Accession,Term) where both Accession and Term are text strings. The matrix is indexed row wise by accession codes and column wise by text derived terms. This approach vastly simplifies implementation of the basic retrieval model. For example, the main Mumps indexing program used in the basic protocol described below is about 76 lines of code (excluding in-line C functions).  The highly concise nature of the Mumps language permits rapid deployment and minimizes maintenance problems that would be the case with more complex coding systems.  

 

This implementation of Mumps primarily uses the Berkley Data Base (BDB) to store, manipulate and retrieve global arrays. The BDB supports file sizes up to 256 terabytes in length along with multi-threading, concurrency, transaction processing, and many other features and is implemented on most operating systems and hardware platforms. With the BDB, it is possible to construct complex Mumps databases of effectively unlimited size and access any element of the database with only a few disk I/O operations.  The BDB is available as an open-source product (http://www.sleepycat.com). 


FASTA (Pearson, 2000) and Smith-Waterman (Smith and Waterman, 1981) are sequence alignment procedures to match candidate protein or NA sequences to entries in a database.  Sequences retrieved as a result of text searches with the system described here can be post-processed by FASTA and the Smith-Waterman.  Of these, the Smith-Waterman algorithm is especially sensitive and accurate but also relatively time consuming.  Using this system to isolate candidate sequences by text keywords and subsequently processing the resulting subset of the larger database, results in considerable time savings.  In our experiments we used the Smith-Waterman program available as part of the FASTA package developed by W. R. Pearson (Pearson 2000).  Additionally, the output of this system is compatible with the many genomic analysis programs found in the open source EMBOSS collection.

 

System Overview

The system software is compatible with several genomic database input formats, subject to preprocessing by filters. In the example presented here, the NCBI GenBank collection was used.  GenBank consists of accessions which contain sequence and related data collected by researchers throughout the world. 

Two protocols were developed to index the GenBank data sets.  Initially, we tried a direct, single step, vector space model protocol that constructed the accession-term matrix directly from the GenBank files.  However, experiments revealed that this approach was unacceptably slow when used with large data sets.  This resulted in the development of a multi-step protocol that performed the same basic functions but as a series of steps designed to improve overall processing speed.  The discussion below centers on the multi-step protocol although timing results are given for both models.  The work was performed on a Linux based, dual processor hyper-threaded Pentium Xeon 2.20 GHz system with 1 GB of main memory and dual 120 GB EIDE 7,200 rpm disk drives.

 

The entire GenBank data collection consisted of approximately 83.5 GB of data at the time of these experiments.  When working with data sets of smaller size, relatively straightforward approaches to text processing can be used with confidence.  However, when working with data sets of very large dimensions, it soon became apparent that special strategies would be needed in order to reduce the complexity of the processing problem. During indexing, as the B-tree grew, delays due to I/O became significant when the size of the data set exceeded the amount of physical memory available.  At that point, the memory I/O cache became inadequate to service I/O requests without significant actual movement of data to and from external media.  When this happened, CPU utilization was observed to drop to very low values while input/output activity grew to system maximum.  Once page thrashing began, overall progress to index the data set slowed dramatically.

 

In order to avoid this problem, the multi-step protocol was devised in which the indexing tasks were divided into multiple steps and sort routines were employed in order to prepare intermediate data files so that at each stage where the database was being loaded into the B-tree, the keys were presented to the system in ascending key order thus inducing an effectively sequential data set build and eliminating page thrashing.  While this process produced a significant number of large intermediate files, it was substantially faster than unordered key insertion.

 

Data Sets

 

The main data sets used were from the NCBI GenBank collection (ftp://ftp.ncbi.nlm.nih.gov):

 

1.    The GenBank short directory, gbsdr.txt consisting of locus codes which, at this writing,  is approximately 1.45 billion bytes in length and has approximately 18.2 million entries.

2.    The nucleotide data, the accessions, are stored in over 300 gzip compressed files. Each file is about 220 megabytes long and consists of nucleotide accessions. We pre-process each file with a filter program that extracts text and other information. Pre-processing results in a format similar to the EMBL (European Micro Biology Laboratory) format and this  makes for faster processing in subsequent steps as well as greatly reducing disk storage requirements.  For example, the file gbbct1.seq, currently 250,009,587 bytes in length, reduced to 8,368,295 bytes after pre-processing.

3.    Optionally, a list of NCBI manually derived multi-word keys from the file gbkey.idx (502,549,211 bytes).  Processing of these keys is similar to that of derived keys but only a default, minimum weight is produced. 

4.    In addition to text found in the accessions, GenBank, as well as many other data resources, contains links to on-line libraries of journal articles, books, abstract and so forth.  While not exploited in this example, these links can provide additional sources of text keys related to the accessions in the originating database.  They can also be used to provide cross-references between multiple databases through citation matching.  Additionally, these articles are usually manually indexed into keyword hierarchies such as MeSH and this information can be used to place the accessions into the framework of alternative indexing schemes.

 

Multiple Step Protocol

 

The multiple step protocol, shown in Figure 1, separated the work into several steps and was based on the observation that using system sort facilities to preprocess data files resulted in much faster database creation since the keys can be loaded into the B-tree database in ascending key order.  This observation was founded in an early experiment in which an accession-term matrix was constructed by loading the keys from a 5 million accessions file sorted by ascending accession key.  The load procedure itself used a total of 1,032 seconds (17.2 minutes).  On the other hand, loading the keys directly from a file not sorted by accession was 7.1 times slower requiring 7,333 seconds (122.2 minutes) to load. 

 

The main text analysis procedure reads the filtered version of the accession files.  Lines of text were scanned,  punctuation and extraneous characters were removed, words matching entries in the stop list were discarded, and, finally, words were processed to remove suffixes and consolidated into groups based on word stems (readerb.cgi).  Each term was written to an output file (words.out) along with its accession code and a code indicating the source of the data. A second file was also produced that associated each processed stem along with the original form of the term (xwords.out).  The output files were sorted concurrently.  The xwords.out file was sorted by term with duplicate entries discarded while the words.out file was sorted to two output files: words.sorted, ordered by term then accession code, and words.sorted.2 ordered by accession code then term.

 

The file words.sorted was processed to count word usage (readerd.cgi).  As the file was ordered by term then accession code, multiple occurrences of a word in a document appear on successive lines.  The program deleted words whose overall frequency of occurrence was too low or high.  Files df.lst and dict.lst were produced which contain, respectively, for each term, the number of accessions in which it appears in, and the total number of occurrences.

 

The file words.sorted.2 (sorted by accession code and term) was processed by readerg.cgi to produce words.counted.2 which generated for each accession, the number of times each term occurred in an accession and a string of code letters giving the original sources of the term (from the original input line codes).  This file was ordered by accession and term.

 

The files xwords.sorted, df.lst, dict.lst and words.counted.2 were processed by readerc.cgi to produce internal data vectors and an output file named weighted.words which contained the accession code, term, the calculated inverse document frequency weight of the term, and source code(s) for the term.  If the calculated weight of a term in an accession was below a threshold, it was discarded.  Since the input file words.counted.2 was ordered by accession then by term, the output file weighted.words was also ordered by accession then term.

 

Finally, the Nrml3a.cgi constructed the term-accession matrix (^I) from the term sorted file weighted.words and Nrml3.cgi built the accession-term matrix (^D) from wgted.words.sorted which was ordered by accession and term.  In this final step, the database assumed its full size and it was this step that is most critical in terms of time.  As each of the matrices were ordered according to their first, then second indices, the B-tree was built in ascending key order.

 

Retrieval

 

Retrieval is via a web page interface or an interactive keyboard based program.  Queries are expressed as logical expressions of terms, possibly including wildcards.  Queries may be restricted to data particular sources (title, locus, etc.) or specific divisions (ROD, PRI, etc).  When a query is submitted to the system  it is first expanded to include related terms for any wildcards.  The expression is converted into a Mumps expression and candidate accession codes of those accessions containing terms from the query are identified. The Mumps expression is applied to each identified candidate accession.  A similarity coefficient between the accession and the query is calculated based on the weight of the terms in the accessions using a simple similarity formula.

 

From the accessions retrieved,  the user can view the original NCBI accession page, save the accession list for further reference or convert the accessions to FASTA format and match the accessions against a candidate sequence with the FASTA or the Smith-Waterman algorithm.  By means of the GI code from the VERSION field in the original GenBank accession, a user can access full data concerning the retrieved accession directly from NCBI.  Also stored is the Medline access code which provides direct entry into the Medline database for the accession.

 

Retrieval times are proportional to the amount of material retrieved, the complexity of the query, and the number of accessions in which each query term appears..  For specific queries that retrieve only a few accessions, processing times less than 1 second are typical.

 

Results and Discussion

 

Some overall processing statistics for the two protocols are given in Table 1.  As can be seen, the multi-step protocol performed significantly better than the basic protocol. 

 

Table 1 - Processing Time Statistics (in minutes)

Accessions Processed

1,000,000

5,000,000

22,318,882

Multi-Step Protocol

63.9

350.8

2,016.1

 Basic Protocol

246.9

1,735.61

6994.7

 

The dimensions of the final matrices are potentially of vast size: 22.3 million by  501,614 in this case.  Potentially, this implies a matrix of 11.5 trillion elements.  However, the matrix is very sparse and the file system stores only those elements which actually exist. After processing the entire GenBank, the actual database was only 23 GB although at its largest, before compaction of unused space, it reached 44 GB.

 

Evaluation of retrieval effectiveness from a data set of this size is clearly difficult as there are few benchmarks against which to compare the results.  However, NCBI distributes a file of keyword phrases with GenBank  gbkey.idx (502,549,211  bytes).  This file contains submission author assigned keyword phrases and associated accession identifiers.  Of the 48,023 unique keys in gbkey.idx (after removal of special characters and words less than three characters in length), 26,814 keys were the same as the keys selected by MARBL.  The 21,209 keys that differed were, for the most part, words of very high or low frequency that the system rejected due to preset thresholds.  Alternatively, the MARBL system identified and retained a highly specific 501,614 terms, many of which were specific codes used to identify genes. 

 

When comparing the accessions linked to keywords in gbkey.idx with MARBL derived accessions, it was clear that  MARBL discovered vastly more linkages than the NCBI file identified.  For example, the keyword zyxin (the last entry in gbkey.idx) was linked to 4 accessions by gbkey.idx but MARBL detected 336 accessions.  In twelve other queries based on terms randomly selected from gbkey.idx, MARBL found more accessions than were listed in gbkey.idx in nine cases and the same number in three cases.  On average, each MARBL derived keyword points to 130.34 accessions whereas gbkey.idx keys, on average, points to 6.80 accessions. 

 

We compared MARBL with BLAST by entering the nucleotide sequence of a Bacillus anthrtacis bacteriophage that was of interest to a local researcher.  BLAST retrieved 24 accessions, with one scoring 1,356, versus the next highest with a score of 50. The highest scoring accession was the correct answer, while the remainder were noise. When we entered the phrase anthracis & bacteriophage to the MARBL retrieval package, only one accession was retrieved, the same one that received the highest score from BLAST.  BLAST took 29 seconds, MARBL retrieval took 10 seconds.  It should be noted, however, that BLAST searches are not based on keywords but on genomic sequences.

 

Mumps is an excellent text indexing implementation language (O'Kane, 1992).  Mumps programs are concise and are easily maintained and modified. The string indexed global arrays, underpinned by the effectively unlimited file sizes supported by the BDB, make it possible to design very large, efficient systems with minimal effort. In all, there were 10 main indexing routines with a total of 930 lines of Mumps code (including comments) for an average of 93 lines of code per module.  On the other hand, the C programs generated by the Mumps compiler amounted to 21,146 lines of code, not counting many thousands of lines in run-time support and database routines. The size of the C routines is comparable to reported code sizes for other information retrieval projects such as Wade (1988) who reports that Instruct as approximately 6,000 lines of Pascal code, and Plexus (Vickery and Brooks, 1987a) reported as approximately 10,000 lines although, due to differences in features, these figures should not be used for direct comparisons.

 

We expect in the near future to begin work involving clustering of accessions and the development of hierarchical dictionaries of terms based on the vector space model as well as development of additional database filters thereby allowing integration of other genomic databases into the overall system.  Copies of all software in source code form are available from: http://www.cs.uni.edu/~okane.

 

References

 

Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs",  Nucleic Acids Res., 25, 3389-3402.

 

American National Standards Institute, Inc. (1995). ANSI/MDC X11.4 1995 Information Systems   Programming Languages - M, American National Standards Institute, 11 West 42nd Street, New York, New York 10036.

 

Barker, W.C., et. al. (1999).  The PIR-International Sequence Database, Nucleic Acids Research, 27(1) 39-43.

Barnett, G.O. & Greenes, R.A. (1970). High level programming languages, Computers and Biomedical Research, 3,  488 497.

 

Bowie, J &  and Barnett, G. O. (1976). MUMPS   an economical and efficient time sharing language for information management, Computer Programs in Biomedicine, 6, 11 21.

 

Crouch, C.J. (1988). An analysis of approximate versus exact discrimination values, Information processing and Management, 24(1), 5 16.


El Hamdouchi, A.; & Willet, P. (1988). An improved algorithm for the calculation of exact term discrimination values, Information Processing and Management, 24(1) 17 22.

El Hamdouchi, A.; and Willet, P. (1989). Comparison of hierarchic and agglomerative clustering methods for document retrieval,  The Computer Journal, 32(3) 220 227.

 

Hazel, P. (1997), Perl Compatible Regular Expression Library, Cambridge: University of Cambridge Computing Service.


Linden, G., &  Smith, B. (2003). Amazon.com Recommendations: Item-to-Item Collaborative Filtering, IEEE Distributed Systems Online, http://dsonline.computer.org/0301/d/w1lind.htm

 

O'Kane, K.C. (1980). An RT 11 single user standard MUMPS interpreter, MUMPS Users' Group Quarterly, 10, 5 6.

O'Kane, K.C. (1983). A portable hybrid MUMPS development system host, Proc IEEE Computer Society 7th International Computer Software Applications Conference, 60 65.

 

O'Kane, K.C. (1992). A language for implementing information retrieval software, Online Review, 16(3) 127-137.  

O'Kane, K.C. (1999). An M Compiler for Internet server applications, M Computing, pp 11-17, 7(1) 11-17.

 

Pearson, W. R. (2000). Flexible sequence similarity searching with the FASTA3 program package Methods Mol.Biol., 132, 185-219

 

Pollitt, S. (1987). Cansearch: an expert systems approach to document retrieval, Information Processing and Management, 23(2), 119 138.


Salton, G. (1968). Automatic Information Organization and Retrieval, New York: McGraw Hill Book Company.

 

Salton, Gerard; and McGill, Michael J.; Introduction to Modern Information Retrieval, McGraw Hill Book Company (New York, 1983).

Salton, G. (1988). Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer,  Reading Massachusetts: Addison Wesley.

Salton, G. (editor) (1971). The SMART Retrieval System: Experiments in Automatic Document Processing, Englewood Cliffs, New Jersey: Prentice Hall.

 

Shpaer, E. G., Robinson, M, et al. (1996). Sensitivity and Selectivity in Protein Similarity Searches: Comparison of Smith-Waterman in Hardware, Genomics, 38, 179-191.


Sleepycat Software, Inc. (2001). Berkeley DB, Indianapolis, IN: New Riders, ISBN 0-7357-1064-3

Smith, T. F., & Waterman, M.S. (1981) Identification of common molecular subsequence, J. Mol. Biol., 147(1), 195-197.

 

Thure, E. & Argos, P. (1993a). SRS an indexing and retrieval tool for flat file data libraries. Comput. Appl. Biosci., 9, 49-57.

Thure, E. & Argos, P. (1993b). Transforming a set of biological flat file libraries to a fast access network. Appl. Biosci., 9, 59-64.

U.S. National Library of Medicine (2003), 2003 MeSH, Annotated Alphabetic List, 8600 Rockville Pike, Bethesda, MD 20894, NTIS Order number: PB2003-96480. (http://www.nlm.nih.gov/mesh/meshhome.html)

Vickery, A.; and Brooks, H.M. (1987a). PLEXUS: an expert system for referral, Information Processing and Management, 23, 99 117.

Vickery, A.; & Brooks, H.M. (1987b). A reference referral system using expert system techniques, Journal of Documentation, 43, 1 23.


Wade, S.J.; and Willett, P. (1988). INSTRUCT: a teaching package for experimental methods in information retrieval. Part III. Browsing, clustering and query expansion, Program, 22, 44 61.

Willett, P. (1985). An algorithm for the classification of exact term discrimination values, Information Processing and Management, 21(3), 225 232.

 

Williams,  J.H. (1963). A discriminant method for automatically classifying documents, Proc 1963 Fall Joint Computer Conference, 161 166.

Wu, C.H., et. al. (2003). The Protein Information Resource, Nucleic Acids Res., 31(1), 345-347.

 

 Multi-Step Protocol Flow Chart

Figure 1