MARBL (Mumps Analysis and Retrieval from Bioinformatics Libraries)

(See also: O'Kane, K.C.; & Lockner, M.J, Indexing genomic sequence libraries, Information Processing & Management, Volume 41, Issue 2 , March 2005, Pages 265-274)

Kevin C. O'Kane
University of Northern Iowa
Cedar Falls, IA 50614-0507
okane@cs.uni.edu

Overview and Background

This system, MARBL (Mumps Analysis and Retrieval from Bioinformatics Libraries), is an implementation and toolkit to integrate multiple, very large, genomic, data bases into a unified data repository through open-source components and provide fast, web-based access to the results. This system differs from existing genomic text retrieval systems in that:

  1. it derives the index terms from the text rather than manual assignment;

  2. it is fully built upon open-source components;

  3. it is coded in an extended implementation language that supports multi-dimensional data bases; and,

  4. it can index in a matter of hours the largest genomic data sets.

While natural language text indexing can be approached from the perspective of assignment to pre-existing categories and hierarchies such as the National Library of Medicine MeSH (Medical Subject Headings) (Hazel, 1997), derivative indexing is better able to adapt to changes in a rapidly evolving discipline as the terms are dynamically extracted directly from the source material rather than waiting for manual analysis. Existing keyword based genomic retrieval systems are primarily based on assignment indexing whereas the approach taken here is based on derivative indexing, where both queries and documents are encoded into a common intermediate representation and metrics are developed to calculate the coefficients of similarity between queries and documents. Documents are ranked according to their computed similarity to the query and presented to the user in rank order. Several systems employing this and related models have been implemented such as Smart (Salton, 1968, 1971, 1983, 1988), Instruct (Wade, 1988), Cansearch (Pollitt, 1987) and Plexus (Vickery, 1987a, a987b). More recently, these approaches have been used to index Internet web pages and provide collaborative filtered recommendations regarding similar texts to book buyers at Amazon.com (Linden, 2003).

In this system, genomic accessions are represented by vectors that reflect accession content through descriptors derived from the source text by analysis of word usage (Salton,1968, 1971, 1983, 1988; Willett, 1985; Crouch, 1988). This approach can be further enhanced identifying clusters of similar documents (El-Hamdouchi et al.,1988, 1989). Likewise, term-term co-occurrence matrices can be developed that can be used to identify similar or related terms and these can be automatically included into queries to enhance recall or to identify term clusters. Other techniques based on terms and queries have also been explored (Salton, 1988; Williams, 1983).

The vector model is rooted in the construction of document vectors that consisting of the weights of each term in each document. Taken collectively, the document vectors constitute a document-term matrix whose rows are document vectors. A document-term matrix can have millions of rows, more than 22 million in GenBank case, and thousands of columns (terms), more than 500,000 in GenBank. This yields a matrix with potentially trillions of possible elements which must be quickly addressable not by numeric indices but also by text keys. Additionally, to enhance retrieval speed, an inverted matrix of the same size is needed which doubles the storage requirements. Fortunately, however, both matrices are very sparse.

Implementation Indexing Phase

Filter programs read the input files (for example, NCBI GenBank files). Lines of text are scanned, punctuation and extraneous characters are removed, and words matching entries in the stop list are discarded. Each term is written to an output file (words.out) along with its accession code and a code indicating the source of the data. Optionally, for each accession containing a Medline identification code, Medline abstracts are downloaded and the texts of the abstracts processed to extract additional words. These words are also added to words.out along with the original accession number associated with the abstract. The output files are sorted concurrently. The file words.out file is sorted to two output files: words.sorted, ordered by term then accession code, and words.sorted.2 ordered by accession code then term.

The file words.sorted is processed to count word usage. As the file is ordered by term then accession code, multiple occurrences of a word in a document appear on successive lines. The program delets words whose overall frequency of occurrence is too low or high. Files df.lst and dict.lst are produced which contain, respectively, for each term, the number of accessions in which it appears in, and the total number of occurrences.

The file words.sorted.2 (sorted by accession code and term) is processed to produce words.counted.2 which generats for each accession, the number of times each term occurrs in an accession and a string of code letters giving the original sources of the term (from the original input line codes). This file is ordered by accession and term.

The files df.lst, dict.lst and words.counted.2 are processed by to produce internal data vectors and an output file named weighted.words which contains the accession code, term, the calculated inverse document frequency weight of the term, and source code(s) for the term. If the calculated weight of a term in an accession is below a threshold, it is discarded. Since the input file words.counted.2 is ordered by accession then by term, the output file weighted.words is also ordered by accession then term.

Finally, the Nrml3a.cgi constructs the term-accession matrix (^I) from the term sorted file weighted.words and Nrml3.cgi builds the accession-term matrix (^D) from wgted.words.sorted which was ordered by accession and term. In this final step, the data base assumes its full size and it is this step that is most critical in terms of time and space. As each of the matrices are ordered according to their first, then second indices, the B-tree is built in ascending key order, thus improving overall performance substantially.

Implementation - Retrieval Phase

After indexing has been completed, there are several programs to handle retrieval both from keyboard interaction or, more typically, web based interaction. In the web based model, retrieved accessions are presented to the user along with links to the original NCBI GenBank pages. Retrieved accessions may also have their sequences extracted to a FASTA format file for downloading or further processing or, alternatively, the Genbank pages on each accession can be extracted to a file for download.

If the sequence data form accession is extracted to a FASTA file, the resulting FASTA file may be further processed. Amonth the options are: a fasta34 search of the file, a Perl search, or a clustalw analysis of the sequences. Since clustalw can be very time consuming, there are limits placed on the number of sequences that can be compared at one time. A Smith-Waterman analysis will soon be available.


References

Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997). "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res., 25, 3389-3402.

American National Standards Institute, Inc. (1995). ANSI/MDC X11.4.1995 Information Systems . Programming Languages - M, American National Standards Institute, 11 West 42nd Street, New York, New York 10036.

Barker, W.C., et. al. (1999). The PIR-International Sequence Database, Nucleic Acids Research, 27(1) 39-43.

Barnett, G.O. & Greenes, R.A. (1970). High level programming languages, Computers and Biomedical Research, 3, 488.497.

Bowie, J & and Barnett, G. O. (1976). MUMPS . an economical and efficient time.sharing language for information management, Computer Programs in Biomedicine, 6, 11.21.

Crouch, C.J. (1988). An analysis of approximate versus exact discrimination values, Information processing and Management, 24(1), 5.16.

El.Hamdouchi, A.; & Willet, P. (1988). An improved algorithm for the calculation of exact term discrimination values, Information Processing and Management, 24(1) 17.22.

El.Hamdouchi, A.; and Willet, P. (1989). Comparison of hierarchic and agglomerative clustering methods for document retrieval, The Computer Journal, 32(3) 220.227.

Hazel, P. (1997), Perl Compatible Regular Expression Library, Cambridge: University of Cambridge Computing Service.

Linden, G., & Smith, B. (2003). Amazon.com Recommendations: Item-to-Item Collaborative Filtering, IEEE Distributed Systems Online, http://dsonline.computer.org/0301/d/w1lind.htm

O'Kane, K.C. (1980). An RT.11 single user standard MUMPS interpreter, MUMPS Users' Group Quarterly, 10, 5.6.

O'Kane, K.C. (1983). A portable hybrid MUMPS development system host, Proc IEEE Computer Society 7th International Computer Software Applications Conference, 60.65.

O'Kane, K.C. (1992). A language for implementing information retrieval software, Online Review, 16(3) 127-137.  

O'Kane, K.C. (1999). An M Compiler for Internet server applications, M Computing, pp 11-17, 7(1) 11-17.

Pearson, W. R. (2000). Flexible sequence similarity searching with the FASTA3 program package Methods Mol.Biol., 132, 185-219

Pollitt, S. (1987). Cansearch: an expert systems approach to document retrieval, Information Processing and Management, 23(2), 119.138.

Salton, G. (1968). Automatic Information Organization and Retrieval, New York: McGraw.Hill Book Company.

Salton, Gerard; and McGill, Michael J.; Introduction to Modern Information Retrieval, McGraw.Hill Book Company (New York, 1983).

Salton, G. (1988). Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer, Reading Massachusetts: Addison.Wesley.

Salton, G. (editor) (1971). The SMART Retrieval System: Experiments in Automatic Document Processing, Englewood Cliffs, New Jersey: Prentice.Hall.

Shpaer, E. G., Robinson, M, et al. (1996). Sensitivity and Selectivity in Protein Similarity Searches: Comparison of Smith-Waterman in Hardware, Genomics, 38, 179-191.

Sleepycat Software, Inc. (2001). Berkeley DB, Indianapolis, IN: New Riders, ISBN 0-7357-1064-3

Smith, T. F., & Waterman, M.S. (1981) Identification of common molecular subsequence, J. Mol. Biol., 147(1), 195-197.

Thure, E. & Argos, P. (1993a). SRS an indexing and retrieval tool for flat file data libraries. Comput. Appl. Biosci., 9, 49-57.

Thure, E. & Argos, P. (1993b). Transforming a set of biological flat file libraries to a fast access network. Appl. Biosci., 9, 59-64.

U.S. National Library of Medicine (2003), 2003 MeSH, Annotated Alphabetic List, 8600 Rockville Pike, Bethesda, MD 20894, NTIS Order number: PB2003-96480. (http://www.nlm.nih.gov/mesh/meshhome.html)

Vickery, A.; and Brooks, H.M. (1987a). PLEXUS: an expert system for referral, Information Processing and Management, 23, 99.117.

Vickery, A.; & Brooks, H.M. (1987b). A reference referral system using expert system techniques, Journal of Documentation, 43, 1.23.

Wade, S.J.; and Willett, P. (1988). INSTRUCT: a teaching package for experimental methods in information retrieval. Part III. Browsing, clustering and query expansion, Program, 22, 44.61.

Willett, P. (1985). An algorithm for the classification of exact term discrimination values, Information Processing and Management, 21(3), 225.232.

Williams, J.H. (1963). A discriminant method for automatically classifying documents, Proc 1963 Fall Joint Computer Conference, 161.166.

Wu, C.H., et. al. (2003). The Protein Information Resource, Nucleic Acids Res., 31(1), 345-347.

November 30, 2004