Mumps Bioinformatics Software Library
Copyright (C) 2004, 2005 by Kevin C. O'Kane  

Kevin C. O'Kane
anamfianna@earthlink.net
okane@cs.uni.edu

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

Genomic Data Base Toolkit
Indexing Genbank
Sept 18, 2005

Note: you must install the Mumps Compiler to use this software. The Mumps Compiler is free and distributed under the GPL.

See also this example transcript of an install

You must download and install the Mumps Compiler to use this code. See below.

The major script files are:

  1. 00BUILD - set parameters for all compilations and builds. Run this file to build the distribution. It invokes autoconf, configure and make as does other housekeeping functions.
  2. configure.ac - configures the software.
  3. Index - reads GenBank accessions, fetches Medline abstracts and indexes the text. It is build by configure. Run this file to index your data base.
  4. mumpsc - part of the Mumps Compiler. This script is used to compile both Mumps programs and C++ programs that use the Mumps libraries. When compiling C++ programs, this script gives access to the header and object files needed for the C++ Mumps interface. The distribution contains binary executable Linux programs. These should work on most recent Linux systems if the appropriate libraries (see below) have been intstalled.


To build, install and test, do the following:

  1. Download and unzip the Mumps Compiler from:

    http://www.cs.uni.edu/~okane

  2. After unziping/untaring the distribution, enter the directory "mumpsc" created by the distribution.

  3. If you are "root", enter the command:

    ./configure prefix=/usr

    If not "root" enter the command:

    ./configure

    The script "configure" has many options and these are detailed in mumpsc/doc/compiler.html. You should check these. You may need to install additional open-source software on your machine such as the Perl Compatible Regular Expression Library. See the Mumps documentation for details.

  4. Enter the commands:

    make
    make install

  5. If you are not "root", the Mumps distribution will be placed in "~/mumps_compiler". You must add: "~/mumps_compiler/bin" to your PATH variable in order to gain access to the files. If you used the "root" procedure, the executables have been placed in "/usr/bin" and the libraries in "/usr/include/mumpsc" and "/usr/lib". See the documentation in "mumpsc" and in "mumpsc/doc/compiler.html" for full details. It is recommended that you do the install as "root".

    You may also need to install additional software such as the Berkeley Data Base and the Perl Compatible Regular Expression Library (PCRE). These are freely available and there are instructions in mumpc/doc/compiler.html.

    Please note that there are two data base subsystems available: the "native" and the Berkeley Data Base (BDB). The BDB is more stable. The native data base is provided with the Mumps Compiler. If you work only with the native native data base, you do not need the BDB.

    Note: many Linux distributions now automatically install the PCRE.

  6. Download and decompress some GenBank sequence files into a directory of your chosing. Download only data files such as gbbct*, gbinv*, gbpat*, gbphg*, gbpln*, gbpri*, gbrod*, gbvrl* and gbvrt*. Do no download the ancillary files such as gbsdr*. The gbest* files can be indexed but they contain relatively little text data at this time. p>

  7. Edit 00BUILD and answer the questions. These include where the data files are, and so on.
  8. Run 00BUILD. Note: you must have a recent autoconf available on your system as well as a recent version of the gcc and g++ compilers. Some distributions of Linux provide by default an old version of autoconf which will not work. Find a version with aversion number of 2.59. 00BUILD will configure and compile the software. Watch for error messages. The process does not always halt on error.

  9. To extract the indices and build the data base:

    nohup ./Index &

    Note: this job can take a very long time to complete if you are indexing a large number of accessions. It should best be done on a fast machine with lots of memory. You should have at least 1 GB of memory and the machine should be otherwise unoccupied. It is best if you have high speed scsi raid 0 devices for your file system. Run a small data set (e.g., gbvrl) first.

  • Try a text query. First, if you are using the Berkeley Data base, check to see if the file "DB_CONFIG" is present. This file sets the size of the Berkeley Data Base cache (the "native" cache is preset in "mumpsc/include/mumpsc/btree.h"). The second numeric field in this one line file is the cache size (the first number is the cache size in gigabytes, the second is the size in bytes). A very large cache size can cause the interactive query programs to run slow. You should probably remove this file (the build routines re-create it) or set the cache size to some small (a few megabytes).

    There are several query processors. The simplest is "kquery.cgi" which is keyboard based:

    ./kquery.cgi

    You will be prompted for a query string. Query strings may consist of a single word, or multiple words separated by blanks and joined by logical operators and parentheses.

    For example:

    query: alanyl
    query: kinas*
    query: alanyl & kinase
    query: ( alanyl & kinase ) | (alanyl & albumin)

    To terminate processing, enter an empty line. If one or more of your terms exceed a preset threshold with regard to the number of accessions they are linked too, the query will be rejected. The default threshold is 1000. Thus, a term that is linked to 1000 accessions will be rejected. This can be overridden by responding to the "query:" prompt with:

    /searchmax=2000

    If accessions are found, their accession id and relevance weight will be displayed. The display will be limited by a threshold (see below). After display of the accession ids, you will be queried if you want the sequence data retrieved and copied to a file named "fasta.out". This file will contain the sequence data from the found accessions in fasta format suitable for subsequent FASTA analysis. The command to perform the fasta analysis if of the form:

    ./fasta34.cgi -q -O results.file query.sequence fasta.out

    where the results of the anaysis will be placed in "results.file". The query sequence should be placed in the file "query.sequence". To do this, you need to install the fasta package and you should be familiar with its options, especially with regard to substitution matrices.

    Other options recognized by kquery are:

    /q

    /quit - exit. An empty string also causes exit.

    /maxprint=# 0- maximum number of accession ids to show.

    /division=CODE - search will be restricted to accessions with this division. The valid valies of CODE are: All PRI ROD MAM VRT INV PLN BCT VRL PHG SYN UNA EST PAT STS GSS HTG HTC

    /? - shows current settings

    /columns=# - number of columns to display.

  • For web server access, make a symbolic link from your web servers cgi-bin directory to the directory containing the software. Make all the files in the same group as the web server. Point your browser at something like:

    http://sidhe.cs.uni.edu/cgi-bin/marbl/lookup1.cgi

    replace the URL with your URL.