Mumps Bioinformatics Software Library Copyright (C) 2004, 2005 by Kevin C. O'Kane Kevin C. O'Kane anamfianna@earthlink.net okane@cs.uni.edu This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA |
Genomic Data Base Toolkit
Indexing Genbank
Sept 18, 2005
Note: you must install the Mumps Compiler to use this software. The Mumps Compiler is free and distributed under the GPL.
See also this example transcript of an install
You must download and install the Mumps Compiler to use this code. See below.
The major script files are:
http://www.cs.uni.edu/~okane
./configure prefix=/usr
If not "root" enter the command:
./configure
The script "configure" has many options and these are detailed in mumpsc/doc/compiler.html. You should check these. You may need to install additional open-source software on your machine such as the Perl Compatible Regular Expression Library. See the Mumps documentation for details.
make
make install
You may also need to install additional software such as the Berkeley Data Base and the Perl Compatible Regular Expression Library (PCRE). These are freely available and there are instructions in mumpc/doc/compiler.html.Please note that there are two data base subsystems available: the "native" and the Berkeley Data Base (BDB). The BDB is more stable. The native data base is provided with the Mumps Compiler. If you work only with the native native data base, you do not need the BDB.
Note: many Linux distributions now automatically install the PCRE.
To extract the indices and build the data base:
nohup ./Index &
Note: this job can take a very long time to complete if you are indexing a large number of accessions. It should best be done on a fast machine with lots of memory. You should have at least 1 GB of memory and the machine should be otherwise unoccupied. It is best if you have high speed scsi raid 0 devices for your file system. Run a small data set (e.g., gbvrl) first.
There are several query processors. The simplest is "kquery.cgi" which is keyboard based:
./kquery.cgi
You will be prompted for a query string. Query strings may consist of a single word, or multiple words separated by blanks and joined by logical operators and parentheses.
For example:
query: alanyl
query: kinas*
query: alanyl & kinase
query: ( alanyl & kinase ) | (alanyl & albumin)
To terminate processing, enter an empty line. If one or more of your terms exceed a preset threshold with regard to the number of accessions they are linked too, the query will be rejected. The default threshold is 1000. Thus, a term that is linked to 1000 accessions will be rejected. This can be overridden by responding to the "query:" prompt with:
/searchmax=2000
If accessions are found, their accession id and relevance weight will be displayed. The display will be limited by a threshold (see below). After display of the accession ids, you will be queried if you want the sequence data retrieved and copied to a file named "fasta.out". This file will contain the sequence data from the found accessions in fasta format suitable for subsequent FASTA analysis. The command to perform the fasta analysis if of the form:
./fasta34.cgi -q -O results.file query.sequence fasta.out
where the results of the anaysis will be placed in "results.file". The query sequence should be placed in the file "query.sequence". To do this, you need to install the fasta package and you should be familiar with its options, especially with regard to substitution matrices.
Other options recognized by kquery are:
/q
/quit - exit. An empty string also causes exit.
/maxprint=# 0- maximum number of accession ids to show.
/division=CODE - search will be restricted to accessions with this division. The valid valies of CODE are: All PRI ROD MAM VRT INV PLN BCT VRL PHG SYN UNA EST PAT STS GSS HTG HTC
/? - shows current settings
/columns=# - number of columns to display.
http://sidhe.cs.uni.edu/cgi-bin/marbl/lookup1.cgi
replace the URL with your URL.