Mumps/MDH Toolkit
Inverse Document Frequency Weighted
Genomic Sequence Retrieval
Version 1.0
Kevin C. O'Kane, Ph.D.
Computer Science Department
University of Northern Iowa
Cedar Falls, IA 50614
okane@cs.uni.edu
http://www.cs.uni.edu/~okane
July 28, 2006

Copyright (c) 2004, 2005, 2006 Kevin C. O'Kane, Ph.D.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being: Page 1, with the Front-Cover Texts being: Page 1, and with the Back-Cover Texts being: no Back-Cover Texts.


Contents

  1. Citation
  2. Abstract
  3. Online Demo Version
  4. General Comments
  5. Distributions
    • Full Source Code Distribution
  6. Linux/Cygwin Installation
  7. Sequence Retrieval
  8. Using the Web Interface
  9. Web Examples
  10. Clustering
  11. Cluster Examples
  12. Methodology

  1. Citation

    O'Kane, K.C., The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval, Online Journal of Bioinformatics, Volume 6 (2) 162-173, 2005.

  2. Abstract

    This software presents a method to identify weighted n-gram sequence fragments in large genomic databases whose indexing characteristics permits the construction of fast, indexed, sequence retrieval programs where query processing time is determined mainly by the size of the query and number of sequences retrieved rather than, as is the case in sequential scan based systems such as BLAST, FASTA, and Smith-Waterman, the size of the database. The weighting scheme is based on the inverse document frequency (IDF) method, a weighting formula that calculates the relative importance of indexing terms based on term distribution. In experiments (see citation above), the relative IDF weights of all segmented, overlapping, fixed n-grams of length eleven in the NCBI .nt. and other databases were calculated and the resulting n-grams ranked and used to create an inverted index into the sequence file. The system was evaluated on test cases constructed from randomly selected known sequences which were randomly fragmented and mutated and the results compared with BLAST and MegaBlast for accuracy and speed. Due to the speed of query processing, the system is also capable of database sequence clustering, examples of which are given below.

  3. Online Version

    An online server version of the IDF Searcher can be accessed by clicking the link below. This version maintains recent (see dates) versions of NCBI data bases and may be used for searching and testing. The server is a dual Xeon system with 8 GB of main memory and 1.8 TB of Raid 0 disk running Suse Linux. Other activity, including experiments, other users and student use, will affect timing. From time to time, the data bases are updated. During updates, parts of the system will not be available. See Using the Web Interface below for details on how to use.

    Bioinformatics Demos on Genbank Data Bases
    1. Viral DB
            Viral Clusters DB
    2. IDF Primate DB
            Primate Clusters
    3. Invertebrate DB
            Invertebrate Clusters
    4. Phage DB
            Phage Clusters
    5. Patent DB
            Patent Clusters
    6. IDF GenBank EST DB
    7. IDF Blast NT DB
    8. IDF Blast NR DB

  4. General Comments

    This software is mainly aimed at Linux users.

    Your Linux system must have autoconf and make available. You must have a recent version of autoconf (we use 2.59). The gcc and g++ compiler must also be available. The system will compile on both 32 and 64 bit systems. However, your file system must be capable of 64 bit addresses, however, regardless of processor (most Linux distros are now 64 bit file system based).

    This code has been developed under Mandriva and Suse Linux distros.

  5. Distributions

    The full IDF distribution is to be found in::

    http://www.cs.uni.edu/~okane/source/IDF
    and is named similar to:
    idf.mar-30-2006.tgz
    (use the latest revision). This package consists mainly of source code.

    Other required software, including the Mumps Compiler, the MDH Toolkit and documentation are located in

    http://www.cs.uni.edu/~okane/source/

    You must install the Mumps Compiler in order to compile the IDF package. The instructions for Mumps installation are in the Mumps distribution package. Before installing the Mumps Compiler, reset the value of STR_MAX by using the --with-strmax=16000 in the configure command.


  6. Linux Installation

    You must install the Mumps Compiler and MDH Toolkit which is free and available from:

    http://www.cs.uni.edu/~okane

    You must have recent versions of autoconf, make, gcc and g++ available.

    Directories

    The IDF distribution has been reorganized and simplified so that you may build seprate versions for separate data bases. Thus, if you are indexing, for example, the Genbank PRIMATES data base (gbpri*.seq), create a directory named:

    idf.gbpri
    or some other name that indicates the directory will contain PRIMATES data (the choice of directory name is yours).

    Next, descend into that directory.

    Untar/unzip the files (use the most recently dated distribution):

    tar xvzf idf.mar-30-2006.tgz
    This will create a directory named "idf" in which will be the code and subdirectory for the data base. Do not rename this directory. You should place this directory on a disk drive that has ample unused space. A full GenBank index of "nt" currently takes 16 GB of disk space (this will increase as the data bases increase).

    1. The latest version of the system relies on make, configure and a script files to build the Linux distribution.

    2. First, modify the parameters in 00BUILD then run 00BUILD (see below). It will create the needed Makefile and it will run configure. At present, 00BUILD is divided into sections for each data base. Uncomment the section for the data base you are indexing before running 00BUILD and adjust parameters and directory addresses as appropriate.

    3. Download the appropriate nucleotide data bases. These will include all or some of the following (at your option - you only need to download the ones with which you want to work):
      ftp://ftp.ncbi.nih.gov/genbank/gbbct*.seq.gz
      ftp://ftp.ncbi.nih.gov/genbank/gbvrl*.seq.gz
      ftp://ftp.ncbi.nih.gov/genbank/gbinv*.seq.gz
      ftp://ftp.ncbi.nih.gov/genbank/gbpat*.seq.gz
      ftp://ftp.ncbi.nih.gov/genbank/gbpri*.seq.gz
      ftp://ftp.ncbi.nih.gov/genbank/gbphg*.seq.gz
      ftp://ftp.ncbi.nih.gov/genbank/gbrod*.seq.gz
      ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nt.gz
      ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz
      Place these files in a directory, preferably on a disk different that the one containing the IDF distribution. Placement of the files on a different disk than the distribution will reduce disk contention during indexing.

      You must decompress these files after downloading them with a Linux command or the form:

      gzip -d nt.gz

      In Linux, gzip is part of the operating system distribution.

      You will need to place the path description to these files in the 00BUILD script file used to configure the indexing operations (see below).

    4. The system distribution directory contains a subdirectory named DATABASE where the indexing will be performed.

    5. After having configured and run 00BUILD successfully, to index the selected data base, descend into DATABASE and run the command:
      nohup nice BUILD_INDEX.script &

      Depending on the size of the data base and the speed of your machine, this may take several hours. During index builds, you should minimize all other system activity and close unneeded windows. The indexer attempts to maximize physical memory usage and this reduces intermediary files and decrease the time to index.

  7. Sequence Retrieval

    The main search routine is find10h4.cgi and it can be run interactively be descending to the DATABASE directory and typing:

    ../find10h4.cgi --nohtml < test
    where "test" is the file name of a file containing a test query sequence in FASTA format. Each source file group has a test file that will be copied to the DATABASE dirtectory.

    Note: find10h4.cpp uses parameters set by 00BUILD during compilation. See 00BUILD for details.

    When you run find10h4 from a command prompt, the following parameters apply:

    -- swscore Use Smith-Waterman scoring of final results. IDF scoring will be used to identify candidates but the SW procedure will score and rank these. Using SW increases the amount of time required by a considerable amount. Default: disabled.
    --showsw Use Smith-Waterman scoring and show the SW alignments. This increases the amount of time required considerably and can result in very large amounts of output depending upon sequence sizes and alternative alignments. Default: disabled.
    --help Displays parmeter options.
    --unweighted Use unweighted indexing only. Usually results in serious performance degradation. Default: disabled.
    --showmax nbr Number of sequences to display. Default is 500.
    --low nbr Low weight cutoff. System will not use words with weights lower than this to score sequences. Default: 65
    --high nbr High weight cutoff. System will not use words with weights higher than this to score sequences. Default: 120
    --gap nbr SW gap penalty. Default: -1.
    --match nbr SW match reward. Default: 2.
    --mismatch nbr SW mis-match penalty. Default: -1.
    --suffix string String to append to "fasta.sequences." as name for fasta sequences output file. Default: none. Enables sequence fetching (--genseq).
    --genseq For each sequence up to --showmax, fetch the sequence and store it in an output file named "fasta.sequences" or "fasta.sequences.name" where "name" is provided by --suffix.
    --nohtml Output will be in plain text (no HTML links embedded). The default is to embed HTML links into the output.
    --cluster filename Perform cluster analysis on sequences in current directory and store similarities in "filename". "filename" is required. (SEQ_FILE).
    --clustermax nbr Maximum number of sequences to cluster [1000]
    --bitmin nbr Cluster analysis bit score minimum cutoff [0]
    --simmin nbr Cluster analysis similarity score cutoff [0.0]
    --skip nbr Number of leading sequences to skip.
    Other Notes

    00BUILD

    #!/bin/bash
    
    #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    #+   
    #+     Mumps Bioinformatics Software Library
    #+     Copyright (C) 2005, 2006 by Kevin C. O'Kane  
    #+
    #+     Kevin C. O'Kane
    #+     anamfianna@earthlink.net
    #+     okane@cs.uni.edu
    #+
    #+
    #+ This program is free software; you can redistribute it and/or modify
    #+ it under the terms of the GNU General Public License as published by
    #+ the Free Software Foundation; either version 2 of the License, or
    #+ (at your option) any later version.
    #+ 
    #+ This program is distributed in the hope that it will be useful,
    #+ but WITHOUT ANY WARRANTY; without even the implied warranty of
    #+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    #+ GNU General Public License for more details.
    #+ 
    #+ You should have received a copy of the GNU General Public License
    #+ along with this program; if not, write to the Free Software
    #+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307
    #+
    #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
    
    # set the following parameters:
    
    touch *.c *.cpp *.mps
    
    autoconf
    
    if [ $? != 0 ] ; then
    	echo "errors in config.ac - terminating"
          exit
    	fi
    
    user="okane"
    group="www"
    amino_size=-1
    nucl_size=-1
    max_seq_length=25000
    p_range=25000
    p_min=100
    high_limit=120
    tmp_dir="/tmp"
    
    1. Modify "user" to be the user id of the owner of these files.
    2. Modify "group" to be the group id of these files.
    3. Modify "tmp_dir" to be a directory for temporary files. Your systm's /tmp may not be large enough.
    4. The parameters max_seq_length, p_range, p_min, and high_limit, are defaults. Specific values are set in each data base setup.
    # PRIMATES database setup
    #db_files="gbpri"
    #scale_factor=5
    #nucl_size=11
    #nbr_rand_smpls=25
    #cp test.primates DATABASE/test
    #cp pval.dat.invertebrates DATABASE/pval.dat
    #nucleotides="#define NUCLEOTIDES"
    #extract_files=1
    #database="/r0/genbank"
    #random="randomN"
    # stat_cal - build new pval.dat data base if 1, don't if 0
    #stat_calc=0
    #low_limit=65
    #high_limit=100
    #max_seq_length=25000
    #p_range=25000
    
    # RODENTS database setup
    #db_files="gbrod"
    #scale_factor=5
    #nucl_size=11
    #nbr_rand_smpls=50
    #cp test.rodents DATABASE/test
    #cp pval.dat.rodents DATABASE/pval.dat
    #nucleotides="#define NUCLEOTIDES"
    #extract_files=1
    #max_seq_length=25000
    #p_range=25000
    #database="/c1/genbank"
    #random="randomN"
    # stat_cal - build new pval.dat data base if 1, don't if 0
    #stat_calc=0
    #low_limit=65
    
    # VERTEBRATES database setup
    #db_files="gbvrt"
    #scale_factor=5
    #nucl_size=11
    #nbr_rand_smpls=25
    #cp test.vertebrates DATABASE/test
    #cp pval.dat.vertebrates DATABASE/pval.dat
    #nucleotides="#define NUCLEOTIDES"
    #extract_files=1
    #database="/c1/genbank"
    #random="randomN"
    # stat_cal - build new pval.dat data base if 1, don't if 0
    #stat_calc=0
    #low_limit=65
    #high_limit=100
    #max_seq_length=25000
    #p_range=25000
    
    # BACTERIA database setup
    #db_files="gbbct"
    #scale_factor=8
    #nucl_size=11
    #nbr_rand_smpls=25
    #cp test.bacteria DATABASE/test
    #cp pval.dat.bacteria DATABASE/pval.dat
    #nucleotides="#define NUCLEOTIDES"
    #extract_files=1
    #database="/c1/genbank"
    #random="randomN"
    # stat_cal - build new pval.dat data base if 1, don't if 0
    #stat_calc=0
    #low_limit=65
    #high_limit=100
    #max_seq_length=25000
    #p_range=25000
    
    # PLANT database setup
    #db_files="gbpln"
    #scale_factor=5
    #nucl_size=11
    #nbr_rand_smpls=25
    #cp test.plants DATABASE/test
    #cp pval.dat.plants DATABASE/pval.dat
    #nucleotides="#define NUCLEOTIDES"
    #extract_files=1
    #database="/c1/genbank"
    #random="randomN"
    # stat_cal - build new pval.dat data base if 1, don't if 0
    #stat_calc=0
    #low_limit=65
    #max_seq_length=25000
    #p_range=25000
    
    # INVERTEBRATES database setup
    #db_files="gbinv"
    #scale_factor=1
    #nucl_size=11
    #nbr_rand_smpls=25
    #cp test.invertebrates DATABASE/test
    #cp pval.dat.invertebrates DATABASE/pval.dat
    #nucleotides="#define NUCLEOTIDES"
    #extract_files=1
    #database="/r0/genbank"
    #random="randomN"
    # stat_cal - build new pval.dat data base if 1, don't if 0
    #stat_calc=0
    #low_limit=65
    #high_limit=100
    #max_seq_length=25000
    #p_range=25000
    
    # PHAGE database setup
    #db_files="gbphg"
    #scale_factor=1
    #nucl_size=11
    #nbr_rand_smpls=25
    #cp test.phage DATABASE/test
    #cp pval.dat.phage DATABASE/pval.dat
    #nucleotides="#define NUCLEOTIDES"
    #extract_files=1
    #database="/r0/genbank"
    #random="randomN"
    # stat_cal - build new pval.dat data base if 1, don't if 0
    #stat_calc=0
    #low_limit=65
    #max_seq_length=25000
    #p_range=25000
    
    # VIRAL database setup
    #db_files="gbvrl"
    #scale_factor=1
    #nucl_size=11
    #nbr_rand_smpls=25
    #cp test.viral DATABASE/test
    #cp pval.dat.phage DATABASE/pval.dat
    #nucleotides="#define NUCLEOTIDES"
    #extract_files=1
    #database="/r0/genbank"
    #random="randomN"
    # stat_cal - build new pval.dat data base if 1, don't if 0
    #stat_calc=0
    #low_limit=65
    #max_seq_length=25000
    #p_range=25000
    #high_limit=120
    
    # EST database setup
    #db_files="gbest"
    #scale_factor=5
    #nucl_size=11
    #nbr_rand_smpls=25
    #cp test.est DATABASE/test
    #cp pval.dat.est DATABASE/pval.dat
    #nucleotides="#define NUCLEOTIDES"
    #extract_files=1
    #database="/c1/genbank"
    #random="randomN"
    # stat_cal - build new pval.dat data base if 1, don't if 0
    #stat_calc=0
    #low_limit=95
    #high_limit=115
    #max_seq_length=7500
    #p_range=7500
    
    # NT database setup
    #db_files="nt"
    #scale_factor=8
    #nucl_size=11
    #nbr_rand_smpls=25
    #p_range=20000
    #cp test.nt DATABASE/test
    #cp pval.dat.nt DATABASE/pval.dat
    #nucleotides="#define NUCLEOTIDES"
    #extract_files=0
    #database="/r0/blast/db/FASTA/nt"
    #max_seq_length=20000
    #random="randomN"
    #low_limit=65
    # stat_cal - build new pval.dat data base if 1, don't if 0
    #stat_calc=0
    
    # NR database setup
    #db_files="nr"
    #scale_factor=7
    #nbr_rand_smpls=25
    #cp pval.dat.nr DATABASE/pval.dat
    #p_range=12000
    #p_min=50
    #max_seq_length=12000
    #cp test.nr DATABASE/test
    #database="/r0/blast/db/FASTA/nr"
    #amino_size=3
    #extract_files=0
    #random="randomA"
    #low_limit=30
    #high_limit=65
    # stat_cal - build new pval.dat data base if 1, don't if 0
    #stat_calc=0
    
    1. Uncomment the section appropriate for your data base (remove # sign). Do not uncomment the line:
      # stat_cal - build new pval.dat data base if 1, don't if 0
      
    2. Modify the line:
      database="/r0/genbank"
      to point to the directory containing your data files. Note: for the NR and NT databases, this line points to the actual data base file:
      database="/r0/blast/db/FASTA/nt"

      where in the other database, it points merely to the directory containg the data base files.
    #set the memory limit to a value less than your physical memory.
    
    memlimit=1800000001
    
    export database shred_home tmp_dir user group memlimit \
           max_seq_length\
           nr_database nt_database nucleotides low_limit \
           db_files scale_factor\
           nbr_rand_smpls extract_files amino_size random \
           p_range high_limit stat_calc\
           nucl_size p_min
    
    if [ $nucl_size = -1 ] && [ $amino_size = -1 ] ; then
          echo "No data base section was specified (uncommented)"
          echo "halting - configuration aborted"
          exit
          fi
    
    ./configure
    
    if [ $? = 0 ] ; then
    	echo "begin make"
    	make idf
          cp blank.cgi DATABASE/statistics
          cp lm.cgi DATABASE/statistics
    	chown $user *
    	chgrp $group *
          chmod u+x stat.script
          chmod u+x DATABASE/statistics/setup.script
          chmod u+x DATABASE/clusterbuild.script
          chmod a+x DATABASE/index.cgi
          chmod a+x DATABASE/weights.cgi
    else
    
          chown $user *
          chgrp $group *
    	echo "make supressed due to errors from configure"
    	fi
    
    1. Modify "memlimit" to be a value equal to about 80% to 90% of your total physical memory. If you have a very large system memory, 90% is appropriate. The goal here is to give the indexing programs the maximum amount of memory possible without inducing virtual memory page swapping.


  8. Using the Web Interface

    The basic web page should appear as follows (Windows XP version - the Linux/cygwin version has a few more options which will be discussed below):

    Options:

    1. Weight distribution - the last line in the top blue area is a link to a graph of the word weight distribution for this data base. If you click it, you will see a display such as:

      This graph indicates the relative weights of word fragments used in the indexing process. It can be useful in determining parameters for searches.

    2. Sequence text box - the large white text box in the center is where you place your sequence. Your sequence should be in FASTA format, that is, the firt line begins with a ">" symbol followed by a description of the query sequence. Subsequent lines contain the nucleotide sequence to be searched for. At present, sequences are limited to a total of about 1,000 character, inclusive of the first line. The display you see has a sample sequence already filled in. You should replace it with your sequence or you may use it as a test case. If you use this test sequence, it will probably not match any sequence in your data base well except if you are using the "NT" data base from which is was extracted.

    3. Selection of scoring technique - lower left hand box. Options:

      1. IDF scoring (default). Sequences will be scored using the IDF technique only and the results will be ordered by IDF socres.

      2. Smith-Waterman Scoring. Sequences will be selected by the IDF method but scored using the Smith-Waterman algorithm. This option will increase the amount of time for a search but can produce better scorrings.

      3. Smith-Waterman Scoring - show aligns. Same as above but optimal alignments will be shown. Note: for a given query sequence and data base match and a given Smith-Waterman score, it is likely there are more than one possible alignment that produces the same score. Only one alignment is shown. Other alignments, although different, are usually quite similar.

    4. IDF Weight Factors. These parameters determine which portion of the IDF weight distribution will be used for scoring. Generally speaking, the weight distribution is a bell-like curve. Words to the left of the main peak (or peaks) represent higher frequency, more broadly distributed words. Words to the right of the central peak(s) are less widely distributed and norlally of lower frequency. Bu selecting a band of words in the downslope of the main peaks, the retrieval is faster and more sensitive. The graphs of word distribution can be seen by clicking on the line "Click Here for weight distribution" in the top left box. The default values initially shown are tailored for each data base. The box labeled: "No Weighting:" disables IDF weight indexing and should not be used except to compare the effectiveness of the IDF method versus raw indexing.

    5. Retrieve 25 sequences. The number of top scoring sequences to retrieve. Larger number of sequences will increase timing slightly for IDF scoring and moderately for Smuth-Waterman scoring.

    6. SW Settings. These are the basic reward/penalty values used in the Smith-Waterman algorithm, if selected. The value for "Gap" is the penalty for opening a gap in a match; the "Mismatch" is the penalty for a sequence mismatch; and the "Match" value is the reward for matching letters.

    7. Search button. Click the search button to initiate the search.

  9. Web Examples

    1. Search with default selections shown in figure above:

      The display has been scrolled down to show results so some text material at the top is not seen. Each result consists of the IDF score, the p-value, the sequence length and a portion of the first line of the sequence (the descriptor). Each line is a link to the full display of the sequence at NCBI.

    2. Same search but with Smith-Waterman scoring turned on:

    3. Same search but with Smith-Waterman alignments enabled:

      Only a portion of the first match is shown. This option greatly increases the amount of output.

    Clustering

    Because of the speed of this method, it becomes possible to cluster the sequences in a collection using a single link agglomeration method. The script for clustering is "Cluster.script" and looks like the following:

    #!/bin/bash #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ #+ #+ Mumps Bioinformatics Software Library #+ Copyright (C) 2004, 2005 by Kevin C. O'Kane #+ #+ Kevin C. O'Kane #+ anamfianna@earthlink.net #+ okane@cs.uni.edu #+ #+ #+ This program is free software; you can redistribute it and/or modify #+ it under the terms of the GNU General Public License as published by #+ the Free Software Foundation; either version 2 of the License, or #+ (at your option) any later version. #+ #+ This program is distributed in the hope that it will be useful, #+ but WITHOUT ANY WARRANTY; without even the implied warranty of #+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the #+ GNU General Public License for more details. #+ #+ You should have received a copy of the GNU General Public License #+ along with this program; if not, write to the Free Software #+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA #+ #++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ cat /dev/null > nohup.out CPUS=4 GROUP=`cat DocCount` let GROUP=(GROUP+4)/CPUS let SEQ_LIMIT=GROUP*CPUS echo "CPUS $CPUS SEQ_LIMIT $SEQ_LIMIT GROUP $GROUP" rm cluster_tmp* rm task* rm key.dat data.dat ../starttime.cgi for ((i=0; i<SEQ_LIMIT; i=i+GROUP)); do let skip=i ../find10h4.cgi --cluster cluster_tmp$i --skip $skip --simmin .000001 --clustermax $GROUP > task$i & done wait cat cluster_tmp* > CTMP sort -n -r < CTMP > CTMP.S ../cluster1.cgi < CTMP.S ../endtime.cgi

    Note: you should run these procedures using "nohup" as the results will be displayed to stdout.

    nohup clusterbuild.script &

    The resulting clusters will be in "nohup.out". Note: clusterbuild.script is built by 00BUILD from clusterbuild.script.in

    Definition of terms in the clustering scripts that you may need to set (in file clusterbuild.script.in):

    1. CPUS - the number of CPUS available for processing the data. For example, on our Dual Hyperthreaded Xeon systems, each processor behaves (supposedly) like two processors to we use a CPUS value of 4. The value of CPUS will be the number of independent, parallel tasks created.

    2. --simmin - This value determines the minimum p-value between sequences

    3. --high nbr --low nbr - You may set these to select IDF score bands to use during retrieval. The score bands need to be set lower for protein data bases than for nucleotide data bases.

    The programs will produce some number of files named "cluster_tmp*". These contain the similarities between all the sequences where the similarity was equal to or greater than "--simmin" or "--bitmin". These are merged into a large file named "CTMP" which is then sorted by similarity co-efficient into the file "CTMP.S". You may edit CTMP.S to remove lower (or high) ranking similarities and recluster (use cluster.cgi as seen in the script). The files "cluster_tmp*" may be erased.

    Clustering is very time consuming, more so on protein data bases than nucleotide data bases. You need a fast, preferably multi-processor machine with much memory. If you do not have enough memeory to effectively load all the retrieval tables and you system must repeatedly load data from disk, clustering will be impractical except on the smallest of data sets.

    Cluster Examples:

    1. Viral data base (as of March 18, 2006) - Genbank gbvrl*
    2. IDF Primates Clusters DB

    Methodology

    Sequences from the NCBI data bases were used. The overall frequencies of occurrence of all possible 11 character tuples in each sequence (for nucleotides) or 3 character tuples (for proteins) in the data base were determined along with the number of sequences in which each unique word was found. For nucleotides, this can result in a theoretical maximum of 4,194,304 words. The tuple sizes of 11 and 3 were initially selected as these are the default tuple size used in BLAST for nucleotide searches. The programs however, will accommodate other tuple lengths.

    The data base was processed according to the following diagram:

    Not shown is a preprocessing step. If the data base is on of the gb*.seq data bases, the nucleotide sequences must be extracted to FASTA format prior to other processing. A routine named extractDNA builds the FASTA format input files. In the cases of the nt and nr data bases, which are already in FASAT format, this step is not executed.

    Each FASTA format sequence in the data base was read and decomposed into all possible words in the sequence string of length 11 by taking all 11 letter words beginning at starting position one, two and so on up to and including starting position eleven. Procedurally, given the vast amount of words produced, this was accomplished by producing multiple intermediate files (*.words) of about 440 million bytes each (depending on memory availability), ordered alphabetically by word and listing, for each word, the relative reference number of the original sequence containing the word. A relative sequence number was used as it could be expressed in four bytes rather than an offset which would have required eight bytes due to the size of the input data base (12 GB). A master table named offset.table was also produced that translates each relative sequence reference into an eight byte offset into the original data base. The multiple intermediate files were subsequently merged and and three files produced:

    1. a large (word-sequence table, out.table, giving, for each word, a list of the sequence references of those sequences in which the word occurs;

    2. a file (freq.bin) containing the IDF weights for each word; and

    3. a file named index giving for each word the eight byte offset of the word's entry in out.table.

    4. Finally, index and freq.bin are merged into ITABLE which contains for each word its weight, offset, and a pointer to a list of aliases.

    The IDF weights (freq.bin) Wi for each word i were calculated by taking the base 10 logarithm, multiplied by 10 and truncated to the nearest integer, of the total number of sequences (N) divided by the number of sequences in which each word occurred (DocFreqi):

    Wi= (int) 10 * Log10 ( N / DocFreqi )

    This weight yields higher values for words whose distribution is more concentrated and lower values for words whose use is more widespread. Thus, words of broad context are weighted lower than words of narrow context.

    For retrieval, each query sequence is read and decomposed into 11 or 3 character words. These words are reduced to a numeric equivalent and this is used to index ITABLE. Entries in a master scoring vector corresponding to data base sequences are incremented by the weight of the word if the word occurs in the sequence and if the weight of the word lies within a specified range. Ultimately, when all words are processed, entries in the master sequence vector are normalized according to the length of the underlying sequences and to the length of the query. Finally, the master sequence vector is sorted and the top scoring entries are either printed or submitted to a Smith-Waterman procedure for more detailed scoring and then re-sorted and printed. Optionally, the Smith-Waterman alignments details can be printed and the selected sequences can be extracted from the data base and stored in a separate output file for additional post-processing.