Inverse Document Frequency Genome Searching

Inverse Document Frequency Weighted
Genomic Sequence Retrieval


Kevin C. O'Kane, Ph.D.
Professor Emeritus
Computer Science Department
University of Northern Iowa
Cedar Falls, IA 50614
kc.okane@gmail.com
http://www.cs.uni.edu/~okane
July 28, 2006
Copyright (c) 2004, 2005, 2006, 2015 Kevin C. O'Kane, Ph.D.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being: Page 1, with the Front-Cover Texts being: Page 1, and with the Back-Cover Texts being: no Back-Cover Texts.


  1. Citation

    O'Kane, K.C., The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval, Online Journal of Bioinformatics, Volume 6 (2) 162-173, 2005.

    Full Text

  2. Abstract

    This software presents a method to identify weighted n-gram sequence fragments in large genomic databases whose indexing characteristics permits the construction of fast, indexed, sequence retrieval programs where query processing time is determined mainly by the size of the query and number of sequences retrieved rather than, as is the case in sequential scan based systems such as BLAST, FASTA, and Smith-Waterman, the size of the database. The weighting scheme is based on the inverse document frequency (IDF) method, a weighting formula that calculates the relative importance of indexing terms based on term distribution. In experiments (see citation above), the relative IDF weights of all segmented, overlapping, fixed n-grams of length eleven in the NCBI .nt. and other databases were calculated and the resulting n-grams ranked and used to create an inverted index into the sequence file. The system was evaluated on test cases constructed from randomly selected known sequences which were randomly fragmented and mutated and the results compared with BLAST and MegaBlast for accuracy and speed. Due to the speed of query processing, the system is also capable of database sequence clustering, examples of which are given below.

  3. General Comments

    This software is mainly aimed at Linux users.

    This code has been developed under the Mint/Mate Linux distro.

  4. Distributions

    The full IDF distribution is to be found in::

    http://www.cs.uni.edu/~okane/source/IDF/IDF-1.00.src.tar.gz

  5. Linux Installation

    See the file named ReadMeFirst.pdf in the distribution.

  6. Example

    The first example shows a search for a known sequence (exact match). The p-score indicates that the result is not random (0.00003).

    The second example is the result of searching for a randomly generated query against the first 50 files of the NCBI EST database (about 7.8 million sequences). The Smith-Waterman alignment of the highest scoring result is shown. The p-score (1.00) indicates that the match was no better (as expected) than random.

    
    IDF Sequence Search 
    Kevin C. O'Kane 
    kc.okane@gmail.com 
    http://threadsafebooks.com/ 
    
    Query:
    >gi | AI501777.1  GI:4399628 | AI501777 | UI-R-A0-bb-e-06-0-UI.s1 UI-R-A0 Rattus norvegicus cDNA clone
    CGGCCGCTCCCGGGGAGCCCGGCGGGTCGCCGGCGCGGGGTTTTCCTCCGGCCTCGTCCT
    CCCCCTTCCCCCTCCGCGGGGTCGGGGGTTCCCGGGGTTCGGGGTTCTCCTCCGCGTCGG
    CGGTTCCCCCGCCGGGTGCGCCCCCCGGGCCGCGGTTTCCCGCGCGGCGCCTCGCCTCGG
    CCGGCGCCTATCAGCCGACTTAGAACTGGTGCGGACCACGGGAATCCGACTGTTTAATTA
    AAACAAAGCATCGCG
    Query string has 255 letters
    
    Searching ...
    
      IDF    SW      P      IDF    Seq
     Score Score   Score    Rank   Size Accession
     18590   510 3.00e-05    1      255 >gi | AI501777.1  GI:4399628 | AI501777 | UI-R-A0-bb-e-06-0-UI.s1 UI-R-A0 Rattus norvegicu
     17450   506 3.00e-05   18      360 >gi | AI555952.1  GI:4488315 | AI555952 | UI-R-C2p-rc-d-08-0-UI.s1 UI-R-C2p Rattus norvegi
     17450   506 3.00e-05   17      331 >gi | AI706390.1  GI:4994290 | AI706390 | UI-R-AE1-zg-f-04-0-UI.s1 UI-R-AE1 Rattus norvegi
     17450   506 3.00e-05   16      369 >gi | AA900277.1  GI:4232768 | AA900277 | UI-R-E0-de-h-11-0-UI.s2 UI-R-E0 Rattus norvegicu
     17450   506 3.00e-05   15      461 >gi | AA900281.1  GI:4232772 | AA900281 | UI-R-E0-dd-a-03-0-UI.s1 UI-R-E0 Rattus norvegicu
     17450   506 3.00e-05   14      476 >gi | AA900444.1  GI:4232936 | AA900444 | UI-R-E0-bw-a-03-0-UI.s2 UI-R-E0 Rattus norvegicu
     17450   506 3.00e-05   13      425 >gi | AA924277.1  GI:4234417 | AA924277 | UI-R-A1-ds-h-05-0-UI.s1 UI-R-A1 Rattus norvegicu
     17479   506 3.00e-05   12      471 >gi | AI511114.1  GI:4416813 | AI511114 | UI-R-BT0-po-c-10-0-UI.s1 UI-R-BT0 Rattus norvegi
     17479   506 3.00e-05   11      448 >gi | AI577844.1  GI:4562220 | AI577844 | UI-R-AB0-vw-b-08-0-UI.s1 UI-R-AB0 Rattus norvegi
     17479   506 3.00e-05   10      404 >gi | AI601935.1  GI:4611096 | AI601935 | UI-R-AB0-vt-b-12-0-UI.s2 UI-R-AB0 Rattus norvegi
     17479   506 3.00e-05    9      396 >gi | AI602876.1  GI:4612037 | AI602876 | UI-R-AE0-xm-g-07-0-UI.s1 UI-R-AE0 Rattus norvegi
     17479   506 3.00e-05    8      350 >gi | AI705323.1  GI:4993223 | AI705323 | UI-R-AG0-xb-c-09-0-UI.s1 UI-R-AG0 Rattus norvegi
     17479   506 3.00e-05    7      422 >gi | AI706663.1  GI:4994563 | AI706663 | UI-R-AD1-zl-c-02-0-UI.s1 UI-R-AD1 Rattus norvegi
     17479   506 3.00e-05    6      368 >gi | AI706756.1  GI:4994656 | AI706756 | UI-R-AA1-aab-d-08-0-UI.s1 UI-R-AA1 Rattus norveg
     17479   506 3.00e-05    5      457 >gi | AI706764.1  GI:4994664 | AI706764 | UI-R-AA1-aab-e-07-0-UI.s1 UI-R-AA1 Rattus norveg
     17479   506 3.00e-05    4      353 >gi | AI710114.1  GI:4999890 | AI710114 | UI-R-AA1-zz-h-03-0-UI.s1 UI-R-AA1 Rattus norvegi
     17479   506 3.00e-05    3      260 >gi | AW521588.1  GI:7163973 | AW521588 | UI-R-BO0-ago-d-10-0-UI.s1 UI-R-BO0 Rattus norveg
     17697   506 3.00e-05    2      341 >gi | AI502126.1  GI:4399977 | AI502126 | UI-R-C2p-rx-d-08-0-UI.s2 UI-R-C2p Rattus norvegi
     17397   504 3.00e-05   20      368 >gi | AI578544.1  GI:4562920 | AI578544 | UI-R-AA0-wl-d-03-0-UI.s1 UI-R-AA0 Rattus norvegi
     17400   504 3.00e-05   19      372 >gi | AI602738.1  GI:4611899 | AI602738 | UI-R-AE0-xk-e-09-0-UI.s1 UI-R-AE0 Rattus norvegi
    
    7780681 >gi | AI501777.1  GI:4399628 | AI501777 | UI-R-A0-bb-e-06-0-UI.s1 UI-R-A0 Rattus norvegicus cDNA clone
    
        1 CGGCCGCTCCCGGGGAGCCCGGCGGGTCGCCGGCGCGGGGTTTTCCTCCGGCCTCGTCCTCCCCCTTCCCCCTCCGCGGGGTCGGGGGTT
          ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
        1 CGGCCGCTCCCGGGGAGCCCGGCGGGTCGCCGGCGCGGGGTTTTCCTCCGGCCTCGTCCTCCCCCTTCCCCCTCCGCGGGGTCGGGGGTT
    
       91 CCCGGGGTTCGGGGTTCTCCTCCGCGTCGGCGGTTCCCCCGCCGGGTGCGCCCCCCGGGCCGCGGTTTCCCGCGCGGCGCCTCGCCTCGG
          ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
       91 CCCGGGGTTCGGGGTTCTCCTCCGCGTCGGCGGTTCCCCCGCCGGGTGCGCCCCCCGGGCCGCGGTTTCCCGCGCGGCGCCTCGCCTCGG
    
      181 CCGGCGCCTATCAGCCGACTTAGAACTGGTGCGGACCACGGGAATCCGACTGTTTAATTAAAACAAAGCATCGCG
          :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
      181 CCGGCGCCTATCAGCCGACTTAGAACTGGTGCGGACCACGGGAATCCGACTGTTTAATTAAAACAAAGCATCGCG
    
    SW score=510
    
    
    
    IDF Sequence Search 
    Kevin C. O'Kane 
    kc.okane@gmail.com 
    http://threadsafebooks.com/ 
    
    Query: 
    > random sequence 1 of length 300 
    TTTACGTCGATAAGTGGCAAAAGATCATTGCGGAGTTCACCTCCGAAAGAAGAGGTTGGG 
    CACTATGTCTAGGCAAGACAACGCACATTTGATTTTGGTTCTGTCGATGCTGGCTGGAGG 
    TAGGTGCCAACCATACCAATCTCAACGTCACAATGTCTCCAGGACTCGTCCCAGCATTTA 
    AAAAAGACGCTGTCGAATGAATCACGACCACCAGGAAGGGTCAGGGGTCATCTACAGGCT 
    GTCGCTTCCCAAGAGATATAAGGTGTTACCATACGCACGCGGCATTAGTAGTGAGCACCC 
    
    Query string has 300 letters 
    
    Searching ... 
    
      IDF    SW      P      IDF    Seq 
     Score Score   Score    Rank   Size Accession 
       648   116 1.00e+00   12      769 >gi | AM349338.1  GI:112348641 | AM349338 | AM349338 Oryzias
       603   114 1.00e+00   28      590 >gi | BE637582.1  GI:9920693 | BE637582 | WHE0860_B08_C16ZS Wheat
       752   102 1.00e+00    4      752 >gi | AI663456.1  GI:4767039 | AI663456 | uk33a10.y1 Sugano mouse
       752   102 1.00e+00    3      524 >gi | AW259173.1  GI:6632154 | AW259173 | um89d05.y1 Sugano mouse
       752   102 1.00e+00    2      997 >gi | BB609451.1  GI:15390254 | BB609451 | BB609451 RIKEN full-
       752   102 1.00e+00    1      637 >gi | BB661515.1  GI:16495294 | BB661515 | BB661515 RIKEN full-
       594   100 1.00e+00   31      354 >gi | BB871941.1  GI:17118151 | BB871941 | BB871941 RIKEN full-
       617   100 1.00e+00   19      321 >gi | AV205850.1  GI:6146703 | AV205850 | AV205850 RIKEN full-
       634    98 1.00e+00   15      451 >gi | AU289438.1  GI:24249558 | AU289438 | AU289438 zinnia cultured
       597    88 1.00e+00   30      528 >gi | AA056840.1  GI:1549467 | AA056840 | SWMFCA614SK Brugia malayi
    
    1486919 >gi | AM349338.1  GI:112348641 | AM349338 | AM349338 Oryzias latipes whole embryo Cab stages
    
      209 TTCT-C-TTGAT-GGTGTG--AAA-CTGC-TGTCCTGGA-TTCTCGCTGACCACATCAATGCGAAG-GTGCTCAAGGTGAAC-GTGCTTC 
          :: : : : :::  ::: :  :::  : : : : : ::: ::: : ::  ::  :  ::   :::: : : :   :: : ::  ::  :: 
        1 TT-TACGTCGATAAGTG-GCAAAAGAT-CAT-TGC-GGAGTTCAC-CT--CC-GA--AA---GAAGAG-G-T--TGG-GCACTATG--TC 
    
      289 TCGG-AGGACAGCGC-GATGAGAAGTTCGTCGCT-TAGCAAG-TGGTAATTGAGGGCCAGCTCCCTCAGGCCATGGCACTGGTCGGCCAC 
          : :: : :::: :::  ::  ::  :: ::  :: : : : : :::   : :: ::  :: :  : ::  ::::  :: :  ::  :  : 
       70 TAGGCAAGACAACGCACATTTGATTTTGGT-TCTGTCG-ATGCTGG--CTGGA-GG-TAGGT-GC-CA-ACCATACCAAT-CTCAACGTC 
    
      375 ACACAGGATCCCTGCAGGTGACACGTGAGGGCAGC-TGCTCATCTTCAG-CAGCT-TC-AGTGTGTCACTGTTGTTGGCCACCAGGACTT 
          ::: : : :  :: :: : ::: :::     :::: :  : :     :: : ::: :: : ::  ::::       : ::::::::: 
      150 ACA-ATG-T--CTCCA-G-GACTCGT---CCCAGCAT--TTA-AAAAAGAC-GCTGTCGAATGAATCAC-------GACCACCAGGA--- 
    
      461 TGAGAGAGGGGTCATCTACAGGCGTGTCATCGATCCTCAAAGACGACAGGGATTTGGAGTTGATGAAAAC-CACCGTCAGCGC-TGAGAT 
           : :  ::::::::::::::::: ::::  :  ::: : :::: :  :   ::  :: ::: :   : :: :: :: : : :: : :: : 
      217 AGGGTCAGGGGTCATCTACAGGC-TGTC-GC-TTCC-C-AAGA-G--A--TATAAGGTGTT-A-CCATACGCA-CG-C-G-GCATTAG-T 
    
      549 GAAGTGAG-ACC 
            :::::: ::: 
      290 --AGTGAGCACC 
    
    SW score=116 
    
    

  7. Methodology

    Sequences from the NCBI data bases were used. The overall frequencies of occurrence of all possible 11 character tuples in each sequence (for nucleotides) or 3 character tuples (for proteins) in the data base were determined along with the number of sequences in which each unique word was found. For nucleotides, this can result in a theoretical maximum of 4,194,304 words. The tuple sizes of 11 and 3 were initially selected as these are the default tuple size used in BLAST for nucleotide searches. The programs however, will accommodate other tuple lengths.

    The data base was processed according to the following diagram:

    Not shown is a preprocessing step. If the data base is on of the gb*.seq data bases, the nucleotide sequences must be extracted to FASTA format prior to other processing. A routine named extractDNA builds the FASTA format input files. In the cases of the nt and nr data bases, which are already in FASAT format, this step is not executed.

    Each FASTA format sequence in the data base was read and decomposed into all possible words in the sequence string of length 11 by taking all 11 letter words beginning at starting position one, two and so on up to and including starting position eleven. Procedurally, given the vast amount of words produced, this was accomplished by producing multiple intermediate files (*.words) of about 440 million bytes each (depending on memory availability), ordered alphabetically by word and listing, for each word, the relative reference number of the original sequence containing the word. A relative sequence number was used as it could be expressed in four bytes rather than an offset which would have required eight bytes due to the size of the input data base (12 GB). A master table named offset.table was also produced that translates each relative sequence reference into an eight byte offset into the original data base. The multiple intermediate files were subsequently merged and and three files produced:

    1. a large (word-sequence table, out.table, giving, for each word, a list of the sequence references of those sequences in which the word occurs;

    2. a file (freq.bin) containing the IDF weights for each word; and

    3. a file named index giving for each word the eight byte offset of the word's entry in out.table.

    4. Finally, index and freq.bin are merged into ITABLE which contains for each word its weight, offset, and a pointer to a list of aliases.

    The IDF weights (freq.bin) Wi for each word i were calculated by taking the base 10 logarithm, multiplied by 10 and truncated to the nearest integer, of the total number of sequences (N) divided by the number of sequences in which each word occurred (DocFreqi):

    Wi= (int) 10 * Log10 ( N / DocFreqi )

    This weight yields higher values for words whose distribution is more concentrated and lower values for words whose use is more widespread. Thus, words of broad context are weighted lower than words of narrow context.

    For retrieval, each query sequence is read and decomposed into 11 or 3 character words. These words are reduced to a numeric equivalent and this is used to index ITABLE. Entries in a master scoring vector corresponding to data base sequences are incremented by the weight of the word if the word occurs in the sequence and if the weight of the word lies within a specified range. Ultimately, when all words are processed, entries in the master sequence vector are normalized according to the length of the underlying sequences and to the length of the query. Finally, the master sequence vector is sorted and the top scoring entries are either printed or submitted to a Smith-Waterman procedure for more detailed scoring and then re-sorted and printed. Optionally, the Smith-Waterman alignments details can be printed and the selected sequences can be extracted from the data base and stored in a separate output file for additional post-processing.