Inverse Document Frequency Genome Searching

Inverse Document Frequency Weighted
Genomic Sequence Retrieval


Kevin C. O'Kane, Ph.D.
Professor Emeritus
Computer Science Department
University of Northern Iowa
Cedar Falls, IA 50614
kc.okane@gmail.com
https://www.cs.uni.edu/~okane
Sept 27, 2018
Copyright (c) 2004, 2005, 2006, 2015 2018 Kevin C. O'Kane, Ph.D.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being: Page 1, with the Front-Cover Texts being: Page 1, and with the Back-Cover Texts being: no Back-Cover Texts.


  1. Citation

    O'Kane, K.C., The Effect of Inverse Document Frequency Weights on Indexed Sequence Retrieval, Online Journal of Bioinformatics, Volume 6 (2) 162-173, 2005.

    Full Text

  2. Abstract

    This software presents a method to identify weighted n-gram sequence fragments in large genomic databases whose indexing characteristics permits the construction of fast, indexed, sequence retrieval programs where query processing time is determined mainly by the size of the query and number of sequences retrieved rather than, as is the case in sequential scan based systems such as BLAST, FASTA, and Smith-Waterman, the size of the database. The weighting scheme is based on the inverse document frequency (IDF) method, a weighting formula that calculates the relative importance of indexing terms based on term distribution. In experiments (see citation above), the relative IDF weights of all segmented, overlapping, fixed n-grams of length eleven in the NCBI .nt. and other databases were calculated and the resulting n-grams ranked and used to create an inverted index into the sequence file. The system was evaluated on test cases constructed from randomly selected known sequences which were randomly fragmented and mutated and the results compared with BLAST and MegaBlast for accuracy and speed. Due to the speed of query processing, the system is also capable of database sequence clustering, examples of which are given below.

  3. General Comments

    This software is mainly aimed at Linux users.

    This code has been developed under the Mint/Mate Linux distro.

  4. Distributions

    The full IDF distribution is to be found in::

    https://www.cs.uni.edu/~okane/source/IDF/IDF-3.02-src.tgz

  5. Linux Installation

    See the file named ReadMeFirst.pdf in the distribution.

  6. Example

    The example shows a search for a randomly modified known sequence (exact match) in the gbpri* database. The IDF Score column gives the IDF metric of similarity. The Fasta results (W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448) are the result of running the same query sequence against the original gbpri input database.

    Sequence Searcher Tue Sep 25 13:50:31 2018
    Database files:
    /media/okane/ssd120/IDF-New/inputSequences/nt,old
    /media/okane/ssd120/work/words.merged,old
    
    Query: 
    >gi | AC192344.3 | AC192344 | Pan troglodytes BAC clone CH251-157M8 fr 1531378284
    GAATTCTCATACGCTGCTTGTGGGAATGTAAAATGGTGCCACCACTTTGGACAAACAGTCTGACAGTTTCTCAAATGGTTA
    AACATAGAGTTAACCTATGATTCAGGAATTCTACTCCAATGTATATACACACAGGAAATGGAAACATATGTCAACGCAAAA
    ACATATTCATGAGCAACATTATTTATAATAGCCAGAAGGGTGGAAACTACTTCATTTTTTTCAACCAAACACAGGATTAAA
    AAATTGTGATATATCCTTAGACTGGAATATGATTTAGCCATAAAAGAGAATGAAGTACTGGTATATGCTGCAATATGAATA
    AACCTTGAAAATATTATGCTAAGTGAAACAGACCAGTCACTAAAGATGTATGATTCCATCTTATATGAAACATCACGGATA
    GACAAATCTGCTTAGAGCTGACTGGGATGGGGAGATTGGGCAATAGCTAAAGAGTACAGGGTTTTGGTTCTTTTTTTGAGG
    TGACAAAAATGCTCAAAATTGAATGTAGTGATGGTTGCCCATATCTCTGAATATACAAAAAGCCATTGAACTGTGCACTTT
    ACGTGCATGAATTGCATGGTATGTGAACTACACCTTAATGAAGCTGGTATAAAATTTTTAATTTAGTGCCCGTTCTACAGA
    TTATAACATGCATAATCAACATATTACTGTTTAGTTTAAATGAATGCTTTTATCATTTCTTGAGCAACCTAATAACTTCAT
    AACCTTTTTTTGAGACAGGGTCTCACTCTGTCGCCCAGGCTGTATTGCAGTGGTGTGATCACAGCTCACTGCACTTTTGAC
    CTCCTGGGCTTAAGTGGTCCTCCCACCTCAGCCTCCTGAGTAGTTAGGACTACAAGTGTACATCACCATGCCTGGCTAATT
    TTTGTAGATACAGAGTCTCGCTATATTGCCCAGGCTGGGTCTCAAACTCCTGGGACTCAAGCAATCCATCTGCCTGCAGCC
    TCTCAAAGTGCTGGGATTACAAGTGTGAGCCATTGCACCCTGGTTCTTTATAACTCAACTCCATTTGTTCTCCTTCTACAT
    TTTTGTGTTACTGACATCAATATTTACTTCTTCATATATTTAAAATTTTATGAGGATATATTAAAGTTATCATTTCAAACC
    CTAAGGAAAGTATACATTTAGATTTAGTCACATATTCTTCCCAGTGCTTTTAATTTCTTCTTGAATTATTCTATCT
    
    IDF Results:
    
    IDF:  3795 >gi | AL133329.11 | AL133329 | Human DNA sequence from clone RP11-524P
    IDF:  3795 >gi | AC195289.3 | AC195289 | Pan troglodytes BAC clone CH251-716G17 f
    IDF:  3795 >gi | AC192344.3 | AC192344 | Pan troglodytes BAC clone CH251-157M8 fr
    IDF:  3795 >gi | AC169978.2 | AC169978 | Pan troglodytes chromosome X clone PTB-9
    IDF:   659 >gi | LT160002.1 | LT160002 | Macaca fascicularis complete genome, chr
    IDF:   649 >gi | AP001711.1 | AP001711 AL163256 | Homo sapiens genomic DNA, chrom
    IDF:   632 >gi | LT160001.1 | LT160001 | Macaca fascicularis complete genome, chr
    IDF:   575 >gi | AC098675.2 | AC098675 AC034282 | Homo sapiens BAC clone RP11-370
    IDF:   565 >gi | AC007314.3 | AC007314 | Homo sapiens BAC clone RP11-74G24 from 2
    IDF:   562 >gi | BS000112.1 | BS000112 BA000046 | Pan troglodytes chromosome 22 c
    IDF:   504 >gi | AC068205.25 | AC068205 | Homo sapiens chromosome 11, clone RP11-
    IDF:   501 >gi | AC206440.2 | AC206440 | Pan troglodytes chromosome X clone CH251
    IDF:   499 >gi | AL354861.11 | AL354861 | Human DNA sequence from clone RP11-180I
    IDF:   499 >gi | AC157755.2 | AC157755 | Pan troglodytes BAC clone CH251-416M24 f
    IDF:   498 >gi | AC087600.21 | AC087600 | Homo sapiens 12q BAC RP11-495C23 (Roswe
    IDF:   496 >gi | Z84484.1 | Z84484 | Human DNA sequence from clone RP1-50J22 on c
    IDF:   493 >gi | AC147285.3 | AC147285 | Pan troglodytes BAC clone RP43-81J3 from
    IDF:   492 >gi | AF099810.1 | AF099810 | Homo sapiens neurexin III-alpha gene, pa
    IDF:   490 >gi | AC093124.3 | AC093124 | Papio anubis clone RP41-216E17, complete
    IDF:   485 >gi | BA000025.2 | BA000025 AP000502-AP000521 | Homo sapiens genomic D
    
    IDF search time: 1
    
    Fasta results: 
    Fasta search time 98
    
    The best scores are:                                      opt bits E(969128)
    gi | AC192344.3 | AC192344 | Pan troglodytes B (154271) [f] 5840 355.8 7.7e-95
    gi | AC169978.2 | AC169978 | Pan troglodytes c (215778) [f] 5840 355.4 1.1e-94
    gi | AC195289.3 | AC195289 | Pan troglodytes B (189809) [f] 5840 355.1 1.3e-94
    gi | AL133329.11 | AL133329 | Human DNA sequen (109508) [f] 5684 346.7 4.3e-92
    gi | AC145336.19 | AC145336 | Pan troglodytes  (155318) [f] 1294 95.0 2.5e-16
    gi | AC145339.34 | AC145339 | Pan troglodytes  (150845) [f] 1294 95.0 2.5e-16
    gi | AC159032.2 | AC159032 | Pan troglodytes B (183880) [r] 1294 94.2 4.5e-16
    gi | AC200436.3 | AC200436 | Pongo abelii BAC  (202873) [f] 1182 88.6 2.2e-14
    gi | AC198041.4 | AC198041 | Pongo abelii BAC  (200198) [f] 1182 88.6 2.2e-14
    gi | AC239452.3 | AC239452 | Chlorocebus aethi (174798) [f] 1105 83.2 9.2e-13