Kevin C. O'Kane
Computer Science Department
University of Northern Iowa
Cedar Falls, IA 50613
July 22, 2022
The ISR code is now packaged with the Mumps Language code. The Mumps distro (see below) contains the ISR code and a small
subset of the OHSUMED database. A link to the full database is given below.
- ISR Book: Implementing
Information Retrieval Algorithms Using Mumps
Mumps & ISR source code distribution
- Compressed OHSU Medline Data Base
The OHSU Medline database was obtained from:
With the following notes apply (see web site above for full discussion):
"... The OHSUMED test collection is a set of 348,566 references from
MEDLINE, the on-line medical information database, consisting of
titles and/or abstracts from 270 medical journals over a five-year
period (1987-1991). The available fields are title, abstract, MeSH
indexing terms, author, source, and publication type. The National
Library of Medicine has agreed to make the MEDLINE references in the
test database available for experimentation, restricted to the
1. The data will not be used in any non-experimental clinical,
library, or other setting.
2. Any human users of the data will explicitly be told that the data
is incomplete and out-of-date.
The OHSUMED document collection was obtained by William Hersh
(hersh@OHSU.EDU) and colleagues for the experiments described in the
Hersh WR, Buckley C, Leone TJ, Hickam DH, OHSUMED: An interactive
retrieval evaluation and new large test collection for research,
Proceedings of the 17th Annual ACM SIGIR Conference, 1994, 192-201.
Hersh WR, Hickam DH, Use of a multi-application computer workstation
in a clinical setting, Bulletin of the Medical Library Association,
1994, 82: 382-389. ..."
- Example Output
- Slides for Providence College Talk
The purpose of this document is to
introduce a collection of programs to be found in the Vector Space
The workbench presently consists of about fifty modular programs
written in Mumps and/or bash
script. These programs implement the basic Vector Space Model for
document classification and retrieval as originally developed by G.
Salton [Salton, 1968, 1983, 1988, 1992] and others. Also included is a
collection of approximately 294,000 medical abstracts for testing and
The purpose of this package is to
facilitate teaching, exploration and experimentation with the vector
space model and the development of new algorithms and techniques. The
modular design of the code together with the Mumps multidimensional
database model enable the user to experiment, augment, and measure
various indexing strategies.
Currently, the package contains programs that perform:
The programs build:
- word frequency analysis,
- stop list generation,
- word stemming,
- term weighting,
- synonym detection,
- phrase identification,
- term clustering,
- document clustering,
- document hyper-clustering, and
- several retrieval methods.
There are programs to calculate:
- document-term matrix,
- term-document matrix,
- term-term matrix,
- document-document matrix,
- dictionary vectors giving:
- word frequency,
- document frequency,
- Zipf's Law coefficients,
- inverse document frequency weights [Salton 1968] and
- discrimination coefficients [Willet 1985].
The package includes routines to retrieve documents based on:
- term phrases,
- term cohesion,
- proximity weighted term similarities,
- term clusters.
- document clusters and
- clusters of document clusters.
There also indexing routines to organize the documents by:
- simple sequential searches,
- inverted file searches and
- weighted inverted file searches
using document similarity metrics such as Cosine [Salton 1983].
The experimental corpus provided
(details given below) is the OSU Medline collection used at the
National Institute of Standards (NIST) Text Retrieval Conference 9
(TREC-9) [NIST 2000]. Other user provided collections may also be
used if their source text is formatted according to the input model.
- controlled vocabularies such as MeSH,
- KWIC/KWOC indices,
- n-grams [Manning 1999] and
- Soundex codes [US National Archives, 2007].
Most of the code in these modules is
written in Mumps, a language developed in medicine in the late 1960s
[Barnett 1970, Bowie 1976, O'Kane 2008] which supports a string
handling and a multidimensional database model which is ideally
suited for vector space model implementations. The Mumps modules are
invoked by bash
scripts which control flow of data and multitasking.
The Mumps interpreter software used in
these experiments are available for free download (GPL License) at:
[Salton 1968] Salton, G., Automatic Information Organization and Retrieval, McGraw Hill (New York, 1968).
[Salton 1971] Salton, G, ed.; The SMART Retrieval System, Experiments in Automatic Document Processing, Prentice-Hall (Englewood Cliffs, NJ, 1971).
[Salton 1983] Salton, G.; and McGill, M.J., Introduction to Modern Information Retrieval, McGraw Hill; (New York, 1983).
[Salton 1988] Salton, G., Automatic Text Processing, Addison-Wesley (Reading, 1988).
[Salton 1992] Salton, G., The state of retrieval system evaluation, Information Processing & Management, Vol 28 No 4, pp. 441-449 (1992).
[NIST 2000] National Institute of Standards and Technology, Text Retrieval Conference 9 https://trec.nist.gov/pubs/trec9/t9_proceedings.html
[Willet 1985] Willett, P., An algorithm for calculation of exact term discrimination vales, Information Processing and Management, Vol 21, No. 3, pp 225-232 (1985).
Information Storage and Retrieval Videos
Vector Space Model
Vector Space Model Matrices
Kwic/Kwoc indices, stop lists and stemming
Reducing the collection to word stems
Word pruning based on frequency
Document Term Matrix
Global Array Overview
The Big Picture
Term Normailization and Weights
Term-Term Matrix Overview
Term-Term Matrix Calculation
Term-Term Matrix for Full Collection
Pruning the Document-Term Matrix
Inverse Document Frequencies
Weighting Terms in Documents
Building a MeSH Tree
MeSH Tree Print Programs
MeSH Index Program
MeSH Titles Program
Find MeSH Terms and Sub-Terms