Kevin C. O'Kane
March 7, 2017 (The contents of this page are under development - check back for updates)
The purpose of this document is to introduce a collection of programs to be found in the Vector Space ISR Workbench.
The workbench presently consists of about fifty modular programs written in Mumps and/or bash script. These programs implement the basic Vector Space Model for document classification and retrieval as originally developed by G. Salton [Salton, 1968, 1983, 1988, 1992] and others. Also included is a collection of approximately 294,000 medical abstracts for testing and experiments.
The purpose of this package is to facilitate teaching, exploration and experimentation with the vector space model and the development of new algorithms and techniques. The modular design of the code together with the Mumps multidimensional database model enable the user to experiment, augment, and measure various indexing strategies.
Currently, the package contains programs that perform:
Most of the code in these modules is written in Mumps, a language developed in medicine in the late 1960s [Barnett 1970, Bowie 1976, O'Kane 2008] which supports a string handling and a multidimensional database model which is ideally suited for vector space model implementations. The Mumps modules are invoked by bash scripts which control flow of data and multitasking.
The Mumps interpreter software used in these experiments are available for free download (GPL License) at:
[Salton 1968] Salton, G., Automatic Information Organization and Retrieval, McGraw Hill (New York, 1968).
[Salton 1971] Salton, G, ed.; The SMART Retrieval System, Experiments in Automatic Document Processing, Prentice-Hall (Englewood Cliffs, NJ, 1971).
[Salton 1983] Salton, G.; and McGill, M.J., Introduction to Modern Information Retrieval, McGraw Hill; (New York, 1983).
[Salton 1988] Salton, G., Automatic Text Processing, Addison-Wesley (Reading, 1988).
[Salton 1992] Salton, G., The state of retrieval system evaluation, Information Processing & Management, Vol 28 No 4, pp. 441-449 (1992).
[NIST 2000] National Institute of Standards and Technology, Text Retrieval Conference 9 http://trec.nist.gov/pubs/trec9/t9_proceedings.html
[Willet 1985] Willett, P., An algorithm for calculation of exact term discrimination vales, Information Processing and Management, Vol 21, No. 3, pp 225-232 (1985).