081:115G Spring 2008
810:115g Information Storage and Retrieval (3 hours)

Last updated: February 13, 2008
Objective To understand computer based automatic indexing and retrieval of text/web based information.
Books: Experiments in Information Storage and Retrieval Using Mumps
Course Materials: Programming examples will mainly use the Mumps language although other languages including C++, Perl, PHP, and Java may be used as desired.

Data bases, examples and text
Mumps Pocket Guide JPG images

Requirements: The requiements will consist of a set of assignments to be performed individually and a project which may be done either individually or in small groups (approx. 3 persons max). As this couse depends heavily on lecture content, attendance is required. Excessive absence will result in a reduced grade for the project. The term "excessive absence" means: don't push your luck.

Term Project (30%)
Assignments (40%) Late assignments charged 5% per class day to a max of 25%. Chronically late assignments will be charged at a higher rate.
Tests (2 at 10% each)
Final (10%).

Classes: Classes are lecture format. Cell phones, pagers, laptops and PDA's may not be used.

Test 1 Friday Feb 29.
Test 2 Click here Fri April 4.
Final: Click Here
Makeup Tests Makeup tests will be given only in cases of demonstrated need for causes such as serious illness, family emergency or University sanctioned schedule conflict. In all cases, written documentation will be required.
Penalty for
Hacking & Cheating
A grade of "F" for the course; termination of computer access. If your work duplicates in whole or part the work of another, both works will receive a grade of F.
Project

Projects will be presented in class the last week of classes.

All projects will be presented in class with a brief online demonstration. A vote will be taken for the best project and the winner will be exempt from the final (grade of A will be entered for the final exam grade). You may work in teams of from 1 to 3 (hint: have one person do the indexing, one the retrieval code and the other do the web interface). All group members will receive the same grade.

Project: using the OSU Medline file, design a web based system to access the data. Your project will include code to index and access the information. Use the first 20,000 abstracts.

You may design your own access interface method: keywords, hierarchies, etc. Input can be either by typed queries or point-and-click. Be original!

Final Grades Final grades will not be available via email. If you want your grade mailed to you, bring a stamped, self-addressed envelope to the final.
Contact Click Here
Computer I am assuming you will want to use your own computer to do the assignments and project on. If that is not the case, you may have an account on one of my Linux servers.
Getting Started
  1. Get and install Cygwin for Windows
  2. Get and install Mumps under Cygwin: In Cygwin, type:
    1. wget http://cns2.uni.edu/~okane/source/MUMPS-MDH/mumpscompiler-10.0.src.tar.gz
      (Check http://cns2.uni.edu/~okane/source/MUMPS-MDH/ for the latest version number).
    2. tar xvzf mumpscompiler-10.0.src.tar.gz
    3. cd mumpsc
    4. ./configure prefix=/usr
    5. make
    6. make install
  3. Learn an adult editor (nano/pico are for children) : vi Tutorial
  4. Learn Mumps.
HTML BareBones HTML Guide
Assignments Do not push assignments under the office door - leave them in the CS Office. All pages must be stapled.

  1. Read the MESH Codes (mtrees2003.gz) and build a global array indexed at the first level by text words of the code (the part prior to the semi-colon) and at the second level by the hierarchy code (the part after the semi-colon) associated with the word. Convert the words to lower case. Example:

    Laryngeal Cartilages;A02.165.507

    becomes:

    set ^word("laryngeal","A02.165.507")=""
    set ^word("cartilages","A02.165.507")=""

    As many text descriptions have more than one word in the description, a given word may have multiple codes. In a separate program, read in a key word and display all hierarchy codes that contain that keyword. Due: Fri Feb 8.

    Turn in both programs and the results of it running with the keywords cavity, lobe, and muscle. Use the $order() function to extract the second level indices of a found word. The MESH file mtrees2003.gz can be decompressed with the command:

    gzip -d mtrees2003.gz

  2. Using the MESH data base created above, write the HTML and code necessary to permit a lookup of codes based on a word. That is, creat browser pages. The first will permit the user to enter a word. The program executed will find all MESH codes that use that word and return the result to the browser, nicely formatted, along with an text entry box to let them try again.

    Due: Fri Feb 15.

  3. For the compressed text file of the Aeneid (in English) at: http://www.cs.uni.edu/~okane/source/ISR/aeneid.txt.gz, build a dictionary of words sorted by frequency (highest first) then calculate a Zipf constant. Turn in the code, commands you used (script is handy here) and the first page of the dictionary and the first page of the list of Zipf constants. Use $zzScan to read and $zn() to process the words. Be sure to delete any previous contents of your dictionary before you re-run your programs (or the freq's will be cumulative).

    Due: Fri Feb 22.

  4. Build weighted document-term and term-document matrices for the first 5,000 abstracts in the OSU Medline data base. You may use the http://www.cs.uni.edu/~okane/source/ISR/medline.translated.txt.gz file as input. Note the format: token, offset, doc number followed by words on one line. Use $zzscan to read these. Do not use a READ command as the line lengths are too long. These words are already stemed, lower cased, punctuation removed and short/long words removed. Store the offset with each document - it points back to the full text file start point of each document.

    Due Fri March 7.

Other Resources Mesh

Documentation and Availability
Documentation Descriptor Data Elements
Descriptor Records
Documentation Qualifier Data Elements
Qualifier Records
Supplementary Records

Resources:

Salton, G., Automatic Informatiuon Organization and Retrieval, McGraw Hill (1968)
Salton, G., Automatic Text Processing, Addison-Wesley (1989)
Salton, G., ed., The SMART Retrieval System, Prentice-Hall (1971).
Salton, G., and McGill, M., Introduction to Modern Information Retrieval, McGraw-Hill, (1983)
Borko, H., Automated Language Processing, (1968)
The Smart System from Cornell: ftp://cs.cornell.edu/pub/smart


Data Sets for Machine Learning
WordNet
Apache
Web Archive (you think you have disk capacity problems!)
PostgreSQL Tutorial
Rod Library Electronic Resources
ACM Digital Library
Lemur Toolkit for Language Modeling
Cornell Smart System
SIGIR List and Archives
UVA Electronic Text Center
NLM Gateway
Digital Library Research Laboratory
Automatic Text Browsing Using the Vector Space Model
Lawrence Berkeley Lab Science Articles Archive
The Internet Archive
Search Engine Features
Anatomy of a Search Engine (Google)
Medline (National Library of Medicine)
Information Retrieval by C. J. van Rijsbergen
Modern Information Retrieval Chapter 10
Cystic Fibrosis Reference Collection
Marti Hearst Site
Online Papers
Ed Fox Links
IIT IR Publications
Web IR
WWW 10
WWW 9
WWW 8
WWW 7
WWW 5
IRIS Project
Lots of Links
Top Ten Issues
Searching Genomic Databases
Amazon.com's recommender algorithm
Huffman Trees
Knuth Optimal Binary Trees
Hu-Tucker Trees
AVL (Balanced) Trees
B trees and AVL Trees
B Trees
IBM Clever Project
flex documentation
Homology Searching
Project Gutenberg

The following notice is required by the University:

"The Americans with Disabilities Act of 1990 (ADA) provides protection from illegal discrimination for qualified individuals with disabilities. Students requesting instructional accommodations due to disabilities must arrange for such accommodation through the Office of Disability Services. The ODS is located at: 213 Student Services Center, and the phone number is: 273-2676."

Because the Office of Disability Services has procedures in place to determine the validity of disability claims as well as the need for instructional accommodations, faculty are reminded that they are to direct all students with accommodation requests to the above listed office.

UNDER NO CIRCUMSTANCE SHOULD A FACULTY MEMBER MAKE AN ACCOMMODATION INDEPENDENT OF THE OFFICE OF DISABILITY SERVICES.

Questions may be directed to: Jane Slykhuis, Disability Services Coordinator, at 273-2676 or to this office at 273-2846.