Mumps MDH Toolkit
Experiments in Information Storage and Retrieval Using Mumps
5th Edition
Kevin C. O'Kane, Ph.D.
Computer Science Department
University of Northern Iowa
Cedar Falls, IA 50614
okane@cs.uni.edu
http://www.cs.uni.edu/~okane
February 7, 2010
|
Copyright (c) 2007, 2008, 2009 2010 Kevin C. O'Kane, Ph.D.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being: Page 1, with the Front-Cover Texts being: Page 1, and with the Back-Cover Texts being: no Back-Cover Texts. |
The purpose of this text is to illustrate several basic information storage and retrieval techniques through real world data experiments. Information retrieval is the art of identifying similarities between queries and objects in a database. In nearly all cases, the objects found as a result of the query will not be identical to the query but will resemble it in some fashion.
For example, if your query is "give me articles about aviation," the results might include articles about early pioneers in the field, technical reports on aircraft design, flight schedules on airlines, information on airports and so on. For example, the term "aviation" when typed into Google results in about 111,000,000 hits all of which have something to do with aviation.
Information retrieval isn't restricted to text retrieval. So, if you have a cut of a musical piece such as this (from the Beethoven 9th Symphony) and you want to find other music similar to it such as this (from the Beethoven Choral Fantasy), you need a retrieval engine that can detect the similarities, but not von Weber's der Freischutz.
Similar examples exist in many other areas. In Bioinformatics, researchers often identify DNA or protein sequences and search massive databases for similar (and sometimes only distantly related) sequences. For example, the DNA sequence:
>gi|2695846|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobulin heavy chain, clone ScH 3.3 TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGTATAATAATGA CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGAGTCCCATAAA CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAAAGGGTCTGGA ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCCATCTCCAGAG ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTGTGCTCGGGGC GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCACCACCGACAGT GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACCGGATTCTGCC TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT |
Where the first line identifies the name and library accession numbers of the sequence and the subsequent lines are the DNA nucleotide codes (the letters A, C, G, and T represent Adenine, Cytosine, Guanine, and Thymine, respectively). A program known as BLAST (Basic Local Alignment Sequencing Tool) can be used to find similar sequences in the online databases of known sequences. If you submit the above to NCBI BLAST (National Center for Biotechnology Information), they will conduct a search of their nr database of 6,284,619 nucleotide sequences, presently more than 22,427,755,047 bytes in length. The result is a ranked list of hits of sequences in the data base based on their similarity to the query sequence. Sequences found whose similarity score exceeds a threshold are displayed. One of these is:
>gb|U17058.1|LOU17058 Lepisosteus osseus Ig heavy chain V region mRNA, partial cds
Length=159
Score = 151 bits (76), Expect = 4e-33
Identities = 133/152 (87%), Gaps = 0/152 (0%)
Strand=Plus/Plus
Query 242 TGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGA 301
|||||||| ||||||||| | | ||| || | |||||||||| |||||||||||||||||
Sbjct 4 TGGGTGGCGTATATTTACACCGATGGGAGCAATACATACTATTCCCAGTCTGTCCAGGGA 63
Query 302 AGATTCGCCATCTCCAGAGACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTG 361
|||||| |||||||||||||| ||||||| | |||||| ||||| |||| |||||||
Sbjct 64 AGATTCACCATCTCCAGAGACAATTCCAAGAATCAGCTGTACTTACAGATGAGCAGCCTG 123
Query 362 AAGACTGAAGACACTGCCGTGTATTACTGTGC 393
||||||||||||||||| ||||||||||||||
Sbjct 124 AAGACTGAAGACACTGCTGTGTATTACTGTGC 155
|
In the display from BLAST seen above, the sections of the query that match the sequence in the database are shown. The numbers at the beginning and ends of the lines are the starting and ending points of the subsequence (relative to one, the start of all sequences). Where there are vertical lines between the query and the subject, there is an exact match. Where there are blanks, there was a mismatch.
It should be clear that, even though the subject is different than the query in many places, the two have a high degree of similarity.
Also, consider the search for similar images. Again, this involves searching for similarities, not identity. For example, a human observer would clearly see the two following pictures as dealing with the same subject, despite the differences:
An obvious question would be, how can you write a computer program to see the similarity?
The following are several example programs illustrating the use of Mumps to create indexing vocabularies and to process and index text files.
The following is a sample of the MESH tree hierarchy:
Body Regions;A01 Abdomen;A01.047 Abdominal Cavity;A01.047.025 Peritoneum;A01.047.025.600 Douglas' Pouch;A01.047.025.600.225 Mesentery;A01.047.025.600.451 Mesocolon;A01.047.025.600.451.535 Omentum;A01.047.025.600.573 Peritoneal Cavity;A01.047.025.600.678 Retroperitoneal Space;A01.047.025.750 Abdominal Wall;A01.047.050 Groin;A01.047.365 Inguinal Canal;A01.047.412 Umbilicus;A01.047.849 Back;A01.176 Lumbosacral Region;A01.176.519 Sacrococcygeal Region;A01.176.780 Breast;A01.236 Nipples;A01.236.500 Extremities;A01.378 Amputation Stumps;A01.378.100 |
The format is: text description, semi-colon, code hierarchy. Thus, "Body Regions" is code A01, the "Abdomen" is A01.047, the Peritoneum is A01.047.025.600 and so forth. The goal is to build a global array tree where each successive index is a successive code in the MESH hierarchy and the text of each entry is stored in the tree at the appropriate level. Thus, we want something like:
|
Graphically:

This can be done with a program such as:
|
Notes:
. if key=""!(code="") break
uses the OR operator (!). Also note the use of parentheses needed since execution of expressions in Mumps does not rely on precedence.
. for j=1:1:i-1 set z=z_""""_x(j)_""","
uses the concatenation operator (_) as well as a local array x(j). Local arrays should be used as little as possible since access to them through the Mumps run-time symbol table can be slow if there are a lot of elements in the symbol table.
|
#!/usr/bin/mumps
# mtreeprint.mps January 13, 2008
for lev1=$order(^mesh(lev1)) do
. write lev1," ",^mesh(lev1),!
. for lev2=$order(^mesh(lev1,lev2)) do
.. write ?5,lev2," ",^mesh(lev1,lev2),!
.. for lev3=$order(^mesh(lev1,lev2,lev3)) do
... write ?10,lev3," ",^mesh(lev1,lev2,lev3),!
... for lev4=$order(^mesh(lev1,lev2,lev3,lev4)) do
.... write ?15,lev4," ",^mesh(lev1,lev2,lev3,lev4),!
yields:
A01 Body Regions
047 Abdomen
025 Abdominal Cavity
600 Peritoneum
750 Retroperitoneal Space
050 Abdominal Wall
365 Groin
412 Inguinal Canal
849 Umbilicus
176 Back
519 Lumbosacral Region
780 Sacrococcygeal Region
236 Breast
500 Nipples
378 Extremities
100 Amputation Stumps
610 Lower Extremity
100 Buttocks
250 Foot
400 Hip
450 Knee
500 Leg
750 Thigh
800 Upper Extremity
075 Arm
090 Axilla
420 Elbow
585 Forearm
667 Hand
750 Shoulder
456 Head
313 Ear
505 Face
173 Cheek
259 Chin
420 Eye
580 Forehead
631 Mouth
733 Nose
750 Parotid Region
810 Scalp
830 Skull Base
150 Cranial Fossa, Anterior
165 Cranial Fossa, Middle
200 Cranial Fossa, Posterior
598 Neck
673 Pelvis
600 Pelvic Floor
719 Perineum
911 Thorax
800 Thoracic Cavity
500 Mediastinum
650 Pleural Cavity
850 Thoracic Wall
960 Viscera
A02 Musculoskeletal System
165 Cartilage
165 Cartilage, Articular
207 Ear Cartilages
410 Intervertebral Disk
507 Laryngeal Cartilages
083 Arytenoid Cartilage
211 Cricoid Cartilage
411 Epiglottis
870 Thyroid Cartilage
590 Menisci, Tibial
639 Nasal Septum
340 Fascia
424 Fascia Lata
513 Ligaments
170 Broad Ligament
514 Ligaments, Articular
100 Anterior Cruciate Ligament
162 Collateral Ligaments
287 Ligamentum Flavum
350 Longitudinal Ligaments
475 Patellar Ligament
600 Posterior Cruciate Ligament
.
.
.
|
Alternatively, using some of the newer Mumps functions, the table can be printed as:
#!/usr/bin/mumps
# mtreeprintnew.mps January 28, 2010
set x="^mesh(0)"
for do
. set x=$query(x)
. if x="" break
. set i=$qlength(x)
. write ?i*2," ",$qsubscript(x,i)," ",@x,?50,x,!
|
which produces the output:
A01 Body Regions ^mesh("A01")
047 Abdomen ^mesh("A01","047")
025 Abdominal Cavity ^mesh("A01","047","025")
600 Peritoneum ^mesh("A01","047","025","600")
225 Douglas' Pouch ^mesh("A01","047","025","600","225")
451 Mesentery ^mesh("A01","047","025","600","451")
535 Mesocolon ^mesh("A01","047","025","600","451","535")
573 Omentum ^mesh("A01","047","025","600","573")
678 Peritoneal Cavity ^mesh("A01","047","025","600","678")
750 Retroperitoneal Space ^mesh("A01","047","025","750")
050 Abdominal Wall ^mesh("A01","047","050")
365 Groin ^mesh("A01","047","365")
412 Inguinal Canal ^mesh("A01","047","412")
849 Umbilicus ^mesh("A01","047","849")
176 Back ^mesh("A01","176")
519 Lumbosacral Region ^mesh("A01","176","519")
780 Sacrococcygeal Region ^mesh("A01","176","780")
236 Breast ^mesh("A01","236")
500 Nipples ^mesh("A01","236","500")
378 Extremities ^mesh("A01","378")
100 Amputation Stumps ^mesh("A01","378","100")
610 Lower Extremity ^mesh("A01","378","610")
100 Buttocks ^mesh("A01","378","610","100")
250 Foot ^mesh("A01","378","610","250")
149 Ankle ^mesh("A01","378","610","250","149")
300 Forefoot, Human ^mesh("A01","378","610","250","300")
480 Metatarsus ^mesh("A01","378","610","250","300","480")
792 Toes ^mesh("A01","378","610","250","300","792")
380 Hallux ^mesh("A01","378","610","250","300","792","380")
510 Heel ^mesh("A01","378","610","250","510")
400 Hip ^mesh("A01","378","610","400")
450 Knee ^mesh("A01","378","610","450")
500 Leg ^mesh("A01","378","610","500")
750 Thigh ^mesh("A01","378","610","750")
800 Upper Extremity ^mesh("A01","378","800")
075 Arm ^mesh("A01","378","800","075")
090 Axilla ^mesh("A01","378","800","090")
420 Elbow ^mesh("A01","378","800","420")
585 Forearm ^mesh("A01","378","800","585")
667 Hand ^mesh("A01","378","800","667")
430 Fingers ^mesh("A01","378","800","667","430")
705 Thumb ^mesh("A01","378","800","667","430","705")
715 Wrist ^mesh("A01","378","800","667","715")
750 Shoulder ^mesh("A01","378","800","750")
456 Head ^mesh("A01","456")
313 Ear ^mesh("A01","456","313")
505 Face ^mesh("A01","456","505")
173 Cheek ^mesh("A01","456","505","173")
259 Chin ^mesh("A01","456","505","259")
420 Eye ^mesh("A01","456","505","420")
338 Eyebrows ^mesh("A01","456","505","420","338")
504 Eyelids ^mesh("A01","456","505","420","504")
421 Eyelashes ^mesh("A01","456","505","420","504","421")
580 Forehead ^mesh("A01","456","505","580")
631 Mouth ^mesh("A01","456","505","631")
515 Lip ^mesh("A01","456","505","631","515")
|
|
with the text of the MeSH code stored at each node.
This is also the order in which they are stored in the file system. The Mumps function $query() can be used to dump the file system in sequential key order. You pass to $query() a string containing a global array reference (with embedded quotes around string indices). It returns the next array reference in the file system. Eventually, you will run out of "^mesh" references and receive an empty string Consequently, you must test to determine if you received the empty string.
|
The output of both of which looks like:
|
Notes:
. write x,?50,@x,!
displays the reference (x) and then prints the contents of the node x (@x).
|
Notes:
. if $find(@x,key) do // is key stored at this ref?
which uses indirection to get the text value stored (@x evaluates to the contents of the global array reference in x). The $find() function searches the text for any substring containing the input key.
HTML file query3.html:
|
Next move the following file to /var/www/cgi-bin and make it world readable and executable.
|