Mumps MDH Toolkit
Experiments in Information Storage and Retrieval Using Mumps
5th Edition
Kevin C. O'Kane, Ph.D.
Computer Science Department
University of Northern Iowa
Cedar Falls, IA 50614
kc.okane@gmail.com
http://www.cs.uni.edu/~okane
http:www.omahadave.com
December 27, 2010
Copyright (c) 2007, 2008, 2009 2010 Kevin C. O'Kane, Ph.D.
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.1 or any later version published by the Free Software Foundation; with the Invariant Sections being: Page 1, with the Front-Cover Texts being: Page 1, and with the Back-Cover Texts being: no Back-Cover Texts. |
The purpose of this text is to illustrate several basic information storage and retrieval techniques through real world data experiments. Information retrieval is the art of identifying similarities between queries and objects in a database. In nearly all cases, the objects found as a result of the query will not be identical to the query but will resemble it in some fashion.
For example, if your query is "give me articles about aviation," the results might include articles about early pioneers in the field, technical reports on aircraft design, flight schedules on airlines, information on airports and so on. For example, the term "aviation" when typed into Google results in about 111,000,000 hits all of which have something to do with aviation.
Information retrieval isn't restricted to text retrieval. So, if you have a cut of a musical piece such as this (from the Beethoven 9th Symphony) and you want to find other music similar to it such as this (from the Beethoven Choral Fantasy), you need a retrieval engine that can detect the similarities, but not von Weber's der Freischutz.
Similar examples exist in many other areas. In Bioinformatics, researchers often identify DNA or protein sequences and search massive databases for similar (and sometimes only distantly related) sequences. For example, the DNA sequence:
>gi|2695846|emb|Y13255.1|ABY13255 Acipenser baeri mRNA for immunoglobulin heavy chain, clone ScH 3.3 TGGTTACAACACTTTCTTCTTTCAATAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGTATAATAATGA CAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGTCCGGACCAGCAGTTATAAAGCCTGGAGAGTCCCATAAA CTGTCCTGTAAAGCCTCTGGATTCACATTCAGCAGCGCCTACATGAGCTGGGTTCGACAAGCTCCTGGAAAGGGTCTGGA ATGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGAAGATTCGCCATCTCCAGAG ACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACACTGCCGTGTATTACTGTGCTCGGGGC GGGCTGGGGTGGTCCCTTGACTACTGGGGGAAAGGCACAATGATCACCGTAACTTCTGCTACGCCATCACCACCGACAGT GTTTCCGCTTATGGAGTCATGTTGTTTGAGCGATATCTCGGGTCCTGTTGCTACGGGCTGCTTAGCAACCGGATTCTGCC TACCCCCGCGACCTTCTCGTGGACTGATCAATCTGGAAAAGCTTTT |
Where the first line identifies the name and library accession numbers of the sequence and the subsequent lines are the DNA nucleotide codes (the letters A, C, G, and T represent Adenine, Cytosine, Guanine, and Thymine, respectively). A program known as BLAST (Basic Local Alignment Sequencing Tool) can be used to find similar sequences in the online databases of known sequences. If you submit the above to NCBI BLAST (National Center for Biotechnology Information), they will conduct a search of their nr database of 6,284,619 nucleotide sequences, presently more than 22,427,755,047 bytes in length. The result is a ranked list of hits of sequences in the data base based on their similarity to the query sequence. Sequences found whose similarity score exceeds a threshold are displayed. One of these is:
>gb|U17058.1|LOU17058 Lepisosteus osseus Ig heavy chain V region mRNA, partial cds Length=159 Score = 151 bits (76), Expect = 4e-33 Identities = 133/152 (87%), Gaps = 0/152 (0%) Strand=Plus/Plus Query 242 TGGGTGGCTTATATTTACTCAGGTGGTAGTAGTACATACTATGCCCAGTCTGTCCAGGGA 301 |||||||| ||||||||| | | ||| || | |||||||||| ||||||||||||||||| Sbjct 4 TGGGTGGCGTATATTTACACCGATGGGAGCAATACATACTATTCCCAGTCTGTCCAGGGA 63 Query 302 AGATTCGCCATCTCCAGAGACGATTCCAACAGCATGCTGTATTTACAAATGAACAGCCTG 361 |||||| |||||||||||||| ||||||| | |||||| ||||| |||| ||||||| Sbjct 64 AGATTCACCATCTCCAGAGACAATTCCAAGAATCAGCTGTACTTACAGATGAGCAGCCTG 123 Query 362 AAGACTGAAGACACTGCCGTGTATTACTGTGC 393 ||||||||||||||||| |||||||||||||| Sbjct 124 AAGACTGAAGACACTGCTGTGTATTACTGTGC 155 |
In the display from BLAST seen above, the sections of the query that match the sequence in the database are shown. The numbers at the beginning and ends of the lines are the starting and ending points of the subsequence (relative to one, the start of all sequences). Where there are vertical lines between the query and the subject, there is an exact match. Where there are blanks, there was a mismatch.
It should be clear that, even though the subject is different than the query in many places, the two have a high degree of similarity.
Also, consider the search for similar images. Again, this involves searching for similarities, not identity. For example, a human observer would clearly see the two following pictures as dealing with the same subject, despite the differences:
An obvious question would be, how can you write a computer program to see the similarity?
The following are several example programs illustrating the use of Mumps to create indexing vocabularies and to process and index text files.
The following is a sample of the MESH tree hierarchy:
Body Regions;A01 Abdomen;A01.047 Abdominal Cavity;A01.047.025 Peritoneum;A01.047.025.600 Douglas' Pouch;A01.047.025.600.225 Mesentery;A01.047.025.600.451 Mesocolon;A01.047.025.600.451.535 Omentum;A01.047.025.600.573 Peritoneal Cavity;A01.047.025.600.678 Retroperitoneal Space;A01.047.025.750 Abdominal Wall;A01.047.050 Groin;A01.047.365 Inguinal Canal;A01.047.412 Umbilicus;A01.047.849 Back;A01.176 Lumbosacral Region;A01.176.519 Sacrococcygeal Region;A01.176.780 Breast;A01.236 Nipples;A01.236.500 Extremities;A01.378 Amputation Stumps;A01.378.100 |
The format is: text description, semi-colon, code hierarchy. Thus, "Body Regions" is code A01, the "Abdomen" is A01.047, the Peritoneum is A01.047.025.600 and so forth. The goal is to build a global array tree where each successive index is a successive code in the MESH hierarchy and the text of each entry is stored in the tree at the appropriate level. Thus, we want something like:
set ^mesh("A01")="Body Regions" set ^mesh("A01","047")="Abdomen" set ^mesh("A01","047","025")="Abdomenal Cavity" set ^mesh("A01","047","025","600")="Peritoneum" . . . set ^mesh("A01","047","365")="Groin" . . . |
Graphically:
This can be done with a program such as:
#!/usr/bin/mumps # mtree.mps January 13, 2008 # Copyright 2007 K. C. O'Kane - GPL License applies open 1:"mtrees2003.txt,old" for do . use 1 . read a . if '$test break . set key=$piece(a,";",1) // text description . set code=$piece(a,";",2) // everything else . if key=""!(code="") break . for i=1:1 do .. set x(i)=$piece(code,".",i) // extract code numbers .. if x(i)="" break . set i=i-1 . use 5 . set z="^mesh(" // begin building a global reference #----------------------------------------------------------------------- # build a reference like ^mesh("A01","047","025","600) # by concatenating quotes, codes, quotes, and commas onto z #----------------------------------------------------------------------- . for j=1:1:i-1 set z=z_""""_x(j)_"""," . set z="set "_z_""""_x(i)_""")="""_key_"""" #----------------------------------------------------------------------- # z now looks like set ^mesh("A01","047")="Abdomen" # now execute the text #----------------------------------------------------------------------- . write z,! . xecute z close 1 use 5 write "done",! halt |
Notes:
. if key=""!(code="") break
uses the OR operator (!). Also note the use of parentheses needed since execution of expressions in Mumps does not rely on precedence.
. for j=1:1:i-1 set z=z_""""_x(j)_""","
uses the concatenation operator (_) as well as a local array x(j). Local arrays should be used as little as possible since access to them through the Mumps run-time symbol table can be slow if there are a lot of elements in the symbol table.
set ^mesh("A01")="Body Regions" set ^mesh("A01","047")="Abdomen" set ^mesh("A01","047","025")="Abdominal Cavity" set ^mesh("A01","047","025","600")="Peritoneum" set ^mesh("A01","047","025","600","225")="Douglas' Pouch" set ^mesh("A01","047","025","600","451")="Mesentery" set ^mesh("A01","047","025","600","451","535")="Mesocolon" set ^mesh("A01","047","025","600","573")="Omentum" set ^mesh("A01","047","025","600","678")="Peritoneal Cavity" set ^mesh("A01","047","025","750")="Retroperitoneal Space" set ^mesh("A01","047","050")="Abdominal Wall" set ^mesh("A01","047","365")="Groin" set ^mesh("A01","047","412")="Inguinal Canal" set ^mesh("A01","047","849")="Umbilicus" set ^mesh("A01","176")="Back" set ^mesh("A01","176","519")="Lumbosacral Region" set ^mesh("A01","176","780")="Sacrococcygeal Region" set ^mesh("A01","236")="Breast" set ^mesh("A01","236","500")="Nipples" set ^mesh("A01","378")="Extremities" set ^mesh("A01","378","100")="Amputation Stumps" set ^mesh("A01","378","610")="Lower Extremity" set ^mesh("A01","378","610","100")="Buttocks" set ^mesh("A01","378","610","250")="Foot" set ^mesh("A01","378","610","250","149")="Ankle" set ^mesh("A01","378","610","250","300")="Forefoot, Human" set ^mesh("A01","378","610","250","300","480")="Metatarsus" . . . |
#!/usr/bin/mumps # mtreeprint.mps January 13, 2008 for lev1=$order(^mesh(lev1)) do . write lev1," ",^mesh(lev1),! . for lev2=$order(^mesh(lev1,lev2)) do .. write ?5,lev2," ",^mesh(lev1,lev2),! .. for lev3=$order(^mesh(lev1,lev2,lev3)) do ... write ?10,lev3," ",^mesh(lev1,lev2,lev3),! ... for lev4=$order(^mesh(lev1,lev2,lev3,lev4)) do .... write ?15,lev4," ",^mesh(lev1,lev2,lev3,lev4),! yields: A01 Body Regions 047 Abdomen 025 Abdominal Cavity 600 Peritoneum 750 Retroperitoneal Space 050 Abdominal Wall 365 Groin 412 Inguinal Canal 849 Umbilicus 176 Back 519 Lumbosacral Region 780 Sacrococcygeal Region 236 Breast 500 Nipples 378 Extremities 100 Amputation Stumps 610 Lower Extremity 100 Buttocks 250 Foot 400 Hip 450 Knee 500 Leg 750 Thigh 800 Upper Extremity 075 Arm 090 Axilla 420 Elbow 585 Forearm 667 Hand 750 Shoulder 456 Head 313 Ear 505 Face 173 Cheek 259 Chin 420 Eye 580 Forehead 631 Mouth 733 Nose 750 Parotid Region 810 Scalp 830 Skull Base 150 Cranial Fossa, Anterior 165 Cranial Fossa, Middle 200 Cranial Fossa, Posterior 598 Neck 673 Pelvis 600 Pelvic Floor 719 Perineum 911 Thorax 800 Thoracic Cavity 500 Mediastinum 650 Pleural Cavity 850 Thoracic Wall 960 Viscera A02 Musculoskeletal System 165 Cartilage 165 Cartilage, Articular 207 Ear Cartilages 410 Intervertebral Disk 507 Laryngeal Cartilages 083 Arytenoid Cartilage 211 Cricoid Cartilage 411 Epiglottis 870 Thyroid Cartilage 590 Menisci, Tibial 639 Nasal Septum 340 Fascia 424 Fascia Lata 513 Ligaments 170 Broad Ligament 514 Ligaments, Articular 100 Anterior Cruciate Ligament 162 Collateral Ligaments 287 Ligamentum Flavum 350 Longitudinal Ligaments 475 Patellar Ligament 600 Posterior Cruciate Ligament . . . |
Alternatively, using some of the newer Mumps functions, the table can be printed as:
#!/usr/bin/mumps # mtreeprintnew.mps January 28, 2010 set x="^mesh(0)" for do . set x=$query(x) . if x="" break . set i=$qlength(x) . write ?i*2," ",$qsubscript(x,i)," ",@x,?50,x,! |
which produces the output:
A01 Body Regions ^mesh("A01") 047 Abdomen ^mesh("A01","047") 025 Abdominal Cavity ^mesh("A01","047","025") 600 Peritoneum ^mesh("A01","047","025","600") 225 Douglas' Pouch ^mesh("A01","047","025","600","225") 451 Mesentery ^mesh("A01","047","025","600","451") 535 Mesocolon ^mesh("A01","047","025","600","451","535") 573 Omentum ^mesh("A01","047","025","600","573") 678 Peritoneal Cavity ^mesh("A01","047","025","600","678") 750 Retroperitoneal Space ^mesh("A01","047","025","750") 050 Abdominal Wall ^mesh("A01","047","050") 365 Groin ^mesh("A01","047","365") 412 Inguinal Canal ^mesh("A01","047","412") 849 Umbilicus ^mesh("A01","047","849") 176 Back ^mesh("A01","176") 519 Lumbosacral Region ^mesh("A01","176","519") 780 Sacrococcygeal Region ^mesh("A01","176","780") 236 Breast ^mesh("A01","236") 500 Nipples ^mesh("A01","236","500") 378 Extremities ^mesh("A01","378") 100 Amputation Stumps ^mesh("A01","378","100") 610 Lower Extremity ^mesh("A01","378","610") 100 Buttocks ^mesh("A01","378","610","100") 250 Foot ^mesh("A01","378","610","250") 149 Ankle ^mesh("A01","378","610","250","149") 300 Forefoot, Human ^mesh("A01","378","610","250","300") 480 Metatarsus ^mesh("A01","378","610","250","300","480") 792 Toes ^mesh("A01","378","610","250","300","792") 380 Hallux ^mesh("A01","378","610","250","300","792","380") 510 Heel ^mesh("A01","378","610","250","510") 400 Hip ^mesh("A01","378","610","400") 450 Knee ^mesh("A01","378","610","450") 500 Leg ^mesh("A01","378","610","500") 750 Thigh ^mesh("A01","378","610","750") 800 Upper Extremity ^mesh("A01","378","800") 075 Arm ^mesh("A01","378","800","075") 090 Axilla ^mesh("A01","378","800","090") 420 Elbow ^mesh("A01","378","800","420") 585 Forearm ^mesh("A01","378","800","585") 667 Hand ^mesh("A01","378","800","667") 430 Fingers ^mesh("A01","378","800","667","430") 705 Thumb ^mesh("A01","378","800","667","430","705") 715 Wrist ^mesh("A01","378","800","667","715") 750 Shoulder ^mesh("A01","378","800","750") 456 Head ^mesh("A01","456") 313 Ear ^mesh("A01","456","313") 505 Face ^mesh("A01","456","505") 173 Cheek ^mesh("A01","456","505","173") 259 Chin ^mesh("A01","456","505","259") 420 Eye ^mesh("A01","456","505","420") 338 Eyebrows ^mesh("A01","456","505","420","338") 504 Eyelids ^mesh("A01","456","505","420","504") 421 Eyelashes ^mesh("A01","456","505","420","504","421") 580 Forehead ^mesh("A01","456","505","580") 631 Mouth ^mesh("A01","456","505","631") 515 Lip ^mesh("A01","456","505","631","515") |
^mesh("A01") ^mesh("A01","047") ^mesh("A01","047","025") ^mesh("A01","047","025","600") ^mesh("A01","047","025","600","225") ^mesh("A01","047","025","600","451") ^mesh("A01","047","025","600","451","535") ^mesh("A01","047","025","600","573") ^mesh("A01","047","025","600","678") ^mesh("A01","047","025","750") ^mesh("A01","047","050") ^mesh("A01","047","365") ^mesh("A01","047","412") ^mesh("A01","047","849") ^mesh("A01","176") |
with the text of the MeSH code stored at each node.
This is also the order in which they are stored in the file system. The Mumps function $query() can be used to dump the file system in sequential key order. You pass to $query() a string containing a global array reference (with embedded quotes around string indices). It returns the next array reference in the file system. Eventually, you will run out of "^mesh" references and receive an empty string Consequently, you must test to determine if you received the empty string.
#!/usr/bin/mumps # meshheadings.mps January 28, 2010 set x="^mesh" // build the first index for do . set x=$query(x) // get next array reference . if x="" break . write x,?50,@x,! |
The output of both of which looks like:
^mesh("A01") Body Regions ^mesh("A01","047") Abdomen ^mesh("A01","047","025") Abdominal Cavity ^mesh("A01","047","025","600") Peritoneum ^mesh("A01","047","025","600","225") Douglas' Pouch ^mesh("A01","047","025","600","451") Mesentery ^mesh("A01","047","025","600","451","535") Mesocolon ^mesh("A01","047","025","600","573") Omentum ^mesh("A01","047","025","600","678") Peritoneal Cavity ^mesh("A01","047","025","750") Retroperitoneal Space ^mesh("A01","047","050") Abdominal Wall ^mesh("A01","047","365") Groin ^mesh("A01","047","412") Inguinal Canal ^mesh("A01","047","849") Umbilicus ^mesh("A01","176") Back ^mesh("A01","176","519") Lumbosacral Region ^mesh("A01","176","780") Sacrococcygeal Region ^mesh("A01","236") Breast ^mesh("A01","236","500") Nipples ^mesh("A01","378") Extremities ^mesh("A01","378","100") Amputation Stumps ^mesh("A01","378","610") Lower Extremity ^mesh("A01","378","610","100") Buttocks ^mesh("A01","378","610","250") Foot ^mesh("A01","378","610","250","149") Ankle ^mesh("A01","378","610","250","300") Forefoot, Human ^mesh("A01","378","610","250","300","480") Metatarsus ^mesh("A01","378","610","250","300","792") Toes ^mesh("A01","378","610","250","300","792","380") Hallux ^mesh("A01","378","610","250","510") Heel ^mesh("A01","378","610","400") Hip ^mesh("A01","378","610","450") Knee ^mesh("A01","378","610","500") Leg ^mesh("A01","378","610","750") Thigh ^mesh("A01","378","800") Upper Extremity ^mesh("A01","378","800","075") Arm ^mesh("A01","378","800","090") Axilla ^mesh("A01","378","800","420") Elbow ^mesh("A01","378","800","585") Forearm ^mesh("A01","378","800","667") Hand ^mesh("A01","378","800","667","430") Fingers ^mesh("A01","378","800","667","430","705") Thumb ^mesh("A01","378","800","667","715") Wrist ^mesh("A01","378","800","750") Shoulder |
Notes:
. write x,?50,@x,!
displays the reference (x) and then prints the contents of the node x (@x).
#!/usr/bin/mumps # findmesh.mps January 28, 2010 read "enter keyword: ",key write ! set x="^mesh" // build a global array ref set x=$query(x) if x="" halt for do . if '$find(@x,key) set x=$query(x) // is key stored at this ref? . else do .. set i=$qlength(x) // number of subscripts .. write x," ",@x,! .. for do ... set x=$query(x) ... if x="" halt ... if $qlength(x)'>i break ... write ?5,x," ",@x,! . if x="" halt |
Notes:
. if $find(@x,key) do // is key stored at this ref?
which uses indirection to get the text value stored (@x evaluates to the contents of the global array reference in x). The $find() function searches the text for any substring containing the input key.
HTML file query3.html:
|
Next move the following file to ~/public_html/cgi-bin and make it world readable and executable.
|