I have a file of one billion characters. (Actually, I have two.) The sequence consists of the characters A, C, T, and G, the four nucleotide bases that make up a strand of DNA. When you hear people talking about "sequencing the genome", they are talking about working with this kind of nucleotide sequence. One of the basic tasks of such work is finding the occurrence of one nucleotide sequence within another. For example, this can help us see that two sequences overlap and thus combined to create a longer sequence.
Here is your challenge:
How long will it take a Python program to find all of the occurrences of atcatcg in my one-gigabyte file?
If you just have to know the answer now, here it is.
Here is a quick summary of the process, which we saw in detail in last week:
Practice this process until it flows naturally. It is basic knowledge for reading and writing code.
We have used the find() method for strings:
>>> 'abcdefg'.find('d') 3 >>> 'abcdefg'.find('de') 3 >>> 'abcdefg'.find('bcd') 1 >>> 'abcdefg'.find('dc') -1
find() allows more than one argument:
>>> 'eugene is euphonius and euphemistic and euphoric'.find('eu') 0 >>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 1) 10
The second argument tells find() in which position to begin its search.
We can even give a third argument:
>>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 11, 21) -1 >>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 11, 32) 24
The third argument tells find() in which position to end its search. As with other range-related operations in Python, the start value is included in the range, and the end value is not. If we use the string's length as the third argument, then find() will keep looking to the end.
find() always finds the first match. We can use find() to step through the string by changing the start position:
>>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 0, 48) 0 >>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 1, 48) 10 >>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 11, 48) 24 >>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 25, 48) 40 >>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 41, 48) -1
This idea can help us build another useful string utility...
In bioinformatics, we often need to find the locations of all occurrences of a substring, not just the first. So:
Write a function multi_find(source, target, start, end) that finds the positions of all occurrences of target in source that occur in the range start ≤ position < end.
Instead of one integer index, return a string containing zero or more indices, separated by commas. For example:
File to search: find_in_string.txt String to find: in Positions: 0,25,63,73,81,90,152,156,159,179,184,220,239
Hint: use find() to step through the string, updating the next starting position after each successful find.
... loop through string. ... start at 0. ... add results to a string. -- running total pattern ... next pass: success + 1. -- need a while loop ... can stop if find ever fails.
Look at a solution. The code is getting long, but we can see the running total pattern in its shape. That can help us navigate the code. The return is interesting; if we ever found the target, then we need to strip a trailing comma.
... a useful utility. Add to str_utils module from the lab.
... our own modules. ... Python looks for imported modules in current directory, or in its path. Paths are a common idea in programming tools and operating systems...
Look at second version of demo program, with multi_find() imported from the str_utils module.
Feel free to make modules for yourself!
First note. This was a challenging lab. You handled it differently than the first challenging labs, seven or eight weeks ago. Bravo. You have come a long way.
... walk through the tasks and solutions ...
... timing a program using the time. See version three of my multi_find(), using the imported time module and multi_find() imported from the str_utils module.
... common idiom for simple timing of operations.
This takes a while to run... Jump ahead to see the result from my office machine.
Here is how Python's built-in find() did finding the first occurrence:
File to search: string-gig01.txt String to find: atcatcg Position: 47019 Time: 1.5061960220336914 File to search: string-gig02.txt String to find: atcatcg Position: 1164 Time: 1.4972848892211914 File to search: string-gig01.txt String to find: tttttttttttttttt Position: -1 Time: 2.0287790298461914
In the worst case of having to examine the entire string, it takes only ~ 2 seconds.
And here is how my version of multi_find() did finding all occurrences in a 10-megabyte file:
File to search: string-10mb.txt String to find: actactg Positions : 0,7070,7188,26727,35126,68850,69664,92603,140770, ... Total time: 0.038643836975097656 File to search: string-10mb.txt String to find: tttttttttttttttt Positions : Total time: 0.020676136016845703
Wow. The results are in the range of 2-3 hundredths of a second. 10mb is one percent of 1gb, so we might expect that the results on a gigabyte file would be in the 2-3 second range.
A gigabyte is a lot of data, and Python has to manage all that space in addition to doing the computation. My Python set-up is not configured to work with such big files! (return to session)