## Session 20

### Opening Exercise: Timing Challenge

I have a file of one billion characters. (Actually, I have two.) The sequence consists of the characters A, C, T, and G, the four nucleotide bases that make up a strand of DNA. When you hear people talking about "sequencing the genome", they are talking about working with this kind of nucleotide sequence. One of the basic tasks of such work is finding the occurrence of one nucleotide sequence within another. For example, this can help us see that two sequences overlap and thus combined to create a longer sequence.

Here is your challenge:

How long will it take a Python program to find all of the occurrences of atcatcg in my one-gigabyte file?

If you just have to know the answer now, here it is.

### What Happens When a Function Is Called

Here is a quick summary of the process, which we saw in detail in last week:

1. Python evaluates the arguments.
2. The values are given to the function...
3. ... and assigned to the function's parameters.
4. The function uses its parameters to perform its operation.
5. Python inserts the value of the function in place of the function call.

Practice this process until it flows naturally. It is basic knowledge for reading and writing code.

### Interlude: The find() for Strings

We have used the find() method for strings:

```    >>> 'abcdefg'.find('d')
3
>>> 'abcdefg'.find('de')
3
>>> 'abcdefg'.find('bcd')
1
>>> 'abcdefg'.find('dc')
-1
```

find() allows more than one argument:

```    >>> 'eugene is euphonius and euphemistic and euphoric'.find('eu')
0
>>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 1)
10
```

The second argument tells find() in which position to begin its search.

We can even give a third argument:

```    >>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 11, 21)
-1
>>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 11, 32)
24
```

The third argument tells find() in which position to end its search. As with other range-related operations in Python, the start value is included in the range, and the end value is not. If we use the string's length as the third argument, then find() will keep looking to the end.

find() always finds the first match. We can use find() to step through the string by changing the start position:

```    >>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 0, 48)
0
>>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 1, 48)
10
>>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 11, 48)
24
>>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 25, 48)
40
>>> 'eugene is euphonius and euphemistic and euphoric'.find('eu', 41, 48)
-1
```

This idea can help us build another useful string utility...

### Exercise: multi_find()

In bioinformatics, we often need to find the locations of all occurrences of a substring, not just the first. So:

Write a function multi_find(source, target, start, end) that finds the positions of all occurrences of target in source that occur in the range start ≤ position < end.

Instead of one integer index, return a string containing zero or more indices, separated by commas. For example:

```    File to search: find_in_string.txt
String to find: in
Positions: 0,25,63,73,81,90,152,156,159,179,184,220,239
```

Hint: use find() to step through the string, updating the next starting position after each successful find.

### Designing multi_find()

Thoughts...

```     ... loop through string.
... start at 0.
... add results to a string.     -- running total pattern
... next pass: success + 1.      -- need a while loop
... can stop if find ever fails.
```

Look at a solution. The code is getting long, but we can see the running total pattern in its shape. That can help us navigate the code. The return is interesting; if we ever found the target, then we need to strip a trailing comma.

... a useful utility. Add to str_utils module from the lab.

... our own modules. ... Python looks for imported modules in current directory, or in its path. Paths are a common idea in programming tools and operating systems...

Look at second version of demo program, with multi_find() imported from the str_utils module.

Feel free to make modules for yourself!

### Lessons from the Lab

First note. This was a challenging lab. You handled it differently than the first challenging labs, seven or eight weeks ago. Bravo. You have come a long way.

... walk through the tasks and solutions ...

### An Answer to the Opening Challenge

... timing a program using the time. See version three of my multi_find(), using the imported time module and multi_find() imported from the str_utils module.

... common idiom for simple timing of operations.

This takes a while to run... Jump ahead to see the result from my office machine.

### The Answer, For the Impatient

Here is how Python's built-in find() did finding the first occurrence:

```    File to search: string-gig01.txt
String to find: atcatcg
Position: 47019
Time: 1.5061960220336914

File to search: string-gig02.txt
String to find: atcatcg
Position: 1164
Time: 1.4972848892211914

File to search: string-gig01.txt
String to find: tttttttttttttttt
Position: -1
Time: 2.0287790298461914
```

In the worst case of having to examine the entire string, it takes only ~ 2 seconds.

And here is how my version of multi_find() did finding all occurrences in a 10-megabyte file:

```    File to search: string-10mb.txt
String to find: actactg
Positions : 0,7070,7188,26727,35126,68850,69664,92603,140770, ...
Total time: 0.038643836975097656

File to search: string-10mb.txt
String to find: tttttttttttttttt
Positions :
Total time: 0.020676136016845703
```

Wow. The results are in the range of 2-3 hundredths of a second. 10mb is one percent of 1gb, so we might expect that the results on a gigabyte file would be in the 2-3 second range.

A gigabyte is a lot of data, and Python has to manage all that space in addition to doing the computation. My Python set-up is not configured to work with such big files!     (return to session)

### Wrap Up

• Code -- today's code file -- bigger than usual, due to the nucleotide files

• Reading -- Read Sections 7.1-7.3, pages 283-294.

• Homework -- Homework 8 is available and due Monday.

Eugene Wallingford ..... wallingf@cs.uni.edu ..... October 30, 2014