Programming Assignment 07
An Introduction to Bioinformatics.
Objective:
String manipulation.
Due:
Wednesday, April 15th, 10:00 AM
Some of you may be aware of the fact that we have a Bioinformatics major in this
department. This major is for people who are good with math, computers,
and science (in particular, chemistry and biology). This assignment is designed
to give you a VERY simple introduction to some of the uses of computers in the
field of bioinformatics.
DISCLAIMER : I am a CS person not a Biologist. I suspect that in trying
to explain a little of the science behind this I have made over simplifications
or possibly even out-and-out falsehoods. Forgive me if this is the
case.
Overall Background
Assumptions
- To get started, you may assume that
the String provided is a valid String consisting only of the 4
nucleotides listed above. For example, "CGTAGGCAT" is a legal
DNA because it contains nothing but A, C, G, and T. "CGTAFLCAT" is not a legal
DNA because F and L are not part of the base set of nucleotides.
In order to complete this assignment you should create a file called pa07.py
which contains the following method (you should complete these in the
order listed).
- Write the numberOfTimes(sequence,nucleotide) method.
- BACKGROUND :
- There are times when a scientist may want to know how many times a
particular nucleotide occurs in a particular DNA sequence. This method
should provide access to this information.
- ACTION :
- Add to PA07.py the method called numberOfTimes().
- This method takes
one String (sequence) and one character (nucleotide) as parameters.
- This method searches the entire DNA sequence
one nucleotide (character) at a time testing to see if the current
nucleotide is equal to the parameter. It maintains a running count of
how many are found.
- This method returns this final count to the client
code (the code calling the numberOfTimes() method ).
- LIMITATIONS :
- You may recall a count() method from Lab12. You MAY NOT use
count() to solve this problem. You need to write the looping and
comparison code yourself. In essence, you are writing the count method
that already exists().
TESTING :
- Now that we have spent some time talking about testing, I would like you
to really think about how to test your code as you write it. For
example, while I said you may assume that sequence is a valid sequence, I
did NOT say you should assume that nucleotide is a valid nucleotide.
- Write the generateSecondStrand(sequence) method.
- BACKGROUND :
- DNA is often represented as a single sequence of nucleotides. In
reality, DNA is a double stranded entity. However the second strand
of the DNA is a simple translation of the first strand - Ts pair with As and
Cs pair with Gs.
- As an example, below is a simple DNA sequence and it's "Second Strand"
- DNA Sequence : CACATG
- Second Strand : GTGTAC
- ACTION :
- Add a method called generateSecondStrand(sequence).
- This method takes one String (sequence) as a parameter and returns the String which would be
the second strand of the DNA sequence receiving the message.
- In order to do this, you will need to consider the DNA sequence one
nucleotide (character) at a time and generate an output String consisting of
the original strand's partner at each corresponding position.
- TESTING :
- Test with a variety of valid sequences.
- Do you agree with the results?
- Write the generatemRNASequence(sequence) method.
- BACKGROUND :
- While this isn't completely accurate, you can consider that the mRNA of a DNA sequence is the second strand of the DNA sequence with
each occurrence of the nucleotide T replaced with the nucleotide U.
Using the previous example, this DNS's mRNA sequence would be:
- DNA Sequence : CACATG
- Second Strand : GTGTAC
- mRNA Sequence : GUGUAC
- ACTION :
- Complete the method generatemRNASequence(sequence).
- This method takes one String (sequence) as a parameter and returns the String which would be
the mRNA sequence of the DNA sequence receiving the message.
- In order to do this, you will need to first convert the DNA to it's
second strand (take advantage of existing code). Then consider this
second strand one nucleotide (character) at a time and generate an output
String where you replace each occurrence of T with U and leave all other
nucleotides alone.
- LIMITATIONS :
- Note that I ask you to make this conversion one nucleotide at
a time. Thus, you should be writing the loop yourself and not relying
on something like replace().
- TESTING :
- Test with a variety of valid sequences
- Do you agree with the results?
- Write the generateAnAminoAcidSequence(sequence) method.
- BACKGROUND :
- mRNA can be translated to a sequence of Amino Acids by
"breaking" the mRNA sequence into groups of three nucleotides.
- Each
nucleotide triple (use your discrete mathematics skills to consider why
there are 64 of these) translates to one of 20 Amino Acids using the
following conversion table (where column one is nucleotide one, columns 2-5
contain nucleotide two, and column 6 contains nucleotide three).
- Reading this table we see that nucleotide triple UCG is Ser (Serine) while
nucleotide CUG is Leu (Leucine).

Thus, "Val His Stop" would be translated to "VH*"
- ACTION :
- Add the method named generateAnAminoAcidSequence(sequence).
- This method takes
a single String (sequence) as a parameter.
- This method converts the DNA sequence receiving the message into mRNA
(again, use the code you wrote in Step 3).
- It then takes this
mRNA sequence and divides it into substrings of length 3.
- It
then translates each nucleotide triple into it's corresponding Amino Acid.
- Your job is to
figure out how to divide a long mRNA String into nucleotide triples
(substrings of length 3),
use aminoAcidTranslation to translate these to Amino Acids and return the String
which is the complete sequence of AminoAcids for the original DNA.
- You should not assume that your original String is divisible by 3.
Thus, you will need to be careful at the end of the sequence. You may ignore
any "extra" characters at the end of the DNA sequence. For example,
the DNA sequence "CACGT" would translate to mRNA of "GUGCA" and this would
translate to the single Amino Acid of "V"
- TESTING :
- There are several places this method can fail. Test it carefully.
- Do you agree with the results?
- CHALLENGE 1: Write the generateAllAminoAcidSequences() method.
- BACKGROUND :
- When we look at DNA and mRNA all that we REALLY know is that the
sequence is part of a larger sequence. Thus, when we look at the DNA
sequence "CACGTGATC" which translates to the mRNA sequence "GUGCACUAG" all
we really know is that this is probably part of a much larger sequence.
In step 4, we made the assumption that this mRNA sequence starts with
the letter G, and thus the first triple to convert is GUG which converts to
V.
- However, if this is just some random subsequence, it may be that the G
is the last character of the previous triple. Thus, the FIRST triple
we can process is UGC which would translate to Cys or C.
- OR, it might be that both the G and the U are part of the previous
triple, and the first triple we can process is GCA which would translate to
Ala or A.
- Therefore, when scientists convert the mRNA sequence to Amino Acid
sequences, they normally convert to all the possible sequences. Thus,
- DNA Sequence : CACGTGATCC
- mRNA Sequence : GUGCACUAGG
- Amino Acid Sequences :
- VH* (start with the first letter, consider the first three triples
and ignoring the last letter)
- CTR (ignoring the first letter and considering the next three triples)
- AL (ignoring the first two letters, considering the next two triples,
and ignoring the last two letters)
- ACTION :
- Add the method named generateAllAminoAcidSequences().
- This method takes
a single sequence as a parameter.
- This method converts the DNA sequence receiving the message into mRNA
(again, use the code you wrote in Step 3).
- It then takes this
mRNA sequence and considers the three possible ways of dividing it into
triples (start with the first character, start with the second character,
start with the third character)
- For each of these ways, it
translates each nucleotide triple into it's corresponding Amino Acid.
- Since you can't return Three Strings, you should simply print each of
these triples.