INF280 Autumn 2005: Compulsory Assignment 1
The problems in this compulsory assignment
are to be solved on a computer.
Assistance for solving the problems will be available
in the assignment class on Friday, September 23.
Your solution (including source code files)
should be sent by email to Harald Barsnes
no later than October 3.
A satisfactory solution is required in order to take the exam.
Problem 1: Practical database search with BLAST
In this exercise you will try out practical database search
with BLAST via web servers. There are many such servers.
One is located in the European Bioinformatics Institute EBI
and is found at
There are several versions of BLAST for different types of sequences:
blastn searches a given DNA sequence in a DNA database,
blastp searches a with given protein sequence in a protein database,
and blastx searches a given DNA sequence in a protein database.
All programs have a number of parameters that can be set.
In the web interface you can select the program, database, and the
parameters from menus.
The following DNA sequence is a fragment from a virus and contains a
part of a gene.
Use blastx to search for the protein for which this gene codes
(cut and paste!) in the database Swiss-Prot.
Swiss-Prot is a manually curated database containing experimentally
verified protein sequences only. BLAST displays the matches first as
a short list, and you have to click on "Show Alignments" in order to
see the local alignments found for each match.
Look at the alignment for the best match. From which virus does this
sequence originate? Where does the protein coding part start in the DNA
sequence? Where does it end?
catgacatca gcttatgagt cataattaat cgtgcgttac aagtagaatt ctactcgtaa
agcgagttga aggatcatat ttagttgcgt ttatgagata agattgaaag cacgtgtaaa
atgtttcccg cgcgttggca caactattta caatgcggcc aagttataaa agattctaat
ctgatatgtt ttaaaacacc tttgcggccc gagttgtttg cgtacgtgac tagcgaagaa
Note that the query sequence is translated to an amino acid sequence.
In the help pages it is stated that blastx searches for no less than six
different amino acid sequences. Explain this.
A link from the BLAST results leads to the corresponding entry in
Swiss-Prot. At the bottom you will find the amino acid sequence
If this were an unknown protein, we could use database searching to
look for similar (homologous) proteins with known function.
Use blastp to search for homologous sequences in Swiss-Prot.
Use the score matrix Blosum62, and the preselected gap penalty.
MFPARWHNYL QCGQVIKDSN LICFKTPLRP ELFAYVTSEE DVWTAEQIVK QNPSIGAIID
LTNTSKYYDG VHFLRAGLLY KKIQVPGQTL PPESIVQEFI DTVKEFTEKC PGMLVGVHCT
HGINRTGYMV CRYLMHTLGI APQEAIDRFE KARGHKIERQ NYVQDLLI
The first match is (not surprisingly!) the sequence itself.
Many of the following matches are from viruses. The 13th is from mouse.
Explain how the matches are sorted. What is score and E-value?
Is the 13th match significant?
Examine the alignment. How many gaps are there? It is also possible
to run an "old-fashioned" BLAST without gaps (selected from a menu).
Can we still find this protein?
Repeat the search in b with Fasta
and compare the result. Remember to use the same score matrix!
For more information about Fasta, click on Fasta Help in the menu
to the right on the above web page.
What is ktup, and how does it affect the speed and sensitivity in the
search? Is there a similar parameter in BLAST?
If you want to, you may also experiment with other score matrices
and gap costs.
Problem 2: Alignment by dynamic programming
In this problem we will implement and test sequence alignment by
Consider a global alignment scoring scheme with score
m for a match, score s for a mismatch (substitution),
and a linear gap penalty of g for each space.
Write down the recurrence equation for computing the dynamic
programming table Hi,j used for finding
the optimal alignment score. Include the base cases
(row and column 0). Where is the optimal score found?
Write a computer program (use your favourite programming language)
that computes the dynamic programming table H corresponding
to the above recurrence.
The input to the program is the two sequences, and the parameters
m, s, and g.
As output, the program prints the whole table H,
where the rows and columns are labeled with the two sequences
(see Figure 1.3 in Eidhammer et al. for an example).
Use your program to compute the alignment score for the sequences
REINSDYR and REINDEER, with score
m=1 for a match, s=-1 for a mismatch, and
gap penalty g=1 for each space.
Define the edit distance between two sequences. How can the parameters
be chosen so that your program computes the edit distance?
Use the program to compute the edit distance between the sequences
REINSDYR and REINDEER.
By manual inspection, state the edit operations that occurred.
A subsequence of a sequence S is an order-preserving,
but not necessarily contiguous
sequence of symbols from S. For example, RIS
is a subsequence of REINSDYR.
The longest common subsequence problem for a set of sequences
is to find the longest sequence that is a subsequence of all members
of the set. Find a new choice of parameters that can be used to
compute the longest common subsequence of two sequences by dynamic
programming, and use your program to find the length of the longest
common subsequence of REINSDYR and REINDEER.
Which subsequence is it?
A procedure called backtracking is used to find the optimal alignment
from the matrix H. Two different strategies can be used.
The first involves storing extra information ("arrows")
along with the values Hi,j, and the second involves
recomputing some of the values. Explain briefly each method with advantages and
Choose one method and try to implement backtracking and output of the optimal
alignment. When several alignments are optimal, it is sufficient to
output one. Use your program to find the optimal alignments
from b, c, and d. (If you were not able to implement this, find the
alignments manually, marking the path through the matrix
H generated by the program.)
Find choices of parameters that produce optimal alignments different
from the previous ones.
Which changes to the algorithm are needed to compute an optimal
local alignment instead of a global alignment?
Implement this and test it on the sequences
REINSDYRSTEK and ERTEDYRKING.
Compare with the global alignment.
Use the parameters from b. Why would the edit distance no longer be
a good parameter choice? What about the parameters
m=1, s=0, g=1 (try!)?
Last update: Sep 21, 2005