INF380 Spring 2006: Compulsory Assignment
Send your solution by mail to
coward@ii.uib.no
by March 9. A satisfactory solution is required in order
to take the exam.
Problem 1: Multiple alignment: Using CLUSTALW
This is a simple exercise to demonstrate the use of CLUSTALW.
It corresponds to Exercise 11 in Chapter 4 of Eidhammer, Jonassen and
Taylor, with some
additional explanation. All answers should be brief!
A web-interface to a CLUSTALW server can be found at
http://www.ebi.ac.uk/clustalw/.
Two small datasets to be aligned are found here:
- 5 cyclin sequences
- 9 small unrelated sequences.
(Note slightly different sequence formats: PIR format and FASTA
format, respectively; both are accepted as input).
Do the following for each sequence set in turn.
-
First, generate a guide tree and a multiple alignment.
This is done by selecting "none" as tree type under Phylogenetic tree.
How is the guide tree computed, and what is it used for?
-
Then generate a phylogenetic tree from the alignment. This is done by
using the alignment file from the first step as input (not the
raw sequences), and selecting tree type "nj" (for example).
Compare the tree with the guide tree from a. Is there any difference?
Which tree is most likely to show the true evolutionary relationship
between the sequences? Why?
-
Generate a new tree,
selecting the option that ignores gaps.
Explain exactly what this option does,
and compare the alignment with the one in b.
Which of the two datasets gives the most different trees? Why?
-
Do a new multiple alignment after changing the gap open penalty to 1.
Comment on the effect on the alignment.
Problem 2: Multiple alignment: Doing It Yourself
In Eidhammer, Jonassen and Taylor, Chapter 4, do Exercise 9 a, b, and
c (p. 98). This problem will give some insight in the principles of
multiple alignment by computing a small pair-guided progressive
alignment by hand. How does the procedure differ from the
CLUSTALW approach?
Problem 3: Hidden Markov Models
In this problem we are going to use a simple Hidden Markov Model (HMM)
for predicting GC-rich regions, like the one presented in the lectures.
The model has two states for GC-rich and GC-poor regions, and the
symbols emitted are the DNA nucleotides A, C, G, T.
-
Given this HMM structure, it remains to estimate the model parameters.
What are they? How many independent parameters are there?
-
We are given some examples of
Use these data to estimate the emission probabilities.
-
We are told that in our data, we should expect that 10% of the
sequence is in GC-rich regions. The average length of a GC-rich region
is 20 bp. Use this to estimate the transition probabilities.
-
From the transition probability matrix, compute the stationary
distribution. Does it agree with the data in the previous question?
Explain why is it reasonable to set the initial state probabilities
equal to the stationary distribution.
-
Optional:
Write a computer program that takes a DNA sequence as input and
uses the Viterbi algorithm to find the most probable state sequence
in this model. Use it to predict the GC-regions in
this sequence,
which was randomly generated by the HMM in this problem.
(You could also check your prediction by writing a program to generate
such sequences yourself,
and compare your predictions with the known state sequence.)
-
CpG islands are special GC-rich regions where the dinucleotide CG
occurs particularly frequently. Explain why our HMM model is not very
well designed to detect this feature. Describe a revised HMM with
eight states that could be used for this purpose. What are now the
emission probabilities for each state? How many independent parameteres
are there in the HMM?
-
Optional:
How can the data in b be used to estimate some of the parameters in
the model? How could they be used to decide whether the revised model
is better than the simple two-state model? Finally, how would you
estimate the state transitions that we could not obtain from the data
in b?
Eivind Coward
Last update: Feb 24, 2006