Your solution (including source code files)
should be sent by email to Harald Barsnes
(haraldb@ii.uib.no)
no later than November 7.
A satisfactory solution is required in order to take the exam.
The machinery that executes this (the spliceosome)
recognises certain motifs in the RNA sequence in order
to bind to the correct places, the splice sites.
The start of an intron is called a donor site
and the end of an intron is called an acceptor site.
The intron almost always starts with the dinucleotide GT
and ends with an AG. However, the vast majority of GT and AG
occurrences are not donor or acceptor sites, so more information
is needed in order to identify such sites. The sequence around
those dinucleotides is also important, but the process is very
complicated, and no simple rule is known.

We will try to use the ID3 algorithm to learn the boolean valued target
function "is donor site".
The data (instances) in our case is short subsequences around
the putative donor site. Each instance is a sequence of length 10,
always with GT in position 56.
An important part of the design is to select the attributes
to test. With short sequences as instances, it is quite natural
to base the attributes on each sequence position, but there are still several
possibilities. Three of them are:
 Attribute A_{i} = "nucleotide in position i".
Values: A, C, G, T.
 Attribute A_{i,c} = "nucleotide in position
i = c", where c=A,C,G,T
Values: true, false.
 Attribute A_{i,S} = "nucleotide in position
i is in set S", where S is any nonempty, proper
subset of {A,C,G,T}.
Values: true, false.
How many different attributes are there in each case?
(We will not use attributes for sequence positions i=5,6,
since the nucleotides
are always the same.) What is the size of a fully expanded decision
tree of depth d? Discuss advantages and disadvantages of each
of the three alternatives.
(Tip for alternative 3: think twice!)

For this exercise we will choose alternative 2 above.
(As a matter of fact, in a similar published study, the more general
alternative 3 was chosen, but most of the attributes actually selected
by the algorithm were on the form in alternative 2).
Since our data comes from complex, biological systems, we can not expect
a perfect classification to be possible. The most important reason for
this is that our short sequences do not capture all relevant
information.
How will this affect the tree size and the accuracy on training
and validation examples if we run the ID3 algorithm
to completion, as stated in Mitchell? What is this phenomenon called?
Discuss sensible ways to stop the tree growth earlier,
and choose a strategy for your program.

Implement the ID3 algorithm for this learning task.
For each interior node, the following
information should be output:
 the number of positive and negative examples before splitting
 the chosen attribute
 the information gain for splitting on this attribute
 the attribute value for each branch
For a leaf node, the number of positive and negative examples
should be given. (Since the tree is not fully grown, there
will typically be both positive and negative examples on a leaf node.)
Apply your program to the training examples and show
the program output. Manually draw your tree, or part of it,
if this is not clear from the program output itself.
Does the decision tree appear to do a good
job of classifying donor sites? Which attributes are most
important?

A simple way to look for patterns in donor sites is
to make a weight matrix or profile
of our positive examples. For each sequence position
we record the nucleotide distribution. This can be visualised
in the form of a sequence logo.
Here is a sequence logo of our positive training examples:
The total height of each column shows the information content,
which is 2 minus the entropy (so it is 0 if entropy is maximal, and
maximal if entropy is 0). The relative heights of the letters reflect
the proportion of the letter in that position.
Can you find some of the information in this profile in your
decision tree?
What kind of patterns can be expressed in a decision tree, but not
in a profile?

If we apply the tree to a new instance, we will
classify it according to the distribution of positive
and negative examples on the leaf node. Let P
be the proportion of positive examples. The most natural
choice is perhaps to classify an instance as positive
if P > 0.5. Let us first adopt this criterion.
Implement an application/testing mode that applies the learned tree to
a set of new instances. For each instance it should output
the proportion P and the classification yes/no.
It should also summarise the number of
positive and negative classifications for all examples.
By testing the program on known positive and negative
examples separately, we will then be able to measure the accuracy.
Run your testing mode first on the positive and negative training data
again. Find the accuracy (percent correct among all examples),
the sensitivity, and the specificity (use Spec_2 from Eidhammer et
al.) Then do the same thing on the independent validation data set.
Comment on the result.
Discuss whether it is always appropriate to use the criterion
P > 0.5 for positive classification. What if the
total numbers of positive and negative examples are very different?
Modify your program (the testing mode) such that it instead
uses the criterion P > T, where the threshold
T is a parameter. Try out different values of
T and see how it
affects the accuracy, the sensitivity, and the specificity.

Postpruning of the tree can be done by removing nodes that
do not contribute positively to the classification accuracy
on an independent validation set (see Mitchell, Sec. 3.7.1.1).
You are not asked to implement this, but can you see from your
output if the tree can be improved by pruning? Which node would
you prune first? (It might help to include in the test output
some information of which node in the tree is used
for each instance).

Sometimes we are given sequences with unknown nucleotides,
denoted by N. Discuss how we could adapt the method to allow such sequences
i) when classifying using a learned tree, and ii) during training of
the tree.
You do not need to implement this.