INF280 Autumn 2006: Compulsory Assignment 2

This compulsory assignment involves writing a computer program, just as the previous one. You can use the programming language of your choice. Assistance will be available in the assignment class on November 1.

Your solution (including source code files) should be sent by email to Harald Barsnes (haraldb@ii.uib.no) no later than November 10. Please send your solution in pdf or ps format (or even on paper), and source code as plain text files. A satisfactory solution is required in order to take the exam.

Problem: Decision tree learning applied to splice site prediction

In this problem you are going to implement decision tree learning applied to a quite realistic problem in bioinformatics, using real data. In computer science terms, it is a pattern discovery problem, and your program can be applied to such problems in general.

Biological background

In all living cells, genes are expressed in the following way: A stretch of DNA (the gene) is transcribed into a mRNA copy, which is then (for protein-coding genes) translated into an amino acid sequence by the ribosomes. In eukaryotes (most organisms other than bacteria), a further step happens between transcription and translation. The original gene copy, called pre-mRNA, is subject to a process called splicing. Pieces of the sequence called introns are removed from the RNA, and the remaining pieces, the exons, are glued together to produce the final mature mRNA.

The machinery that executes this (the spliceosome) recognises certain motifs in the RNA sequence in order to bind to the correct places, the splice sites. The start of an intron is called a donor site and the end of an intron is called an acceptor site. The intron almost always starts with the dinucleotide GT and ends with an AG. However, the vast majority of GT and AG occurrences are not donor or acceptor sites, so more information is needed in order to identify such sites. The sequence around those dinucleotides is also important, but the process is very complicated, and no simple rule is known.

The problem

We will try to find rules for identifying donor sites. We are given a set of confirmed real donor sites, as well as a set of sequences which are not donor sites. Hence, this is a typical machine learning problem (concept learning). The sequences are extracted from data published by T.A. Thanaraj (1999).

The data is found here.

We will try to use the ID3 algorithm to learn the boolean valued target function "is donor site". The data (instances) in our case is short subsequences around the putative donor site. Each instance is a sequence of length 10, always with GT in position 5-6.
An important part of the design is to select the attributes to test. With short sequences as instances, it is quite natural to base the attributes on each sequence position, but there are still several possibilities. Three of them are:
1. Attribute A_i = "nucleotide in position i".
  Values: A, C, G, T.
2. Attribute A_i,c = "nucleotide in position i = c", where c=A,C,G,T
  Values: true, false.
3. Attribute A_i,S = "nucleotide in position i is in set S", where S is any nonempty, proper subset of {A,C,G,T}.
  Values: true, false.
How many different attributes are there in each case? (We will not use attributes for sequence positions i=5,6, since the nucleotides are always the same.) What is the size of a fully expanded decision tree of depth d? Discuss advantages and disadvantages of each of the three alternatives. (Tip for alternative 3: think twice!)
For this exercise we will choose alternative 2 above. (As a matter of fact, in a similar published study, the more general alternative 3 was chosen, but most of the attributes actually selected by the algorithm were on the form in alternative 2).
Since our data comes from complex, biological systems, we can not expect a perfect classification to be possible. The most important reason for this is that our short sequences do not capture all relevant information. How will this affect the tree size and the accuracy on training and validation examples if we run the ID3 algorithm to completion, as stated in Mitchell? What is this phenomenon called?
Discuss sensible ways to stop the tree growth earlier, and choose a strategy for your program.
Implement the ID3 algorithm for this learning task. For each interior node, the following information should be output:
- the number of positive and negative examples before splitting
- the chosen attribute
- the information gain for splitting on this attribute
- the attribute value for each branch
For a leaf node, the number of positive and negative examples should be given. (Since the tree is not fully grown, there will typically be both positive and negative examples on a leaf node.)
Apply your program to the training examples and show the program output. Manually draw your tree, or part of it, if this is not clear from the program output itself. Does the decision tree appear to do a good job of classifying donor sites? Which attributes are most important?
A simple way to look for patterns in donor sites is to make a weight matrix of our positive examples. For each sequence position we record the nucleotide distribution. This can be visualised in the form of a sequence logo. Here is a sequence logo of our positive training examples:

The total height of each column shows the information content, which is 2 minus the entropy (so it is 0 if entropy is maximal, and maximal if entropy is 0). The relative heights of the letters reflect the proportion of the letter in that position.
Can you find some of the information in this profile in your decision tree? What kind of patterns can be expressed in a decision tree, but not in a profile?
If we apply the tree to a new instance, we will classify it according to the distribution of positive and negative examples on the leaf node. Let P be the proportion of positive examples. The most natural choice is perhaps to classify an instance as positive if P > 0.5. Let us first adopt this criterion.
Implement an application/testing mode that applies the learned tree to a set of new instances. For each instance it should output the proportion P and the classification yes/no. It should also summarise the number of positive and negative classifications for all examples. By testing the program on known positive and negative examples separately, we will then be able to measure the accuracy.
Run your testing mode first on the positive and negative training data again. Find the accuracy (percent correct among all examples), the sensitivity, and the specificity (use Spec_2 from Eidhammer et al.) Then do the same thing on the independent validation data set. Comment on the result.
Discuss whether it is always appropriate to use the criterion P > 0.5 for positive classification. What if the total numbers of positive and negative examples are very different? Modify your program (the testing mode) such that it instead uses the criterion P > T, where the threshold T is a parameter. Try out different values of T and see how it affects the accuracy, the sensitivity, and the specificity.
Post-pruning of the tree can be done by removing nodes that do not contribute positively to the classification accuracy on an independent validation set (see Mitchell, Sec. 3.7.1.1). You are not asked to implement this, but can you see from your output if the tree can be improved by pruning? Which node would you prune first? (It might help to include in the test output some information of which node in the tree is used for each instance).
Sometimes we are given sequences with unknown nucleotides, denoted by N. Discuss how we could adapt the method to allow such sequences i) when classifying using a learned tree, and ii) during training of the tree. You do not need to implement this.

Reference

T. A. Thanaraj (1999): A clean data set of EST-confirmed splice sites from Homo sapiens and standards for clean-up procedures. Nucl. Acids Res. 27 (13), 2627-2637.

Eivind Coward

Last update: Oct 23, 2006