Your solution (including source code files) should be sent by email to Harald Barsnes (haraldb@ii.uib.no) no later than November 10. Please send your solution in pdf or ps format (or even on paper), and source code as plain text files. A satisfactory solution is required in order to take the exam.
The machinery that executes this (the spliceosome) recognises certain motifs in the RNA sequence in order to bind to the correct places, the splice sites. The start of an intron is called a donor site and the end of an intron is called an acceptor site. The intron almost always starts with the dinucleotide GT and ends with an AG. However, the vast majority of GT and AG occurrences are not donor or acceptor sites, so more information is needed in order to identify such sites. The sequence around those dinucleotides is also important, but the process is very complicated, and no simple rule is known.
An important part of the design is to select the attributes to test. With short sequences as instances, it is quite natural to base the attributes on each sequence position, but there are still several possibilities. Three of them are:
Since our data comes from complex, biological systems, we can not expect a perfect classification to be possible. The most important reason for this is that our short sequences do not capture all relevant information. How will this affect the tree size and the accuracy on training and validation examples if we run the ID3 algorithm to completion, as stated in Mitchell? What is this phenomenon called?
Discuss sensible ways to stop the tree growth earlier, and choose a strategy for your program.
Apply your program to the training examples and show the program output. Manually draw your tree, or part of it, if this is not clear from the program output itself. Does the decision tree appear to do a good job of classifying donor sites? Which attributes are most important?
Can you find some of the information in this profile in your decision tree? What kind of patterns can be expressed in a decision tree, but not in a profile?
Implement an application/testing mode that applies the learned tree to a set of new instances. For each instance it should output the proportion P and the classification yes/no. It should also summarise the number of positive and negative classifications for all examples. By testing the program on known positive and negative examples separately, we will then be able to measure the accuracy.
Run your testing mode first on the positive and negative training data again. Find the accuracy (percent correct among all examples), the sensitivity, and the specificity (use Spec_2 from Eidhammer et al.) Then do the same thing on the independent validation data set. Comment on the result.
Discuss whether it is always appropriate to use the criterion P > 0.5 for positive classification. What if the total numbers of positive and negative examples are very different? Modify your program (the testing mode) such that it instead uses the criterion P > T, where the threshold T is a parameter. Try out different values of T and see how it affects the accuracy, the sensitivity, and the specificity.