INF280 Autumn 2005: Compulsory Assignment 1

The problems in this compulsory assignment are to be solved on a computer. Assistance for solving the problems will be available in the assignment class on Friday, September 23.

Your solution (including source code files) should be sent by email to Harald Barsnes (haraldb@ii.uib.no) no later than October 3. A satisfactory solution is required in order to take the exam.

Problem 1: Practical database search with BLAST

In this exercise you will try out practical database search with BLAST via web servers. There are many such servers. One is located in the European Bioinformatics Institute EBI and is found at http://www.ebi.ac.uk/blastall/.

There are several versions of BLAST for different types of sequences: blastn searches a given DNA sequence in a DNA database, blastp searches a with given protein sequence in a protein database, and blastx searches a given DNA sequence in a protein database. All programs have a number of parameters that can be set. In the web interface you can select the program, database, and the parameters from menus.

  1. The following DNA sequence is a fragment from a virus and contains a part of a gene.
    catgacatca gcttatgagt cataattaat cgtgcgttac aagtagaatt ctactcgtaa
    agcgagttga aggatcatat ttagttgcgt ttatgagata agattgaaag cacgtgtaaa
    atgtttcccg cgcgttggca caactattta caatgcggcc aagttataaa agattctaat
    ctgatatgtt ttaaaacacc tttgcggccc gagttgtttg cgtacgtgac tagcgaagaa
    
    Use blastx to search for the protein for which this gene codes (cut and paste!) in the database Swiss-Prot. (Error message? Click here!) Swiss-Prot is a manually curated database containing experimentally verified protein sequences only. BLAST displays the matches first as a short list, and you have to click on "Show Alignments" in order to see the local alignments found for each match. Look at the alignment for the best match. From which virus does this sequence originate? Where does the protein coding part start in the DNA sequence? Where does it end?

    Note that the query sequence is translated to an amino acid sequence. In the help pages it is stated that blastx searches for no less than six different amino acid sequences. Explain this.

  2. A link from the BLAST results leads to the corresponding entry in Swiss-Prot. At the bottom you will find the amino acid sequence (verify this)
    MFPARWHNYL QCGQVIKDSN LICFKTPLRP ELFAYVTSEE DVWTAEQIVK QNPSIGAIID
    LTNTSKYYDG VHFLRAGLLY KKIQVPGQTL PPESIVQEFI DTVKEFTEKC PGMLVGVHCT
    HGINRTGYMV CRYLMHTLGI APQEAIDRFE KARGHKIERQ NYVQDLLI
    
    If this were an unknown protein, we could use database searching to look for similar (homologous) proteins with known function. Use blastp to search for homologous sequences in Swiss-Prot. Use the score matrix Blosum62, and the preselected gap penalty.

    The first match is (not surprisingly!) the sequence itself. Many of the following matches are from viruses. The 13th is from mouse. Explain how the matches are sorted. What is score and E-value? Is the 13th match significant?

    Examine the alignment. How many gaps are there? It is also possible to run an "old-fashioned" BLAST without gaps (selected from a menu). Can we still find this protein?

  3. Repeat the search in b with Fasta (http://www.ebi.ac.uk/fasta33/) and compare the result. Remember to use the same score matrix! For more information about Fasta, click on Fasta Help in the menu to the right on the above web page. What is ktup, and how does it affect the speed and sensitivity in the search? Is there a similar parameter in BLAST? If you want to, you may also experiment with other score matrices and gap costs.

Problem 2: Alignment by dynamic programming

In this problem we will implement and test sequence alignment by dynamic programming.
  1. Consider a global alignment scoring scheme with score m for a match, score s for a mismatch (substitution), and a linear gap penalty of g for each space. Write down the recurrence equation for computing the dynamic programming table Hi,j used for finding the optimal alignment score. Include the base cases (row and column 0). Where is the optimal score found?

  2. Write a computer program (use your favourite programming language) that computes the dynamic programming table H corresponding to the above recurrence.

    The input to the program is the two sequences, and the parameters m, s, and g.

    As output, the program prints the whole table H, where the rows and columns are labeled with the two sequences (see Figure 1.3 in Eidhammer et al. for an example).

    Use your program to compute the alignment score for the sequences REINSDYR and REINDEER, with score m=1 for a match, s=-1 for a mismatch, and gap penalty g=1 for each space.

  3. Define the edit distance between two sequences. How can the parameters be chosen so that your program computes the edit distance? Use the program to compute the edit distance between the sequences REINSDYR and REINDEER. By manual inspection, state the edit operations that occurred.

  4. A subsequence of a sequence S is an order-preserving, but not necessarily contiguous sequence of symbols from S. For example, RIS is a subsequence of REINSDYR. The longest common subsequence problem for a set of sequences is to find the longest sequence that is a subsequence of all members of the set. Find a new choice of parameters that can be used to compute the longest common subsequence of two sequences by dynamic programming, and use your program to find the length of the longest common subsequence of REINSDYR and REINDEER. Which subsequence is it?

  5. A procedure called backtracking is used to find the optimal alignment from the matrix H. Two different strategies can be used. The first involves storing extra information ("arrows") along with the values Hi,j, and the second involves recomputing some of the values. Explain briefly each method with advantages and disadvantages.

  6. Choose one method and try to implement backtracking and output of the optimal alignment. When several alignments are optimal, it is sufficient to output one. Use your program to find the optimal alignments from b, c, and d. (If you were not able to implement this, find the alignments manually, marking the path through the matrix H generated by the program.)

  7. Find choices of parameters that produce optimal alignments different from the previous ones.

  8. Which changes to the algorithm are needed to compute an optimal local alignment instead of a global alignment? Implement this and test it on the sequences REINSDYRSTEK and ERTEDYRKING. Compare with the global alignment. Use the parameters from b. Why would the edit distance no longer be a good parameter choice? What about the parameters m=1, s=0, g=1 (try!)?

Eivind Coward
Last update: Sep 21, 2005