INF280 Autumn 2006: Compulsory Assignment 1

The problems in this compulsory assignment are to be solved on a computer. Assistance for solving the problems will be available in the assignment class on Tuesday, 26 September.

Your solution (including source code files) should be sent by email to Harald Barsnes (haraldb@ii.uib.no) no later than October 6, 12:00. A satisfactory solution is required in order to take the exam.

Problem 1: Practical database search with BLAST

In this exercise you will try out practical database search with BLAST via web servers. There are many such servers. One is located in the European Bioinformatics Institute EBI and is found at http://www.ebi.ac.uk/blastall/.

There are several versions of BLAST for different types of sequences: blastn searches a given DNA sequence in a DNA database, blastp searches a with given protein sequence in a protein database, and blastx searches a given DNA sequence in a protein database. All programs have a number of parameters that can be set. In the web interface you can select the program, database, and the parameters from menus.

  1. The following DNA sequence is a fragment from a virus and contains a part of a gene.
    catgacatca gcttatgagt cataattaat cgtgcgttac aagtagaatt ctactcgtaa
    agcgagttga aggatcatat ttagttgcgt ttatgagata agattgaaag cacgtgtaaa
    atgtttcccg cgcgttggca caactattta caatgcggcc aagttataaa agattctaat
    ctgatatgtt ttaaaacacc tttgcggccc gagttgtttg cgtacgtgac tagcgaagaa
    
    Use blastx to search for the protein for which this gene codes (cut and paste!) in the database Swiss-Prot. Swiss-Prot is a manually curated database containing experimentally verified protein sequences only. BLAST displays the matches first as a short list. Clicking on the ID of each sequence shows each sequence entry in Swiss-Prot. To display the BLAST output with local alignments, press the button "Blast Result". Look at the best match. From which virus does this sequence originate? Where does the protein coding part start in the DNA sequence (look at the alignment)? Where does it end?

    Note that the query sequence is translated to an amino acid sequence. In the help pages it is stated that blastx searches for no less than six different amino acid sequences. Explain this.

  2. A link from the BLAST results leads to the corresponding entry in Swiss-Prot. At the bottom you will find the amino acid sequence (verify this)
    MFPARWHNYL QCGQVIKDSN LICFKTPLRP ELFAYVTSEE DVWTAEQIVK QNPSIGAIID
    LTNTSKYYDG VHFLRAGLLY KKIQVPGQTL PPESIVQEFI DTVKEFTEKC PGMLVGVHCT
    HGINRTGYMV CRYLMHTLGI APQEAIDRFE KARGHKIERQ NYVQDLLI
    
    If this were an unknown protein, we could use database searching to look for similar (homologous) proteins with known function. Use blastp to search for homologous sequences in Uniprot. Use the score matrix Blosum62, and the preselected gap penalty.

    The first match is (not surprisingly!) the sequence itself. Many of the following matches are from viruses. The 14th is from rat. Explain how the matches are sorted. What is score and E-value? Is the 14th match significant?

    Examine the alignment (tip: when selecting "Blast Result", there is a link to each alignment from the initial list of matches). How many gaps are there? It is also possible to run an "old-fashioned" BLAST without gaps (selected from a menu). Can we still find this protein?

  3. Repeat the search in b with Fasta (http://www.ebi.ac.uk/fasta33/) and compare the result. Remember to use the same score matrix! For more information about Fasta, click on Fasta Help in the menu to the right on the above web page. What is ktup, and how does it affect the speed and sensitivity in the search? Is there a similar parameter in BLAST? If you want to, you may also experiment with other score matrices and gap costs.

Problem 2: Alignment by dynamic programming

In this problem we will implement and test sequence alignment by dynamic programming.
  1. Given a score matrix R, and affine gap penalty parameters go (gap open) and ge (gap extension). We want to compute the optimal global alignment score for two sequences q and d (of length m and n) using similarity scoring with the score matrix R and an affine gap penalty g(k) = go + ge k for a gap of length k. Write down the recurrence equations for computing the dynamic programming matrix, including the base cases.
  2. Write a computer program (use your favourite programming language) that computes the dynamic programming table H corresponding to the above recurrence.

    The program takes the following input: i) a file containing the score matrix, ii) the two gap parameters go and ge and iii) two protein sequences q and d. An example of such a score matrix is BLOSUM62:

    ARNDCQEGHILKMFPSTWYVBZX
     4
    -1  5
    -2  0  6
    -2 -2  1  6
     0 -3 -3 -3  9
    -1  1  0  0 -3  5
    -1  0  0  2 -4  2  5
     0 -2  0 -1 -3 -2 -2  6
    -2  0  1 -1 -3  0  0 -2  8
    -1 -3 -3 -3 -1 -3 -3 -4 -3  4
    -1 -2 -3 -4 -1 -2 -3 -4 -3  2  4
    -1  2  0 -1 -3  1  1 -2 -1 -3 -2  5
    -1 -1 -2 -3 -1  0 -2 -3 -2  1  2 -1  5
    -2 -3 -3 -3 -2 -3 -3 -3 -1  0  0 -3  0  6
    -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4  7
     1 -1  1  0 -1  0  0  0 -1 -2 -2  0 -1 -2 -1  4
     0 -1  0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1  1  5
    -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1  1 -4 -3 -2 11
    -2 -2 -2 -3 -2 -1 -2 -3  2 -1 -1 -2 -1  3 -3 -2 -2  2  7
     0 -3 -3 -3 -1 -2 -2 -3 -3  3  1 -2  1 -1 -2 -2  0 -3 -1  4
    -2 -1  3  4 -3  0  1 -1  0 -3 -4  0 -3 -3 -2  0 -1 -4 -3 -3  4
    -1  0  0  1 -3  3  4 -2  0 -3 -3  1 -1 -3 -1  0 -1 -3 -2 -2  1  4
     0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2  0  0 -2 -1 -1 -1 -1 -1
    
    The alphabet is given first, then a lower triangular matrix. This is because the matrix is symmetric (why?)

    The program should print out the whole table Hi,j (where Hm,n is the optimal alignment score). Label the rows and columns with the two sequences (see Figure 1.3 in Eidhammer et al. for an example).

  3. Test your program on a small part of the two protein sequeces from Problem 1:
    q = FTEKCPGMLVGVHCTHGI og d = FLKENKDNDKLIGVHCTHGL.
    Use the scoring matrix BLOSUM62 (above) and the gap parameters go = 11, ge = 1.
  4. To find an optimal alignment given the dynamic programming matrix, two different strategies are possible. One involves storing "arrows" in each cell in the matrix, and the other involves recomputing values. Explain the two strategies briefly. Choose one strategy (you could modify your program to print out the arrows), mark the traceback path in your dynamic programming table and find the optimal algnment by hand. Compare your result with the BLAST output.
  5. Optional! Implement automatic traceback and output of the optimal alignment. Test your program on the complete protein sequences from Problem 1. Compare with the output from Blast and Fasta. Also try different gap costs, including linear (go=0).
  6. Indicate how your program could be modified to compute local instead of global alignments. Implement and test it if you like (optional).

Eivind Coward
Last update: Sep 26, 2006