Your solution (including source code files) should be sent by email to Harald Barsnes (haraldb@ii.uib.no) no later than October 6, 12:00. A satisfactory solution is required in order to take the exam.
There are several versions of BLAST for different types of sequences: blastn searches a given DNA sequence in a DNA database, blastp searches a with given protein sequence in a protein database, and blastx searches a given DNA sequence in a protein database. All programs have a number of parameters that can be set. In the web interface you can select the program, database, and the parameters from menus.
Use blastx to search for the protein for which this gene codes (cut and paste!) in the database Swiss-Prot. Swiss-Prot is a manually curated database containing experimentally verified protein sequences only. BLAST displays the matches first as a short list. Clicking on the ID of each sequence shows each sequence entry in Swiss-Prot. To display the BLAST output with local alignments, press the button "Blast Result". Look at the best match. From which virus does this sequence originate? Where does the protein coding part start in the DNA sequence (look at the alignment)? Where does it end?catgacatca gcttatgagt cataattaat cgtgcgttac aagtagaatt ctactcgtaa agcgagttga aggatcatat ttagttgcgt ttatgagata agattgaaag cacgtgtaaa atgtttcccg cgcgttggca caactattta caatgcggcc aagttataaa agattctaat ctgatatgtt ttaaaacacc tttgcggccc gagttgtttg cgtacgtgac tagcgaagaa
Note that the query sequence is translated to an amino acid sequence. In the help pages it is stated that blastx searches for no less than six different amino acid sequences. Explain this.
If this were an unknown protein, we could use database searching to look for similar (homologous) proteins with known function. Use blastp to search for homologous sequences in Uniprot. Use the score matrix Blosum62, and the preselected gap penalty.MFPARWHNYL QCGQVIKDSN LICFKTPLRP ELFAYVTSEE DVWTAEQIVK QNPSIGAIID LTNTSKYYDG VHFLRAGLLY KKIQVPGQTL PPESIVQEFI DTVKEFTEKC PGMLVGVHCT HGINRTGYMV CRYLMHTLGI APQEAIDRFE KARGHKIERQ NYVQDLLI
The first match is (not surprisingly!) the sequence itself. Many of the following matches are from viruses. The 14th is from rat. Explain how the matches are sorted. What is score and E-value? Is the 14th match significant?
Examine the alignment (tip: when selecting "Blast Result", there is a link to each alignment from the initial list of matches). How many gaps are there? It is also possible to run an "old-fashioned" BLAST without gaps (selected from a menu). Can we still find this protein?
The program takes the following input: i) a file containing the score matrix, ii) the two gap parameters go and ge and iii) two protein sequences q and d. An example of such a score matrix is BLOSUM62:
The alphabet is given first, then a lower triangular matrix. This is because the matrix is symmetric (why?)ARNDCQEGHILKMFPSTWYVBZX 4 -1 5 -2 0 6 -2 -2 1 6 0 -3 -3 -3 9 -1 1 0 0 -3 5 -1 0 0 2 -4 2 5 0 -2 0 -1 -3 -2 -2 6 -2 0 1 -1 -3 0 0 -2 8 -1 -3 -3 -3 -1 -3 -3 -4 -3 4 -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1
The program should print out the whole table Hi,j (where Hm,n is the optimal alignment score). Label the rows and columns with the two sequences (see Figure 1.3 in Eidhammer et al. for an example).