Your solution (including source code files) should be sent by email to Harald Barsnes (haraldb@ii.uib.no) no later than October 3. A satisfactory solution is required in order to take the exam.
There are several versions of BLAST for different types of sequences: blastn searches a given DNA sequence in a DNA database, blastp searches a with given protein sequence in a protein database, and blastx searches a given DNA sequence in a protein database. All programs have a number of parameters that can be set. In the web interface you can select the program, database, and the parameters from menus.
Use blastx to search for the protein for which this gene codes (cut and paste!) in the database Swiss-Prot. (Error message? Click here!) Swiss-Prot is a manually curated database containing experimentally verified protein sequences only. BLAST displays the matches first as a short list, and you have to click on "Show Alignments" in order to see the local alignments found for each match. Look at the alignment for the best match. From which virus does this sequence originate? Where does the protein coding part start in the DNA sequence? Where does it end?catgacatca gcttatgagt cataattaat cgtgcgttac aagtagaatt ctactcgtaa agcgagttga aggatcatat ttagttgcgt ttatgagata agattgaaag cacgtgtaaa atgtttcccg cgcgttggca caactattta caatgcggcc aagttataaa agattctaat ctgatatgtt ttaaaacacc tttgcggccc gagttgtttg cgtacgtgac tagcgaagaa
Note that the query sequence is translated to an amino acid sequence. In the help pages it is stated that blastx searches for no less than six different amino acid sequences. Explain this.
If this were an unknown protein, we could use database searching to look for similar (homologous) proteins with known function. Use blastp to search for homologous sequences in Swiss-Prot. Use the score matrix Blosum62, and the preselected gap penalty.MFPARWHNYL QCGQVIKDSN LICFKTPLRP ELFAYVTSEE DVWTAEQIVK QNPSIGAIID LTNTSKYYDG VHFLRAGLLY KKIQVPGQTL PPESIVQEFI DTVKEFTEKC PGMLVGVHCT HGINRTGYMV CRYLMHTLGI APQEAIDRFE KARGHKIERQ NYVQDLLI
The first match is (not surprisingly!) the sequence itself. Many of the following matches are from viruses. The 13th is from mouse. Explain how the matches are sorted. What is score and E-value? Is the 13th match significant?
Examine the alignment. How many gaps are there? It is also possible to run an "old-fashioned" BLAST without gaps (selected from a menu). Can we still find this protein?
The input to the program is the two sequences, and the parameters m, s, and g.
As output, the program prints the whole table H, where the rows and columns are labeled with the two sequences (see Figure 1.3 in Eidhammer et al. for an example).
Use your program to compute the alignment score for the sequences REINSDYR and REINDEER, with score m=1 for a match, s=-1 for a mismatch, and gap penalty g=1 for each space.