Search and Annotation System

Background

One important problem when an EST library has been created for a new ogranism, is to identify the genes the EST originate from. Lacking the genome, protein data, or even full lenght mRNA for the organism, the best we can do is often to compare the ESTs to data from other organisms.

In particular, it is common to simply BLAST the sequence against whatever data is available, and annotate it with the best BLAST hit.

As far as I know, there is no method to go beyond the direct hits, and very few that take into account multiple hits. This has the potential to be to the existing tools what Google was to other search engines :-)

General

This project is probably suited for somebody with some experience designing medium-sized programs, and some understanding of Haskell and database interfacing. Some bioinformatics experience is also a plus.

The general idea is to start out with:

a set of data bases
precalculated, scored links between objects in the data bases (initially generated from e.g. all-against-all BLAST searches, but possibly also from other kinds of relationships)

Then, we generate an initial set of hits against the data bases, and iterate over it, letting the score flow out along the links, and accumulate in the central "objects".

A prototype for this seems to work nicely, but it lacks testing on real data.

Tasks

Interface to data bases and link tables

I think it will be a good idea to provide an SQL-based interface to storage. Data bases typically come as flat files (FASTA format for sequences). Later on, different types of data will come into play as well. It should be relatively easy to write standalone programs to import the necessary data into the DB.

Iteration algorithm

The current model is a bit ad hoc, but seems to work. It could probably be improved, and it would be nice to have some proof of convergence, etc.

We need good test cases and a framework for evaluating the performance. (This will be particluarly necessary for later publication).

User interface

The user should be presented with both a list of good hits, or a graphical presentation of the results. The user must also be able to limit the view to certain categories (e.g. proteins), and to assign "sink" status to certain objects (an over-abundant protein domain, or a genomic repeat, for instance).

The user should also have the option to insert manual annotations in the database.

Ideally, the interface should be available as a web page.

Incorporating more data

Other types of data, besides nulceotide and peptide (protein) sequences, and other kinds of links, besides sequence similarity.

GO terms
protein and RNA (3d) structure
transcription factor proximity
genomic repeats and vector sequence (mainly for down-scoring)

Ketil Malde

Last modified: Fri May 5 09:19:43 CEST 2006