RBR development

Note that I have now implemented most of the described functionality.

Background

RBR is a tool for masking EST data. ESTs are basically strings over the alphabet {ACGT}, and they represent fragments of genes. One problem is that genes contain repeats, sections that are very similar. As ESTs are commonly clustered by similarity, repeats can cause ESTs from different genes to be clustered together. Masking the repeats helps avoiding this.

RBR masks ESTs by collecting all words and their frequencies (number of occurences) from the entire data set. Each EST is then masked by calculate a baseline frequency distribution (representing the number of ESTs covering the same region of the gene), and parts of the EST where the word frequencies are significantly higher than this, are masked.

I have submitted a paper that describes this in more detail, mail me if you are interested.

General

There are plenty of improvements that would increase the usefulness and usability of RBR. Most of these shouldn't be very difficult, and won't require very deep understanding of Haskell or Bioinformatics. The current codebase is about 1KLOC.

All the tasks listed below may be too much for one student, and I think we should make a selection based on your skills and interests - or, if there are more than one student, split it in two.

Tasks

Organization and janitorial tasks

These should be very straightforward, relatively quick to implement, and require a bit of Unix and a bit of Haskell knowledge.

  1. cabalize - cabal is the packaging system for Haskell. It would be nice to have RBR support it.
  2. QuickCheck test suite - QuickCheck is a framework for automatic randomized testing. RBR is short on tests at the moment, and this is the way to go.
  3. Generate .deb and .rpm packages.

Optimization tasks

Will require a deeper knowledge of Haskell, including the FFI. Some C background is also nice.

  1. RBR uses Parsec to parse its input as a [Char]. We definitely should go to some packed string format, probably Don S's FPS. Ideally, we retain the Parsec code for the FASTA headers.
  2. More efficient data structures - currently RBR spends most of its time in a Map (words to frequencies). I think the FMIndex would be ideal, but other alternatives (LZIndex, Judy trees) could also be considered. Currently, space reduction is more important than speed improvements.

Repeat masking improvements

Relatively straightforward stuff, but will probably require more extensive refactorings and modifications of the codebase.

  1. Library-based masking - read a library of known repeats, and optionally mask against this, in addition to the word frequency based masking.
  2. Three pass masking - RBR now has two passes, one to build the word index, the second to mask the sequences. An alternative could be to have three passes, where the middle pass builds a library of to-be-masked sequences, and all sequences are masked against that. (The current approach means that some word may be masked in some sequences, and not in others).

Extended functionality

Larger tasks, that will require more code, and more knowledge and study of algorithms.

  1. Clustering - if we can augment the data structure with the positions of words (as e.g. the FMIndex lets us), it should be possible to do fast clustering of sequences also.
  2. Assembly - and when we do clustering, we're half way to assembly.

Ketil Malde
Last modified: Fri May 5 09:18:44 CEST 2006