Note that I have now implemented most of the described functionality.
RBR is a tool for masking EST data. ESTs are basically strings over the alphabet {ACGT}, and they represent fragments of genes. One problem is that genes contain repeats, sections that are very similar. As ESTs are commonly clustered by similarity, repeats can cause ESTs from different genes to be clustered together. Masking the repeats helps avoiding this.
RBR masks ESTs by collecting all words and their frequencies (number of occurences) from the entire data set. Each EST is then masked by calculate a baseline frequency distribution (representing the number of ESTs covering the same region of the gene), and parts of the EST where the word frequencies are significantly higher than this, are masked.
I have submitted a paper that describes this in more detail, mail me if you are interested.
There are plenty of improvements that would increase the usefulness and usability of RBR. Most of these shouldn't be very difficult, and won't require very deep understanding of Haskell or Bioinformatics. The current codebase is about 1KLOC.
All the tasks listed below may be too much for one student, and I think we should make a selection based on your skills and interests - or, if there are more than one student, split it in two.
These should be very straightforward, relatively quick to implement, and require a bit of Unix and a bit of Haskell knowledge.
Will require a deeper knowledge of Haskell, including the FFI. Some C background is also nice.
Relatively straightforward stuff, but will probably require more extensive refactorings and modifications of the codebase.
Larger tasks, that will require more code, and more knowledge and study of algorithms.