See also the downloads page

Software tools for EST analysis

Note that this page remains for historical reasons, and that the software now has a new home at malde.org.

Deriving repeat libraries

From a given EST clustering, we can derive a repeat library by extracting sequence fragments occurring in multiple sequences. A rather crude implementation is 'reps', currently available as a Linux binary.

Usage: reps k ugfile (where ugfile is a file in Unigene cluster format - ie. Fasta separated by #-comments, OR reps k clusters sequences, where clusters are on list format (one line per cluster) and sequences are Fasta

Random selection of sequences

For testing purposes, I wanted to pick random sequences from a Fasta file. I couldn't find this anywhere, so I wrote a small tool called rselect. Static Linux binary only, the source is in this darcs repo.

RBR

RBR masks ESTs for repeats, without using a library. Currently, a statically linked linux executable, a tarball and the darcs repository are available. Building it now requires the bio library as well.

cluster_tools

Various small tools for manipulating cluster output and FASTA-files, including 'clusc', a tool to compare EST clusters using entropy and more traditional metrics. Nothing fancy, but perhaps useful to somebody? Get it from the darcs repo, or as Linux binaries

xsact

This is a tool for EST clustering. It uses a suffix array (well, almost) to quickly find all exact matches in the data set, and constructs pairwise matching scores from them.

It can output clusters as lists of labels (accession numbers), as hierarchies (newick-formatted trees), or as Unigene-style clusters (separated by a line starting with #, and followed by the complete sequence data for the cluster).

It can also generate the list of matching pairs, highlighting the blocks that match, which gives you a quick-and-dirty pseudo-alignment.

Latest version is now 1.5 which is updated to work with recent GHC versions, and also includes xtract in beta status. Also available as i386 RPM and i386 RPM

If you have darcs, you will probably prefer to do

darcs get http://www.ii.uib.no/~ketil/bioinformatics/repos/xsact

xtract

This tool builds splice graphs from EST clusters, and can either visualize them (using the graphviz package), or construct consensus sequences. This hasn't seen any development lately, and I can't really recommend it over e.g. CAP3 for industrial use at this point. The code is here if you still want to play with it.

Trivialities

(I wrote these before I discovered agrep :-)

Miscellanea

A quick and dirty Python script, clusqual.py that compares clusterings using the Jaccard index.

An EST dataset, courtesy of the good people at SANBI, necessary if you want to run the included tests.


Ketil Malde
Last modified: Wed Apr 19 13:36:21 CEST 2006