A comprehensive pipeline for EST analysis (?) IDEA: based on a word map, - identify repeats (RB-style) - mask RB-repeats and against libraries - cluster - assemble Stages: 1. fasta parser (Parsec) 2. word map builder (using arrays) 3 a) cluster all-against-all b) detect repeats (word frequency aberrances) c) position sequence against a reference 4. linear assembly (xtract replacement(?)) - and feature output (thought on RB: alt splicing may cause different expression levels along the EST -- but each exon should give a region of constant expression level, possibly tapering off. * identify exons and relative expression level? * use region-based "background expression level" to find repeats ) (thought: determine sequence quality based on low match number?) (thoughts on visualization: align sequences (and curves)?) (thought: use the length of the sequence to get an idea about quality? mail Stephen Rudd, ask what happens if he turns sequences around in the quality experiment) How to handle unkeyable words (i.e. with Ns in them)? - zero (current tack) - average of prec/succ - average of possible interpretations Call mc something else - mc is midnight commander (or something) Current status (check darcs!): f3 - does a bit of debugging output rb - does masking based on contigous words, and can output freqs along a sequence mc - compares maskings