DONE: IntMap is *much* faster, - rewrite to switch between Map Integer Int, and IntMap Int DONE: Sequences are parsed and passed around. With lazy bytestrings, this is (memory-)inefficient, and we should instead stream over the file multiple times. Also, we could build partial indices (for different word-prefixes) and prune less interesting bits (rare parts,eg) TODO: Check out options for freqtable data structure. Things to try out: fmindex/afi, hashtable, accumArray, HsJudy And perhaps combining key and count into an Int(|eger|64)? ---------------------------------------- DONE: support arbitrary length keys (fall back Int32, Int64, Integer) DONE: optimize, entering only one (minimum) of w (revcompl w) DONE: trap exceptions from parsing, and fall back to "trivial" - by eliminating complicated parsing DONE: auto-limit heap to 80% physical (but no go on CentOS :-( ) TODO: support shaped keys DONE: support sparse keys (every nth) try to fit new sequences to old keys (add to score number of unregistered positions?) DONE: gap closing TODO: Repeat ID: output report (.tbl, .out) * calculate 1..k'th order entropy * other? TODO: Mask against library TODO: three-pass: build FT, build library, mask against it TODO: calculate distrib and mask over windows (w=200? 400?) avoids different treatment of different length sequences - Clustering - Clustering with (SG/Lee) assembly -> statistics to use when clustering: 1. mode of word counts distribution (= coverage) 2. estimated p value (1-var/mu) (= avg. overlap)