# measuring cluster quality import string import sys # helper function, counts matching entries def matchlen(p,ls): c = 0 for a in ls: if a==p: c=c+1 return c # building dictionaries dicts = [] infile = sys.argv[1] # build a dict in order to ignore sequences not in the input d = open(infile) indict = {} while(1): l = d.readline() if l=='': break ss = string.split(l) for a in ss: indict[a] = 1 # build dictionaries for the files to compare against for file in sys.argv[2:]: print "Building dictionary for ",file d = open(file) dict = {} count = {} i = 0 while(1): l = d.readline() if l=='': break ss = string.split(l) count[i] = 0 for a in ss: if indict.has_key(a): count[i] = count[i]+1 dict[a] = i i = i+1 dicts = dicts + [(file,count,dict)] print "Processing input from ",infile # subs = {} contained = {} supers = {} exact = {} jac_a = {} jac_b = {} jac_c = {} for (dname,c,d) in dicts: # subs[dname] = 0 supers[dname] = 0 contained[dname] = 0 exact[dname] = 0 jac_a[dname] = 0.0 jac_b[dname] = 0.0 jac_c[dname] = 0.0 input = open(infile) i = 0 ns = 0 while(1): l = input.readline() if l=='': break ss = string.split(l) if len(ss) >= 2: # continue # don't bother with singleton clusters ns = ns+1 # compare with dictionaries output = '' for (dname,cs,d) in dicts: tmp = [] cc = [] na = 0 for a in ss: if d.has_key(a): cc = cc + [d[a]] if d[a] not in tmp: tmp = tmp + [d[a]] else: na = na+1 # Jacard scores (verify these!) for c in tmp: ct = matchlen(c, cc) # error detection (remove): if not cs.has_key(c): print "aaargh" if len(cc) < ct: print "aargh",ss,c,cc,len(cc),ct if cs[c] < ct: print "argh",ss,c,cc,cs[c],ct jac_a[dname] = jac_a[dname]+ ct*(ct-1) jac_b[dname] = jac_b[dname]+ (len(cc) - ct)*ct jac_c[dname] = jac_c[dname]+ (cs[c]- ct)*ct # Count and print s = 0 for a in tmp: s = s+cs[a] if s==len(ss)-na: if len(tmp)==1: exact[dname] = exact[dname]+1 # output = output + '\tmatches cluster from'+str(dname)+':'+str(tmp)+'\n' elif len(tmp) > 1: supers[dname] = supers[dname]+1 contained[dname] = contained[dname]+len(tmp) output =output+'\tis a superset of clusters from '\ +str(dname)+': ' for a in tmp: output = output+str(a)+'('+str(cs[a])+') ' output = output+'\n' # else no sequences from this cluster was found - ignore # elif len(tmp) == 1: # subs[dname] = subs[dname]+1 # output = output+'\tis a subset of cluster from',dname,':'+str(tmp)+'\n' # uncomment this block for partial match output! #else: #output = output+'\tpartially matches clusters from '+str(dname)+': ' #for a in tmp: # output = output+str(a)+'('+str(cs[a])+') ' #output = output+'\n' if output != '': print 'Cluster ',i,', contains',len(ss),' sequence(s): ',l, print output i = i+1 print '\n\n--------------------' print 'Matching summary for',infile,':' print 'Total clusters: ', i print 'Singleton clusters:', i-ns print 'Non-singletons: ', ns for (dname,cs,x) in dicts: print print 'Compared to',dname, ': ' print '\t', exact[dname],'(',exact[dname]*100/i,'% ) matching exactly' print '\t', supers[dname],'(',supers[dname]*100/i, '% ) supersets, containing a total of',contained[dname],'clusters.' cplx = i-exact[dname]-supers[dname] print '\t', cplx, '(',cplx*100/i,'% ) clusters are subsets, or in complex arrangements' # print '\t',subs[dname],'(', (subs[dname])*100/(i+1),'% ) subsets.' print print '\tJaccard score:', if jac_a[dname] == 0: print 0 else: print jac_a[dname]/(jac_a[dname]+jac_b[dname]+jac_c[dname]) print '\t\t(a =',jac_a[dname],', b =',jac_b[dname],', c =',jac_c[dname],')'