MDL test case: Chromo shadow family


For more information about this family see Rein Aaslands Chromo Domain WWW page. and
The chromo shadow domain, a second chromo domain in Heterchromatin-binding protein 1, HP1.
Rein Aasland and A. Francis Stewart.
Nucleic Acids Research 23(16): 3168-3173 (1995)

Sequences included in the analysis:

  1. Sequence segments included in Aasland and Stewart's NJ tree.
  2. Bigger set.

NJ tree generated by using Clustal W(calculation) and Phylip (drawing).

The alignment that was used to generate tree, made by Clustal W.

Aasland and Stewart's NJ tree showing an estimate of their evolutionary relationship:

Results from running Pratt with the MDL set cover algorithm

First Pratt was run on the complete set of sequences, using different K-values (minimum number of sequences to match a pattern). The pattern having the maximum C-value was chosen, and the sequences matching this pattern was removed from the set of sequences. The remaining set of sequences were analysed in the same way. This was repeated until the number of sequences left was less than 4. The resulting set of patterns covers the set of sequences in a close-to-optimal way according to the Minimum Length Description (MDL) principle. Two different tests were done using different parameters for the MDL scoring.
  1. Z=0.0
  2. Z=2.0
  3. Z=4.0
  4. Z=5.0, and Z=5.0, complete (Pratt run for all K-values).
  5. Z=5.5.
  6. Z=6.0
  7. Z=20.0 (complete)

Alternative scheme with two constants, Z and Z'

  1. Z=4.0, Z'=3.0
  2. Z=4.0, Z'=10.0
  3. Z=4.0, Z'=50.0
  4. Z=4.0, Z'=100.0
  5. Z=5.0, Z'=50.0
Alternative scheme with three constants, Z1, Z2, and Z3:
  1. (3,2,50)
  2. (10,2,50)
  3. (10,2,100)
  4. (10,-2,100)
  5. (10,-10,100)
  6. (12,2,100)
  7. (12,-2,100)
  8. (12,-4,100)
  9. (15,2,50)
We see that for Z=5.0 we get almost exactly the groups circled in Aasland and Stewart's NJ tree. For lower Z-values we get subsets of these sets, and for higher Z-values (Z=6.0) we get bigger groups. This means that we can adjust the Z-values to give us bigger families of more distant sequences (super-families) by using a high Z-value, or smaller denser familes (subfamilies) by using a lower Z-value.

For the parameter value Z=5.0, we also did a test to see if the stepping of K paramter-values for pratt makes a difference. In this specific case, the two strongest patterns reported (defining sets corresponding to the circled sets in the tree above) are the same when all K-values are used as when the standard algorithm is used, and the two remaining patterns and are clearly correlated. One should however note that in this particular case all reported sets are relatively small, and the standard algorithm is nearly exhaustive for small K-values (see examples below). The effect of the stepping used will probably depend heavily on the specific case.

For the Z=5.0 example, we get sets of sizes: 7, 7, 12, and 8 (standard algorithm) and 7, 7, 11, 8, and 1 (complete). Examples of sequences of K-values used by the standard algorithm when analysing sets of size N sequences:
Sequence of K-values used for N=34: 34, 29, 24, 20, 16, 13, 11, 9, 8, 7, 6, 5, 4.
Sequence of K-values used for N=27: 27, 24 20, 16, 13, 11, 9, 8, 7, 6, 5, 4.
Sequence of K-values used for N=18: 18, 16, 13, 11, 9, 8 7, 6, 5, 4.
Sequence of K-values used for N=13: 13, 11, 9, 8 7, 6, 5, 4.

We see that most of the chromo shadow domains are very seldom put in the same group as any of the other sequences, and less often so than some of the classical chromo domains liked to chromo shadow domains. The chromo shadow domains also seem to constitute a more separate subtree than the classical chromo domains liked to chromo shadow domains.

Technical details about the algorithms used.

Page compiled by: Inge Jonassen.

MDL test cases,