Input Format

Sequence Information

Sequence information is fed in the form of a darwin database file. Each sequence in this file should contain the following information in the following format:

  <E>
  <ID>sequence_identifier</ID>
  <SEQ>peptide_sequence</SEQ>
  <ORG>Genus_species</ORG>
  <GB>gi_number</GB>
  <PAR>completeness</PAR>
  </E>
Where:

In the examples, sequence_identifier = the GenBank GI number of the peptide and gi_number = the GenBank GI number of the sequence containing the corresponding gene.

Other formats can be supported by including a new <sequence_data> specification in the property file (see manual).

Gene Trees

Gene trees is fed using the Newick format with bootstrap / branch support values represented as internal node labels. Other variations on the newick format can be supported by including a new <tree_parser> specification in the property file (see manual). These trees should be in the directory specified by the family group, and should be called family{identifier}.tree, where {identifier} is an integer that uniquely identifies each tree.

Species Trees

Species tree information is fed in the form of the GenBank files names.dmp and nodes.dmp, which are available at NCBI Taxonomy (or FTP).

Or you can use your own species tree information, as long as it is in the same format as these and that all the species in your gene tree are present.

Property File

The property file describes where the input information is, where the output should go, and what format it is all in. Default formats are available, and are described along in the following. The example property file that accompanies this guide, prop.xml, illustrates the following: