Softparsmap - Quick Start Guide | ||
---|---|---|
<<< Previous | Next >>> |
Sequence information is fed in the form of a darwin database file. Each sequence in this file should contain the following information in the following format:
<E> <ID>sequence_identifier</ID> <SEQ>peptide_sequence</SEQ> <ORG>Genus_species</ORG> <GB>gi_number</GB> <PAR>completeness</PAR> </E> |
sequence_identifier
= an integer that
uniquely identifies the sequence in the database file.
Genus_species
= the organism that was the
source of the sequence
gi_number
= GI number of the GenBank entry
that contains the coding sequence.
completeness
= either
complete
, where the peptide sequence is
complete, or partial
, where it is only
a fragment.
sequence_identifier
= the
GenBank GI number of the peptide and gi_number = the GenBank
GI number of the sequence containing the corresponding gene.
Other formats can be supported by including a new
<sequence_data>
specification in the property file (see manual).
Gene trees is fed using the Newick format with bootstrap /
branch support values represented as internal node labels. Other
variations on the newick format can be supported by including a new
<tree_parser>
specification in the property file (see manual). These trees
should be in the directory specified by the family group, and
should be called family{identifier}.tree, where
{identifier}
is an integer that uniquely
identifies each tree.
Species tree information is fed in the form of the GenBank files
names.dmp
and nodes.dmp
,
which are available at NCBI Taxonomy
(or FTP).
Or you can use your own species tree information, as long as it is in the same format as these and that all the species in your gene tree are present.
The property file describes where the input information is, where the
output should go, and what format it is all in. Default formats are
available, and are described along in the following. The example
property file that accompanies this guide,
prop.xml
, illustrates the following:
Common definitions can be imported from a separate file. In the example, definitions included with the package (see def.xml) are imported:
<import source="softparsmap/def.xml" source_context="classpath"/> |
Tasks are then defined:
<task did="root_example" eid="root" inparalogous="new" template_target="target/family{number}_rooted.tree" template_target_non_binary="target/family{number}_non_binary.info" tree_parser_out="newick_bootstrap" /> <task did="map_example" eid="map" template_target="target/family{number}_map.info"> <tree_parser eid="schreiber_gene_label"/> <tree_parser eid="schreiber_duplication"/> </task> |
root_example
inherits from the task root
(defined in the
imported common definitions file def.xml) and computes rooted
trees which are put in the directory target
.
The second task called map_example
find
duplications in the rooted trees and writes the result into
files located the the target directory.
Sequences collected together in to a free tree are referred to as a family. Families can be grouped together for processing in the same run. From the example:
<family_group did="unrooted" eid="trees_in_files" data_source="example" template_tree_file_name="family{number}.tree" tree_parser_in="newick_bootstrap"> <include_directory eid="super" tree_files_directory="trees"/> </family_group> |
unrooted
,
that inherits from the trees_in_files
family
group (specified in the imported common definitions def.xml),
reads sequence data specified by the example data_source
(described below), and reads trees from the directory
trees/
.
The next section defines a family group that contains the rooted
trees:
<family_group did="rooted" eid="trees_in_files" data_source="example" template_tree_file_name="family{number}_rooted.tree" tree_parser_in="newick"> <include_directory eid="super" tree_files_directory="target"/> </family_group> |
java softparsmap.Compute prop.xml compare_gene_trees unrooted rooted |
<sequence_data did="example" eid="xml" gi_number_tag_name="GB" gi_number_marker="{gi_number}" gi_number_template="{gi_number}" /> <data_source did="example" eid="xml_ncbi_taxonomy" abstract_sequence_data="example" ncbi_taxonomy_names_file="names.dmp" ncbi_taxonomy_nodes_file="nodes.dmp" xml_database_file="db.drw" index_file="java_index" /> |
The following lines define weak edges as those with support values less than 0.7:
<edge_type did="taed" eid="unknown" short_name="UN" value_limit="0.7" /> |
The next section describes how to parse input newick trees that have leaves with sequence identifiers as labels and internal nodes with support values as labels.
<tree_parser did="newick_bootstrap" eid="newick" edge_type="taed" template_node_data="{value}" template_leaf="{label}" /> |
The final section details what should be done with in-paralogous.
remove_while_minimizing_mutation="yes"
means that in-paralogous will be removed from the tree while
the number of gene duplications and losses is minimized.
remove_before_saving
means that they are
also removed just before the rooted gene tree is saved.
<inparalogous did="new" eid="standard" remove_before_saving="yes" remove_while_minimizing_mutation="yes" /> |
<<< Previous | Home | Next >>> |
Quick Start | Further Information |