This document provides code listings used for the analyses as described in the supplement chapter "Phylogenetic Placement". See there for a higher level explanation of the analysis pipeline. We use terms and abbreviations from the supplement chapter in this document. Thus, we assume the reader to be familiar with the supplement. The outline of this document roughly follows the sections of the supplement.
The code provided here is intended for Linux systems using bash (e.g., Ubuntu). It needs to be adapted for other environments. Many steps were run on cluster nodes in parallel. As cluster environments differ, the code here is a simplified serial version. We however provide an outline of how to parallelize it.
Author: Lucas Czech (lucas.czech@h-its.org)
Date: 2016-03-04
In case of trouble or bugs, please email to lucas.czech@h-its.org or alexandros.stamatakis@h-its.org
The code in this document is provided by the authors and contributors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the authors or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of the code, even if advised of the possibility of such damage.
The data input to the analysis pipeline consists of the following files (renamed for simplicity).
Reference data:
Euks.fasta
: reference alignment of 512 taxa from all over the Eukaryotic tree with 4,686 sites (E).Apis.fasta
: reference alignment of 190 taxa from the Apicomplexan clade with 2,627 sites (A).Euks.tax
, Apis.tax
: taxonomic assignments for both references (used for C).Sequence data:
Amplicons.fasta
: 10,567,804 amplicon sequences, with 194 to 554 sites per sequence (M).OTUs.fasta
: 29,029 OTU sequences, resulting from a SWARM analysis of Amplicons.fasta
,
with 195 to 554 sites per sequence (O).Amplicons.table
, OTUs.table
: abundance count tables that for each sequence and for each of the
154 sampling locations contain the frequency with which the sequence was found in that sample.The pipeline for the data analysis uses the following programs:
To reproduce the figures from the resulting data, those programs are needed:
For some of the downstream analyses and data handling of the placements, we used our own toolkit genesis, for which we are currently preparing a proper release. At the time of this writing, there is no comprehensive manual yet on how to use the toolkit. Thus, we give a short introduction here that should suffice to get it to work.
The source code of the used version genesis v0.2.0 is available at GitHub. It is written in C++11, but also supports a Python interface. For simplicity, we only use the C++ interface here. For building, a fairly modern compiler is necessary (g++ >= 4.9, clang++ >= 3.6). Furthermore, make and cmake >= 2.6 are needed as build tools. Then, calling
in the main directory builds the library.
In order to use custom code, particularly the programs provided later in this document, the source
files have to be placed as *.cpp
files in the apps
directory.
They will be automatically compiled and turned into an executable in the bin
directory when
the command
is issued in the main directory of genesis. Those executables can then be called with the necessary command line parameters to run the programs.
Although the reference alignments Euks.fasta
and Apis.fasta
are already aligned, there is
some preprocessing necessary.
The following script takes an alignment file in FASTA format as input and writes a new alignment file where sequence names are cleaned, i.e., they are stripped off any part that comes after the first whitespace. The file is named like the input file, plus a suffix of ".clean".
Some FASTA files contain additional information on their sequence name line, which confuses RAxML, for example
This script gets rid of that and turns it into
so that it can be used in RAxML.
#!/bin/bashALN="path/to/alignment.fasta" # Euks.fasta or Apis.fastacat ${ALN} | sed "s/>\([^ ]*\) \(.*\)/>\1/g" > ${ALN}.clean
This step needs to be done for both reference alignments (Euks.fasta
, Apis.fasta
; E and A).
This script takes an alignment file in FASTA format as input and runs the RAxML check algorithm
-f c
on it. This produces a reduced alignment file, where sites without signal are removed.
A call to ./clean_alignment.sh might be necessary first in order to get a file that can be read by RAxML.
#!/bin/bashRAXML="path/to/raxml"ALN="path/to/alignment.fasta.clean" # Euks.fasta or Apis.fasta (cleaned)${RAXML} -f c -m GTRGAMMA -s ${ALN} -n check_file
This step needed to be done for both reference alignments (Euks.fasta(.clean)
,
Apis.fasta(.clean)
; E and A). The resulting files are then fed into
the tree search.
The amount of sequences (at least in the case of Amplicons.fasta
) is too much to process the whole
file at once. Thus, we split the data into smaller chunks. This is done by using
the count tables Amplicons.table
and OTUs.table
to create sequence files for each of the 154
sampling location.
#!/usr/bin/pythonfrom sets import Setimport os# Input file namesseq_file = "path/to/alignment.fasta" # Amplicons.fasta or OTUs.fastatab_file = "path/to/alignment.table" # Amplicons.table or OTUs.table# Output directory for the sequence files per sampleout_dir = "out_dir"if not :# Prepare a list of sequences per sample. We assume 154 samples here, as given in the table files.sample =for i in :# Process the table fileprint "Reading table file "+tab_filewith as tab_in:for line in tab_in:cols =if != 156:print "Warning: Line does not contain 156 columns."exitname =tot =if != 154:print "Warning: Line does not contain 154 columns."exits = 0for i in :s +=if > 0:.if s != :print "Warning: Sum "++" != "+tot+"."# Now we convert the list of sequences per sample into a set for faster lookup.# (In our original script, we used the lists for some further checks. As this is not needed here,# we could instead directly fill the set instead of first a list and then convert...)sample_set =print "There are "++" samples:"for i in :print " Sample "++" has "++" sequences."for s in :.=print "Extracted read names."# Process the sequence fileprint "Reading sequence file "+seq_filewith as seq_in:while True:line1 =sequence =if not sequence:breakif != ">":print "Sequence does not start with >, aborting."exitname = .for i in :#~ print "name "+name+", i "+str(i)if name in :#~ print "found "+name+" at "+str(i)out =print "Done."
This script (and small variations of it to check different properties of the input data like correct
number of sequences etc) were run for the Amplicons.fasta
and OTUs.fasta
files (M and O).
First on all sequences, and later, for the second part of the pipeline (for the Apicomplexans, A)
again on the Apicomplexan subset of the sequences. This results in a set of FASTA files
(one FASTA file per sampling location) for each combination of either Eukariotes (all sequences) or
Apicomplexans (only this subset of the sequences), and either amplicons or OTUs.
In total, four sets of fasta files: E-M, E-O, A-M, A-O.
Remark: As there are amplicons and OTUs that appear in more than one sample, some of the sequences are duplicated in different chunks. This means that there is an overhead for calculating those duplicated placements multiple times. For the 10,567,804 amplicons, there are 1,618,894 duplications, which means that we needed to do 15.3% more computations. For the 29,092 OTUs, there are 52,511 duplications, so the amount of calculations increased 2.8 times. As the total number of OTUs is however small enough (compared to the number of amplicons), runtimes are still short. A reduplication step is done in the sequence extraction step.
There are other ways of splitting the data into smaller portions, for example by just separating the data into chunks of 100,000 sequences. This would circumvent the duplications. However, for some downstream analyses (not part of this paper), we were interested in a splitting according to sampling location. For other use cases, it might be helpful to split the data in a different manner. See also section Data Postprocessing for some more information related to this.
Given the reference alignments, we inferred reference trees. To improve the result, we ran 40
thorough searches (option -f o
) with RAxML and selected the best scoring (maximum likelihood) tree.
The following script runs independent instances of RAxML with random seeds. It stores the outputs
in separate directories called sXYZ
according to to the seed. The instances can of course also be
executed in parallel (not shown here; the call to RAxML has to be changed to an appropriate cluster
job call).
#!/bin/bash# Settings and pathsJOBS=40ALN="path/to/alignment.fasta" # Euks.fasta or Apis.fasta (cleaned, reduced)RAXML="path/to/raxml"# Run independent instances of RAxML
This step is necessary for both the Euks and Apis reference alignment
(Euks.fasta
, and Apis.fasta
, E, A, respectively their cleaned and reduced versions).
Now we need to select the best tree. For this, the following script scans the previously created directories containing the RAxML results and parses their output files.
#!/bin/bash# Output file namesLH_FILE="LH_bests"SORTED_LH_FILE="LH_bests_sorted"rm -f ${LH_FILE} ${SORTED_LH_FILE}# Scan the directories# Sort by likelihood, output the best treesort -n -r $LH_FILE > ${SORTED_LH_FILE}BEST_RESULT=`head -n 1 ${SORTED_LH_FILE} | tr -s " " "\n" | sed -n '2p'`echo "Best tree: $BEST_RESULT"cp ${BEST_RESULT} best_tree.newick
The best tree is now stored in best_tree.newick
, one for the Euks and one for the Apis
(EU and AU).
The tree inference with taxonomic constraint was carried out using Sativa. In this step, Sativa infers a constrained tree. This is again the best tree from 40 maximum likelihood runs. Internally, Sativa uses RAxML for tree inference.
This step is again necessary for both the Euks and Apis reference alignment (E and A),
and needs as an additional input a taxonomic constraint file. Sativa then yields
a so-called refjson
file, which is meant to be used for Sativa's original purpose of mislabel
detection in taxonomies.
However, we "misuse" Sativa and its output here, because our goal is not to check the consistency
and correctness of the taxonomy. Instead, we only extract the constrained tree from the result file.
It is also possible to do this with RAxML directly. However, when designing the pipeline, we wanted to have easy access to the capabilities of Sativa. It was not needed in the end, but as the additional runtime overhead is rather small, we decided to keep it this way.
#!/bin/bashALI="path/to/alignment.fasta" # Euks.fasta or Apis.fasta (cleaned, reduced)TAX="path/to/taxonomy.tax" # Euks.tax or Apis.taxSATIVA="path/to/sativa"# Run sativa${SATIVA}/epa_trainer.py -s ${ALI} -t ${TAX} -n Euks_Constr -x ZOO -no-hmmer -N 40# Extract the best tree from the result file.grep "\"raxmltree\":" Euks_Constr.refjson \| sed "s/[ ]\+\"raxmltree\"\:\ \"//g" \| sed "s/(r_/(/g" \| sed "s/,r_/,/g" \| sed "s/;\",[ ]*$/;/g" \> best_tree.newick
The best tree is written to best_tree.newick
, again one for the Euks and one for the Apis
(EC and AC)..
At this stage, we have created the following files:
The next step is now to align those sequence files to their respective references. We use PaPaRa for this, which takes the sequences, the reference alignment, and also the reference tree as input. By taking the tree into account, PaPaRa is able to use the phylogenetic signal of the sequences to get a better alignment.
As this alignment step takes all input data into account, it has to be run for all 8 analyses. Furthermore, for each of them, it is run for all of the 154 sequence files. The call to PaPaRa within the loop can be replaced by an according call to a cluster submission in order to parallelize the process. This is highly recommended, as this step takes a while.
#!/bin/bashALI="path/to/alignment.fasta" # Euks.fasta or Apis.fasta (cleaned, reduced)TREE="path/to/best_tree.newick" # Euks or Apis treeSAMPLES="path/to/sample_dir" # (Euks or Apis sequences) and (Amplicons or OTUs)PAPARA="path/to/papara"
This results in 154 alignment files (in PHYLIP format) for each of the 8 analyses.
The output files are named papara_alignment.sample_i
, where i
is the sample number (0-153).
The alignment files resulting from the previous step were then fed into RAxML-EPA in order to get a phylogenetic placement for all of the sequences.
For our pipeline, we further split the data into chucks of 20,000 query sequences in order to have more parallel execution speed. This was done by creating FASTA files that all contain the reference of either 512 or 190 sequences (E or A) plus up to 20,000 sequences from the previously split 154 files. After running EPA, these chunks were then combined again. This is possible because the query sequences are independent in EPA (no order dependency, they don't change the tree, etc). It however turned out that this was not necessary, as the runtime of EPA for short enough for our cluster (<48h) for each sample anyway. Thus, this splitting step is omitted here for simplicity.
In the following script, we set the option --epa-accumulated-threshold
to 99.9%. This option
controls how many of the possible placement positions are outputted by the program - in this case
as many as needed to get a sum of the likelihood weights that is close to 1.0 (the weights are
sorted of course, so the most probably ones are outputted first). The default for this option
is 95%. Most downstream analyses work with the most probable placement position only, hence it would
normally suffice to use this default.
We however need the higher threshold for the clade label annotation.
#!/bin/bashRAXML=path/to/raxmlTREE=path/to/best_tree.newick # Euks or Apis
This step is run for the Euks and Apis reference (with respective sequences; EU, AU)
and for the amplicons and OTUs, for a total of 4 analyses (EU-M, EU-O, AU-M, AU-O).
The output files (in jplace format) are named sample_i.jplace
,
where i
is the sample number (0-153).
For the constrained case, we again used Sativa. It is also possible to use RAxML directly, as shown in the previous section. However, for some downstream analyses (not part of this paper), we wanted to have the full Sativa result, too. For completeness, we wanted to also show this script here. The result is the same, because internally, Sativa runs RAxML in the same way as shown before.
#!/bin/bashSATIVA=path/to/sativaREF=path/to/ref.refjson # Euks or Apis
This step is run for the Euks and Apis reference (with respective sequences) and for the
amplicons and OTUs, for a total of 4 analyses (EC-M, EC-O, AC-M, AC-O).
The output files (in jplace format) are named sample_i.jplace
,
where i
is the sample number (0-153).
As mentioned in the section on Data Preparation, some of the sequences occur multiple times in different samples. This also means that the placements for those sequences are calculated multiple times (one for each sample they appear in). As the EPA is deterministic, this is in principle no issue. Given identical input, all the results are the same.
However, for estimating the mutation rate of the sequences (which are important for evaluating the phylogenetic likelihood function), the EPA uses the base frequencies of the nucleotids of the whole data, i.e., the reference AND the query sequences. This means that they might slightly differ between samples. This is because different samples contain different query sequences, which might have a different distribution of the nucleotids. In the current implementation, this also affects the branch length of the outputted tree, but only after the 7th meaningful digit, so we decided to ignore this for the trees themselves. However, this whole issue can lead to placing the same (duplicated) sequence on different branches in different samples.
In our data, this issue occurred in in Unconstrained Apicomplexan datasets (AU), for both the amplicons and OTUs (AU-M and AU-O). In the former case, this affected 592 sequences (0.008% of 7,592,831 Apicomplexan amplicons); in the latter case, 7 sequences (0.05% of 13,953 Apicomplexan OTUs). It did not happen in the Constrained case (AC), and not for the Euks (EU and EC). The affected sequences "jumped" only between close by branches, which further confirms that this is an issue related to the phylogenetic likelihood evaluation (as opposed to random or other systematic errors).
As the impact of this issue is rather small, both in terms of affected sequences and difference in placement position, we decided to ignore this and simply use the first of the resulting positions.
As mentioned in the section on Unconstrained Phylogenetic Placement, EPA outputs multiple possible placement positions with different likelihood weights (probabilities), which sum up to 1.0 for all branches of the tree. For the purposes of abundance counting and visualization, we were only interested in the most probable position. Thus, we ran the following tool to get rid of all but this placement.
It is implemented in C++ using our genesis toolkit. See the introduction of this document for instructions on how to get this to work.
// file: max_weight_placement.cpp<string>"placement/functions.hpp""placement/io/jplace_processor.hpp""placement/placement_map.hpp""utils/core/logging.hpp"using namespace genesis;int main (int argc, char** argv){// Activate Logging.Logging::log_to_stdout();// Get the dir containing the jplace files.if (argc != 2) {LOG_WARN << "Need to provide base dir containing jplace files.";return 1;}auto base_dir = text::trim_right( ::( argv[1] ), "/") + "/";LOG_INFO << "Jplace dir: " << base_dir;// Process all files.for (size_t i = 0; i < 154; i++) {LOG_INFO << "=====================================================";LOG_INFO << "Sample " << i;// Read placement file.PlacementMap map;:: jfile = base_dir + "sample_" + ::(i) + ".jplace";if (! JplaceProcessor().from_file(jfile, map) ) {LOG_WARN << "Could not load jplace file " << jfile;return 1;}// Output information before.LOG_INFO << "Before reduction to max weight placement:";LOG_INFO << "Pquery count " << map.pquery_size();LOG_INFO << "Placement count " << map.placement_count();// Delete all but the most probable placement and save.map.restrain_to_max_weight_placements();JplaceProcessor().to_file(map, base_dir + "sample_" + ::(i) + "_max.jplace");// Output information after.LOG_INFO << "After reduction to max weight placement:";LOG_INFO << "Pquery count " << map.pquery_size();LOG_INFO << "Placement count " << map.placement_count();}}
The program takes the .jplace
files (resulting from the EPA run) and outputs new .jplace
files,
which only contain the most probable placement position, as given by the
like_weight_ratio
. As a side effect, this also reduces the amount of data for downstream analyses.
To run the program, call bin/max_weight_placement base_dir/
in the genesis directory. Here,
base_dir
is the directory in which the jplace files from the EPA run are stored.
This results in files called sample_i_max.jplace
, where i
is the sample number (0-153).
After running the pipeline with the Eukaryotic alignment and data, we ended up having .jplace
files
per sample (0-153) for the Unconstrained and Constrained tree and for Amplicons and OTUs, for a
total of 4 analyses (EU-M, EU-O, EC-M, EC-O).
In this section, we discuss the steps to extract those sequences (amplicons and OTUS) that were placed into the Apicomplexan clade on the Eukariotes tree. Those sequences were then fed back into the pipeline for the additional remaining 4 analyses (AU-M, AU-O, AC-M, AC-O).
The first step is to get a representation of the clades that can be read by our toolkit.
For this, we used the input file Euks.tax
, which was also used to create the constrained tree
(EC).
This file contains a taxonomic annotation for all 512 eukaryotic reference sequences.
To get the clade annotation, we simply use the second level of that taxonomy.
By replacing all semicolons ;
in the tax file by spaces, the file can be loaded into a
spreadsheet application (Microsoft Excel or OpenOffice Calc) as a CSV file. Then, by selecting
the first column (taxa names) and third column (which corresponds to the second taxonomy level)
and copying those columns into a text file, we obtain the needed representation.
The following script then creates single files for each clade (in the directory clades
),
which each contain the taxa of those clades.
The script needs to be provided with the text file from above as input.
#!/bin/bashrm -rf cladesmkdir clades< clade_list
The resulting clade files are the input for subsequent steps.
They are also needed for visualizing the clade annotated trees later.
Thus, this script has to be run for both taxonomic constraints (Euks.tax
and Apis.tax
).
In this step, we extract those sequences from the dataset (amplicons and OTUs) which were placed into the Apicomplexan clade in the previous placement step. As mentioned earlier, the EPA yields different possible placement positions for each sequence. This however implies that sequences can be placed in different clades. This can happen when either the reference is sparse or the sequences are somehow not well fitting to the reference (either new species, sequencing errors, chimeras, etc). Thus, we have to filter.
We apply a 95% threshold for the clade annotation. That means, we only extract a sequence and assign
a clade label to it, if 95% of its placement probability (measured via like_weight_ratio
) were
placed into the same clade. All other sequences are discarded.
For the amplicons, there were 574,833 out of 10,567,804 sequences that were discarded this way, which is 5.4%. For the OTUs, there were 2,211 out of 29,092 sequences, which is 7.6%. In total, this means that by using Evolutionary Placement, we were able to keep >92% of all sequences. This is in contrast to other methods which need the sequences to be closer to a known reference database and thus would discard much more of our data (we estimated about 75%, so they only keep 25%).
Extracting the sequences names using the clades is done by the following program. As we need the
information of the whole placement weights, we use the original .jplace
files here
(instead of the ones with only the max weight).
It is implemented in C++ using our genesis toolkit. See the introduction of this
document for instructions on how to get this to work.
// file extract_sequence_names.cpp<algorithm><assert.h><cmath><numeric><string><unordered_map><unordered_set><utility><vector>"placement/functions.hpp""placement/io/jplace_processor.hpp""placement/placement_map.hpp""tree/bipartition/bipartition_set.hpp""tree/default/functions.hpp""tree/tree.hpp""utils/core/fs.hpp""utils/text/string.hpp"using namespace genesis;int main (int argc, char** argv){// Activate logging.Logging::log_to_stdout();// Get the dir containing the jplace files.if (argc != 2) {LOG_WARN << "Need to provide base dir containing jplace files.";return 1;}auto base_dir = text::trim_right( ::( argv[1] ), "/") + "/";LOG_INFO << "Jplace dir: " << base_dir;// Threshold.const double precision = 0.95;// Sativa prepends a "r_" to all taxa names. We remove this.const :: taxon_prefix = "r_";// Create output directories.utils::dir_create(base_dir + "clades_extr");utils::dir_create(base_dir + "names");// Get a list of the files containing the clades.::<::> clade_files;utils::dir_list_files(base_dir + "clades", clade_files);::(clade_files.begin(), clade_files.end());// Create a list of all clades and fill each clade with its taxa.::<::<::, ::<::>>> clades;for (auto cf : clade_files) {auto taxa = text::split( utils::file_read(base_dir + "clades/" + cf), "\n" );::(taxa.begin(), taxa.end());clades.push_back(::(cf, taxa));}// Process all samples.for (size_t i = 0; i < 154; i++) {LOG_INFO << "============================================================";LOG_INFO << "Sample " << i;// Read placement file.PlacementMap map;:: jfile = base_dir + "samples/sample_" + ::(i) + ".jplace";if( !JplaceProcessor().from_file(jfile, map) ) {LOG_WARN << "Could not read jplace file " << jfile;return 1;}// Output information.LOG_INFO << "Pquery count " << map.pquery_size();LOG_INFO << "Placement count " << map.placement_count();// For each clade, make a list of all edges. We use a vector to preserve order.::<::<::, ::<PlacementTree::EdgeType*>>> clade_edges;// Make a set of all edges that do not belong to any clade (basal branches).// We first fill it with all edges, then remove the clade-edges later.::<PlacementTree::EdgeType*> basal_branches;for (auto it = map.tree().begin_edges(); it != map.tree().end_edges(); ++it) {basal_branches.insert(it->get());}// Extract clade subtrees.LOG_INFO << "Extract clade subtrees...";for (auto& clade : clades) {::<PlacementTree::NodeType*> node_list;// Find the nodes that belong to the taxa of this clade.for (auto taxon : clade.second) {PlacementTree::NodeType* node = find_node( map.tree(), taxon_prefix + taxon);if (node == nullptr) {node = find_node( map.tree(), taxon);}if (node == nullptr) {LOG_WARN << "Cannot find taxon " << taxon;continue;}node_list.push_back(node);}// Find the edges that are part of the subtree of this clade.auto bps = BipartitionSet<PlacementTree>(map.tree());auto smallest = bps.find_smallest_subtree (node_list);auto subedges = bps.get_subtree_edges(smallest->link());// Add them to the clade edges list.clade_edges.push_back(::(clade.first, subedges));// Remove edges from the non-clade edges list::<::> clades_extr;for (auto& e : subedges) {if (e->primary_node()->is_leaf()) {clades_extr.push_back(e->primary_node()->data.name);}if (e->secondary_node()->is_leaf()) {clades_extr.push_back(e->secondary_node()->data.name);}basal_branches.erase(e);}// Only write out inferred clades in first iteration.// (as the reference tree is the same for all 154 samples, the other iterations// will yield the same result, so we can skip this)if (i == 0) {::(clades_extr.begin(), clades_extr.end());for (auto ce : clades_extr) {utils::file_append(base_dir + "clades_extr/" + clade.first, text::replace_all(ce, " ", "_") + "\n");}}}clade_edges.push_back(::("basal_branches", basal_branches));// Normalize.map.normalize_weight_ratios();// Collect the accumulated positions within the clades for each pquery.for (auto& pqry : map.pqueries()) {::<double> edge_clade_vec (clade_edges.size(), 0.0);// For each placement, find its edge and accumulate the edge's clade counter by// the placements like weight ratio.for (auto& place : pqry->placements) {bool found_edge = false;for (size_t i = 0; i < clade_edges.size(); ++i) {if (clade_edges[i].second.count(place->edge) > 0) {edge_clade_vec[i] += place->like_weight_ratio;// Make sure that we do not count this placement twice.// (can only happen if clade_edges is wrong).assert(found_edge == false);if (found_edge){LOG_WARN << "Already found this edge!";}found_edge = true;}}// If the placement was not found within the clades, clade_edges is wrong.if (!found_edge) {LOG_WARN << "Edge not found!";}assert(found_edge);}// Check total like weight ratio sum. If too different from 1.0, there is something// wrong and we need to manually inspect this pquery. Could just be a weird result// of EPA, so nothing too serious, but better make sure we check it.double sum = :: (edge_clade_vec.begin(), edge_clade_vec.end(), 0.0);if (::(sum - 1.0) > 0.01) {LOG_WARN << "Placement with sum " << sum;}// If there is a clade that has more than 95% of the placements weight ratio,// this is the one we assign the pquery to.assert(edge_clade_vec.size() == clade_edges.size());bool found_max = false;:: all_line = pqry->names[0]->name;for (size_t i = 0; i < edge_clade_vec.size(); ++i) {if (edge_clade_vec[i] >= precision) {assert(!found_max);found_max = true;utils::file_append(base_dir + "names/" + clade_edges[i].first, pqry->names[0]->name + "\n");}if (edge_clade_vec[i] > 0.0) {all_line += " " + clade_edges[i].first + "(" + ::(edge_clade_vec[i]) + ")";}}// If there is no sure assignment (<95%), we put the pquery in an extra list of// uncertain sequences.if (!found_max) {utils::file_append(base_dir + "names/uncertain", pqry->names[0]->name + "\n");:: line = pqry->names[0]->name;for (size_t i = 0; i < edge_clade_vec.size(); ++i) {if (edge_clade_vec[i] > 0.0) {line += " " + clade_edges[i].first + "(" + ::(edge_clade_vec[i]) + ")";}}}}}LOG_INFO << "Finished.";return 0;}
This program expects a base_dir
directory path as input which contains the following
subdirectories and files:
base_dir/clades/
: The directory with the clade files from the
previous step.base_dir/samples/
: The directory with the placement data in sample_i.jplace
files, where i
is the sample number (0-153). It is recommended to provide a sym link to the directory
from the EPA step in order to avoid copying the data.To run it, call
from the genesis main directory.
The program then creates two output directories with the following contents:
base_dir/clades_extr/
: Contains files with the names of the clades, each containing a list of
the taxa of this clade. This is how the clades were extracted in the program, based on the clade
input and the tree. Thus, the produced files should exactly match the onces in base_dir/clades/
.
If not, something went wrong. This directly is hence an error checking mechanism.base_dir/names/
: Contains files with the names of the clades, each containing a list of the
sequences names that fell into this clade. This is the actual information we are interested in.As mentioned in the query sequence preparation, there are duplications
in the sequences resulting from our splitting step.
Those sequence names now occur multiple times in the extracted names, which we thus need to clean.
This is done with the following script, which needs to be call in the base_dir
.
#!/bin/bashrm -rf names_uniqmkdir names_uniqcd namescd ..
It creates a new directory names_uniq
, which contains the unique sequence names per clade.
In a last step, we want to create new FASTA files given the sequences names. For this, we use
the following script extract_sequences.py
:
#!/usr/bin/pythonfrom sets import Setimport os, sys# This script looks for files with query sequence names in names/# and then extracts all sequences from the according fasta file# into single fasta files for each list in names/.# Input file names.seq_file = "path/to/sequences.fasta" # Amplicons.fasta or OTUs.fasta# Output directory.out_dir = "seqs"if not :# Get clade name from command line.if != 2:print "Expecting clade name as argument."clade_name =print "Processing clade", clade_name# Set file names.list_file = "names/"+clade_name# Prepare a set of read names from the list file.count = 0read_set =print "Reading list file", list_filewith as list_in:for line in list_in:rn =if :rn =count += 1print "There are", count, "sequences in", list_file# Process the sequence file.print "Reading sequence file "+seq_fileout =with as seq_in:while True:line1 = .sequence =if not sequence:breakif != ">":print "Sequence does not start with >, aborting."exitname = .if name in read_set:print "Done."
It needs to be run for every clade name in names
(or names_uniq
, it doesn't matter, because the
script itself does the reduplication again). For this, call the following script from the
base_dir
.
#!/bin/bashmkdir seqs
The result is stored in a new directory called seqs
. It contains FASTA files for each of the
clades provided in the names
directory. Each FASTA file contains those sequences from the original
sequence files (Amplicons.fasta
or OTUs.fasta
) that were placed in the respective clade.
This step needs to be run for the Euks amplicons and OTUs (E-M, E-O).
The resulting FASTA files are then the input for the second round of analyses, i.e., the 4 Apicomplexan analysis runs (AU-M, AU-O, AC-M, AC-O). The files are first used again in the Alignment step in those analyses.
In a first step, we visualized the clades on the tree. This is mostly an error checking and preparation step for later visualizations. The result was also used for determining the basal branches, which are shaded gray in the clade annotated trees.
It is implemented in C++ using our genesis toolkit. See the introduction of this document for instructions on how to get this to work.
// file visualize_clades.cpp<string><unordered_set><vector>"tree/bipartition/bipartition_set.hpp""tree/default/functions.hpp""tree/default/newick_processor.hpp""tree/default/tree.hpp""tree/io/newick/color_mixin.hpp""tree/io/newick/processor.hpp""tree/tree.hpp""utils/core/logging.hpp""utils/io/nexus/document.hpp""utils/io/nexus/taxa.hpp""utils/io/nexus/trees.hpp""utils/io/nexus/writer.hpp""utils/tools/color.hpp""utils/tools/color/names.hpp"using namespace genesis;// write_color_tree_nexusvoid write_color_tree_nexus(DefaultTree const& tree,::<color::Color> color_vec,:: filename) {typedef NewickColorMixin<DefaultTreeNewickProcessor> ColorProcessor;auto proc = ColorProcessor();proc.edge_colors(color_vec);:: tree_out = proc.to_string(tree);auto doc = nexus::Document();auto taxa = make_unique<nexus::Taxa>();taxa->add_taxa(node_names(tree));doc.set_block( ::(taxa) );auto trees = make_unique<nexus::Trees>();trees->add_tree( "tree1", tree_out );doc.set_block( ::(trees) );:: buffer;auto writer = nexus::Writer();writer.to_stream( doc, buffer );auto nexus_out = buffer.str();utils::file_write(filename, nexus_out);}// clade_color_treevoid clade_color_tree( :: base_dir ){// List of clade files.::<::> clade_files;utils::dir_list_files(base_dir + "clades", clade_files);// Sativa prepends a "r_" to all taxa names. We remove this.:: taxon_prefix = "r_";// Create a list of all clades and fill each clade with its taxa.::<::<::, ::<::>>> clades;for (auto cf : clade_files) {auto taxa = text::split( utils::file_read(base_dir + "clades/" + cf), "\n" );::(taxa.begin(), taxa.end());clades.push_back(::(cf, taxa));}// Read tree file.:: tfile = base_dir + "best_tree.newick";DefaultTree tree;if( utils::file_exists( tfile ) ) {DefaultTreeNewickProcessor().from_file( tfile, tree );} else {LOG_WARN << "Tree file " << tfile << " does not exists.";return;}// Remove taxon prefix from taxon names. This usually is "r_" from SATIVA runs.for( auto nit = tree.begin_nodes(); nit != tree.end_nodes(); ++nit ) {auto& n = **nit;if( n.data.name.substr(0, taxon_prefix.size()) == taxon_prefix ) {n.data.name = n.data.name.substr(taxon_prefix.size());}}// Initialize color vector with pink to mark unprocessed edges// (there should be none left after the next steps).auto color_vec = ::<color::Color>( tree.edge_count(), color::Color(255,0,255) );// Make a set of all edges that do not belong to any clade.// We first fill it with all edges, then remove the clade-edges later.::<DefaultTree::EdgeType*> non_clade_edges;for (auto it = tree.begin_edges(); it != tree.end_edges(); ++it) {non_clade_edges.insert(it->get());}// Define a nice color scheme (based on web colors).::<::> scheme = {"Crimson","DarkCyan","DarkGoldenRod","DarkGreen","DarkOrchid","DeepPink","DodgerBlue","DimGray","GreenYellow","Indigo","MediumVioletRed","MidnightBlue","Olive","Orange","OrangeRed","Peru","Purple","SeaGreen","DeepSkyBlue","RoyalBlue","SlateBlue","Tomato","YellowGreen"};LOG_INFO << "Examining clades...";size_t clade_num = 0;for (auto& clade : clades) {::<DefaultTree::NodeType*> node_list;// Find the nodes that belong to the taxa of this clade.for (auto taxon : clade.second) {DefaultTree::NodeType* node = find_node( tree, taxon_prefix + taxon );if (node == nullptr) {node = find_node( tree, taxon);}if (node == nullptr) {LOG_WARN << "Couldn't find taxon " << taxon;continue;}node_list.push_back(node);}// Find the edges that are part of the subtree of this clade.auto bps = BipartitionSet<DefaultTree>(tree);auto smallest = bps.find_smallest_subtree (node_list);auto subedges = bps.get_subtree_edges(smallest->link());// Color all edges that fall into this calde with one of the color scheme colors.for (auto& e : subedges) {// Error check.if( non_clade_edges.count(e) == 0 ) {LOG_WARN << "Edge at " << e->primary_node()->data.name<< e->secondary_node()->data.name << " already done...";}// Remove this edge from the non-clade edges list. Apply color.non_clade_edges.erase(e);color_vec[e->index()] = color::get_named_color( scheme[clade_num] );}++clade_num;}// Debug info.LOG_INFO << "Out of clade edges: " << non_clade_edges.size();// Color all basal branches, then write the tree file to nexus format.for (auto& e : non_clade_edges) {color_vec[e->index()] = color::Color(192,192,192);}write_color_tree_nexus(tree, color_vec, base_dir + "clade_colors.nexus");}// mainint main( int argc, char** argv ){// Activate logging.Logging::log_to_stdout();if (argc != 2) {LOG_WARN << "Need to provide base dir.";return 1;}auto base_dir = text::trim_right( ::( argv[1] ), "/") + "/";LOG_INFO << "base dir: " << base_dir;clade_color_tree( base_dir );LOG_INFO << "Finished.";return 0;}
This program expects a base_dir
directory path as input which contains the following
subdirectories and files:
base_dir/clades/
: The directory with the clade files from the
Preparing the Clade Annotation step.base_dir/best_tree.newick
: The tree used for the analysis (Euks or Apis, Constrained or
Unconstrained).To run it, call
from the genesis main directory.
The program then creates a file clade_colors.nexus
, which is a tree in nexus format where all
branches that belong to a certain clade are colored the same (and different clades in different
colors). This file can be seen using e.g. FigTree. This visualization serves as a check whether the
clades are correct. It is also used to determine the basal branches (which are colored gray).
Those are the branches which are shaded in the clade annotated tree in the main text.
This step is run for the Euks and Apis tree, and for the Constrained and Unconstrained case (EU, EC, AU, AC).
This is the main visualization, which results in the trees shown in the main text and supplement with branches colored in a light blue, purple and black gradient.
In the first step, we use the EPA result (.jplace
files) to create a tree with branches colored
according to the placement count per branch.
// file visualize_placements.cpp<algorithm><assert.h><cmath><numeric><string><unordered_map><unordered_set><utility><vector>"placement/functions.hpp""placement/io/edge_color.hpp""placement/io/jplace_processor.hpp""placement/io/newick_processor.hpp""placement/io/serializer.hpp""placement/placement_map.hpp""tree/bipartition/bipartition_set.hpp""tree/default/functions.hpp""tree/default/newick_processor.hpp""tree/io/newick/color_mixin.hpp""tree/io/newick/processor.hpp""tree/tree.hpp""utils/core/logging.hpp""utils/io/nexus/document.hpp""utils/io/nexus/taxa.hpp""utils/io/nexus/trees.hpp""utils/io/nexus/writer.hpp""utils/tools/color.hpp""utils/tools/color/gradient.hpp""utils/tools/color/names.hpp""utils/tools/color/operators.hpp"using namespace genesis;// write_color_tree_nexusvoid write_color_tree_nexus(PlacementTree const& tree,::<color::Color> color_vec,:: filename) {typedef NewickColorMixin<PlacementTreeNewickProcessor> ColorProcessor;auto proc = ColorProcessor();proc.enable_edge_nums(false);proc.edge_colors(color_vec);:: tree_out = proc.to_string(tree);auto doc = nexus::Document();auto taxa = make_unique<nexus::Taxa>();taxa->add_taxa(node_names(tree));doc.set_block( ::(taxa) );auto trees = make_unique<nexus::Trees>();trees->add_tree( "tree1", tree_out );doc.set_block( ::(trees) );:: buffer;auto writer = nexus::Writer();writer.to_stream( doc, buffer );auto nexus_out = buffer.str();utils::file_write(filename, nexus_out);}// placement_count_color_treevoid placement_count_color_tree( :: base_dir ){// ----------------------------------------------------// Clade Init// ----------------------------------------------------// List of clade files.::<::> clade_files;utils::dir_list_files(base_dir + "clades", clade_files);// Create a list of all clades and fill each clade with its taxa.::<::<::, ::<::>>> clades;for (auto cf : clade_files) {auto taxa = text::split( utils::file_read(base_dir + "clades/" + cf), "\n" );::(taxa.begin(), taxa.end());clades.push_back(::(cf, taxa));}::<size_t, ::> clade_num_map;::<size_t, ::> edge_index_to_clade_map;::<::, size_t> clade_count;::<::, double> clade_mass;// ----------------------------------------------------// Branch Init// ----------------------------------------------------::<int> index_to_edgenum;::<size_t> placement_count;::<double> placement_mass;::<::, int> taxa_done;size_t taxa_inconsistent = 0;// Sativa prepends a "r_" to all taxa names. We remove this.:: taxon_prefix = "r_";PlacementTree tree0;// ----------------------------------------------------// Iterate all Jplace files// ----------------------------------------------------// Iterate all samples and collect placement counts.size_t total_placement_count = 0;for (size_t i = 0; i < 154; i++) {// --------------------------------// Read files// --------------------------------// Read placement file.PlacementMap map;:: jfile = base_dir + "samples/sample_" + ::(i) + "_max.jplace";if( !JplaceProcessor().from_file(jfile, map) ) {LOG_ERR << "Couldn't read jplace file " << jfile;return;}total_placement_count += map.placement_count();auto& tree = map.tree();// Remove taxon prefix from taxon names. This usually is "r_" from SATIVA runs.for( auto nit = tree.begin_nodes(); nit != tree.end_nodes(); ++nit ) {auto& n = **nit;if( n.data.name.substr(0, taxon_prefix.size()) == taxon_prefix ) {n.data.name = n.data.name.substr(taxon_prefix.size());}}// --------------------------------// Check Properties// --------------------------------// Init vectors in first iteration...if( i == 0 ) {index_to_edgenum = ::<int>(tree.edge_count(), 0);placement_count = ::<size_t>(tree.edge_count(), 0.0);placement_mass = ::<double>(tree.edge_count(), 0.0);for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {auto& e = **eit;index_to_edgenum[e.index()] = e.data.edge_num;}tree0 = tree;// ... and check for correctness in later iterations.} else {for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {auto& e = **eit;if( index_to_edgenum[e.index()] != e.data.edge_num ) {LOG_ERR << "index_to_edgenum[e.index()] != e.data.edge_num : "<< index_to_edgenum[e.index()] << " != " << e.data.edge_num;return;}}}// --------------------------------// Clade Extraction// --------------------------------// Make a set of all edges that do not belong to any clade.// We first fill it with all edges, then remove the clade-edges later.::<PlacementTree::EdgeType*> non_clade_edges;for (auto it = tree.begin_edges(); it != tree.end_edges(); ++it) {non_clade_edges.insert(it->get());}// Examining clades...size_t clade_num = 0;for (auto& clade : clades) {::<PlacementTree::NodeType*> node_list;// Find the nodes that belong to the taxa of this clade.for (auto taxon : clade.second) {PlacementTree::NodeType* node = find_node( tree, taxon_prefix + taxon);if (node == nullptr) {node = find_node( tree, taxon);}if (node == nullptr) {LOG_DBG2 << "couldn't find taxon " << taxon;continue;}node_list.push_back(node);// Check clade num consistencyif( clade_num_map.count(clade_num) == 0 ) {if( i != 0 ) {LOG_WARN << "clade " << clade.first << " not found in sample 0! (num)";return;}clade_num_map[clade_num] = clade.first;} else if( clade_num_map[clade_num] != clade.first ) {LOG_WARN << "clade num " << clade_num << " does not match " << clade.first;return;}}// Find the edges that are part of the subtree of this clade.auto bps = BipartitionSet<PlacementTree>(tree);auto smallest = bps.find_smallest_subtree (node_list);auto subedges = bps.get_subtree_edges(smallest->link());// Extract all sequences from those edges and write them to files.for (auto& e : subedges) {// Remove this edge from the non-clade edges listif( non_clade_edges.count(e) == 0 ) {LOG_WARN << "edge at " << e->primary_node()->data.name<< e->secondary_node()->data.name << " already done...";}non_clade_edges.erase(e);// Check edge index consistencyif( edge_index_to_clade_map.count(e->index()) == 0 ) {if( i != 0 ) {LOG_WARN << "clade " << clade.first << " not found in sample 0! (edge)";return;}edge_index_to_clade_map[e->index()] = clade.first;} else if( edge_index_to_clade_map[e->index()] != clade.first ) {LOG_WARN << "edge with index " << e->index() << " does not match " << clade.first;return;}}++clade_num;}// Add remaining edges to "basal_branches" cladefor( auto& e : non_clade_edges ) {if( edge_index_to_clade_map.count(e->index()) == 0 ) {if( i != 0 ) {LOG_WARN << "clade basal_branches not found in sample 0!";return;}edge_index_to_clade_map[e->index()] = "basal_branches";} else if( edge_index_to_clade_map[e->index()] != "basal_branches" ) {LOG_WARN << "edge with index " << e->index() << " does not match basal_branches";return;}}// --------------------------------// Count collection// --------------------------------// Collect the placement counts and masses.for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {auto& e = **eit;// Add all new placement counts and masses to the counters.for( auto& p : e.data.placements ) {if( p->pquery->name_size() != 1 ) {LOG_WARN << "name size == " << p->pquery->name_size();return;}auto name = p->pquery->name_at(0).name;// If the placement is new, add it. If not, check whether it is consistent.if( taxa_done.count(name) == 0 ) {placement_count[e.index()] += 1;placement_mass[e.index()] += p->like_weight_ratio;taxa_done[name] = p->edge_num;// Count clade placements.if( edge_index_to_clade_map.count(e.index()) == 0 ) {LOG_WARN << "no clade for edge " << e.index();return;}:: clade_name = edge_index_to_clade_map[e.index()];clade_count[clade_name] += 1;clade_mass[clade_name] += p->like_weight_ratio;} else {if( taxa_done[name] != p->edge_num ) {++taxa_inconsistent;LOG_WARN << "placement not consistent between samples: " << name;}}}}}// ----------------------------------------------------// Summarize Information// ----------------------------------------------------LOG_INFO << "uniq taxa count: " << taxa_done.size();LOG_INFO << "inconsistent taxa: " << taxa_inconsistent;taxa_done.clear();// ----------------------------------------------------// Branch counts// ----------------------------------------------------LOG_INFO << "total_placement_count " << total_placement_count;// Write counts.:: placement_count_list;for( auto& pv : placement_count ) {placement_count_list += ::(pv) + "\n";}utils::file_write(base_dir + "placement_count_list", placement_count_list);// Write masses.:: placement_mass_list;for( auto& pv : placement_mass ) {placement_mass_list += ::(pv) + "\n";}utils::file_write(base_dir + "placement_mass_list", placement_mass_list);if( placement_count.size() != placement_mass.size() ) {LOG_ERR << "placement_count.size() != placement_mass.size() : "<< placement_count.size() << " != " << placement_mass.size();return;}// Sum up everythingauto count_sum = ::(placement_count.begin(), placement_count.end(), 0);auto count_max = *:: (placement_count.begin(), placement_count.end());auto mass_sum = ::(placement_mass.begin(), placement_mass.end(), 0.0);auto mass_max = *:: (placement_mass.begin(), placement_mass.end());LOG_INFO << "sum count " << count_sum;LOG_INFO << "max count " << count_max;LOG_INFO << "sum mass " << mass_sum;LOG_INFO << "max mass " << mass_max;// ----------------------------------------------------// Clade counts// ----------------------------------------------------LOG_INFO;LOG_INFO << "Clade counts:";for( auto& cp : clade_count ) {LOG_INFO << cp.first << "\t" << cp.second << "\t" << ( (double)cp.second / count_sum );}LOG_INFO;LOG_INFO << "Clade masses:";for( auto& cp : clade_mass ) {LOG_INFO << cp.first << "\t" << cp.second << "\t" << ( cp.second / mass_sum );}// ----------------------------------------------------// Colour branches// ----------------------------------------------------// Create color gradient in "blue pink black".auto gradient = ::<double, color::Color>();gradient[ 0.0 ] = color::color_from_hex("#81bfff");gradient[ 0.5 ] = color::color_from_hex("#c040be");gradient[ 1.0 ] = color::color_from_hex("#000000");auto base_color = color::color_from_hex("#81bfff");// Make count color tree.auto count_color_vec_lin = ::<color::Color>( placement_count.size(), base_color );auto count_color_vec_log = ::<color::Color>( placement_count.size(), base_color );for( size_t i = 0; i < placement_count.size(); ++i ) {if( placement_count[i] > 0 ) {double val;val = static_cast<double>(placement_count[i]) / count_max;count_color_vec_lin[i] = color::gradient(gradient, val);val = log(static_cast<double>(placement_count[i])) / log(count_max);count_color_vec_log[i] = color::gradient(gradient, val);}}write_color_tree_nexus(tree0, count_color_vec_lin, base_dir + "tree_count_lin.nexus");write_color_tree_nexus(tree0, count_color_vec_log, base_dir + "tree_count_log.nexus");// Make mass color tree.auto mass_color_vec_lin = ::<color::Color>( placement_mass.size(), base_color );auto mass_color_vec_log = ::<color::Color>( placement_mass.size(), base_color );for( size_t i = 0; i < placement_mass.size(); ++i ) {if( placement_mass[i] > 0 ) {double val;val = static_cast<double>(placement_mass[i]) / mass_max;mass_color_vec_lin[i] = color::gradient(gradient, val);val = log(static_cast<double>(placement_mass[i])) / log(mass_max);mass_color_vec_log[i] = color::gradient(gradient, val);}}write_color_tree_nexus(tree0, mass_color_vec_lin, base_dir + "tree_mass_lin.nexus");write_color_tree_nexus(tree0, mass_color_vec_log, base_dir + "tree_mass_log.nexus");clade_count.clear();clade_mass.clear();}// mainint main( int argc, char** argv ){// Activate logging.Logging::log_to_stdout();// Get base dir.if (argc != 2) {LOG_WARN << "Need to provide base dir.";return 1;}auto base_dir = text::trim_right( ::( argv[1] ), "/") + "/";LOG_INFO << "base dir: " << base_dir;// Run.placement_count_color_tree( base_dir );LOG_INFO << "Finished.";return 0;}
This program expects a base_dir
directory path as input which contains the following
subdirectories and files:
base_dir/clades/
: The directory with the clade files from the
Preparing the Clade Annotation step.base_dir/samples/
: The directory with the placement data in sample_i_max.jplace
files,
where i
is the sample number (0-153). This is the data resulting from the
Restriction to max weight Placements step.
It is recommended to provide a sym link to the directory from that step in order to avoid
copying the data.To run it, call
from the genesis main directory.
The program output in the terminal is used for subsequent steps, so it is stored in a log file
log.txt
. See the next sections for more information on how this output is used.
The program also creates the following files:
placement_count_list
and placement_mass_list
, which contain the counts and masses
(measured in like_weight_ratio
) of the placements per branch (as indexed within genesis).
This is mostly an error checking step.tree_count_lin.nexus
, tree_count_log.nexus
,
tree_mass_lin.nexus
and tree_mass_log.nexus
, which are the important result of this step.
They each contain a tree with colored branches that can be read in FigTree.The two count
trees visualize the counts of the placements (their number per branch),
while the mass
trees show the masses of these placements (measured in like_weight_ratio
).
As we did not set those masses, they default to 1.0.
This effectively results in identical trees for both the counts and masses.
However, this might be useful for future approaches: abundance data or some other data might be
interesting to visualize here as well.
Furthermore, the two lin
trees use linear scaling, while the two log
trees use
logarithmic scaling for determining the color per branch. The latter one is more useful, as the
number of placements per branch is highly unevenly distributed: There are many branches with only
a few placements on them, while just a few branches accumulated most of the placements.
In a linear scaling, this would result in most branches being light blue (the color for 0.0 in our
gradient), while only very few highly populated branches would turn black (the other end of the
gradient). Using logarithmic scaling prevents this and thus also makes it possible to see the
in-between counts and colors.
For the main text, we used the tree_count_log.nexus
files. This program can be run for all of the
8 analyses, to give tree visualizations for each of them.
The previous step yields tree files in nexus format, that can be read by FigTree. Furthermore, the program output itself contains valuable information needed for proper visualizing the data.
One line of output is a count of "inconsistent taxa", which are the numbers shown in section Inconsistent Placements.
Furthermore, summaries of the counts and masses of placements on the tree are outputted:
sum count
, max count
, sum mass
and max mass
show the sum and the maximal value of placements
for their counts and masses. The max count
is the value used for setting the axis values for the
color scale shown next to the trees.
The workflow to create publication quality figures from the nexus files of the previous step is as follows:
FigTree
tree_count_log.nexus
.Inkscape.
The color scale was created in Inkscape, using the same gradient used for the tree visualization. It is a linear gradient with the following stops:
#81bfff
#c040be
#000000
The axis markers were added by hand and had to be adjusted according to the maximum value of placement mass on the particular tree. In order to get the correct position for each axis marker, we used the following Python script:
#!/usr/bin/pythonimport mathmax_val = 2487rounded = 2500print "max val", max_valval = /print "%.0f" % rounded, "at", "%.3f" % valfor i in :val = /print (" " * (6-i)) + "%.0f" % , "at", "%.3f" % val
For each tree, the value max_val
has to be set to the value that the previous step gave as output
for the max count
value. The value of rounded
is then set to a value greater than that which is
used as the maximum value displayed on the scale. The output of this script gives the relative
positions of the markers for the scale, which then have to be set by hand in Inkscape.
To achieve this, we recommended creating a gradient with 100 units hight, so that the relative
positions for the markers can be translated to Inkscape units in a straight forward manner.
The resulting scale image can the be inserted into the tree (see steps above).
Furthermore, for the clade annotated trees (the ones with clades instead of taxa names, e.g., the one in the main text), we shaded the inner (basal) branches. Those are the ones that to not belong to any clade. Those branches were marked in gray in the clade tree (see section Clade Visualization).
As the final step of the analysis, we compared how the placements differ when using taxonomically constrained and unconstrained reference trees. Thus, the step can be done for both the Euks and Apis trees, and for amplicons and OTUs, respectively (E-M, E-O, A-M, A-O).
As the trees for the constrained and unconstrained case differ, it is not possible to do a straight forward comparison of the counts of placements per branch. Instead, we counted how many placements were placed into each clade of those trees. This yielded the table in the supplement.
To get this information, we used the following program, which is implemented in C++ using our genesis toolkit. See the introduction of this document for instructions on how to get this to work.
// file compare_constr_unconstr.cpp<algorithm><assert.h><cmath><numeric><string><unordered_map><unordered_set><utility><vector>"placement/functions.hpp""placement/io/jplace_processor.hpp""placement/io/newick_processor.hpp""placement/io/serializer.hpp""placement/placement_map.hpp""tree/bipartition/bipartition_set.hpp""tree/default/functions.hpp""tree/tree.hpp""utils/core/fs.hpp""utils/core/logging.hpp""utils/math/matrix.hpp"using namespace genesis;// compare_constr_unconstrvoid compare_constr_unconstr( :: base_dir ){auto samples_a = base_dir + "samples_Unconstr/";auto samples_b = base_dir + "samples_Constr/";LOG_INFO << "samples_a : " << samples_a;LOG_INFO << "samples_b : " << samples_b;// --------------------------------// Clade Init// --------------------------------// List of clade files.::<::> clade_files;utils::dir_list_files(base_dir + "clades", clade_files);::(clade_files.begin(), clade_files.end());// Create a list of all clades and fill each clade with its taxa.::<::<::, ::<::>>> clades;for (auto cf : clade_files) {auto taxa = text::split( utils::file_read(base_dir + "clades/" + cf), "\n" );::(taxa.begin(), taxa.end());clades.push_back(::(cf, taxa));}// --------------------------------// Placement Init// --------------------------------::<int> index_to_edgenum;::<size_t, size_t> edge_index_to_clade_num;::<::, int> taxa_done;::<::, size_t> read_to_clade_num_map;:: taxon_prefix = "r_";:: query_prefix = "q_";size_t total_placement_count = 0;size_t taxa_inconsistent = 0;// --------------------------------// Result Matrix Init// --------------------------------auto result_matrix = Matrix<size_t>( clades.size() + 1, clades.size() + 1, 0 );// --------------------------------------------------------// Iterate all Jplace files in base dir A// --------------------------------------------------------LOG_INFO << "Reading 154 samples from " << samples_a;for (size_t i = 0; i < 154; i++) {// --------------------------------// Read files// --------------------------------// Read placement file.PlacementMap map;:: jfile = samples_a + "sample_" + ::(i) + "_max.jplace";if( !JplaceProcessor().from_file(jfile, map) ) {LOG_ERR << "Couldn't read jplace file " << jfile;return;}total_placement_count += map.placement_count();auto& tree = map.tree();// Remove taxon prefix from taxon names. This usually is "r_" from SATIVA runs.for( auto nit = tree.begin_nodes(); nit != tree.end_nodes(); ++nit ) {auto& n = **nit;if( n.data.name.substr(0, taxon_prefix.size()) == taxon_prefix ) {n.data.name = n.data.name.substr(taxon_prefix.size());}}// --------------------------------// Check Tree Consistency// --------------------------------// Init vectors in first iteration...if( i == 0 ) {index_to_edgenum = ::<int>(tree.edge_count(), 0);for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {auto& e = **eit;index_to_edgenum[e.index()] = e.data.edge_num;}// ... and check for correctness in later iterations.} else {for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {auto& e = **eit;if( index_to_edgenum[e.index()] != e.data.edge_num ) {LOG_ERR << "index_to_edgenum[e.index()] != e.data.edge_num : "<< index_to_edgenum[e.index()] << " != " << e.data.edge_num;return;}}}// --------------------------------// Clade Extraction// --------------------------------// Make a set of all edges that do not belong to any clade.// We first fill it with all edges, then remove the clade-edges later.::<PlacementTree::EdgeType*> non_clade_edges;for (auto it = tree.begin_edges(); it != tree.end_edges(); ++it) {non_clade_edges.insert(it->get());}// Examining clades...for( size_t ci = 0; ci < clades.size(); ++ci ) {auto& clade = clades[ci];::<PlacementTree::NodeType*> node_list;// Find the nodes that belong to the taxa of this clade.for (auto taxon : clade.second) {PlacementTree::NodeType* node = find_node( tree, taxon_prefix + taxon);if (node == nullptr) {node = find_node( tree, taxon);}if (node == nullptr) {LOG_WARN << "couldn't find taxon " << taxon;continue;}node_list.push_back(node);}// Find the edges that are part of the subtree of this clade.auto bps = BipartitionSet<PlacementTree>(tree);auto smallest = bps.find_smallest_subtree (node_list);auto subedges = bps.get_subtree_edges(smallest->link());// Extract all sequences from those edges and write them to files.for (auto& e : subedges) {// Remove this edge from the non-clade edges listif( non_clade_edges.count(e) == 0 ) {LOG_WARN << "edge at " << e->primary_node()->data.name<< e->secondary_node()->data.name << " already done...";}non_clade_edges.erase(e);// Check edge index consistencyif( edge_index_to_clade_num.count(e->index()) == 0 ) {if( i != 0 ) {LOG_WARN << "clade " << clade.first << " not found in sample 0! (edge)";return;}edge_index_to_clade_num[e->index()] = ci;} else if( edge_index_to_clade_num[e->index()] != ci ) {LOG_WARN << "edge with index " << e->index() << " does not match " << clade.first;return;}}}// Add remaining edges to "basal_branches" cladefor( auto& e : non_clade_edges ) {if( edge_index_to_clade_num.count(e->index()) == 0 ) {if( i != 0 ) {LOG_WARN << "clade basal_branches not found in sample 0!";return;}edge_index_to_clade_num[e->index()] = clades.size();} else if( edge_index_to_clade_num[e->index()] != clades.size() ) {LOG_WARN << "edge with index " << e->index() << " does not match basal_branches";return;}}// --------------------------------// Iterate all Placements// --------------------------------// Collect the placement counts and masses.for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {auto& e = **eit;// Add all new placement counts and masses to the counters.for( auto& p : e.data.placements ) {if( p->pquery->name_size() != 1 ) {LOG_WARN << "name size == " << p->pquery->name_size();return;}auto name = p->pquery->name_at(0).name;if( name.substr(0, query_prefix.size()) == query_prefix ) {name = name.substr(query_prefix.size());}// If the placement is new, add it. If not, check whether it is consinstent.if( taxa_done.count(name) == 0 ) {taxa_done[name] = p->edge_num;// Find the clade num for this read and store it.//if( edge_index_to_clade_num.count(e.index()) == 0 ) {LOG_WARN << "no clade for edge " << e.index();return;}if( read_to_clade_num_map.count(name) != 0 ) {LOG_WARN << "read " << name << " was somehow already processed...";}size_t clade_num = edge_index_to_clade_num[e.index()];read_to_clade_num_map[name] = clade_num;} else {if( taxa_done[name] != p->edge_num ) {++taxa_inconsistent;LOG_WARN << "placement not consistent between samples: " << name;}}}}}LOG_INFO << "total_placement_count " << total_placement_count;LOG_INFO << "uniq taxa count: " << taxa_done.size();LOG_INFO << "inconsistent taxa: " << taxa_inconsistent;LOG_INFO;// ---------------------------------------------------------// Iterate all Jplace files in base dir B// ---------------------------------------------------------edge_index_to_clade_num.clear();index_to_edgenum.clear();taxa_done.clear();total_placement_count = 0;taxa_inconsistent = 0;LOG_INFO << "Reading 154 samples from " << samples_b;for (size_t i = 0; i < 154; i++) {// --------------------------------// Read files// --------------------------------// Read placement file.PlacementMap map;:: jfile = samples_b + "sample_" + ::(i) + "_max.jplace";:: bfile = samples_b + "sample_" + ::(i) + "_max.bplace";if( utils::file_exists( bfile ) ) {PlacementMapSerializer::load(bfile, map);} else {if( !JplaceProcessor().from_file(jfile, map) ) {LOG_ERR << "Couldn't read jplace file " << jfile;return;}PlacementMapSerializer::save(map, bfile);}total_placement_count += map.placement_count();auto& tree = map.tree();// Remove taxon prefix from taxon names. This usually is "r_" from SATIVA runs.for( auto nit = tree.begin_nodes(); nit != tree.end_nodes(); ++nit ) {auto& n = **nit;if( n.data.name.substr(0, taxon_prefix.size()) == taxon_prefix ) {n.data.name = n.data.name.substr(taxon_prefix.size());}}// --------------------------------// Check Tree Consistency// --------------------------------// Init vectors in first iteration...if( i == 0 ) {index_to_edgenum = ::<int>(tree.edge_count(), 0);for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {auto& e = **eit;index_to_edgenum[e.index()] = e.data.edge_num;}// ... and check for correctness in later iterations.} else {for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {auto& e = **eit;if( index_to_edgenum[e.index()] != e.data.edge_num ) {LOG_ERR << "index_to_edgenum[e.index()] != e.data.edge_num : "<< index_to_edgenum[e.index()] << " != " << e.data.edge_num;return;}}}// --------------------------------// Clade Extraction// --------------------------------// Make a set of all edges that do not belong to any clade.// We first fill it with all edges, then remove the clade-edges later.::<PlacementTree::EdgeType*> non_clade_edges;for (auto it = tree.begin_edges(); it != tree.end_edges(); ++it) {non_clade_edges.insert(it->get());}// Examining clades...for( size_t ci = 0; ci < clades.size(); ++ci ) {auto& clade = clades[ci];::<PlacementTree::NodeType*> node_list;// Find the nodes that belong to the taxa of this clade.for (auto taxon : clade.second) {PlacementTree::NodeType* node = find_node( tree, taxon_prefix + taxon);if (node == nullptr) {node = find_node( tree, taxon);}if (node == nullptr) {LOG_WARN << "couldn't find taxon " << taxon;continue;}node_list.push_back(node);}// Find the edges that are part of the subtree of this clade.auto bps = BipartitionSet<PlacementTree>(tree);auto smallest = bps.find_smallest_subtree (node_list);auto subedges = bps.get_subtree_edges(smallest->link());// Extract all sequences from those edges and write them to files.for (auto& e : subedges) {// Remove this edge from the non-clade edges listif( non_clade_edges.count(e) == 0 ) {LOG_WARN << "edge at " << e->primary_node()->data.name<< e->secondary_node()->data.name << " already done...";}non_clade_edges.erase(e);// Check edge index consistencyif( edge_index_to_clade_num.count(e->index()) == 0 ) {if( i != 0 ) {LOG_WARN << "clade " << clade.first << " not found in sample 0! (edge)";return;}edge_index_to_clade_num[e->index()] = ci;} else if( edge_index_to_clade_num[e->index()] != ci ) {LOG_WARN << "edge with index " << e->index() << " does not match " << clade.first;return;}}}// Add remaining edges to "basal_branches" cladefor( auto& e : non_clade_edges ) {if( edge_index_to_clade_num.count(e->index()) == 0 ) {if( i != 0 ) {LOG_WARN << "clade basal_branches not found in sample 0!";return;}edge_index_to_clade_num[e->index()] = clades.size();} else if( edge_index_to_clade_num[e->index()] != clades.size() ) {LOG_WARN << "edge with index " << e->index() << " does not match basal_branches";return;}}// --------------------------------// Iterate all Placements// --------------------------------// Collect the placement counts and masses.for( auto eit = tree.begin_edges(); eit != tree.end_edges(); ++eit ) {auto& e = **eit;// Add all new placement counts and masses to the counters.for( auto& p : e.data.placements ) {if( p->pquery->name_size() != 1 ) {LOG_WARN << "name size == " << p->pquery->name_size();return;}auto name = p->pquery->name_at(0).name;if( name.substr(0, query_prefix.size()) == query_prefix ) {name = name.substr(query_prefix.size());}// If the placement is new, add it. If not, check whether it is consinstent.if( taxa_done.count(name) == 0 ) {taxa_done[name] = p->edge_num;// Find the clade num for this read and store it.if( edge_index_to_clade_num.count(e.index()) == 0 ) {LOG_WARN << "no clade for edge " << e.index();return;}if( read_to_clade_num_map.count(name) == 0 ) {LOG_WARN << "read " << name << " is in B but not in A!!!";}size_t clade_num_a = read_to_clade_num_map[name];size_t clade_num_b = edge_index_to_clade_num[e.index()];++result_matrix.at(clade_num_a, clade_num_b);} else {if( taxa_done[name] != p->edge_num ) {++taxa_inconsistent;LOG_WARN << "placement not consistent between samples: " << name;}}}}}LOG_INFO << "total_placement_count " << total_placement_count;LOG_INFO << "uniq taxa count: " << taxa_done.size();LOG_INFO << "inconsistent taxa: " << taxa_inconsistent;LOG_INFO;:: csv;for( size_t i = 0; i < clades.size(); ++i ) {csv += ", " + clades[i].first;}csv += ", basal_branches\n";for( size_t i = 0; i < result_matrix.rows(); ++i ) {if( i < clades.size() ) {csv += clades[i].first;} else {csv += "basal_branches";}for( size_t j = 0; j < result_matrix.cols(); ++j ) {csv += ", " + ::(result_matrix(i, j));}csv += "\n";}utils::file_write(base_dir + "compare_constr_unconstr.csv", csv);LOG_INFO << "finished";}// mainint main( int argc, char** argv ){// Activate LoggingLogging::log_to_stdout();// Get base dir.if (argc != 2) {LOG_WARN << "Need to provide base dir.";return 1;}auto base_dir = text::trim_right( ::( argv[1] ), "/") + "/";LOG_INFO << "base dir : " << base_dir;// Run.compare_constr_unconstr( base_dir );LOG_INFO << "Finished.";return 0;}
This program expects a base_dir
directory path as input which contains the following
subdirectories and files:
base_dir/clades/
: The directory with the clade files from the
Preparing the Clade Annotation step.base_dir/samples_Unconstr/
: The directory with the placement data in sample_i_max.jplace
files, for the taxonomically unconstrained reference, where i
is the sample number (0-153).
This is the data resulting from the
Restriction to max weight Placements step.
It is recommended to provide a sym link to the directory from that step in order to avoid
copying the data.base_dir/samples_Constr/
: Same as above, just with the constrained reference instead.To run it, call
from the genesis main directory.
The program then creates a file compare_constr_unconstr.csv
in the base_dir
, which is the raw
resulting table for comparison as shown in the supplement. This file can be opened with spreadsheet
applications like Microsoft Excel or OpenOffice Calc.
The table answers the question: How many sequences are there in total that were placed in clade A
in the unconstrained analysis and in clade B constrained analysis?
The table shows the unconstrained clades at the left and constrained ones at the top.
Thus, each cell shows the number of sequences that were placed in the unconstrained clade
corresponding to its row and the constrained clade corresponding to its column.
Example: The value x
in the first column and third row means: There are x
sequences that were
placed in the clade of the third row in the unconstrained analysis, but in the clade of the first
column in the cConstrained analysis.
In order to turn those absolute numbers into relative ones, all values have to be divided by the total number of sequences (which is simply the sum of all cells). Furthermore, for better visualization, the cells can be colored according to their value by using the conditional formatting mechanism of the spreadsheet application.
The tables can be obtained for all analyses (E-M, E-O, A-M, A-O).