Using PyPythia
This library can be used in two ways: either directly as command line tool, or the prediction can be called from other python code.
Command Line Interface
If you only want to predict the difficulty for a single MSA, you can query the predictor using the command line interface, for example like this:
pythia --msa examples/example.phy --raxmlng /path/to/raxml-ng
Note that when you installed PyPythia using conda, you will have to download the example.phy
and adjust the path
accordingly.
The output will be something like The predicted difficulty for MSA examples/example.phy is: 0.02.
, telling us that
example.phy is an easy dataset. In fact, this dataset exhibits a single likelihood peak. Depending on the predictor
version or operating system you are using, the actual value might slightly differ.
This is expected and nothing to worry about 🙂
Note that Pythia can also handle FASTA input files, see section Input Data below.
The following options are available:
PyPythia version 2.0.0 released by The Exelixis Lab
Developed by: Julia Haag
Latest version: https://github.com/tschuelia/PyPythia
Questions/problems/suggestions? Please open an issue on GitHub.
usage: pythia [-h] -m MSA -r RAXMLNG [-t THREADS] [-s SEED] [-p PREFIX]
[--predictor PREDICTOR] [--shap] [--forceDuplicates] [--forceFullGaps]
[--nofiles] [-V]
Parser for Pythia command line options.
options:
-h, --help show this help message and exit
-m MSA, --msa MSA Multiple Sequence Alignment to predict the difficulty for. Must be
in either phylip or fasta format.
-r RAXMLNG, --raxmlng RAXMLNG
Path to the binary of RAxML-NG. For install instructions see
https://github.com/amkozlov/raxml-ng.(default: 'raxml-ng' if in
$PATH, otherwise this option is mandatory).
-t THREADS, --threads THREADS
Number of threads to use for parallel parsimony tree inference
(default: RAxML-NG autoconfig).
-s SEED, --seed SEED Seed for the RAxML-NG parsimony tree inference (default: 0).
-p PREFIX, --prefix PREFIX
Prefix of the PyPythia log and result file (default: MSA file name).
--predictor PREDICTOR
Filepath of the alternative predictor to use (default: latest
Pythia).
--shap If set, computes the shapley values of the prediction as waterfall
plot in '{prefix}.shap.pdf'. When using this option, make sure you
understand what shapley values are and how to interpret this
plot.For details on shapley values refer to the documentation:
https://tschuelia.github.io/PyPythia/latest/usage/#shap-waterfall-
plot (default: False).
--forceDuplicates Per default, Pythia refuses to predict the difficulty for MSAs
containing duplicate sequences,and removes duplicate sequences prior
to predicting the difficulty. Only set this option if you are
absolutely sure that you want to predict the difficulty for this MSA
(default: False).
--forceFullGaps Per default, Pythia refuses to predict the difficulty for MSAs
containing sequences with only gaps,and removes full-gap sequences
prior to predicting the difficulty. Only set this option if you are
absolutely sure that you want to predict the difficulty for this MSA
(default: False).
--nofiles Prevent Pythia from writing any files and only print logs/results to
the terminal (default: False). WARNING: in this case and if your MSA
contains duplicate/full-gap sequences the reduced MSA will not be
stored.
-V, --version Print the version number and exit.
Result files
Pythia will write the following files:
- A logfile containing the same information as printed to the terminal: {result_prefix}.pythia.log
- The reduced MSA file in case the input MSA contained duplicate/full-gap sequences (and the reduction was not disabled): {result_prefix}.reduced.phy
- The inferred parsimony trees in Newick format: {result_prefix}.pythia.trees
- The shapley values as waterfall plot (if --shap is set): {result_prefix}.shap.pdf
- The features and predicted difficulty as CSV file: {result_prefix}.pythia.csv
The result_prefix can be set using the --prefix
command line option. If not set, Pythia uses the MSA file as prefix. You can prevent Pythia from writing any files via the flag --nofiles
.
From Code
You can also use the library as a regular python library by installing it in your current environment. The following code snippet shows how to predict the difficulty for an MSA using PyPythia:
from pypythia.prediction import predict_difficulty
import pathlib
msa = pathlib.Path("examples/example.phy")
difficulty = predict_difficulty(msa)
print(f"The predicted difficulty for MSA {msa} is: {round(difficulty, 2)}.")
And the output will be the same as for the CLI: The predicted difficulty for MSA examples/example.phy is: 0.02.
.
If you want to get all features, or do more specific analyses of your MSA, see the API Reference for further details on all available classes and methods.
Input data
Supported file types
The input for Pythia is an MSA file in either Phylip or FASTA format.
Supported data types
Pythia supports DNA, AA, and morphological data.
Please note that by morphological data we refer to biological data. According to our analyses, the attributes of biological morphological data are similar to the attributes of DNA and AA data. However, when analyzing language data (cognate, sound-class, and morphosyntactic data) we observed substantially distinct attributes and concluded that morphological language data is not comparable to DNA, AA, or biological morphological data. Thus, at the moment Pythia is not able to reliably predict the difficulty for language alignments.
Taxon names
Make sure that the MSA only contains RAxML-NG compatible taxon names. In particular, taxon labels with spaces, tabs, newlines, commas, colons, semicolons and parenthesis are invalid.
MSAs with duplicate sequences
Pythia refuses to predict the difficulty for MSAs containing duplicate sequences or MSAs containing sequences containing
only gaps.
As of version 2.0.0, Pythia removes duplicates and full-gap sequences per default and predicts the difficulty for this
reduced MSA.
If you absolutely want to predict the difficulty for the original MSA, set the command line flags --forceDuplicates
and --forceFullGaps
.
As of version 1.0.0 Pythia refuses to predict the difficulty for MSAs containing multiple exactly identical sequences ( duplicate sequences). The reason for this is that duplicate sequences can have a substantial impact on the resulting topologies during the maximum parsimony tree inference and therefore on the topological distance measures.
If you set the command line option --removeDuplicates
, Pythia will create a reduced alignment with all duplicates
removed and predict the difficulty for this reduced alignment.
For duplicate sequences, the first occurrence of the sequence is kept.
WARNING: The resulting predicted difficulty is only applicable to the reduced MSA! We recommend to only use the created
reduced alignment for your subsequent analyses.
Predictors
To continuously and automatically improve the prediction accuracy of Pythia, we regularly extend the training data set
and subsequently retrain the predictor.
We extend the training data using the anonymized MSAs that we continuously obtain during our RAxML Grove database
updates.
Note that these MSAs are only available internally in RAxML Grove and are not publicly available.
As per default, PyPythia uses the latest predictor predictors/latest.txt
. If you want to use an older version of Pythia,
please install the respective PyPythia version.
You can also pass a custom predictor file using the --predictor
option. However, this will only work if the passed
file contains a LightGBM Booster model.
Note that the predictions for the same MSA can be different when using different versions of Pythia.
SHAP Waterfall Plot
As of version 1.1.0, Pythia includes an option to plot Shapley values for a prediction. The interpretation of Shapley values is not straight-forward, and we emphasize the importance of learning about these values before drawing conclusions based on the resulting plot! We provide the Shapley values as waterfall plot. In the following, we briefly describe what Shapley values are, what a waterfall plot is, and how you can interpret this plot. It is important to note that Shapley values are not the same as feature importances. Predicting the difficulty of two distinct MSAs will lead to two distinct waterfall plots.
Shapley Values
Based on the training data, our difficulty predictor Pythia has learned a base line difficulty. This base line difficulty is the expected value for every new prediction. Starting off this base line, Pythia adjusts its prediction using the features of the MSA. To determine how much each feature contributes to this change, ultimately leading to the final prediction is estimated by Shapley values. Since Pythia is a tree-based regressor, computing the Shapley values requires some advanced mathematics that I won't go into detail about here. If you are interested in this check out the links in the More Details section below. Due to the calculation of Shapley values, the value for one feature is NOT the difference in prediction when removing this feature. The Shapley value for one feature can only be interpreted considering all feature values together for a specific set of feature values.
Waterfall plot
The following figure shows an exemplary waterfall plot output for the MSA example/example.py
and Pythia version 1.1.0.
The x-axis depicts the difficulty and the y-axis the features alongside the respective feature value. The features are
sorted by their Shapley value with the most important feature on top. You can read the plot as follows. The baseline
difficulty that Pythia v1.1.0 learned is 0.35, as indicated by the E[f(x)] = 0.35
on the x-axis. The
proportion_invariant
feature contributed to the overall prediction with a shift towards 1.0
(more difficult) of
0.01
, so in combination with the other features, a proportion_invariant
of 0.341
indicates that the MSA is
slightly more difficult than the average difficulty in the training set. We emphasize that the combination with the
other features part, since the same value for proportion_invariant
with a different MSA and different feature values
for the remaining features might lead to a substantially different contribution to the overall prediction.
The feature with the highest impact for this example is the patterns-over-taxa ratio (num_patterns/num_taxa
). The
overall contribution is 0.23 towards 0.0
, meaning it shifts the overall prediction towards easy
.
More Details
For further information please refer
to this great book on interpretable ML,
the documentation of the shap
package,
especially their notes on the interpretability of Shapley values.