Running Py Protein Inference
Running the Main Py Protein Inference Method
Running Via Command Line
Upon proper installation of the package, the command line tool should be installed and should be available from any location on the system. The command line tool can be called as follows:
protein_inference_cli.py --help
This will return the help prompt for the tool.
If this does not work download protein_inference_cli.py
from our repository and write the full path to the script while also calling python
:
python /path/to/directory/pyproteininference/scripts/protein_inference_cli.py --help
Command line options are as follows:
cli$ python protein_inference_cli.py --help
usage: protein_inference_cli.py [-h] [-t FILE [FILE ...]] [-d FILE [FILE ...]]
[-f FILE [FILE ...]] [-o DIR] [-l FILE]
[-a DIR] [-b DIR] [-c DIR] [-db FILE]
[-y FILE] [-p] [-i]
Protein Inference
optional arguments:
-h, --help show this help message and exit
-t FILE [FILE ...], --target FILE [FILE ...]
Input target psm output from percolator. Can either
input one file or a list of files.
-d FILE [FILE ...], --decoy FILE [FILE ...]
Input decoy psm output from percolator. Can either
input one file or a list of files.
-f FILE [FILE ...], --combined_files FILE [FILE ...]
Input combined psm search results in idXML, mzIdentML, pepXML, or
tab delimited format. This should contain Target and Decoy PSMS. "
Can either input one file or a list of files.
-o DIR, --output DIR Result Directory to write to - the name of file will
be determined by parameters selected and tag
parameter. If this option is not set, will write
results to current working directory.
-l FILE, --output_filename FILE
Filename to write results to. Can be left blank. If
this flag is left blank the filename will be
automatically generated. If set this flag will
override -o.
-a DIR, --target_directory DIR
Directory that contains either .txt or .tsv input
target psm data. Make sure the directory ONLY contains
result files.
-b DIR, --decoy_directory DIR
Directory that contains either .txt or .tsv input
decoy psm data. Make sure the directory ONLY contains
result files.
-c DIR, --combined_directory DIR
Directory that contains either .txt or .tsv input data
with targets/decoys combined. Make sure the directory
ONLY contains result files.
-db FILE, --database FILE
Path to the fasta formatted database used in the MS
search. This is optional. If not set, will use the
proteins only in the input files.
-y FILE, --yaml_params FILE
Path to a Protein Inference Yaml Parameter File. If
this is not set, default parameters will be used.
-p, --skip_append_alt
Advanced usage only. If this flag is set, will skip
adding alternative proteins to each PSM from the
database digest. If this flag is not set, the
peptide/protein mapping will be taken from database
digest and appended to the mapping present in the
input files.
-i, --id_splitting Advanced usage only. If set this flag will split
protein identifiers.If not set, this flag will not
split protein identifiers.Sometimes the fasta database
protein IDs are formatted as: 'sp|ARAF_HUMAN|P10398'.
While protein IDs in the input files are formatted as
'ARAF_HUMAN|P10398'. Setting This flag will split off
the front 'sp|' or 'tr|' from the database protein
identifiers.
The following combinations of input are allowed and at least one combination is required:
-t -d
Path to input target (-t
) and decoy (-d
) files. This can be one target and one decoy file or multiple files separated by spaces (" "). See here for information on target/decoy input files.-a -b
Path to input target (-a
) and decoy (-b
) directories that contain target and decoy files. This is one directory each and all .txt and .tsv files will be read in as input.-f
Path to input combined target/decoy (-f
) files. This can be one file or multiple files separated by spaces (" "). Use this option if your input is .mzIdentML, idXML, or pepXML.-c
Path to input combined target/decoy (-c
) directory that contain combined target/decoy files. This is one directory each and all .txt and .tsv files will be read in as input.
Any other combinations will result in an Error raised.
Optional flags
-db
Path to Fasta Database file.-y
Path to Protein Inference Yaml Parameter file. (If this is not supplied default parameters will be used).-o
Path to the output directory, if this is left blank files will be written to the current working directory.-l
Path to the output filename, if this is left blank a filename will be automatically generated and will be written to directory as set in-o
. Will override-o
flag if set.
Advanced usage flags
-p
This flag is a True/False on whether to skip appending alternative proteins from the Fasta database digestion. If this flag is left blank, it will not skip appending alternative proteins (recommended).-i
True/False on whether to split the IDs in the Fasta database file. If this is left blank, it will not split IDs in the Fasta database file (recommended).
You can run the tool as follows with separate target and decoy files:
protein_inference_cli.py \
-t /path/to/target/file.txt \
-d /path/to/decoy/file.txt \
-db /path/to/database/file.fasta \
-y /path/to/parameter/file.yaml \
-o /path/to/output/directory/
Or from combined files like an mzIdentML file:
protein_inference_cli.py \
-f /path/to/target/file.mzid \
-db /path/to/database/file.fasta \
-y /path/to/parameter/file.yaml \
-o /path/to/output/directory/
Running with multiple input target/decoy files:
protein_inference_cli.py \
-t /path/to/target/file1.txt /path/to/target/file2.txt \
-d /path/to/decoy/file1.txt /path/to/decoy/file2.txt \
-db /path/to/database/file.fasta \
-y /path/to/parameter/file.yaml \
-o /path/to/output/directory/
Or from multiple mzIdentML / idXML / pepXML files:
protein_inference_cli.py \
-f /path/to/target/file1.mzid /path/to/target/file2.mzid \
-db /path/to/database/file.fasta \
-y /path/to/parameter/file.yaml \
-o /path/to/output/directory/
Running Within Python
To run within a python console please see the following example:
from pyproteininference.pipeline import ProteinInferencePipeline
yaml_params = "/path/to/yaml/params.yaml"
database = "/path/to/database/file.fasta"
### target_files can either be a list of files or one file
target_files = ["/path/to/target1.txt","/path/to/target2.txt"]
### decoy_files can either be a list of files or one file
decoy_files = ["/path/to/decoy1.txt","/path/to/decoy2.txt"]
output_directory_name = "/path/to/output/directory/"
pipeline = ProteinInferencePipeline(parameter_file=yaml_params,
database_file=database,
target_files=target_files,
decoy_files=decoy_files,
combined_files=None,
output_directory=output_directory_name)
# Calling .execute() will initiate the pipeline with the given data
pipeline.execute()
Or running mzIdentML files within python:
from pyproteininference.pipeline import ProteinInferencePipeline
yaml_params = "/path/to/yaml/params.yaml"
database = "/path/to/database/file.fasta"
### target_files can either be a list of files or one file
mzid_files = ["/path/to/file1.mzid","/path/to/file2.mzid"]
### decoy_files can either be a list of files or one file
output_directory_name = "/path/to/output/directory/"
pipeline = ProteinInferencePipeline(parameter_file=yaml_params,
database_file=database,
target_files=None,
decoy_files=None,
combined_files=mzid_files,
output_directory=output_directory_name)
# Calling .execute() will initiate the pipeline with the given data
pipeline.execute()
Running the Heuristic Method
NOTE: The Heuristic Method is experimental and has not be extensively tested on multiple datasets yet. Check back for updates on this tool.
Py Protein Inference also has a built-in Heuristic that runs through four inference methods (Inclusion, Exclusion, Parsimony, and Peptide Centric) and selects a recommended method for your given dataset.
By default, all four result files will be written, and the optimal method will be highlighted to the user.
The Heuristic method also outputs a density plot that showcases all the inference methods compared to one another to gain further insight. For more information on the Heuristic Method see the Heuristic algorithm section.
Running the Heuristic Method via the Command Line
python protein_inference_heuristic_cli.py --help
This will return the help prompt for the tool.
If this does not work download protein_inference_heuristic_cli.py
from the repository and write the full path to the script while also calling python
.
python /path/to/directory/pyproteininference/scripts/protein_inference_heuristic_cli.py --help
Command line options are as follows:
cli$ python protein_inference_heuristic_cli.py --help
usage: protein_inference_heuristic_cli.py [-h] [-t FILE [FILE ...]]
[-d FILE [FILE ...]]
[-f FILE [FILE ...]] [-o DIR]
[-l FILE] [-a DIR] [-b DIR] [-c DIR]
[-db FILE] [-y FILE] [-p] [-i]
[-r FILE] [-m FLOAT] [-u STR]
Protein Inference Heuristic
optional arguments:
-h, --help show this help message and exit
-t FILE [FILE ...], --target FILE [FILE ...]
Input target psm output from percolator. Can either
input one file or a list of files.
-d FILE [FILE ...], --decoy FILE [FILE ...]
Input decoy psm output from percolator. Can either
input one file or a list of files.
-f FILE [FILE ...], --combined_files FILE [FILE ...]
Input combined psm output from percolator. This should
contain Target and Decoy PSMS. Can either input one
file or a list of files.
-o DIR, --output DIR Result Directory to write to - the name of file will
be determined by parameters selected and tag
parameter. If this option is not set, will write
results to current working directory.
-l FILE, --output_filename FILE
Filename to write results to. Can be left blank. If
this flag is left blank the filename will be
automatically generated. If set this flag will
override -o.
-a DIR, --target_directory DIR
Directory that contains either .txt or .tsv input
target psm data. Make sure the directory ONLY contains
result files.
-b DIR, --decoy_directory DIR
Directory that contains either .txt or .tsv input
decoy psm data. Make sure the directory ONLY contains.
result files.
-c DIR, --combined_directory DIR
Directory that contains either .txt or .tsv input data
with targets/decoys combined. Make sure the directory
ONLY contains result files.
-db FILE, --database FILE
Path to the fasta formatted database used in the MS
search. This is optional. If not set, will use the
proteins only in the input files.
-y FILE, --yaml_params FILE
Path to a Protein Inference Yaml Parameter File. If
this is not set, default parameters will be used.
-p, --skip_append_alt
Advanced usage only. If this flag is set, will skip
adding alternative proteins to each PSM from the
database digest. If this flag is not set, the
peptide/protein mapping will be taken from database
digest and appended to the mapping present in the
input files.
-i, --id_splitting Advanced usage only. If set this flag will split
protein identifiers.If not set, this flag will not
split protein identifiers.Sometimes the fasta database
protein IDs are formatted as: 'sp|ARAF_HUMAN|P10398'.
While protein IDs in the input files are formatted as
'ARAF_HUMAN|P10398'. Setting This flag will split off
the front 'sp|' or 'tr|' from the database protein
identifiers.
-r FILE, --pdf_filename FILE
PDF Filepath to write the Heuristic plot to after
Heuristic Scoring. If not set, writes the file with
filename heuristic_plot.pdf to directory set in -o. If -o is
not set, will write the file to current working
directory.
-m FLOAT, --fdr_threshold FLOAT
The FDR threshold to use in the Heuristic Method.
Defaults to 0.05 if not set.
-u STR, --output_type STR
The type of output to be written. Can either be 'all'
or 'optimal'. If set to 'all' will output all
inference results. If set to 'optimal' will output
only the result selected by the heuristic method. If
left blank this will default to 'all'.
Input options are the same as the standard protein_inference_cli.py with the addition of three optional inputs:
1. -r
This is a filepath that will have a density plot written to it after the heuristic method has been run. If this is left blank, it will write the plot into the standard output directory with the name heuristic_plot.pdf
2. -m
The FDR threshold to use in the Heuristic Method. The method will use values from 0 to the FDR threshold. If this value is left blank, it will be set to 0.05
3. -u
This is the type of output to be written after the heuristic method is complete. Will either output all results or the optimal results. If all is selected, the optimal results will have the string "optimal_method" spliced into the filename.
You can run the tool as follows:
protein_inference_heuristic_cli.py \
-t /path/to/target/file.txt \
-d /path/to/decoy/file.txt \
-db /path/to/database/file.fasta \
-y /path/to/parameter/file.yaml \
-o /path/to/output/directory/ \
-r /path/to/pdf/file.pdf \
-m 0.05
Running with multiple input target/decoy files:
protein_inference_heuristic_cli.py \
-t /path/to/target/file1.txt /path/to/target/file2.txt \
-d /path/to/decoy/file1.txt /path/to/decoy/file2.txt \
-db /path/to/database/file.fasta \
-y /path/to/parameter/file.yaml \
-o /path/to/output/directory/ \
-r /path/to/pdf/file.pdf \
-m 0.05
Running the Heuristic Method via Python
To run within a python console please see the following example:
from pyproteininference.heuristic import HeuristicPipeline
yaml_params = "/path/to/yaml/params.yaml"
database = "/path/to/database/file.fasta"
### target_files can either be a list of files or one file
target_files = ["/path/to/target1.txt","/path/to/target2.txt"]
### decoy_files can either be a list of files or one file
decoy_files = ["/path/to/decoy1.txt","/path/to/decoy2.txt"]
output_directory_name = "/path/to/output/directory/"
pdf_filename = "/path/to/output/directory/heuristic_plot.pdf"
hp = HeuristicPipeline(parameter_file=yaml_params,
database_file=database,
target_files=target_files,
decoy_files=decoy_files,
combined_files=None,
output_directory=output_directory_name,
pdf_filename=pdf_filename,
output_type="all")
# Calling .execute() will initiate the heuristic pipeline with the given data
# The suggested method will be output in the console and the suggested method results will be written into the output_directory
hp.execute(fdr_threshold=0.05)
# The optimal inference method and density plot can be generated separately as well with the following to specify thresholds directly:
hp.determine_optimal_inference_method(false_discovery_rate_threshold=0.05,
upper_empirical_threshold=1,
lower_empirical_threshold=.5,
pdf_filename=None)
Heuristic Output Example
Console Output
Console Output is as follows and indicates the recommended method at the end:
2022-05-12 17:28:38,413 - pyproteininference.heuristic - INFO - Heuristic Scores
2022-05-12 17:28:38,413 - pyproteininference.heuristic - INFO - {'inclusion': 1.2145313335009247, 'exclusion': 1.053616485888155, 'parsimony': 0.5416878942666304, 'peptide_centric': 0.24465822235367252}
2022-05-12 17:28:38,413 - pyproteininference.heuristic - INFO - Either parsimony 0.5416878942666304 or peptide centric 0.24465822235367252 pass empirical threshold 0.5. Selecting the best method of the two.
2022-05-12 17:28:38,413 - pyproteininference.heuristic - INFO - Method peptide_centric selected with the heuristic algorithm
Heuristic Density Plot Output
Below is an example of a Heuristic Density plot. The plot indicates the distribution of the number of standard deviations from the mean (of identified proteins at a specified FDR) for each inference method for a range of FDRs from 0 to the false discovery rate threshold (100 fdrs are incrementally selected in the range [0, fdr threshold]) In general, the closer that the peak of a distribution is to 0 the more likely the associated method is to be selected as the recommended method. For more information on the specifics of the Heuristic Algorithm see Heuristic Algorithm Description