Yaml Parameter File Outline
The Yaml Parameter File is the central location for all configurations for a given Protein Inference run and are summarized below:
Note: These parameters are all optional. Please see the section Default Parameters for more information on defaults.
General
Parameter |
Description |
Type |
export |
Export Type can be one of: peptides, psms, psm_ids, long, q_value, q_value_all, q_value_comma_sep, leads, all, comma_sep. Suggested types are peptides, psms, and psm_ids as these produce square output. If there are multiple proteins per group the three mentioned types will report the leads only. Other types report on the peptide level with slightly different formats and whether or not to include leads only or all proteins. See here for an in-depth explanation of Export Types. |
String |
fdr |
False Discovery Rate to be marked as significant. Ex. 0.01 for 1% FDR. |
Numeric |
picker |
True/False on whether to run the Protein Picker algorithm. For more info click here. |
Bool |
tag |
A String tag that will be written into the result files. Ex. example_tag. |
String |
Data Restriction
Parameter |
Description |
Type |
pep_restriction |
Posterior Error Probability values to filter (i.e. 0.9). In this case PSMs with PEP values greater than 0.9 would be removed from the input. If PEP values are not in input please use None. |
Numeric |
peptide_length_restriction |
Peptide Length to filter on. (i.e. 7). If no filter is wanted please use None. |
Int |
q_value_restriction |
Q Values to filter. (i.e. 0.2). In this case PSMs with Q Values greater than 0.2 would be removed from the input. If Q Values are not in input please use None . |
Numeric |
custom_restriction |
Custom Value to filter. (i.e. 5). In this case PSMs with Custom value greater than / less than 5 would be removed from the input. If Not using a custom score please use None. NOTE: If a higher score is "better" for your score please set psm_score_type to additive. If a lower score is "better" please set psm_score_type parameter to multiplicative. |
Numeric |
Score
Parameter |
Description |
Type |
protein_score |
One of any of the following: multiplicative_log, best_peptide_per_protein, top_two_combined, additive, iterative_downweighted_log, downweighted_multiplicative_log, geometric_mean. Recommended: multiplicative_log. |
String |
psm_score |
PSM score to use for Protein Scoring. If using Percolator output as input this would either be posterior_error_prob or q-value. The string typed here should match the column in your input files EXACTLY. If using a custom score it will be filtered accordingly with the value in custom_restriction. |
String |
psm_score_type |
The Type of score that psm_score parameter is. Either multiplicative or additive. If a larger psm score is "better" than input additive (i.e. Mascot Ion Score, Xcorr, Percolator Score). If a smaller psm score is "better" than input multiplicative (i.e. Q Value, Posterior Error Probability). See below for more information. |
String |
- The protein_score, psm_score, and psm_score_type methods must be compatible.
- If using a PSM score (psm_score parameter) where the lower the score the better (i.e. posterior_error_prob or q-value) then any protein_score can be used except additive. psm_score_type must also be set to multiplicative.
- If using a PSM score (psm_score parameter) where the higher the score the better (i.e. Percolator Score, Mascot Ion Score, Xcorr) (Percolator Score is called psm_score - column name) in the tab delimited percolator output. Then protein_score and psm_score_type must both be additive.
Identifiers
Parameter |
Description |
Type |
decoy_symbol |
Symbol within Decoy Identifiers to distinguish between targets. (i.e "##" or "__decoy___"). This is important for Protein Picker and FDR calculation. |
String |
isoform_symbol |
Symbol that is present in isoform proteins only. (i.e. "-"). See below for more information. |
String |
reviewed_identifier_symbol |
Identifier to determine a reviewed vs unreviewed identifier. (i.e. "sp|"). See below for more information. |
String |
- For the decoy_symbol an example of a target protein -> ex|protein and its decoy counterpart could be any of the following: ##ex|##protein, ##ex|protein, decoy_ex|protein. The decoy symbol just needs to be present within the string to be determined a decoy.
- For isoform_symbol and reviewed_identifier_symbol, these are used to assign priority in certain algorithms such as parsimony. For example, if we have canonical proteins, isoform proteins, and reviewed/unreviewed proteins in a given analysis; the priority would be established as follows: Reviewed Canonical, Reviewed Isoform, Unreviewed. This means that if two proteins map to the same peptides, the algorithm has to make a decision on which to pick. It would use the previous mentioned priority to pick the protein lead to report.
Inference
Parameter |
Description |
Type |
inference_type |
The Inference procedure to apply to the analysis. This can be parsimony, inclusion, exclusion, peptide_centric, or first_protein. Please see here for more information on the inference types. |
String |
grouping_type |
How to group proteins for a given inference_type. This can be subset_peptides, shared_peptides, or None. Typically subset_peptides is used. This parameter only effects grouped proteins and has no impact on protein leads. |
String |
Digest
Parameter |
Description |
Type |
digest_type |
The enzyme used for digestion for the MS searches. (i.e. trypsin). For reference, the database digestion is handled with pyteomics. Can be any expasy rule as defined here other common examples include: trypsin, chymotrypsin high specificity, chymotrypsin low specificity, lysc. |
String |
missed_cleavages |
The number of missed cleavages allowed for the MS searches. (i.e. 2) |
Int |
Parsimony
These parameters are only used if parsimony is selected as inference_type.
Parameter |
Description |
Type |
lp_solver |
This can be one of: pulp or None. This determines which linear program solver is used. Please see here for more information on lp solvers. Input None if not running parsimony. If running parsimony this needs to be set to pulp. |
String |
shared_peptides |
How to assign shared peptides for parsimony. Can be one of: all or best. all assigns shared peptides to all possible proteins in the output. best assigns shared peptides to the best scoring protein which is a "winner take all" approach. This is specific to the Parsimony Inference type. |
String |
Peptide Centric
These parameters are only used if peptide_centric is selected as inference_type.
Parameter |
Description |
Type |
max_identifiers |
The maximum number of proteins a peptide is allowed to map to. (i.e. 5). This serves to limit the number of protein groups that can be created due to highly homologous peptides. |
Int |
Default Parameters
parameters:
general:
export: peptides
fdr: 0.01
picker: True
tag: py_protein_inference
data_restriction:
pep_restriction: 0.9
peptide_length_restriction: 7
q_value_restriction: 0.005
custom_restriction: None
score:
protein_score: multiplicative_log
psm_score: posterior_error_prob
psm_score_type: multiplicative
identifiers:
decoy_symbol: "##"
isoform_symbol: "-"
reviewed_identifier_symbol: "sp|"
inference:
inference_type: peptide_centric
grouping_type: shared_peptides
digest:
digest_type: trypsin
missed_cleavages: 3
parsimony:
lp_solver: pulp
shared_peptides: all
peptide_centric:
max_identifiers: 5