Yaml Parameter File Outline
The Yaml Parameter File is the central location for all configurations for a given Protein Inference run and are summarized below:
Note: These parameters are all optional. Please see the section Default Parameters for more information on defaults.
General
Parameter |
Description |
Type |
export |
Export Type can be one of: peptides, psms, psm_ids, long, q_value, q_value_all, q_value_comma_sep, leads, all, comma_sep. Suggested types are peptides, psms, and psm_ids as these produce square output. If there are multiple proteins per group the three mentioned types will report the leads only unless inference_type is peptide_centric which will output a ; separated list of proteins in the group OR if inference_type is parsimony and grouping_type is parsimonious_grouping which will also return a ; separated list of proteins in the parsimony group. Other types report on the peptide level with slightly different formats and whether or not to include leads only or all proteins. See here for an in-depth explanation of Export Types. |
String |
fdr |
False Discovery Rate to be marked as significant. Ex. 0.01 for 1% FDR. |
Numeric |
picker |
True/False on whether to run the Protein Picker algorithm. For more info click here. |
Bool |
tag |
A String tag that will be written into the result files. Ex. example_tag. |
String |
xml_parser |
The library to read idXML, mzID, or pepXML files from. Can be either openms or pyteomics. Default: openms. |
String |
Data Restriction
Parameter |
Description |
Type |
pep_restriction |
Posterior Error Probability values to filter (i.e. 0.9). In this case PSMs with PEP values greater than 0.9 would be removed from the input. If PEP values are not in input please use None. |
Numeric |
peptide_length_restriction |
Peptide Length to filter on. (i.e. 7). If no filter is wanted please use None. |
Int |
q_value_restriction |
Q Values to filter. (i.e. 0.2). In this case PSMs with Q Values greater than 0.2 would be removed from the input. If Q Values are not in input please use None . |
Numeric |
custom_restriction |
Custom Value to filter. (i.e. 5). In this case PSMs with Custom value greater than / less than 5 would be removed from the input. If Not using a custom score please use None. NOTE: If a higher score is "better" for your score please set psm_score_type to additive. If a lower score is "better" please set psm_score_type parameter to multiplicative. |
Numeric |
max_allowed_alternative_proteins |
The maximum number of proteins a peptide is allowed to map to. Default: 50. |
Int |
Score
Parameter |
Description |
Type |
protein_score |
One of any of the following: multiplicative_log, best_peptide_per_protein, top_two_combined, additive, iterative_downweighted_log, downweighted_multiplicative_log, geometric_mean. Recommended: multiplicative_log. |
String |
psm_score |
PSM score to use for Protein Scoring. If using Percolator output as input this would either be posterior_error_prob or q-value. The string typed here should match the column/attribute in your input files EXACTLY. For more info on selecting PSM scores from your input files please see input file examples |
String |
psm_score_type |
The type of score that psm_score parameter is. Either multiplicative or additive. If a larger psm score is "better" than input additive (i.e. Mascot Ion Score, Xcorr, Percolator Score). If a smaller psm score is "better" than input multiplicative (i.e. Q Value, Posterior Error Probability). See below for more information. |
String |
- The protein_score, psm_score, and psm_score_type methods must be compatible.
- If using a PSM score (psm_score parameter) where the lower the score the better (i.e. posterior_error_prob or q-value) then any protein_score can be used except additive. psm_score_type must also be set to multiplicative.
- If using a PSM score (psm_score parameter) where the higher the score the better (i.e. Percolator Score, Mascot Ion Score, Xcorr) (Percolator Score is called psm_score - column name) in the tab delimited percolator output. Then protein_score and psm_score_type must both be additive.
Identifiers
Parameter |
Description |
Type |
decoy_symbol |
Symbol within Decoy Identifiers to distinguish between targets. (i.e "##", "decoy_", "rev_", "DECOY_"). This is important for Protein Picker and FDR calculation. |
String |
isoform_symbol |
Symbol that is present in isoform proteins only. (i.e. "-"). See below for more information. |
String |
reviewed_identifier_symbol |
Identifier to determine a reviewed vs unreviewed identifier. (i.e. "sp|"). See below for more information. |
String |
- For the decoy_symbol an example of a target protein -> ex|protein and its decoy counterpart could be any of the following: ##ex|##protein, ##ex|protein, decoy_ex|protein. The decoy symbol just needs to be present within the string to be determined a decoy.
- For isoform_symbol and reviewed_identifier_symbol, these are used to assign priority in certain algorithms such as parsimony. For example, if we have canonical proteins, isoform proteins, and reviewed/unreviewed proteins in a given analysis; the priority would be established as follows: Reviewed Canonical, Reviewed Isoform, Unreviewed. This means that if two proteins map to the same peptides, the algorithm has to make a decision on which to pick. It would use the previous mentioned priority to pick the protein lead to report.
Inference
Parameter |
Description |
Type |
inference_type |
The Inference procedure to apply to the analysis. This can be parsimony, inclusion, exclusion, peptide_centric, or first_protein. Please see here for more information on the inference types. |
String |
grouping_type |
How to group proteins for a given inference_type. This can be subset_peptides, shared_peptides, parsimonious_grouping, or None. Typically subset_peptides or parsimonious_grouping is used. This parameter only effects grouped proteins and has no impact on protein leads. Suggested to use parsimonious_grouping if parsimony groups are wanted to be seen in the output when running parsimony. |
String |
Digest
Parameter |
Description |
Type |
digest_type |
The enzyme used for digestion for the MS searches. (i.e. trypsin). For reference, the database digestion is handled with pyteomics. Can be any expasy rule as defined here other common examples include: trypsin, chymotrypsin high specificity, chymotrypsin low specificity, lysc. |
String |
missed_cleavages |
The number of missed cleavages allowed for the MS searches. (i.e. 2) |
Int |
Parsimony
These parameters are only used if parsimony is selected as inference_type.
Parameter |
Description |
Type |
lp_solver |
This can be one of: pulp or None. This determines which linear program solver is used. Input None if not running parsimony. If running parsimony this needs to be set to pulp. |
String |
shared_peptides |
How to assign shared peptides for parsimony. Can be one of: all or best. all assigns shared peptides to all possible proteins in the output. best assigns shared peptides to the best scoring protein which is a "winner take all" approach. This is specific to the Parsimony Inference type. |
String |
Peptide Centric
These parameters are only used if peptide_centric is selected as inference_type.
Parameter |
Description |
Type |
max_identifiers |
The maximum number of proteins a peptide is allowed to map to. (i.e. 5). This serves to limit the number of protein groups that can be created due to highly homologous peptides. |
Int |
Default Parameters
parameters:
general:
export: peptides
fdr: 0.01
picker: True
tag: example_tag
xml_parser: openms
data_restriction:
pep_restriction: 0.9
peptide_length_restriction: 7
q_value_restriction: .9
custom_restriction: None
max_allowed_alternative_proteins: 50
score:
protein_score: best_peptide_per_protein
psm_score: posterior_error_prob
psm_score_type: multiplicative
identifiers:
decoy_symbol: "##"
isoform_symbol: "-"
reviewed_identifier_symbol: "sp|"
inference:
inference_type: parsimony
grouping_type: parsimonious_grouping
digest:
digest_type: trypsin
missed_cleavages: 3
parsimony:
lp_solver: pulp
shared_peptides: all
peptide_centric:
max_identifiers: 5