Yaml Parameter File Outline

The Yaml Parameter File is the central location for all configurations for a given Protein Inference run and are summarized below: Note: These parameters are all optional. Please see the section Default Parameters for more information on defaults.

General

Parameter Description Type
export Export Type can be one of: peptides, psms, psm_ids, long, q_value, q_value_all, q_value_comma_sep, leads, all, comma_sep. Suggested types are peptides, psms, and psm_ids as these produce square output. If there are multiple proteins per group the three mentioned types will report the leads only unless inference_type is peptide_centric which will output a ; separated list of proteins in the group OR if inference_type is parsimony and grouping_type is parsimonious_grouping which will also return a ; separated list of proteins in the parsimony group. Other types report on the peptide level with slightly different formats and whether or not to include leads only or all proteins. See here for an in-depth explanation of Export Types. String
fdr False Discovery Rate to be marked as significant. Ex. 0.01 for 1% FDR. Numeric
picker True/False on whether to run the Protein Picker algorithm. For more info click here. Bool
tag A String tag that will be written into the result files. Ex. example_tag. String
xml_parser The library to read idXML, mzID, or pepXML files from. Can be either openms or pyteomics. Default: openms. String

Data Restriction

Parameter Description Type
pep_restriction Posterior Error Probability values to filter (i.e. 0.9). In this case PSMs with PEP values greater than 0.9 would be removed from the input. If PEP values are not in input please use None. Numeric
peptide_length_restriction Peptide Length to filter on. (i.e. 7). If no filter is wanted please use None. Int
q_value_restriction Q Values to filter. (i.e. 0.2). In this case PSMs with Q Values greater than 0.2 would be removed from the input. If Q Values are not in input please use None . Numeric
custom_restriction Custom Value to filter. (i.e. 5). In this case PSMs with Custom value greater than / less than 5 would be removed from the input. If Not using a custom score please use None. NOTE: If a higher score is "better" for your score please set psm_score_type to additive. If a lower score is "better" please set psm_score_type parameter to multiplicative. Numeric
max_allowed_alternative_proteins The maximum number of proteins a peptide is allowed to map to. Default: 50. Int

Score

Parameter Description Type
protein_score One of any of the following: multiplicative_log, best_peptide_per_protein, top_two_combined, additive, iterative_downweighted_log, downweighted_multiplicative_log, geometric_mean. Recommended: multiplicative_log. String
psm_score PSM score to use for Protein Scoring. If using Percolator output as input this would either be posterior_error_prob or q-value. The string typed here should match the column/attribute in your input files EXACTLY. For more info on selecting PSM scores from your input files please see input file examples String
psm_score_type The type of score that psm_score parameter is. Either multiplicative or additive. If a larger psm score is "better" than input additive (i.e. Mascot Ion Score, Xcorr, Percolator Score). If a smaller psm score is "better" than input multiplicative (i.e. Q Value, Posterior Error Probability). See below for more information. String

Extra Score information

  1. The protein_score, psm_score, and psm_score_type methods must be compatible.
  2. If using a PSM score (psm_score parameter) where the lower the score the better (i.e. posterior_error_prob or q-value) then any protein_score can be used except additive. psm_score_type must also be set to multiplicative.
  3. If using a PSM score (psm_score parameter) where the higher the score the better (i.e. Percolator Score, Mascot Ion Score, Xcorr) (Percolator Score is called psm_score - column name) in the tab delimited percolator output. Then protein_score and psm_score_type must both be additive.

Identifiers

Parameter Description Type
decoy_symbol Symbol within Decoy Identifiers to distinguish between targets. (i.e "##", "decoy_", "rev_", "DECOY_"). This is important for Protein Picker and FDR calculation. String
isoform_symbol Symbol that is present in isoform proteins only. (i.e. "-"). See below for more information. String
reviewed_identifier_symbol Identifier to determine a reviewed vs unreviewed identifier. (i.e. "sp|"). See below for more information. String

Extra Identifier information

  1. For the decoy_symbol an example of a target protein -> ex|protein and its decoy counterpart could be any of the following: ##ex|##protein, ##ex|protein, decoy_ex|protein. The decoy symbol just needs to be present within the string to be determined a decoy.
  2. For isoform_symbol and reviewed_identifier_symbol, these are used to assign priority in certain algorithms such as parsimony. For example, if we have canonical proteins, isoform proteins, and reviewed/unreviewed proteins in a given analysis; the priority would be established as follows: Reviewed Canonical, Reviewed Isoform, Unreviewed. This means that if two proteins map to the same peptides, the algorithm has to make a decision on which to pick. It would use the previous mentioned priority to pick the protein lead to report.

Inference

Parameter Description Type
inference_type The Inference procedure to apply to the analysis. This can be parsimony, inclusion, exclusion, peptide_centric, or first_protein. Please see here for more information on the inference types. String
grouping_type How to group proteins for a given inference_type. This can be subset_peptides, shared_peptides, parsimonious_grouping, or None. Typically subset_peptides or parsimonious_grouping is used. This parameter only effects grouped proteins and has no impact on protein leads. Suggested to use parsimonious_grouping if parsimony groups are wanted to be seen in the output when running parsimony. String

Digest

Parameter Description Type
digest_type The enzyme used for digestion for the MS searches. (i.e. trypsin). For reference, the database digestion is handled with pyteomics. Can be any expasy rule as defined here other common examples include: trypsin, chymotrypsin high specificity, chymotrypsin low specificity, lysc. String
missed_cleavages The number of missed cleavages allowed for the MS searches. (i.e. 2) Int

Parsimony

These parameters are only used if parsimony is selected as inference_type.

Parameter Description Type
lp_solver This can be one of: pulp or None. This determines which linear program solver is used. Input None if not running parsimony. If running parsimony this needs to be set to pulp. String
shared_peptides How to assign shared peptides for parsimony. Can be one of: all or best. all assigns shared peptides to all possible proteins in the output. best assigns shared peptides to the best scoring protein which is a "winner take all" approach. This is specific to the Parsimony Inference type. String

Peptide Centric

These parameters are only used if peptide_centric is selected as inference_type.

Parameter Description Type
max_identifiers The maximum number of proteins a peptide is allowed to map to. (i.e. 5). This serves to limit the number of protein groups that can be created due to highly homologous peptides. Int

Default Parameters

parameters:
  general:
    export: peptides
    fdr: 0.01
    picker: True
    tag: example_tag
    xml_parser: openms
  data_restriction:
    pep_restriction: 0.9
    peptide_length_restriction: 7
    q_value_restriction: .9
    custom_restriction: None
    max_allowed_alternative_proteins: 50
  score:
    protein_score: best_peptide_per_protein
    psm_score: posterior_error_prob
    psm_score_type: multiplicative
  identifiers:
    decoy_symbol: "##"
    isoform_symbol: "-"
    reviewed_identifier_symbol: "sp|"
  inference:
    inference_type: parsimony
    grouping_type: parsimonious_grouping
  digest:
    digest_type: trypsin
    missed_cleavages: 3
  parsimony:
    lp_solver: pulp
    shared_peptides: all
  peptide_centric:
    max_identifiers: 5