Yaml Parameter File Outline

The Yaml Parameter File is the central location for all configurations for a given Protein Inference run and are summarized below: Note: These parameters are all optional. Please see the section Default Parameters for more information on defaults.

General

Parameter Description Type
export Export Type can be one of: peptides, psms, psm_ids, long, q_value, q_value_all, q_value_comma_sep, leads, all, comma_sep. Suggested types are peptides, psms, and psm_ids as these produce square output. If there are multiple proteins per group the three mentioned types will report the leads only. Other types report on the peptide level with slightly different formats and whether or not to include leads only or all proteins. See here for an in-depth explanation of Export Types. String
fdr False Discovery Rate to be marked as significant. Ex. 0.01 for 1% FDR. Numeric
picker True/False on whether to run the Protein Picker algorithm. For more info click here. Bool
tag A String tag that will be written into the result files. Ex. example_tag. String

Data Restriction

Parameter Description Type
pep_restriction Posterior Error Probability values to filter (i.e. 0.9). In this case PSMs with PEP values greater than 0.9 would be removed from the input. If PEP values are not in input please use None. Numeric
peptide_length_restriction Peptide Length to filter on. (i.e. 7). If no filter is wanted please use None. Int
q_value_restriction Q Values to filter. (i.e. 0.2). In this case PSMs with Q Values greater than 0.2 would be removed from the input. If Q Values are not in input please use None . Numeric
custom_restriction Custom Value to filter. (i.e. 5). In this case PSMs with Custom value greater than / less than 5 would be removed from the input. If Not using a custom score please use None. NOTE: If a higher score is "better" for your score please set psm_score_type to additive. If a lower score is "better" please set psm_score_type parameter to multiplicative. Numeric

Score

Parameter Description Type
protein_score One of any of the following: multiplicative_log, best_peptide_per_protein, top_two_combined, additive, iterative_downweighted_log, downweighted_multiplicative_log, geometric_mean. Recommended: multiplicative_log. String
psm_score PSM score to use for Protein Scoring. If using Percolator output as input this would either be posterior_error_prob or q-value. The string typed here should match the column in your input files EXACTLY. If using a custom score it will be filtered accordingly with the value in custom_restriction. String
psm_score_type The Type of score that psm_score parameter is. Either multiplicative or additive. If a larger psm score is "better" than input additive (i.e. Mascot Ion Score, Xcorr, Percolator Score). If a smaller psm score is "better" than input multiplicative (i.e. Q Value, Posterior Error Probability). See below for more information. String

Extra Score information

  1. The protein_score, psm_score, and psm_score_type methods must be compatible.
  2. If using a PSM score (psm_score parameter) where the lower the score the better (i.e. posterior_error_prob or q-value) then any protein_score can be used except additive. psm_score_type must also be set to multiplicative.
  3. If using a PSM score (psm_score parameter) where the higher the score the better (i.e. Percolator Score, Mascot Ion Score, Xcorr) (Percolator Score is called psm_score - column name) in the tab delimited percolator output. Then protein_score and psm_score_type must both be additive.

Identifiers

Parameter Description Type
decoy_symbol Symbol within Decoy Identifiers to distinguish between targets. (i.e "##" or "__decoy___"). This is important for Protein Picker and FDR calculation. String
isoform_symbol Symbol that is present in isoform proteins only. (i.e. "-"). See below for more information. String
reviewed_identifier_symbol Identifier to determine a reviewed vs unreviewed identifier. (i.e. "sp|"). See below for more information. String

Extra Identifier information

  1. For the decoy_symbol an example of a target protein -> ex|protein and its decoy counterpart could be any of the following: ##ex|##protein, ##ex|protein, decoy_ex|protein. The decoy symbol just needs to be present within the string to be determined a decoy.
  2. For isoform_symbol and reviewed_identifier_symbol, these are used to assign priority in certain algorithms such as parsimony. For example, if we have canonical proteins, isoform proteins, and reviewed/unreviewed proteins in a given analysis; the priority would be established as follows: Reviewed Canonical, Reviewed Isoform, Unreviewed. This means that if two proteins map to the same peptides, the algorithm has to make a decision on which to pick. It would use the previous mentioned priority to pick the protein lead to report.

Inference

Parameter Description Type
inference_type The Inference procedure to apply to the analysis. This can be parsimony, inclusion, exclusion, peptide_centric, or first_protein. Please see here for more information on the inference types. String
grouping_type How to group proteins for a given inference_type. This can be subset_peptides, shared_peptides, or None. Typically subset_peptides is used. This parameter only effects grouped proteins and has no impact on protein leads. String

Digest

Parameter Description Type
digest_type The enzyme used for digestion for the MS searches. (i.e. trypsin). For reference, the database digestion is handled with pyteomics. Can be any expasy rule as defined here other common examples include: trypsin, chymotrypsin high specificity, chymotrypsin low specificity, lysc. String
missed_cleavages The number of missed cleavages allowed for the MS searches. (i.e. 2) Int

Parsimony

These parameters are only used if parsimony is selected as inference_type.

Parameter Description Type
lp_solver This can be one of: pulp or None. This determines which linear program solver is used. Please see here for more information on lp solvers. Input None if not running parsimony. If running parsimony this needs to be set to pulp. String
shared_peptides How to assign shared peptides for parsimony. Can be one of: all or best. all assigns shared peptides to all possible proteins in the output. best assigns shared peptides to the best scoring protein which is a "winner take all" approach. This is specific to the Parsimony Inference type. String

Peptide Centric

These parameters are only used if peptide_centric is selected as inference_type.

Parameter Description Type
max_identifiers The maximum number of proteins a peptide is allowed to map to. (i.e. 5). This serves to limit the number of protein groups that can be created due to highly homologous peptides. Int

Default Parameters

parameters:
  general:
    export: peptides
    fdr: 0.01
    picker: True
    tag: py_protein_inference
  data_restriction:
    pep_restriction: 0.9
    peptide_length_restriction: 7
    q_value_restriction: 0.005
    custom_restriction: None
  score:
    protein_score: multiplicative_log
    psm_score: posterior_error_prob
    psm_score_type: multiplicative
  identifiers:
    decoy_symbol: "##"
    isoform_symbol: "-"
    reviewed_identifier_symbol: "sp|"
  inference:
    inference_type: peptide_centric
    grouping_type: shared_peptides
  digest:
    digest_type: trypsin
    missed_cleavages: 3
  parsimony:
    lp_solver: pulp
    shared_peptides: all
  peptide_centric:
    max_identifiers: 5