Py Protein Inference Module
datastore
DataStore
The following Class serves as the data storage object for a protein inference analysis The class serves as a central point that is accessed at virtually every PI processing step
Attributes:
Name | Type | Description |
---|---|---|
main_data_form |
list |
List of unrestricted Psm objects. |
parameter_file_object |
ProteinInferenceParameter |
protein inference parameter object. |
restricted_peptides |
list |
List of non flaking peptide strings present in the current analysis. |
main_data_restricted |
list |
List of restricted Psm objects. Restriction is based on the parameter_file_object and the object is created by function restrict_psm_data. |
scored_proteins |
list |
List of scored Protein objects. Output from scoring methods from scoring. |
grouped_scored_proteins |
list |
List of scored Protein objects that have been grouped and sorted. Output from run_inference method. |
scoring_input |
list |
List of non-scored Protein objects. Output from create_scoring_input. |
picked_proteins_scored |
list |
List of Protein objects that pass the protein picker algorithm (protein_picker). |
picked_proteins_removed |
list |
List of Protein objects that do not pass the protein picker algorithm (protein_picker). |
protein_peptide_dictionary |
collections.defaultdict |
Dictionary of protein strings (keys) that map to sets of peptide strings based on the peptides and proteins found in the search. Protein -> set(Peptides). |
peptide_protein_dictionary |
collections.defaultdict |
Dictionary of peptide strings (keys) that map to sets of protein strings based on the peptides and proteins found in the search. Peptide -> set(Proteins). |
high_low_better |
str |
Variable that indicates whether a higher or a lower protein score is better. This is necessary to sort Protein objects by score properly. Can either be "higher" or "lower". |
psm_score |
str |
Variable that indicates the Psm score being used in the analysis to generate Protein scores. |
protein_score |
str |
String to indicate the protein score method used. |
short_protein_score |
str |
Short String to indicate the protein score method used. |
protein_group_objects |
list |
List of scored ProteinGroup objects that have been grouped and sorted. Output from run_inference method. |
decoy_symbol |
str |
String that is used to differentiate between decoy proteins and target proteins. Ex: "##". |
digest |
Digest |
|
SCORE_MAPPER |
dict |
Dictionary that maps potential scores in input files to internal score names. |
CUSTOM_SCORE_KEY |
str |
String that indicates a custom score is being used. |
Source code in pyproteininference/datastore.py
class DataStore(object):
"""
The following Class serves as the data storage object for a protein inference analysis
The class serves as a central point that is accessed at virtually every PI processing step
Attributes:
main_data_form (list): List of unrestricted Psm objects.
parameter_file_object (ProteinInferenceParameter): protein inference parameter
[object][pyproteininference.parameters.ProteinInferenceParameter].
restricted_peptides (list): List of non flaking peptide strings present in the current analysis.
main_data_restricted (list): List of restricted [Psm][pyproteininference.physical.Psm] objects.
Restriction is based on the parameter_file_object and the object is created by function
[restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
scored_proteins (list): List of scored [Protein][pyproteininference.physical.Protein] objects.
Output from scoring methods from [scoring][pyproteininference.scoring].
grouped_scored_proteins (list): List of scored [Protein][pyproteininference.physical.Protein]
objects that have been grouped and sorted. Output from
[run_inference][pyproteininference.inference.Inference.run_inference] method.
scoring_input (list): List of non-scored [Protein][pyproteininference.physical.Protein] objects.
Output from [create_scoring_input][pyproteininference.datastore.DataStore.create_scoring_input].
picked_proteins_scored (list): List of [Protein][pyproteininference.physical.Protein] objects that pass
the protein picker algorithm ([protein_picker][pyproteininference.datastore.DataStore.protein_picker]).
picked_proteins_removed (list): List of [Protein][pyproteininference.physical.Protein] objects that do not
pass the protein picker algorithm ([protein_picker][pyproteininference.datastore.DataStore.protein_picker]).
protein_peptide_dictionary (collections.defaultdict): Dictionary of protein strings (keys) that map to sets
of peptide strings based on the peptides and proteins found in the search. Protein -> set(Peptides).
peptide_protein_dictionary (collections.defaultdict): Dictionary of peptide strings (keys) that map to sets
of protein strings based on the peptides and proteins found in the search. Peptide -> set(Proteins).
high_low_better (str): Variable that indicates whether a higher or a lower protein score is better.
This is necessary to sort Protein objects by score properly. Can either be "higher" or "lower".
psm_score (str): Variable that indicates the [Psm][pyproteininference.physical.Psm]
score being used in the analysis to generate [Protein][pyproteininference.physical.Protein] scores.
protein_score (str): String to indicate the protein score method used.
short_protein_score (str): Short String to indicate the protein score method used.
protein_group_objects (list): List of scored [ProteinGroup][pyproteininference.physical.ProteinGroup]
objects that have been grouped and sorted. Output from
[run_inference][pyproteininference.inference.Inference.run_inference] method.
decoy_symbol (str): String that is used to differentiate between decoy proteins and target proteins. Ex: "##".
digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].
SCORE_MAPPER (dict): Dictionary that maps potential scores in input files to internal score names.
CUSTOM_SCORE_KEY (str): String that indicates a custom score is being used.
"""
SCORE_MAPPER = {
"q_value": "qvalue",
"pep_value": "pepvalue",
"perc_score": "percscore",
"score": "percscore",
"q-value": "qvalue",
"posterior_error_prob": "pepvalue",
"posterior_error_probability": "pepvalue",
}
CUSTOM_SCORE_KEY = "custom_score"
HIGHER_PSM_SCORE = "higher"
LOWER_PSM_SCORE = "lower"
def __init__(self, reader, digest, validate=True):
"""
Args:
reader (Reader): Reader object [Reader][pyproteininference.reader.Reader].
digest (Digest): Digest object
[Digest][pyproteininference.in_silico_digest.Digest].
validate (bool): True/False to indicate if the input data should be validated.
Example:
>>> pyproteininference.datastore.DataStore(reader = reader, digest=digest)
"""
# If the reader class is from a percolator.psms then define main_data_form as reader.psms
# main_data_form is the starting point for all other analyses
self._init_validate(reader=reader)
self.parameter_file_object = reader.parameter_file_object # Parameter object
self.main_data_restricted = None # PSM data post restriction
self.scored_proteins = [] # List of scored Protein objects
self.grouped_scored_proteins = [] # List of sorted scored Protein objects
self.scoring_input = None # List of non scored Protein objects
self.picked_proteins_scored = None # List of Protein objects after picker algorithm
self.picked_proteins_removed = None # Protein objects removed via picker
self.protein_peptide_dictionary = None
self.peptide_protein_dictionary = None
self.high_low_better = None # Variable that indicates whether a higher or lower protein score is better
self.psm_score = None # PSM Score used
self.protein_score = None
self.short_protein_score = None
self.protein_group_objects = [] # List of sorted protein group objects
self.decoy_symbol = self.parameter_file_object.decoy_symbol # Decoy symbol from parameter file
self.digest = digest # Digest object
# Run Checks and Validations
if validate:
self.validate_psm_data()
self.validate_digest()
self.check_data_consistency()
# Run method to fix our parameter object if necessary
self.parameter_file_object.fix_parameters_from_datastore(data=self)
def get_sorted_identifiers(self, scored=True):
"""
Retrieves a sorted list of protein strings present in the analysis.
Args:
scored (bool): True/False to indicate if we should return scored or non-scored identifiers.
Returns:
list: List of sorted protein identifier strings.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> sorted_proteins = data.get_sorted_identifiers(scored=True)
"""
if scored:
self._validate_scored_proteins()
if self.picked_proteins_scored:
proteins = set([x.identifier for x in self.picked_proteins_scored])
else:
proteins = set([x.identifier for x in self.scored_proteins])
else:
self._validate_scoring_input()
proteins = [x.identifier for x in self.scoring_input]
all_sp_proteins = set(self.digest.swiss_prot_protein_set)
our_target_sp_proteins = sorted([x for x in proteins if x in all_sp_proteins and self.decoy_symbol not in x])
our_decoy_sp_proteins = sorted([x for x in proteins if x in all_sp_proteins and self.decoy_symbol in x])
our_target_tr_proteins = sorted(
[x for x in proteins if x not in all_sp_proteins and self.decoy_symbol not in x]
)
our_decoy_tr_proteins = sorted([x for x in proteins if x not in all_sp_proteins and self.decoy_symbol in x])
our_proteins_sorted = (
our_target_sp_proteins + our_decoy_sp_proteins + our_target_tr_proteins + our_decoy_tr_proteins
)
return our_proteins_sorted
@classmethod
def sort_protein_group_objects(cls, protein_group_objects, higher_or_lower):
"""
Class Method to sort a list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects by
score and number of peptides.
Args:
protein_group_objects (list): list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".
Returns:
list: list of sorted [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
Example:
>>> list_of_group_objects = pyproteininference.datastore.DataStore.sort_protein_group_objects(
>>> protein_group_objects=list_of_group_objects, higher_or_lower="higher"
>>> )
"""
if higher_or_lower == cls.LOWER_PSM_SCORE:
protein_group_objects = sorted(
protein_group_objects,
key=lambda k: (
k.proteins[0].score,
-k.proteins[0].num_peptides,
),
reverse=False,
)
elif higher_or_lower == cls.HIGHER_PSM_SCORE:
protein_group_objects = sorted(
protein_group_objects,
key=lambda k: (
k.proteins[0].score,
k.proteins[0].num_peptides,
),
reverse=True,
)
return protein_group_objects
@classmethod
def sort_protein_objects(cls, grouped_protein_objects, higher_or_lower):
"""
Class Method to sort a list of [Protein][pyproteininference.physical.Protein] objects by score and number of
peptides.
Args:
grouped_protein_objects (list): list of [Protein][pyproteininference.physical.Protein] objects.
higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".
Returns:
list: list of sorted [Protein][pyproteininference.physical.Protein] objects.
Example:
>>> scores_grouped = pyproteininference.datastore.DataStore.sort_protein_objects(
>>> grouped_protein_objects=scores_grouped, higher_or_lower="higher"
>>> )
"""
if higher_or_lower == cls.LOWER_PSM_SCORE:
grouped_protein_objects = sorted(
grouped_protein_objects,
key=lambda k: (k[0].score, -k[0].num_peptides),
reverse=False,
)
if higher_or_lower == cls.HIGHER_PSM_SCORE:
grouped_protein_objects = sorted(
grouped_protein_objects,
key=lambda k: (k[0].score, k[0].num_peptides),
reverse=True,
)
return grouped_protein_objects
@classmethod
def sort_protein_sub_groups(cls, protein_list, higher_or_lower):
"""
Method to sort protein sub lists.
Args:
protein_list (list): List of [Protein][pyproteininference.physical.Protein] objects to be sorted.
higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".
Returns:
list: List of [Protein][pyproteininference.physical.Protein] objects to be sorted by score and number of
peptides.
"""
# Sort the groups based on higher or lower indication, secondarily sort the groups based on number of unique
# peptides
# We use the index [1:] as we do not wish to sort the lead protein...
if higher_or_lower == cls.LOWER_PSM_SCORE:
protein_list[1:] = sorted(
protein_list[1:],
key=lambda k: (float(k.score), -float(k.num_peptides)),
reverse=False,
)
if higher_or_lower == cls.HIGHER_PSM_SCORE:
protein_list[1:] = sorted(
protein_list[1:],
key=lambda k: (float(k.score), float(k.num_peptides)),
reverse=True,
)
return protein_list
def get_psm_data(self):
"""
Method to retrieve a list of [Psm][pyproteininference.physical.Psm] objects.
Retrieves restricted data if the data has been restricted or all of the data if the data has
not been restricted.
Returns:
list: list of [Psm][pyproteininference.physical.Psm] objects.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> psm_data = data.get_psm_data()
"""
if not self.main_data_restricted and not self.main_data_form:
raise ValueError(
"Both main_data_restricted and main_data_form variables are empty. Please re-load the DataStore "
"object with a properly loaded Reader object."
)
if self.main_data_restricted:
psm_data = self.main_data_restricted
else:
psm_data = self.main_data_form
return psm_data
def get_protein_data(self):
"""
Method to retrieve a list of [Protein][pyproteininference.physical.Protein] objects.
Retrieves picked and scored data if the data has been picked and scored or just the scored data if the data has
not been picked.
Returns:
list: list of [Protein][pyproteininference.physical.Protein] objects.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> # Data must ben ran through a pyproteininference.scoring.Score method
>>> protein_data = data.get_protein_data()
"""
if self.picked_proteins_scored:
scored_proteins = self.picked_proteins_scored
else:
scored_proteins = self.scored_proteins
return scored_proteins
def get_protein_identifiers_from_psm_data(self):
"""
Method to retrieve a list of lists of all possible protein identifiers from the psm data.
Returns:
list: list of lists of protein strings.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_strings = data.get_protein_identifiers_from_psm_data()
"""
psm_data = self.get_psm_data()
proteins = [x.possible_proteins for x in psm_data]
return proteins
def get_q_values(self):
"""
Method to retrieve a list of all q values for all PSMs.
Returns:
list: list of floats (q values).
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> q = data.get_q_values()
"""
psm_data = self.get_psm_data()
q_values = [x.qvalue for x in psm_data]
return q_values
def get_pep_values(self):
"""
Method to retrieve a list of all posterior error probabilities for all PSMs.
Returns:
list: list of floats (pep values).
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> pep = data.get_pep_values()
"""
psm_data = self.get_psm_data()
pep_values = [x.pepvalue for x in psm_data]
return pep_values
def get_protein_information_dictionary(self):
"""
Method to retrieve a dictionary of scores for each peptide.
Returns:
dict: dictionary of scores for each protein.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_dict = data.get_protein_information_dictionary()
"""
psm_data = self.get_psm_data()
protein_psm_score_dictionary = collections.defaultdict(list)
# Loop through all Psms
for psms in psm_data:
# Loop through all proteins
for prots in psms.possible_proteins:
protein_psm_score_dictionary[prots].append(
{
"peptide": psms.identifier,
"Qvalue": psms.qvalue,
"PosteriorErrorProbability": psms.pepvalue,
"Percscore": psms.percscore,
}
)
return protein_psm_score_dictionary
def restrict_psm_data(self, remove1pep=True):
"""
Method to restrict the input of [Psm][pyproteininference.physical.Psm] objects.
This method is central to the pyproteininference module and is able to restrict the Psm data by:
Q value, Pep Value, Percolator Score, Peptide Length, and Custom Score Input.
Restriction values are pulled from
the [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter]
object.
This method sets the `main_data_restricted` and `restricted_peptides` Attributes for the DataStore object.
Args:
remove1pep (bool): True/False on whether or not to remove PEP values that equal 1 even if other restrictions
are set to not restrict.
Returns:
None:
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.restrict_psm_data(remove1pep=True)
"""
# Validate that we have the main data variable
self._validate_main_data_form()
logger.info("Restricting PSM data")
peptide_length = self.parameter_file_object.restrict_peptide_length
posterior_error_prob_threshold = self.parameter_file_object.restrict_pep
q_value_threshold = self.parameter_file_object.restrict_q
custom_threshold = self.parameter_file_object.restrict_custom
main_psm_data = self.main_data_form
logger.info("Length of main data: {}".format(len(self.main_data_form)))
# If restrict_main_data is called, we automatically discard everything that has a PEP of 1
if remove1pep and posterior_error_prob_threshold:
main_psm_data = [x for x in main_psm_data if x.pepvalue != 1]
# Restrict peptide length and posterior error probability
if peptide_length and posterior_error_prob_threshold and not q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if len(psms.stripped_peptide) >= peptide_length and psms.pepvalue < float(
posterior_error_prob_threshold
):
restricted_data.append(psms)
# Restrict peptide length only
if peptide_length and not posterior_error_prob_threshold and not q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if len(psms.stripped_peptide) >= peptide_length:
restricted_data.append(psms)
# Restrict peptide length, posterior error probability, and qvalue
if peptide_length and posterior_error_prob_threshold and q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if (
len(psms.stripped_peptide) >= peptide_length
and psms.pepvalue < float(posterior_error_prob_threshold)
and psms.qvalue < float(q_value_threshold)
):
restricted_data.append(psms)
# Restrict peptide length and qvalue
if peptide_length and not posterior_error_prob_threshold and q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if len(psms.stripped_peptide) >= peptide_length and psms.qvalue < float(q_value_threshold):
restricted_data.append(psms)
# Restrict posterior error probability and q value
if not peptide_length and posterior_error_prob_threshold and q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if psms.pepvalue < float(posterior_error_prob_threshold) and psms.qvalue < float(q_value_threshold):
restricted_data.append(psms)
# Restrict qvalue only
if not peptide_length and not posterior_error_prob_threshold and q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if psms.qvalue < float(q_value_threshold):
restricted_data.append(psms)
# Restrict posterior error probability only
if not peptide_length and posterior_error_prob_threshold and not q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if psms.pepvalue < float(posterior_error_prob_threshold):
restricted_data.append(psms)
# Restrict nothing... (only PEP gets restricted - takes everything less than 1)
if not peptide_length and not posterior_error_prob_threshold and not q_value_threshold:
restricted_data = main_psm_data
if custom_threshold:
custom_restricted = []
if self.parameter_file_object.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE:
for psms in restricted_data:
if psms.custom_score <= custom_threshold:
custom_restricted.append(psms)
if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
for psms in restricted_data:
if psms.custom_score >= custom_threshold:
custom_restricted.append(psms)
restricted_data = custom_restricted
self.main_data_restricted = restricted_data
logger.info("Length of restricted data: {}".format(len(restricted_data)))
self.restricted_peptides = [x.non_flanking_peptide for x in restricted_data]
def create_scoring_input(self):
"""
Method to create the scoring input.
This method initializes a list of [Protein][pyproteininference.physical.Protein] objects to get them ready
to be scored by [Score][pyproteininference.scoring.Score] methods.
This method also takes into account the inference type and aggregates peptides -> proteins accordingly.
This method sets the `scoring_input` and `score` Attributes for the DataStore object.
The score selected comes from the protein inference parameter object.
Returns:
None:
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.create_scoring_input()
"""
logger.info("Creating Scoring Input")
psm_data = self.get_psm_data()
protein_psm_dict = collections.defaultdict(list)
try:
score_key = self.SCORE_MAPPER[self.parameter_file_object.psm_score]
except KeyError:
score_key = self.CUSTOM_SCORE_KEY
if self.parameter_file_object.inference_type != Inference.PEPTIDE_CENTRIC:
# Loop through all Psms
for psms in psm_data:
psms.assign_main_score(score=score_key)
# Loop through all proteins
for prots in psms.possible_proteins:
protein_psm_dict[prots].append(psms)
else:
self.peptide_to_protein_dictionary()
sp_proteins = self.digest.swiss_prot_protein_set
for psms in psm_data:
# Assign main score
psms.assign_main_score(score=score_key)
protein_set = self.peptide_protein_dictionary[psms.non_flanking_peptide]
# Sort protein_set by sp-alpha, decoy-sp-alpha, tr-alpha, decoy-tr-alpha
sorted_protein_list = self.sort_protein_strings(
protein_string_list=protein_set,
sp_proteins=sp_proteins,
decoy_symbol=self.parameter_file_object.decoy_symbol,
)
# Restrict the number of identifiers by the value in param file max_identifiers_peptide_centric
sorted_protein_list = sorted_protein_list[: self.parameter_file_object.max_identifiers_peptide_centric]
protein_name = ";".join(sorted_protein_list)
protein_psm_dict[protein_name].append(psms)
protein_list = []
for pkey in sorted(protein_psm_dict.keys()):
protein_object = Protein(identifier=pkey)
protein_object.psms = protein_psm_dict[pkey]
protein_object.raw_peptides = set([x.identifier for x in protein_psm_dict[pkey]])
protein_list.append(protein_object)
self.psm_score = self.parameter_file_object.psm_score
self.scoring_input = protein_list
def protein_to_peptide_dictionary(self):
"""
Method that returns a map of protein strings to sets of peptide strings and is essentially half
of a BiPartite graph.
This method sets the `protein_peptide_dictionary` Attribute for the DataStore object.
Returns:
collections.defaultdict: Dictionary of protein strings (keys) that map to sets of peptide strings based
on the peptides and proteins found in the search. Protein -> set(Peptides).
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_peptide_dict = data.protein_to_peptide_dictionary()
"""
psm_data = self.get_psm_data()
res_pep_set = set(self.restricted_peptides)
default_dict_proteins = collections.defaultdict(set)
for peptide_objects in psm_data:
for prots in peptide_objects.possible_proteins:
cur_peptide = peptide_objects.non_flanking_peptide
if cur_peptide in res_pep_set:
default_dict_proteins[prots].add(cur_peptide)
self.protein_peptide_dictionary = default_dict_proteins
return default_dict_proteins
def peptide_to_protein_dictionary(self):
"""
Method that returns a map of peptide strings to sets of protein strings and is essentially half of a
BiPartite graph.
This method sets the `peptide_protein_dictionary` Attribute for the DataStore object.
Returns:
collections.defaultdict: Dictionary of peptide strings (keys) that map to sets of protein strings based
on the peptides and proteins found in the search. Peptide -> set(Proteins).
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> peptide_protein_dict = data.peptide_to_protein_dictionary()
"""
psm_data = self.get_psm_data()
res_pep_set = set(self.restricted_peptides)
default_dict_peptides = collections.defaultdict(set)
for peptide_objects in psm_data:
for prots in peptide_objects.possible_proteins:
cur_peptide = peptide_objects.non_flanking_peptide
if cur_peptide in res_pep_set:
default_dict_peptides[cur_peptide].add(prots)
else:
pass
self.peptide_protein_dictionary = default_dict_peptides
return default_dict_peptides
def unique_to_leads_peptides(self):
"""
Method to retrieve peptides that are unique based on the data from the searches
(Not based on the database digestion).
Returns:
set: a Set of peptide strings
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> unique_peps = data.unique_to_leads_peptides()
"""
if self.grouped_scored_proteins:
lead_peptides = [list(x[0].peptides) for x in self.grouped_scored_proteins]
flat_peptides = [item for sublist in lead_peptides for item in sublist]
counted_peps = collections.Counter(flat_peptides)
unique_to_leads_peptides = set([x for x in counted_peps if counted_peps[x] == 1])
else:
unique_to_leads_peptides = set()
return unique_to_leads_peptides
def higher_or_lower(self):
"""
Method to determine if a higher or lower score is better for a given combination of score input and score type.
This method sets the `high_low_better` Attribute for the DataStore object.
This method depends on the output from the Score class to be sorted properly from best to worst score.
Returns:
str: String indicating "higher" or "lower" depending on if a higher or lower score is a
better protein score.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> high_low = data.higher_or_lower()
"""
if not self.high_low_better:
logger.info("Determining If a higher or lower score is better based on scored proteins")
worst_score = self.scored_proteins[-1].score
best_score = self.scored_proteins[0].score
if float(best_score) > float(worst_score):
higher_or_lower = self.HIGHER_PSM_SCORE
if float(best_score) < float(worst_score):
higher_or_lower = self.LOWER_PSM_SCORE
logger.info("best score = {}".format(best_score))
logger.info("worst score = {}".format(worst_score))
if best_score == worst_score:
raise ValueError(
"Best and Worst scores were identical, equal to {}. Score type {} produced the error, "
"please change psm_score type.".format(best_score, self.psm_score)
)
self.high_low_better = higher_or_lower
else:
higher_or_lower = self.high_low_better
return higher_or_lower
def get_protein_identifiers(self, data_form):
"""
Method to retrieve the protein string identifiers.
Args:
data_form (str): Can be one of the following: "main", "restricted", "picked", "picked_removed".
Returns:
list: list of protein identifier strings.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_strings = data.get_protein_identifiers(data_form="main")
"""
if data_form == "main":
# All the data (unrestricted)
data_to_select = self.main_data_form
prots = [[x.possible_proteins] for x in data_to_select]
proteins = prots
if data_form == "restricted":
# Proteins that pass certain restriction criteria (peptide length, pep, qvalue)
data_to_select = self.main_data_restricted
prots = [[x.possible_proteins] for x in data_to_select]
proteins = prots
if data_form == "picked":
# Here we look at proteins that are 'picked' (aka the proteins that beat out their matching target/decoy)
data_to_select = self.picked_proteins_scored
prots = [x.identifier for x in data_to_select]
proteins = prots
if data_form == "picked_removed":
# Here we look at the proteins that were removed due to picking (aka the proteins that
# have a worse score than their target/decoy counterpart)
data_to_select = self.picked_proteins_removed
prots = [x.identifier for x in data_to_select]
proteins = prots
return proteins
def get_protein_information(self, protein_string):
"""
Method to retrieve attributes for a specific scored protein.
Args:
protein_string (str): Protein Identifier String.
Returns:
list: list of protein attributes.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_attr = data.get_protein_information(protein_string="RAF1_HUMAN|P04049")
"""
all_scored_protein_data = self.scored_proteins
identifiers = [x.identifier for x in all_scored_protein_data]
protein_scores = [x.score for x in all_scored_protein_data]
groups = [x.group_identification for x in all_scored_protein_data]
reviewed = [x.reviewed for x in all_scored_protein_data]
peptides = [x.peptides for x in all_scored_protein_data]
# Peptide scores currently broken...
peptide_scores = [x.peptide_scores for x in all_scored_protein_data]
picked = [x.picked for x in all_scored_protein_data]
num_peptides = [x.num_peptides for x in all_scored_protein_data]
main_index = identifiers.index(protein_string)
list_structure = [
[
"identifier",
"protein_score",
"groups",
"reviewed",
"peptides",
"peptide_scores",
"picked",
"num_peptides",
]
]
list_structure.append([protein_string])
list_structure[-1].append(protein_scores[main_index])
list_structure[-1].append(groups[main_index])
list_structure[-1].append(reviewed[main_index])
list_structure[-1].append(peptides[main_index])
list_structure[-1].append(peptide_scores[main_index])
list_structure[-1].append(picked[main_index])
list_structure[-1].append(num_peptides[main_index])
return list_structure
def exclude_non_distinguishing_peptides(self, protein_subset_type="hard"):
"""
Method to Exclude peptides that are not distinguishing on either the search or database level.
The method sets the `scoring_input` and `restricted_peptides` variables for the DataStore object.
Args:
protein_subset_type (str): Either "hard" or "soft". Hard will select distinguishing peptides based on
the database digestion. "soft" will only use peptides identified in the search.
Returns:
None:
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.exclude_non_distinguishing_peptides(protein_subset_type="hard")
"""
logger.info("Applying Exclusion Model")
our_proteins_sorted = self.get_sorted_identifiers(scored=False)
if protein_subset_type == "hard":
# Hard protein subsetting defines protein subsets on the digest level (Entire protein is used)
# This is how Percolator PI does subsetting
peptides = [self.digest.protein_to_peptide_dictionary[x] for x in our_proteins_sorted]
elif protein_subset_type == "soft":
# Soft protein subsetting defines protein subsets on the Peptides identified from the search
peptides = [set(x.raw_peptides) for x in self.scoring_input]
else:
# If neither is dfined we do "hard" exclusion
peptides = [self.digest.protein_to_peptide_dictionary[x] for x in our_proteins_sorted]
# Get frozen set of peptides....
# We will also have a corresponding list of proteins...
# They will have the same index...
peptide_sets = [frozenset(e) for e in peptides]
# Find a way to sort this list of sets...
# We can sort the sets if we sort proteins from above...
logger.info("{} number of peptide sets".format(len(peptide_sets)))
non_subset_peptide_sets = set()
i = 0
# Get all peptide sets that are not a subset...
while peptide_sets:
i = i + 1
peptide_set = peptide_sets.pop()
if any(peptide_set.issubset(s) for s in peptide_sets) or any(
peptide_set.issubset(s) for s in non_subset_peptide_sets
):
continue
else:
non_subset_peptide_sets.add(peptide_set)
if i % 10000 == 0:
logger.info("Parsed {} Peptide Sets".format(i))
logger.info("Parsed {} Peptide Sets".format(i))
# Get their index from peptides which is the initial list of sets...
list_of_indeces = []
for pep_sets in non_subset_peptide_sets:
ind = peptides.index(pep_sets)
list_of_indeces.append(ind)
non_subset_proteins = set([our_proteins_sorted[x] for x in list_of_indeces])
logger.info("Removing direct subset Proteins from the data")
# Remove all proteins from scoring input that are a subset of another protein...
self.scoring_input = [x for x in self.scoring_input if x.identifier in non_subset_proteins]
logger.info("{} proteins in scoring input after removing subset proteins".format(len(self.scoring_input)))
# For all the proteins that are not a complete subset of another protein...
# Get the raw peptides...
raw_peps = [x.raw_peptides for x in self.scoring_input if x.identifier in non_subset_proteins]
# Make the raw peptides a flat list
flat_peptides = [Psm.split_peptide(peptide_string=item) for sublist in raw_peps for item in sublist]
# Count the number of peptides in this list...
# This is the number of proteins this peptide maps to....
counted_peptides = collections.Counter(flat_peptides)
# If the count is greater than 1... exclude the protein entirely from scoring input... :)
raw_peps_good = set([x for x in counted_peptides.keys() if counted_peptides[x] <= 1])
# Alter self.scoring_input by removing psms and peptides that are not in raw_peps_good
current_score_input = list(self.scoring_input)
for j in range(len(current_score_input)):
k = j + 1
psm_list = []
new_raw_peptides = []
current_psms = current_score_input[j].psms
current_raw_peptides = current_score_input[j].raw_peptides
for psm_scores in current_psms:
if psm_scores.non_flanking_peptide in raw_peps_good:
psm_list.append(psm_scores)
for rp in current_raw_peptides:
if Psm.split_peptide(peptide_string=rp) in raw_peps_good:
new_raw_peptides.append(rp)
current_score_input[j].psms = psm_list
current_score_input[j].raw_peptides = new_raw_peptides
if k % 10000 == 0:
logger.info("Redefined {} Peptide Sets".format(k))
logger.info("Redefined {} Peptide Sets".format(j))
filtered_score_input = [x for x in current_score_input if x.psms]
self.scoring_input = filtered_score_input
# Recompute the flat peptides
raw_peps = [x.raw_peptides for x in self.scoring_input if x.identifier in non_subset_proteins]
# Make the raw peptides a flat list
new_flat_peptides = set([Psm.split_peptide(peptide_string=item) for sublist in raw_peps for item in sublist])
self.scoring_input = [x for x in self.scoring_input if x.psms]
self.restricted_peptides = [x for x in self.restricted_peptides if x in new_flat_peptides]
def protein_picker(self):
"""
Method to run the protein picker algorithm.
Proteins must be scored first with [score_psms][pyproteininference.scoring.Score.score_psms].
The algorithm will match target and decoy proteins identified from the PSMs from the search.
If a target and matching decoy is found then target/decoy competition is performed.
In the Target/Decoy pair the protein with the better score is kept and the one with the worse score is
discarded from the analysis.
The method sets the `picked_proteins_scored` and `picked_proteins_removed` variables for
the DataStore object.
Returns:
None:
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.protein_picker()
"""
self._validate_scored_proteins()
logger.info("Running Protein Picker")
# Use higher or lower class to determine if a higher protein score or lower protein score is better
# based on the scoring method used
higher_or_lower = self.higher_or_lower()
# Here we determine if a lower or higher score is better
# Since all input is ordered from best to worst we can do the following
index_to_remove = []
# data.scored_proteins is simply a list of Protein objects...
# Create list of all decoy proteins
decoy_proteins = [x.identifier for x in self.scored_proteins if self.decoy_symbol in x.identifier]
# Create a list of all potential matching targets (some of these may not exist in the search)
matching_targets = [x.replace(self.decoy_symbol, "") for x in decoy_proteins]
# Create a list of all the proteins from the scored data
all_proteins = [x.identifier for x in self.scored_proteins]
logger.info("{} proteins scored".format(len(all_proteins)))
total_targets = []
total_decoys = []
decoys_removed = []
targets_removed = []
# Loop over all decoys identified in the search
logger.info("Picking Proteins...")
for i in range(len(decoy_proteins)):
cur_decoy_index = all_proteins.index(decoy_proteins[i])
cur_decoy_protein_object = self.scored_proteins[cur_decoy_index]
total_decoys.append(cur_decoy_protein_object.identifier)
# Try, Except here because the matching target to the decoy may not be a result from the search
try:
cur_target_index = all_proteins.index(matching_targets[i])
cur_target_protein_object = self.scored_proteins[cur_target_index]
total_targets.append(cur_target_protein_object.identifier)
if higher_or_lower == self.HIGHER_PSM_SCORE:
if cur_target_protein_object.score > cur_decoy_protein_object.score:
index_to_remove.append(cur_decoy_index)
decoys_removed.append(cur_decoy_index)
cur_target_protein_object.picked = True
cur_decoy_protein_object.picked = False
else:
index_to_remove.append(cur_target_index)
targets_removed.append(cur_target_index)
cur_decoy_protein_object.picked = True
cur_target_protein_object.picked = False
if higher_or_lower == self.LOWER_PSM_SCORE:
if cur_target_protein_object.score < cur_decoy_protein_object.score:
index_to_remove.append(cur_decoy_index)
decoys_removed.append(cur_decoy_index)
cur_target_protein_object.picked = True
cur_decoy_protein_object.picked = False
else:
index_to_remove.append(cur_target_index)
targets_removed.append(cur_target_index)
cur_decoy_protein_object.picked = True
cur_target_protein_object.picked = False
except ValueError:
pass
logger.info("{} total decoy proteins".format(len(total_decoys)))
logger.info("{} matching target proteins also found in search".format(len(total_targets)))
logger.info("{} decoy proteins to be removed".format(len(decoys_removed)))
logger.info("{} target proteins to be removed".format(len(targets_removed)))
logger.info("Removing Lower Scoring Proteins...")
picked_list = []
removed_proteins = []
for protein_objects in self.scored_proteins:
if protein_objects.picked:
picked_list.append(protein_objects)
else:
removed_proteins.append(protein_objects)
self.picked_proteins_scored = picked_list
self.picked_proteins_removed = removed_proteins
logger.info("Finished Removing Proteins")
def calculate_q_values(self, regular=True):
"""
Method calculates Q values FDR on the lead protein in the group on the `protein_group_objects`
instance variable.
FDR is calculated As (2*decoys)/total if regular is set to True and is
(decoys)/total if regular is set to False.
This method updates the `protein_group_objects` for the DataStore object by updating
the q_value variable of the [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
Returns:
None:
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> # Data must be scored first
>>> data.calculate_q_values()
"""
self._validate_protein_group_objects()
logger.info("Calculating Q values from the protein group objects")
# pick out the lead scoring protein for each group... lead score is at 0 position
lead_score = [x.proteins[0] for x in self.protein_group_objects]
# Now pick out only the lead protein identifiers
lead_proteins = [x.identifier for x in lead_score]
lead_proteins.reverse()
logger.info("Calculating FDRs")
fdr_list = []
for i in range(len(lead_proteins)):
binary_decoy_target_list = [1 if self.decoy_symbol in elem else 0 for elem in lead_proteins]
total = len(lead_proteins)
decoys = sum(binary_decoy_target_list)
# Calculate FDR at every step starting with the entire list...
# Delete first entry (worst score) every time we go through a cycle
if regular:
fdr = (2 * decoys) / (float(total))
else:
fdr = (decoys) / (float(total))
fdr_list.append(fdr)
del lead_proteins[0]
qvalue_list = []
new_fdr_list = []
logger.info("Calculating Q Values")
for fdrs in fdr_list:
new_fdr_list.append(fdrs)
qvalue = min(new_fdr_list)
# qvalue = fdrs
qvalue_list.append(qvalue)
qvalue_list.reverse()
logger.info("Assigning Q Values")
for k in range(len(self.protein_group_objects)):
self.protein_group_objects[k].q_value = qvalue_list[k]
fdr_restricted = [x for x in self.protein_group_objects if x.q_value <= self.parameter_file_object.fdr]
fdr_restricted_set = [self.grouped_scored_proteins[x] for x in range(len(fdr_restricted))]
onehitwonders = []
for groups in fdr_restricted_set:
if int(groups[0].num_peptides) == 1:
onehitwonders.append(groups[0])
logger.info(
"Protein Group leads that pass with more than 1 PSM with a {} FDR = {}".format(
self.parameter_file_object.fdr,
str(len(fdr_restricted_set) - len(onehitwonders)),
)
)
logger.info(
"Protein Group lead One hit Wonders that pass {} FDR = {}".format(
self.parameter_file_object.fdr, len(onehitwonders)
)
)
logger.info(
"Number of Protein groups that pass a {} percent FDR: {}".format(
str(self.parameter_file_object.fdr * 100), len(fdr_restricted_set)
)
)
logger.info("Finished Q value Calculation")
def validate_psm_data(self):
"""
Method that validates the PSM data.
"""
self._validate_decoys_from_data()
self._validate_isoform_from_data()
def validate_digest(self):
"""
Method that validates the [Digest object][pyproteininference.in_silico_digest.Digest].
"""
self._validate_reviewed_v_unreviewed()
self._check_target_decoy_split()
def check_data_consistency(self):
"""
Method that checks for data consistency.
"""
self._check_data_digest_overlap_psms()
self._check_data_digest_overlap_proteins()
def _check_data_digest_overlap_psms(self):
"""
Method that logs the overlap between the digested fasta file and the input files on the PSM level.
"""
peptides = [x.stripped_peptide for x in self.main_data_form]
peptides_in_digest = set(self.digest.peptide_to_protein_dictionary.keys())
peptides_from_search_in_digest = [x for x in peptides if x in peptides_in_digest]
percentage = float(len(set(peptides))) / float(len(set(peptides_from_search_in_digest)))
logger.info("{} PSMs identified from input files".format(len(peptides)))
logger.info(
"{} PSMs identified from input files that are also present in database digestion".format(
len(peptides_from_search_in_digest)
)
)
logger.info(
"{}; ratio of PSMs identified from input files to those that are present in the search"
" and in the database digestion".format(percentage)
)
def _check_data_digest_overlap_proteins(self):
"""
Method that logs the overlap between the digested fasta file and the input files on the Protein level.
"""
proteins = [x.possible_proteins for x in self.main_data_form]
flat_proteins = set([item for sublist in proteins for item in sublist])
proteins_in_digest = set(self.digest.protein_to_peptide_dictionary.keys())
proteins_from_search_in_digest = [x for x in flat_proteins if x in proteins_in_digest]
percentage = float(len(flat_proteins)) / float(len(proteins_from_search_in_digest))
logger.info("{} proteins identified from input files".format(len(flat_proteins)))
logger.info(
"{} proteins identified from input files that are also present in database digestion".format(
len(proteins_from_search_in_digest)
)
)
logger.info(
"{}; ratio of proteins identified from input files that are also present in database digestion".format(
percentage
)
)
def _check_target_decoy_split(self):
"""
Method that logs the number of target and decoy proteins from the digest.
"""
# Check the number of targets vs the number of decoys from the digest
targets = [
x
for x in self.digest.protein_to_peptide_dictionary.keys()
if self.parameter_file_object.decoy_symbol not in x
]
decoys = [
x for x in self.digest.protein_to_peptide_dictionary.keys() if self.parameter_file_object.decoy_symbol in x
]
ratio = float(len(targets)) / float(len(decoys))
logger.info("Number of Target Proteins in Digest: {}".format(len(targets)))
logger.info("Number of Decoy Proteins in Digest: {}".format(len(decoys)))
logger.info("Ratio of Targets Proteins to Decoy Proteins: {}".format(ratio))
def _validate_decoys_from_data(self):
"""
Method that checks to make sure that target and decoy proteins exist in the data files.
"""
# Check to see if we find decoys from our input files
proteins = [x.possible_proteins for x in self.main_data_form]
flat_proteins = set([item for sublist in proteins for item in sublist])
targets = [x for x in flat_proteins if self.parameter_file_object.decoy_symbol not in x]
decoys = [x for x in flat_proteins if self.parameter_file_object.decoy_symbol in x]
logger.info("Number of Target Proteins in Data Files: {}".format(len(targets)))
logger.info("Number of Decoy Proteins in Data Files: {}".format(len(decoys)))
def _validate_isoform_from_data(self):
"""
Method that validates whether or not isoforms are able to be identified in the data files.
"""
# Check to see if we find any proteins with isoform info in name in our input files
proteins = [x.possible_proteins for x in self.main_data_form]
flat_proteins = set([item for sublist in proteins for item in sublist])
if self.parameter_file_object.isoform_symbol:
non_iso = [x for x in flat_proteins if self.parameter_file_object.isoform_symbol not in x]
else:
non_iso = [x for x in flat_proteins]
if self.parameter_file_object.isoform_symbol:
iso = [x for x in flat_proteins if self.parameter_file_object.isoform_symbol in x]
else:
iso = []
logger.info("Number of Non Isoform Labeled Proteins in Data Files: {}".format(len(non_iso)))
logger.info("Number of Isoform Labeled Proteins in Data Files: {}".format(len(iso)))
def _validate_reviewed_v_unreviewed(self):
"""
Method that logs whether or not we can distinguish from reviewed and unreviewd protein identifiers
in the digest.
"""
# Check to see if we get reviewed prots in digest...
reviewed_proteins = len(self.digest.swiss_prot_protein_set)
proteins_in_digest = len(set(self.digest.protein_to_peptide_dictionary.keys()))
unreviewed_proteins = proteins_in_digest - reviewed_proteins
logger.info("Number of Total Proteins in from Digest: {}".format(proteins_in_digest))
logger.info("Number of Reviewed Proteins in from Digest: {}".format(reviewed_proteins))
logger.info("Number of Unreviewed Proteins in from Digest: {}".format(unreviewed_proteins))
@classmethod
def sort_protein_strings(cls, protein_string_list, sp_proteins, decoy_symbol):
"""
Method that sorts protein strings in the following order: Target Reviewed, Decoy Reviewed, Target Unreviewed,
Decoy Unreviewed.
Args:
protein_string_list (list): List of Protein Strings.
sp_proteins (set): Set of Reviewed Protein Strings.
decoy_symbol (str): Symbol to denote a decoy protein identifier IE "##".
Returns:
list: List of sorted protein strings.
Example:
>>> list_of_group_objects = datastore.DataStore.sort_protein_strings(
>>> protein_string_list=protein_string_list, sp_proteins=sp_proteins, decoy_symbol="##"
>>> )
"""
our_target_sp_proteins = sorted([x for x in protein_string_list if x in sp_proteins and decoy_symbol not in x])
our_decoy_sp_proteins = sorted([x for x in protein_string_list if x in sp_proteins and decoy_symbol in x])
our_target_tr_proteins = sorted(
[x for x in protein_string_list if x not in sp_proteins and decoy_symbol not in x]
)
our_decoy_tr_proteins = sorted([x for x in protein_string_list if x not in sp_proteins and decoy_symbol in x])
identifiers_sorted = (
our_target_sp_proteins + our_decoy_sp_proteins + our_target_tr_proteins + our_decoy_tr_proteins
)
return identifiers_sorted
def input_has_q(self):
"""
Method that checks to see if the input data has q values.
"""
len_q = len([x.qvalue for x in self.main_data_form if x.qvalue])
len_all = len(self.main_data_form)
if len_q == len_all:
status = True
logger.info("Input has Q value; Can restrict by Q value")
else:
status = False
logger.warning("Input does not have Q value; Cannot restrict by Q value")
return status
def input_has_pep(self):
"""
Method that checks to see if the input data has pep values.
"""
len_pep = len([x.pepvalue for x in self.main_data_form if x.pepvalue])
len_all = len(self.main_data_form)
if len_pep == len_all:
status = True
logger.info("Input has Pep value; Can restrict by Pep value")
else:
status = False
logger.warning("Input does not have Pep value; Cannot restrict by Pep value")
return status
def input_has_custom(self):
"""
Method that checks to see if the input data has custom score values.
"""
len_c = len([x.custom_score for x in self.main_data_form if x.custom_score])
len_all = len(self.main_data_form)
if len_c == len_all:
status = True
logger.info("Input has Custom value; Can restrict by Custom value")
else:
status = False
logger.warning("Input does not have Custom value; Cannot restrict by Custom value")
return status
def get_protein_objects(self, false_discovery_rate=None, fdr_restricted=False):
"""
Method retrieves protein objects. Either retrieves FDR restricted list of protien objects,
or retrieves all objects.
Args:
fdr_restricted (bool): True/False on whether to restrict the list of objects based on FDR.
Returns:
list: List of scored [ProteinGroup][pyproteininference.physical.ProteinGroup] objects
that have been grouped and sorted.
"""
if not false_discovery_rate:
false_discovery_rate = self.parameter_file_object.fdr
if fdr_restricted:
protein_objects = [x.proteins for x in self.protein_group_objects if x.q_value <= false_discovery_rate]
else:
protein_objects = self.grouped_scored_proteins
return protein_objects
def _init_validate(self, reader):
"""
Internal Method that checks to make sure the reader object is properly loaded and validated.
"""
if reader.psms:
self.main_data_form = reader.psms # Unrestricted PSM data
self.restricted_peptides = [x.non_flanking_peptide for x in self.main_data_form]
else:
raise ValueError(
"Psms variable from Reader object is either empty or does not exist. "
"Make sure your files contain proper data and that you run the 'read_psms' "
"method on your Reader object."
)
def _validate_main_data_form(self):
"""
Internal Method that checks to make sure the Main data has been defined to run DataStore methods.
"""
if self.main_data_form:
pass
else:
raise ValueError(
"Main Data is not defined, thus method cannot be ran. Please make sure PSM data is properly"
" loaded from the Reader object"
)
def _validate_main_data_restricted(self):
"""
Internal Method that checks to make sure the Main data Restricted has been defined to run DataStore methods.
"""
if self.main_data_restricted:
pass
else:
raise ValueError(
"Main Data Restricted is not defined, thus method cannot be ran. Please make sure PSM data is properly"
" loaded from the Reader object and make sure to run DataStore method 'restrict_psm_data'."
)
def _validate_scored_proteins(self):
"""
Internal Method that checks to make sure that proteins have been scored to run certain subsequent methods.
"""
if self.picked_proteins_scored or self.scored_proteins:
pass
else:
raise ValueError(
"Proteins have not been scored, Please initialize a Score object and run a score method with"
" 'score_psms' instance method."
)
def _validate_scoring_input(self):
"""
Internal Method that checks to make sure that Scoring Input has been created to be able to run scoring methods.
"""
if self.scoring_input:
pass
else:
raise ValueError(
"Scoring input has not been created, Please run 'create_scoring_input' method from the DataStore "
"object to continue."
)
def _validate_protein_group_objects(self):
"""
Internal Method that checks to make sure inference has been run before proceeding.
"""
if self.protein_group_objects and self.grouped_scored_proteins:
pass
else:
raise ValueError(
"Either 'protein_group_objects' or 'grouped_scored_proteins' or both DataStore variables are undefined."
" Please make sure you run an inference method from the Inference class before proceeding."
)
def generate_fdr_vs_target_hits(self, fdr_max=0.2):
"""
Method for calculating FDR vs number of Target Proteins.
Args:
fdr_max (float): The maximum false discovery rate to calculate target hits for.
Will stop once fdr_max is reached.
Returns:
list: List of lists of: (FDR, Number of Target Hits). Ordered by increasing number of Target Hits.
"""
fdr_vs_count = []
count_list = []
for pg in self.protein_group_objects:
if self.decoy_symbol not in pg.proteins[0].identifier:
count_list.append(pg)
fdr_vs_count.append([pg.q_value, len(count_list)])
fdr_vs_count = [x for x in fdr_vs_count if x[0] < fdr_max]
return fdr_vs_count
def recover_mapping(self):
logger.info("Recovering Proteins that exist in the input files but not in the database digest.")
all_psms = self.get_psm_data()
proteins = [x.possible_proteins for x in all_psms]
flat_proteins = [item for sublist in proteins for item in sublist]
missing_prots = []
for prot in flat_proteins:
try:
self.digest.protein_to_peptide_dictionary[prot]
except KeyError:
missing_prots.append(prot)
psm_data = self.get_psm_data()
peptides = [x.stripped_peptide for x in psm_data if prot in x.possible_proteins]
for pep in peptides:
self.digest.peptide_to_protein_dictionary.setdefault(pep, set()).add(prot)
self.digest.protein_to_peptide_dictionary.setdefault(prot, set()).add(pep)
if missing_prots:
logger.info(
"{} proteins not found in mapping objects, please double check that your database"
" provided is accurate for the given input data.".format(len(missing_prots))
)
else:
logger.info("No missing proteins in the mapping objects.")
__init__(self, reader, digest, validate=True)
special
Parameters: |
---|
Examples:
>>> pyproteininference.datastore.DataStore(reader = reader, digest=digest)
Source code in pyproteininference/datastore.py
def __init__(self, reader, digest, validate=True):
"""
Args:
reader (Reader): Reader object [Reader][pyproteininference.reader.Reader].
digest (Digest): Digest object
[Digest][pyproteininference.in_silico_digest.Digest].
validate (bool): True/False to indicate if the input data should be validated.
Example:
>>> pyproteininference.datastore.DataStore(reader = reader, digest=digest)
"""
# If the reader class is from a percolator.psms then define main_data_form as reader.psms
# main_data_form is the starting point for all other analyses
self._init_validate(reader=reader)
self.parameter_file_object = reader.parameter_file_object # Parameter object
self.main_data_restricted = None # PSM data post restriction
self.scored_proteins = [] # List of scored Protein objects
self.grouped_scored_proteins = [] # List of sorted scored Protein objects
self.scoring_input = None # List of non scored Protein objects
self.picked_proteins_scored = None # List of Protein objects after picker algorithm
self.picked_proteins_removed = None # Protein objects removed via picker
self.protein_peptide_dictionary = None
self.peptide_protein_dictionary = None
self.high_low_better = None # Variable that indicates whether a higher or lower protein score is better
self.psm_score = None # PSM Score used
self.protein_score = None
self.short_protein_score = None
self.protein_group_objects = [] # List of sorted protein group objects
self.decoy_symbol = self.parameter_file_object.decoy_symbol # Decoy symbol from parameter file
self.digest = digest # Digest object
# Run Checks and Validations
if validate:
self.validate_psm_data()
self.validate_digest()
self.check_data_consistency()
# Run method to fix our parameter object if necessary
self.parameter_file_object.fix_parameters_from_datastore(data=self)
calculate_q_values(self, regular=True)
Method calculates Q values FDR on the lead protein in the group on the protein_group_objects
instance variable.
FDR is calculated As (2*decoys)/total if regular is set to True and is
(decoys)/total if regular is set to False.
This method updates the protein_group_objects
for the DataStore object by updating
the q_value variable of the ProteinGroup objects.
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> # Data must be scored first
>>> data.calculate_q_values()
Source code in pyproteininference/datastore.py
def calculate_q_values(self, regular=True):
"""
Method calculates Q values FDR on the lead protein in the group on the `protein_group_objects`
instance variable.
FDR is calculated As (2*decoys)/total if regular is set to True and is
(decoys)/total if regular is set to False.
This method updates the `protein_group_objects` for the DataStore object by updating
the q_value variable of the [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
Returns:
None:
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> # Data must be scored first
>>> data.calculate_q_values()
"""
self._validate_protein_group_objects()
logger.info("Calculating Q values from the protein group objects")
# pick out the lead scoring protein for each group... lead score is at 0 position
lead_score = [x.proteins[0] for x in self.protein_group_objects]
# Now pick out only the lead protein identifiers
lead_proteins = [x.identifier for x in lead_score]
lead_proteins.reverse()
logger.info("Calculating FDRs")
fdr_list = []
for i in range(len(lead_proteins)):
binary_decoy_target_list = [1 if self.decoy_symbol in elem else 0 for elem in lead_proteins]
total = len(lead_proteins)
decoys = sum(binary_decoy_target_list)
# Calculate FDR at every step starting with the entire list...
# Delete first entry (worst score) every time we go through a cycle
if regular:
fdr = (2 * decoys) / (float(total))
else:
fdr = (decoys) / (float(total))
fdr_list.append(fdr)
del lead_proteins[0]
qvalue_list = []
new_fdr_list = []
logger.info("Calculating Q Values")
for fdrs in fdr_list:
new_fdr_list.append(fdrs)
qvalue = min(new_fdr_list)
# qvalue = fdrs
qvalue_list.append(qvalue)
qvalue_list.reverse()
logger.info("Assigning Q Values")
for k in range(len(self.protein_group_objects)):
self.protein_group_objects[k].q_value = qvalue_list[k]
fdr_restricted = [x for x in self.protein_group_objects if x.q_value <= self.parameter_file_object.fdr]
fdr_restricted_set = [self.grouped_scored_proteins[x] for x in range(len(fdr_restricted))]
onehitwonders = []
for groups in fdr_restricted_set:
if int(groups[0].num_peptides) == 1:
onehitwonders.append(groups[0])
logger.info(
"Protein Group leads that pass with more than 1 PSM with a {} FDR = {}".format(
self.parameter_file_object.fdr,
str(len(fdr_restricted_set) - len(onehitwonders)),
)
)
logger.info(
"Protein Group lead One hit Wonders that pass {} FDR = {}".format(
self.parameter_file_object.fdr, len(onehitwonders)
)
)
logger.info(
"Number of Protein groups that pass a {} percent FDR: {}".format(
str(self.parameter_file_object.fdr * 100), len(fdr_restricted_set)
)
)
logger.info("Finished Q value Calculation")
check_data_consistency(self)
Method that checks for data consistency.
Source code in pyproteininference/datastore.py
def check_data_consistency(self):
"""
Method that checks for data consistency.
"""
self._check_data_digest_overlap_psms()
self._check_data_digest_overlap_proteins()
create_scoring_input(self)
Method to create the scoring input. This method initializes a list of Protein objects to get them ready to be scored by Score methods. This method also takes into account the inference type and aggregates peptides -> proteins accordingly.
This method sets the scoring_input
and score
Attributes for the DataStore object.
The score selected comes from the protein inference parameter object.
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.create_scoring_input()
Source code in pyproteininference/datastore.py
def create_scoring_input(self):
"""
Method to create the scoring input.
This method initializes a list of [Protein][pyproteininference.physical.Protein] objects to get them ready
to be scored by [Score][pyproteininference.scoring.Score] methods.
This method also takes into account the inference type and aggregates peptides -> proteins accordingly.
This method sets the `scoring_input` and `score` Attributes for the DataStore object.
The score selected comes from the protein inference parameter object.
Returns:
None:
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.create_scoring_input()
"""
logger.info("Creating Scoring Input")
psm_data = self.get_psm_data()
protein_psm_dict = collections.defaultdict(list)
try:
score_key = self.SCORE_MAPPER[self.parameter_file_object.psm_score]
except KeyError:
score_key = self.CUSTOM_SCORE_KEY
if self.parameter_file_object.inference_type != Inference.PEPTIDE_CENTRIC:
# Loop through all Psms
for psms in psm_data:
psms.assign_main_score(score=score_key)
# Loop through all proteins
for prots in psms.possible_proteins:
protein_psm_dict[prots].append(psms)
else:
self.peptide_to_protein_dictionary()
sp_proteins = self.digest.swiss_prot_protein_set
for psms in psm_data:
# Assign main score
psms.assign_main_score(score=score_key)
protein_set = self.peptide_protein_dictionary[psms.non_flanking_peptide]
# Sort protein_set by sp-alpha, decoy-sp-alpha, tr-alpha, decoy-tr-alpha
sorted_protein_list = self.sort_protein_strings(
protein_string_list=protein_set,
sp_proteins=sp_proteins,
decoy_symbol=self.parameter_file_object.decoy_symbol,
)
# Restrict the number of identifiers by the value in param file max_identifiers_peptide_centric
sorted_protein_list = sorted_protein_list[: self.parameter_file_object.max_identifiers_peptide_centric]
protein_name = ";".join(sorted_protein_list)
protein_psm_dict[protein_name].append(psms)
protein_list = []
for pkey in sorted(protein_psm_dict.keys()):
protein_object = Protein(identifier=pkey)
protein_object.psms = protein_psm_dict[pkey]
protein_object.raw_peptides = set([x.identifier for x in protein_psm_dict[pkey]])
protein_list.append(protein_object)
self.psm_score = self.parameter_file_object.psm_score
self.scoring_input = protein_list
exclude_non_distinguishing_peptides(self, protein_subset_type='hard')
Method to Exclude peptides that are not distinguishing on either the search or database level.
The method sets the scoring_input
and restricted_peptides
variables for the DataStore object.
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.exclude_non_distinguishing_peptides(protein_subset_type="hard")
Source code in pyproteininference/datastore.py
def exclude_non_distinguishing_peptides(self, protein_subset_type="hard"):
"""
Method to Exclude peptides that are not distinguishing on either the search or database level.
The method sets the `scoring_input` and `restricted_peptides` variables for the DataStore object.
Args:
protein_subset_type (str): Either "hard" or "soft". Hard will select distinguishing peptides based on
the database digestion. "soft" will only use peptides identified in the search.
Returns:
None:
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.exclude_non_distinguishing_peptides(protein_subset_type="hard")
"""
logger.info("Applying Exclusion Model")
our_proteins_sorted = self.get_sorted_identifiers(scored=False)
if protein_subset_type == "hard":
# Hard protein subsetting defines protein subsets on the digest level (Entire protein is used)
# This is how Percolator PI does subsetting
peptides = [self.digest.protein_to_peptide_dictionary[x] for x in our_proteins_sorted]
elif protein_subset_type == "soft":
# Soft protein subsetting defines protein subsets on the Peptides identified from the search
peptides = [set(x.raw_peptides) for x in self.scoring_input]
else:
# If neither is dfined we do "hard" exclusion
peptides = [self.digest.protein_to_peptide_dictionary[x] for x in our_proteins_sorted]
# Get frozen set of peptides....
# We will also have a corresponding list of proteins...
# They will have the same index...
peptide_sets = [frozenset(e) for e in peptides]
# Find a way to sort this list of sets...
# We can sort the sets if we sort proteins from above...
logger.info("{} number of peptide sets".format(len(peptide_sets)))
non_subset_peptide_sets = set()
i = 0
# Get all peptide sets that are not a subset...
while peptide_sets:
i = i + 1
peptide_set = peptide_sets.pop()
if any(peptide_set.issubset(s) for s in peptide_sets) or any(
peptide_set.issubset(s) for s in non_subset_peptide_sets
):
continue
else:
non_subset_peptide_sets.add(peptide_set)
if i % 10000 == 0:
logger.info("Parsed {} Peptide Sets".format(i))
logger.info("Parsed {} Peptide Sets".format(i))
# Get their index from peptides which is the initial list of sets...
list_of_indeces = []
for pep_sets in non_subset_peptide_sets:
ind = peptides.index(pep_sets)
list_of_indeces.append(ind)
non_subset_proteins = set([our_proteins_sorted[x] for x in list_of_indeces])
logger.info("Removing direct subset Proteins from the data")
# Remove all proteins from scoring input that are a subset of another protein...
self.scoring_input = [x for x in self.scoring_input if x.identifier in non_subset_proteins]
logger.info("{} proteins in scoring input after removing subset proteins".format(len(self.scoring_input)))
# For all the proteins that are not a complete subset of another protein...
# Get the raw peptides...
raw_peps = [x.raw_peptides for x in self.scoring_input if x.identifier in non_subset_proteins]
# Make the raw peptides a flat list
flat_peptides = [Psm.split_peptide(peptide_string=item) for sublist in raw_peps for item in sublist]
# Count the number of peptides in this list...
# This is the number of proteins this peptide maps to....
counted_peptides = collections.Counter(flat_peptides)
# If the count is greater than 1... exclude the protein entirely from scoring input... :)
raw_peps_good = set([x for x in counted_peptides.keys() if counted_peptides[x] <= 1])
# Alter self.scoring_input by removing psms and peptides that are not in raw_peps_good
current_score_input = list(self.scoring_input)
for j in range(len(current_score_input)):
k = j + 1
psm_list = []
new_raw_peptides = []
current_psms = current_score_input[j].psms
current_raw_peptides = current_score_input[j].raw_peptides
for psm_scores in current_psms:
if psm_scores.non_flanking_peptide in raw_peps_good:
psm_list.append(psm_scores)
for rp in current_raw_peptides:
if Psm.split_peptide(peptide_string=rp) in raw_peps_good:
new_raw_peptides.append(rp)
current_score_input[j].psms = psm_list
current_score_input[j].raw_peptides = new_raw_peptides
if k % 10000 == 0:
logger.info("Redefined {} Peptide Sets".format(k))
logger.info("Redefined {} Peptide Sets".format(j))
filtered_score_input = [x for x in current_score_input if x.psms]
self.scoring_input = filtered_score_input
# Recompute the flat peptides
raw_peps = [x.raw_peptides for x in self.scoring_input if x.identifier in non_subset_proteins]
# Make the raw peptides a flat list
new_flat_peptides = set([Psm.split_peptide(peptide_string=item) for sublist in raw_peps for item in sublist])
self.scoring_input = [x for x in self.scoring_input if x.psms]
self.restricted_peptides = [x for x in self.restricted_peptides if x in new_flat_peptides]
generate_fdr_vs_target_hits(self, fdr_max=0.2)
Method for calculating FDR vs number of Target Proteins.
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/datastore.py
def generate_fdr_vs_target_hits(self, fdr_max=0.2):
"""
Method for calculating FDR vs number of Target Proteins.
Args:
fdr_max (float): The maximum false discovery rate to calculate target hits for.
Will stop once fdr_max is reached.
Returns:
list: List of lists of: (FDR, Number of Target Hits). Ordered by increasing number of Target Hits.
"""
fdr_vs_count = []
count_list = []
for pg in self.protein_group_objects:
if self.decoy_symbol not in pg.proteins[0].identifier:
count_list.append(pg)
fdr_vs_count.append([pg.q_value, len(count_list)])
fdr_vs_count = [x for x in fdr_vs_count if x[0] < fdr_max]
return fdr_vs_count
get_pep_values(self)
Method to retrieve a list of all posterior error probabilities for all PSMs.
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> pep = data.get_pep_values()
Source code in pyproteininference/datastore.py
def get_pep_values(self):
"""
Method to retrieve a list of all posterior error probabilities for all PSMs.
Returns:
list: list of floats (pep values).
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> pep = data.get_pep_values()
"""
psm_data = self.get_psm_data()
pep_values = [x.pepvalue for x in psm_data]
return pep_values
get_protein_data(self)
Method to retrieve a list of Protein objects. Retrieves picked and scored data if the data has been picked and scored or just the scored data if the data has not been picked.
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> # Data must ben ran through a pyproteininference.scoring.Score method
>>> protein_data = data.get_protein_data()
Source code in pyproteininference/datastore.py
def get_protein_data(self):
"""
Method to retrieve a list of [Protein][pyproteininference.physical.Protein] objects.
Retrieves picked and scored data if the data has been picked and scored or just the scored data if the data has
not been picked.
Returns:
list: list of [Protein][pyproteininference.physical.Protein] objects.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> # Data must ben ran through a pyproteininference.scoring.Score method
>>> protein_data = data.get_protein_data()
"""
if self.picked_proteins_scored:
scored_proteins = self.picked_proteins_scored
else:
scored_proteins = self.scored_proteins
return scored_proteins
get_protein_identifiers(self, data_form)
Method to retrieve the protein string identifiers.
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_strings = data.get_protein_identifiers(data_form="main")
Source code in pyproteininference/datastore.py
def get_protein_identifiers(self, data_form):
"""
Method to retrieve the protein string identifiers.
Args:
data_form (str): Can be one of the following: "main", "restricted", "picked", "picked_removed".
Returns:
list: list of protein identifier strings.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_strings = data.get_protein_identifiers(data_form="main")
"""
if data_form == "main":
# All the data (unrestricted)
data_to_select = self.main_data_form
prots = [[x.possible_proteins] for x in data_to_select]
proteins = prots
if data_form == "restricted":
# Proteins that pass certain restriction criteria (peptide length, pep, qvalue)
data_to_select = self.main_data_restricted
prots = [[x.possible_proteins] for x in data_to_select]
proteins = prots
if data_form == "picked":
# Here we look at proteins that are 'picked' (aka the proteins that beat out their matching target/decoy)
data_to_select = self.picked_proteins_scored
prots = [x.identifier for x in data_to_select]
proteins = prots
if data_form == "picked_removed":
# Here we look at the proteins that were removed due to picking (aka the proteins that
# have a worse score than their target/decoy counterpart)
data_to_select = self.picked_proteins_removed
prots = [x.identifier for x in data_to_select]
proteins = prots
return proteins
get_protein_identifiers_from_psm_data(self)
Method to retrieve a list of lists of all possible protein identifiers from the psm data.
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_strings = data.get_protein_identifiers_from_psm_data()
Source code in pyproteininference/datastore.py
def get_protein_identifiers_from_psm_data(self):
"""
Method to retrieve a list of lists of all possible protein identifiers from the psm data.
Returns:
list: list of lists of protein strings.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_strings = data.get_protein_identifiers_from_psm_data()
"""
psm_data = self.get_psm_data()
proteins = [x.possible_proteins for x in psm_data]
return proteins
get_protein_information(self, protein_string)
Method to retrieve attributes for a specific scored protein.
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_attr = data.get_protein_information(protein_string="RAF1_HUMAN|P04049")
Source code in pyproteininference/datastore.py
def get_protein_information(self, protein_string):
"""
Method to retrieve attributes for a specific scored protein.
Args:
protein_string (str): Protein Identifier String.
Returns:
list: list of protein attributes.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_attr = data.get_protein_information(protein_string="RAF1_HUMAN|P04049")
"""
all_scored_protein_data = self.scored_proteins
identifiers = [x.identifier for x in all_scored_protein_data]
protein_scores = [x.score for x in all_scored_protein_data]
groups = [x.group_identification for x in all_scored_protein_data]
reviewed = [x.reviewed for x in all_scored_protein_data]
peptides = [x.peptides for x in all_scored_protein_data]
# Peptide scores currently broken...
peptide_scores = [x.peptide_scores for x in all_scored_protein_data]
picked = [x.picked for x in all_scored_protein_data]
num_peptides = [x.num_peptides for x in all_scored_protein_data]
main_index = identifiers.index(protein_string)
list_structure = [
[
"identifier",
"protein_score",
"groups",
"reviewed",
"peptides",
"peptide_scores",
"picked",
"num_peptides",
]
]
list_structure.append([protein_string])
list_structure[-1].append(protein_scores[main_index])
list_structure[-1].append(groups[main_index])
list_structure[-1].append(reviewed[main_index])
list_structure[-1].append(peptides[main_index])
list_structure[-1].append(peptide_scores[main_index])
list_structure[-1].append(picked[main_index])
list_structure[-1].append(num_peptides[main_index])
return list_structure
get_protein_information_dictionary(self)
Method to retrieve a dictionary of scores for each peptide.
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_dict = data.get_protein_information_dictionary()
Source code in pyproteininference/datastore.py
def get_protein_information_dictionary(self):
"""
Method to retrieve a dictionary of scores for each peptide.
Returns:
dict: dictionary of scores for each protein.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_dict = data.get_protein_information_dictionary()
"""
psm_data = self.get_psm_data()
protein_psm_score_dictionary = collections.defaultdict(list)
# Loop through all Psms
for psms in psm_data:
# Loop through all proteins
for prots in psms.possible_proteins:
protein_psm_score_dictionary[prots].append(
{
"peptide": psms.identifier,
"Qvalue": psms.qvalue,
"PosteriorErrorProbability": psms.pepvalue,
"Percscore": psms.percscore,
}
)
return protein_psm_score_dictionary
get_protein_objects(self, false_discovery_rate=None, fdr_restricted=False)
Method retrieves protein objects. Either retrieves FDR restricted list of protien objects, or retrieves all objects.
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/datastore.py
def get_protein_objects(self, false_discovery_rate=None, fdr_restricted=False):
"""
Method retrieves protein objects. Either retrieves FDR restricted list of protien objects,
or retrieves all objects.
Args:
fdr_restricted (bool): True/False on whether to restrict the list of objects based on FDR.
Returns:
list: List of scored [ProteinGroup][pyproteininference.physical.ProteinGroup] objects
that have been grouped and sorted.
"""
if not false_discovery_rate:
false_discovery_rate = self.parameter_file_object.fdr
if fdr_restricted:
protein_objects = [x.proteins for x in self.protein_group_objects if x.q_value <= false_discovery_rate]
else:
protein_objects = self.grouped_scored_proteins
return protein_objects
get_psm_data(self)
Method to retrieve a list of Psm objects. Retrieves restricted data if the data has been restricted or all of the data if the data has not been restricted.
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> psm_data = data.get_psm_data()
Source code in pyproteininference/datastore.py
def get_psm_data(self):
"""
Method to retrieve a list of [Psm][pyproteininference.physical.Psm] objects.
Retrieves restricted data if the data has been restricted or all of the data if the data has
not been restricted.
Returns:
list: list of [Psm][pyproteininference.physical.Psm] objects.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> psm_data = data.get_psm_data()
"""
if not self.main_data_restricted and not self.main_data_form:
raise ValueError(
"Both main_data_restricted and main_data_form variables are empty. Please re-load the DataStore "
"object with a properly loaded Reader object."
)
if self.main_data_restricted:
psm_data = self.main_data_restricted
else:
psm_data = self.main_data_form
return psm_data
get_q_values(self)
Method to retrieve a list of all q values for all PSMs.
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> q = data.get_q_values()
Source code in pyproteininference/datastore.py
def get_q_values(self):
"""
Method to retrieve a list of all q values for all PSMs.
Returns:
list: list of floats (q values).
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> q = data.get_q_values()
"""
psm_data = self.get_psm_data()
q_values = [x.qvalue for x in psm_data]
return q_values
get_sorted_identifiers(self, scored=True)
Retrieves a sorted list of protein strings present in the analysis.
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> sorted_proteins = data.get_sorted_identifiers(scored=True)
Source code in pyproteininference/datastore.py
def get_sorted_identifiers(self, scored=True):
"""
Retrieves a sorted list of protein strings present in the analysis.
Args:
scored (bool): True/False to indicate if we should return scored or non-scored identifiers.
Returns:
list: List of sorted protein identifier strings.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> sorted_proteins = data.get_sorted_identifiers(scored=True)
"""
if scored:
self._validate_scored_proteins()
if self.picked_proteins_scored:
proteins = set([x.identifier for x in self.picked_proteins_scored])
else:
proteins = set([x.identifier for x in self.scored_proteins])
else:
self._validate_scoring_input()
proteins = [x.identifier for x in self.scoring_input]
all_sp_proteins = set(self.digest.swiss_prot_protein_set)
our_target_sp_proteins = sorted([x for x in proteins if x in all_sp_proteins and self.decoy_symbol not in x])
our_decoy_sp_proteins = sorted([x for x in proteins if x in all_sp_proteins and self.decoy_symbol in x])
our_target_tr_proteins = sorted(
[x for x in proteins if x not in all_sp_proteins and self.decoy_symbol not in x]
)
our_decoy_tr_proteins = sorted([x for x in proteins if x not in all_sp_proteins and self.decoy_symbol in x])
our_proteins_sorted = (
our_target_sp_proteins + our_decoy_sp_proteins + our_target_tr_proteins + our_decoy_tr_proteins
)
return our_proteins_sorted
higher_or_lower(self)
Method to determine if a higher or lower score is better for a given combination of score input and score type.
This method sets the high_low_better
Attribute for the DataStore object.
This method depends on the output from the Score class to be sorted properly from best to worst score.
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> high_low = data.higher_or_lower()
Source code in pyproteininference/datastore.py
def higher_or_lower(self):
"""
Method to determine if a higher or lower score is better for a given combination of score input and score type.
This method sets the `high_low_better` Attribute for the DataStore object.
This method depends on the output from the Score class to be sorted properly from best to worst score.
Returns:
str: String indicating "higher" or "lower" depending on if a higher or lower score is a
better protein score.
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> high_low = data.higher_or_lower()
"""
if not self.high_low_better:
logger.info("Determining If a higher or lower score is better based on scored proteins")
worst_score = self.scored_proteins[-1].score
best_score = self.scored_proteins[0].score
if float(best_score) > float(worst_score):
higher_or_lower = self.HIGHER_PSM_SCORE
if float(best_score) < float(worst_score):
higher_or_lower = self.LOWER_PSM_SCORE
logger.info("best score = {}".format(best_score))
logger.info("worst score = {}".format(worst_score))
if best_score == worst_score:
raise ValueError(
"Best and Worst scores were identical, equal to {}. Score type {} produced the error, "
"please change psm_score type.".format(best_score, self.psm_score)
)
self.high_low_better = higher_or_lower
else:
higher_or_lower = self.high_low_better
return higher_or_lower
input_has_custom(self)
Method that checks to see if the input data has custom score values.
Source code in pyproteininference/datastore.py
def input_has_custom(self):
"""
Method that checks to see if the input data has custom score values.
"""
len_c = len([x.custom_score for x in self.main_data_form if x.custom_score])
len_all = len(self.main_data_form)
if len_c == len_all:
status = True
logger.info("Input has Custom value; Can restrict by Custom value")
else:
status = False
logger.warning("Input does not have Custom value; Cannot restrict by Custom value")
return status
input_has_pep(self)
Method that checks to see if the input data has pep values.
Source code in pyproteininference/datastore.py
def input_has_pep(self):
"""
Method that checks to see if the input data has pep values.
"""
len_pep = len([x.pepvalue for x in self.main_data_form if x.pepvalue])
len_all = len(self.main_data_form)
if len_pep == len_all:
status = True
logger.info("Input has Pep value; Can restrict by Pep value")
else:
status = False
logger.warning("Input does not have Pep value; Cannot restrict by Pep value")
return status
input_has_q(self)
Method that checks to see if the input data has q values.
Source code in pyproteininference/datastore.py
def input_has_q(self):
"""
Method that checks to see if the input data has q values.
"""
len_q = len([x.qvalue for x in self.main_data_form if x.qvalue])
len_all = len(self.main_data_form)
if len_q == len_all:
status = True
logger.info("Input has Q value; Can restrict by Q value")
else:
status = False
logger.warning("Input does not have Q value; Cannot restrict by Q value")
return status
peptide_to_protein_dictionary(self)
Method that returns a map of peptide strings to sets of protein strings and is essentially half of a
BiPartite graph.
This method sets the peptide_protein_dictionary
Attribute for the DataStore object.
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> peptide_protein_dict = data.peptide_to_protein_dictionary()
Source code in pyproteininference/datastore.py
def peptide_to_protein_dictionary(self):
"""
Method that returns a map of peptide strings to sets of protein strings and is essentially half of a
BiPartite graph.
This method sets the `peptide_protein_dictionary` Attribute for the DataStore object.
Returns:
collections.defaultdict: Dictionary of peptide strings (keys) that map to sets of protein strings based
on the peptides and proteins found in the search. Peptide -> set(Proteins).
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> peptide_protein_dict = data.peptide_to_protein_dictionary()
"""
psm_data = self.get_psm_data()
res_pep_set = set(self.restricted_peptides)
default_dict_peptides = collections.defaultdict(set)
for peptide_objects in psm_data:
for prots in peptide_objects.possible_proteins:
cur_peptide = peptide_objects.non_flanking_peptide
if cur_peptide in res_pep_set:
default_dict_peptides[cur_peptide].add(prots)
else:
pass
self.peptide_protein_dictionary = default_dict_peptides
return default_dict_peptides
protein_picker(self)
Method to run the protein picker algorithm.
Proteins must be scored first with score_psms.
The algorithm will match target and decoy proteins identified from the PSMs from the search. If a target and matching decoy is found then target/decoy competition is performed. In the Target/Decoy pair the protein with the better score is kept and the one with the worse score is discarded from the analysis.
The method sets the picked_proteins_scored
and picked_proteins_removed
variables for
the DataStore object.
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.protein_picker()
Source code in pyproteininference/datastore.py
def protein_picker(self):
"""
Method to run the protein picker algorithm.
Proteins must be scored first with [score_psms][pyproteininference.scoring.Score.score_psms].
The algorithm will match target and decoy proteins identified from the PSMs from the search.
If a target and matching decoy is found then target/decoy competition is performed.
In the Target/Decoy pair the protein with the better score is kept and the one with the worse score is
discarded from the analysis.
The method sets the `picked_proteins_scored` and `picked_proteins_removed` variables for
the DataStore object.
Returns:
None:
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.protein_picker()
"""
self._validate_scored_proteins()
logger.info("Running Protein Picker")
# Use higher or lower class to determine if a higher protein score or lower protein score is better
# based on the scoring method used
higher_or_lower = self.higher_or_lower()
# Here we determine if a lower or higher score is better
# Since all input is ordered from best to worst we can do the following
index_to_remove = []
# data.scored_proteins is simply a list of Protein objects...
# Create list of all decoy proteins
decoy_proteins = [x.identifier for x in self.scored_proteins if self.decoy_symbol in x.identifier]
# Create a list of all potential matching targets (some of these may not exist in the search)
matching_targets = [x.replace(self.decoy_symbol, "") for x in decoy_proteins]
# Create a list of all the proteins from the scored data
all_proteins = [x.identifier for x in self.scored_proteins]
logger.info("{} proteins scored".format(len(all_proteins)))
total_targets = []
total_decoys = []
decoys_removed = []
targets_removed = []
# Loop over all decoys identified in the search
logger.info("Picking Proteins...")
for i in range(len(decoy_proteins)):
cur_decoy_index = all_proteins.index(decoy_proteins[i])
cur_decoy_protein_object = self.scored_proteins[cur_decoy_index]
total_decoys.append(cur_decoy_protein_object.identifier)
# Try, Except here because the matching target to the decoy may not be a result from the search
try:
cur_target_index = all_proteins.index(matching_targets[i])
cur_target_protein_object = self.scored_proteins[cur_target_index]
total_targets.append(cur_target_protein_object.identifier)
if higher_or_lower == self.HIGHER_PSM_SCORE:
if cur_target_protein_object.score > cur_decoy_protein_object.score:
index_to_remove.append(cur_decoy_index)
decoys_removed.append(cur_decoy_index)
cur_target_protein_object.picked = True
cur_decoy_protein_object.picked = False
else:
index_to_remove.append(cur_target_index)
targets_removed.append(cur_target_index)
cur_decoy_protein_object.picked = True
cur_target_protein_object.picked = False
if higher_or_lower == self.LOWER_PSM_SCORE:
if cur_target_protein_object.score < cur_decoy_protein_object.score:
index_to_remove.append(cur_decoy_index)
decoys_removed.append(cur_decoy_index)
cur_target_protein_object.picked = True
cur_decoy_protein_object.picked = False
else:
index_to_remove.append(cur_target_index)
targets_removed.append(cur_target_index)
cur_decoy_protein_object.picked = True
cur_target_protein_object.picked = False
except ValueError:
pass
logger.info("{} total decoy proteins".format(len(total_decoys)))
logger.info("{} matching target proteins also found in search".format(len(total_targets)))
logger.info("{} decoy proteins to be removed".format(len(decoys_removed)))
logger.info("{} target proteins to be removed".format(len(targets_removed)))
logger.info("Removing Lower Scoring Proteins...")
picked_list = []
removed_proteins = []
for protein_objects in self.scored_proteins:
if protein_objects.picked:
picked_list.append(protein_objects)
else:
removed_proteins.append(protein_objects)
self.picked_proteins_scored = picked_list
self.picked_proteins_removed = removed_proteins
logger.info("Finished Removing Proteins")
protein_to_peptide_dictionary(self)
Method that returns a map of protein strings to sets of peptide strings and is essentially half
of a BiPartite graph.
This method sets the protein_peptide_dictionary
Attribute for the DataStore object.
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_peptide_dict = data.protein_to_peptide_dictionary()
Source code in pyproteininference/datastore.py
def protein_to_peptide_dictionary(self):
"""
Method that returns a map of protein strings to sets of peptide strings and is essentially half
of a BiPartite graph.
This method sets the `protein_peptide_dictionary` Attribute for the DataStore object.
Returns:
collections.defaultdict: Dictionary of protein strings (keys) that map to sets of peptide strings based
on the peptides and proteins found in the search. Protein -> set(Peptides).
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_peptide_dict = data.protein_to_peptide_dictionary()
"""
psm_data = self.get_psm_data()
res_pep_set = set(self.restricted_peptides)
default_dict_proteins = collections.defaultdict(set)
for peptide_objects in psm_data:
for prots in peptide_objects.possible_proteins:
cur_peptide = peptide_objects.non_flanking_peptide
if cur_peptide in res_pep_set:
default_dict_proteins[prots].add(cur_peptide)
self.protein_peptide_dictionary = default_dict_proteins
return default_dict_proteins
restrict_psm_data(self, remove1pep=True)
Method to restrict the input of Psm objects. This method is central to the pyproteininference module and is able to restrict the Psm data by: Q value, Pep Value, Percolator Score, Peptide Length, and Custom Score Input. Restriction values are pulled from the ProteinInferenceParameter object.
This method sets the main_data_restricted
and restricted_peptides
Attributes for the DataStore object.
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.restrict_psm_data(remove1pep=True)
Source code in pyproteininference/datastore.py
def restrict_psm_data(self, remove1pep=True):
"""
Method to restrict the input of [Psm][pyproteininference.physical.Psm] objects.
This method is central to the pyproteininference module and is able to restrict the Psm data by:
Q value, Pep Value, Percolator Score, Peptide Length, and Custom Score Input.
Restriction values are pulled from
the [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter]
object.
This method sets the `main_data_restricted` and `restricted_peptides` Attributes for the DataStore object.
Args:
remove1pep (bool): True/False on whether or not to remove PEP values that equal 1 even if other restrictions
are set to not restrict.
Returns:
None:
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.restrict_psm_data(remove1pep=True)
"""
# Validate that we have the main data variable
self._validate_main_data_form()
logger.info("Restricting PSM data")
peptide_length = self.parameter_file_object.restrict_peptide_length
posterior_error_prob_threshold = self.parameter_file_object.restrict_pep
q_value_threshold = self.parameter_file_object.restrict_q
custom_threshold = self.parameter_file_object.restrict_custom
main_psm_data = self.main_data_form
logger.info("Length of main data: {}".format(len(self.main_data_form)))
# If restrict_main_data is called, we automatically discard everything that has a PEP of 1
if remove1pep and posterior_error_prob_threshold:
main_psm_data = [x for x in main_psm_data if x.pepvalue != 1]
# Restrict peptide length and posterior error probability
if peptide_length and posterior_error_prob_threshold and not q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if len(psms.stripped_peptide) >= peptide_length and psms.pepvalue < float(
posterior_error_prob_threshold
):
restricted_data.append(psms)
# Restrict peptide length only
if peptide_length and not posterior_error_prob_threshold and not q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if len(psms.stripped_peptide) >= peptide_length:
restricted_data.append(psms)
# Restrict peptide length, posterior error probability, and qvalue
if peptide_length and posterior_error_prob_threshold and q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if (
len(psms.stripped_peptide) >= peptide_length
and psms.pepvalue < float(posterior_error_prob_threshold)
and psms.qvalue < float(q_value_threshold)
):
restricted_data.append(psms)
# Restrict peptide length and qvalue
if peptide_length and not posterior_error_prob_threshold and q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if len(psms.stripped_peptide) >= peptide_length and psms.qvalue < float(q_value_threshold):
restricted_data.append(psms)
# Restrict posterior error probability and q value
if not peptide_length and posterior_error_prob_threshold and q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if psms.pepvalue < float(posterior_error_prob_threshold) and psms.qvalue < float(q_value_threshold):
restricted_data.append(psms)
# Restrict qvalue only
if not peptide_length and not posterior_error_prob_threshold and q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if psms.qvalue < float(q_value_threshold):
restricted_data.append(psms)
# Restrict posterior error probability only
if not peptide_length and posterior_error_prob_threshold and not q_value_threshold:
restricted_data = []
for psms in main_psm_data:
if psms.pepvalue < float(posterior_error_prob_threshold):
restricted_data.append(psms)
# Restrict nothing... (only PEP gets restricted - takes everything less than 1)
if not peptide_length and not posterior_error_prob_threshold and not q_value_threshold:
restricted_data = main_psm_data
if custom_threshold:
custom_restricted = []
if self.parameter_file_object.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE:
for psms in restricted_data:
if psms.custom_score <= custom_threshold:
custom_restricted.append(psms)
if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
for psms in restricted_data:
if psms.custom_score >= custom_threshold:
custom_restricted.append(psms)
restricted_data = custom_restricted
self.main_data_restricted = restricted_data
logger.info("Length of restricted data: {}".format(len(restricted_data)))
self.restricted_peptides = [x.non_flanking_peptide for x in restricted_data]
sort_protein_group_objects(protein_group_objects, higher_or_lower)
classmethod
Class Method to sort a list of ProteinGroup objects by score and number of peptides.
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> list_of_group_objects = pyproteininference.datastore.DataStore.sort_protein_group_objects(
>>> protein_group_objects=list_of_group_objects, higher_or_lower="higher"
>>> )
Source code in pyproteininference/datastore.py
@classmethod
def sort_protein_group_objects(cls, protein_group_objects, higher_or_lower):
"""
Class Method to sort a list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects by
score and number of peptides.
Args:
protein_group_objects (list): list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".
Returns:
list: list of sorted [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
Example:
>>> list_of_group_objects = pyproteininference.datastore.DataStore.sort_protein_group_objects(
>>> protein_group_objects=list_of_group_objects, higher_or_lower="higher"
>>> )
"""
if higher_or_lower == cls.LOWER_PSM_SCORE:
protein_group_objects = sorted(
protein_group_objects,
key=lambda k: (
k.proteins[0].score,
-k.proteins[0].num_peptides,
),
reverse=False,
)
elif higher_or_lower == cls.HIGHER_PSM_SCORE:
protein_group_objects = sorted(
protein_group_objects,
key=lambda k: (
k.proteins[0].score,
k.proteins[0].num_peptides,
),
reverse=True,
)
return protein_group_objects
sort_protein_objects(grouped_protein_objects, higher_or_lower)
classmethod
Class Method to sort a list of Protein objects by score and number of peptides.
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> scores_grouped = pyproteininference.datastore.DataStore.sort_protein_objects(
>>> grouped_protein_objects=scores_grouped, higher_or_lower="higher"
>>> )
Source code in pyproteininference/datastore.py
@classmethod
def sort_protein_objects(cls, grouped_protein_objects, higher_or_lower):
"""
Class Method to sort a list of [Protein][pyproteininference.physical.Protein] objects by score and number of
peptides.
Args:
grouped_protein_objects (list): list of [Protein][pyproteininference.physical.Protein] objects.
higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".
Returns:
list: list of sorted [Protein][pyproteininference.physical.Protein] objects.
Example:
>>> scores_grouped = pyproteininference.datastore.DataStore.sort_protein_objects(
>>> grouped_protein_objects=scores_grouped, higher_or_lower="higher"
>>> )
"""
if higher_or_lower == cls.LOWER_PSM_SCORE:
grouped_protein_objects = sorted(
grouped_protein_objects,
key=lambda k: (k[0].score, -k[0].num_peptides),
reverse=False,
)
if higher_or_lower == cls.HIGHER_PSM_SCORE:
grouped_protein_objects = sorted(
grouped_protein_objects,
key=lambda k: (k[0].score, k[0].num_peptides),
reverse=True,
)
return grouped_protein_objects
sort_protein_strings(protein_string_list, sp_proteins, decoy_symbol)
classmethod
Method that sorts protein strings in the following order: Target Reviewed, Decoy Reviewed, Target Unreviewed, Decoy Unreviewed.
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> list_of_group_objects = datastore.DataStore.sort_protein_strings(
>>> protein_string_list=protein_string_list, sp_proteins=sp_proteins, decoy_symbol="##"
>>> )
Source code in pyproteininference/datastore.py
@classmethod
def sort_protein_strings(cls, protein_string_list, sp_proteins, decoy_symbol):
"""
Method that sorts protein strings in the following order: Target Reviewed, Decoy Reviewed, Target Unreviewed,
Decoy Unreviewed.
Args:
protein_string_list (list): List of Protein Strings.
sp_proteins (set): Set of Reviewed Protein Strings.
decoy_symbol (str): Symbol to denote a decoy protein identifier IE "##".
Returns:
list: List of sorted protein strings.
Example:
>>> list_of_group_objects = datastore.DataStore.sort_protein_strings(
>>> protein_string_list=protein_string_list, sp_proteins=sp_proteins, decoy_symbol="##"
>>> )
"""
our_target_sp_proteins = sorted([x for x in protein_string_list if x in sp_proteins and decoy_symbol not in x])
our_decoy_sp_proteins = sorted([x for x in protein_string_list if x in sp_proteins and decoy_symbol in x])
our_target_tr_proteins = sorted(
[x for x in protein_string_list if x not in sp_proteins and decoy_symbol not in x]
)
our_decoy_tr_proteins = sorted([x for x in protein_string_list if x not in sp_proteins and decoy_symbol in x])
identifiers_sorted = (
our_target_sp_proteins + our_decoy_sp_proteins + our_target_tr_proteins + our_decoy_tr_proteins
)
return identifiers_sorted
sort_protein_sub_groups(protein_list, higher_or_lower)
classmethod
Method to sort protein sub lists.
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/datastore.py
@classmethod
def sort_protein_sub_groups(cls, protein_list, higher_or_lower):
"""
Method to sort protein sub lists.
Args:
protein_list (list): List of [Protein][pyproteininference.physical.Protein] objects to be sorted.
higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".
Returns:
list: List of [Protein][pyproteininference.physical.Protein] objects to be sorted by score and number of
peptides.
"""
# Sort the groups based on higher or lower indication, secondarily sort the groups based on number of unique
# peptides
# We use the index [1:] as we do not wish to sort the lead protein...
if higher_or_lower == cls.LOWER_PSM_SCORE:
protein_list[1:] = sorted(
protein_list[1:],
key=lambda k: (float(k.score), -float(k.num_peptides)),
reverse=False,
)
if higher_or_lower == cls.HIGHER_PSM_SCORE:
protein_list[1:] = sorted(
protein_list[1:],
key=lambda k: (float(k.score), float(k.num_peptides)),
reverse=True,
)
return protein_list
unique_to_leads_peptides(self)
Method to retrieve peptides that are unique based on the data from the searches (Not based on the database digestion).
Returns: |
|
---|
Examples:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> unique_peps = data.unique_to_leads_peptides()
Source code in pyproteininference/datastore.py
def unique_to_leads_peptides(self):
"""
Method to retrieve peptides that are unique based on the data from the searches
(Not based on the database digestion).
Returns:
set: a Set of peptide strings
Example:
>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> unique_peps = data.unique_to_leads_peptides()
"""
if self.grouped_scored_proteins:
lead_peptides = [list(x[0].peptides) for x in self.grouped_scored_proteins]
flat_peptides = [item for sublist in lead_peptides for item in sublist]
counted_peps = collections.Counter(flat_peptides)
unique_to_leads_peptides = set([x for x in counted_peps if counted_peps[x] == 1])
else:
unique_to_leads_peptides = set()
return unique_to_leads_peptides
validate_digest(self)
Method that validates the Digest object.
Source code in pyproteininference/datastore.py
def validate_digest(self):
"""
Method that validates the [Digest object][pyproteininference.in_silico_digest.Digest].
"""
self._validate_reviewed_v_unreviewed()
self._check_target_decoy_split()
validate_psm_data(self)
Method that validates the PSM data.
Source code in pyproteininference/datastore.py
def validate_psm_data(self):
"""
Method that validates the PSM data.
"""
self._validate_decoys_from_data()
self._validate_isoform_from_data()
export
Export
Class that handles exporting protein inference results to filesystem as csv files.
Attributes:
Name | Type | Description |
---|---|---|
data |
DataStore |
|
filepath |
str |
Path to file to be written. |
Source code in pyproteininference/export.py
class Export(object):
"""
Class that handles exporting protein inference results to filesystem as csv files.
Attributes:
data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
filepath (str): Path to file to be written.
"""
EXPORT_LEADS = "leads"
EXPORT_ALL = "all"
EXPORT_COMMA_SEP = "comma_sep"
EXPORT_Q_VALUE_COMMA_SEP = "q_value_comma_sep"
EXPORT_Q_VALUE = "q_value"
EXPORT_Q_VALUE_ALL = "q_value_all"
EXPORT_PEPTIDES = "peptides"
EXPORT_PSMS = "psms"
EXPORT_PSM_IDS = "psm_ids"
EXPORT_LONG = "long"
EXPORT_TYPES = [
EXPORT_LEADS,
EXPORT_ALL,
EXPORT_COMMA_SEP,
EXPORT_Q_VALUE_COMMA_SEP,
EXPORT_Q_VALUE,
EXPORT_Q_VALUE_ALL,
EXPORT_PEPTIDES,
EXPORT_PSMS,
EXPORT_PSM_IDS,
EXPORT_LONG,
]
def __init__(self, data):
"""
Initialization method for the Export class.
Args:
data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
Example:
>>> export = pyproteininference.export.Export(data=data)
"""
self.data = data
self.filepath = None
def export_to_csv(self, output_filename=None, directory=None, export_type="q_value"):
"""
Method that dispatches to one of the many export methods given an export_type input.
filepath is determined based on directory arg and information from
[DataStore object][pyproteininference.datastore.DataStore].
This method sets the `filepath` variable.
Args:
output_filename (str): Filepath to write to. If set as None will auto generate filename and
will write to directory variable.
directory (str): Directory to write the result file to. If None, will write to current working directory.
export_type (str): Must be a value in `EXPORT_TYPES` and determines the output format.
Example:
>>> export = pyproteininference.export.Export(data=data)
>>> export.export_to_csv(output_filename=None, directory="/path/to/output/dir/", export_type="psms")
"""
if not directory:
directory = os.getcwd()
data = self.data
tag = data.parameter_file_object.tag
if self.EXPORT_LEADS == export_type:
filename = "{}_leads_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_leads_restricted(filename_out=complete_filepath)
elif self.EXPORT_ALL == export_type:
filename = "{}_all_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_all_restricted(complete_filepath)
elif self.EXPORT_COMMA_SEP == export_type:
filename = "{}_comma_sep_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_comma_sep_restricted(complete_filepath)
elif self.EXPORT_Q_VALUE_COMMA_SEP == export_type:
filename = "{}_q_value_comma_sep_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_comma_sep(complete_filepath)
elif self.EXPORT_Q_VALUE == export_type:
filename = "{}_q_value_leads_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_leads(complete_filepath)
elif self.EXPORT_Q_VALUE_ALL == export_type:
filename = "{}_q_value_all_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_all(complete_filepath)
elif self.EXPORT_PEPTIDES == export_type:
filename = "{}_q_value_leads_peptides_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_leads_peptides(complete_filepath)
elif self.EXPORT_PSMS == export_type:
filename = "{}_q_value_leads_psms_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_leads_psms(complete_filepath)
elif self.EXPORT_PSM_IDS == export_type:
filename = "{}_q_value_leads_psm_ids_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_leads_psm_ids(complete_filepath)
elif self.EXPORT_LONG == export_type:
filename = "{}_q_value_long_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_leads_long(complete_filepath)
else:
complete_filepath = "protein_inference_results.csv"
self.filepath = complete_filepath
def csv_export_all_restricted(self, filename_out):
"""
Method that outputs a subset of the passing proteins based on FDR.
This method returns a non-square CSV file.
Args:
filename_out (str): Filename for the data to be written to
"""
protein_objects = self.data.get_protein_objects(fdr_restricted=True)
protein_export_list = [
[
"Protein",
"Score",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in protein_objects:
for prots in groups:
protein_export_list.append([prots.identifier])
protein_export_list[-1].append(prots.score)
protein_export_list[-1].append(prots.num_peptides)
if prots.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(prots.group_identification)
for peps in prots.peptides:
protein_export_list[-1].append(peps)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
def csv_export_leads_restricted(self, filename_out):
"""
Method that outputs a subset of the passing proteins based on FDR.
Only Proteins that pass FDR will be output and only Lead proteins will be output
This method returns a non-square CSV file.
Args:
filename_out (str): Filename for the data to be written to.
"""
protein_objects = self.data.get_protein_objects(fdr_restricted=True)
protein_export_list = [
[
"Protein",
"Score",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in protein_objects:
protein_export_list.append([groups[0].identifier])
protein_export_list[-1].append(groups[0].score)
protein_export_list[-1].append(groups[0].num_peptides)
if groups[0].reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups[0].group_identification)
for peps in sorted(groups[0].peptides):
protein_export_list[-1].append(peps)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
def csv_export_comma_sep_restricted(self, filename_out):
"""
Method that outputs a subset of the passing proteins based on FDR.
Only Proteins that pass FDR will be output and only Lead proteins will be output.
Proteins in the groups of lead proteins will also be output in the same row as the lead.
This method returns a non-square CSV file.
Args:
filename_out (str): Filename for the data to be written to.
"""
protein_objects = self.data.get_protein_objects(fdr_restricted=True)
protein_export_list = [
[
"Protein",
"Score",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Other_Potential_Identifiers",
]
]
for groups in protein_objects:
for prots in groups:
if prots == groups[0]:
protein_export_list.append([prots.identifier])
protein_export_list[-1].append(prots.score)
protein_export_list[-1].append(prots.num_peptides)
if prots.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(prots.group_identification)
else:
protein_export_list[-1].append(prots.identifier)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
def csv_export_q_value_leads(self, filename_out):
"""
Method that outputs all lead proteins with Q values.
This method returns a non-square CSV file.
Args:
filename_out (str): Filename for the data to be written to.
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
lead_protein = groups.proteins[0]
protein_export_list.append([lead_protein.identifier])
protein_export_list[-1].append(lead_protein.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(lead_protein.num_peptides)
if lead_protein.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
peptides = lead_protein.peptides
for peps in sorted(peptides):
protein_export_list[-1].append(peps)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
def csv_export_q_value_comma_sep(self, filename_out):
"""
Method that outputs all lead proteins with Q values.
Proteins in the groups of lead proteins will also be output in the same row as the lead.
This method returns a non-square CSV file.
Args:
filename_out (str): Filename for the data to be written to.
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Other_Potential_Identifiers",
]
]
for groups in self.data.protein_group_objects:
lead_protein = groups.proteins[0]
protein_export_list.append([lead_protein.identifier])
protein_export_list[-1].append(lead_protein.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(lead_protein.num_peptides)
if lead_protein.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
for other_prots in groups.proteins[1:]:
protein_export_list[-1].append(other_prots.identifier)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
def csv_export_q_value_all(self, filename_out):
"""
Method that outputs all proteins with Q values.
Non Lead proteins are also output - entire group gets output.
Proteins in the groups of lead proteins will also be output in the same row as the lead.
This method returns a non-square CSV file.
Args:
filename_out (str): Filename for the data to be written to.
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
for proteins in groups.proteins:
protein_export_list.append([proteins.identifier])
protein_export_list[-1].append(proteins.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(proteins.num_peptides)
if proteins.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
for peps in sorted(proteins.peptides):
protein_export_list[-1].append(peps)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
def csv_export_q_value_all_proteologic(self, filename_out):
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
for proteins in groups.proteins:
protein_export_list.append([proteins.identifier])
protein_export_list[-1].append(proteins.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(proteins.num_peptides)
if proteins.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
for peps in sorted(proteins.peptides):
protein_export_list[-1].append(peps)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
def csv_export_q_value_leads_long(self, filename_out):
"""
Method that outputs all lead proteins with Q values.
This method returns a long formatted result file with one peptide on each row.
Args:
filename_out (str): Filename for the data to be written to.
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
lead_protein = groups.proteins[0]
for peps in sorted(lead_protein.peptides):
protein_export_list.append([lead_protein.identifier])
protein_export_list[-1].append(lead_protein.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(lead_protein.num_peptides)
if lead_protein.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
protein_export_list[-1].append(peps)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
def csv_export_q_value_leads_peptides(self, filename_out, peptide_delimiter=" "):
"""
Method that outputs all lead proteins with Q values in rectangular format.
This method outputs unique peptides per protein.
This method returns a rectangular CSV file.
Args:
filename_out (str): Filename for the data to be written to.
peptide_delimiter (str): String to separate peptides by in the "Peptides" column of the csv file
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
lead_protein = groups.proteins[0]
protein_export_list.append([lead_protein.identifier])
protein_export_list[-1].append(lead_protein.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(lead_protein.num_peptides)
if lead_protein.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
peptides = peptide_delimiter.join(list(sorted(lead_protein.peptides)))
protein_export_list[-1].append(peptides)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
def csv_export_q_value_leads_psms(self, filename_out, peptide_delimiter=" "):
"""
Method that outputs all lead proteins with Q values in rectangular format.
This method outputs all PSMs for the protein not just unique peptide identifiers.
This method returns a rectangular CSV file.
Args:
filename_out (str): Filename for the data to be written to.
peptide_delimiter (str): String to separate peptides by in the "Peptides" column of the csv file.
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
lead_protein = groups.proteins[0]
protein_export_list.append([lead_protein.identifier])
protein_export_list[-1].append(lead_protein.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(lead_protein.num_peptides)
if lead_protein.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
psms = peptide_delimiter.join(sorted([x.non_flanking_peptide for x in lead_protein.psms]))
protein_export_list[-1].append(psms)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
def csv_export_q_value_leads_psm_ids(self, filename_out, peptide_delimiter=" "):
"""
Method that outputs all lead proteins with Q values in rectangular format.
Psms are output as the psm_id value. So sequence information is not output.
This method returns a rectangular CSV file.
Args:
filename_out (str): Filename for the data to be written to.
peptide_delimiter (str): String to separate psm_ids by in the "Peptides" column of the csv file.
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
lead_protein = groups.proteins[0]
protein_export_list.append([lead_protein.identifier])
protein_export_list[-1].append(lead_protein.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(lead_protein.num_peptides)
if lead_protein.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
psms = peptide_delimiter.join(sorted(lead_protein.get_psm_ids()))
protein_export_list[-1].append(psms)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
__init__(self, data)
special
Initialization method for the Export class.
Parameters: |
|
---|
Examples:
>>> export = pyproteininference.export.Export(data=data)
Source code in pyproteininference/export.py
def __init__(self, data):
"""
Initialization method for the Export class.
Args:
data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
Example:
>>> export = pyproteininference.export.Export(data=data)
"""
self.data = data
self.filepath = None
csv_export_all_restricted(self, filename_out)
Method that outputs a subset of the passing proteins based on FDR.
This method returns a non-square CSV file.
Parameters: |
|
---|
Source code in pyproteininference/export.py
def csv_export_all_restricted(self, filename_out):
"""
Method that outputs a subset of the passing proteins based on FDR.
This method returns a non-square CSV file.
Args:
filename_out (str): Filename for the data to be written to
"""
protein_objects = self.data.get_protein_objects(fdr_restricted=True)
protein_export_list = [
[
"Protein",
"Score",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in protein_objects:
for prots in groups:
protein_export_list.append([prots.identifier])
protein_export_list[-1].append(prots.score)
protein_export_list[-1].append(prots.num_peptides)
if prots.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(prots.group_identification)
for peps in prots.peptides:
protein_export_list[-1].append(peps)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
csv_export_comma_sep_restricted(self, filename_out)
Method that outputs a subset of the passing proteins based on FDR. Only Proteins that pass FDR will be output and only Lead proteins will be output. Proteins in the groups of lead proteins will also be output in the same row as the lead.
This method returns a non-square CSV file.
Parameters: |
|
---|
Source code in pyproteininference/export.py
def csv_export_comma_sep_restricted(self, filename_out):
"""
Method that outputs a subset of the passing proteins based on FDR.
Only Proteins that pass FDR will be output and only Lead proteins will be output.
Proteins in the groups of lead proteins will also be output in the same row as the lead.
This method returns a non-square CSV file.
Args:
filename_out (str): Filename for the data to be written to.
"""
protein_objects = self.data.get_protein_objects(fdr_restricted=True)
protein_export_list = [
[
"Protein",
"Score",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Other_Potential_Identifiers",
]
]
for groups in protein_objects:
for prots in groups:
if prots == groups[0]:
protein_export_list.append([prots.identifier])
protein_export_list[-1].append(prots.score)
protein_export_list[-1].append(prots.num_peptides)
if prots.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(prots.group_identification)
else:
protein_export_list[-1].append(prots.identifier)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
csv_export_leads_restricted(self, filename_out)
Method that outputs a subset of the passing proteins based on FDR. Only Proteins that pass FDR will be output and only Lead proteins will be output
This method returns a non-square CSV file.
Parameters: |
|
---|
Source code in pyproteininference/export.py
def csv_export_leads_restricted(self, filename_out):
"""
Method that outputs a subset of the passing proteins based on FDR.
Only Proteins that pass FDR will be output and only Lead proteins will be output
This method returns a non-square CSV file.
Args:
filename_out (str): Filename for the data to be written to.
"""
protein_objects = self.data.get_protein_objects(fdr_restricted=True)
protein_export_list = [
[
"Protein",
"Score",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in protein_objects:
protein_export_list.append([groups[0].identifier])
protein_export_list[-1].append(groups[0].score)
protein_export_list[-1].append(groups[0].num_peptides)
if groups[0].reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups[0].group_identification)
for peps in sorted(groups[0].peptides):
protein_export_list[-1].append(peps)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
csv_export_q_value_all(self, filename_out)
Method that outputs all proteins with Q values. Non Lead proteins are also output - entire group gets output. Proteins in the groups of lead proteins will also be output in the same row as the lead.
This method returns a non-square CSV file.
Parameters: |
|
---|
Source code in pyproteininference/export.py
def csv_export_q_value_all(self, filename_out):
"""
Method that outputs all proteins with Q values.
Non Lead proteins are also output - entire group gets output.
Proteins in the groups of lead proteins will also be output in the same row as the lead.
This method returns a non-square CSV file.
Args:
filename_out (str): Filename for the data to be written to.
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
for proteins in groups.proteins:
protein_export_list.append([proteins.identifier])
protein_export_list[-1].append(proteins.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(proteins.num_peptides)
if proteins.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
for peps in sorted(proteins.peptides):
protein_export_list[-1].append(peps)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
csv_export_q_value_comma_sep(self, filename_out)
Method that outputs all lead proteins with Q values. Proteins in the groups of lead proteins will also be output in the same row as the lead.
This method returns a non-square CSV file.
Parameters: |
|
---|
Source code in pyproteininference/export.py
def csv_export_q_value_comma_sep(self, filename_out):
"""
Method that outputs all lead proteins with Q values.
Proteins in the groups of lead proteins will also be output in the same row as the lead.
This method returns a non-square CSV file.
Args:
filename_out (str): Filename for the data to be written to.
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Other_Potential_Identifiers",
]
]
for groups in self.data.protein_group_objects:
lead_protein = groups.proteins[0]
protein_export_list.append([lead_protein.identifier])
protein_export_list[-1].append(lead_protein.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(lead_protein.num_peptides)
if lead_protein.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
for other_prots in groups.proteins[1:]:
protein_export_list[-1].append(other_prots.identifier)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
csv_export_q_value_leads(self, filename_out)
Method that outputs all lead proteins with Q values.
This method returns a non-square CSV file.
Parameters: |
|
---|
Source code in pyproteininference/export.py
def csv_export_q_value_leads(self, filename_out):
"""
Method that outputs all lead proteins with Q values.
This method returns a non-square CSV file.
Args:
filename_out (str): Filename for the data to be written to.
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
lead_protein = groups.proteins[0]
protein_export_list.append([lead_protein.identifier])
protein_export_list[-1].append(lead_protein.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(lead_protein.num_peptides)
if lead_protein.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
peptides = lead_protein.peptides
for peps in sorted(peptides):
protein_export_list[-1].append(peps)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
csv_export_q_value_leads_long(self, filename_out)
Method that outputs all lead proteins with Q values.
This method returns a long formatted result file with one peptide on each row.
Parameters: |
|
---|
Source code in pyproteininference/export.py
def csv_export_q_value_leads_long(self, filename_out):
"""
Method that outputs all lead proteins with Q values.
This method returns a long formatted result file with one peptide on each row.
Args:
filename_out (str): Filename for the data to be written to.
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
lead_protein = groups.proteins[0]
for peps in sorted(lead_protein.peptides):
protein_export_list.append([lead_protein.identifier])
protein_export_list[-1].append(lead_protein.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(lead_protein.num_peptides)
if lead_protein.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
protein_export_list[-1].append(peps)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
csv_export_q_value_leads_peptides(self, filename_out, peptide_delimiter=' ')
Method that outputs all lead proteins with Q values in rectangular format. This method outputs unique peptides per protein.
This method returns a rectangular CSV file.
Parameters: |
|
---|
Source code in pyproteininference/export.py
def csv_export_q_value_leads_peptides(self, filename_out, peptide_delimiter=" "):
"""
Method that outputs all lead proteins with Q values in rectangular format.
This method outputs unique peptides per protein.
This method returns a rectangular CSV file.
Args:
filename_out (str): Filename for the data to be written to.
peptide_delimiter (str): String to separate peptides by in the "Peptides" column of the csv file
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
lead_protein = groups.proteins[0]
protein_export_list.append([lead_protein.identifier])
protein_export_list[-1].append(lead_protein.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(lead_protein.num_peptides)
if lead_protein.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
peptides = peptide_delimiter.join(list(sorted(lead_protein.peptides)))
protein_export_list[-1].append(peptides)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
csv_export_q_value_leads_psm_ids(self, filename_out, peptide_delimiter=' ')
Method that outputs all lead proteins with Q values in rectangular format. Psms are output as the psm_id value. So sequence information is not output.
This method returns a rectangular CSV file.
Parameters: |
|
---|
Source code in pyproteininference/export.py
def csv_export_q_value_leads_psm_ids(self, filename_out, peptide_delimiter=" "):
"""
Method that outputs all lead proteins with Q values in rectangular format.
Psms are output as the psm_id value. So sequence information is not output.
This method returns a rectangular CSV file.
Args:
filename_out (str): Filename for the data to be written to.
peptide_delimiter (str): String to separate psm_ids by in the "Peptides" column of the csv file.
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
lead_protein = groups.proteins[0]
protein_export_list.append([lead_protein.identifier])
protein_export_list[-1].append(lead_protein.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(lead_protein.num_peptides)
if lead_protein.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
psms = peptide_delimiter.join(sorted(lead_protein.get_psm_ids()))
protein_export_list[-1].append(psms)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
csv_export_q_value_leads_psms(self, filename_out, peptide_delimiter=' ')
Method that outputs all lead proteins with Q values in rectangular format. This method outputs all PSMs for the protein not just unique peptide identifiers.
This method returns a rectangular CSV file.
Parameters: |
|
---|
Source code in pyproteininference/export.py
def csv_export_q_value_leads_psms(self, filename_out, peptide_delimiter=" "):
"""
Method that outputs all lead proteins with Q values in rectangular format.
This method outputs all PSMs for the protein not just unique peptide identifiers.
This method returns a rectangular CSV file.
Args:
filename_out (str): Filename for the data to be written to.
peptide_delimiter (str): String to separate peptides by in the "Peptides" column of the csv file.
"""
protein_export_list = [
[
"Protein",
"Score",
"Q_Value",
"Number_of_Peptides",
"Identifier_Type",
"GroupID",
"Peptides",
]
]
for groups in self.data.protein_group_objects:
lead_protein = groups.proteins[0]
protein_export_list.append([lead_protein.identifier])
protein_export_list[-1].append(lead_protein.score)
protein_export_list[-1].append(groups.q_value)
protein_export_list[-1].append(lead_protein.num_peptides)
if lead_protein.reviewed:
protein_export_list[-1].append("Reviewed")
else:
protein_export_list[-1].append("Unreviewed")
protein_export_list[-1].append(groups.number_id)
psms = peptide_delimiter.join(sorted([x.non_flanking_peptide for x in lead_protein.psms]))
protein_export_list[-1].append(psms)
with open(filename_out, "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(protein_export_list)
export_to_csv(self, output_filename=None, directory=None, export_type='q_value')
Method that dispatches to one of the many export methods given an export_type input.
filepath is determined based on directory arg and information from DataStore object.
This method sets the filepath
variable.
Parameters: |
|
---|
Examples:
>>> export = pyproteininference.export.Export(data=data)
>>> export.export_to_csv(output_filename=None, directory="/path/to/output/dir/", export_type="psms")
Source code in pyproteininference/export.py
def export_to_csv(self, output_filename=None, directory=None, export_type="q_value"):
"""
Method that dispatches to one of the many export methods given an export_type input.
filepath is determined based on directory arg and information from
[DataStore object][pyproteininference.datastore.DataStore].
This method sets the `filepath` variable.
Args:
output_filename (str): Filepath to write to. If set as None will auto generate filename and
will write to directory variable.
directory (str): Directory to write the result file to. If None, will write to current working directory.
export_type (str): Must be a value in `EXPORT_TYPES` and determines the output format.
Example:
>>> export = pyproteininference.export.Export(data=data)
>>> export.export_to_csv(output_filename=None, directory="/path/to/output/dir/", export_type="psms")
"""
if not directory:
directory = os.getcwd()
data = self.data
tag = data.parameter_file_object.tag
if self.EXPORT_LEADS == export_type:
filename = "{}_leads_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_leads_restricted(filename_out=complete_filepath)
elif self.EXPORT_ALL == export_type:
filename = "{}_all_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_all_restricted(complete_filepath)
elif self.EXPORT_COMMA_SEP == export_type:
filename = "{}_comma_sep_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_comma_sep_restricted(complete_filepath)
elif self.EXPORT_Q_VALUE_COMMA_SEP == export_type:
filename = "{}_q_value_comma_sep_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_comma_sep(complete_filepath)
elif self.EXPORT_Q_VALUE == export_type:
filename = "{}_q_value_leads_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_leads(complete_filepath)
elif self.EXPORT_Q_VALUE_ALL == export_type:
filename = "{}_q_value_all_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_all(complete_filepath)
elif self.EXPORT_PEPTIDES == export_type:
filename = "{}_q_value_leads_peptides_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_leads_peptides(complete_filepath)
elif self.EXPORT_PSMS == export_type:
filename = "{}_q_value_leads_psms_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_leads_psms(complete_filepath)
elif self.EXPORT_PSM_IDS == export_type:
filename = "{}_q_value_leads_psm_ids_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_leads_psm_ids(complete_filepath)
elif self.EXPORT_LONG == export_type:
filename = "{}_q_value_long_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
complete_filepath = os.path.join(directory, filename)
if output_filename:
complete_filepath = output_filename
logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
self.csv_export_q_value_leads_long(complete_filepath)
else:
complete_filepath = "protein_inference_results.csv"
self.filepath = complete_filepath
heuristic
HeuristicPipeline (ProteinInferencePipeline)
This is the Protein Inference Heuristic class which houses the logic to run the Protein Inference Heuristic method to determine the best inference method for the given data. Logic is executed in the execute method.
Attributes:
Name | Type | Description |
---|---|---|
parameter_file |
str |
Path to Protein Inference Yaml Parameter File. |
database_file |
str |
Path to Fasta database used in proteomics search. |
target_files |
str/list |
Path to Target Psm File (Or a list of files). |
decoy_files |
str/list |
Path to Decoy Psm File (Or a list of files). |
combined_files |
str/list |
Path to Combined Psm File (Or a list of files). |
target_directory |
str |
Path to Directory containing Target Psm Files. |
decoy_directory |
str |
Path to Directory containing Decoy Psm Files. |
combined_directory |
str |
Path to Directory containing Combined Psm Files. |
output_directory |
str |
Path to Directory where output will be written. |
output_filename |
str |
Path to Filename where output will be written. Will override output_directory. |
id_splitting |
bool |
True/False on whether to split protein IDs in the digest. Advanced usage only. |
append_alt_from_db |
bool |
True/False on whether to append alternative proteins from the DB digestion in Reader class. |
pdf_filename |
str |
Filepath to be written to by Heuristic Plotting method. This is optional and a default filename will be created in output_directory if this is left as None. |
inference_method_list |
list |
List of inference methods used in heuristic determination. |
datastore_dict |
dict |
Dictionary of DataStore objects generated in heuristic determination with the inference method as the key of each entry. |
selected_methods |
list |
a list of String representations of the selected inference methods based on the heuristic. |
selected_datastores |
dict |
a Dictionary of DataStore object objects as selected by the heuristic. |
output_type |
str |
How to output results. Can either be "all" or "optimal". Will either output all results or will only output the optimal results. |
Source code in pyproteininference/heuristic.py
class HeuristicPipeline(ProteinInferencePipeline):
"""
This is the Protein Inference Heuristic class which houses the logic to run the Protein Inference Heuristic method
to determine the best inference method for the given data.
Logic is executed in the [execute][pyproteininference.heuristic.HeuristicPipeline.execute] method.
Attributes:
parameter_file (str): Path to Protein Inference Yaml Parameter File.
database_file (str): Path to Fasta database used in proteomics search.
target_files (str/list): Path to Target Psm File (Or a list of files).
decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
combined_files (str/list): Path to Combined Psm File (Or a list of files).
target_directory (str): Path to Directory containing Target Psm Files.
decoy_directory (str): Path to Directory containing Decoy Psm Files.
combined_directory (str): Path to Directory containing Combined Psm Files.
output_directory (str): Path to Directory where output will be written.
output_filename (str): Path to Filename where output will be written. Will override output_directory.
id_splitting (bool): True/False on whether to split protein IDs in the digest.
Advanced usage only.
append_alt_from_db (bool): True/False on whether to append
alternative proteins from the DB digestion in Reader class.
pdf_filename (str): Filepath to be written to by Heuristic Plotting method.
This is optional and a default filename will be created in output_directory if this is left as None.
inference_method_list (list): List of inference methods used in heuristic determination.
datastore_dict (dict): Dictionary of [DataStore][pyproteininference.datastore.DataStore]
objects generated in heuristic determination with the inference method as the key of each entry.
selected_methods (list): a list of String representations of the selected inference methods based on the
heuristic.
selected_datastores (dict):
a Dictionary of [DataStore object][pyproteininference.datastore.DataStore] objects as selected by the
heuristic.
output_type (str): How to output results. Can either be "all" or "optimal". Will either output all results
or will only output the optimal results.
"""
RATIO_CONSTANT = 2
OUTPUT_TYPES = ["all", "optimal"]
def __init__(
self,
parameter_file=None,
database_file=None,
target_files=None,
decoy_files=None,
combined_files=None,
target_directory=None,
decoy_directory=None,
combined_directory=None,
output_directory=None,
output_filename=None,
id_splitting=False,
append_alt_from_db=True,
pdf_filename=None,
output_type="all",
):
"""
Args:
parameter_file (str): Path to Protein Inference Yaml Parameter File.
database_file (str): Path to Fasta database used in proteomics search.
target_files (str/list): Path to Target Psm File (Or a list of files).
decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
combined_files (str/list): Path to Combined Psm File (Or a list of files).
target_directory (str): Path to Directory containing Target Psm Files.
decoy_directory (str): Path to Directory containing Decoy Psm Files.
combined_directory (str): Path to Directory containing Combined Psm Files.
output_directory (str): Path to Directory where output will be written.
output_filename (str): Path to Filename where output will be written.
Will override output_directory.
id_splitting (bool): True/False on whether to split protein IDs in the digest.
Advanced usage only.
append_alt_from_db (bool): True/False on whether to append alternative proteins
from the DB digestion in Reader class.
pdf_filename (str): Filepath to be written to by Heuristic Plotting method.
This is optional and a default filename will be created in output_directory if this is left as None
output_type (str): How to output results. Can either be "all" or "optimal". Will either output all results
or will only output the optimal results.
Returns:
HeuristicPipeline: [HeuristicPipeline][pyproteininference.heuristic.HeuristicPipeline] object
Example:
>>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
>>> parameter_file=yaml_params,
>>> database_file=database,
>>> target_files=target,
>>> decoy_files=decoy,
>>> combined_files=combined_files,
>>> target_directory=target_directory,
>>> decoy_directory=decoy_directory,
>>> combined_directory=combined_directory,
>>> output_directory=dir_name,
>>> output_filename=output_filename,
>>> append_alt_from_db=append_alt,
>>> pdf_filename=pdf_filename,
>>> output_type="all"
>>> )
"""
self.parameter_file = parameter_file
self.database_file = database_file
self.target_files = target_files
self.decoy_files = decoy_files
self.combined_files = combined_files
self.target_directory = target_directory
self.decoy_directory = decoy_directory
self.combined_directory = combined_directory
self.output_directory = output_directory
self.output_filename = output_filename
self.id_splitting = id_splitting
self.append_alt_from_db = append_alt_from_db
self.output_type = output_type
if self.output_type not in self.OUTPUT_TYPES:
raise ValueError("The variable output_type must be set to either 'all' or 'optimal'")
if not pdf_filename:
if self.output_directory and not self.output_filename:
self.pdf_filename = os.path.join(self.output_directory, "heuristic_plot.pdf")
elif self.output_filename:
self.pdf_filename = os.path.join(os.path.split(self.output_filename)[0], "heuristic_plot.pdf")
else:
self.pdf_filename = os.path.join(os.getcwd(), "heuristic_plot.pdf")
else:
self.pdf_filename = pdf_filename
self.inference_method_list = [
Inference.INCLUSION,
Inference.EXCLUSION,
Inference.PARSIMONY,
Inference.PEPTIDE_CENTRIC,
]
self.datastore_dict = {}
self.selected_methods = None
self.selected_datastores = {}
self._validate_input()
self._set_output_directory()
self._log_append_alt_from_db()
def execute(self, fdr_threshold=0.05):
"""
This method is the main driver of the heuristic method.
This method calls other classes and methods that make up the heuristic pipeline.
This includes but is not limited to:
1. Loops over the main inference methods: Inclusion, Exclusion, Parsimony, and Peptide Centric.
2. Determines the optimal inference method based on the input data as well as the database file.
3. Outputs the results and indicates the optimal results.
Args:
fdr_threshold (float): The Qvalue/FDR threshold the heuristic method uses to base calculations from.
Returns:
None:
Example:
>>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
>>> parameter_file=yaml_params,
>>> database_file=database,
>>> target_files=target,
>>> decoy_files=decoy,
>>> combined_files=combined_files,
>>> target_directory=target_directory,
>>> decoy_directory=decoy_directory,
>>> combined_directory=combined_directory,
>>> output_directory=dir_name,
>>> output_filename=output_filename,
>>> append_alt_from_db=append_alt,
>>> pdf_filename=pdf_filename,
>>> output_type="all"
>>> )
>>> heuristic.execute(fdr_threshold=0.05)
"""
pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
yaml_param_filepath=self.parameter_file
)
digest = pyproteininference.in_silico_digest.PyteomicsDigest(
database_path=self.database_file,
digest_type=pyproteininference_parameters.digest_type,
missed_cleavages=pyproteininference_parameters.missed_cleavages,
reviewed_identifier_symbol=pyproteininference_parameters.reviewed_identifier_symbol,
max_peptide_length=pyproteininference_parameters.restrict_peptide_length,
id_splitting=self.id_splitting,
)
if self.database_file:
logger.info("Running In Silico Database Digest on file {}".format(self.database_file))
digest.digest_fasta_database()
else:
logger.warning(
"No Database File provided, Skipping database digest and only taking protein-peptide mapping from the "
"input files."
)
for inference_method in self.inference_method_list:
method_specific_parameters = copy.deepcopy(pyproteininference_parameters)
logger.info("Overriding inference type {}".format(method_specific_parameters.inference_type))
method_specific_parameters.inference_type = inference_method
logger.info("New inference type {}".format(method_specific_parameters.inference_type))
logger.info("FDR Threshold Set to {}".format(method_specific_parameters.fdr))
reader = pyproteininference.reader.GenericReader(
target_file=self.target_files,
decoy_file=self.decoy_files,
combined_files=self.combined_files,
parameter_file_object=method_specific_parameters,
digest=digest,
append_alt_from_db=self.append_alt_from_db,
)
reader.read_psms()
data = pyproteininference.datastore.DataStore(reader=reader, digest=digest)
data.restrict_psm_data()
data.recover_mapping()
data.create_scoring_input()
if method_specific_parameters.inference_type == Inference.EXCLUSION:
data.exclude_non_distinguishing_peptides()
score = pyproteininference.scoring.Score(data=data)
score.score_psms(score_method=method_specific_parameters.protein_score)
if method_specific_parameters.picker:
data.protein_picker()
else:
pass
pyproteininference.inference.Inference.run_inference(data=data, digest=digest)
data.calculate_q_values()
self.datastore_dict[inference_method] = data
self.selected_methods = self.determine_optimal_inference_method(
false_discovery_rate_threshold=fdr_threshold, pdf_filename=self.pdf_filename
)
self.selected_datastores = {x: self.datastore_dict[x] for x in self.selected_methods}
if self.output_type == "all":
self._write_all_results(parameters=method_specific_parameters)
elif self.output_type == "optimal":
self._write_optimal_results(parameters=method_specific_parameters)
else:
self._write_optimal_results(parameters=method_specific_parameters)
def generate_roc_plot(self, fdr_max=0.2, pdf_filename=None):
"""
This method produces a PDF ROC plot overlaying the 4 inference methods apart of the heuristic algorithm.
Args:
fdr_max (float): Max FDR to display on the plot.
pdf_filename (str): Filename to write roc plot to.
Returns:
None:
"""
f = plt.figure()
for inference_method in self.datastore_dict.keys():
fdr_vs_target_hits = self.datastore_dict[inference_method].generate_fdr_vs_target_hits(fdr_max=fdr_max)
fdrs = [x[0] for x in fdr_vs_target_hits]
target_hits = [x[1] for x in fdr_vs_target_hits]
plt.plot(fdrs, target_hits, '-', label=inference_method.replace("_", " "))
target_fdr = self.datastore_dict[inference_method].parameter_file_object.fdr
if inference_method in self.selected_methods:
best_value = min(fdrs, key=lambda x: abs(x - target_fdr))
best_index = fdrs.index(best_value)
best_target_hit_value = target_hits[best_index] # noqa F841
plt.axvline(target_fdr, color="black", linestyle='--', alpha=0.75, label="Target FDR")
plt.legend()
plt.xlabel('Decoy FDR')
plt.ylabel('Target Protein Hits')
plt.xlim([-0.01, fdr_max])
plt.legend(loc='lower right')
plt.title("FDR vs Target Protein Hits per Inference Method")
if pdf_filename:
logger.info("Writing ROC plot to: {}".format(pdf_filename))
f.savefig(pdf_filename)
plt.close()
def _write_all_results(self, parameters):
"""
Internal method that loops over all results and writes them out.
"""
for method in list(self.datastore_dict.keys()):
datastore = self.datastore_dict[method]
if method in self.selected_methods:
inference_method_string = "{}_{}".format(method, "optimal_method")
else:
inference_method_string = method
if not self.output_filename and self.output_directory:
# If a filename is not provided then construct one using output_directory
# Note: output_directory will always get set even if its set as None - gets set to cwd
inference_filename = os.path.join(
self.output_directory,
"{}_{}_{}_{}_{}".format(
inference_method_string,
parameters.tag,
datastore.short_protein_score,
datastore.psm_score,
"protein_inference_results.csv",
),
)
if self.output_filename:
# If the user specified an output filename then split it apart and insert the inference method
# Then reconstruct the file
split = os.path.split(self.output_filename)
path = split[0]
filename = split[1]
inference_filename = os.path.join(path, "{}_{}".format(inference_method_string, filename))
export = pyproteininference.export.Export(data=self.datastore_dict[method])
export.export_to_csv(
output_filename=inference_filename,
directory=self.output_directory,
export_type=parameters.export,
)
def _write_optimal_results(self, parameters):
"""
Internal method that writes out the optimized results.
"""
for method in self.selected_methods:
datastore = self.datastore_dict[method]
inference_method_string = "{}_{}".format(method, "optimal_method")
if not self.output_filename and self.output_directory:
# If a filename is not provided then construct one using output_directory
# Note: output_directory will always get set even if its set as None - gets set to cwd
inference_filename = os.path.join(
self.output_directory,
"{}_{}_{}_{}_{}".format(
inference_method_string,
parameters.tag,
datastore.short_protein_score,
datastore.psm_score,
"protein_inference_results.csv",
),
)
if self.output_filename:
# If the user specified an output filename then split it apart and insert the inference method
# Then reconstruct the file
split = os.path.split(self.output_filename)
path = split[0]
filename = split[1]
inference_filename = os.path.join(path, "{}_{}".format(inference_method_string, filename))
export = pyproteininference.export.Export(data=self.selected_datastores[method])
export.export_to_csv(
output_filename=inference_filename,
directory=self.output_directory,
export_type=parameters.export,
)
def determine_optimal_inference_method(
self,
false_discovery_rate_threshold=0.05,
upper_empirical_threshold=1,
lower_empirical_threshold=0.5,
pdf_filename=None,
):
"""
This method determines the optimal inference method from Inclusion, Exclusion, Parsimony, Peptide-Centric.
Args:
false_discovery_rate_threshold (float): The fdr threshold to use in heuristic algorithm -
This parameter determines the maximum fdr used when creating a range of finite FDR values.
upper_empirical_threshold (float): Upper Threshold used for parsimony/peptide centric cutoff for
the heuristic algorithm.
lower_empirical_threshold (float): Lower Threshold used for inclusion/exclusion cutoff for
the heuristic algorithm.
pdf_filename (str): Filename to write heuristic density plot to.
Returns:
list: List of string representations of the recommended inference methods.
"""
# Get the number of passing proteins
number_stdev_from_mean_dict = {}
fdrs = [false_discovery_rate_threshold * 0.01 * x for x in range(100)]
for fdr in fdrs:
stdev_from_mean = self.determine_number_stdev_from_mean(false_discovery_rate=fdr)
number_stdev_from_mean_dict[fdr] = stdev_from_mean
stdev_collection = collections.defaultdict(list)
for fdr in fdrs:
for key in number_stdev_from_mean_dict[fdr]:
stdev_collection[key].append(number_stdev_from_mean_dict[fdr][key])
heuristic_scores = self.generate_density_plot(
number_stdevs_from_mean=stdev_collection, pdf_filename=pdf_filename
)
# Apply conditional statement with lower and upper thresholds
if (
heuristic_scores[Inference.PARSIMONY] <= lower_empirical_threshold
or heuristic_scores[Inference.PEPTIDE_CENTRIC] <= lower_empirical_threshold
):
# If parsimony or peptide centric are less than the lower empirical threshold
# Then select the best method of the two
logger.info(
"Either parsimony {} or peptide centric {} pass empirical threshold {}. "
"Selecting the best method of the two.".format(
heuristic_scores[Inference.PARSIMONY],
heuristic_scores[Inference.PEPTIDE_CENTRIC],
lower_empirical_threshold,
)
)
sub_dict = {
Inference.PARSIMONY: heuristic_scores[Inference.PARSIMONY],
Inference.PEPTIDE_CENTRIC: heuristic_scores[Inference.PEPTIDE_CENTRIC],
}
if (
heuristic_scores[Inference.PARSIMONY] <= lower_empirical_threshold
and heuristic_scores[Inference.PEPTIDE_CENTRIC] <= lower_empirical_threshold
):
# If both are under the threshold return both
selected_methods = [Inference.PARSIMONY, Inference.PEPTIDE_CENTRIC]
else:
selected_methods = [min(sub_dict, key=sub_dict.get)]
# If the above condition does not apply
elif (
heuristic_scores[Inference.EXCLUSION] <= upper_empirical_threshold
or heuristic_scores[Inference.INCLUSION] <= upper_empirical_threshold
):
# If exclusion or inclusion are less than the upper empirical threshold
# Then select the best method of the two
logger.info(
"Either inclusion {} or exclusion {} pass empirical threshold {}. "
"Selecting the best method of the two.".format(
heuristic_scores[Inference.INCLUSION],
heuristic_scores[Inference.EXCLUSION],
upper_empirical_threshold,
)
)
sub_dict = {
Inference.EXCLUSION: heuristic_scores[Inference.EXCLUSION],
Inference.INCLUSION: heuristic_scores[Inference.INCLUSION],
}
if (
heuristic_scores[Inference.EXCLUSION] <= upper_empirical_threshold
and heuristic_scores[Inference.INCLUSION] <= upper_empirical_threshold
):
# If both are under the threshold return both
selected_methods = [Inference.INCLUSION, Inference.EXCLUSION]
else:
selected_methods = [min(sub_dict, key=sub_dict.get)]
else:
# If we have no conditional scenarios...
# Select the best method
logger.info("No methods pass empirical thresholds, selecting the best method")
selected_methods = [min(heuristic_scores, key=heuristic_scores.get)]
logger.info("Method(s) {} selected with the heuristic algorithm".format(", ".join(selected_methods)))
return selected_methods
def generate_density_plot(self, number_stdevs_from_mean, pdf_filename=None):
"""
This method produces a PDF Density Plot plot overlaying the 4 inference methods part of the heuristic algorithm.
Args:
number_stdevs_from_mean (dict): a dictionary of the number of standard deviations from the mean per
inference method for a range of FDRs.
pdf_filename (str): Filename to write heuristic density plot to.
Returns:
dict: a dictionary of heuristic scores per inference method which correlates to the
maximum point of the density plot per inference method.
"""
f = plt.figure()
heuristic_scores = {}
for method in number_stdevs_from_mean:
readible_method_name = Inference.INFERENCE_NAME_MAP[method]
kwargs = dict(histtype='stepfilled', alpha=0.3, density=True, bins=40, ec="k", label=readible_method_name)
x, y, _ = plt.hist(number_stdevs_from_mean[method], **kwargs)
center = y[list(x).index(max(x))]
heuristic_scores[method] = abs(center)
plt.axvline(0, color="black", linestyle='--', alpha=0.75)
plt.title("Density Plot of the Number of Standard Deviations from the Mean")
plt.xlabel('Number of Standard Deviations from the Mean')
plt.ylabel('Number of Observations')
plt.legend(loc='upper right')
if pdf_filename:
logger.info("Writing Heuristic Density plot to: {}".format(pdf_filename))
f.savefig(pdf_filename)
else:
plt.show()
plt.close()
logger.info("Heuristic Scores")
logger.info(heuristic_scores)
return heuristic_scores
def determine_number_stdev_from_mean(self, false_discovery_rate):
"""
This method calculates the mean of the number of proteins identified at a specific FDR of all
4 methods and then for each method calculates the number of standard deviations
from the previous calculated mean.
Args:
false_discovery_rate (float): The false discovery rate used as a cutoff for calculations.
Returns:
dict: a dictionary of the number of standard deviations away from the mean per inference method.
"""
filtered_protein_objects = {
x: self.datastore_dict[x].get_protein_objects(
fdr_restricted=True, false_discovery_rate=false_discovery_rate
)
for x in self.datastore_dict.keys()
}
number_passing_proteins = {x: len(filtered_protein_objects[x]) for x in filtered_protein_objects.keys()}
# Calculate how similar the number of passing proteins is for each method
all_values = [x for x in number_passing_proteins.values()]
mean = numpy.mean(all_values)
standard_deviation = statistics.stdev(all_values)
number_stdev_from_mean_dict = {}
for key in number_passing_proteins.keys():
cur_value = number_passing_proteins[key]
number_stdev_from_mean_dict[key] = (cur_value - mean) / standard_deviation
return number_stdev_from_mean_dict
__init__(self, parameter_file=None, database_file=None, target_files=None, decoy_files=None, combined_files=None, target_directory=None, decoy_directory=None, combined_directory=None, output_directory=None, output_filename=None, id_splitting=False, append_alt_from_db=True, pdf_filename=None, output_type='all')
special
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
>>> parameter_file=yaml_params,
>>> database_file=database,
>>> target_files=target,
>>> decoy_files=decoy,
>>> combined_files=combined_files,
>>> target_directory=target_directory,
>>> decoy_directory=decoy_directory,
>>> combined_directory=combined_directory,
>>> output_directory=dir_name,
>>> output_filename=output_filename,
>>> append_alt_from_db=append_alt,
>>> pdf_filename=pdf_filename,
>>> output_type="all"
>>> )
Source code in pyproteininference/heuristic.py
def __init__(
self,
parameter_file=None,
database_file=None,
target_files=None,
decoy_files=None,
combined_files=None,
target_directory=None,
decoy_directory=None,
combined_directory=None,
output_directory=None,
output_filename=None,
id_splitting=False,
append_alt_from_db=True,
pdf_filename=None,
output_type="all",
):
"""
Args:
parameter_file (str): Path to Protein Inference Yaml Parameter File.
database_file (str): Path to Fasta database used in proteomics search.
target_files (str/list): Path to Target Psm File (Or a list of files).
decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
combined_files (str/list): Path to Combined Psm File (Or a list of files).
target_directory (str): Path to Directory containing Target Psm Files.
decoy_directory (str): Path to Directory containing Decoy Psm Files.
combined_directory (str): Path to Directory containing Combined Psm Files.
output_directory (str): Path to Directory where output will be written.
output_filename (str): Path to Filename where output will be written.
Will override output_directory.
id_splitting (bool): True/False on whether to split protein IDs in the digest.
Advanced usage only.
append_alt_from_db (bool): True/False on whether to append alternative proteins
from the DB digestion in Reader class.
pdf_filename (str): Filepath to be written to by Heuristic Plotting method.
This is optional and a default filename will be created in output_directory if this is left as None
output_type (str): How to output results. Can either be "all" or "optimal". Will either output all results
or will only output the optimal results.
Returns:
HeuristicPipeline: [HeuristicPipeline][pyproteininference.heuristic.HeuristicPipeline] object
Example:
>>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
>>> parameter_file=yaml_params,
>>> database_file=database,
>>> target_files=target,
>>> decoy_files=decoy,
>>> combined_files=combined_files,
>>> target_directory=target_directory,
>>> decoy_directory=decoy_directory,
>>> combined_directory=combined_directory,
>>> output_directory=dir_name,
>>> output_filename=output_filename,
>>> append_alt_from_db=append_alt,
>>> pdf_filename=pdf_filename,
>>> output_type="all"
>>> )
"""
self.parameter_file = parameter_file
self.database_file = database_file
self.target_files = target_files
self.decoy_files = decoy_files
self.combined_files = combined_files
self.target_directory = target_directory
self.decoy_directory = decoy_directory
self.combined_directory = combined_directory
self.output_directory = output_directory
self.output_filename = output_filename
self.id_splitting = id_splitting
self.append_alt_from_db = append_alt_from_db
self.output_type = output_type
if self.output_type not in self.OUTPUT_TYPES:
raise ValueError("The variable output_type must be set to either 'all' or 'optimal'")
if not pdf_filename:
if self.output_directory and not self.output_filename:
self.pdf_filename = os.path.join(self.output_directory, "heuristic_plot.pdf")
elif self.output_filename:
self.pdf_filename = os.path.join(os.path.split(self.output_filename)[0], "heuristic_plot.pdf")
else:
self.pdf_filename = os.path.join(os.getcwd(), "heuristic_plot.pdf")
else:
self.pdf_filename = pdf_filename
self.inference_method_list = [
Inference.INCLUSION,
Inference.EXCLUSION,
Inference.PARSIMONY,
Inference.PEPTIDE_CENTRIC,
]
self.datastore_dict = {}
self.selected_methods = None
self.selected_datastores = {}
self._validate_input()
self._set_output_directory()
self._log_append_alt_from_db()
determine_number_stdev_from_mean(self, false_discovery_rate)
This method calculates the mean of the number of proteins identified at a specific FDR of all 4 methods and then for each method calculates the number of standard deviations from the previous calculated mean.
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/heuristic.py
def determine_number_stdev_from_mean(self, false_discovery_rate):
"""
This method calculates the mean of the number of proteins identified at a specific FDR of all
4 methods and then for each method calculates the number of standard deviations
from the previous calculated mean.
Args:
false_discovery_rate (float): The false discovery rate used as a cutoff for calculations.
Returns:
dict: a dictionary of the number of standard deviations away from the mean per inference method.
"""
filtered_protein_objects = {
x: self.datastore_dict[x].get_protein_objects(
fdr_restricted=True, false_discovery_rate=false_discovery_rate
)
for x in self.datastore_dict.keys()
}
number_passing_proteins = {x: len(filtered_protein_objects[x]) for x in filtered_protein_objects.keys()}
# Calculate how similar the number of passing proteins is for each method
all_values = [x for x in number_passing_proteins.values()]
mean = numpy.mean(all_values)
standard_deviation = statistics.stdev(all_values)
number_stdev_from_mean_dict = {}
for key in number_passing_proteins.keys():
cur_value = number_passing_proteins[key]
number_stdev_from_mean_dict[key] = (cur_value - mean) / standard_deviation
return number_stdev_from_mean_dict
determine_optimal_inference_method(self, false_discovery_rate_threshold=0.05, upper_empirical_threshold=1, lower_empirical_threshold=0.5, pdf_filename=None)
This method determines the optimal inference method from Inclusion, Exclusion, Parsimony, Peptide-Centric.
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/heuristic.py
def determine_optimal_inference_method(
self,
false_discovery_rate_threshold=0.05,
upper_empirical_threshold=1,
lower_empirical_threshold=0.5,
pdf_filename=None,
):
"""
This method determines the optimal inference method from Inclusion, Exclusion, Parsimony, Peptide-Centric.
Args:
false_discovery_rate_threshold (float): The fdr threshold to use in heuristic algorithm -
This parameter determines the maximum fdr used when creating a range of finite FDR values.
upper_empirical_threshold (float): Upper Threshold used for parsimony/peptide centric cutoff for
the heuristic algorithm.
lower_empirical_threshold (float): Lower Threshold used for inclusion/exclusion cutoff for
the heuristic algorithm.
pdf_filename (str): Filename to write heuristic density plot to.
Returns:
list: List of string representations of the recommended inference methods.
"""
# Get the number of passing proteins
number_stdev_from_mean_dict = {}
fdrs = [false_discovery_rate_threshold * 0.01 * x for x in range(100)]
for fdr in fdrs:
stdev_from_mean = self.determine_number_stdev_from_mean(false_discovery_rate=fdr)
number_stdev_from_mean_dict[fdr] = stdev_from_mean
stdev_collection = collections.defaultdict(list)
for fdr in fdrs:
for key in number_stdev_from_mean_dict[fdr]:
stdev_collection[key].append(number_stdev_from_mean_dict[fdr][key])
heuristic_scores = self.generate_density_plot(
number_stdevs_from_mean=stdev_collection, pdf_filename=pdf_filename
)
# Apply conditional statement with lower and upper thresholds
if (
heuristic_scores[Inference.PARSIMONY] <= lower_empirical_threshold
or heuristic_scores[Inference.PEPTIDE_CENTRIC] <= lower_empirical_threshold
):
# If parsimony or peptide centric are less than the lower empirical threshold
# Then select the best method of the two
logger.info(
"Either parsimony {} or peptide centric {} pass empirical threshold {}. "
"Selecting the best method of the two.".format(
heuristic_scores[Inference.PARSIMONY],
heuristic_scores[Inference.PEPTIDE_CENTRIC],
lower_empirical_threshold,
)
)
sub_dict = {
Inference.PARSIMONY: heuristic_scores[Inference.PARSIMONY],
Inference.PEPTIDE_CENTRIC: heuristic_scores[Inference.PEPTIDE_CENTRIC],
}
if (
heuristic_scores[Inference.PARSIMONY] <= lower_empirical_threshold
and heuristic_scores[Inference.PEPTIDE_CENTRIC] <= lower_empirical_threshold
):
# If both are under the threshold return both
selected_methods = [Inference.PARSIMONY, Inference.PEPTIDE_CENTRIC]
else:
selected_methods = [min(sub_dict, key=sub_dict.get)]
# If the above condition does not apply
elif (
heuristic_scores[Inference.EXCLUSION] <= upper_empirical_threshold
or heuristic_scores[Inference.INCLUSION] <= upper_empirical_threshold
):
# If exclusion or inclusion are less than the upper empirical threshold
# Then select the best method of the two
logger.info(
"Either inclusion {} or exclusion {} pass empirical threshold {}. "
"Selecting the best method of the two.".format(
heuristic_scores[Inference.INCLUSION],
heuristic_scores[Inference.EXCLUSION],
upper_empirical_threshold,
)
)
sub_dict = {
Inference.EXCLUSION: heuristic_scores[Inference.EXCLUSION],
Inference.INCLUSION: heuristic_scores[Inference.INCLUSION],
}
if (
heuristic_scores[Inference.EXCLUSION] <= upper_empirical_threshold
and heuristic_scores[Inference.INCLUSION] <= upper_empirical_threshold
):
# If both are under the threshold return both
selected_methods = [Inference.INCLUSION, Inference.EXCLUSION]
else:
selected_methods = [min(sub_dict, key=sub_dict.get)]
else:
# If we have no conditional scenarios...
# Select the best method
logger.info("No methods pass empirical thresholds, selecting the best method")
selected_methods = [min(heuristic_scores, key=heuristic_scores.get)]
logger.info("Method(s) {} selected with the heuristic algorithm".format(", ".join(selected_methods)))
return selected_methods
execute(self, fdr_threshold=0.05)
This method is the main driver of the heuristic method. This method calls other classes and methods that make up the heuristic pipeline. This includes but is not limited to:
- Loops over the main inference methods: Inclusion, Exclusion, Parsimony, and Peptide Centric.
- Determines the optimal inference method based on the input data as well as the database file.
- Outputs the results and indicates the optimal results.
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
>>> parameter_file=yaml_params,
>>> database_file=database,
>>> target_files=target,
>>> decoy_files=decoy,
>>> combined_files=combined_files,
>>> target_directory=target_directory,
>>> decoy_directory=decoy_directory,
>>> combined_directory=combined_directory,
>>> output_directory=dir_name,
>>> output_filename=output_filename,
>>> append_alt_from_db=append_alt,
>>> pdf_filename=pdf_filename,
>>> output_type="all"
>>> )
>>> heuristic.execute(fdr_threshold=0.05)
Source code in pyproteininference/heuristic.py
def execute(self, fdr_threshold=0.05):
"""
This method is the main driver of the heuristic method.
This method calls other classes and methods that make up the heuristic pipeline.
This includes but is not limited to:
1. Loops over the main inference methods: Inclusion, Exclusion, Parsimony, and Peptide Centric.
2. Determines the optimal inference method based on the input data as well as the database file.
3. Outputs the results and indicates the optimal results.
Args:
fdr_threshold (float): The Qvalue/FDR threshold the heuristic method uses to base calculations from.
Returns:
None:
Example:
>>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
>>> parameter_file=yaml_params,
>>> database_file=database,
>>> target_files=target,
>>> decoy_files=decoy,
>>> combined_files=combined_files,
>>> target_directory=target_directory,
>>> decoy_directory=decoy_directory,
>>> combined_directory=combined_directory,
>>> output_directory=dir_name,
>>> output_filename=output_filename,
>>> append_alt_from_db=append_alt,
>>> pdf_filename=pdf_filename,
>>> output_type="all"
>>> )
>>> heuristic.execute(fdr_threshold=0.05)
"""
pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
yaml_param_filepath=self.parameter_file
)
digest = pyproteininference.in_silico_digest.PyteomicsDigest(
database_path=self.database_file,
digest_type=pyproteininference_parameters.digest_type,
missed_cleavages=pyproteininference_parameters.missed_cleavages,
reviewed_identifier_symbol=pyproteininference_parameters.reviewed_identifier_symbol,
max_peptide_length=pyproteininference_parameters.restrict_peptide_length,
id_splitting=self.id_splitting,
)
if self.database_file:
logger.info("Running In Silico Database Digest on file {}".format(self.database_file))
digest.digest_fasta_database()
else:
logger.warning(
"No Database File provided, Skipping database digest and only taking protein-peptide mapping from the "
"input files."
)
for inference_method in self.inference_method_list:
method_specific_parameters = copy.deepcopy(pyproteininference_parameters)
logger.info("Overriding inference type {}".format(method_specific_parameters.inference_type))
method_specific_parameters.inference_type = inference_method
logger.info("New inference type {}".format(method_specific_parameters.inference_type))
logger.info("FDR Threshold Set to {}".format(method_specific_parameters.fdr))
reader = pyproteininference.reader.GenericReader(
target_file=self.target_files,
decoy_file=self.decoy_files,
combined_files=self.combined_files,
parameter_file_object=method_specific_parameters,
digest=digest,
append_alt_from_db=self.append_alt_from_db,
)
reader.read_psms()
data = pyproteininference.datastore.DataStore(reader=reader, digest=digest)
data.restrict_psm_data()
data.recover_mapping()
data.create_scoring_input()
if method_specific_parameters.inference_type == Inference.EXCLUSION:
data.exclude_non_distinguishing_peptides()
score = pyproteininference.scoring.Score(data=data)
score.score_psms(score_method=method_specific_parameters.protein_score)
if method_specific_parameters.picker:
data.protein_picker()
else:
pass
pyproteininference.inference.Inference.run_inference(data=data, digest=digest)
data.calculate_q_values()
self.datastore_dict[inference_method] = data
self.selected_methods = self.determine_optimal_inference_method(
false_discovery_rate_threshold=fdr_threshold, pdf_filename=self.pdf_filename
)
self.selected_datastores = {x: self.datastore_dict[x] for x in self.selected_methods}
if self.output_type == "all":
self._write_all_results(parameters=method_specific_parameters)
elif self.output_type == "optimal":
self._write_optimal_results(parameters=method_specific_parameters)
else:
self._write_optimal_results(parameters=method_specific_parameters)
generate_density_plot(self, number_stdevs_from_mean, pdf_filename=None)
This method produces a PDF Density Plot plot overlaying the 4 inference methods part of the heuristic algorithm.
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/heuristic.py
def generate_density_plot(self, number_stdevs_from_mean, pdf_filename=None):
"""
This method produces a PDF Density Plot plot overlaying the 4 inference methods part of the heuristic algorithm.
Args:
number_stdevs_from_mean (dict): a dictionary of the number of standard deviations from the mean per
inference method for a range of FDRs.
pdf_filename (str): Filename to write heuristic density plot to.
Returns:
dict: a dictionary of heuristic scores per inference method which correlates to the
maximum point of the density plot per inference method.
"""
f = plt.figure()
heuristic_scores = {}
for method in number_stdevs_from_mean:
readible_method_name = Inference.INFERENCE_NAME_MAP[method]
kwargs = dict(histtype='stepfilled', alpha=0.3, density=True, bins=40, ec="k", label=readible_method_name)
x, y, _ = plt.hist(number_stdevs_from_mean[method], **kwargs)
center = y[list(x).index(max(x))]
heuristic_scores[method] = abs(center)
plt.axvline(0, color="black", linestyle='--', alpha=0.75)
plt.title("Density Plot of the Number of Standard Deviations from the Mean")
plt.xlabel('Number of Standard Deviations from the Mean')
plt.ylabel('Number of Observations')
plt.legend(loc='upper right')
if pdf_filename:
logger.info("Writing Heuristic Density plot to: {}".format(pdf_filename))
f.savefig(pdf_filename)
else:
plt.show()
plt.close()
logger.info("Heuristic Scores")
logger.info(heuristic_scores)
return heuristic_scores
generate_roc_plot(self, fdr_max=0.2, pdf_filename=None)
This method produces a PDF ROC plot overlaying the 4 inference methods apart of the heuristic algorithm.
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/heuristic.py
def generate_roc_plot(self, fdr_max=0.2, pdf_filename=None):
"""
This method produces a PDF ROC plot overlaying the 4 inference methods apart of the heuristic algorithm.
Args:
fdr_max (float): Max FDR to display on the plot.
pdf_filename (str): Filename to write roc plot to.
Returns:
None:
"""
f = plt.figure()
for inference_method in self.datastore_dict.keys():
fdr_vs_target_hits = self.datastore_dict[inference_method].generate_fdr_vs_target_hits(fdr_max=fdr_max)
fdrs = [x[0] for x in fdr_vs_target_hits]
target_hits = [x[1] for x in fdr_vs_target_hits]
plt.plot(fdrs, target_hits, '-', label=inference_method.replace("_", " "))
target_fdr = self.datastore_dict[inference_method].parameter_file_object.fdr
if inference_method in self.selected_methods:
best_value = min(fdrs, key=lambda x: abs(x - target_fdr))
best_index = fdrs.index(best_value)
best_target_hit_value = target_hits[best_index] # noqa F841
plt.axvline(target_fdr, color="black", linestyle='--', alpha=0.75, label="Target FDR")
plt.legend()
plt.xlabel('Decoy FDR')
plt.ylabel('Target Protein Hits')
plt.xlim([-0.01, fdr_max])
plt.legend(loc='lower right')
plt.title("FDR vs Target Protein Hits per Inference Method")
if pdf_filename:
logger.info("Writing ROC plot to: {}".format(pdf_filename))
f.savefig(pdf_filename)
plt.close()
in_silico_digest
Digest
The following class handles data storage of in silico digest data from a fasta formatted sequence database.
Attributes:
Name | Type | Description |
---|---|---|
peptide_to_protein_dictionary |
dict |
Dictionary of peptides (keys) to protein sets (values). |
protein_to_peptide_dictionary |
dict |
Dictionary of proteins (keys) to peptide sets (values). |
swiss_prot_protein_set |
set |
Set of reviewed proteins if they are able to be distinguished from unreviewed proteins. |
database_path |
str |
Path to fasta database file to digest. |
missed_cleavages |
int |
The number of missed cleavages to allow. |
id_splitting |
bool |
True/False on whether or not to split a given regex off identifiers. This is used to split of "sp|" and "tr|" from the database protein strings as sometimes the database will contain those strings while the input data will have the strings split already. Advanced usage only. |
reviewed_identifier_symbol |
str/None |
Identifier that distinguishes reviewed from unreviewed proteins. Typically this is "sp|". Can also be None type. |
digest_type |
str |
can be any value in |
max_peptide_length |
int |
Max peptide length to keep for analysis. |
Source code in pyproteininference/in_silico_digest.py
class Digest(object):
"""
The following class handles data storage of in silico digest data from a fasta formatted sequence database.
Attributes:
peptide_to_protein_dictionary (dict): Dictionary of peptides (keys) to protein sets (values).
protein_to_peptide_dictionary (dict): Dictionary of proteins (keys) to peptide sets (values).
swiss_prot_protein_set (set): Set of reviewed proteins if they are able to be distinguished from unreviewed
proteins.
database_path (str): Path to fasta database file to digest.
missed_cleavages (int): The number of missed cleavages to allow.
id_splitting (bool): True/False on whether or not to split a given regex off identifiers.
This is used to split of "sp|" and "tr|"
from the database protein strings as sometimes the database will contain those strings while
the input data will have the strings split already.
Advanced usage only.
reviewed_identifier_symbol (str/None): Identifier that distinguishes reviewed from unreviewed proteins.
Typically this is "sp|". Can also be None type.
digest_type (str): can be any value in `LIST_OF_DIGEST_TYPES`.
max_peptide_length (int): Max peptide length to keep for analysis.
"""
TRYPSIN = "trypsin"
LYSC = "lysc"
LIST_OF_DIGEST_TYPES = set(parser.expasy_rules.keys())
AA_LIST = [
"A",
"R",
"N",
"D",
"C",
"E",
"Q",
"G",
"H",
"I",
"L",
"K",
"M",
"F",
"P",
"S",
"T",
"W",
"Y",
"V",
]
UNIPROT_STRS = "sp\||tr\|" # noqa W605
UNIPROT_STR_REGEX = re.compile(UNIPROT_STRS)
SP_STRING = "sp|"
METHIONINE = "M"
ANY_AMINO_ACID = "X"
def __init__(self):
pass
PyteomicsDigest (Digest)
This class represents a pyteomics implementation of an in silico digest.
Source code in pyproteininference/in_silico_digest.py
class PyteomicsDigest(Digest):
"""
This class represents a pyteomics implementation of an in silico digest.
"""
def __init__(
self,
database_path,
digest_type,
missed_cleavages,
reviewed_identifier_symbol,
max_peptide_length,
id_splitting=True,
):
"""
The following class creates protein to peptide, peptide to protein, and reviewed protein mappings.
The input is a fasta database, a protein inference parameter object, and whether or not to split IDs.
This class sets important attributes for the Digest object such as: `peptide_to_protein_dictionary`,
`protein_to_peptide_dictionary`, and `swiss_prot_protein_set`.
Args:
database_path (str): Path to fasta database file to digest.
digest_type (str): Must be a value in `LIST_OF_DIGEST_TYPES`.
missed_cleavages (int): Integer that indicates the maximum number of allowable missed cleavages from
the ms search.
reviewed_identifier_symbol (str/None): Symbol that indicates a reviewed identifier.
If using Uniprot this is typically 'sp|'.
max_peptide_length (int): The maximum length of peptides to keep for the analysis.
id_splitting (bool): True/False on whether or not to split a given regex off identifiers.
This is used to split of "sp|" and "tr|"
from the database protein strings as sometimes the database will contain those
strings while the input data will have the strings split already.
Advanced usage only.
Example:
>>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
>>> database_path=database_file,
>>> digest_type='trypsin',
>>> missed_cleavages=2,
>>> reviewed_identifier_symbol='sp|',
>>> max_peptide_length=7,
>>> id_splitting=False,
>>> )
"""
self.peptide_to_protein_dictionary = {}
self.protein_to_peptide_dictionary = {}
self.swiss_prot_protein_set = set()
self.database_path = database_path
self.missed_cleavages = missed_cleavages
self.id_splitting = id_splitting
self.reviewed_identifier_symbol = reviewed_identifier_symbol
if digest_type in self.LIST_OF_DIGEST_TYPES:
self.digest_type = digest_type
else:
raise ValueError(
"digest_type must be equal to one of the following {}".format(str(self.LIST_OF_DIGEST_TYPES))
)
self.max_peptide_length = max_peptide_length
def digest_fasta_database(self):
"""
This method reads in and prepares the fasta database for database digestion and assigns
the several attributes for the Digest object: `peptide_to_protein_dictionary`,
`protein_to_peptide_dictionary`, and `swiss_prot_protein_set`.
Returns:
None:
Example:
>>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
>>> database_path=database_file,
>>> digest_type='trypsin',
>>> missed_cleavages=2,
>>> reviewed_identifier_symbol='sp|',
>>> max_peptide_length=7,
>>> id_splitting=False,
>>> )
>>> digest.digest_fasta_database()
"""
logger.info("Starting Pyteomics Digest...")
pep_dict = {}
prot_dict = {}
sp_set = set()
for description, sequence in fasta.read(self.database_path):
new_peptides = parser.cleave(
sequence,
parser.expasy_rules[self.digest_type],
self.missed_cleavages,
min_length=self.max_peptide_length,
)
# Hopefully this splitting works...
# IDK how robust this is...
identifier = description.split(" ")[0]
# Handle ID Splitting...
if self.id_splitting:
identifier_stripped = self.UNIPROT_STR_REGEX.sub("", identifier)
else:
identifier_stripped = identifier
# If reviewed add to sp_set
if self.reviewed_identifier_symbol:
if identifier.startswith(self.reviewed_identifier_symbol):
sp_set.add(identifier_stripped)
prot_dict[identifier_stripped] = new_peptides
met_cleaved_peps = set()
for peptide in new_peptides:
pep_dict.setdefault(peptide, set()).add(identifier_stripped)
# Need to account for potential N-term Methionine Cleavage
if sequence.startswith(peptide) and peptide.startswith(self.METHIONINE):
# If our sequence starts with the current peptide... and our current peptide starts with methionine
# Then we remove the methionine from the peptide and add it to our dicts...
methionine_cleaved_peptide = peptide[1:]
met_cleaved_peps.add(methionine_cleaved_peptide)
for met_peps in met_cleaved_peps:
pep_dict.setdefault(met_peps, set()).add(identifier_stripped)
prot_dict[identifier_stripped].add(met_peps)
self.swiss_prot_protein_set = sp_set
self.peptide_to_protein_dictionary = pep_dict
self.protein_to_peptide_dictionary = prot_dict
logger.info("Pyteomics Digest Finished...")
__init__(self, database_path, digest_type, missed_cleavages, reviewed_identifier_symbol, max_peptide_length, id_splitting=True)
special
The following class creates protein to peptide, peptide to protein, and reviewed protein mappings.
The input is a fasta database, a protein inference parameter object, and whether or not to split IDs.
This class sets important attributes for the Digest object such as: peptide_to_protein_dictionary
,
protein_to_peptide_dictionary
, and swiss_prot_protein_set
.
Parameters: |
|
---|
Examples:
>>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
>>> database_path=database_file,
>>> digest_type='trypsin',
>>> missed_cleavages=2,
>>> reviewed_identifier_symbol='sp|',
>>> max_peptide_length=7,
>>> id_splitting=False,
>>> )
Source code in pyproteininference/in_silico_digest.py
def __init__(
self,
database_path,
digest_type,
missed_cleavages,
reviewed_identifier_symbol,
max_peptide_length,
id_splitting=True,
):
"""
The following class creates protein to peptide, peptide to protein, and reviewed protein mappings.
The input is a fasta database, a protein inference parameter object, and whether or not to split IDs.
This class sets important attributes for the Digest object such as: `peptide_to_protein_dictionary`,
`protein_to_peptide_dictionary`, and `swiss_prot_protein_set`.
Args:
database_path (str): Path to fasta database file to digest.
digest_type (str): Must be a value in `LIST_OF_DIGEST_TYPES`.
missed_cleavages (int): Integer that indicates the maximum number of allowable missed cleavages from
the ms search.
reviewed_identifier_symbol (str/None): Symbol that indicates a reviewed identifier.
If using Uniprot this is typically 'sp|'.
max_peptide_length (int): The maximum length of peptides to keep for the analysis.
id_splitting (bool): True/False on whether or not to split a given regex off identifiers.
This is used to split of "sp|" and "tr|"
from the database protein strings as sometimes the database will contain those
strings while the input data will have the strings split already.
Advanced usage only.
Example:
>>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
>>> database_path=database_file,
>>> digest_type='trypsin',
>>> missed_cleavages=2,
>>> reviewed_identifier_symbol='sp|',
>>> max_peptide_length=7,
>>> id_splitting=False,
>>> )
"""
self.peptide_to_protein_dictionary = {}
self.protein_to_peptide_dictionary = {}
self.swiss_prot_protein_set = set()
self.database_path = database_path
self.missed_cleavages = missed_cleavages
self.id_splitting = id_splitting
self.reviewed_identifier_symbol = reviewed_identifier_symbol
if digest_type in self.LIST_OF_DIGEST_TYPES:
self.digest_type = digest_type
else:
raise ValueError(
"digest_type must be equal to one of the following {}".format(str(self.LIST_OF_DIGEST_TYPES))
)
self.max_peptide_length = max_peptide_length
digest_fasta_database(self)
This method reads in and prepares the fasta database for database digestion and assigns
the several attributes for the Digest object: peptide_to_protein_dictionary
,
protein_to_peptide_dictionary
, and swiss_prot_protein_set
.
Returns: |
|
---|
Examples:
>>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
>>> database_path=database_file,
>>> digest_type='trypsin',
>>> missed_cleavages=2,
>>> reviewed_identifier_symbol='sp|',
>>> max_peptide_length=7,
>>> id_splitting=False,
>>> )
>>> digest.digest_fasta_database()
Source code in pyproteininference/in_silico_digest.py
def digest_fasta_database(self):
"""
This method reads in and prepares the fasta database for database digestion and assigns
the several attributes for the Digest object: `peptide_to_protein_dictionary`,
`protein_to_peptide_dictionary`, and `swiss_prot_protein_set`.
Returns:
None:
Example:
>>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
>>> database_path=database_file,
>>> digest_type='trypsin',
>>> missed_cleavages=2,
>>> reviewed_identifier_symbol='sp|',
>>> max_peptide_length=7,
>>> id_splitting=False,
>>> )
>>> digest.digest_fasta_database()
"""
logger.info("Starting Pyteomics Digest...")
pep_dict = {}
prot_dict = {}
sp_set = set()
for description, sequence in fasta.read(self.database_path):
new_peptides = parser.cleave(
sequence,
parser.expasy_rules[self.digest_type],
self.missed_cleavages,
min_length=self.max_peptide_length,
)
# Hopefully this splitting works...
# IDK how robust this is...
identifier = description.split(" ")[0]
# Handle ID Splitting...
if self.id_splitting:
identifier_stripped = self.UNIPROT_STR_REGEX.sub("", identifier)
else:
identifier_stripped = identifier
# If reviewed add to sp_set
if self.reviewed_identifier_symbol:
if identifier.startswith(self.reviewed_identifier_symbol):
sp_set.add(identifier_stripped)
prot_dict[identifier_stripped] = new_peptides
met_cleaved_peps = set()
for peptide in new_peptides:
pep_dict.setdefault(peptide, set()).add(identifier_stripped)
# Need to account for potential N-term Methionine Cleavage
if sequence.startswith(peptide) and peptide.startswith(self.METHIONINE):
# If our sequence starts with the current peptide... and our current peptide starts with methionine
# Then we remove the methionine from the peptide and add it to our dicts...
methionine_cleaved_peptide = peptide[1:]
met_cleaved_peps.add(methionine_cleaved_peptide)
for met_peps in met_cleaved_peps:
pep_dict.setdefault(met_peps, set()).add(identifier_stripped)
prot_dict[identifier_stripped].add(met_peps)
self.swiss_prot_protein_set = sp_set
self.peptide_to_protein_dictionary = pep_dict
self.protein_to_peptide_dictionary = prot_dict
logger.info("Pyteomics Digest Finished...")
inference
Exclusion (Inference)
Exclusion Inference class. This class contains methods that support the initialization of an Exclusion inference method.
Attributes:
Name | Type | Description |
---|---|---|
data |
DataStore |
|
digest |
Digest |
|
scored_data |
list |
a List of scored Protein objects. |
Source code in pyproteininference/inference.py
class Exclusion(Inference):
"""
Exclusion Inference class. This class contains methods that support the initialization of an
Exclusion inference method.
Attributes:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
scored_data (list): a List of scored [Protein][pyproteininference.physical.Protein] objects.
"""
def __init__(self, data, digest):
"""
Initialization method of the Exclusion Class.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
"""
self.data = data
self.digest = digest
self.data._validate_scored_proteins()
self.scored_data = self.data.get_protein_data()
self.list_of_prots_not_in_db = None
self.list_of_peps_not_in_db = None
def infer_proteins(self):
"""
This method performs the Exclusion inference/grouping method.
For the exclusion inference method groups cannot be created because all shared peptides are removed.
This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
These are both variables of the [DataStore Object][pyproteininference.datastore.DataStore] and are
lists of [Protein][pyproteininference.physical.Protein] objects
and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
"""
grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)
hl = self.data.higher_or_lower()
logger.info("Applying Group ID's for the Exclusion Method")
regrouped_proteins = self._apply_protein_group_ids(
grouped_protein_objects=grouped_proteins,
)
grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
protein_group_objects = regrouped_proteins["group_objects"]
logger.info("Sorting Results based on lead Protein Score")
grouped_protein_objects = datastore.DataStore.sort_protein_objects(
grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
)
protein_group_objects = datastore.DataStore.sort_protein_group_objects(
protein_group_objects=protein_group_objects, higher_or_lower=hl
)
self.data.grouped_scored_proteins = grouped_protein_objects
self.data.protein_group_objects = protein_group_objects
__init__(self, data, digest)
special
Initialization method of the Exclusion Class.
Parameters: |
|
---|
Source code in pyproteininference/inference.py
def __init__(self, data, digest):
"""
Initialization method of the Exclusion Class.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
"""
self.data = data
self.digest = digest
self.data._validate_scored_proteins()
self.scored_data = self.data.get_protein_data()
self.list_of_prots_not_in_db = None
self.list_of_peps_not_in_db = None
infer_proteins(self)
This method performs the Exclusion inference/grouping method.
For the exclusion inference method groups cannot be created because all shared peptides are removed.
This method assigns the variables: grouped_scored_proteins
and protein_group_objects
.
These are both variables of the DataStore Object and are
lists of Protein objects
and ProteinGroup objects.
Source code in pyproteininference/inference.py
def infer_proteins(self):
"""
This method performs the Exclusion inference/grouping method.
For the exclusion inference method groups cannot be created because all shared peptides are removed.
This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
These are both variables of the [DataStore Object][pyproteininference.datastore.DataStore] and are
lists of [Protein][pyproteininference.physical.Protein] objects
and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
"""
grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)
hl = self.data.higher_or_lower()
logger.info("Applying Group ID's for the Exclusion Method")
regrouped_proteins = self._apply_protein_group_ids(
grouped_protein_objects=grouped_proteins,
)
grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
protein_group_objects = regrouped_proteins["group_objects"]
logger.info("Sorting Results based on lead Protein Score")
grouped_protein_objects = datastore.DataStore.sort_protein_objects(
grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
)
protein_group_objects = datastore.DataStore.sort_protein_group_objects(
protein_group_objects=protein_group_objects, higher_or_lower=hl
)
self.data.grouped_scored_proteins = grouped_protein_objects
self.data.protein_group_objects = protein_group_objects
FirstProtein (Inference)
FirstProtein Inference class. This class contains methods that support the initialization of a FirstProtein inference method.
Attributes:
Name | Type | Description |
---|---|---|
data |
DataStore |
|
digest |
Digest |
Source code in pyproteininference/inference.py
class FirstProtein(Inference):
"""
FirstProtein Inference class. This class contains methods that support the initialization of a
FirstProtein inference method.
Attributes:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
"""
def __init__(self, data, digest):
"""
FirstProtein Inference initialization method.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
Returns:
object:
"""
self.data = data
self.digest = digest
self.data._validate_scored_proteins()
self.scored_data = self.data.get_protein_data()
self.data = data
def infer_proteins(self):
"""
This method performs the First Protein inference method.
This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
lists of [Protein][pyproteininference.physical.Protein] objects
and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
"""
grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)
# Get the higher or lower variable
hl = self.data.higher_or_lower()
logger.info("Applying Group ID's for the First Protein Method")
regrouped_proteins = self._apply_protein_group_ids(
grouped_protein_objects=grouped_proteins,
)
grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
protein_group_objects = regrouped_proteins["group_objects"]
logger.info("Sorting Results based on lead Protein Score")
grouped_protein_objects = datastore.DataStore.sort_protein_objects(
grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
)
protein_group_objects = datastore.DataStore.sort_protein_group_objects(
protein_group_objects=protein_group_objects, higher_or_lower=hl
)
self.data.grouped_scored_proteins = grouped_protein_objects
self.data.protein_group_objects = protein_group_objects
__init__(self, data, digest)
special
FirstProtein Inference initialization method.
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/inference.py
def __init__(self, data, digest):
"""
FirstProtein Inference initialization method.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
Returns:
object:
"""
self.data = data
self.digest = digest
self.data._validate_scored_proteins()
self.scored_data = self.data.get_protein_data()
self.data = data
infer_proteins(self)
This method performs the First Protein inference method.
This method assigns the variables: grouped_scored_proteins
and protein_group_objects
.
These are both variables of the DataStore object and are
lists of Protein objects
and ProteinGroup objects.
Source code in pyproteininference/inference.py
def infer_proteins(self):
"""
This method performs the First Protein inference method.
This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
lists of [Protein][pyproteininference.physical.Protein] objects
and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
"""
grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)
# Get the higher or lower variable
hl = self.data.higher_or_lower()
logger.info("Applying Group ID's for the First Protein Method")
regrouped_proteins = self._apply_protein_group_ids(
grouped_protein_objects=grouped_proteins,
)
grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
protein_group_objects = regrouped_proteins["group_objects"]
logger.info("Sorting Results based on lead Protein Score")
grouped_protein_objects = datastore.DataStore.sort_protein_objects(
grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
)
protein_group_objects = datastore.DataStore.sort_protein_group_objects(
protein_group_objects=protein_group_objects, higher_or_lower=hl
)
self.data.grouped_scored_proteins = grouped_protein_objects
self.data.protein_group_objects = protein_group_objects
Inclusion (Inference)
Inclusion Inference class. This class contains methods that support the initialization of an Inclusion inference method.
Attributes:
Name | Type | Description |
---|---|---|
data |
DataStore |
|
digest |
Digest |
|
scored_data |
list |
a List of scored Protein objects. |
Source code in pyproteininference/inference.py
class Inclusion(Inference):
"""
Inclusion Inference class. This class contains methods that support the initialization of an
Inclusion inference method.
Attributes:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
scored_data (list): a List of scored [Protein][pyproteininference.physical.Protein] objects.
"""
def __init__(self, data, digest):
"""
Initialization method of the Inclusion Inference method.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
"""
self.data = data
self.digest = digest
self.data._validate_scored_proteins()
self.scored_data = self.data.get_protein_data()
def infer_proteins(self):
"""
This method performs the grouping for Inclusion.
Inclusion actually does not do grouping as all peptides get assigned to all possible proteins
and groups are not created.
This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
These are both variables of the [DataStore Object][pyproteininference.datastore.DataStore] and are
lists of [Protein][pyproteininference.physical.Protein] objects
and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
"""
grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)
hl = self.data.higher_or_lower()
logger.info("Applying Group ID's for the Inclusion Method")
regrouped_proteins = self._apply_protein_group_ids(
grouped_protein_objects=grouped_proteins,
)
grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
protein_group_objects = regrouped_proteins["group_objects"]
logger.info("Sorting Results based on lead Protein Score")
grouped_protein_objects = datastore.DataStore.sort_protein_objects(
grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
)
protein_group_objects = datastore.DataStore.sort_protein_group_objects(
protein_group_objects=protein_group_objects, higher_or_lower=hl
)
self.data.grouped_scored_proteins = grouped_protein_objects
self.data.protein_group_objects = protein_group_objects
def _apply_protein_group_ids(self, grouped_protein_objects):
"""
This method creates the ProteinGroup objects for the inclusion inference type using protein groups from
[_create_protein_groups][`pyproteininference.inference.Inference._create_protein_groups].
Args:
grouped_protein_objects (list): list of grouped [Protein][pyproteininference.physical.Protein] objects.
Returns:
dict: a Dictionary that contains a list of [ProteinGroup][pyproteininference.physical.ProteinGroup]
objects (key:"group_objects") and a list of
grouped [Protein][pyproteininference.physical.Protein] objects (key:"grouped_protein_objects").
"""
sp_protein_set = set(self.digest.swiss_prot_protein_set)
prot_pep_dict = self.data.protein_to_peptide_dictionary()
# Here we create group ID's
group_id = 0
protein_group_objects = []
for protein_group in grouped_protein_objects:
protein_list = []
group_id = group_id + 1
pg = ProteinGroup(group_id)
logger.debug("Created Protein Group with ID: {}".format(str(group_id)))
for prot in protein_group:
cur_protein = prot
# The following loop assigns group_id's, reviewed/unreviewed status, and number of unique peptides...
if group_id not in cur_protein.group_identification:
cur_protein.group_identification.add(group_id)
if cur_protein.identifier in sp_protein_set:
cur_protein.reviewed = True
else:
cur_protein.unreviewed = True
cur_identifier = cur_protein.identifier
cur_protein.num_peptides = len(prot_pep_dict[cur_identifier])
# Here append the number of unique peptides... so we can use this as secondary sorting...
protein_list.append(cur_protein)
# Sorted protein_groups then becomes a list of lists... of protein objects
pg.proteins = protein_list
protein_group_objects.append(pg)
return_dict = {
"grouped_protein_objects": grouped_protein_objects,
"group_objects": protein_group_objects,
}
return return_dict
__init__(self, data, digest)
special
Initialization method of the Inclusion Inference method.
Parameters: |
|
---|
Source code in pyproteininference/inference.py
def __init__(self, data, digest):
"""
Initialization method of the Inclusion Inference method.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
"""
self.data = data
self.digest = digest
self.data._validate_scored_proteins()
self.scored_data = self.data.get_protein_data()
infer_proteins(self)
This method performs the grouping for Inclusion.
Inclusion actually does not do grouping as all peptides get assigned to all possible proteins and groups are not created.
This method assigns the variables: grouped_scored_proteins
and protein_group_objects
.
These are both variables of the DataStore Object and are
lists of Protein objects
and ProteinGroup objects.
Source code in pyproteininference/inference.py
def infer_proteins(self):
"""
This method performs the grouping for Inclusion.
Inclusion actually does not do grouping as all peptides get assigned to all possible proteins
and groups are not created.
This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
These are both variables of the [DataStore Object][pyproteininference.datastore.DataStore] and are
lists of [Protein][pyproteininference.physical.Protein] objects
and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
"""
grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)
hl = self.data.higher_or_lower()
logger.info("Applying Group ID's for the Inclusion Method")
regrouped_proteins = self._apply_protein_group_ids(
grouped_protein_objects=grouped_proteins,
)
grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
protein_group_objects = regrouped_proteins["group_objects"]
logger.info("Sorting Results based on lead Protein Score")
grouped_protein_objects = datastore.DataStore.sort_protein_objects(
grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
)
protein_group_objects = datastore.DataStore.sort_protein_group_objects(
protein_group_objects=protein_group_objects, higher_or_lower=hl
)
self.data.grouped_scored_proteins = grouped_protein_objects
self.data.protein_group_objects = protein_group_objects
Inference
Parent Inference class for all inference/grouper subset classes. The base Inference class contains several methods that are shared across the Inference sub-classes.
Attributes:
Name | Type | Description |
---|---|---|
data |
DataStore |
|
digest |
Digest |
Source code in pyproteininference/inference.py
class Inference(object):
"""
Parent Inference class for all inference/grouper subset classes.
The base Inference class contains several methods that are shared across the Inference sub-classes.
Attributes:
data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].
"""
PARSIMONY = "parsimony"
INCLUSION = "inclusion"
EXCLUSION = "exclusion"
FIRST_PROTEIN = "first_protein"
PEPTIDE_CENTRIC = "peptide_centric"
INFERENCE_TYPES = [
PARSIMONY,
INCLUSION,
EXCLUSION,
FIRST_PROTEIN,
PEPTIDE_CENTRIC,
]
INFERENCE_NAME_MAP = {
PARSIMONY: "Parsimony",
INCLUSION: "Inclusion",
EXCLUSION: "Exclusion",
FIRST_PROTEIN: "First Protein",
PEPTIDE_CENTRIC: "Peptide Centric",
}
SUBSET_PEPTIDES = "subset_peptides"
SHARED_PEPTIDES = "shared_peptides"
NONE_GROUPING = None
GROUPING_TYPES = [SUBSET_PEPTIDES, SHARED_PEPTIDES, NONE_GROUPING]
PULP = "pulp"
LP_SOLVERS = [PULP]
ALL_SHARED_PEPTIDES = "all"
BEST_SHARED_PEPTIDES = "best"
NONE_SHARED_PEPTIDES = None
SHARED_PEPTIDE_TYPES = [
ALL_SHARED_PEPTIDES,
BEST_SHARED_PEPTIDES,
NONE_SHARED_PEPTIDES,
]
def __init__(self, data, digest):
"""
Initialization method of Inference object.
Args:
data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].
"""
self.data = data
self.digest = digest
self.data._validate_scored_proteins()
@classmethod
def run_inference(cls, data, digest):
"""
This class method dispatches to one of the five different inference classes/models
based on input from the [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter]
object.
The methods are "parsimony", "inclusion", "exclusion", "peptide_centric", and "first_protein".
Args:
data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].
Example:
>>> pyproteininference.inference.Inference.run_inference(data=data,digest=digest)
"""
inference_type = data.parameter_file_object.inference_type
logger.info("Running Inference with Inference Type: {}".format(inference_type))
if inference_type == Inference.PARSIMONY:
group = Parsimony(data=data, digest=digest)
group.infer_proteins()
if inference_type == Inference.INCLUSION:
group = Inclusion(data=data, digest=digest)
group.infer_proteins()
if inference_type == Inference.EXCLUSION:
group = Exclusion(data=data, digest=digest)
group.infer_proteins()
if inference_type == Inference.FIRST_PROTEIN:
group = FirstProtein(data=data, digest=digest)
group.infer_proteins()
if inference_type == Inference.PEPTIDE_CENTRIC:
group = PeptideCentric(data=data, digest=digest)
group.infer_proteins()
def _create_protein_groups(self, scored_proteins):
"""
This method sets up protein groups for inference methods that do not need grouping.
Args:
scored_proteins (list): List of scored [Protein][pyproteininference.physical.Protein] objects.
Returns:
list: List of lists of scored [Protein][pyproteininference.physical.Protein] objects.
"""
scored_proteins = sorted(
scored_proteins,
key=lambda k: (k.score, len(k.raw_peptides), k.identifier),
reverse=True,
)
prot_pep_dict = self.data.protein_to_peptide_dictionary()
restricted_peptides_set = set(self.data.restricted_peptides)
grouped_proteins = []
for protein_objects in scored_proteins:
cur_protein_identifier = protein_objects.identifier
# Set peptide variable if the peptide is in the restricted peptide set
# Sort the peptides alphabetically
protein_objects.peptides = set(
sorted([x for x in prot_pep_dict[cur_protein_identifier] if x in restricted_peptides_set])
)
protein_list_group = [protein_objects]
grouped_proteins.append(protein_list_group)
return grouped_proteins
def _apply_protein_group_ids(self, grouped_protein_objects):
"""
This method creates the ProteinGroup objects from the output of
[_create_protein_groups][`pyproteininference.inference.Inference._create_protein_groups].
Args:
grouped_protein_objects (list): list of grouped [Protein][pyproteininference.physical.Protein] objects.
Returns:
dict: a Dictionary that contains a list of [ProteinGroup][pyproteininference.physical.ProteinGroup]
objects (key:"group_objects") and a list of grouped [Protein][pyproteininference.physical.Protein]
objects (key:"grouped_protein_objects").
"""
sp_protein_set = set(self.digest.swiss_prot_protein_set)
prot_pep_dict = self.data.protein_to_peptide_dictionary()
# Here we create group ID's
group_id = 0
protein_group_objects = []
for protein_group in grouped_protein_objects:
protein_list = []
group_id = group_id + 1
pg = ProteinGroup(group_id)
logger.debug("Created Protein Group with ID: {}".format(str(group_id)))
for protein in protein_group:
cur_protein = protein
# The following loop assigns group_id's, reviewed/unreviewed status, and number of unique peptides...
if group_id not in cur_protein.group_identification:
cur_protein.group_identification.add(group_id)
if protein.identifier in sp_protein_set:
cur_protein.reviewed = True
else:
cur_protein.unreviewed = True
cur_identifier = protein.identifier
cur_protein.num_peptides = len(prot_pep_dict[cur_identifier])
# Here append the number of unique peptides... so we can use this as secondary sorting...
protein_list.append(cur_protein)
# Sorted protein_groups then becomes a list of lists... of protein objects
pg.proteins = protein_list
protein_group_objects.append(pg)
return_dict = {
"grouped_protein_objects": grouped_protein_objects,
"group_objects": protein_group_objects,
}
return return_dict
__init__(self, data, digest)
special
Initialization method of Inference object.
Parameters: |
|
---|
Source code in pyproteininference/inference.py
def __init__(self, data, digest):
"""
Initialization method of Inference object.
Args:
data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].
"""
self.data = data
self.digest = digest
self.data._validate_scored_proteins()
run_inference(data, digest)
classmethod
This class method dispatches to one of the five different inference classes/models based on input from the ProteinInferenceParameter object. The methods are "parsimony", "inclusion", "exclusion", "peptide_centric", and "first_protein".
Parameters: |
|
---|
Examples:
>>> pyproteininference.inference.Inference.run_inference(data=data,digest=digest)
Source code in pyproteininference/inference.py
@classmethod
def run_inference(cls, data, digest):
"""
This class method dispatches to one of the five different inference classes/models
based on input from the [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter]
object.
The methods are "parsimony", "inclusion", "exclusion", "peptide_centric", and "first_protein".
Args:
data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].
Example:
>>> pyproteininference.inference.Inference.run_inference(data=data,digest=digest)
"""
inference_type = data.parameter_file_object.inference_type
logger.info("Running Inference with Inference Type: {}".format(inference_type))
if inference_type == Inference.PARSIMONY:
group = Parsimony(data=data, digest=digest)
group.infer_proteins()
if inference_type == Inference.INCLUSION:
group = Inclusion(data=data, digest=digest)
group.infer_proteins()
if inference_type == Inference.EXCLUSION:
group = Exclusion(data=data, digest=digest)
group.infer_proteins()
if inference_type == Inference.FIRST_PROTEIN:
group = FirstProtein(data=data, digest=digest)
group.infer_proteins()
if inference_type == Inference.PEPTIDE_CENTRIC:
group = PeptideCentric(data=data, digest=digest)
group.infer_proteins()
Parsimony (Inference)
Parsimony Inference class. This class contains methods that support the initialization of a Parsimony inference method.
Attributes:
Name | Type | Description |
---|---|---|
data |
DataStore |
|
digest |
Digest |
|
scored_data |
list |
a List of scored Protein objects. |
lead_protein_set |
set |
Set of protein strings that are classified as leads from the LP solver. |
Source code in pyproteininference/inference.py
class Parsimony(Inference):
"""
Parsimony Inference class. This class contains methods that support the initialization of a
Parsimony inference method.
Attributes:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
scored_data (list): a List of scored [Protein][pyproteininference.physical.Protein] objects.
lead_protein_set (set): Set of protein strings that are classified as leads from the LP solver.
"""
def __init__(self, data, digest):
"""
Initialization method of the Parsimony object.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
"""
self.data = data
self.digest = digest
self.data._validate_scored_proteins()
self.scored_data = self.data.get_protein_data()
self.lead_protein_set = None
self.parameter_file_object = data.parameter_file_object
def _create_protein_groups(
self,
all_scored_proteins,
lead_protein_objects,
grouping_type="shared_peptides",
):
"""
Internal method that creates a list of lists of [Protein][pyproteininference.physical.Protein]
objects for the Parsimony inference object.
These list of lists are "groups" and the proteins get grouped them according to grouping_type variable.
Args:
all_scored_proteins (list): list of [Protein][pyproteininference.physical.Protein] objects.
lead_protein_objects (list): list of [Protein][pyproteininference.physical.Protein] objects
Only needed if inference_type=parsimony.
grouping_type: (str): One of `GROUPING_TYPES`.
Returns:
list: list of lists of [Protein][pyproteininference.physical.Protein] objects.
"""
logger.info("Grouping Peptides with Grouping Type: {}".format(grouping_type))
logger.info("Grouping Peptides with Inference Type: {}".format(self.PARSIMONY))
all_scored_proteins = sorted(
all_scored_proteins,
key=lambda k: (len(k.raw_peptides), k.identifier),
reverse=True,
)
lead_scored_proteins = lead_protein_objects
lead_scored_proteins = sorted(
lead_scored_proteins,
key=lambda k: (len(k.raw_peptides), k.identifier),
reverse=True,
)
protein_finder = [x.identifier for x in all_scored_proteins]
prot_pep_dict = self.data.protein_to_peptide_dictionary()
protein_tracker = set()
restricted_peptides_set = set(self.data.restricted_peptides)
try:
picked_removed = set([x.identifier for x in self.data.picked_proteins_removed])
except TypeError:
picked_removed = set()
missing_proteins = set()
in_silico_peptides_to_proteins = self.digest.peptide_to_protein_dictionary
grouped_proteins = []
for protein_objects in lead_scored_proteins:
if protein_objects not in protein_tracker:
protein_tracker.add(protein_objects)
cur_protein_identifier = protein_objects.identifier
# Set peptide variable if the peptide is in the restricted peptide set
# Sort the peptides alphabetically
protein_objects.peptides = set(
sorted([x for x in prot_pep_dict[cur_protein_identifier] if x in restricted_peptides_set])
)
protein_list_group = [protein_objects]
current_peptides = prot_pep_dict[cur_protein_identifier]
current_grouped_proteins = set()
for (
peptide
) in current_peptides: # Probably put an if here... if peptide is in the list of peptide after being
# restricted by datastore.RestrictMainData
if peptide in restricted_peptides_set:
# Get the proteins that map to the current peptide using in_silico_peptides_to_proteins
# First make sure our peptide is formatted properly...
if not peptide.isupper() or not peptide.isalpha():
# If the peptide is not all upper case or if its not all alphabetical...
peptide = Psm.remove_peptide_mods(peptide)
potential_protein_list = in_silico_peptides_to_proteins[peptide]
if not potential_protein_list:
logger.warning(
"Protein {} and Peptide {} is not in database...".format(
protein_objects.identifier, peptide
)
)
# Assign proteins to groups based on shared peptide... unless the protein is equivalent
# to the current identifier
if grouping_type != self.NONE_GROUPING:
for protein in potential_protein_list:
# If statement below to avoid grouping the same protein twice and to not group the lead
if (
protein not in current_grouped_proteins
and protein != cur_protein_identifier
and protein not in picked_removed
and protein not in missing_proteins
):
try:
# Try to find its object using protein_finder (list of identifiers) and
# lead_scored_proteins (list of Protein Objects)
cur_index = protein_finder.index(protein)
current_protein_object = all_scored_proteins[cur_index]
if not current_protein_object.peptides:
current_protein_object.peptides = set(
sorted(
[
x
for x in prot_pep_dict[current_protein_object.identifier]
if x in restricted_peptides_set
]
)
)
if grouping_type == self.SHARED_PEPTIDES:
current_grouped_proteins.add(current_protein_object)
elif grouping_type == self.SUBSET_PEPTIDES:
if current_protein_object.peptides.issubset(protein_objects.peptides):
current_grouped_proteins.add(current_protein_object)
protein_tracker.add(current_protein_object)
else:
pass
else:
pass
except ValueError:
logger.warning(
"Protein from DB {} not found with protein finder for peptide {}".format(
protein, peptide
)
)
missing_proteins.add(protein)
else:
pass
# Add the proteins to the lead if they share peptides...
protein_list_group = protein_list_group + list(current_grouped_proteins)
# protein_list_group at first is just the lead protein object...
# We then try apply grouping by looking at all peptide from the lead...
# For all of these peptide look at all other non lead proteins and try to assign them to the group...
# We assign the entire protein object as well... in the above try/except
# Then append this sub group to the main list
# The variable grouped_proteins is now a list of lists which each element being a Protein object and
# each list of protein objects corresponding to a group
grouped_proteins.append(protein_list_group)
return grouped_proteins
def _swissprot_and_isoform_override(
self,
scored_data,
grouped_proteins,
override_type="soft",
isoform_override=True,
):
"""
This internal method creates and reorders protein groups based on criteria such as Reviewed/Unreviewed
Identifiers as well as Canonincal/Isoform Identifiers.
This method is only used with parsimony inference type.
Args:
scored_data (list): list of scored [Protein][pyproteininference.physical.Protein] objects.
grouped_proteins: list of grouped [Protein][pyproteininference.physical.Protein] objects.
override_type (str): "soft" or "hard" to indicate Reviewed/Unreviewed override. "soft" is preferred and
default.
isoform_override (bool): True/False on whether to favor canonical forms vs isoforms as group leads.
Returns:
dict: a Dictionary that contains a list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects
(key:"group_objects") and a list of grouped [Protein][pyproteininference.physical.Protein]
objects (key:"grouped_protein_objects").
"""
sp_protein_set = set(self.digest.swiss_prot_protein_set)
scored_proteins = list(scored_data)
protein_finder = [x.identifier for x in scored_proteins]
prot_pep_dict = self.data.protein_to_peptide_dictionary()
# Get the higher or lower variable
higher_or_lower = self.data.higher_or_lower()
logger.info("Applying Group IDs... and Executing {} Swissprot Override...".format(override_type))
# Here we create group ID's for all groups and do some sorting
grouped_protein_objects = []
group_id = 0
leads = set()
protein_group_objects = []
for protein_group in grouped_proteins:
protein_list = []
group_id = group_id + 1
# Make a protein group
pg = ProteinGroup(group_id)
logger.debug("Created Protein Group with ID: {}".format(str(group_id)))
for prots in protein_group:
# Loop over all proteins in the original group
try:
# The following loop assigns group_id's, reviewed/unreviewed status, and number of unique peptides
pindex = protein_finder.index(prots.identifier)
# Attempt to find the protein object by identifier
cur_protein = scored_proteins[pindex]
if group_id not in cur_protein.group_identification:
cur_protein.group_identification.add(group_id)
if prots.identifier in sp_protein_set:
cur_protein.reviewed = True
else:
cur_protein.unreviewed = True
cur_identifier = prots.identifier
cur_protein.num_peptides = len(prot_pep_dict[cur_identifier])
# Here append the number of unique peptides... so we can use this as secondary sorting...
protein_list.append(cur_protein)
# Sorted groups then becomes a list of lists... of protein objects
except ValueError:
# Here we pass if the protein does not have a score...
# Potentially it got 'picked' (removed) by protein picker...
pass
# Sort protein sub group
protein_list = datastore.DataStore.sort_protein_sub_groups(
protein_list=protein_list, higher_or_lower=higher_or_lower
)
# grouped_protein_objects is the MAIN list of lists with grouped protein objects
grouped_protein_objects.append(protein_list)
# If the lead is reviewed append it to leads and do nothing else...
# If the lead is unreviewed then try to replace it with the best reviewed hit
# Run swissprot override
if self.data.parameter_file_object.reviewed_identifier_symbol:
sp_override = self._swissprot_override(
protein_list=protein_list,
leads=leads,
grouped_protein_objects=grouped_protein_objects,
override_type=override_type,
)
grouped_protein_objects = sp_override["grouped_protein_objects"]
leads = sp_override["leads"]
protein_list = sp_override["protein_list"]
# Run isoform override If we want to run isoform_override and if the isoform symbol exists...
if isoform_override and self.data.parameter_file_object.isoform_symbol:
iso_override = self._isoform_override(
protein_list=protein_list,
leads=leads,
grouped_protein_objects=grouped_protein_objects,
)
grouped_protein_objects = iso_override["grouped_protein_objects"]
leads = iso_override["leads"]
protein_list = iso_override["protein_list"]
pg.proteins = protein_list
protein_group_objects.append(pg)
return_dict = {
"grouped_protein_objects": grouped_protein_objects,
"group_objects": protein_group_objects,
}
return return_dict
def _swissprot_override(self, protein_list, leads, grouped_protein_objects, override_type):
"""
This method re-assigns protein group leads if the lead is an unreviewed protein and if the protein group
contains a reviewed protein that contains the exact same set of peptides as the unreviewed lead.
This method is here to provide consistency to the output.
Args:
protein_list (list): List of grouped [Protein][pyproteininference.physical.Protein] objects.
leads (set): Set of string protein identifiers that have been identified as a lead.
grouped_protein_objects (list): List of protein_list lists.
override_type (str): "soft" or "hard" on how to override non reviewed identifiers. "soft" is preferred.
Returns:
dict: leads (set): Set of string protein identifiers that have been identified as a lead.
Updated to reflect lead changes.
grouped_protein_objects (list): List of protein_list lists. Updated to reflect lead changes.
protein_list (list): List of grouped [Protein][pyproteininference.physical.Protein] objects.
Updated to reflect lead changes.
"""
if not protein_list[0].reviewed:
# If the lead is unreviewed attempt to replace it...
# Start to loop through protein_list which is the current group...
for protein in protein_list[1:]:
# Find the first reviewed it... if its not a lead protein already then do score swap and break...
if protein.reviewed:
best_swiss_prot_prot = protein
if override_type == "soft":
# If the lead proteins peptides are a subset of the best swissprot.... then swap the proteins.
# (meaning equal peptides or the swissprot completely covers the tremble reference)
if best_swiss_prot_prot.identifier not in leads and set(protein_list[0].peptides).issubset(
set(best_swiss_prot_prot.peptides)
):
# We use -1 as the idex of grouped_protein_objects because the current 'protein_list' is
# the last entry appended to scores grouped
# Essentially grouped_protein_objects[-1]==protein_list
# We need this syntax so we can switch the location of the unreviewed lead identifier with
# the best reviewed identifier in grouped_protein_objects
swiss_prot_override_index = grouped_protein_objects[-1].index(best_swiss_prot_prot)
cur_tr_lead = grouped_protein_objects[-1][0]
(
grouped_protein_objects[-1][0],
grouped_protein_objects[-1][swiss_prot_override_index],
) = (
grouped_protein_objects[-1][swiss_prot_override_index],
grouped_protein_objects[-1][0],
)
grouped_protein_objects[-1][swiss_prot_override_index], grouped_protein_objects[-1][0]
new_sp_lead = grouped_protein_objects[-1][0]
logger.info(
"Overriding Unreviewed {} with Reviewed {}".format(
cur_tr_lead.identifier, new_sp_lead.identifier
)
)
# Append new_sp_lead protein to leads, to make sure we dont repeat leads
leads.add(new_sp_lead.identifier)
break
else:
# If no reviewed and none not in leads then pass...
pass
if override_type == "hard":
if best_swiss_prot_prot.identifier not in leads:
# We use -1 as the index of grouped_protein_objects because the current 'protein_list'
# is the last entry appended to grouped_protein_objects
# Essentially grouped_protein_objects[-1]==protein_list
# We need this syntax so we can switch the location of the unreviewed lead identifier
# with the best reviewed identifier in grouped_protein_objects
swiss_prot_override_index = grouped_protein_objects[-1].index(best_swiss_prot_prot)
cur_tr_lead = grouped_protein_objects[-1][0]
# Re-assigning the value within the index will also reassign the value in protein_list...
# This is because grouped_protein_objects[-1] equals protein_list
# So we do not have to reassign values in protein_list
(
grouped_protein_objects[-1][0],
grouped_protein_objects[-1][swiss_prot_override_index],
) = (
grouped_protein_objects[-1][swiss_prot_override_index],
grouped_protein_objects[-1][0],
)
new_sp_lead = grouped_protein_objects[-1][0]
logger.info(
"Overriding Unreviewed {} with Reviewed {}".format(
cur_tr_lead.identifier, new_sp_lead.identifier
)
)
# Append new_sp_lead protein to leads, to make sure we dont repeat leads
leads.add(new_sp_lead.identifier)
break
else:
# If no reviewed and none not in leads then pass...
pass
else:
pass
else:
leads.add(protein_list[0].identifier)
return_dict = {
"leads": leads,
"grouped_protein_objects": grouped_protein_objects,
"protein_list": protein_list,
}
return return_dict
def _isoform_override(self, protein_list, grouped_protein_objects, leads):
"""
This method re-assigns protein group leads if the lead is an isoform protein and if the protein group contains
a canonical protein that contains the exact same set of peptides as the isoform lead.
This method is here to provide consistency to the output.
Args:
protein_list (list): List of grouped [Protein][pyproteininference.physical.Protein] objects.
leads (set): Set of string protein identifiers that have been identified as a lead.
grouped_protein_objects (list): List of protein_list lists.
Returns:
dict: leads (set): Set of string protein identifiers that have been identified as a lead. Updated to
reflect lead changes.
grouped_protein_objects (list): List of protein_list lists. Updated to reflect lead changes.
protein_list (list): List of grouped [Protein][pyproteininference.physical.Protein] objects.
Updated to reflect lead changes.
"""
if self.data.parameter_file_object.isoform_symbol in protein_list[0].identifier:
pure_id = protein_list[0].identifier.split(self.data.parameter_file_object.isoform_symbol)[0]
# Start to loop through protein_list which is the current group...
for potential_replacement in protein_list[1:]:
isoform_override = potential_replacement
if (
isoform_override.identifier == pure_id
and isoform_override.identifier not in leads
and set(protein_list[0].peptides).issubset(set(isoform_override.peptides))
):
isoform_override_index = grouped_protein_objects[-1].index(isoform_override)
cur_iso_lead = grouped_protein_objects[-1][0]
# Re-assigning the value within the index will also reassign the value in protein_list...
# This is because grouped_protein_objects[-1] equals protein_list
# So we do not have to reassign values in protein_list
(grouped_protein_objects[-1][0], grouped_protein_objects[-1][isoform_override_index],) = (
grouped_protein_objects[-1][isoform_override_index],
grouped_protein_objects[-1][0],
)
grouped_protein_objects[-1][isoform_override_index], grouped_protein_objects[-1][0]
new_iso_lead = grouped_protein_objects[-1][0]
logger.info(
"Overriding Isoform {} with {}".format(cur_iso_lead.identifier, new_iso_lead.identifier)
)
leads.add(protein_list[0].identifier)
return_dict = {
"leads": leads,
"grouped_protein_objects": grouped_protein_objects,
"protein_list": protein_list,
}
return return_dict
def _reassign_protein_group_leads(self, protein_group_objects):
"""
This internal method corrects leads that are improperly assigned in the parsimony inference method.
This method acts on the protein group objects.
Args:
protein_group_objects (list): List of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
Returns:
protein_group_objects (list): List of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects
where leads have been reassigned properly.
"""
# Get the higher or lower variable
if not self.data.high_low_better:
higher_or_lower = self.data.higher_or_lower()
else:
higher_or_lower = self.data.high_low_better
# Sometimes we have cases where:
# protein a maps to peptides 1,2,3
# protein b maps to peptides 1,2
# protein c maps to a bunch of peptides and peptide 3
# Therefore, in the model proteins a and b are equivalent in that they map to 2 peptides together - 1 and 2.
# peptide 3 maps to a but also to c...
# Sometimes the model (pulp) will spit out protein b as the lead... we wish to swap protein b as the lead with
# protein a because it will likely have a better score...
logger.info("Potentially Reassigning Protein Group leads...")
lead_protein_set = set([x.proteins[0].identifier for x in protein_group_objects])
for i in range(len(protein_group_objects)):
for j in range(1, len(protein_group_objects[i].proteins)): # Loop over all sub proteins in the group...
# if the lead proteins peptides are a subset of one of its proteins in the group, and the secondary
# protein is not a lead protein and its score is better than the leads... and it has more peptides...
new_lead = protein_group_objects[i].proteins[j]
old_lead = protein_group_objects[i].proteins[0]
if higher_or_lower == datastore.DataStore.HIGHER_PSM_SCORE:
if (
set(old_lead.peptides).issubset(set(new_lead.peptides))
and new_lead.identifier not in lead_protein_set
and old_lead.score <= new_lead.score
and len(old_lead.peptides) < len(new_lead.peptides)
):
logger.info(
"protein {} will replace protein {} as lead, with index {}, New Num Peptides: {}, "
"Old Num Peptides: {}".format(
str(new_lead.identifier),
str(old_lead.identifier),
str(j),
str(len(new_lead.peptides)),
str(len(old_lead.peptides)),
)
)
lead_protein_set.add(new_lead.identifier)
lead_protein_set.remove(old_lead.identifier)
# Swap their positions in the list
(
protein_group_objects[i].proteins[0],
protein_group_objects[i].proteins[j],
) = (new_lead, old_lead)
break
if higher_or_lower == datastore.DataStore.LOWER_PSM_SCORE:
if (
set(old_lead.peptides).issubset(set(new_lead.peptides))
and new_lead.identifier not in lead_protein_set
and old_lead.score >= new_lead.score
and len(old_lead.peptides) < len(new_lead.peptides)
):
logger.info(
"protein {} will replace protein {} as lead, with index {}, New Num Peptides: {}, "
"Old Num Peptides: {}".format(
str(new_lead.identifier),
str(old_lead.identifier),
str(j),
str(len(new_lead.peptides)),
str(len(old_lead.peptides)),
)
)
lead_protein_set.add(new_lead.identifier)
lead_protein_set.remove(old_lead.identifier)
# Swap their positions in the list
(
protein_group_objects[i].proteins[0],
protein_group_objects[i].proteins[j],
) = (new_lead, old_lead)
break
return protein_group_objects
def _reassign_protein_list_leads(self, grouped_protein_objects):
"""
This internal method corrects leads that are improperly assigned in the parsimony inference method.
This method acts on the grouped protein objects.
Args:
grouped_protein_objects (list): List of [Protein][pyproteininference.physical.Protein] objects.
Returns:
list: List of [Protein][pyproteininference.physical.Protein] objects where leads have been
reassigned properly.
"""
# Get the higher or lower variable
if not self.data.high_low_better:
higher_or_lower = self.data.higher_or_lower()
else:
higher_or_lower = self.data.high_low_better
# Sometimes we have cases where:
# protein a maps to peptides 1,2,3
# protein b maps to peptides 1,2
# protein c maps to a bunch of peptides and peptide 3
# Therefore, in the model proteins a and b are equivalent in that they map to 2 peptides together - 1 and 2.
# peptide 3 maps to a but also to c...
# Sometimes the model (pulp) will spit out protein b as the lead... we wish to swap protein b as the lead with
# protein a because it will likely have a better score...
logger.info("Potentially Reassigning Proteoin List leads...")
lead_protein_set = set([x[0].identifier for x in grouped_protein_objects])
for i in range(len(grouped_protein_objects)):
for j in range(1, len(grouped_protein_objects[i])): # Loop over all sub proteins in the group...
# if the lead proteins peptides are a subset of one of its proteins in the group, and the secondary
# protein is not a lead protein and its score is better than the leads... and it has more peptides...
new_lead = grouped_protein_objects[i][j]
old_lead = grouped_protein_objects[i][0]
if higher_or_lower == datastore.DataStore.HIGHER_PSM_SCORE:
if (
set(old_lead.peptides).issubset(set(new_lead.peptides))
and new_lead.identifier not in lead_protein_set
and old_lead.score <= new_lead.score
and len(old_lead.peptides) < len(new_lead.peptides)
):
logger.info(
"protein {} will replace protein {} as lead, with index {}, New Num Peptides: {}, "
"Old Num Peptides: {}".format(
str(new_lead.identifier),
str(old_lead.identifier),
str(j),
str(len(new_lead.peptides)),
str(len(old_lead.peptides)),
)
)
lead_protein_set.add(new_lead.identifier)
lead_protein_set.remove(old_lead.identifier)
# Swap their positions in the list
(
grouped_protein_objects[i][0],
grouped_protein_objects[i][j],
) = (new_lead, old_lead)
break
if higher_or_lower == datastore.DataStore.LOWER_PSM_SCORE:
if (
set(old_lead.peptides).issubset(set(new_lead.peptides))
and new_lead.identifier not in lead_protein_set
and old_lead.score >= new_lead.score
and len(old_lead.peptides) < len(new_lead.peptides)
):
logger.info(
"protein {} will replace protein {} as lead, with index {}, New Num Peptides: {}, "
"Old Num Peptides: {}".format(
str(new_lead.identifier),
str(old_lead.identifier),
str(j),
str(len(new_lead.peptides)),
str(len(old_lead.peptides)),
)
)
lead_protein_set.add(new_lead.identifier)
lead_protein_set.remove(old_lead.identifier)
# Swap their positions in the list
(
grouped_protein_objects[i][0],
grouped_protein_objects[i][j],
) = (new_lead, old_lead)
break
return grouped_protein_objects
def _pulp_grouper(self):
"""
This internal function uses pulp to solve the lp problem for parsimony then performs protein grouping with the
various internal grouping functions.
This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
lists of [Protein][pyproteininference.physical.Protein] objects
and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
"""
# Here we get the peptide to protein dictionary
pep_prot_dict = self.data.peptide_to_protein_dictionary()
self.data.protein_to_peptide_dictionary()
identifiers_sorted = self.data.get_sorted_identifiers(scored=True)
# Get all the proteins that we scored and the ones picked if picker was ran...
data_proteins = sorted([x for x in self.data.protein_peptide_dictionary.keys() if x in identifiers_sorted])
# Get the set of peptides for each protein...
data_peptides = [set(self.data.protein_peptide_dictionary[x]) for x in data_proteins]
flat_peptides_in_data = set([item for sublist in data_peptides for item in sublist])
peptide_sets = []
# Loop over the list of peptides...
for k in range(len(data_peptides)):
raw_peptides = data_peptides[k]
peptide_set = set()
# Loop over each individual peptide per protein...
for peps in raw_peptides:
peptide = peps
# Remove mods...
new_peptide = Psm.remove_peptide_mods(peptide)
# Add it to a temporary set...
peptide_set.add(new_peptide)
# Append this set to a new list...
peptide_sets.append(peptide_set)
# Set that proteins peptides to be the unmodified ones...
data_peptides[k] = peptide_set
# Get them all...
all_peptides = [x for x in data_peptides]
# Remove redundant sets...
non_redundant_peptide_sets = [set(i) for i in OrderedDict.fromkeys(frozenset(item) for item in peptide_sets)]
# Loop over the restricted list of peptides...
ind_list = []
for pep_sets in non_redundant_peptide_sets:
# Get its index in terms of the overall list...
ind_list.append(all_peptides.index(pep_sets))
# Get the protein based on the index
restricted_proteins = [data_proteins[x] for x in range(len(data_peptides)) if x in ind_list]
# Here we get the list of all proteins
plist = []
for peps in pep_prot_dict.keys():
for prots in list(pep_prot_dict[peps]):
if prots in restricted_proteins and peps in flat_peptides_in_data:
plist.append(prots)
# Here we get the unique proteins
unique_prots = list(set(plist).union())
unique_protein_set = set(unique_prots)
unique_prots_sorted = [x for x in identifiers_sorted if x in unique_prots]
# Define the protein variables with a lower bound of 0 and catgeory Integer
prots = pulp.LpVariable.dicts("prot", indices=unique_prots_sorted, lowBound=0, cat="Integer")
# Define our Lp Problem which is to Minimize our objective function
prob = pulp.LpProblem("Parsimony_Problem", pulp.LpMinimize)
# Define our objective function, which is to take the sum of all of our proteins and find the minimum set.
prob += pulp.lpSum([prots[i] for i in prots])
# Set up our constraints. The constrains are as follows:
# Loop over each peptide and determine the proteins it maps to...
# Each peptide is a constraint with the proteins it maps to having to be greater than or equal to 1
# In the case below we see that protein 3 has a unique peptide, protein 2 is redundant
logger.info("Sorting peptides before looping")
for peptides in sorted(list(pep_prot_dict.keys())):
try:
prob += (
pulp.lpSum([prots[i] for i in sorted(list(pep_prot_dict[peptides])) if i in unique_protein_set])
>= 1
)
except KeyError:
logger.info("Not including protein {} in pulp model".format(pep_prot_dict[peptides]))
prob.solve()
scored_data = self.data.get_protein_data()
scored_proteins = list(scored_data)
protein_finder = [x.identifier for x in scored_proteins]
lead_protein_objects = []
lead_protein_identifiers = []
for proteins in unique_prots_sorted:
parsimony_value = pulp.value(prots[proteins])
if proteins in protein_finder and parsimony_value == 1:
p_ind = protein_finder.index(proteins)
protein_object = scored_proteins[p_ind]
lead_protein_objects.append(protein_object)
lead_protein_identifiers.append(protein_object.identifier)
else:
if parsimony_value == 1:
# Why are some proteins not being found when we run exclusion???
logger.warning("Protein {} not found with protein finder...".format(proteins))
else:
pass
self.lead_protein_objects = lead_protein_objects
grouped_proteins = self._create_protein_groups(
all_scored_proteins=scored_data,
lead_protein_objects=self.lead_protein_objects,
grouping_type=self.data.parameter_file_object.grouping_type,
)
regrouped_proteins = self._swissprot_and_isoform_override(
scored_data=scored_data,
grouped_proteins=grouped_proteins,
override_type="soft",
isoform_override=True,
)
grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
protein_group_objects = regrouped_proteins["group_objects"]
# Get the higher or lower variable
hl = self.data.higher_or_lower()
logger.info("Sorting Results based on lead Protein Score")
grouped_protein_objects = datastore.DataStore.sort_protein_objects(
grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
)
protein_group_objects = datastore.DataStore.sort_protein_group_objects(
protein_group_objects=protein_group_objects, higher_or_lower=hl
)
# Run lead reassignment for the group objets and protein objects
protein_group_objects = self._reassign_protein_group_leads(
protein_group_objects=protein_group_objects,
)
grouped_protein_objects = self._reassign_protein_list_leads(grouped_protein_objects=grouped_protein_objects)
logger.info("Re Sorting Results based on lead Protein Score")
grouped_protein_objects = datastore.DataStore.sort_protein_objects(
grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
)
protein_group_objects = datastore.DataStore.sort_protein_group_objects(
protein_group_objects=protein_group_objects, higher_or_lower=hl
)
self.data.grouped_scored_proteins = grouped_protein_objects
self.data.protein_group_objects = protein_group_objects
def infer_proteins(self):
"""
This method performs the Parsimony inference method and uses pulp for the LP solver.
This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
lists of [Protein][pyproteininference.physical.Protein] objects
and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
"""
if self.parameter_file_object.lp_solver == self.PULP:
self._pulp_grouper()
else:
raise ValueError(
"Parsimony cannot run if lp_solver parameter value is not one of the following: {}".format(
", ".join(Inference.LP_SOLVERS)
)
)
# Call assign shared peptides
self._assign_shared_peptides(shared_pep_type=self.parameter_file_object.shared_peptides)
def _assign_shared_peptides(self, shared_pep_type="all"):
if not self.data.grouped_scored_proteins and self.data.protein_group_objects:
raise ValueError(
"Grouped Protein objects could not be found. Please run 'infer_proteins' method of the Parsimony class"
)
if shared_pep_type == self.ALL_SHARED_PEPTIDES:
pass
elif shared_pep_type == self.BEST_SHARED_PEPTIDES:
logger.info("Assigning Shared Peptides from Parsimony to the Best Scoring Protein")
raw_peptide_tracker = set()
peptide_tracker = set()
for prots in self.data.grouped_scored_proteins:
new_psms = []
new_raw_peptides = set()
new_peptides = set()
lead_prot = prots[0]
for psm in lead_prot.psms:
raw_pep = psm.identifier
pep = psm.non_flanking_peptide
if raw_pep not in raw_peptide_tracker:
new_raw_peptides.add(raw_pep)
raw_peptide_tracker.add(raw_pep)
if pep not in peptide_tracker:
new_peptides.add(pep)
new_psms.append(psm)
peptide_tracker.add(pep)
lead_prot.psms = new_psms
lead_prot.raw_peptides = new_raw_peptides
lead_prot.peptides = new_peptides
raw_peptide_tracker = set()
peptide_tracker = set()
for group in self.data.protein_group_objects:
lead_prot = group.proteins[0]
new_psms = []
new_raw_peptides = set()
new_peptides = set()
for psm in lead_prot.psms:
raw_pep = psm.identifier
pep = psm.non_flanking_peptide
if raw_pep not in raw_peptide_tracker:
new_raw_peptides.add(raw_pep)
raw_peptide_tracker.add(raw_pep)
if pep not in peptide_tracker:
new_peptides.add(pep)
new_psms.append(psm)
peptide_tracker.add(pep)
lead_prot.psms = new_psms
lead_prot.raw_peptides = new_raw_peptides
lead_prot.peptides = new_peptides
else:
pass
__init__(self, data, digest)
special
Initialization method of the Parsimony object.
Parameters: |
|
---|
Source code in pyproteininference/inference.py
def __init__(self, data, digest):
"""
Initialization method of the Parsimony object.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
"""
self.data = data
self.digest = digest
self.data._validate_scored_proteins()
self.scored_data = self.data.get_protein_data()
self.lead_protein_set = None
self.parameter_file_object = data.parameter_file_object
infer_proteins(self)
This method performs the Parsimony inference method and uses pulp for the LP solver.
This method assigns the variables: grouped_scored_proteins
and protein_group_objects
.
These are both variables of the DataStore object and are
lists of Protein objects
and ProteinGroup objects.
Source code in pyproteininference/inference.py
def infer_proteins(self):
"""
This method performs the Parsimony inference method and uses pulp for the LP solver.
This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
lists of [Protein][pyproteininference.physical.Protein] objects
and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
"""
if self.parameter_file_object.lp_solver == self.PULP:
self._pulp_grouper()
else:
raise ValueError(
"Parsimony cannot run if lp_solver parameter value is not one of the following: {}".format(
", ".join(Inference.LP_SOLVERS)
)
)
# Call assign shared peptides
self._assign_shared_peptides(shared_pep_type=self.parameter_file_object.shared_peptides)
PeptideCentric (Inference)
PeptideCentric Inference class. This class contains methods that support the initialization of a PeptideCentric inference method.
Attributes:
Name | Type | Description |
---|---|---|
data |
DataStore |
|
digest |
Digest |
Source code in pyproteininference/inference.py
class PeptideCentric(Inference):
"""
PeptideCentric Inference class. This class contains methods that support the initialization of a
PeptideCentric inference method.
Attributes:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
"""
def __init__(self, data, digest):
"""
PeptideCentric Inference initialization method.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
Returns:
object:
"""
self.data = data
self.digest = digest
self.data._validate_scored_proteins()
self.scored_data = self.data.get_protein_data()
def infer_proteins(self):
"""
This method performs the Peptide Centric inference method.
This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
lists of [Protein][pyproteininference.physical.Protein] objects
and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
Returns:
None:
"""
# Get the higher or lower variable
hl = self.data.higher_or_lower()
logger.info("Applying Group ID's for the Peptide Centric Method")
regrouped_proteins = self._apply_protein_group_ids()
grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
protein_group_objects = regrouped_proteins["group_objects"]
logger.info("Sorting Results based on lead Protein Score")
grouped_protein_objects = datastore.DataStore.sort_protein_objects(
grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
)
protein_group_objects = datastore.DataStore.sort_protein_group_objects(
protein_group_objects=protein_group_objects, higher_or_lower=hl
)
self.data.grouped_scored_proteins = grouped_protein_objects
self.data.protein_group_objects = protein_group_objects
def _apply_protein_group_ids(self):
"""
This method creates the ProteinGroup objects for the peptide_centric inference based on protein groups
from [._create_protein_groups][pyproteininference.inference.Inference._create_protein_groups].
Returns:
dict: a Dictionary that contains a list of [ProteinGroup]]pyproteininference.physical.ProteinGroup]
objects (key:"group_objects") and a list of grouped [Protein]]pyproteininference.physical.Protein]
objects (key:"grouped_protein_objects").
"""
grouped_protein_objects = self.data.get_protein_data()
# Here we create group ID's
group_id = 0
list_of_proteins_grouped = []
protein_group_objects = []
for protein_group in grouped_protein_objects:
protein_group.peptides = set(
[Psm.split_peptide(peptide_string=x) for x in list(protein_group.raw_peptides)]
)
protein_list = []
group_id = group_id + 1
pg = ProteinGroup(group_id)
logger.debug("Created Protein Group with ID: {}".format(str(group_id)))
# The following loop assigns group_id's, reviewed/unreviewed status, and number of unique peptides...
if group_id not in protein_group.group_identification:
protein_group.group_identification.add(group_id)
protein_group.num_peptides = len(protein_group.peptides)
# Here append the number of unique peptides... so we can use this as secondary sorting...
protein_list.append(protein_group)
# Sorted protein_groups then becomes a list of lists... of protein objects
pg.proteins = protein_list
protein_group_objects.append(pg)
list_of_proteins_grouped.append([protein_group])
return_dict = {
"grouped_protein_objects": list_of_proteins_grouped,
"group_objects": protein_group_objects,
}
return return_dict
__init__(self, data, digest)
special
PeptideCentric Inference initialization method.
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/inference.py
def __init__(self, data, digest):
"""
PeptideCentric Inference initialization method.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
Returns:
object:
"""
self.data = data
self.digest = digest
self.data._validate_scored_proteins()
self.scored_data = self.data.get_protein_data()
infer_proteins(self)
This method performs the Peptide Centric inference method.
This method assigns the variables: grouped_scored_proteins
and protein_group_objects
.
These are both variables of the DataStore object and are
lists of Protein objects
and ProteinGroup objects.
Returns: |
|
---|
Source code in pyproteininference/inference.py
def infer_proteins(self):
"""
This method performs the Peptide Centric inference method.
This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
lists of [Protein][pyproteininference.physical.Protein] objects
and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
Returns:
None:
"""
# Get the higher or lower variable
hl = self.data.higher_or_lower()
logger.info("Applying Group ID's for the Peptide Centric Method")
regrouped_proteins = self._apply_protein_group_ids()
grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
protein_group_objects = regrouped_proteins["group_objects"]
logger.info("Sorting Results based on lead Protein Score")
grouped_protein_objects = datastore.DataStore.sort_protein_objects(
grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
)
protein_group_objects = datastore.DataStore.sort_protein_group_objects(
protein_group_objects=protein_group_objects, higher_or_lower=hl
)
self.data.grouped_scored_proteins = grouped_protein_objects
self.data.protein_group_objects = protein_group_objects
parameters
ProteinInferenceParameter
Class that handles data retrieval, storage, and validation of Protein Inference Parameters.
Attributes:
Name | Type | Description |
---|---|---|
yaml_param_filepath |
str |
path to properly formatted parameter file specific to Protein Inference. |
digest_type |
str |
String that determines that type of digestion in silico digestion for Digest object. Typically "trypsin". |
export |
str |
String to indicate the export type for Export object. Typically this is "psms", "peptides", or "psm_ids". |
fdr |
float |
Float to indicate FDR filtering. |
missed_cleavages |
int |
Integer to determine the number of missed cleavages in the database digestion Digest object. |
picker |
bool |
True/False on whether or not to run the protein picker algorithm. |
restrict_pep |
float/None |
Float to restrict the posterior error probability values by in the PSM input. Used in restrict_psm_data. |
restrict_peptide_length |
int/None |
Float to restrict the peptide length values by in the PSM input. Used in restrict_psm_data. |
restrict_q |
float/None |
Float to restrict the q values by in the PSM input. Used in restrict_psm_data. |
restrict_custom |
float/None |
Float to restrict the custom values by in the PSM input. Used in restrict_psm_data. Filtering depends on score_type variable. If score_type is multiplicative then values that are less than restrict_custom are kept. If score_type is additive then values that are more than restrict_custom are kept. |
protein_score |
str |
String to determine the way in which Proteins are scored can be any of the SCORE_METHODS in Score object. |
psm_score_type |
str |
String to determine the type of score that the PSM scores are (Additive or Multiplicative) can be any of the SCORE_TYPES in Score object. |
decoy_symbol |
str |
String to denote decoy proteins from target proteins. IE "##". |
isoform_symbol |
str |
String to denote isoforms from regular proteins. IE "-". Can also be None. |
reviewed_identifier_symbol |
str |
String to denote a "Reviewed" Protein. Typically this is: "sp|" if using Uniprot Fasta database. |
inference_type |
str |
String to determine the inference procedure. Can be any value of INFERENCE_TYPES of Inference object. |
tag |
str |
String to be added to output files. |
psm_score |
str |
String that indicates the PSM input score. The value should match the string in the input data of the score you want to use for PSM score. This score will be used in scoring methods here: Score object. |
grouping_type |
str/None |
String to determine the grouping procedure. Can be any value of GROUPING_TYPES of Inference object. |
max_identifiers_peptide_centric |
int |
Maximum number of identifiers to assign to a group when running peptide_centric inference. Typically this is 10 or 5. |
lp_solver |
str/None |
The LP solver to use if inference_type="Parsimony". Can be any value in LP_SOLVERS in the Inference object. |
Source code in pyproteininference/parameters.py
class ProteinInferenceParameter(object):
"""
Class that handles data retrieval, storage, and validation of Protein Inference Parameters.
Attributes:
yaml_param_filepath (str): path to properly formatted parameter file specific to Protein Inference.
digest_type (str): String that determines that type of digestion in silico digestion for
[Digest object][pyproteininference.in_silico_digest.Digest]. Typically "trypsin".
export (str): String to indicate the export type for [Export object][pyproteininference.export.Export].
Typically this is "psms", "peptides", or "psm_ids".
fdr (float): Float to indicate FDR filtering.
missed_cleavages (int): Integer to determine the number of missed cleavages in the database digestion
[Digest object][pyproteininference.in_silico_digest.Digest].
picker (bool): True/False on whether or not to run
the [protein picker][pyproteininference.datastore.DataStore.protein_picker] algorithm.
restrict_pep (float/None): Float to restrict the posterior error probability values by in the PSM input.
Used in [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
restrict_peptide_length (int/None): Float to restrict the peptide length values by in the PSM input.
Used in [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
restrict_q (float/None): Float to restrict the q values by in the PSM input.
Used in [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
restrict_custom (float/None): Float to restrict the custom values by in the PSM input.
Used in [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
Filtering depends on score_type variable. If score_type is multiplicative then values that are less than
restrict_custom are kept. If score_type is additive then values that are more than restrict_custom are kept.
protein_score (str): String to determine the way in which Proteins are scored can be any of the SCORE_METHODS
in [Score object][pyproteininference.scoring.Score].
psm_score_type (str): String to determine the type of score that the PSM scores are
(Additive or Multiplicative) can be any of the SCORE_TYPES
in [Score object][pyproteininference.scoring.Score].
decoy_symbol (str): String to denote decoy proteins from target proteins. IE "##".
isoform_symbol (str): String to denote isoforms from regular proteins. IE "-". Can also be None.
reviewed_identifier_symbol (str): String to denote a "Reviewed" Protein. Typically this is: "sp|"
if using Uniprot Fasta database.
inference_type (str): String to determine the inference procedure. Can be any value of INFERENCE_TYPES
of [Inference object][pyproteininference.inference.Inference].
tag (str): String to be added to output files.
psm_score (str): String that indicates the PSM input score. The value should match the string in the
input data of the score you want to use for PSM score. This score will be used in scoring methods
here: [Score object][pyproteininference.scoring.Score].
grouping_type (str/None): String to determine the grouping procedure. Can be any value of
GROUPING_TYPES of [Inference object][pyproteininference.inference.Inference].
max_identifiers_peptide_centric (int): Maximum number of identifiers to assign to a group when
running peptide_centric inference. Typically this is 10 or 5.
lp_solver (str/None): The LP solver to use if inference_type="Parsimony".
Can be any value in LP_SOLVERS in the [Inference object][pyproteininference.inference.Inference].
"""
PARENT_PARAMETER_KEY = "parameters"
GENERAL_PARAMETER_KEY = "general"
DATA_RESTRICTION_PARAMETER_KEY = "data_restriction"
SCORE_PARAMETER_KEY = "score"
IDENTIFIERS_PARAMETER_KEY = "identifiers"
INFERENCE_PARAMETER_KEY = "inference"
DIGEST_PARAMETER_KEY = "digest"
PARSIMONY_PARAMETER_KEY = "parsimony"
PEPTIDE_CENTRIC_PARAMETER_KEY = "peptide_centric"
PARAMETER_MAIN_KEYS = {
GENERAL_PARAMETER_KEY,
DATA_RESTRICTION_PARAMETER_KEY,
SCORE_PARAMETER_KEY,
IDENTIFIERS_PARAMETER_KEY,
INFERENCE_PARAMETER_KEY,
DIGEST_PARAMETER_KEY,
PARSIMONY_PARAMETER_KEY,
PEPTIDE_CENTRIC_PARAMETER_KEY,
}
EXPORT_PARAMETER = "export"
FDR_PARAMETER = "fdr"
PICKER_PARAMETER = "picker"
TAG_PARAMETER = "tag"
GENERAL_PARAMETER_SUB_KEYS = {
EXPORT_PARAMETER,
FDR_PARAMETER,
PICKER_PARAMETER,
TAG_PARAMETER,
}
PEP_RESTRICT_PARAMETER = "pep_restriction"
PEPTIDE_LENGTH_RESTRICT_PARAMETER = "peptide_length_restriction"
Q_VALUE_RESTRICT_PARAMETER = "q_value_restriction"
CUSTOM_RESTRICT_PARAMETER = "custom_restriction"
DATA_RESTRICTION_PARAMETER_SUB_KEYS = {
PEP_RESTRICT_PARAMETER,
PEPTIDE_LENGTH_RESTRICT_PARAMETER,
Q_VALUE_RESTRICT_PARAMETER,
CUSTOM_RESTRICT_PARAMETER,
}
PROTEIN_SCORE_PARAMETER = "protein_score"
PSM_SCORE_PARAMETER = "psm_score"
PSM_SCORE_TYPE_PARAMETER = "psm_score_type"
SCORE_PARAMETER_SUB_KEYS = {
PROTEIN_SCORE_PARAMETER,
PSM_SCORE_PARAMETER,
PSM_SCORE_TYPE_PARAMETER,
}
DECOY_SYMBOL_PARAMETER = "decoy_symbol"
ISOFORM_SYMBOL_PARAMETER = "isoform_symbol"
REVIEWED_IDENTIFIER_PARAMETER = "reviewed_identifier_symbol"
IDENTIFIER_SUB_KEYS = {
DECOY_SYMBOL_PARAMETER,
ISOFORM_SYMBOL_PARAMETER,
REVIEWED_IDENTIFIER_PARAMETER,
}
INFERENCE_TYPE_PARAMETER = "inference_type"
GROUPING_TYPE_PARAMETER = "grouping_type"
INFERENCE_SUB_KEYS = {INFERENCE_TYPE_PARAMETER, GROUPING_TYPE_PARAMETER}
DIGEST_TYPE_PARAMETER = "digest_type"
MISSED_CLEAV_PARAMETER = "missed_cleavages"
DIGEST_SUB_KEYS = {DIGEST_TYPE_PARAMETER, MISSED_CLEAV_PARAMETER}
LP_SOLVER_PARAMETER = "lp_solver"
SHARED_PEPTIDES_PARAMETER = "shared_peptides"
PARSIMONY_SUB_KEYS = {
LP_SOLVER_PARAMETER,
SHARED_PEPTIDES_PARAMETER,
}
MAX_IDENTIFIERS_PARAMETER = "max_identifiers"
PEPTIDE_CENTRIC_SUB_KEYS = {MAX_IDENTIFIERS_PARAMETER}
DEFAULT_DIGEST_TYPE = "trypsin"
DEFAULT_EXPORT = "peptides"
DEFAULT_FDR = 0.01
DEFAULT_MISSED_CLEAVAGES = 3
DEFAULT_PICKER = True
DEFAULT_RESTRICT_PEP = 0.9
DEFAULT_RESTRICT_PEPTIDE_LENGTH = 7
DEFAULT_RESTRICT_Q = 0.005
DEFAULT_RESTRICT_CUSTOM = "None"
DEFAULT_PROTEIN_SCORE = "multiplicative_log"
DEFAULT_PSM_SCORE = "posterior_error_prob"
DEFAULT_DECOY_SYMBOL = "##"
DEFAULT_ISOFORM_SYMBOL = "-"
DEFAULT_REVIEWED_IDENTIFIER_SYMBOL = "sp|"
DEFAULT_INFERENCE_TYPE = "peptide_centric"
DEFAULT_TAG = "py_protein_inference"
DEFAULT_PSM_SCORE_TYPE = "multiplicative"
DEFAULT_GROUPING_TYPE = "shared_peptides"
DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC = 5
DEFAULT_LP_SOLVER = "pulp"
DEFAULT_SHARED_PEPTIDES = "all"
def __init__(self, yaml_param_filepath, validate=True):
"""Class to store Protein Inference parameter information as an object.
Args:
yaml_param_filepath (str): path to properly formatted parameter file specific to Protein Inference.
validate (bool): True/False on whether to validate the parameter file of interest.
Returns:
None:
Example:
>>> pyproteininference.parameters.ProteinInferenceParameter(
>>> yaml_param_filepath = "/path/to/pyproteininference_params.yaml", validate=True
>>> )
"""
self.yaml_param_filepath = yaml_param_filepath
self.digest_type = self.DEFAULT_DIGEST_TYPE
self.export = self.DEFAULT_EXPORT
self.fdr = self.DEFAULT_FDR
self.missed_cleavages = self.DEFAULT_MISSED_CLEAVAGES
self.picker = self.DEFAULT_PICKER
self.restrict_pep = self.DEFAULT_RESTRICT_PEP
self.restrict_peptide_length = self.DEFAULT_RESTRICT_PEPTIDE_LENGTH
self.restrict_q = self.DEFAULT_RESTRICT_Q
self.restrict_custom = self.DEFAULT_RESTRICT_CUSTOM
self.protein_score = self.DEFAULT_PROTEIN_SCORE
self.psm_score_type = self.DEFAULT_PSM_SCORE_TYPE
self.decoy_symbol = self.DEFAULT_DECOY_SYMBOL
self.isoform_symbol = self.DEFAULT_ISOFORM_SYMBOL
self.reviewed_identifier_symbol = self.DEFAULT_REVIEWED_IDENTIFIER_SYMBOL
self.inference_type = self.DEFAULT_INFERENCE_TYPE
self.tag = self.DEFAULT_TAG
self.psm_score = self.DEFAULT_PSM_SCORE
self.grouping_type = self.DEFAULT_GROUPING_TYPE
self.max_identifiers_peptide_centric = self.DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC
self.lp_solver = self.DEFAULT_LP_SOLVER
self.shared_peptides = self.DEFAULT_SHARED_PEPTIDES
self.validate = validate
self.convert_to_object()
if validate:
self.validate_parameters()
self._fix_none_parameters()
def convert_to_object(self):
"""
Function that takes a Protein Inference parameter file and converts it into a ProteinInferenceParameter object
by assigning all Attributes of the ProteinInferenceParameter object.
If no parameter filepath is supplied the parameter object will be loaded with default params.
This function gets ran in the initialization of the ProteinInferenceParameter object.
Returns:
None:
"""
if self.yaml_param_filepath:
with open(self.yaml_param_filepath, "r") as stream:
yaml_params = yaml.load(stream, Loader=yaml.Loader)
try:
self.digest_type = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY][
self.DIGEST_TYPE_PARAMETER
]
except KeyError:
logger.warning("digest_type set to default of {}".format(self.DEFAULT_DIGEST_TYPE))
try:
self.export = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.EXPORT_PARAMETER]
except KeyError:
logger.warning("export set to default of {}".format(self.DEFAULT_EXPORT))
try:
self.fdr = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.FDR_PARAMETER]
except KeyError:
logger.warning("fdr set to default of {}".format(self.DEFAULT_FDR))
try:
self.missed_cleavages = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY][
self.MISSED_CLEAV_PARAMETER
]
except KeyError:
logger.warning("missed_cleavages set to default of {}".format(self.DEFAULT_MISSED_CLEAVAGES))
try:
self.picker = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.PICKER_PARAMETER]
except KeyError:
logger.warning("picker set to default of {}".format(self.DEFAULT_PICKER))
try:
self.restrict_pep = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
self.PEP_RESTRICT_PARAMETER
]
except KeyError:
logger.warning("restrict_pep set to default of {}".format(self.DEFAULT_RESTRICT_PEP))
try:
self.restrict_peptide_length = yaml_params[self.PARENT_PARAMETER_KEY][
self.DATA_RESTRICTION_PARAMETER_KEY
][self.PEPTIDE_LENGTH_RESTRICT_PARAMETER]
except KeyError:
logger.warning(
"restrict_peptide_length set to default of {}".format(self.DEFAULT_RESTRICT_PEPTIDE_LENGTH)
)
try:
self.restrict_q = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
self.Q_VALUE_RESTRICT_PARAMETER
]
except KeyError:
logger.warning("restrict_q set to default of {}".format(self.DEFAULT_RESTRICT_Q))
try:
self.restrict_custom = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
self.CUSTOM_RESTRICT_PARAMETER
]
except KeyError:
logger.warning("restrict_custom set to default of {}".format(self.DEFAULT_RESTRICT_CUSTOM))
try:
self.protein_score = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
self.PROTEIN_SCORE_PARAMETER
]
except KeyError:
logger.warning("protein_score set to default of {}".format(self.DEFAULT_PROTEIN_SCORE))
try:
self.psm_score_type = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
self.PSM_SCORE_TYPE_PARAMETER
]
except KeyError:
logger.warning("psm_score_type set to default of {}".format(self.DEFAULT_PSM_SCORE_TYPE))
try:
self.decoy_symbol = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY][
self.DECOY_SYMBOL_PARAMETER
]
except KeyError:
logger.warning("decoy_symbol set to default of {}".format(self.DEFAULT_DECOY_SYMBOL))
try:
self.isoform_symbol = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY][
self.ISOFORM_SYMBOL_PARAMETER
]
except KeyError:
logger.warning("isoform_symbol set to default of {}".format(self.DEFAULT_ISOFORM_SYMBOL))
try:
self.reviewed_identifier_symbol = yaml_params[self.PARENT_PARAMETER_KEY][
self.IDENTIFIERS_PARAMETER_KEY
][self.REVIEWED_IDENTIFIER_PARAMETER]
except KeyError:
logger.warning(
"reviewed_identifier_symbol set to default of {}".format(self.DEFAULT_REVIEWED_IDENTIFIER_SYMBOL)
)
try:
self.inference_type = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY][
self.INFERENCE_TYPE_PARAMETER
]
except KeyError:
logger.warning("inference_type set to default of {}".format(self.DEFAULT_INFERENCE_TYPE))
try:
self.tag = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.TAG_PARAMETER]
except KeyError:
logger.warning("tag set to default of {}".format(self.DEFAULT_TAG))
try:
self.psm_score = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
self.PSM_SCORE_PARAMETER
]
except KeyError:
logger.warning("psm_score set to default of {}".format(self.DEFAULT_PSM_SCORE))
try:
self.grouping_type = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY][
self.GROUPING_TYPE_PARAMETER
]
except KeyError:
logger.warning("grouping_type set to default of {}".format(self.DEFAULT_GROUPING_TYPE))
try:
self.max_identifiers_peptide_centric = yaml_params[self.PARENT_PARAMETER_KEY][
self.PEPTIDE_CENTRIC_PARAMETER_KEY
][self.MAX_IDENTIFIERS_PARAMETER]
except KeyError:
logger.warning(
"max_identifiers_peptide_centric set to default of {}".format(
self.DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC
)
)
try:
self.lp_solver = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY][
self.LP_SOLVER_PARAMETER
]
except KeyError:
logger.warning("lp_solver set to default of {}".format(self.DEFAULT_LP_SOLVER))
try:
# Do try except here to make old param files backwards compatible
self.shared_peptides = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY][
self.SHARED_PEPTIDES_PARAMETER
]
except KeyError:
logger.warning("shared_peptides set to default of {}".format(self.DEFAULT_SHARED_PEPTIDES))
else:
logger.warning("Yaml parameter file not found, all parameters set to default")
def validate_parameters(self):
"""
Class method to validate all parameters.
Returns:
None:
"""
# Run all of the parameter validations
self._validate_digest_type()
self._validate_export_type()
self._validate_floats()
self._validate_bools()
self._validate_score_type()
self._validate_score_method()
self._validate_score_combination()
self._validate_inference_type()
self._validate_grouping_type()
self._validate_max_id()
self._validate_lp_solver()
self._validate_identifiers()
self._validate_parsimony_shared_peptides()
def _validate_digest_type(self):
"""
Internal ProteinInferenceParameter method to validate the digest type.
"""
# Make sure we have a valid digest type
if self.digest_type in PyteomicsDigest.LIST_OF_DIGEST_TYPES:
logger.info("Using digest type '{}'".format(self.digest_type))
else:
raise ValueError(
"Digest Type '{}' not supported, please use one of the following enyzme digestions: '{}'".format(
self.digest_type, ", ".join(PyteomicsDigest.LIST_OF_DIGEST_TYPES)
)
)
def _validate_export_type(self):
"""
Internal ProteinInferenceParameter method to validate the export type.
"""
# Make sure we have a valid export type
if self.export in Export.EXPORT_TYPES:
logger.info("Using Export type '{}'".format(self.export))
else:
raise ValueError(
"Export Type '{}' not supported, please use one of the following export types: '{}'".format(
self.export, ", ".join(Export.EXPORT_TYPES)
)
)
pass
def _validate_floats(self):
"""
Internal ProteinInferenceParameter method to validate floats.
"""
# Validate that FDR, cleavages, and restrict values are all floats and or ints if they need to be
try:
if 0 <= float(self.fdr) <= 1:
logger.info("FDR Input {}".format(self.fdr))
except ValueError:
raise ValueError("FDR must be a decimal between 0 and 1, FDR provided: {}".format(self.fdr))
try:
if 0 <= float(self.restrict_pep) <= 1:
logger.info("PEP restriction {}".format(self.restrict_pep))
except ValueError:
if not self.restrict_pep or self.restrict_pep.lower() == "none":
self.restrict_pep = None
logger.info("Not restrict by PEP Value")
else:
raise ValueError(
"PEP restriction must be a decimal between 0 and 1, PEP restriction provided: {}".format(
self.restrict_pep
)
)
try:
if 0 <= float(self.restrict_q) <= 1:
logger.info("Q Value restriction {}".format(self.restrict_q))
except ValueError:
if not self.restrict_q or self.restrict_q.lower() == "none":
self.restrict_q = None
logger.info("Not restrict by Q Value")
else:
raise ValueError(
"Q Value restriction must be a decimal between 0 and 1, Q Value restriction provided: {}".format(
self.restrict_q
)
)
try:
int(self.missed_cleavages)
logger.info("Missed Cleavages selected: {}".format(self.missed_cleavages))
except ValueError:
raise ValueError(
"Missed Cleavages must be an integer, Provided Missed Cleavages value: {}".format(self.missed_cleavages)
)
try:
int(self.restrict_peptide_length)
logger.info("Peptide Length Restriction: Len {}".format(self.restrict_peptide_length))
except ValueError:
if not self.restrict_peptide_length or self.restrict_peptide_length.lower() == "none":
self.restrict_peptide_length = None
logger.info("Not Restricting by Peptide Length")
else:
raise ValueError(
"Peptide Length Restriction must be an integer, "
"Provided Peptide Length Restriction value: {}".format(self.restrict_peptide_length)
)
try:
float(self.restrict_custom)
logger.info("Custom restriction {}".format(self.restrict_custom))
except ValueError or TypeError:
if not self.restrict_custom or self.restrict_custom.lower() == "none":
self.restrict_custom = None
logger.info("Not Restricting by Custom Value")
else:
raise ValueError(
"Custom restriction must be a number, Custom restriction provided: {}".format(self.restrict_custom)
)
def _validate_bools(self):
"""
Internal ProteinInferenceParameter method to validate the bools.
"""
# Make sure picker is a bool
if type(self.picker) == bool:
if self.picker:
logger.info("Parameters loaded to run Picker")
else:
logger.info("Parameters loaded to NOT run Picker")
else:
raise ValueError(
"Picker Variable must be set to True or False, Picker Variable provided: {}".format(self.picker)
)
def _validate_score_method(self):
"""
Internal ProteinInferenceParameter method to validate the score method.
"""
# Make sure we have the score method defined in code to use...
if self.protein_score in Score.SCORE_METHODS:
logger.info("Using Score Method '{}'".format(self.protein_score))
else:
raise ValueError(
"Score Method '{}' not supported, "
"please use one of the following Score Methods: '{}'".format(
self.protein_score, ", ".join(Score.SCORE_METHODS)
)
)
def _validate_score_type(self):
"""
Internal ProteinInferenceParameter method to validate the score type.
"""
# Make sure score type is multiplicative or additive
if self.psm_score_type in Score.SCORE_TYPES:
logger.info("Using Score Type '{}'".format(self.psm_score_type))
else:
raise ValueError(
"Score Type '{}' not supported, "
"please use one of the following Score Types: '{}'".format(
self.psm_score_type, ", ".join(Score.SCORE_TYPES)
)
)
def _validate_score_combination(self):
"""
Internal ProteinInferenceParameter method to validate combination of score method and score type.
"""
# Check to see if combination of score (column), method(multiplicative log, additive),
# and score type (multiplicative/additive) is possible...
# This will be super custom
if self.psm_score_type == Score.ADDITIVE_SCORE_TYPE and self.protein_score != Score.ADDITIVE:
raise ValueError(
"If Score type is 'additive' (Higher PSM score is better) then you must use the 'additive' score method"
)
elif self.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE and self.protein_score == Score.ADDITIVE:
raise ValueError(
"If Score type is 'multiplicative' (Lower PSM score is better) "
"then you must NOT use the 'additive' score method please "
"select one of the following score methods: {}".format(
", ".join([x for x in Score.SCORE_METHODS if x != "additive"])
)
)
else:
logger.info(
"Combination of Score Type: '{}' and Score Method: '{}' is Ok".format(
self.psm_score_type, self.protein_score
)
)
def _validate_inference_type(self):
"""
Internal ProteinInferenceParameter method to validate the inference type.
"""
# Check if its parsimony, exclusion, inclusion, none
if self.inference_type in Inference.INFERENCE_TYPES:
logger.info("Using inference type '{}'".format(self.inference_type))
else:
raise ValueError(
"Inferece Type '{}' not supported, please use one of the following Inferece Types: '{}'".format(
self.inference_type, ", ".join(Inference.INFERENCE_TYPES)
)
)
def _validate_grouping_type(self):
"""
Internal ProteinInferenceParameter method to validate the grouping type.
"""
# Check if its parsimony, exclusion, inclusion, none
if self.grouping_type in Inference.GROUPING_TYPES:
logger.info("Using Grouping type '{}'".format(self.grouping_type))
else:
if self.grouping_type.lower() == "none" or not self.grouping_type:
self.grouping_type = None
logger.info("Using Grouping type: None")
else:
raise ValueError(
"Grouping Type '{}' not supported, please use one of the following Grouping Types: '{}'".format(
self.grouping_type, Inference.GROUPING_TYPES
)
)
def _validate_max_id(self):
"""
Internal ProteinInferenceParameter method to validate the max peptide centric id.
"""
# Check if max_identifiers_peptide_centric param is an INT
if type(self.max_identifiers_peptide_centric) == int:
logger.info(
"Max Number of Indentifiers for Peptide Centric Inference: '{}'".format(
self.max_identifiers_peptide_centric
)
)
else:
raise ValueError(
"Max Number of Indentifiers for Peptide Centric Inference must be an integer, "
"provided value: {}".format(self.max_identifiers_peptide_centric)
)
def _validate_lp_solver(self):
"""
Internal ProteinInferenceParameter method to validate the lp solver.
"""
# Check if its pulp or None
if self.lp_solver in Inference.LP_SOLVERS:
logger.info("Using LP Solver '{}'".format(self.lp_solver))
else:
if self.lp_solver.lower() == "none" or not self.lp_solver:
self.lp_solver = None
logger.info("Setting LP Solver to None")
else:
raise ValueError(
"LP Solver '{}' not supported, please use one of the following LP Solvers: '{}'".format(
self.lp_solver, ", ".join(Inference.LP_SOLVERS)
)
)
def _validate_parsimony_shared_peptides(self):
"""
Internal ProteinInferenceParameter method to validate the shared peptides parameter.
"""
# Check if its all, best, or none
if self.shared_peptides in Inference.SHARED_PEPTIDE_TYPES:
logger.info("Using Shared Peptide types '{}'".format(self.shared_peptides))
else:
if self.shared_peptides.lower() == "none" or not self.shared_peptides:
self.shared_peptides = None
logger.info("Setting Shared Peptide type to None")
else:
raise ValueError(
"Shared Peptide types '{}' not supported, please use one of the following "
"Shared Peptide types: '{}'".format(self.shared_peptides, Inference.SHARED_PEPTIDE_TYPES)
)
def _validate_identifiers(self):
"""
Internal ProteinInferenceParameter method to validate the decoy symbol, isoform symbol,
and reviewed identifier symbol.
"""
if type(self.decoy_symbol) == str:
logger.info("Decoy Symbol set to: '{}'".format(self.decoy_symbol))
else:
raise ValueError("Decoy Symbol must be a string, provided value: {}".format(self.decoy_symbol))
if type(self.isoform_symbol) == str:
logger.info("Isoform Symbol set to: '{}'".format(self.isoform_symbol))
if self.isoform_symbol.lower() == "none" or not self.isoform_symbol:
self.isoform_symbol = None
logger.info("Isoform Symbol set to None")
else:
if self.isoform_symbol:
self.isoform_symbol = None
logger.info("Isoform Symbol set to None")
raise ValueError("Isoform Symbol must be a string, provided value: {}".format(self.isoform_symbol))
if type(self.reviewed_identifier_symbol) == str:
logger.info("Reviewed Identifier Symbol set to: '{}'".format(self.reviewed_identifier_symbol))
if self.reviewed_identifier_symbol.lower() == "none" or not self.reviewed_identifier_symbol:
self.reviewed_identifier_symbol = None
logger.info("Reviewed Identifier Symbol set to None")
else:
if not self.reviewed_identifier_symbol:
self.reviewed_identifier_symbol = None
logger.info("Reviewed Identifier Symbol set to None")
raise ValueError(
"Reviewed Identifier Symbol must be a string, provided value: {}".format(
self.reviewed_identifier_symbol
)
)
def _validate_parameter_shape(self, yaml_params):
"""
Internal ProteinInferenceParameter method to validate shape of the parameter file by checking to make sure
that all necessary main parameter fields are defined.
"""
if self.PARENT_PARAMETER_KEY in yaml_params.keys():
logger.info("Main Parameter Key is Present")
else:
raise ValueError(
"Key {} needs to be defined as the outermost parameter group".format(self.PARENT_PARAMETER_KEY)
)
if self.PARAMETER_MAIN_KEYS.issubset(yaml_params[self.PARENT_PARAMETER_KEY]):
logger.info("All Sub Parameter Keys Present")
else:
raise ValueError(
"All of the following values: {}. Need to be Sub Parameters in the Yaml Parameter file".format(
", ".join(self.PARAMETER_MAIN_KEYS),
)
)
try:
general_params = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY]
for gkey in self.GENERAL_PARAMETER_SUB_KEYS:
if gkey in general_params.keys():
pass
else:
raise ValueError(
"General Sub Parameter '{}' is not found in the parameter file. "
"Please add it as a sub parameter of the general parameter field".format(gkey)
)
except KeyError:
raise ValueError("'general' sub Parameter not defined in the parameter file")
try:
data_res_params = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY]
for drkey in self.DATA_RESTRICTION_PARAMETER_SUB_KEYS:
if drkey in data_res_params.keys():
pass
else:
raise ValueError(
"Data Restriction Sub Parameter '{}' is not found in the parameter file. "
"Please add it as a sub parameter of the data_restriction parameter field".format(drkey)
)
except KeyError:
raise ValueError("'data_restriction' sub Parameter not defined in the parameter file")
try:
score_params = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY]
for skey in self.SCORE_PARAMETER_SUB_KEYS:
if skey in score_params.keys():
pass
else:
raise ValueError(
"Score Sub Parameter '{}' is not found in the parameter file. "
"Please add it as a sub parameter of the score parameter field".format(skey)
)
except KeyError:
raise ValueError("'score' sub Parameter not defined in the parameter file")
try:
id_params = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY]
for ikey in self.IDENTIFIER_SUB_KEYS:
if ikey in id_params.keys():
pass
else:
raise ValueError(
"Identifiers Sub Parameter '{}' is not found in the parameter file. "
"Please add it as a sub parameter of the identifiers parameter field".format(ikey)
)
except KeyError:
raise ValueError("'identifiers' sub Parameter not defined in the parameter file")
try:
inf_params = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY]
for infkey in self.INFERENCE_SUB_KEYS:
if infkey in inf_params.keys():
pass
else:
raise ValueError(
"Inference Sub Parameter '{}' is not found in the parameter file. "
"Please add it as a sub parameter of the inference parameter field".format(infkey)
)
except KeyError:
raise ValueError("'inference' sub Parameter not defined in the parameter file")
try:
digest_params = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY]
for dkey in self.DIGEST_SUB_KEYS:
if dkey in digest_params.keys():
pass
else:
raise ValueError(
"Digest Sub Parameter '{}' is not found in the parameter file. "
"Please add it as a sub parameter of the digest parameter field".format(dkey)
)
except KeyError:
raise ValueError("'digest' sub Parameter not defined in the parameter file")
try:
parsimony_params = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY]
for pkey in self.PARSIMONY_SUB_KEYS:
if pkey in parsimony_params.keys():
pass
else:
raise ValueError(
"Parsimony Sub Parameter '{}' is not found in the parameter file. "
"Please add it as a sub parameter of the parsimony parameter field".format(pkey)
)
except KeyError:
raise ValueError("'parsimony' sub Parameter not defined in the parameter file")
try:
pep_cen_params = yaml_params[self.PARENT_PARAMETER_KEY][self.PEPTIDE_CENTRIC_PARAMETER_KEY]
for pckey in self.PEPTIDE_CENTRIC_SUB_KEYS:
if pckey in pep_cen_params.keys():
pass
else:
raise ValueError(
"Peptide Centric Sub Parameter '{}' is not found in the parameter file. "
"Please add it as a sub parameter of the peptide_centric parameter field".format(pckey)
)
except KeyError:
raise ValueError("'peptide_centric' sub Parameter not defined in the parameter file")
def override_q_restrict(self, data):
"""
ProteinInferenceParameter method to override restrict_q if the input data does not contain q values.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
"""
data_has_q = data.input_has_q()
if data_has_q:
pass
else:
if self.restrict_q:
logger.warning("No Q values found in the input data, overriding parameters to not filter on Q value")
self.restrict_q = None
def override_pep_restrict(self, data):
"""
ProteinInferenceParameter method to override restrict_pep if the input data does not contain pep values.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
"""
data_has_pep = data.input_has_pep()
if data_has_pep:
pass
else:
if self.restrict_pep:
logger.warning(
"No Pep values found in the input data, overriding parameters to not filter on Pep value"
)
self.restrict_pep = None
def override_custom_restrict(self, data):
"""
ProteinInferenceParameter method to override restrict_custom if
the input data does not contain custom score values.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
"""
data_has_custom = data.input_has_custom()
if data_has_custom:
pass
else:
if self.restrict_custom:
logger.warning(
"No Custom values found in the input data, overriding parameters to not filter on Custom value"
)
self.restrict_custom = None
def fix_parameters_from_datastore(self, data):
"""
ProteinInferenceParameter method to override restriction values in the
parameter file if those scores do not exist in the input files.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
"""
self.override_q_restrict(data=data)
self.override_pep_restrict(data=data)
self.override_custom_restrict(data=data)
def _fix_none_parameters(self):
"""
Internal ProteinInferenceParameter method to fix parameters that have been defined as None.
These get read in as strings with YAML reader and need to be converted to None type.
"""
self._fix_grouping_type()
self._fix_lp_solver()
self._fix_shared_peptides()
def _fix_grouping_type(self):
"""
Internal ProteinInferenceParameter method to override grouping type for None value.
"""
if self.grouping_type in ["None", "none", None]:
self.grouping_type = None
def _fix_lp_solver(self):
"""
Internal ProteinInferenceParameter method to override lp_solver for None value.
"""
if self.lp_solver in ["None", "none", None]:
self.lp_solver = None
def _fix_shared_peptides(self):
"""
Internal ProteinInferenceParameter method to override shared_peptides for None value.
"""
if self.shared_peptides in ["None", "none", None]:
self.shared_peptides = None
__init__(self, yaml_param_filepath, validate=True)
special
Class to store Protein Inference parameter information as an object.
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> pyproteininference.parameters.ProteinInferenceParameter(
>>> yaml_param_filepath = "/path/to/pyproteininference_params.yaml", validate=True
>>> )
Source code in pyproteininference/parameters.py
def __init__(self, yaml_param_filepath, validate=True):
"""Class to store Protein Inference parameter information as an object.
Args:
yaml_param_filepath (str): path to properly formatted parameter file specific to Protein Inference.
validate (bool): True/False on whether to validate the parameter file of interest.
Returns:
None:
Example:
>>> pyproteininference.parameters.ProteinInferenceParameter(
>>> yaml_param_filepath = "/path/to/pyproteininference_params.yaml", validate=True
>>> )
"""
self.yaml_param_filepath = yaml_param_filepath
self.digest_type = self.DEFAULT_DIGEST_TYPE
self.export = self.DEFAULT_EXPORT
self.fdr = self.DEFAULT_FDR
self.missed_cleavages = self.DEFAULT_MISSED_CLEAVAGES
self.picker = self.DEFAULT_PICKER
self.restrict_pep = self.DEFAULT_RESTRICT_PEP
self.restrict_peptide_length = self.DEFAULT_RESTRICT_PEPTIDE_LENGTH
self.restrict_q = self.DEFAULT_RESTRICT_Q
self.restrict_custom = self.DEFAULT_RESTRICT_CUSTOM
self.protein_score = self.DEFAULT_PROTEIN_SCORE
self.psm_score_type = self.DEFAULT_PSM_SCORE_TYPE
self.decoy_symbol = self.DEFAULT_DECOY_SYMBOL
self.isoform_symbol = self.DEFAULT_ISOFORM_SYMBOL
self.reviewed_identifier_symbol = self.DEFAULT_REVIEWED_IDENTIFIER_SYMBOL
self.inference_type = self.DEFAULT_INFERENCE_TYPE
self.tag = self.DEFAULT_TAG
self.psm_score = self.DEFAULT_PSM_SCORE
self.grouping_type = self.DEFAULT_GROUPING_TYPE
self.max_identifiers_peptide_centric = self.DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC
self.lp_solver = self.DEFAULT_LP_SOLVER
self.shared_peptides = self.DEFAULT_SHARED_PEPTIDES
self.validate = validate
self.convert_to_object()
if validate:
self.validate_parameters()
self._fix_none_parameters()
convert_to_object(self)
Function that takes a Protein Inference parameter file and converts it into a ProteinInferenceParameter object by assigning all Attributes of the ProteinInferenceParameter object.
If no parameter filepath is supplied the parameter object will be loaded with default params.
This function gets ran in the initialization of the ProteinInferenceParameter object.
Returns: |
|
---|
Source code in pyproteininference/parameters.py
def convert_to_object(self):
"""
Function that takes a Protein Inference parameter file and converts it into a ProteinInferenceParameter object
by assigning all Attributes of the ProteinInferenceParameter object.
If no parameter filepath is supplied the parameter object will be loaded with default params.
This function gets ran in the initialization of the ProteinInferenceParameter object.
Returns:
None:
"""
if self.yaml_param_filepath:
with open(self.yaml_param_filepath, "r") as stream:
yaml_params = yaml.load(stream, Loader=yaml.Loader)
try:
self.digest_type = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY][
self.DIGEST_TYPE_PARAMETER
]
except KeyError:
logger.warning("digest_type set to default of {}".format(self.DEFAULT_DIGEST_TYPE))
try:
self.export = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.EXPORT_PARAMETER]
except KeyError:
logger.warning("export set to default of {}".format(self.DEFAULT_EXPORT))
try:
self.fdr = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.FDR_PARAMETER]
except KeyError:
logger.warning("fdr set to default of {}".format(self.DEFAULT_FDR))
try:
self.missed_cleavages = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY][
self.MISSED_CLEAV_PARAMETER
]
except KeyError:
logger.warning("missed_cleavages set to default of {}".format(self.DEFAULT_MISSED_CLEAVAGES))
try:
self.picker = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.PICKER_PARAMETER]
except KeyError:
logger.warning("picker set to default of {}".format(self.DEFAULT_PICKER))
try:
self.restrict_pep = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
self.PEP_RESTRICT_PARAMETER
]
except KeyError:
logger.warning("restrict_pep set to default of {}".format(self.DEFAULT_RESTRICT_PEP))
try:
self.restrict_peptide_length = yaml_params[self.PARENT_PARAMETER_KEY][
self.DATA_RESTRICTION_PARAMETER_KEY
][self.PEPTIDE_LENGTH_RESTRICT_PARAMETER]
except KeyError:
logger.warning(
"restrict_peptide_length set to default of {}".format(self.DEFAULT_RESTRICT_PEPTIDE_LENGTH)
)
try:
self.restrict_q = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
self.Q_VALUE_RESTRICT_PARAMETER
]
except KeyError:
logger.warning("restrict_q set to default of {}".format(self.DEFAULT_RESTRICT_Q))
try:
self.restrict_custom = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
self.CUSTOM_RESTRICT_PARAMETER
]
except KeyError:
logger.warning("restrict_custom set to default of {}".format(self.DEFAULT_RESTRICT_CUSTOM))
try:
self.protein_score = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
self.PROTEIN_SCORE_PARAMETER
]
except KeyError:
logger.warning("protein_score set to default of {}".format(self.DEFAULT_PROTEIN_SCORE))
try:
self.psm_score_type = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
self.PSM_SCORE_TYPE_PARAMETER
]
except KeyError:
logger.warning("psm_score_type set to default of {}".format(self.DEFAULT_PSM_SCORE_TYPE))
try:
self.decoy_symbol = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY][
self.DECOY_SYMBOL_PARAMETER
]
except KeyError:
logger.warning("decoy_symbol set to default of {}".format(self.DEFAULT_DECOY_SYMBOL))
try:
self.isoform_symbol = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY][
self.ISOFORM_SYMBOL_PARAMETER
]
except KeyError:
logger.warning("isoform_symbol set to default of {}".format(self.DEFAULT_ISOFORM_SYMBOL))
try:
self.reviewed_identifier_symbol = yaml_params[self.PARENT_PARAMETER_KEY][
self.IDENTIFIERS_PARAMETER_KEY
][self.REVIEWED_IDENTIFIER_PARAMETER]
except KeyError:
logger.warning(
"reviewed_identifier_symbol set to default of {}".format(self.DEFAULT_REVIEWED_IDENTIFIER_SYMBOL)
)
try:
self.inference_type = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY][
self.INFERENCE_TYPE_PARAMETER
]
except KeyError:
logger.warning("inference_type set to default of {}".format(self.DEFAULT_INFERENCE_TYPE))
try:
self.tag = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.TAG_PARAMETER]
except KeyError:
logger.warning("tag set to default of {}".format(self.DEFAULT_TAG))
try:
self.psm_score = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
self.PSM_SCORE_PARAMETER
]
except KeyError:
logger.warning("psm_score set to default of {}".format(self.DEFAULT_PSM_SCORE))
try:
self.grouping_type = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY][
self.GROUPING_TYPE_PARAMETER
]
except KeyError:
logger.warning("grouping_type set to default of {}".format(self.DEFAULT_GROUPING_TYPE))
try:
self.max_identifiers_peptide_centric = yaml_params[self.PARENT_PARAMETER_KEY][
self.PEPTIDE_CENTRIC_PARAMETER_KEY
][self.MAX_IDENTIFIERS_PARAMETER]
except KeyError:
logger.warning(
"max_identifiers_peptide_centric set to default of {}".format(
self.DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC
)
)
try:
self.lp_solver = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY][
self.LP_SOLVER_PARAMETER
]
except KeyError:
logger.warning("lp_solver set to default of {}".format(self.DEFAULT_LP_SOLVER))
try:
# Do try except here to make old param files backwards compatible
self.shared_peptides = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY][
self.SHARED_PEPTIDES_PARAMETER
]
except KeyError:
logger.warning("shared_peptides set to default of {}".format(self.DEFAULT_SHARED_PEPTIDES))
else:
logger.warning("Yaml parameter file not found, all parameters set to default")
fix_parameters_from_datastore(self, data)
ProteinInferenceParameter method to override restriction values in the parameter file if those scores do not exist in the input files.
Parameters: |
|
---|
Source code in pyproteininference/parameters.py
def fix_parameters_from_datastore(self, data):
"""
ProteinInferenceParameter method to override restriction values in the
parameter file if those scores do not exist in the input files.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
"""
self.override_q_restrict(data=data)
self.override_pep_restrict(data=data)
self.override_custom_restrict(data=data)
override_custom_restrict(self, data)
ProteinInferenceParameter method to override restrict_custom if the input data does not contain custom score values.
Parameters: |
|
---|
Source code in pyproteininference/parameters.py
def override_custom_restrict(self, data):
"""
ProteinInferenceParameter method to override restrict_custom if
the input data does not contain custom score values.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
"""
data_has_custom = data.input_has_custom()
if data_has_custom:
pass
else:
if self.restrict_custom:
logger.warning(
"No Custom values found in the input data, overriding parameters to not filter on Custom value"
)
self.restrict_custom = None
override_pep_restrict(self, data)
ProteinInferenceParameter method to override restrict_pep if the input data does not contain pep values.
Parameters: |
|
---|
Source code in pyproteininference/parameters.py
def override_pep_restrict(self, data):
"""
ProteinInferenceParameter method to override restrict_pep if the input data does not contain pep values.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
"""
data_has_pep = data.input_has_pep()
if data_has_pep:
pass
else:
if self.restrict_pep:
logger.warning(
"No Pep values found in the input data, overriding parameters to not filter on Pep value"
)
self.restrict_pep = None
override_q_restrict(self, data)
ProteinInferenceParameter method to override restrict_q if the input data does not contain q values.
Parameters: |
|
---|
Source code in pyproteininference/parameters.py
def override_q_restrict(self, data):
"""
ProteinInferenceParameter method to override restrict_q if the input data does not contain q values.
Args:
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
"""
data_has_q = data.input_has_q()
if data_has_q:
pass
else:
if self.restrict_q:
logger.warning("No Q values found in the input data, overriding parameters to not filter on Q value")
self.restrict_q = None
validate_parameters(self)
Class method to validate all parameters.
Returns: |
|
---|
Source code in pyproteininference/parameters.py
def validate_parameters(self):
"""
Class method to validate all parameters.
Returns:
None:
"""
# Run all of the parameter validations
self._validate_digest_type()
self._validate_export_type()
self._validate_floats()
self._validate_bools()
self._validate_score_type()
self._validate_score_method()
self._validate_score_combination()
self._validate_inference_type()
self._validate_grouping_type()
self._validate_max_id()
self._validate_lp_solver()
self._validate_identifiers()
self._validate_parsimony_shared_peptides()
physical
Protein
The following class is a representation of a Protein that stores characteristics/attributes of a protein for the entire analysis. We use slots to predefine the attributes the Protein Object can have. This is done to speed up runtime of the PI algorithm.
Attributes:
Name | Type | Description |
---|---|---|
identifier |
str |
String identifier for the Protein object. |
score |
float |
Float that represents the protein score as output from Score object methods. |
psms |
list |
List of Psm objects. |
group_identification |
set |
Set of group Identifiers that the protein belongs to (int). |
reviewed |
bool |
True/False on if the identifier is reviewed. |
unreviewed |
bool |
True/False on if the identifier is reviewed. |
peptides |
list |
List of non flanking peptide sequences. |
peptide_scores |
list |
List of Psm scores associated with the protein. |
picked |
bool |
True/False if the protein passes the picker algo. True if passes. False if does not pass. |
num_peptides |
int |
Number of peptides that map to the given Protein. |
unique_peptides |
list |
List of peptide strings that are unique to this protein across the analysis. |
num_unique_peptides |
int |
Number of unique peptides. |
raw_peptides |
list |
List of raw peptides. Includes flanking AA and Mods. |
Source code in pyproteininference/physical.py
class Protein(object):
"""
The following class is a representation of a Protein that stores characteristics/attributes of a protein for the
entire analysis.
We use __slots__ to predefine the attributes the Protein Object can have.
This is done to speed up runtime of the PI algorithm.
Attributes:
identifier (str): String identifier for the Protein object.
score (float): Float that represents the protein score as output from
[Score object][pyproteininference.scoring.Score] methods.
psms (list): List of [Psm][pyproteininference.physical.Psm] objects.
group_identification (set): Set of group Identifiers that the protein belongs to (int).
reviewed (bool): True/False on if the identifier is reviewed.
unreviewed (bool): True/False on if the identifier is reviewed.
peptides (list): List of non flanking peptide sequences.
peptide_scores (list): List of Psm scores associated with the protein.
picked (bool): True/False if the protein passes the picker algo. True if passes. False if does not pass.
num_peptides (int): Number of peptides that map to the given Protein.
unique_peptides (list): List of peptide strings that are unique to this protein across the analysis.
num_unique_peptides (int): Number of unique peptides.
raw_peptides (list): List of raw peptides. Includes flanking AA and Mods.
"""
__slots__ = (
"identifier",
"score",
"psms",
"group_identification",
"reviewed",
"unreviewed",
"peptides",
"peptide_scores",
"picked",
"num_peptides",
"unique_peptides",
"num_unique_peptides",
"raw_peptides",
)
def __init__(self, identifier):
"""
Initialization method for Protein object.
Args:
identifier (str): String identifier for the Protein object.
Example:
>>> protein = pyproteininference.physical.Protein(identifier = "PRKDC_HUMAN|P78527")
"""
self.identifier = identifier
self.score = None
self.psms = [] # List of psm objects
self.group_identification = set()
self.reviewed = False
self.unreviewed = False
self.peptides = None # Sequence info without flanking
self.peptide_scores = None # remove
self.picked = True
self.num_peptides = None # remove
self.unique_peptides = None # remove
self.num_unique_peptides = None # remove
self.raw_peptides = set() # Includes Flanking Seq Info
def get_psm_scores(self):
"""
Retrieves psm scores for a given protein.
Returns:
list: List of psm scores for the given protein.
"""
score_list = [x.main_score for x in self.psms]
return score_list
def get_psm_identifiers(self):
"""
Retrieves a list of Psm identifiers.
Returns:
list: List of Psm identifiers.
"""
psms = [x.identifier for x in self.psms]
return psms
def get_stripped_psm_identifiers(self):
"""
Retrieves a list of Psm identifiers that have had mods removed and flanking AAs removed.
Returns:
list: List of Psm identifiers that have no mods or flanking AAs.
"""
psms = [x.stripped_peptide for x in self.psms]
return psms
def get_unique_peptide_identifiers(self):
"""
Retrieves the unique set of peptides for a protein.
Returns:
set: Set of peptide strings.
"""
unique_peptides = set(self.get_psm_identifiers())
return unique_peptides
def get_unique_stripped_peptide_identifiers(self):
"""
Retrieves the unique set of peptides for a protein that are stripped.
Returns:
set: Set of peptide strings that are stripped of mods and flanking AAs.
"""
stripped_peptide_identifiers = set(self.get_stripped_psm_identifiers())
return stripped_peptide_identifiers
def get_num_psms(self):
"""
Retrieves the number of Psms.
Returns:
int: Number of Psms.
"""
num_psms = len(self.get_psm_identifiers())
return num_psms
def get_num_peptides(self):
"""
Retrieves the number of peptides.
Returns:
int: Number of peptides.
"""
num_peptides = len(self.get_unique_peptide_identifiers())
return num_peptides
def get_psm_ids(self):
"""
Retrieves the Psm Ids.
Returns:
list: List of Psm Ids.
"""
psm_ids = [x.psm_id for x in self.psms]
return psm_ids
__init__(self, identifier)
special
Initialization method for Protein object.
Parameters: |
|
---|
Examples:
>>> protein = pyproteininference.physical.Protein(identifier = "PRKDC_HUMAN|P78527")
Source code in pyproteininference/physical.py
def __init__(self, identifier):
"""
Initialization method for Protein object.
Args:
identifier (str): String identifier for the Protein object.
Example:
>>> protein = pyproteininference.physical.Protein(identifier = "PRKDC_HUMAN|P78527")
"""
self.identifier = identifier
self.score = None
self.psms = [] # List of psm objects
self.group_identification = set()
self.reviewed = False
self.unreviewed = False
self.peptides = None # Sequence info without flanking
self.peptide_scores = None # remove
self.picked = True
self.num_peptides = None # remove
self.unique_peptides = None # remove
self.num_unique_peptides = None # remove
self.raw_peptides = set() # Includes Flanking Seq Info
get_num_peptides(self)
Retrieves the number of peptides.
!!! returns int: Number of peptides.
Source code in pyproteininference/physical.py
def get_num_peptides(self):
"""
Retrieves the number of peptides.
Returns:
int: Number of peptides.
"""
num_peptides = len(self.get_unique_peptide_identifiers())
return num_peptides
get_num_psms(self)
Retrieves the number of Psms.
!!! returns int: Number of Psms.
Source code in pyproteininference/physical.py
def get_num_psms(self):
"""
Retrieves the number of Psms.
Returns:
int: Number of Psms.
"""
num_psms = len(self.get_psm_identifiers())
return num_psms
get_psm_identifiers(self)
Retrieves a list of Psm identifiers.
!!! returns list: List of Psm identifiers.
Source code in pyproteininference/physical.py
def get_psm_identifiers(self):
"""
Retrieves a list of Psm identifiers.
Returns:
list: List of Psm identifiers.
"""
psms = [x.identifier for x in self.psms]
return psms
get_psm_ids(self)
Retrieves the Psm Ids.
Returns: list: List of Psm Ids.
Source code in pyproteininference/physical.py
def get_psm_ids(self):
"""
Retrieves the Psm Ids.
Returns:
list: List of Psm Ids.
"""
psm_ids = [x.psm_id for x in self.psms]
return psm_ids
get_psm_scores(self)
Retrieves psm scores for a given protein.
Returns: |
|
---|
Source code in pyproteininference/physical.py
def get_psm_scores(self):
"""
Retrieves psm scores for a given protein.
Returns:
list: List of psm scores for the given protein.
"""
score_list = [x.main_score for x in self.psms]
return score_list
get_stripped_psm_identifiers(self)
Retrieves a list of Psm identifiers that have had mods removed and flanking AAs removed.
!!! returns list: List of Psm identifiers that have no mods or flanking AAs.
Source code in pyproteininference/physical.py
def get_stripped_psm_identifiers(self):
"""
Retrieves a list of Psm identifiers that have had mods removed and flanking AAs removed.
Returns:
list: List of Psm identifiers that have no mods or flanking AAs.
"""
psms = [x.stripped_peptide for x in self.psms]
return psms
get_unique_peptide_identifiers(self)
Retrieves the unique set of peptides for a protein.
!!! returns set: Set of peptide strings.
Source code in pyproteininference/physical.py
def get_unique_peptide_identifiers(self):
"""
Retrieves the unique set of peptides for a protein.
Returns:
set: Set of peptide strings.
"""
unique_peptides = set(self.get_psm_identifiers())
return unique_peptides
get_unique_stripped_peptide_identifiers(self)
Retrieves the unique set of peptides for a protein that are stripped.
!!! returns set: Set of peptide strings that are stripped of mods and flanking AAs.
Source code in pyproteininference/physical.py
def get_unique_stripped_peptide_identifiers(self):
"""
Retrieves the unique set of peptides for a protein that are stripped.
Returns:
set: Set of peptide strings that are stripped of mods and flanking AAs.
"""
stripped_peptide_identifiers = set(self.get_stripped_psm_identifiers())
return stripped_peptide_identifiers
ProteinGroup
The following class is a physical Protein Group class that stores characteristics of a Protein Group for the entire analysis. We use slots to predefine the attributes the Psm Object can have. This is done to speed up runtime of the PI algorithm.
Attributes:
Name | Type | Description |
---|---|---|
number_id |
int |
unique Integer to represent a group. |
proteins |
list |
List of Protein objects. |
q_value |
float |
Q value for the protein group that is calculated with method calculate_q_values. |
Source code in pyproteininference/physical.py
class ProteinGroup(object):
"""
The following class is a physical Protein Group class that stores characteristics of a Protein Group for the entire
analysis.
We use __slots__ to predefine the attributes the Psm Object can have.
This is done to speed up runtime of the PI algorithm.
Attributes:
number_id (int): unique Integer to represent a group.
proteins (list): List of [Protein][pyproteininference.physical.Protein] objects.
q_value (float): Q value for the protein group that is calculated with method
[calculate_q_values][pyproteininference.datastore.DataStore.calculate_q_values].
"""
__slots__ = ("proteins", "number_id", "q_value")
def __init__(self, number_id):
"""
Initialization method for ProteinGroup object.
Args:
number_id (int): unique Integer to represent a group.
Example:
>>> pg = pyproteininference.physical.ProteinGroup(number_id = 1)
"""
self.proteins = []
self.number_id = number_id
self.q_value = None
__init__(self, number_id)
special
Initialization method for ProteinGroup object.
Parameters: |
|
---|
Examples:
>>> pg = pyproteininference.physical.ProteinGroup(number_id = 1)
Source code in pyproteininference/physical.py
def __init__(self, number_id):
"""
Initialization method for ProteinGroup object.
Args:
number_id (int): unique Integer to represent a group.
Example:
>>> pg = pyproteininference.physical.ProteinGroup(number_id = 1)
"""
self.proteins = []
self.number_id = number_id
self.q_value = None
Psm
The following class is a physical Psm class that stores characteristics of a psm for the entire analysis. We use slots to predefine the attributes the Psm Object can have. This is done to speed up runtime of the PI algorithm.
Attributes:
Name | Type | Description |
---|---|---|
identifier |
str |
Peptide Identifier: IE "K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q". |
percscore |
float |
Percolator Score from input file if it exists. |
qvalue |
float |
Q value from input file if it exists. |
pepvalue |
float |
Pep value from input file if it exists. |
possible_proteins |
list |
List of protein strings that the Psm maps to based on the digest. |
psm_id |
str |
String that represents a global identifier for the Psm. Should come from input files. |
custom_score |
float |
Score that comes from a custom column in the input files. |
main_score |
float |
The Psm score to be used as the scoring variable for protein scoring. can be percscore,qvalue,pepvalue, or custom_score. |
stripped_peptide |
str |
This is the identifier attribute that has had mods removed and flanking AAs removed IE: DLIDEGHAATQLVNQLHDVVVENNLSDK. |
non_flanking_peptide |
str |
This is the identifier attribute that has had flanking AAs removed IE: DLIDEGH#AATQLVNQLHDVVVENNLSDK. #NOTE Mods are still present here. |
Source code in pyproteininference/physical.py
class Psm(object):
"""
The following class is a physical Psm class that stores characteristics of a psm for the entire analysis.
We use __slots__ to predefine the attributes the Psm Object can have.
This is done to speed up runtime of the PI algorithm.
Attributes:
identifier (str): Peptide Identifier: IE "K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".
percscore (float): Percolator Score from input file if it exists.
qvalue (float): Q value from input file if it exists.
pepvalue (float): Pep value from input file if it exists.
possible_proteins (list): List of protein strings that the Psm maps to based on the digest.
psm_id (str): String that represents a global identifier for the Psm. Should come from input files.
custom_score (float): Score that comes from a custom column in the input files.
main_score (float): The Psm score to be used as the scoring variable for protein scoring. can be
percscore,qvalue,pepvalue, or custom_score.
stripped_peptide (str): This is the identifier attribute that has had mods removed and flanking AAs
removed IE: DLIDEGHAATQLVNQLHDVVVENNLSDK.
non_flanking_peptide (str): This is the identifier attribute that has had flanking AAs
removed IE: DLIDEGH#AATQLVNQLHDVVVENNLSDK. #NOTE Mods are still present here.
"""
__slots__ = (
"identifier",
"percscore",
"qvalue",
"pepvalue",
"possible_proteins",
"psm_id",
"custom_score",
"main_score",
"stripped_peptide",
"non_flanking_peptide",
)
# The regex removes anything between parantheses including parenthases - \([^()]*\)
# The regex removes anything between brackets including parenthases - \[.*?\]
# And the regex removes anything that is not an A-Z character [^A-Z]
MOD_REGEX = re.compile("\([^()]*\)|\[.*?\]|[^A-Z]") # noqa W605
FRONT_FLANKING_REGEX = re.compile("^[A-Z|-][.]")
BACK_FLANKING_REGEX = re.compile("[.][A-Z|-]$")
SCORE_ATTRIBUTE_NAMES = set(["pepvalue", "qvalue", "percscore", "custom_score"])
def __init__(self, identifier):
"""
Initialization method for the Psm object.
This method also initializes the `stripped_peptide` and `non_flanking_peptide` attributes.
Args:
identifier (str): Peptide Identifier: IE ""K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".
Example:
>>> psm = pyproteininference.physical.Psm(identifier = "K.DLIDEGHAATQLVNQLHDVVVENNLSDK.Q")
"""
self.identifier = identifier
self.percscore = None
self.qvalue = None
self.pepvalue = None
self.possible_proteins = None
self.psm_id = None
self.custom_score = None
self.main_score = None
self.stripped_peptide = None
self.non_flanking_peptide = None
# Add logic to split the peptide and strip it of mods
current_peptide = Psm.split_peptide(peptide_string=self.identifier)
self.non_flanking_peptide = current_peptide
if not current_peptide.isupper() or not current_peptide.isalpha():
# If we have mods remove them...
peptide_string = current_peptide.upper()
stripped_peptide = Psm.remove_peptide_mods(peptide_string)
current_peptide = stripped_peptide
# Set stripped_peptide variable
self.stripped_peptide = current_peptide
@classmethod
def remove_peptide_mods(cls, peptide_string):
"""
This class method takes a string and uses a `MOD_REGEX` to remove mods from peptide strings.
Args:
peptide_string (str): Peptide string to have mods removed from.
Returns:
str: a peptide string with mods removed.
"""
stripped_peptide = cls.MOD_REGEX.sub("", peptide_string)
return stripped_peptide
@classmethod
def split_peptide(cls, peptide_string, delimiter="."):
"""
This class method takes a peptide string with flanking AAs and removes them from the peptide string.
This method uses string splitting and if the method produces a faulty peptide the method
[split_peptide_pro][pyproteininference.physical.Psm.split_peptide_pro] will be called.
Args:
peptide_string (str): Peptide string to have mods removed from.
delimiter (str): a string to indicate what separates a leading/trailing (flanking) AA from the
peptide sequence.
Returns:
str: a peptide string with flanking AAs removed.
"""
peptide_split = peptide_string.split(delimiter)
if len(peptide_split) == 3:
# If we get 3 chunks it will usually be ['A', 'ADGSDFGSS', 'F']
# So take index 1
peptide = peptide_split[1]
elif len(peptide_split) == 1:
# If we get 1 chunk it should just be ['ADGSDFGSS']
# So take index 0
peptide = peptide_split[0]
else:
# If we split the peptide and it is not length 1 or 3 then try to split with pro
peptide = cls.split_peptide_pro(peptide_string=peptide_string, delimiter=delimiter)
return peptide
@classmethod
def split_peptide_pro(cls, peptide_string, delimiter="."):
"""
This class method takes a peptide string with flanking AAs and removes them from the peptide string.
This is a specialized method of [split_peptide][pyproteininference.physical.Psm.split_peptide] that uses
regex identifiers to replace flanking AAs as opposed to string splitting.
Args:
peptide_string (str): Peptide string to have mods removed from.
delimiter (str): a string to indicate what separates a leading/trailing (flanking) AA from the peptide
sequence.
Returns:
str: a peptide string with flanking AAs removed.
"""
if delimiter != ".":
front_regex = "^[A-Z|-][{}]".format(delimiter)
cls.FRONT_FLANKING_REGEX = re.compile(front_regex)
back_regex = "[{}][A-Z|-]$".format(delimiter)
cls.BACK_FLANKING_REGEX = re.compile(back_regex)
# Replace the front flanking with nothing
peptide_string = cls.FRONT_FLANKING_REGEX.sub("", peptide_string)
# Replace the back flanking with nothing
peptide_string = cls.BACK_FLANKING_REGEX.sub("", peptide_string)
return peptide_string
def assign_main_score(self, score):
"""
This method takes in a score type and assigns the variable main_score for a given Psm based on the score type.
Args:
score (str): This is a string representation of the Psm attribute that will get assigned to the main_score
variable.
"""
# Assign a main score based on user input
if score not in self.SCORE_ATTRIBUTE_NAMES:
raise ValueError("Scores must either be one of: '{}'".format(", ".join(self.SCORE_ATTRIBUTE_NAMES)))
else:
self.main_score = getattr(self, score)
__init__(self, identifier)
special
Initialization method for the Psm object.
This method also initializes the stripped_peptide
and non_flanking_peptide
attributes.
Parameters: |
|
---|
Examples:
>>> psm = pyproteininference.physical.Psm(identifier = "K.DLIDEGHAATQLVNQLHDVVVENNLSDK.Q")
Source code in pyproteininference/physical.py
def __init__(self, identifier):
"""
Initialization method for the Psm object.
This method also initializes the `stripped_peptide` and `non_flanking_peptide` attributes.
Args:
identifier (str): Peptide Identifier: IE ""K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".
Example:
>>> psm = pyproteininference.physical.Psm(identifier = "K.DLIDEGHAATQLVNQLHDVVVENNLSDK.Q")
"""
self.identifier = identifier
self.percscore = None
self.qvalue = None
self.pepvalue = None
self.possible_proteins = None
self.psm_id = None
self.custom_score = None
self.main_score = None
self.stripped_peptide = None
self.non_flanking_peptide = None
# Add logic to split the peptide and strip it of mods
current_peptide = Psm.split_peptide(peptide_string=self.identifier)
self.non_flanking_peptide = current_peptide
if not current_peptide.isupper() or not current_peptide.isalpha():
# If we have mods remove them...
peptide_string = current_peptide.upper()
stripped_peptide = Psm.remove_peptide_mods(peptide_string)
current_peptide = stripped_peptide
# Set stripped_peptide variable
self.stripped_peptide = current_peptide
assign_main_score(self, score)
This method takes in a score type and assigns the variable main_score for a given Psm based on the score type.
Parameters: |
|
---|
Source code in pyproteininference/physical.py
def assign_main_score(self, score):
"""
This method takes in a score type and assigns the variable main_score for a given Psm based on the score type.
Args:
score (str): This is a string representation of the Psm attribute that will get assigned to the main_score
variable.
"""
# Assign a main score based on user input
if score not in self.SCORE_ATTRIBUTE_NAMES:
raise ValueError("Scores must either be one of: '{}'".format(", ".join(self.SCORE_ATTRIBUTE_NAMES)))
else:
self.main_score = getattr(self, score)
remove_peptide_mods(peptide_string)
classmethod
This class method takes a string and uses a MOD_REGEX
to remove mods from peptide strings.
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/physical.py
@classmethod
def remove_peptide_mods(cls, peptide_string):
"""
This class method takes a string and uses a `MOD_REGEX` to remove mods from peptide strings.
Args:
peptide_string (str): Peptide string to have mods removed from.
Returns:
str: a peptide string with mods removed.
"""
stripped_peptide = cls.MOD_REGEX.sub("", peptide_string)
return stripped_peptide
split_peptide(peptide_string, delimiter='.')
classmethod
This class method takes a peptide string with flanking AAs and removes them from the peptide string. This method uses string splitting and if the method produces a faulty peptide the method split_peptide_pro will be called.
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/physical.py
@classmethod
def split_peptide(cls, peptide_string, delimiter="."):
"""
This class method takes a peptide string with flanking AAs and removes them from the peptide string.
This method uses string splitting and if the method produces a faulty peptide the method
[split_peptide_pro][pyproteininference.physical.Psm.split_peptide_pro] will be called.
Args:
peptide_string (str): Peptide string to have mods removed from.
delimiter (str): a string to indicate what separates a leading/trailing (flanking) AA from the
peptide sequence.
Returns:
str: a peptide string with flanking AAs removed.
"""
peptide_split = peptide_string.split(delimiter)
if len(peptide_split) == 3:
# If we get 3 chunks it will usually be ['A', 'ADGSDFGSS', 'F']
# So take index 1
peptide = peptide_split[1]
elif len(peptide_split) == 1:
# If we get 1 chunk it should just be ['ADGSDFGSS']
# So take index 0
peptide = peptide_split[0]
else:
# If we split the peptide and it is not length 1 or 3 then try to split with pro
peptide = cls.split_peptide_pro(peptide_string=peptide_string, delimiter=delimiter)
return peptide
split_peptide_pro(peptide_string, delimiter='.')
classmethod
This class method takes a peptide string with flanking AAs and removes them from the peptide string. This is a specialized method of split_peptide that uses regex identifiers to replace flanking AAs as opposed to string splitting.
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/physical.py
@classmethod
def split_peptide_pro(cls, peptide_string, delimiter="."):
"""
This class method takes a peptide string with flanking AAs and removes them from the peptide string.
This is a specialized method of [split_peptide][pyproteininference.physical.Psm.split_peptide] that uses
regex identifiers to replace flanking AAs as opposed to string splitting.
Args:
peptide_string (str): Peptide string to have mods removed from.
delimiter (str): a string to indicate what separates a leading/trailing (flanking) AA from the peptide
sequence.
Returns:
str: a peptide string with flanking AAs removed.
"""
if delimiter != ".":
front_regex = "^[A-Z|-][{}]".format(delimiter)
cls.FRONT_FLANKING_REGEX = re.compile(front_regex)
back_regex = "[{}][A-Z|-]$".format(delimiter)
cls.BACK_FLANKING_REGEX = re.compile(back_regex)
# Replace the front flanking with nothing
peptide_string = cls.FRONT_FLANKING_REGEX.sub("", peptide_string)
# Replace the back flanking with nothing
peptide_string = cls.BACK_FLANKING_REGEX.sub("", peptide_string)
return peptide_string
pipeline
ProteinInferencePipeline
This is the main Protein Inference class which houses the logic of the entire data analysis pipeline. Logic is executed in the execute method.
Attributes:
Name | Type | Description |
---|---|---|
parameter_file |
str |
Path to Protein Inference Yaml Parameter File. |
database_file |
str |
Path to Fasta database used in proteomics search. |
target_files |
str/list |
Path to Target Psm File (Or a list of files). |
decoy_files |
str/list |
Path to Decoy Psm File (Or a list of files). |
combined_files |
str/list |
Path to Combined Psm File (Or a list of files). |
target_directory |
str |
Path to Directory containing Target Psm Files. |
decoy_directory |
str |
Path to Directory containing Decoy Psm Files. |
combined_directory |
str |
Path to Directory containing Combined Psm Files. |
output_directory |
str |
Path to Directory where output will be written. |
output_filename |
str |
Path to Filename where output will be written. Will override output_directory. |
id_splitting |
bool |
True/False on whether to split protein IDs in the digest. Advanced usage only. |
append_alt_from_db |
bool |
True/False on whether to append alternative proteins from the DB digestion in Reader class. |
data |
DataStore |
|
digest |
Digest |
Source code in pyproteininference/pipeline.py
class ProteinInferencePipeline(object):
"""
This is the main Protein Inference class which houses the logic of the entire data analysis pipeline.
Logic is executed in the [execute][pyproteininference.pipeline.ProteinInferencePipeline.execute] method.
Attributes:
parameter_file (str): Path to Protein Inference Yaml Parameter File.
database_file (str): Path to Fasta database used in proteomics search.
target_files (str/list): Path to Target Psm File (Or a list of files).
decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
combined_files (str/list): Path to Combined Psm File (Or a list of files).
target_directory (str): Path to Directory containing Target Psm Files.
decoy_directory (str): Path to Directory containing Decoy Psm Files.
combined_directory (str): Path to Directory containing Combined Psm Files.
output_directory (str): Path to Directory where output will be written.
output_filename (str): Path to Filename where output will be written. Will override output_directory.
id_splitting (bool): True/False on whether to split protein IDs in the digest. Advanced usage only.
append_alt_from_db (bool): True/False on whether to append alternative proteins from the DB digestion in
Reader class.
data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
"""
def __init__(
self,
parameter_file,
database_file=None,
target_files=None,
decoy_files=None,
combined_files=None,
target_directory=None,
decoy_directory=None,
combined_directory=None,
output_directory=None,
output_filename=None,
id_splitting=False,
append_alt_from_db=True,
):
"""
Args:
parameter_file (str): Path to Protein Inference Yaml Parameter File.
database_file (str): Path to Fasta database used in proteomics search.
target_files (str/list): Path to Target Psm File (Or a list of files).
decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
combined_files (str/list): Path to Combined Psm File (Or a list of files).
target_directory (str): Path to Directory containing Target Psm Files.
decoy_directory (str): Path to Directory containing Decoy Psm Files.
combined_directory (str): Path to Directory containing Combined Psm Files.
output_filename (str): Path to Filename where output will be written. Will override output_directory.
output_directory (str): Path to Directory where output will be written.
id_splitting (bool): True/False on whether to split protein IDs in the digest. Advanced usage only.
append_alt_from_db (bool): True/False on whether to append alternative proteins from the DB digestion in
Reader class.
Returns:
object:
Example:
>>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
>>> parameter_file=yaml_params,
>>> database_file=database,
>>> target_files=target,
>>> decoy_files=decoy,
>>> combined_files=combined_files,
>>> target_directory=target_directory,
>>> decoy_directory=decoy_directory,
>>> combined_directory=combined_directory,
>>> output_directory=dir_name,
>>> output_filename=output_filename,
>>> append_alt_from_db=append_alt,
>>> )
"""
self.parameter_file = parameter_file
self.database_file = database_file
self.target_files = target_files
self.decoy_files = decoy_files
self.combined_files = combined_files
self.target_directory = target_directory
self.decoy_directory = decoy_directory
self.combined_directory = combined_directory
self.output_directory = output_directory
self.output_filename = output_filename
self.id_splitting = id_splitting
self.append_alt_from_db = append_alt_from_db
self.data = None
self.digest = None
self._validate_input()
self._set_output_directory()
self._log_append_alt_from_db()
self._log_id_splitting()
def execute(self):
"""
This method is the main driver of the data analysis for the protein inference package.
This method calls other classes and methods that make up the protein inference pipeline.
This includes but is not limited to:
This method sets the data [DataStore Object][pyproteininference.datastore.DataStore] and digest
[Digest Object][pyproteininference.in_silico_digest.Digest].
1. Parameter file management.
2. Digesting Fasta Database (Optional).
3. Reading in input Psm Files.
4. Initializing the [DataStore Object][pyproteininference.datastore.DataStore].
5. Restricting Psms.
6. Creating Protein objects/scoring input.
7. Scoring Proteins.
8. Running Protein Picker.
9. Running Inference Methods/Grouping.
10. Calculating Q Values.
11. Exporting Proteins to filesystem.
Example:
>>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
>>> parameter_file=yaml_params,
>>> database_file=database,
>>> target_files=target,
>>> decoy_files=decoy,
>>> combined_files=combined_files,
>>> target_directory=target_directory,
>>> decoy_directory=decoy_directory,
>>> combined_directory=combined_directory,
>>> output_directory=dir_name,
>>> output_filename=output_filename,
>>> append_alt_from_db=append_alt,
>>> )
>>> pipeline.execute()
"""
# STEP 1: Load parameter file #
# STEP 1: Load parameter file #
# STEP 1: Load parameter file #
pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
yaml_param_filepath=self.parameter_file
)
# STEP 2: Start with running an In Silico Digestion #
# STEP 2: Start with running an In Silico Digestion #
# STEP 2: Start with running an In Silico Digestion #
digest = pyproteininference.in_silico_digest.PyteomicsDigest(
database_path=self.database_file,
digest_type=pyproteininference_parameters.digest_type,
missed_cleavages=pyproteininference_parameters.missed_cleavages,
reviewed_identifier_symbol=pyproteininference_parameters.reviewed_identifier_symbol,
max_peptide_length=pyproteininference_parameters.restrict_peptide_length,
id_splitting=self.id_splitting,
)
if self.database_file:
logger.info("Running In Silico Database Digest on file {}".format(self.database_file))
digest.digest_fasta_database()
else:
logger.warning(
"No Database File provided, Skipping database digest and only taking protein-peptide mapping from the "
"input files."
)
# STEP 3: Read PSM Data #
# STEP 3: Read PSM Data #
# STEP 3: Read PSM Data #
reader = pyproteininference.reader.GenericReader(
target_file=self.target_files,
decoy_file=self.decoy_files,
combined_files=self.combined_files,
parameter_file_object=pyproteininference_parameters,
digest=digest,
append_alt_from_db=self.append_alt_from_db,
)
reader.read_psms()
# STEP 4: Initiate the datastore object #
# STEP 4: Initiate the datastore object #
# STEP 4: Initiate the datastore object #
data = pyproteininference.datastore.DataStore(reader=reader, digest=digest)
# Step 5: Restrict the PSM data
# Step 5: Restrict the PSM data
# Step 5: Restrict the PSM data
data.restrict_psm_data()
data.recover_mapping()
# Step 6: Generate protein scoring input
# Step 6: Generate protein scoring input
# Step 6: Generate protein scoring input
data.create_scoring_input()
# Step 7: Remove non unique peptides if running exclusion
# Step 7: Remove non unique peptides if running exclusion
# Step 7: Remove non unique peptides if running exclusion
if pyproteininference_parameters.inference_type == Inference.EXCLUSION:
# This gets ran if we run exclusion...
data.exclude_non_distinguishing_peptides()
# STEP 8: Score our PSMs given a score method
# STEP 8: Score our PSMs given a score method
# STEP 8: Score our PSMs given a score method
score = pyproteininference.scoring.Score(data=data)
score.score_psms(score_method=pyproteininference_parameters.protein_score)
# STEP 9: Run protein picker on the data
# STEP 9: Run protein picker on the data
# STEP 9: Run protein picker on the data
if pyproteininference_parameters.picker:
data.protein_picker()
else:
pass
# STEP 10: Apply Inference
# STEP 10: Apply Inference
# STEP 10: Apply Inference
pyproteininference.inference.Inference.run_inference(data=data, digest=digest)
# STEP 11: Q value Calculations
# STEP 11: Q value Calculations
# STEP 11: Q value Calculations
data.calculate_q_values()
# STEP 12: Export to CSV
# STEP 12: Export to CSV
# STEP 12: Export to CSV
export = pyproteininference.export.Export(data=data)
export.export_to_csv(
output_filename=self.output_filename,
directory=self.output_directory,
export_type=pyproteininference_parameters.export,
)
self.data = data
self.digest = digest
logger.info("Protein Inference Finished")
def _validate_input(self):
"""
Internal method that validates whether the proper input files have been defined.
One of the following combinations must be selected as input. No more and no less:
1. either one or multiple target_files and decoy_files.
2. either one or multiple combined_files that include target and decoy data.
3. a directory that contains target files (target_directory) as well as a directory that contains decoy files
(decoy_directory).
4. a directory that contains combined target/decoy files (combined_directory).
Raises:
ValueError: ValueError will occur if an improper combination of input.
"""
if (
self.target_files
and self.decoy_files
and not self.combined_files
and not self.target_directory
and not self.decoy_directory
and not self.combined_directory
):
logger.info("Validating input as target_files and decoy_files")
elif (
self.combined_files
and not self.target_files
and not self.decoy_files
and not self.decoy_directory
and not self.target_directory
and not self.combined_directory
):
logger.info("Validating input as combined_files")
elif (
self.target_directory
and self.decoy_directory
and not self.target_files
and not self.decoy_files
and not self.combined_directory
and not self.combined_files
):
logger.info("Validating input as target_directory and decoy_directory")
self._transform_directory_to_files()
elif (
self.combined_directory
and not self.combined_files
and not self.decoy_files
and not self.decoy_directory
and not self.target_files
and not self.target_directory
):
logger.info("Validating input as combined_directory")
self._transform_directory_to_files()
else:
raise ValueError(
"To run Protein inference please supply either: "
"(1) either one or multiple target_files and decoy_files, "
"(2) either one or multiple combined_files that include target and decoy data"
"(3) a directory that contains target files (target_directory) as well as a directory that "
"contains decoy files (decoy_directory)"
"(4) a directory that contains combined target/decoy files (combined_directory)"
)
def _transform_directory_to_files(self):
"""
This internal method takes files that are in the target_directory, decoy_directory, or combined_directory and
reassigns these files to the target_files, decoy_files, and combined_files to be used in
[Reader][pyproteininference.reader.Reader] object.
"""
if self.target_directory and self.decoy_directory:
logger.info("Transforming target_directory and decoy_directory into files")
target_files = os.listdir(self.target_directory)
target_files_full = [
os.path.join(self.target_directory, x) for x in target_files if x.endswith(".txt") or x.endswith(".tsv")
]
decoy_files = os.listdir(self.decoy_directory)
decoy_files_full = [
os.path.join(self.decoy_directory, x) for x in decoy_files if x.endswith(".txt") or x.endswith(".tsv")
]
self.target_files = target_files_full
self.decoy_files = decoy_files_full
elif self.combined_directory:
logger.info("Transforming combined_directory into files")
combined_files = os.listdir(self.combined_directory)
combined_files_full = [
os.path.join(self.combined_directory, x)
for x in combined_files
if x.endswith(".txt") or x.endswith(".tsv")
]
self.combined_files = combined_files_full
def _set_output_directory(self):
"""
Internal method for setting the output directory.
If the output_directory argument is not supplied the output directory is set as the cwd.
"""
if not self.output_directory:
self.output_directory = os.getcwd()
else:
pass
def _log_append_alt_from_db(self):
"""
Internal method for logging whether the user sets alternative protein append to True or False.
"""
if self.append_alt_from_db:
logger.info("Append Alternative Proteins from Database set to True")
else:
logger.info("Append Alternative Proteins from Database set to False")
def _log_id_splitting(self):
"""
Internal method for logging whether the user sets ID splitting to True or False.
"""
if self.id_splitting:
logger.info("ID Splitting for Database Digestion set to True")
else:
logger.info("ID Splitting for Database Digestion set to False")
__init__(self, parameter_file, database_file=None, target_files=None, decoy_files=None, combined_files=None, target_directory=None, decoy_directory=None, combined_directory=None, output_directory=None, output_filename=None, id_splitting=False, append_alt_from_db=True)
special
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
>>> parameter_file=yaml_params,
>>> database_file=database,
>>> target_files=target,
>>> decoy_files=decoy,
>>> combined_files=combined_files,
>>> target_directory=target_directory,
>>> decoy_directory=decoy_directory,
>>> combined_directory=combined_directory,
>>> output_directory=dir_name,
>>> output_filename=output_filename,
>>> append_alt_from_db=append_alt,
>>> )
Source code in pyproteininference/pipeline.py
def __init__(
self,
parameter_file,
database_file=None,
target_files=None,
decoy_files=None,
combined_files=None,
target_directory=None,
decoy_directory=None,
combined_directory=None,
output_directory=None,
output_filename=None,
id_splitting=False,
append_alt_from_db=True,
):
"""
Args:
parameter_file (str): Path to Protein Inference Yaml Parameter File.
database_file (str): Path to Fasta database used in proteomics search.
target_files (str/list): Path to Target Psm File (Or a list of files).
decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
combined_files (str/list): Path to Combined Psm File (Or a list of files).
target_directory (str): Path to Directory containing Target Psm Files.
decoy_directory (str): Path to Directory containing Decoy Psm Files.
combined_directory (str): Path to Directory containing Combined Psm Files.
output_filename (str): Path to Filename where output will be written. Will override output_directory.
output_directory (str): Path to Directory where output will be written.
id_splitting (bool): True/False on whether to split protein IDs in the digest. Advanced usage only.
append_alt_from_db (bool): True/False on whether to append alternative proteins from the DB digestion in
Reader class.
Returns:
object:
Example:
>>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
>>> parameter_file=yaml_params,
>>> database_file=database,
>>> target_files=target,
>>> decoy_files=decoy,
>>> combined_files=combined_files,
>>> target_directory=target_directory,
>>> decoy_directory=decoy_directory,
>>> combined_directory=combined_directory,
>>> output_directory=dir_name,
>>> output_filename=output_filename,
>>> append_alt_from_db=append_alt,
>>> )
"""
self.parameter_file = parameter_file
self.database_file = database_file
self.target_files = target_files
self.decoy_files = decoy_files
self.combined_files = combined_files
self.target_directory = target_directory
self.decoy_directory = decoy_directory
self.combined_directory = combined_directory
self.output_directory = output_directory
self.output_filename = output_filename
self.id_splitting = id_splitting
self.append_alt_from_db = append_alt_from_db
self.data = None
self.digest = None
self._validate_input()
self._set_output_directory()
self._log_append_alt_from_db()
self._log_id_splitting()
execute(self)
This method is the main driver of the data analysis for the protein inference package. This method calls other classes and methods that make up the protein inference pipeline. This includes but is not limited to:
This method sets the data DataStore Object and digest Digest Object.
- Parameter file management.
- Digesting Fasta Database (Optional).
- Reading in input Psm Files.
- Initializing the DataStore Object.
- Restricting Psms.
- Creating Protein objects/scoring input.
- Scoring Proteins.
- Running Protein Picker.
- Running Inference Methods/Grouping.
- Calculating Q Values.
- Exporting Proteins to filesystem.
Examples:
>>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
>>> parameter_file=yaml_params,
>>> database_file=database,
>>> target_files=target,
>>> decoy_files=decoy,
>>> combined_files=combined_files,
>>> target_directory=target_directory,
>>> decoy_directory=decoy_directory,
>>> combined_directory=combined_directory,
>>> output_directory=dir_name,
>>> output_filename=output_filename,
>>> append_alt_from_db=append_alt,
>>> )
>>> pipeline.execute()
Source code in pyproteininference/pipeline.py
def execute(self):
"""
This method is the main driver of the data analysis for the protein inference package.
This method calls other classes and methods that make up the protein inference pipeline.
This includes but is not limited to:
This method sets the data [DataStore Object][pyproteininference.datastore.DataStore] and digest
[Digest Object][pyproteininference.in_silico_digest.Digest].
1. Parameter file management.
2. Digesting Fasta Database (Optional).
3. Reading in input Psm Files.
4. Initializing the [DataStore Object][pyproteininference.datastore.DataStore].
5. Restricting Psms.
6. Creating Protein objects/scoring input.
7. Scoring Proteins.
8. Running Protein Picker.
9. Running Inference Methods/Grouping.
10. Calculating Q Values.
11. Exporting Proteins to filesystem.
Example:
>>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
>>> parameter_file=yaml_params,
>>> database_file=database,
>>> target_files=target,
>>> decoy_files=decoy,
>>> combined_files=combined_files,
>>> target_directory=target_directory,
>>> decoy_directory=decoy_directory,
>>> combined_directory=combined_directory,
>>> output_directory=dir_name,
>>> output_filename=output_filename,
>>> append_alt_from_db=append_alt,
>>> )
>>> pipeline.execute()
"""
# STEP 1: Load parameter file #
# STEP 1: Load parameter file #
# STEP 1: Load parameter file #
pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
yaml_param_filepath=self.parameter_file
)
# STEP 2: Start with running an In Silico Digestion #
# STEP 2: Start with running an In Silico Digestion #
# STEP 2: Start with running an In Silico Digestion #
digest = pyproteininference.in_silico_digest.PyteomicsDigest(
database_path=self.database_file,
digest_type=pyproteininference_parameters.digest_type,
missed_cleavages=pyproteininference_parameters.missed_cleavages,
reviewed_identifier_symbol=pyproteininference_parameters.reviewed_identifier_symbol,
max_peptide_length=pyproteininference_parameters.restrict_peptide_length,
id_splitting=self.id_splitting,
)
if self.database_file:
logger.info("Running In Silico Database Digest on file {}".format(self.database_file))
digest.digest_fasta_database()
else:
logger.warning(
"No Database File provided, Skipping database digest and only taking protein-peptide mapping from the "
"input files."
)
# STEP 3: Read PSM Data #
# STEP 3: Read PSM Data #
# STEP 3: Read PSM Data #
reader = pyproteininference.reader.GenericReader(
target_file=self.target_files,
decoy_file=self.decoy_files,
combined_files=self.combined_files,
parameter_file_object=pyproteininference_parameters,
digest=digest,
append_alt_from_db=self.append_alt_from_db,
)
reader.read_psms()
# STEP 4: Initiate the datastore object #
# STEP 4: Initiate the datastore object #
# STEP 4: Initiate the datastore object #
data = pyproteininference.datastore.DataStore(reader=reader, digest=digest)
# Step 5: Restrict the PSM data
# Step 5: Restrict the PSM data
# Step 5: Restrict the PSM data
data.restrict_psm_data()
data.recover_mapping()
# Step 6: Generate protein scoring input
# Step 6: Generate protein scoring input
# Step 6: Generate protein scoring input
data.create_scoring_input()
# Step 7: Remove non unique peptides if running exclusion
# Step 7: Remove non unique peptides if running exclusion
# Step 7: Remove non unique peptides if running exclusion
if pyproteininference_parameters.inference_type == Inference.EXCLUSION:
# This gets ran if we run exclusion...
data.exclude_non_distinguishing_peptides()
# STEP 8: Score our PSMs given a score method
# STEP 8: Score our PSMs given a score method
# STEP 8: Score our PSMs given a score method
score = pyproteininference.scoring.Score(data=data)
score.score_psms(score_method=pyproteininference_parameters.protein_score)
# STEP 9: Run protein picker on the data
# STEP 9: Run protein picker on the data
# STEP 9: Run protein picker on the data
if pyproteininference_parameters.picker:
data.protein_picker()
else:
pass
# STEP 10: Apply Inference
# STEP 10: Apply Inference
# STEP 10: Apply Inference
pyproteininference.inference.Inference.run_inference(data=data, digest=digest)
# STEP 11: Q value Calculations
# STEP 11: Q value Calculations
# STEP 11: Q value Calculations
data.calculate_q_values()
# STEP 12: Export to CSV
# STEP 12: Export to CSV
# STEP 12: Export to CSV
export = pyproteininference.export.Export(data=data)
export.export_to_csv(
output_filename=self.output_filename,
directory=self.output_directory,
export_type=pyproteininference_parameters.export,
)
self.data = data
self.digest = digest
logger.info("Protein Inference Finished")
reader
GenericReader (Reader)
The following class takes a percolator like target file and a percolator like decoy file and creates standard Psm objects.
Percolator Like Output is formatted as follows: with each entry being tab delimited.
| PSMId | score | q-value | posterior_error_prob | peptide | proteinIds | | | | # noqa E501 W605 |-------------------------------|----------|-------------|-----------------------|--------------------------------|---------------------|----------------------|----------------------|-------------------------| # noqa E501 W605 | 116108.15139.15139.6.dta | 3.44016 | 0.000479928 | 7.60258e-10 | K.MVVSMTLGLHPWIANIDDTQYLAAK.R | CNDP1_HUMAN|Q96KN2 | B4E180_HUMAN|B4E180 | A8K1K1_HUMAN|A8K1K1 | J3KRP0_HUMAN|J3KRP0 | # noqa E501 W605
Custom columns can be added and used as scoring input. Please see package documentation for more information.
Attributes:
Name | Type | Description |
---|---|---|
target_file |
str/list |
Path to Target PSM result files. |
decoy_file |
str/list |
Path to Decoy PSM result files. |
combined_files |
str/list |
Path to Combined PSM result files. |
directory |
str |
Path to directory containing combined PSM result files. |
psms |
list |
List of Psm objects. |
load_custom_score |
bool |
True/False on whether or not to load a custom score. Depends on scoring_variable. |
scoring_variable |
str |
String to indicate which column in the input file is to be used as the scoring input. |
digest |
Digest |
|
parameter_file_object |
ProteinInferenceParameter |
ProteinInferenceParameter object |
append_alt_from_db |
bool |
Whether or not to append alternative proteins found in the database that are not in the input files. |
Source code in pyproteininference/reader.py
class GenericReader(Reader):
"""
The following class takes a percolator like target file and a percolator like decoy file
and creates standard [Psm][pyproteininference.physical.Psm] objects.
Percolator Like Output is formatted as follows:
with each entry being tab delimited.
| PSMId | score | q-value | posterior_error_prob | peptide | proteinIds | | | | # noqa E501 W605
|-------------------------------|----------|-------------|-----------------------|--------------------------------|---------------------|----------------------|----------------------|-------------------------| # noqa E501 W605
| 116108.15139.15139.6.dta | 3.44016 | 0.000479928 | 7.60258e-10 | K.MVVSMTLGLHPWIANIDDTQYLAAK.R | CNDP1_HUMAN\|Q96KN2 | B4E180_HUMAN\|B4E180 | A8K1K1_HUMAN\|A8K1K1 | J3KRP0_HUMAN\|J3KRP0 | # noqa E501 W605
Custom columns can be added and used as scoring input. Please see package documentation for more information.
Attributes:
target_file (str/list): Path to Target PSM result files.
decoy_file (str/list): Path to Decoy PSM result files.
combined_files (str/list): Path to Combined PSM result files.
directory (str): Path to directory containing combined PSM result files.
psms (list): List of [Psm][pyproteininference.physical.Psm] objects.
load_custom_score (bool): True/False on whether or not to load a custom score. Depends on scoring_variable.
scoring_variable (str): String to indicate which column in the input file is to be used as the scoring input.
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
parameter_file_object (ProteinInferenceParameter):
[ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object
append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
are not in the input files.
"""
PSMID = "PSMId"
SCORE = "score"
Q_VALUE = "q-value"
POSTERIOR_ERROR_PROB = "posterior_error_prob"
PEPTIDE = "peptide"
PROTEIN_IDS = "proteinIds"
ALTERNATIVE_PROTEINS = "alternative_proteins"
def __init__(
self,
digest,
parameter_file_object,
append_alt_from_db=True,
target_file=None,
decoy_file=None,
combined_files=None,
directory=None,
):
"""
Args:
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
parameter_file_object (ProteinInferenceParameter):
[ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
are not in the input files.
target_file (str/list): Path to Target PSM result files.
decoy_file (str/list): Path to Decoy PSM result files.
combined_files (str/list): Path to Combined PSM result files.
directory (str): Path to directory containing combined PSM result files.
Returns:
Reader: [Reader][pyproteininference.reader.Reader] object.
Example:
>>> pyproteininference.reader.GenericReader(target_file = "example_target.txt",
>>> decoy_file = "example_decoy.txt",
>>> digest=digest, parameter_file_object=pi_params)
"""
self.target_file = target_file
self.decoy_file = decoy_file
self.combined_files = combined_files
self.directory = directory
self.psms = None
self.search_id = None
self.digest = digest
self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
self.load_custom_score = False
self.append_alt_from_db = append_alt_from_db
self.parameter_file_object = parameter_file_object
self.scoring_variable = parameter_file_object.psm_score
self._validate_input()
if self.scoring_variable != self.Q_VALUE and self.scoring_variable != self.POSTERIOR_ERROR_PROB:
self.load_custom_score = True
logger.info(
"Pulling custom column based on parameter file input for score, Column: {}".format(
self.scoring_variable
)
)
else:
logger.info(
"Pulling no custom columns based on parameter file input for score, using standard Column: {}".format(
self.scoring_variable
)
)
# If we select to not run inference at all
if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
# Only allow 1 Protein per PSM
self.MAX_ALLOWED_ALTERNATIVE_PROTEINS = 1
def read_psms(self):
"""
Method to read psms from the input files and to transform them into a list of
[Psm][pyproteininference.physical.Psm] objects.
This method sets the `psms` variable. Which is a list of Psm objets.
This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].
Example:
>>> reader = pyproteininference.reader.GenericReader(target_file = "example_target.txt",
>>> decoy_file = "example_decoy.txt",
>>> digest=digest, parameter_file_object=pi_params)
>>> reader.read_psms()
"""
logger.info("Reading in Input Files using Generic Reader...")
# Read in and split by line
# If target_file is a list... read them all in and concatenate...
if self.target_file and self.decoy_file:
if isinstance(self.target_file, (list,)):
all_target = []
for t_files in self.target_file:
ptarg = []
with open(t_files, "r") as psm_target_file:
logger.info(t_files)
spamreader = csv.DictReader(psm_target_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
ptarg.append(row)
all_target = all_target + ptarg
else:
# If not just read the file...
ptarg = []
with open(self.target_file, "r") as psm_target_file:
logger.info(self.target_file)
spamreader = csv.DictReader(psm_target_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
ptarg.append(row)
all_target = ptarg
# Repeat for decoy file
if isinstance(self.decoy_file, (list,)):
all_decoy = []
for d_files in self.decoy_file:
pdec = []
with open(d_files, "r") as psm_decoy_file:
logger.info(d_files)
spamreader = csv.DictReader(psm_decoy_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
pdec.append(row)
all_decoy = all_decoy + pdec
else:
pdec = []
with open(self.decoy_file, "r") as psm_decoy_file:
logger.info(self.decoy_file)
spamreader = csv.DictReader(psm_decoy_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
pdec.append(row)
all_decoy = pdec
# Combine the lists
all_psms = all_target + all_decoy
elif self.combined_files:
if isinstance(self.combined_files, (list,)):
all = []
for c_files in self.combined_files:
c_all = []
with open(c_files, "r") as psm_file:
logger.info(c_files)
spamreader = csv.DictReader(psm_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
c_all.append(row)
all = all + c_all
else:
c_all = []
with open(self.combined_files, "r") as psm_file:
logger.info(self.combined_files)
spamreader = csv.DictReader(psm_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
c_all.append(row)
all = c_all
all_psms = all
elif self.directory:
all_files = os.listdir(self.directory)
all = []
for files in all_files:
psm_per_file = []
with open(files, "r") as psm_file:
logger.info(files)
spamreader = csv.DictReader(psm_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
psm_per_file.append(row)
all = all + psm_per_file
all_psms = all
psms_all_filtered = []
for psms in all_psms:
if self.POSTERIOR_ERROR_PROB in psms.keys():
try:
float(psms[self.POSTERIOR_ERROR_PROB])
psms_all_filtered.append(psms)
except ValueError as e: # noqa F841
pass
else:
try:
float(psms[self.scoring_variable])
psms_all_filtered.append(psms)
except ValueError as e: # noqa F841
pass
# Filter by pep
try:
logger.info("Sorting by {}".format(self.POSTERIOR_ERROR_PROB))
all_psms = sorted(
psms_all_filtered,
key=lambda x: float(x[self.POSTERIOR_ERROR_PROB]),
reverse=False,
)
except KeyError:
logger.info("Cannot Sort by {} the values do not exist".format(self.POSTERIOR_ERROR_PROB))
logger.info("Sorting by {}".format(self.scoring_variable))
if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
all_psms = sorted(
psms_all_filtered,
key=lambda x: float(x[self.scoring_variable]),
reverse=True,
)
if self.parameter_file_object.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE:
all_psms = sorted(
psms_all_filtered,
key=lambda x: float(x[self.scoring_variable]),
reverse=False,
)
list_of_psm_objects = []
peptide_tracker = set()
all_sp_proteins = set(self.digest.swiss_prot_protein_set)
# We only want to get unique peptides... using all messes up scoring...
# Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...
peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary
initial_poss_prots = []
logger.info("Number of PSMs in the input data: {}".format(len(all_psms)))
psms_with_alternative_proteins = self._find_psms_with_alternative_proteins(raw_psms=all_psms)
logger.info(
"Number of PSMs that have alternative proteins in the input data {}".format(
len(psms_with_alternative_proteins)
)
)
if len(psms_with_alternative_proteins) == 0:
logger.warning(
"No PSMs in the input have alternative proteins. "
"Make sure your input is properly formatted. "
"Alternative Proteins will be retrieved from the fasta database"
)
for psm_info in all_psms:
current_peptide = psm_info[self.PEPTIDE]
# Define the Psm...
if current_peptide not in peptide_tracker:
psm = Psm(identifier=current_peptide)
# Attempt to add variables from PSM info...
# If they do not exist in the psm info then we skip...
try:
psm.percscore = float(psm_info[self.SCORE])
except KeyError:
pass
try:
psm.qvalue = float(psm_info[self.Q_VALUE])
except KeyError:
pass
try:
psm.pepvalue = float(psm_info[self.POSTERIOR_ERROR_PROB])
except KeyError:
pass
# If user has a custom score IE not q-value or pep_value...
if self.load_custom_score:
# Then we look for it...
psm.custom_score = float(psm_info[self.scoring_variable])
psm.possible_proteins = []
psm.possible_proteins.append(psm_info[self.PROTEIN_IDS])
psm.possible_proteins = psm.possible_proteins + [x for x in psm_info[self.ALTERNATIVE_PROTEINS] if x]
# Remove potential Repeats
if self.parameter_file_object.inference_type != Inference.FIRST_PROTEIN:
psm.possible_proteins = sorted(list(set(psm.possible_proteins)))
input_poss_prots = copy.copy(psm.possible_proteins)
# Get PSM ID
psm.psm_id = psm_info[self.PSMID]
# Split peptide if flanking
current_peptide = Psm.split_peptide(peptide_string=current_peptide)
if not current_peptide.isupper() or not current_peptide.isalpha():
# If we have mods remove them...
peptide_string = current_peptide.upper()
stripped_peptide = Psm.remove_peptide_mods(peptide_string)
current_peptide = stripped_peptide
# Add the other possible_proteins from insilicodigest here...
try:
current_alt_proteins = sorted(list(peptide_to_protein_dictionary[current_peptide]))
except KeyError:
current_alt_proteins = []
logger.debug(
"Peptide {} was not found in the supplied DB for Proteins {}".format(
current_peptide, ";".join(psm.possible_proteins)
)
)
for poss_prot in psm.possible_proteins:
self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
logger.debug(
"Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
)
# Sort Alt Proteins by Swissprot then Trembl...
identifiers_sorted = DataStore.sort_protein_strings(
protein_string_list=current_alt_proteins,
sp_proteins=all_sp_proteins,
decoy_symbol=self.parameter_file_object.decoy_symbol,
)
# Restrict to 50 possible proteins
psm = self._fix_alternative_proteins(
append_alt_from_db=self.append_alt_from_db,
identifiers_sorted=identifiers_sorted,
max_proteins=self.MAX_ALLOWED_ALTERNATIVE_PROTEINS,
psm=psm,
parameter_file_object=self.parameter_file_object,
)
list_of_psm_objects.append(psm)
peptide_tracker.add(current_peptide)
initial_poss_prots.append(input_poss_prots)
self.psms = list_of_psm_objects
self._check_initial_database_overlap(
initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
)
logger.info("Length of PSM Data: {}".format(len(self.psms)))
logger.info("Finished GenericReader.read_psms...")
def _find_psms_with_alternative_proteins(self, raw_psms):
psms_with_alternative_proteins = [x for x in raw_psms if x["alternative_proteins"]]
return psms_with_alternative_proteins
__init__(self, digest, parameter_file_object, append_alt_from_db=True, target_file=None, decoy_file=None, combined_files=None, directory=None)
special
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> pyproteininference.reader.GenericReader(target_file = "example_target.txt",
>>> decoy_file = "example_decoy.txt",
>>> digest=digest, parameter_file_object=pi_params)
Source code in pyproteininference/reader.py
def __init__(
self,
digest,
parameter_file_object,
append_alt_from_db=True,
target_file=None,
decoy_file=None,
combined_files=None,
directory=None,
):
"""
Args:
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
parameter_file_object (ProteinInferenceParameter):
[ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
are not in the input files.
target_file (str/list): Path to Target PSM result files.
decoy_file (str/list): Path to Decoy PSM result files.
combined_files (str/list): Path to Combined PSM result files.
directory (str): Path to directory containing combined PSM result files.
Returns:
Reader: [Reader][pyproteininference.reader.Reader] object.
Example:
>>> pyproteininference.reader.GenericReader(target_file = "example_target.txt",
>>> decoy_file = "example_decoy.txt",
>>> digest=digest, parameter_file_object=pi_params)
"""
self.target_file = target_file
self.decoy_file = decoy_file
self.combined_files = combined_files
self.directory = directory
self.psms = None
self.search_id = None
self.digest = digest
self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
self.load_custom_score = False
self.append_alt_from_db = append_alt_from_db
self.parameter_file_object = parameter_file_object
self.scoring_variable = parameter_file_object.psm_score
self._validate_input()
if self.scoring_variable != self.Q_VALUE and self.scoring_variable != self.POSTERIOR_ERROR_PROB:
self.load_custom_score = True
logger.info(
"Pulling custom column based on parameter file input for score, Column: {}".format(
self.scoring_variable
)
)
else:
logger.info(
"Pulling no custom columns based on parameter file input for score, using standard Column: {}".format(
self.scoring_variable
)
)
# If we select to not run inference at all
if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
# Only allow 1 Protein per PSM
self.MAX_ALLOWED_ALTERNATIVE_PROTEINS = 1
read_psms(self)
Method to read psms from the input files and to transform them into a list of Psm objects.
This method sets the psms
variable. Which is a list of Psm objets.
This method must be ran before initializing DataStore object.
Examples:
>>> reader = pyproteininference.reader.GenericReader(target_file = "example_target.txt",
>>> decoy_file = "example_decoy.txt",
>>> digest=digest, parameter_file_object=pi_params)
>>> reader.read_psms()
Source code in pyproteininference/reader.py
def read_psms(self):
"""
Method to read psms from the input files and to transform them into a list of
[Psm][pyproteininference.physical.Psm] objects.
This method sets the `psms` variable. Which is a list of Psm objets.
This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].
Example:
>>> reader = pyproteininference.reader.GenericReader(target_file = "example_target.txt",
>>> decoy_file = "example_decoy.txt",
>>> digest=digest, parameter_file_object=pi_params)
>>> reader.read_psms()
"""
logger.info("Reading in Input Files using Generic Reader...")
# Read in and split by line
# If target_file is a list... read them all in and concatenate...
if self.target_file and self.decoy_file:
if isinstance(self.target_file, (list,)):
all_target = []
for t_files in self.target_file:
ptarg = []
with open(t_files, "r") as psm_target_file:
logger.info(t_files)
spamreader = csv.DictReader(psm_target_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
ptarg.append(row)
all_target = all_target + ptarg
else:
# If not just read the file...
ptarg = []
with open(self.target_file, "r") as psm_target_file:
logger.info(self.target_file)
spamreader = csv.DictReader(psm_target_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
ptarg.append(row)
all_target = ptarg
# Repeat for decoy file
if isinstance(self.decoy_file, (list,)):
all_decoy = []
for d_files in self.decoy_file:
pdec = []
with open(d_files, "r") as psm_decoy_file:
logger.info(d_files)
spamreader = csv.DictReader(psm_decoy_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
pdec.append(row)
all_decoy = all_decoy + pdec
else:
pdec = []
with open(self.decoy_file, "r") as psm_decoy_file:
logger.info(self.decoy_file)
spamreader = csv.DictReader(psm_decoy_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
pdec.append(row)
all_decoy = pdec
# Combine the lists
all_psms = all_target + all_decoy
elif self.combined_files:
if isinstance(self.combined_files, (list,)):
all = []
for c_files in self.combined_files:
c_all = []
with open(c_files, "r") as psm_file:
logger.info(c_files)
spamreader = csv.DictReader(psm_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
c_all.append(row)
all = all + c_all
else:
c_all = []
with open(self.combined_files, "r") as psm_file:
logger.info(self.combined_files)
spamreader = csv.DictReader(psm_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
c_all.append(row)
all = c_all
all_psms = all
elif self.directory:
all_files = os.listdir(self.directory)
all = []
for files in all_files:
psm_per_file = []
with open(files, "r") as psm_file:
logger.info(files)
spamreader = csv.DictReader(psm_file, delimiter="\t")
for row in spamreader:
row = self.get_alternative_proteins_from_input(row)
psm_per_file.append(row)
all = all + psm_per_file
all_psms = all
psms_all_filtered = []
for psms in all_psms:
if self.POSTERIOR_ERROR_PROB in psms.keys():
try:
float(psms[self.POSTERIOR_ERROR_PROB])
psms_all_filtered.append(psms)
except ValueError as e: # noqa F841
pass
else:
try:
float(psms[self.scoring_variable])
psms_all_filtered.append(psms)
except ValueError as e: # noqa F841
pass
# Filter by pep
try:
logger.info("Sorting by {}".format(self.POSTERIOR_ERROR_PROB))
all_psms = sorted(
psms_all_filtered,
key=lambda x: float(x[self.POSTERIOR_ERROR_PROB]),
reverse=False,
)
except KeyError:
logger.info("Cannot Sort by {} the values do not exist".format(self.POSTERIOR_ERROR_PROB))
logger.info("Sorting by {}".format(self.scoring_variable))
if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
all_psms = sorted(
psms_all_filtered,
key=lambda x: float(x[self.scoring_variable]),
reverse=True,
)
if self.parameter_file_object.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE:
all_psms = sorted(
psms_all_filtered,
key=lambda x: float(x[self.scoring_variable]),
reverse=False,
)
list_of_psm_objects = []
peptide_tracker = set()
all_sp_proteins = set(self.digest.swiss_prot_protein_set)
# We only want to get unique peptides... using all messes up scoring...
# Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...
peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary
initial_poss_prots = []
logger.info("Number of PSMs in the input data: {}".format(len(all_psms)))
psms_with_alternative_proteins = self._find_psms_with_alternative_proteins(raw_psms=all_psms)
logger.info(
"Number of PSMs that have alternative proteins in the input data {}".format(
len(psms_with_alternative_proteins)
)
)
if len(psms_with_alternative_proteins) == 0:
logger.warning(
"No PSMs in the input have alternative proteins. "
"Make sure your input is properly formatted. "
"Alternative Proteins will be retrieved from the fasta database"
)
for psm_info in all_psms:
current_peptide = psm_info[self.PEPTIDE]
# Define the Psm...
if current_peptide not in peptide_tracker:
psm = Psm(identifier=current_peptide)
# Attempt to add variables from PSM info...
# If they do not exist in the psm info then we skip...
try:
psm.percscore = float(psm_info[self.SCORE])
except KeyError:
pass
try:
psm.qvalue = float(psm_info[self.Q_VALUE])
except KeyError:
pass
try:
psm.pepvalue = float(psm_info[self.POSTERIOR_ERROR_PROB])
except KeyError:
pass
# If user has a custom score IE not q-value or pep_value...
if self.load_custom_score:
# Then we look for it...
psm.custom_score = float(psm_info[self.scoring_variable])
psm.possible_proteins = []
psm.possible_proteins.append(psm_info[self.PROTEIN_IDS])
psm.possible_proteins = psm.possible_proteins + [x for x in psm_info[self.ALTERNATIVE_PROTEINS] if x]
# Remove potential Repeats
if self.parameter_file_object.inference_type != Inference.FIRST_PROTEIN:
psm.possible_proteins = sorted(list(set(psm.possible_proteins)))
input_poss_prots = copy.copy(psm.possible_proteins)
# Get PSM ID
psm.psm_id = psm_info[self.PSMID]
# Split peptide if flanking
current_peptide = Psm.split_peptide(peptide_string=current_peptide)
if not current_peptide.isupper() or not current_peptide.isalpha():
# If we have mods remove them...
peptide_string = current_peptide.upper()
stripped_peptide = Psm.remove_peptide_mods(peptide_string)
current_peptide = stripped_peptide
# Add the other possible_proteins from insilicodigest here...
try:
current_alt_proteins = sorted(list(peptide_to_protein_dictionary[current_peptide]))
except KeyError:
current_alt_proteins = []
logger.debug(
"Peptide {} was not found in the supplied DB for Proteins {}".format(
current_peptide, ";".join(psm.possible_proteins)
)
)
for poss_prot in psm.possible_proteins:
self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
logger.debug(
"Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
)
# Sort Alt Proteins by Swissprot then Trembl...
identifiers_sorted = DataStore.sort_protein_strings(
protein_string_list=current_alt_proteins,
sp_proteins=all_sp_proteins,
decoy_symbol=self.parameter_file_object.decoy_symbol,
)
# Restrict to 50 possible proteins
psm = self._fix_alternative_proteins(
append_alt_from_db=self.append_alt_from_db,
identifiers_sorted=identifiers_sorted,
max_proteins=self.MAX_ALLOWED_ALTERNATIVE_PROTEINS,
psm=psm,
parameter_file_object=self.parameter_file_object,
)
list_of_psm_objects.append(psm)
peptide_tracker.add(current_peptide)
initial_poss_prots.append(input_poss_prots)
self.psms = list_of_psm_objects
self._check_initial_database_overlap(
initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
)
logger.info("Length of PSM Data: {}".format(len(self.psms)))
logger.info("Finished GenericReader.read_psms...")
PercolatorReader (Reader)
The following class takes a percolator target file and a percolator decoy file or combined files/directory and creates standard Psm objects. This reader class is used as input for DataStore object.
Percolator Output is formatted as follows: with each entry being tab delimited.
| PSMId | score | q-value | posterior_error_prob | peptide | proteinIds | | | | # noqa E501 W605 |-------------------------------|----------|-------------|-----------------------|--------------------------------|---------------------|----------------------|----------------------|-------------------------| # noqa E501 W605 | 116108.15139.15139.6.dta | 3.44016 | 0.000479928 | 7.60258e-10 | K.MVVSMTLGLHPWIANIDDTQYLAAK.R | CNDP1_HUMAN|Q96KN2 | B4E180_HUMAN|B4E180 | A8K1K1_HUMAN|A8K1K1 | J3KRP0_HUMAN|J3KRP0 | # noqa E501 W605
Attributes:
Name | Type | Description |
---|---|---|
target_file |
str/list |
Path to Target PSM result files. |
decoy_file |
str/list |
Path to Decoy PSM result files. |
combined_files |
str/list |
Path to Combined PSM result files. |
directory |
str |
Path to directory containing combined PSM result files. |
PSMID_INDEX |
int |
Index of the PSMId from the input files. |
PERC_SCORE_INDEX |
int |
Index of the Percolator score from the input files. |
Q_VALUE_INDEX |
int |
Index of the q-value from the input files. |
POSTERIOR_ERROR_PROB_INDEX |
int |
Index of the posterior error probability from the input files. |
PEPTIDE_INDEX |
int |
Index of the peptides from the input files. |
PROTEINIDS_INDEX |
int |
Index of the proteins from the input files. |
psms |
list |
List of Psm objects. |
Source code in pyproteininference/reader.py
class PercolatorReader(Reader):
"""
The following class takes a percolator target file and a percolator decoy file
or combined files/directory and creates standard [Psm][pyproteininference.physical.Psm] objects.
This reader class is used as input for [DataStore object][pyproteininference.datastore.DataStore].
Percolator Output is formatted as follows:
with each entry being tab delimited.
| PSMId | score | q-value | posterior_error_prob | peptide | proteinIds | | | | # noqa E501 W605
|-------------------------------|----------|-------------|-----------------------|--------------------------------|---------------------|----------------------|----------------------|-------------------------| # noqa E501 W605
| 116108.15139.15139.6.dta | 3.44016 | 0.000479928 | 7.60258e-10 | K.MVVSMTLGLHPWIANIDDTQYLAAK.R | CNDP1_HUMAN\|Q96KN2 | B4E180_HUMAN\|B4E180 | A8K1K1_HUMAN\|A8K1K1 | J3KRP0_HUMAN\|J3KRP0 | # noqa E501 W605
Attributes:
target_file (str/list): Path to Target PSM result files.
decoy_file (str/list): Path to Decoy PSM result files.
combined_files (str/list): Path to Combined PSM result files.
directory (str): Path to directory containing combined PSM result files.
PSMID_INDEX (int): Index of the PSMId from the input files.
PERC_SCORE_INDEX (int): Index of the Percolator score from the input files.
Q_VALUE_INDEX (int): Index of the q-value from the input files.
POSTERIOR_ERROR_PROB_INDEX (int): Index of the posterior error probability from the input files.
PEPTIDE_INDEX (int): Index of the peptides from the input files.
PROTEINIDS_INDEX (int): Index of the proteins from the input files.
psms (list): List of [Psm][pyproteininference.physical.Psm] objects.
"""
PSMID_INDEX = 0
PERC_SCORE_INDEX = 1
Q_VALUE_INDEX = 2
POSTERIOR_ERROR_PROB_INDEX = 3
PEPTIDE_INDEX = 4
PROTEINIDS_INDEX = 5
def __init__(
self,
digest,
parameter_file_object,
append_alt_from_db=True,
target_file=None,
decoy_file=None,
combined_files=None,
directory=None,
):
"""
Args:
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
parameter_file_object (ProteinInferenceParameter):
[ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter].
append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
are not in the input files.
target_file (str/list): Path to Target PSM result files.
decoy_file (str/list): Path to Decoy PSM result files.
combined_files (str/list): Path to Combined PSM result files.
directory (str): Path to directory containing combined PSM result files.
Returns:
Reader: [Reader][pyproteininference.reader.Reader] object.
Example:
>>> pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
>>> decoy_file = "example_decoy.txt", digest=digest,parameter_file_object=pi_params)
"""
self.target_file = target_file
self.decoy_file = decoy_file
self.combined_files = combined_files
self.directory = directory
# Define Indicies based on input
self.psms = None
self.search_id = None
self.digest = digest
self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
self.append_alt_from_db = append_alt_from_db
self.parameter_file_object = parameter_file_object
self._validate_input()
def read_psms(self):
"""
Method to read psms from the input files and to transform them into a list of
[Psm][pyproteininference.physical.Psm] objects.
This method sets the `psms` variable. Which is a list of Psm objets.
This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].
Example:
>>> reader = pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
>>> decoy_file = "example_decoy.txt",
>>> digest=digest, parameter_file_object=pi_params)
>>> reader.read_psms()
"""
# Read in and split by line
if self.target_file and self.decoy_file:
# If target_file is a list... read them all in and concatenate...
if isinstance(self.target_file, (list,)):
all_target = []
for t_files in self.target_file:
logger.info(t_files)
ptarg = []
with open(t_files, "r") as perc_target_file:
spamreader = csv.reader(perc_target_file, delimiter="\t")
for row in spamreader:
ptarg.append(row)
del ptarg[0]
all_target = all_target + ptarg
elif self.target_file:
# If not just read the file...
ptarg = []
with open(self.target_file, "r") as perc_target_file:
spamreader = csv.reader(perc_target_file, delimiter="\t")
for row in spamreader:
ptarg.append(row)
del ptarg[0]
all_target = ptarg
# Repeat for decoy file
if isinstance(self.decoy_file, (list,)):
all_decoy = []
for d_files in self.decoy_file:
logger.info(d_files)
pdec = []
with open(d_files, "r") as perc_decoy_file:
spamreader = csv.reader(perc_decoy_file, delimiter="\t")
for row in spamreader:
pdec.append(row)
del pdec[0]
all_decoy = all_decoy + pdec
elif self.decoy_file:
pdec = []
with open(self.decoy_file, "r") as perc_decoy_file:
spamreader = csv.reader(perc_decoy_file, delimiter="\t")
for row in spamreader:
pdec.append(row)
del pdec[0]
all_decoy = pdec
# Combine the lists
perc_all = all_target + all_decoy
elif self.combined_files:
if isinstance(self.combined_files, (list,)):
all = []
for f in self.combined_files:
logger.info(f)
combined_psm_result_rows = []
with open(f, "r") as perc_files:
spamreader = csv.reader(perc_files, delimiter="\t")
for row in spamreader:
combined_psm_result_rows.append(row)
del combined_psm_result_rows[0]
all = all + combined_psm_result_rows
elif self.combined_files:
# If not just read the file...
combined_psm_result_rows = []
with open(self.combined_files, "r") as perc_files:
spamreader = csv.reader(perc_files, delimiter="\t")
for row in spamreader:
combined_psm_result_rows.append(row)
del combined_psm_result_rows[0]
all = combined_psm_result_rows
perc_all = all
elif self.directory:
all_files = os.listdir(self.directory)
all = []
for files in all_files:
logger.info(files)
combined_psm_result_rows = []
with open(files, "r") as perc_file:
spamreader = csv.reader(perc_file, delimiter="\t")
for row in spamreader:
combined_psm_result_rows.append(row)
del combined_psm_result_rows[0]
all = all + combined_psm_result_rows
perc_all = all
peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary
perc_all_filtered = []
for psms in perc_all:
try:
float(psms[self.POSTERIOR_ERROR_PROB_INDEX])
perc_all_filtered.append(psms)
except ValueError as e: # noqa F841
pass
# Filter by pep
perc_all = sorted(
perc_all_filtered,
key=lambda x: float(x[self.POSTERIOR_ERROR_PROB_INDEX]),
reverse=False,
)
list_of_psm_objects = []
peptide_tracker = set()
all_sp_proteins = set(self.digest.swiss_prot_protein_set)
# We only want to get unique peptides... using all messes up scoring...
# Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...
initial_poss_prots = []
logger.info("Length of PSM Data: {}".format(len(perc_all)))
for psm_info in perc_all:
current_peptide = psm_info[self.PEPTIDE_INDEX]
# Define the Psm...
if current_peptide not in peptide_tracker:
combined_psm_result_rows = Psm(identifier=current_peptide)
# Add all the attributes
combined_psm_result_rows.percscore = float(psm_info[self.PERC_SCORE_INDEX])
combined_psm_result_rows.qvalue = float(psm_info[self.Q_VALUE_INDEX])
combined_psm_result_rows.pepvalue = float(psm_info[self.POSTERIOR_ERROR_PROB_INDEX])
if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
poss_proteins = [psm_info[self.PROTEINIDS_INDEX]]
else:
poss_proteins = sorted(list(set(psm_info[self.PROTEINIDS_INDEX :]))) # noqa E203
poss_proteins = poss_proteins[: self.MAX_ALLOWED_ALTERNATIVE_PROTEINS]
combined_psm_result_rows.possible_proteins = poss_proteins # Restrict to 50 total possible proteins...
combined_psm_result_rows.psm_id = psm_info[self.PSMID_INDEX]
input_poss_prots = copy.copy(poss_proteins)
# Split peptide if flanking
current_peptide = Psm.split_peptide(peptide_string=current_peptide)
if not current_peptide.isupper() or not current_peptide.isalpha():
# If we have mods remove them...
peptide_string = current_peptide.upper()
stripped_peptide = Psm.remove_peptide_mods(peptide_string)
current_peptide = stripped_peptide
# Add the other possible_proteins from insilicodigest here...
try:
current_alt_proteins = sorted(
list(peptide_to_protein_dictionary[current_peptide])
) # This peptide needs to be scrubbed of Mods...
except KeyError:
current_alt_proteins = []
logger.debug(
"Peptide {} was not found in the supplied DB with the following proteins {}".format(
current_peptide,
";".join(combined_psm_result_rows.possible_proteins),
)
)
for poss_prot in combined_psm_result_rows.possible_proteins:
self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
logger.debug(
"Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
)
# Sort Alt Proteins by Swissprot then Trembl...
identifiers_sorted = DataStore.sort_protein_strings(
protein_string_list=current_alt_proteins,
sp_proteins=all_sp_proteins,
decoy_symbol=self.parameter_file_object.decoy_symbol,
)
# Restrict to 50 possible proteins
combined_psm_result_rows = self._fix_alternative_proteins(
append_alt_from_db=self.append_alt_from_db,
identifiers_sorted=identifiers_sorted,
max_proteins=self.MAX_ALLOWED_ALTERNATIVE_PROTEINS,
psm=combined_psm_result_rows,
parameter_file_object=self.parameter_file_object,
)
# Remove blank alt proteins
combined_psm_result_rows.possible_proteins = [
x for x in combined_psm_result_rows.possible_proteins if x != ""
]
list_of_psm_objects.append(combined_psm_result_rows)
peptide_tracker.add(current_peptide)
initial_poss_prots.append(input_poss_prots)
self.psms = list_of_psm_objects
self._check_initial_database_overlap(
initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
)
logger.info("Length of PSM Data: {}".format(len(self.psms)))
__init__(self, digest, parameter_file_object, append_alt_from_db=True, target_file=None, decoy_file=None, combined_files=None, directory=None)
special
Parameters: |
|
---|
Returns: |
|
---|
Examples:
>>> pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
>>> decoy_file = "example_decoy.txt", digest=digest,parameter_file_object=pi_params)
Source code in pyproteininference/reader.py
def __init__(
self,
digest,
parameter_file_object,
append_alt_from_db=True,
target_file=None,
decoy_file=None,
combined_files=None,
directory=None,
):
"""
Args:
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
parameter_file_object (ProteinInferenceParameter):
[ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter].
append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
are not in the input files.
target_file (str/list): Path to Target PSM result files.
decoy_file (str/list): Path to Decoy PSM result files.
combined_files (str/list): Path to Combined PSM result files.
directory (str): Path to directory containing combined PSM result files.
Returns:
Reader: [Reader][pyproteininference.reader.Reader] object.
Example:
>>> pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
>>> decoy_file = "example_decoy.txt", digest=digest,parameter_file_object=pi_params)
"""
self.target_file = target_file
self.decoy_file = decoy_file
self.combined_files = combined_files
self.directory = directory
# Define Indicies based on input
self.psms = None
self.search_id = None
self.digest = digest
self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
self.append_alt_from_db = append_alt_from_db
self.parameter_file_object = parameter_file_object
self._validate_input()
read_psms(self)
Method to read psms from the input files and to transform them into a list of Psm objects.
This method sets the psms
variable. Which is a list of Psm objets.
This method must be ran before initializing DataStore object.
Examples:
>>> reader = pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
>>> decoy_file = "example_decoy.txt",
>>> digest=digest, parameter_file_object=pi_params)
>>> reader.read_psms()
Source code in pyproteininference/reader.py
def read_psms(self):
"""
Method to read psms from the input files and to transform them into a list of
[Psm][pyproteininference.physical.Psm] objects.
This method sets the `psms` variable. Which is a list of Psm objets.
This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].
Example:
>>> reader = pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
>>> decoy_file = "example_decoy.txt",
>>> digest=digest, parameter_file_object=pi_params)
>>> reader.read_psms()
"""
# Read in and split by line
if self.target_file and self.decoy_file:
# If target_file is a list... read them all in and concatenate...
if isinstance(self.target_file, (list,)):
all_target = []
for t_files in self.target_file:
logger.info(t_files)
ptarg = []
with open(t_files, "r") as perc_target_file:
spamreader = csv.reader(perc_target_file, delimiter="\t")
for row in spamreader:
ptarg.append(row)
del ptarg[0]
all_target = all_target + ptarg
elif self.target_file:
# If not just read the file...
ptarg = []
with open(self.target_file, "r") as perc_target_file:
spamreader = csv.reader(perc_target_file, delimiter="\t")
for row in spamreader:
ptarg.append(row)
del ptarg[0]
all_target = ptarg
# Repeat for decoy file
if isinstance(self.decoy_file, (list,)):
all_decoy = []
for d_files in self.decoy_file:
logger.info(d_files)
pdec = []
with open(d_files, "r") as perc_decoy_file:
spamreader = csv.reader(perc_decoy_file, delimiter="\t")
for row in spamreader:
pdec.append(row)
del pdec[0]
all_decoy = all_decoy + pdec
elif self.decoy_file:
pdec = []
with open(self.decoy_file, "r") as perc_decoy_file:
spamreader = csv.reader(perc_decoy_file, delimiter="\t")
for row in spamreader:
pdec.append(row)
del pdec[0]
all_decoy = pdec
# Combine the lists
perc_all = all_target + all_decoy
elif self.combined_files:
if isinstance(self.combined_files, (list,)):
all = []
for f in self.combined_files:
logger.info(f)
combined_psm_result_rows = []
with open(f, "r") as perc_files:
spamreader = csv.reader(perc_files, delimiter="\t")
for row in spamreader:
combined_psm_result_rows.append(row)
del combined_psm_result_rows[0]
all = all + combined_psm_result_rows
elif self.combined_files:
# If not just read the file...
combined_psm_result_rows = []
with open(self.combined_files, "r") as perc_files:
spamreader = csv.reader(perc_files, delimiter="\t")
for row in spamreader:
combined_psm_result_rows.append(row)
del combined_psm_result_rows[0]
all = combined_psm_result_rows
perc_all = all
elif self.directory:
all_files = os.listdir(self.directory)
all = []
for files in all_files:
logger.info(files)
combined_psm_result_rows = []
with open(files, "r") as perc_file:
spamreader = csv.reader(perc_file, delimiter="\t")
for row in spamreader:
combined_psm_result_rows.append(row)
del combined_psm_result_rows[0]
all = all + combined_psm_result_rows
perc_all = all
peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary
perc_all_filtered = []
for psms in perc_all:
try:
float(psms[self.POSTERIOR_ERROR_PROB_INDEX])
perc_all_filtered.append(psms)
except ValueError as e: # noqa F841
pass
# Filter by pep
perc_all = sorted(
perc_all_filtered,
key=lambda x: float(x[self.POSTERIOR_ERROR_PROB_INDEX]),
reverse=False,
)
list_of_psm_objects = []
peptide_tracker = set()
all_sp_proteins = set(self.digest.swiss_prot_protein_set)
# We only want to get unique peptides... using all messes up scoring...
# Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...
initial_poss_prots = []
logger.info("Length of PSM Data: {}".format(len(perc_all)))
for psm_info in perc_all:
current_peptide = psm_info[self.PEPTIDE_INDEX]
# Define the Psm...
if current_peptide not in peptide_tracker:
combined_psm_result_rows = Psm(identifier=current_peptide)
# Add all the attributes
combined_psm_result_rows.percscore = float(psm_info[self.PERC_SCORE_INDEX])
combined_psm_result_rows.qvalue = float(psm_info[self.Q_VALUE_INDEX])
combined_psm_result_rows.pepvalue = float(psm_info[self.POSTERIOR_ERROR_PROB_INDEX])
if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
poss_proteins = [psm_info[self.PROTEINIDS_INDEX]]
else:
poss_proteins = sorted(list(set(psm_info[self.PROTEINIDS_INDEX :]))) # noqa E203
poss_proteins = poss_proteins[: self.MAX_ALLOWED_ALTERNATIVE_PROTEINS]
combined_psm_result_rows.possible_proteins = poss_proteins # Restrict to 50 total possible proteins...
combined_psm_result_rows.psm_id = psm_info[self.PSMID_INDEX]
input_poss_prots = copy.copy(poss_proteins)
# Split peptide if flanking
current_peptide = Psm.split_peptide(peptide_string=current_peptide)
if not current_peptide.isupper() or not current_peptide.isalpha():
# If we have mods remove them...
peptide_string = current_peptide.upper()
stripped_peptide = Psm.remove_peptide_mods(peptide_string)
current_peptide = stripped_peptide
# Add the other possible_proteins from insilicodigest here...
try:
current_alt_proteins = sorted(
list(peptide_to_protein_dictionary[current_peptide])
) # This peptide needs to be scrubbed of Mods...
except KeyError:
current_alt_proteins = []
logger.debug(
"Peptide {} was not found in the supplied DB with the following proteins {}".format(
current_peptide,
";".join(combined_psm_result_rows.possible_proteins),
)
)
for poss_prot in combined_psm_result_rows.possible_proteins:
self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
logger.debug(
"Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
)
# Sort Alt Proteins by Swissprot then Trembl...
identifiers_sorted = DataStore.sort_protein_strings(
protein_string_list=current_alt_proteins,
sp_proteins=all_sp_proteins,
decoy_symbol=self.parameter_file_object.decoy_symbol,
)
# Restrict to 50 possible proteins
combined_psm_result_rows = self._fix_alternative_proteins(
append_alt_from_db=self.append_alt_from_db,
identifiers_sorted=identifiers_sorted,
max_proteins=self.MAX_ALLOWED_ALTERNATIVE_PROTEINS,
psm=combined_psm_result_rows,
parameter_file_object=self.parameter_file_object,
)
# Remove blank alt proteins
combined_psm_result_rows.possible_proteins = [
x for x in combined_psm_result_rows.possible_proteins if x != ""
]
list_of_psm_objects.append(combined_psm_result_rows)
peptide_tracker.add(current_peptide)
initial_poss_prots.append(input_poss_prots)
self.psms = list_of_psm_objects
self._check_initial_database_overlap(
initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
)
logger.info("Length of PSM Data: {}".format(len(self.psms)))
ProteologicPostSearchReader (Reader)
This class is used to read from post processing proteologic logical object.
Attributes:
Name | Type | Description |
---|---|---|
proteologic_object |
list |
List of proteologic post search objects. |
search_id |
int |
Search ID or Search IDs associated with the data. |
postsearch_id |
int |
PostSearch ID or PostSearch IDs associated with the data. |
digest |
Digest |
|
parameter_file_object |
ProteinInferenceParameter |
ProteinInferenceParameter object. |
append_alt_from_db |
bool |
Whether or not to append alternative proteins found in the database that are not in the input files. |
Source code in pyproteininference/reader.py
class ProteologicPostSearchReader(Reader):
"""
This class is used to read from post processing proteologic logical object.
Attributes:
proteologic_object (list): List of proteologic post search objects.
search_id (int): Search ID or Search IDs associated with the data.
postsearch_id (int): PostSearch ID or PostSearch IDs associated with the data.
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
parameter_file_object (ProteinInferenceParameter):
[ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
append_alt_from_db (bool): Whether or not to append alternative proteins found in the database
that are not in the input files.
"""
def __init__(
self,
proteologic_object,
search_id,
postsearch_id,
digest,
parameter_file_object,
append_alt_from_db=True,
):
"""
Args:
proteologic_object (list): List of proteologic post search objects.
search_id (int): Search ID or Search IDs associated with the data.
postsearch_id: PostSearch ID or PostSearch IDs associated with the data.
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
parameter_file_object (ProteinInferenceParameter):
[ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
append_alt_from_db (bool): Whether or not to append alternative proteins found in the database
that are not in the input files.
Returns:
object:
"""
self.proteologic_object = proteologic_object
self.search_id = search_id
self.postsearch_id = postsearch_id
self.psms = None
self.digest = digest
self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
self.append_alt_from_db = append_alt_from_db
self.parameter_file_object = parameter_file_object
def read_psms(self):
"""
Method to read psms from the input files and to transform them into a list of
[Psm][pyproteininference.physical.Psm] objects.
This method sets the `psms` variable. Which is a list of Psm objets.
This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].
"""
logger.info("Reading in data from Proteologic...")
if isinstance(self.proteologic_object, (list,)):
list_of_psms = []
for p_objs in self.proteologic_object:
for psms in p_objs.physical_object.psm_sets:
list_of_psms.append(psms)
else:
list_of_psms = self.proteologic_object.physical_object.psm_sets
# Sort this by posterior error prob...
list_of_psms = sorted(list_of_psms, key=lambda x: float(x.psm_filter.pepvalue))
peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary
list_of_psm_objects = []
peptide_tracker = set()
all_sp_proteins = set(self.digest.swiss_prot_protein_set)
# Peptide tracker is used because we only want UNIQUE peptides...
# The data is sorted by percolator score... or at least it should be...
# Or sorted by posterior error probability
initial_poss_prots = []
for peps in list_of_psms:
current_peptide = peps.peptide.sequence
# Define the Psm...
if current_peptide not in peptide_tracker:
p = Psm(identifier=current_peptide)
# Add all the attributes
p.percscore = float(0) # Will be stored in table in future I think...
p.qvalue = float(peps.psm_filter.q_value)
p.pepvalue = float(peps.psm_filter.pepvalue)
if peps.peptide.protein not in peps.alternative_proteins:
p.possible_proteins = [peps.peptide.protein] + peps.alternative_proteins
else:
p.possible_proteins = peps.alternative_proteins
p.possible_proteins = list(filter(None, p.possible_proteins))
input_poss_prots = copy.copy(p.possible_proteins)
p.psm_id = peps.spectrum.spectrum_identifier
# Split peptide if flanking
current_peptide = Psm.split_peptide(peptide_string=current_peptide)
if not current_peptide.isupper() or not current_peptide.isalpha():
# If we have mods remove them...
peptide_string = current_peptide.upper()
stripped_peptide = Psm.remove_peptide_mods(peptide_string)
current_peptide = stripped_peptide
# Add the other possible_proteins from insilicodigest here...
try:
current_alt_proteins = sorted(
list(peptide_to_protein_dictionary[current_peptide])
) # This peptide needs to be scrubbed of Mods...
except KeyError:
current_alt_proteins = []
logger.debug(
"Peptide {} was not found in the supplied DB with the following proteins {}".format(
current_peptide, ";".join(p.possible_proteins)
)
)
for poss_prot in p.possible_proteins:
self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
logger.debug(
"Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
)
# Sort Alt Proteins by Swissprot then Trembl...
identifiers_sorted = DataStore.sort_protein_strings(
protein_string_list=current_alt_proteins,
sp_proteins=all_sp_proteins,
decoy_symbol=self.parameter_file_object.decoy_symbol,
)
# Restrict to 50 possible proteins... and append alt proteins from db
p = self._fix_alternative_proteins(
append_alt_from_db=self.append_alt_from_db,
identifiers_sorted=identifiers_sorted,
max_proteins=self.MAX_ALLOWED_ALTERNATIVE_PROTEINS,
psm=p,
parameter_file_object=self.parameter_file_object,
)
list_of_psm_objects.append(p)
peptide_tracker.add(current_peptide)
initial_poss_prots.append(input_poss_prots)
self.psms = list_of_psm_objects
self._check_initial_database_overlap(
initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
)
logger.info("Finished reading in data from Proteologic...")
__init__(self, proteologic_object, search_id, postsearch_id, digest, parameter_file_object, append_alt_from_db=True)
special
Parameters: |
|
---|
Returns: |
|
---|
Source code in pyproteininference/reader.py
def __init__(
self,
proteologic_object,
search_id,
postsearch_id,
digest,
parameter_file_object,
append_alt_from_db=True,
):
"""
Args:
proteologic_object (list): List of proteologic post search objects.
search_id (int): Search ID or Search IDs associated with the data.
postsearch_id: PostSearch ID or PostSearch IDs associated with the data.
digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
parameter_file_object (ProteinInferenceParameter):
[ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
append_alt_from_db (bool): Whether or not to append alternative proteins found in the database
that are not in the input files.
Returns:
object:
"""
self.proteologic_object = proteologic_object
self.search_id = search_id
self.postsearch_id = postsearch_id
self.psms = None
self.digest = digest
self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
self.append_alt_from_db = append_alt_from_db
self.parameter_file_object = parameter_file_object
read_psms(self)
Method to read psms from the input files and to transform them into a list of Psm objects.
This method sets the psms
variable. Which is a list of Psm objets.
This method must be ran before initializing DataStore object.
Source code in pyproteininference/reader.py
def read_psms(self):
"""
Method to read psms from the input files and to transform them into a list of
[Psm][pyproteininference.physical.Psm] objects.
This method sets the `psms` variable. Which is a list of Psm objets.
This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].
"""
logger.info("Reading in data from Proteologic...")
if isinstance(self.proteologic_object, (list,)):
list_of_psms = []
for p_objs in self.proteologic_object:
for psms in p_objs.physical_object.psm_sets:
list_of_psms.append(psms)
else:
list_of_psms = self.proteologic_object.physical_object.psm_sets
# Sort this by posterior error prob...
list_of_psms = sorted(list_of_psms, key=lambda x: float(x.psm_filter.pepvalue))
peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary
list_of_psm_objects = []
peptide_tracker = set()
all_sp_proteins = set(self.digest.swiss_prot_protein_set)
# Peptide tracker is used because we only want UNIQUE peptides...
# The data is sorted by percolator score... or at least it should be...
# Or sorted by posterior error probability
initial_poss_prots = []
for peps in list_of_psms:
current_peptide = peps.peptide.sequence
# Define the Psm...
if current_peptide not in peptide_tracker:
p = Psm(identifier=current_peptide)
# Add all the attributes
p.percscore = float(0) # Will be stored in table in future I think...
p.qvalue = float(peps.psm_filter.q_value)
p.pepvalue = float(peps.psm_filter.pepvalue)
if peps.peptide.protein not in peps.alternative_proteins:
p.possible_proteins = [peps.peptide.protein] + peps.alternative_proteins
else:
p.possible_proteins = peps.alternative_proteins
p.possible_proteins = list(filter(None, p.possible_proteins))
input_poss_prots = copy.copy(p.possible_proteins)
p.psm_id = peps.spectrum.spectrum_identifier
# Split peptide if flanking
current_peptide = Psm.split_peptide(peptide_string=current_peptide)
if not current_peptide.isupper() or not current_peptide.isalpha():
# If we have mods remove them...
peptide_string = current_peptide.upper()
stripped_peptide = Psm.remove_peptide_mods(peptide_string)
current_peptide = stripped_peptide
# Add the other possible_proteins from insilicodigest here...
try:
current_alt_proteins = sorted(
list(peptide_to_protein_dictionary[current_peptide])
) # This peptide needs to be scrubbed of Mods...
except KeyError:
current_alt_proteins = []
logger.debug(
"Peptide {} was not found in the supplied DB with the following proteins {}".format(
current_peptide, ";".join(p.possible_proteins)
)
)
for poss_prot in p.possible_proteins:
self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
logger.debug(
"Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
)
# Sort Alt Proteins by Swissprot then Trembl...
identifiers_sorted = DataStore.sort_protein_strings(
protein_string_list=current_alt_proteins,
sp_proteins=all_sp_proteins,
decoy_symbol=self.parameter_file_object.decoy_symbol,
)
# Restrict to 50 possible proteins... and append alt proteins from db
p = self._fix_alternative_proteins(
append_alt_from_db=self.append_alt_from_db,
identifiers_sorted=identifiers_sorted,
max_proteins=self.MAX_ALLOWED_ALTERNATIVE_PROTEINS,
psm=p,
parameter_file_object=self.parameter_file_object,
)
list_of_psm_objects.append(p)
peptide_tracker.add(current_peptide)
initial_poss_prots.append(input_poss_prots)
self.psms = list_of_psm_objects
self._check_initial_database_overlap(
initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
)
logger.info("Finished reading in data from Proteologic...")
Reader
Main Reader Class which is parent to all reader subclasses.
Attributes:
Name | Type | Description |
---|---|---|
target_file |
str/list |
Path to Target PSM result files. |
decoy_file |
str/list |
Path to Decoy PSM result files. |
combined_files |
str/list |
Path to Combined PSM result files. |
directory |
str |
Path to directory containing combined PSM result files. |
Source code in pyproteininference/reader.py
class Reader(object):
"""
Main Reader Class which is parent to all reader subclasses.
Attributes:
target_file (str/list): Path to Target PSM result files.
decoy_file (str/list): Path to Decoy PSM result files.
combined_files (str/list): Path to Combined PSM result files.
directory (str): Path to directory containing combined PSM result files.
"""
MAX_ALLOWED_ALTERNATIVE_PROTEINS = 50
def __init__(self, target_file=None, decoy_file=None, combined_files=None, directory=None):
"""
Args:
target_file (str/list): Path to Target PSM result files.
decoy_file (str/list): Path to Decoy PSM result files.
combined_files (str/list): Path to Combined PSM result files.
directory (str): Path to directory containing combined PSM result files.
"""
self.target_file = target_file
self.decoy_file = decoy_file
self.combined_files = combined_files
self.directory = directory
def get_alternative_proteins_from_input(self, row):
"""
Method to get the alternative proteins from the input files.
"""
if None in row.keys():
try:
row["alternative_proteins"] = row.pop(None)
# Sort the alternative proteins - when they are read in they become unsorted
row["alternative_proteins"] = sorted(row["alternative_proteins"])
except KeyError:
row["alternative_proteins"] = []
else:
row["alternative_proteins"] = []
return row
def _validate_input(self):
"""
Internal method to validate the input to Reader.
"""
if self.target_file and self.decoy_file and not self.combined_files and not self.directory:
logger.info("Validating input as target_file and decoy_file")
elif self.combined_files and not self.target_file and not self.decoy_file and not self.directory:
logger.info("Validating input as combined_files")
elif self.directory and not self.combined_files and not self.decoy_file and not self.target_file:
logger.info("Validating input as combined_directory")
else:
raise ValueError(
"To run Protein inference please supply either: "
"(1) either one or multiple target_files and decoy_files, "
"(2) either one or multiple combined_files that include target and decoy data"
"(3) a combined_directory that contains combined target/decoy files (combined_directory)"
)
@classmethod
def _fix_alternative_proteins(
cls,
append_alt_from_db,
identifiers_sorted,
max_proteins,
psm,
parameter_file_object,
):
"""
Internal method to fix the alternative proteins variable for a given
[Psm][pyproteininference.physical.Psm] object.
Args:
append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that are
not in the input files.
identifiers_sorted (list): List of sorted Protein Strings for the given Psm.
max_proteins (int): Maximum number of proteins that a [Psm][pyproteininference.physical.Psm]
is allowed to map to.
psm: (Psm): [Psm][pyproteininference.physical.Psm] object of interest.
parameter_file_object: (ProteinInferenceParameter):
[ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter].
Returns:
pyproteininference.physical.Psm: [Psm][pyproteininference.physical.Psm] with alternative proteins fixed.
"""
# If we are appending alternative proteins from the db
if append_alt_from_db:
# Loop over the Identifiers from the DB These are identifiers that contain the current peptide
for alt_proteins in identifiers_sorted[:max_proteins]:
# If the identifier is not already in possible proteins
# and if then len of poss prot is less than the max...
# Then append
if alt_proteins not in psm.possible_proteins and len(psm.possible_proteins) < max_proteins:
psm.possible_proteins.append(alt_proteins)
# Next if the len of possible proteins is greater than max then restrict the list length...
if len(psm.possible_proteins) > max_proteins:
psm.possible_proteins = [psm.possible_proteins[x] for x in range(max_proteins)]
else:
pass
# If no inference only select first poss protein
if parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
psm.possible_proteins = [psm.possible_proteins[0]]
return psm
def _check_initial_database_overlap(self, initial_possible_proteins, initial_protein_peptide_map):
"""
Internal method that checks to make sure there is at least some overlap between proteins in the input files
And the proteins in the database digestion.
"""
if len(initial_protein_peptide_map.keys()) > 0:
input_protein_ids_flat = set([protein for group in initial_possible_proteins for protein in group])
digest_proteins = set(initial_protein_peptide_map.keys())
intersection = input_protein_ids_flat.intersection(digest_proteins)
if len(intersection) < 1:
raise ValueError(
"The Intersection of Protein Identifiers between the database digest "
"and the input files is zero. Please consider setting id_splitting to True. "
"Or make sure that the identifiers in the input files and database file match. "
"Example Protein Identifier from input file '{}'."
"Example Protein Identifier from database file '{}'".format(
list(input_protein_ids_flat)[0], list(digest_proteins)[0]
)
)
else:
logger.info("Number of matching proteins from database and input files: {}".format(len(intersection)))
logger.info("Number of proteins from database file: {}".format(len(digest_proteins)))
logger.info("Number of proteins from input files: {}".format(len(input_protein_ids_flat)))
else:
pass
__init__(self, target_file=None, decoy_file=None, combined_files=None, directory=None)
special
Parameters: |
|
---|
Source code in pyproteininference/reader.py
def __init__(self, target_file=None, decoy_file=None, combined_files=None, directory=None):
"""
Args:
target_file (str/list): Path to Target PSM result files.
decoy_file (str/list): Path to Decoy PSM result files.
combined_files (str/list): Path to Combined PSM result files.
directory (str): Path to directory containing combined PSM result files.
"""
self.target_file = target_file
self.decoy_file = decoy_file
self.combined_files = combined_files
self.directory = directory
get_alternative_proteins_from_input(self, row)
Method to get the alternative proteins from the input files.
Source code in pyproteininference/reader.py
def get_alternative_proteins_from_input(self, row):
"""
Method to get the alternative proteins from the input files.
"""
if None in row.keys():
try:
row["alternative_proteins"] = row.pop(None)
# Sort the alternative proteins - when they are read in they become unsorted
row["alternative_proteins"] = sorted(row["alternative_proteins"])
except KeyError:
row["alternative_proteins"] = []
else:
row["alternative_proteins"] = []
return row
scoring
Score
Score class that contains methods to do a variety of scoring methods on the Psm objects contained inside of Protein objects.
Methods in the class loop over each Protein object and creates a protein "score" variable using the Psm object scores.
Methods score all proteins from scoring_input
from DataStore object.
The PSM score that is used is determined from
create_scoring_input.
Each scoring method will set the following attributes for the DataStore object.
score_method
; This is the full name of the score method.short_score_method
; This is the short name of the score method.scored_proteins
; This is a list of Protein objects that have been scored.
Attributes:
Name | Type | Description |
---|---|---|
pre_score_data |
list |
|
data |
DataStore |
DataStore object. |
Source code in pyproteininference/scoring.py
class Score(object):
"""
Score class that contains methods to do a variety of scoring methods on the
[Psm][pyproteininference.physical.Psm] objects
contained inside of [Protein][pyproteininference.physical.Protein] objects.
Methods in the class loop over each Protein object and creates a protein "score" variable using the Psm object
scores.
Methods score all proteins from `scoring_input` from [DataStore object][pyproteininference.datastore.DataStore].
The PSM score that is used is determined from
[create_scoring_input][pyproteininference.datastore.DataStore.create_scoring_input].
Each scoring method will set the following attributes for
the [DataStore object][pyproteininference.datastore.DataStore].
1. `score_method`; This is the full name of the score method.
2. `short_score_method`; This is the short name of the score method.
3. `scored_proteins`; This is a list of [Protein][pyproteininference.physical.Protein] objects
that have been scored.
Attributes:
pre_score_data (list): This is a list of [Protein][pyproteininference.physical.Protein] objects
that contain [Psm][pyproteininference.physical.Psm] objects.
data (DataStore): [DataStore][pyproteininference.datastore.DataStore] object.
"""
BEST_PEPTIDE_PER_PROTEIN = "best_peptide_per_protein"
ITERATIVE_DOWNWEIGHTED_LOG = "iterative_downweighted_log"
MULTIPLICATIVE_LOG = "multiplicative_log"
DOWNWEIGHTED_MULTIPLICATIVE_LOG = "downweighted_multiplicative_log"
DOWNWEIGHTED_VERSION2 = "downweighted_version2"
TOP_TWO_COMBINED = "top_two_combined"
GEOMETRIC_MEAN = "geometric_mean"
ADDITIVE = "additive"
SCORE_METHODS = [
BEST_PEPTIDE_PER_PROTEIN,
ITERATIVE_DOWNWEIGHTED_LOG,
MULTIPLICATIVE_LOG,
DOWNWEIGHTED_MULTIPLICATIVE_LOG,
DOWNWEIGHTED_VERSION2,
TOP_TWO_COMBINED,
GEOMETRIC_MEAN,
ADDITIVE,
]
SHORT_BEST_PEPTIDE_PER_PROTEIN = "bppp"
SHORT_ITERATIVE_DOWNWEIGHTED_LOG = "idwl"
SHORT_MULTIPLICATIVE_LOG = "ml"
SHORT_DOWNWEIGHTED_MULTIPLICATIVE_LOG = "dwml"
SHORT_DOWNWEIGHTED_VERSION2 = "dw2"
SHORT_TOP_TWO_COMBINED = "ttc"
SHORT_GEOMETRIC_MEAN = "gm"
SHORT_ADDITIVE = "add"
SHORT_SCORE_METHODS = [
SHORT_BEST_PEPTIDE_PER_PROTEIN,
SHORT_ITERATIVE_DOWNWEIGHTED_LOG,
SHORT_MULTIPLICATIVE_LOG,
SHORT_DOWNWEIGHTED_MULTIPLICATIVE_LOG,
SHORT_DOWNWEIGHTED_VERSION2,
SHORT_TOP_TWO_COMBINED,
SHORT_GEOMETRIC_MEAN,
SHORT_ADDITIVE,
]
MULTIPLICATIVE_SCORE_TYPE = "multiplicative"
ADDITIVE_SCORE_TYPE = "additive"
SCORE_TYPES = [MULTIPLICATIVE_SCORE_TYPE, ADDITIVE_SCORE_TYPE]
def __init__(self, data):
"""
Initialization method for the Score class.
Args:
data (DataStore): [DataStore][pyproteininference.datastore.DataStore] object.
Raises:
ValueError: If the variable `scoring_input` for the [DataStore][pyproteininference.datastore.DataStore]
object is Empty "[]" or does not exist "None".
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
"""
if data.scoring_input:
self.pre_score_data = data.scoring_input
else:
raise ValueError(
"scoring input not found in data object - Please run 'create_scoring_input' method from "
"DataStore to run any scoring type"
)
self.data = data
def score_psms(self, score_method="multiplicative_log"):
"""
This method dispatches to the actual scoring method given a string input that is defined in the
[ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
Args:
score_method (str): This is a string that represents which scoring method to call.
Raises:
ValueError: Will Error out if the score_method is not present in the constant `SCORE_METHODS`.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.score_psms(score_method="best_peptide_per_protein")
"""
if score_method not in self.SCORE_METHODS:
raise ValueError(
"score method '{}' is not a proper method. Score method must be one of the following: '{}'".format(
score_method, ", ".join(self.SCORE_METHODS)
)
)
else:
if score_method == self.BEST_PEPTIDE_PER_PROTEIN:
self.best_peptide_per_protein()
if score_method == self.ITERATIVE_DOWNWEIGHTED_LOG:
self.iterative_down_weighted_log()
if score_method == self.MULTIPLICATIVE_LOG:
self.multiplicative_log()
if score_method == self.DOWNWEIGHTED_MULTIPLICATIVE_LOG:
self.down_weighted_multiplicative_log()
if score_method == self.DOWNWEIGHTED_VERSION2:
self.down_weighted_v2()
if score_method == self.TOP_TWO_COMBINED:
self.top_two_combied()
if score_method == self.GEOMETRIC_MEAN:
self.geometric_mean_log()
if score_method == self.ADDITIVE:
self.additive()
def best_peptide_per_protein(self):
"""
This method uses a best peptide per protein scoring scheme.
The top scoring Psm for each protein is selected as the overall Protein object score.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.best_peptide_per_protein()
"""
all_scores = []
logger.info("Scoring Proteins with BPPP")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
score = min([float(x) for x in val_list])
protein.score = score
all_scores.append(protein)
# Here do ascending sorting because a lower pep or q value is better
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=False)
self.data.protein_score = self.BEST_PEPTIDE_PER_PROTEIN
self.data.short_protein_score = self.SHORT_BEST_PEPTIDE_PER_PROTEIN
self.data.scored_proteins = all_scores
def fishers_method(self):
"""
This method uses a fishers method scoring scheme.
\
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.fishers_method()
"""
all_scores = []
logger.info("Scoring Proteins with fishers method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
score = -2 * sum([math.log(x) for x in val_list])
protein.score = score
all_scores.append(protein)
# Here reverse the sorting to descending because a higher score is better
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = "fishers_method"
self.data.short_protein_score = "fm"
self.data.scored_proteins = all_scores
def multiplicative_log(self):
"""
This method uses a Multiplicative Log scoring scheme.
The selected Psm score from all the peptides per protein are multiplied together and we take -Log(X)
of the multiplied Peptide scores.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.multiplicative_log()
"""
# Instead of making all_scores a list... make it a generator??
all_scores = []
logger.info("Scoring Proteins with Multiplicative Log Method")
for protein in self.pre_score_data:
# We create a generator of val_list...
val_list = protein.get_psm_scores()
combine = reduce(lambda x, y: x * y, val_list)
if combine == 0:
combine = sys.float_info.min
score = -math.log(combine)
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.MULTIPLICATIVE_LOG
self.data.short_protein_score = self.SHORT_MULTIPLICATIVE_LOG
self.data.scored_proteins = all_scores
def down_weighted_multiplicative_log(self):
"""
This method uses a Multiplicative Log scoring scheme.
The selected PSM score from all the peptides per protein are multiplied together and
then this number is divided by the set PSM scores mean raised to the number of peptides for that protein
then we take -Log(X) of the following value.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.down_weighted_multiplicative_log()
"""
score_list = []
for proteins in self.pre_score_data:
cur_scores = proteins.get_psm_scores()
for scores in cur_scores:
score_list.append(scores)
score_mean = numpy.mean(score_list)
all_scores = []
logger.info("Scoring Proteins with DWML method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
# Divide by the score mean raised to the length of the number of unique peptides for the protein
# This is an attempt to normalize for number of peptides per protein
combine = reduce(lambda x, y: x * y, val_list)
if combine == 0:
combine = sys.float_info.min
score = -math.log(combine / (score_mean ** len(val_list)))
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.DOWNWEIGHTED_MULTIPLICATIVE_LOG
self.data.short_protein_score = self.SHORT_DOWNWEIGHTED_MULTIPLICATIVE_LOG
self.data.scored_proteins = all_scores
def top_two_combied(self):
"""
This method uses a Top Two scoring scheme.
The top two scores for each protein are multiplied together and we take -Log(X) of the multiplied value.
If a protein only has 1 score/peptide, then we only do -Log(X) of the 1 peptide score.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.top_two_combied()
"""
all_scores = []
logger.info("Scoring Proteins with Top Two Method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
try:
# Try to combine the top two scores
# Divide by 2 to attempt to normalize the value
score = -math.log((val_list[0] * val_list[1]) / 2)
except IndexError:
# If there is only 1 score/1 peptide then just use the 1 peptide provided
score = -math.log(val_list[0])
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.TOP_TWO_COMBINED
self.data.short_protein_score = self.SHORT_TOP_TWO_COMBINED
self.data.scored_proteins = all_scores
def down_weighted_v2(self):
"""
This method uses a Downweighted Multiplicative Log scoring scheme.
Each peptide is iteratively downweighted by raising the peptide QValue or PepValue to the
following power (1/(1+index_number)).
Where index_number is the peptide number per protein.
Each score for a protein provides less and less weight iteratively.
We also take -Log(X) of the final score here.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.down_weighted_v2()
"""
all_scores = []
logger.info("Scoring Proteins with down weighted v2 method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
# Here take each score and raise it to the power of (1/(1+index_number)).
# This downweights each successive score by reducing its weight in a decreasing fashion
# Basically, each score for a protein will provide less and less weight iteratively
val_list = [val_list[x] ** (1 / float(1 + x)) for x in range(len(val_list))]
# val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
score = -math.log(reduce(lambda x, y: x * y, val_list))
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.DOWNWEIGHTED_VERSION2
self.data.short_protein_score = self.SHORT_DOWNWEIGHTED_VERSION2
self.data.scored_proteins = all_scores
def iterative_down_weighted_log(self):
"""
This method uses a Downweighted Multiplicative Log scoring scheme.
Each peptide is iteratively downweighted by multiplying the peptide QValue or PepValue to
the following (1+index_number).
Where index_number is the peptide number per protein.
Each score for a protein provides less and less weight iteratively.
We also take -Log(X) of the final score here.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.iterative_down_weighted_log()
"""
all_scores = []
logger.info("Scoring Proteins with IDWL method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
# Here take each score and multiply it by its index number).
# This downweights each successive score by reducing its weight in a decreasing fashion
# Basically, each score for a protein will provide less and less weight iteratively
val_list = [val_list[x] * (float(1 + x)) for x in range(len(val_list))]
# val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
combine = reduce(lambda x, y: x * y, val_list)
if combine == 0:
combine = sys.float_info.min
score = -math.log(combine)
protein.score = score
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.ITERATIVE_DOWNWEIGHTED_LOG
self.data.short_protein_score = self.SHORT_ITERATIVE_DOWNWEIGHTED_LOG
self.data.scored_proteins = all_scores
def geometric_mean_log(self):
"""
This method uses a Geometric Mean scoring scheme.
We also take -Log(X) of the final score here.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.geometric_mean_log()
"""
all_scores = []
logger.info("Scoring Proteins. with GML method")
for protein in self.pre_score_data:
psm_scores = protein.get_psm_scores()
val_list = []
for vals in psm_scores:
val_list.append(float(vals))
combine = reduce(lambda x, y: x * y, val_list)
if combine == 0:
combine = sys.float_info.min
pre_log_score = combine ** (1 / float(len(val_list)))
score = -math.log(pre_log_score)
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.GEOMETRIC_MEAN
self.data.short_protein_score = self.SHORT_GEOMETRIC_MEAN
self.data.scored_proteins = all_scores
def iterative_down_weighted_v2(self):
"""
The following method is an experimental method essentially used for future development of potential scoring
schemes.
"""
all_scores = []
logger.info("Scoring Proteins with iterative down weighted v2 method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
# Here take each score and raise it to the power of (1/(1+index_number)).
# This downweights each successive score by reducing its weight in a decreasing fashion
# Basically, each score for a protein will provide less and less weight iteratively
val_list = [val_list[x] ** (1 / float(1 + x)) for x in range(len(val_list))]
# val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
score = -math.log(reduce(lambda x, y: x * y, val_list))
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = "iterative_downweighting2"
self.data.short_protein_score = "idw2"
self.data.scored_proteins = all_scores
def additive(self):
"""
This method uses an additive scoring scheme.
The method can only be used if a larger PSM score is a better PSM score such as the Percolator score.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.additive()
"""
all_scores = []
logger.info("Scoring Proteins with additive method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
# Take the sum of our scores
score = sum(val_list)
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.ADDITIVE
self.data.short_protein_score = self.SHORT_ADDITIVE
self.data.scored_proteins = all_scores
__init__(self, data)
special
Initialization method for the Score class.
Parameters: |
|
---|
Exceptions: |
|
---|
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
Source code in pyproteininference/scoring.py
def __init__(self, data):
"""
Initialization method for the Score class.
Args:
data (DataStore): [DataStore][pyproteininference.datastore.DataStore] object.
Raises:
ValueError: If the variable `scoring_input` for the [DataStore][pyproteininference.datastore.DataStore]
object is Empty "[]" or does not exist "None".
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
"""
if data.scoring_input:
self.pre_score_data = data.scoring_input
else:
raise ValueError(
"scoring input not found in data object - Please run 'create_scoring_input' method from "
"DataStore to run any scoring type"
)
self.data = data
additive(self)
This method uses an additive scoring scheme. The method can only be used if a larger PSM score is a better PSM score such as the Percolator score.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.additive()
Source code in pyproteininference/scoring.py
def additive(self):
"""
This method uses an additive scoring scheme.
The method can only be used if a larger PSM score is a better PSM score such as the Percolator score.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.additive()
"""
all_scores = []
logger.info("Scoring Proteins with additive method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
# Take the sum of our scores
score = sum(val_list)
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.ADDITIVE
self.data.short_protein_score = self.SHORT_ADDITIVE
self.data.scored_proteins = all_scores
best_peptide_per_protein(self)
This method uses a best peptide per protein scoring scheme. The top scoring Psm for each protein is selected as the overall Protein object score.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.best_peptide_per_protein()
Source code in pyproteininference/scoring.py
def best_peptide_per_protein(self):
"""
This method uses a best peptide per protein scoring scheme.
The top scoring Psm for each protein is selected as the overall Protein object score.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.best_peptide_per_protein()
"""
all_scores = []
logger.info("Scoring Proteins with BPPP")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
score = min([float(x) for x in val_list])
protein.score = score
all_scores.append(protein)
# Here do ascending sorting because a lower pep or q value is better
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=False)
self.data.protein_score = self.BEST_PEPTIDE_PER_PROTEIN
self.data.short_protein_score = self.SHORT_BEST_PEPTIDE_PER_PROTEIN
self.data.scored_proteins = all_scores
down_weighted_multiplicative_log(self)
This method uses a Multiplicative Log scoring scheme. The selected PSM score from all the peptides per protein are multiplied together and then this number is divided by the set PSM scores mean raised to the number of peptides for that protein then we take -Log(X) of the following value.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.down_weighted_multiplicative_log()
Source code in pyproteininference/scoring.py
def down_weighted_multiplicative_log(self):
"""
This method uses a Multiplicative Log scoring scheme.
The selected PSM score from all the peptides per protein are multiplied together and
then this number is divided by the set PSM scores mean raised to the number of peptides for that protein
then we take -Log(X) of the following value.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.down_weighted_multiplicative_log()
"""
score_list = []
for proteins in self.pre_score_data:
cur_scores = proteins.get_psm_scores()
for scores in cur_scores:
score_list.append(scores)
score_mean = numpy.mean(score_list)
all_scores = []
logger.info("Scoring Proteins with DWML method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
# Divide by the score mean raised to the length of the number of unique peptides for the protein
# This is an attempt to normalize for number of peptides per protein
combine = reduce(lambda x, y: x * y, val_list)
if combine == 0:
combine = sys.float_info.min
score = -math.log(combine / (score_mean ** len(val_list)))
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.DOWNWEIGHTED_MULTIPLICATIVE_LOG
self.data.short_protein_score = self.SHORT_DOWNWEIGHTED_MULTIPLICATIVE_LOG
self.data.scored_proteins = all_scores
down_weighted_v2(self)
This method uses a Downweighted Multiplicative Log scoring scheme. Each peptide is iteratively downweighted by raising the peptide QValue or PepValue to the following power (1/(1+index_number)). Where index_number is the peptide number per protein. Each score for a protein provides less and less weight iteratively.
We also take -Log(X) of the final score here.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.down_weighted_v2()
Source code in pyproteininference/scoring.py
def down_weighted_v2(self):
"""
This method uses a Downweighted Multiplicative Log scoring scheme.
Each peptide is iteratively downweighted by raising the peptide QValue or PepValue to the
following power (1/(1+index_number)).
Where index_number is the peptide number per protein.
Each score for a protein provides less and less weight iteratively.
We also take -Log(X) of the final score here.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.down_weighted_v2()
"""
all_scores = []
logger.info("Scoring Proteins with down weighted v2 method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
# Here take each score and raise it to the power of (1/(1+index_number)).
# This downweights each successive score by reducing its weight in a decreasing fashion
# Basically, each score for a protein will provide less and less weight iteratively
val_list = [val_list[x] ** (1 / float(1 + x)) for x in range(len(val_list))]
# val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
score = -math.log(reduce(lambda x, y: x * y, val_list))
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.DOWNWEIGHTED_VERSION2
self.data.short_protein_score = self.SHORT_DOWNWEIGHTED_VERSION2
self.data.scored_proteins = all_scores
fishers_method(self)
This method uses a fishers method scoring scheme.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.fishers_method()
Source code in pyproteininference/scoring.py
def fishers_method(self):
"""
This method uses a fishers method scoring scheme.
\
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.fishers_method()
"""
all_scores = []
logger.info("Scoring Proteins with fishers method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
score = -2 * sum([math.log(x) for x in val_list])
protein.score = score
all_scores.append(protein)
# Here reverse the sorting to descending because a higher score is better
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = "fishers_method"
self.data.short_protein_score = "fm"
self.data.scored_proteins = all_scores
geometric_mean_log(self)
This method uses a Geometric Mean scoring scheme.
We also take -Log(X) of the final score here.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.geometric_mean_log()
Source code in pyproteininference/scoring.py
def geometric_mean_log(self):
"""
This method uses a Geometric Mean scoring scheme.
We also take -Log(X) of the final score here.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.geometric_mean_log()
"""
all_scores = []
logger.info("Scoring Proteins. with GML method")
for protein in self.pre_score_data:
psm_scores = protein.get_psm_scores()
val_list = []
for vals in psm_scores:
val_list.append(float(vals))
combine = reduce(lambda x, y: x * y, val_list)
if combine == 0:
combine = sys.float_info.min
pre_log_score = combine ** (1 / float(len(val_list)))
score = -math.log(pre_log_score)
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.GEOMETRIC_MEAN
self.data.short_protein_score = self.SHORT_GEOMETRIC_MEAN
self.data.scored_proteins = all_scores
iterative_down_weighted_log(self)
This method uses a Downweighted Multiplicative Log scoring scheme. Each peptide is iteratively downweighted by multiplying the peptide QValue or PepValue to the following (1+index_number). Where index_number is the peptide number per protein. Each score for a protein provides less and less weight iteratively.
We also take -Log(X) of the final score here.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.iterative_down_weighted_log()
Source code in pyproteininference/scoring.py
def iterative_down_weighted_log(self):
"""
This method uses a Downweighted Multiplicative Log scoring scheme.
Each peptide is iteratively downweighted by multiplying the peptide QValue or PepValue to
the following (1+index_number).
Where index_number is the peptide number per protein.
Each score for a protein provides less and less weight iteratively.
We also take -Log(X) of the final score here.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.iterative_down_weighted_log()
"""
all_scores = []
logger.info("Scoring Proteins with IDWL method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
# Here take each score and multiply it by its index number).
# This downweights each successive score by reducing its weight in a decreasing fashion
# Basically, each score for a protein will provide less and less weight iteratively
val_list = [val_list[x] * (float(1 + x)) for x in range(len(val_list))]
# val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
combine = reduce(lambda x, y: x * y, val_list)
if combine == 0:
combine = sys.float_info.min
score = -math.log(combine)
protein.score = score
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.ITERATIVE_DOWNWEIGHTED_LOG
self.data.short_protein_score = self.SHORT_ITERATIVE_DOWNWEIGHTED_LOG
self.data.scored_proteins = all_scores
iterative_down_weighted_v2(self)
The following method is an experimental method essentially used for future development of potential scoring schemes.
Source code in pyproteininference/scoring.py
def iterative_down_weighted_v2(self):
"""
The following method is an experimental method essentially used for future development of potential scoring
schemes.
"""
all_scores = []
logger.info("Scoring Proteins with iterative down weighted v2 method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
# Here take each score and raise it to the power of (1/(1+index_number)).
# This downweights each successive score by reducing its weight in a decreasing fashion
# Basically, each score for a protein will provide less and less weight iteratively
val_list = [val_list[x] ** (1 / float(1 + x)) for x in range(len(val_list))]
# val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
score = -math.log(reduce(lambda x, y: x * y, val_list))
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = "iterative_downweighting2"
self.data.short_protein_score = "idw2"
self.data.scored_proteins = all_scores
multiplicative_log(self)
This method uses a Multiplicative Log scoring scheme. The selected Psm score from all the peptides per protein are multiplied together and we take -Log(X) of the multiplied Peptide scores.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.multiplicative_log()
Source code in pyproteininference/scoring.py
def multiplicative_log(self):
"""
This method uses a Multiplicative Log scoring scheme.
The selected Psm score from all the peptides per protein are multiplied together and we take -Log(X)
of the multiplied Peptide scores.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.multiplicative_log()
"""
# Instead of making all_scores a list... make it a generator??
all_scores = []
logger.info("Scoring Proteins with Multiplicative Log Method")
for protein in self.pre_score_data:
# We create a generator of val_list...
val_list = protein.get_psm_scores()
combine = reduce(lambda x, y: x * y, val_list)
if combine == 0:
combine = sys.float_info.min
score = -math.log(combine)
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.MULTIPLICATIVE_LOG
self.data.short_protein_score = self.SHORT_MULTIPLICATIVE_LOG
self.data.scored_proteins = all_scores
score_psms(self, score_method='multiplicative_log')
This method dispatches to the actual scoring method given a string input that is defined in the ProteinInferenceParameter object.
Parameters: |
|
---|
Exceptions: |
|
---|
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.score_psms(score_method="best_peptide_per_protein")
Source code in pyproteininference/scoring.py
def score_psms(self, score_method="multiplicative_log"):
"""
This method dispatches to the actual scoring method given a string input that is defined in the
[ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
Args:
score_method (str): This is a string that represents which scoring method to call.
Raises:
ValueError: Will Error out if the score_method is not present in the constant `SCORE_METHODS`.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.score_psms(score_method="best_peptide_per_protein")
"""
if score_method not in self.SCORE_METHODS:
raise ValueError(
"score method '{}' is not a proper method. Score method must be one of the following: '{}'".format(
score_method, ", ".join(self.SCORE_METHODS)
)
)
else:
if score_method == self.BEST_PEPTIDE_PER_PROTEIN:
self.best_peptide_per_protein()
if score_method == self.ITERATIVE_DOWNWEIGHTED_LOG:
self.iterative_down_weighted_log()
if score_method == self.MULTIPLICATIVE_LOG:
self.multiplicative_log()
if score_method == self.DOWNWEIGHTED_MULTIPLICATIVE_LOG:
self.down_weighted_multiplicative_log()
if score_method == self.DOWNWEIGHTED_VERSION2:
self.down_weighted_v2()
if score_method == self.TOP_TWO_COMBINED:
self.top_two_combied()
if score_method == self.GEOMETRIC_MEAN:
self.geometric_mean_log()
if score_method == self.ADDITIVE:
self.additive()
top_two_combied(self)
This method uses a Top Two scoring scheme. The top two scores for each protein are multiplied together and we take -Log(X) of the multiplied value. If a protein only has 1 score/peptide, then we only do -Log(X) of the 1 peptide score.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.top_two_combied()
Source code in pyproteininference/scoring.py
def top_two_combied(self):
"""
This method uses a Top Two scoring scheme.
The top two scores for each protein are multiplied together and we take -Log(X) of the multiplied value.
If a protein only has 1 score/peptide, then we only do -Log(X) of the 1 peptide score.
Examples:
>>> score = pyproteininference.scoring.Score(data=data)
>>> score.top_two_combied()
"""
all_scores = []
logger.info("Scoring Proteins with Top Two Method")
for protein in self.pre_score_data:
val_list = protein.get_psm_scores()
try:
# Try to combine the top two scores
# Divide by 2 to attempt to normalize the value
score = -math.log((val_list[0] * val_list[1]) / 2)
except IndexError:
# If there is only 1 score/1 peptide then just use the 1 peptide provided
score = -math.log(val_list[0])
protein.score = score
all_scores.append(protein)
# Higher score is better as a smaller q or pep in a -log will give a larger value
all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
self.data.protein_score = self.TOP_TWO_COMBINED
self.data.short_protein_score = self.SHORT_TOP_TWO_COMBINED
self.data.scored_proteins = all_scores