Py Protein Inference Module

datastore

DataStore

The following Class serves as the data storage object for a protein inference analysis The class serves as a central point that is accessed at virtually every PI processing step

Attributes:

Name Type Description
main_data_form list

List of unrestricted Psm objects.

parameter_file_object ProteinInferenceParameter

protein inference parameter object.

restricted_peptides list

List of non flaking peptide strings present in the current analysis.

main_data_restricted list

List of restricted Psm objects. Restriction is based on the parameter_file_object and the object is created by function restrict_psm_data.

scored_proteins list

List of scored Protein objects. Output from scoring methods from scoring.

grouped_scored_proteins list

List of scored Protein objects that have been grouped and sorted. Output from run_inference method.

scoring_input list

List of non-scored Protein objects. Output from create_scoring_input.

picked_proteins_scored list

List of Protein objects that pass the protein picker algorithm (protein_picker).

picked_proteins_removed list

List of Protein objects that do not pass the protein picker algorithm (protein_picker).

protein_peptide_dictionary collections.defaultdict

Dictionary of protein strings (keys) that map to sets of peptide strings based on the peptides and proteins found in the search. Protein -> set(Peptides).

peptide_protein_dictionary collections.defaultdict

Dictionary of peptide strings (keys) that map to sets of protein strings based on the peptides and proteins found in the search. Peptide -> set(Proteins).

high_low_better str

Variable that indicates whether a higher or a lower protein score is better. This is necessary to sort Protein objects by score properly. Can either be "higher" or "lower".

psm_score str

Variable that indicates the Psm score being used in the analysis to generate Protein scores.

protein_score str

String to indicate the protein score method used.

short_protein_score str

Short String to indicate the protein score method used.

protein_group_objects list

List of scored ProteinGroup objects that have been grouped and sorted. Output from run_inference method.

decoy_symbol str

String that is used to differentiate between decoy proteins and target proteins. Ex: "##".

digest Digest

Digest object.

SCORE_MAPPER dict

Dictionary that maps potential scores in input files to internal score names.

CUSTOM_SCORE_KEY str

String that indicates a custom score is being used.

Source code in pyproteininference/datastore.py
class DataStore(object):
    """
    The following Class serves as the data storage object for a protein inference analysis
    The class serves as a central point that is accessed at virtually every PI processing step


    Attributes:
        main_data_form (list): List of unrestricted Psm objects.
        parameter_file_object (ProteinInferenceParameter): protein inference parameter
            [object][pyproteininference.parameters.ProteinInferenceParameter].
        restricted_peptides (list): List of non flaking peptide strings present in the current analysis.
        main_data_restricted (list): List of restricted [Psm][pyproteininference.physical.Psm] objects.
            Restriction is based on the parameter_file_object and the object is created by function
                [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
        scored_proteins (list): List of scored [Protein][pyproteininference.physical.Protein] objects.
            Output from scoring methods from [scoring][pyproteininference.scoring].
        grouped_scored_proteins (list): List of scored [Protein][pyproteininference.physical.Protein]
            objects that have been grouped and sorted. Output from
                [run_inference][pyproteininference.inference.Inference.run_inference] method.
        scoring_input (list): List of non-scored [Protein][pyproteininference.physical.Protein] objects.
            Output from [create_scoring_input][pyproteininference.datastore.DataStore.create_scoring_input].
        picked_proteins_scored (list): List of [Protein][pyproteininference.physical.Protein] objects that pass
            the protein picker algorithm ([protein_picker][pyproteininference.datastore.DataStore.protein_picker]).
        picked_proteins_removed (list): List of [Protein][pyproteininference.physical.Protein] objects that do not
            pass the protein picker algorithm ([protein_picker][pyproteininference.datastore.DataStore.protein_picker]).
        protein_peptide_dictionary (collections.defaultdict): Dictionary of protein strings (keys) that map to sets
            of peptide strings based on the peptides and proteins found in the search. Protein -> set(Peptides).
        peptide_protein_dictionary (collections.defaultdict): Dictionary of peptide strings (keys) that map to sets
            of protein strings based on the peptides and proteins found in the search. Peptide -> set(Proteins).
        high_low_better (str): Variable that indicates whether a higher or a lower protein score is better.
            This is necessary to sort Protein objects by score properly. Can either be "higher" or "lower".
        psm_score (str): Variable that indicates the [Psm][pyproteininference.physical.Psm]
            score being used in the analysis to generate [Protein][pyproteininference.physical.Protein] scores.
        protein_score (str): String to indicate the protein score method used.
        short_protein_score (str): Short String to indicate the protein score method used.
        protein_group_objects (list): List of scored [ProteinGroup][pyproteininference.physical.ProteinGroup]
            objects that have been grouped and sorted. Output from
             [run_inference][pyproteininference.inference.Inference.run_inference] method.
        decoy_symbol (str): String that is used to differentiate between decoy proteins and target proteins. Ex: "##".
        digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].
        SCORE_MAPPER (dict): Dictionary that maps potential scores in input files to internal score names.
        CUSTOM_SCORE_KEY (str): String that indicates a custom score is being used.

    """

    SCORE_MAPPER = {
        "q_value": "qvalue",
        "pep_value": "pepvalue",
        "perc_score": "percscore",
        "score": "percscore",
        "q-value": "qvalue",
        "posterior_error_prob": "pepvalue",
        "posterior_error_probability": "pepvalue",
    }

    CUSTOM_SCORE_KEY = "custom_score"

    HIGHER_PSM_SCORE = "higher"
    LOWER_PSM_SCORE = "lower"

    def __init__(self, reader, digest, validate=True):
        """

        Args:
            reader (Reader): Reader object [Reader][pyproteininference.reader.Reader].
            digest (Digest): Digest object
                [Digest][pyproteininference.in_silico_digest.Digest].
            validate (bool): True/False to indicate if the input data should be validated.

        Example:
            >>> pyproteininference.datastore.DataStore(reader = reader, digest=digest)


        """
        # If the reader class is from a percolator.psms then define main_data_form as reader.psms
        # main_data_form is the starting point for all other analyses
        self._init_validate(reader=reader)

        self.parameter_file_object = reader.parameter_file_object  # Parameter object
        self.main_data_restricted = None  # PSM data post restriction
        self.scored_proteins = []  # List of scored Protein objects
        self.grouped_scored_proteins = []  # List of sorted scored Protein objects
        self.scoring_input = None  # List of non scored Protein objects
        self.picked_proteins_scored = None  # List of Protein objects after picker algorithm
        self.picked_proteins_removed = None  # Protein objects removed via picker
        self.protein_peptide_dictionary = None
        self.peptide_protein_dictionary = None
        self.high_low_better = None  # Variable that indicates whether a higher or lower protein score is better
        self.psm_score = None  # PSM Score used
        self.protein_score = None
        self.short_protein_score = None
        self.protein_group_objects = []  # List of sorted protein group objects
        self.decoy_symbol = self.parameter_file_object.decoy_symbol  # Decoy symbol from parameter file
        self.digest = digest  # Digest object

        # Run Checks and Validations
        if validate:
            self.validate_psm_data()
            self.validate_digest()
            self.check_data_consistency()

        # Run method to fix our parameter object if necessary
        self.parameter_file_object.fix_parameters_from_datastore(data=self)

    def get_sorted_identifiers(self, scored=True):
        """
        Retrieves a sorted list of protein strings present in the analysis.

        Args:
            scored (bool): True/False to indicate if we should return scored or non-scored identifiers.

        Returns:
            list: List of sorted protein identifier strings.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> sorted_proteins = data.get_sorted_identifiers(scored=True)
        """

        if scored:
            self._validate_scored_proteins()
            if self.picked_proteins_scored:
                proteins = set([x.identifier for x in self.picked_proteins_scored])
            else:
                proteins = set([x.identifier for x in self.scored_proteins])
        else:
            self._validate_scoring_input()
            proteins = [x.identifier for x in self.scoring_input]

        all_sp_proteins = set(self.digest.swiss_prot_protein_set)

        our_target_sp_proteins = sorted([x for x in proteins if x in all_sp_proteins and self.decoy_symbol not in x])
        our_decoy_sp_proteins = sorted([x for x in proteins if x in all_sp_proteins and self.decoy_symbol in x])

        our_target_tr_proteins = sorted(
            [x for x in proteins if x not in all_sp_proteins and self.decoy_symbol not in x]
        )
        our_decoy_tr_proteins = sorted([x for x in proteins if x not in all_sp_proteins and self.decoy_symbol in x])

        our_proteins_sorted = (
            our_target_sp_proteins + our_decoy_sp_proteins + our_target_tr_proteins + our_decoy_tr_proteins
        )

        return our_proteins_sorted

    @classmethod
    def sort_protein_group_objects(cls, protein_group_objects, higher_or_lower):
        """
        Class Method to sort a list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects by
        score and number of peptides.

        Args:
            protein_group_objects (list): list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
            higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".

        Returns:
            list: list of sorted [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        Example:
            >>> list_of_group_objects = pyproteininference.datastore.DataStore.sort_protein_group_objects(
            >>>     protein_group_objects=list_of_group_objects, higher_or_lower="higher"
            >>> )
        """
        if higher_or_lower == cls.LOWER_PSM_SCORE:

            protein_group_objects = sorted(
                protein_group_objects,
                key=lambda k: (
                    k.proteins[0].score,
                    -k.proteins[0].num_peptides,
                ),
                reverse=False,
            )
        elif higher_or_lower == cls.HIGHER_PSM_SCORE:

            protein_group_objects = sorted(
                protein_group_objects,
                key=lambda k: (
                    k.proteins[0].score,
                    k.proteins[0].num_peptides,
                ),
                reverse=True,
            )

        return protein_group_objects

    @classmethod
    def sort_protein_objects(cls, grouped_protein_objects, higher_or_lower):
        """
        Class Method to sort a list of [Protein][pyproteininference.physical.Protein] objects by score and number of
        peptides.

        Args:
            grouped_protein_objects (list): list of [Protein][pyproteininference.physical.Protein] objects.
            higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".

        Returns:
            list: list of sorted [Protein][pyproteininference.physical.Protein] objects.

        Example:
            >>> scores_grouped = pyproteininference.datastore.DataStore.sort_protein_objects(
            >>>     grouped_protein_objects=scores_grouped, higher_or_lower="higher"
            >>> )
        """
        if higher_or_lower == cls.LOWER_PSM_SCORE:
            grouped_protein_objects = sorted(
                grouped_protein_objects,
                key=lambda k: (k[0].score, -k[0].num_peptides),
                reverse=False,
            )
        if higher_or_lower == cls.HIGHER_PSM_SCORE:
            grouped_protein_objects = sorted(
                grouped_protein_objects,
                key=lambda k: (k[0].score, k[0].num_peptides),
                reverse=True,
            )
        return grouped_protein_objects

    @classmethod
    def sort_protein_sub_groups(cls, protein_list, higher_or_lower):
        """
        Method to sort protein sub lists.

        Args:
            protein_list (list): List of [Protein][pyproteininference.physical.Protein] objects to be sorted.
            higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".

        Returns:
            list: List of [Protein][pyproteininference.physical.Protein] objects to be sorted by score and number of
            peptides.

        """

        # Sort the groups based on higher or lower indication, secondarily sort the groups based on number of unique
        # peptides
        # We use the index [1:] as we do not wish to sort the lead protein...
        if higher_or_lower == cls.LOWER_PSM_SCORE:
            protein_list[1:] = sorted(
                protein_list[1:],
                key=lambda k: (float(k.score), -float(k.num_peptides)),
                reverse=False,
            )
        if higher_or_lower == cls.HIGHER_PSM_SCORE:
            protein_list[1:] = sorted(
                protein_list[1:],
                key=lambda k: (float(k.score), float(k.num_peptides)),
                reverse=True,
            )

        return protein_list

    def get_psm_data(self):
        """
        Method to retrieve a list of [Psm][pyproteininference.physical.Psm] objects.
        Retrieves restricted data if the data has been restricted or all of the data if the data has
        not been restricted.

        Returns:
            list: list of [Psm][pyproteininference.physical.Psm] objects.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> psm_data = data.get_psm_data()
        """
        if not self.main_data_restricted and not self.main_data_form:
            raise ValueError(
                "Both main_data_restricted and main_data_form variables are empty. Please re-load the DataStore "
                "object with a properly loaded Reader object."
            )

        if self.main_data_restricted:
            psm_data = self.main_data_restricted
        else:
            psm_data = self.main_data_form

        return psm_data

    def get_protein_data(self):
        """
        Method to retrieve a list of [Protein][pyproteininference.physical.Protein] objects.
        Retrieves picked and scored data if the data has been picked and scored or just the scored data if the data has
         not been picked.

        Returns:
            list: list of [Protein][pyproteininference.physical.Protein] objects.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> # Data must ben ran through a pyproteininference.scoring.Score method
            >>> protein_data = data.get_protein_data()
        """

        if self.picked_proteins_scored:
            scored_proteins = self.picked_proteins_scored
        else:
            scored_proteins = self.scored_proteins

        return scored_proteins

    def get_protein_identifiers_from_psm_data(self):
        """
        Method to retrieve a list of lists of all possible protein identifiers from the psm data.

        Returns:
            list: list of lists of protein strings.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> protein_strings = data.get_protein_identifiers_from_psm_data()
        """
        psm_data = self.get_psm_data()

        proteins = [x.possible_proteins for x in psm_data]

        return proteins

    def get_q_values(self):
        """
        Method to retrieve a list of all q values for all PSMs.

        Returns:
            list: list of floats (q values).

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> q = data.get_q_values()
        """
        psm_data = self.get_psm_data()

        q_values = [x.qvalue for x in psm_data]

        return q_values

    def get_pep_values(self):
        """
        Method to retrieve a list of all posterior error probabilities for all PSMs.

        Returns:
            list: list of floats (pep values).

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> pep = data.get_pep_values()
        """
        psm_data = self.get_psm_data()

        pep_values = [x.pepvalue for x in psm_data]

        return pep_values

    def get_protein_information_dictionary(self):
        """
        Method to retrieve a dictionary of scores for each peptide.

        Returns:
            dict: dictionary of scores for each protein.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> protein_dict = data.get_protein_information_dictionary()
        """
        psm_data = self.get_psm_data()

        protein_psm_score_dictionary = collections.defaultdict(list)

        # Loop through all Psms
        for psms in psm_data:
            # Loop through all proteins
            for prots in psms.possible_proteins:
                protein_psm_score_dictionary[prots].append(
                    {
                        "peptide": psms.identifier,
                        "Qvalue": psms.qvalue,
                        "PosteriorErrorProbability": psms.pepvalue,
                        "Percscore": psms.percscore,
                    }
                )

        return protein_psm_score_dictionary

    def restrict_psm_data(self, remove1pep=True):
        """
        Method to restrict the input of [Psm][pyproteininference.physical.Psm]  objects.
        This method is central to the pyproteininference module and is able to restrict the Psm data by:
        Q value, Pep Value, Percolator Score, Peptide Length, and Custom Score Input.
        Restriction values are pulled from
        the [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter]
        object.

        This method sets the `main_data_restricted` and `restricted_peptides` Attributes for the DataStore object.

        Args:
            remove1pep (bool): True/False on whether or not to remove PEP values that equal 1 even if other restrictions
                are set to not restrict.

        Returns:
            None:

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> data.restrict_psm_data(remove1pep=True)
        """

        # Validate that we have the main data variable
        self._validate_main_data_form()

        logger.info("Restricting PSM data")

        peptide_length = self.parameter_file_object.restrict_peptide_length
        posterior_error_prob_threshold = self.parameter_file_object.restrict_pep
        q_value_threshold = self.parameter_file_object.restrict_q
        custom_threshold = self.parameter_file_object.restrict_custom

        main_psm_data = self.main_data_form
        logger.info("Length of main data: {}".format(len(self.main_data_form)))
        # If restrict_main_data is called, we automatically discard everything that has a PEP of 1
        if remove1pep and posterior_error_prob_threshold:
            main_psm_data = [x for x in main_psm_data if x.pepvalue != 1]

        # Restrict peptide length and posterior error probability
        if peptide_length and posterior_error_prob_threshold and not q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if len(psms.stripped_peptide) >= peptide_length and psms.pepvalue < float(
                    posterior_error_prob_threshold
                ):
                    restricted_data.append(psms)

        # Restrict peptide length only
        if peptide_length and not posterior_error_prob_threshold and not q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if len(psms.stripped_peptide) >= peptide_length:
                    restricted_data.append(psms)

        # Restrict peptide length, posterior error probability, and qvalue
        if peptide_length and posterior_error_prob_threshold and q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if (
                    len(psms.stripped_peptide) >= peptide_length
                    and psms.pepvalue < float(posterior_error_prob_threshold)
                    and psms.qvalue < float(q_value_threshold)
                ):
                    restricted_data.append(psms)

        # Restrict peptide length and qvalue
        if peptide_length and not posterior_error_prob_threshold and q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if len(psms.stripped_peptide) >= peptide_length and psms.qvalue < float(q_value_threshold):
                    restricted_data.append(psms)

        # Restrict posterior error probability and q value
        if not peptide_length and posterior_error_prob_threshold and q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if psms.pepvalue < float(posterior_error_prob_threshold) and psms.qvalue < float(q_value_threshold):
                    restricted_data.append(psms)

        # Restrict qvalue only
        if not peptide_length and not posterior_error_prob_threshold and q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if psms.qvalue < float(q_value_threshold):
                    restricted_data.append(psms)

        # Restrict posterior error probability only
        if not peptide_length and posterior_error_prob_threshold and not q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if psms.pepvalue < float(posterior_error_prob_threshold):
                    restricted_data.append(psms)

        # Restrict nothing... (only PEP gets restricted - takes everything less than 1)
        if not peptide_length and not posterior_error_prob_threshold and not q_value_threshold:
            restricted_data = main_psm_data

        if custom_threshold:
            custom_restricted = []
            if self.parameter_file_object.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE:
                for psms in restricted_data:
                    if psms.custom_score <= custom_threshold:
                        custom_restricted.append(psms)

            if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
                for psms in restricted_data:
                    if psms.custom_score >= custom_threshold:
                        custom_restricted.append(psms)

            restricted_data = custom_restricted

        self.main_data_restricted = restricted_data

        logger.info("Length of restricted data: {}".format(len(restricted_data)))

        self.restricted_peptides = [x.non_flanking_peptide for x in restricted_data]

    def create_scoring_input(self):
        """
        Method to create the scoring input.
        This method initializes a list of [Protein][pyproteininference.physical.Protein] objects to get them ready
        to be scored by [Score][pyproteininference.scoring.Score] methods.
        This method also takes into account the inference type and aggregates peptides -> proteins accordingly.

        This method sets the `scoring_input` and `score` Attributes for the DataStore object.

        The score selected comes from the protein inference parameter object.

        Returns:
            None:

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> data.create_scoring_input()
        """

        logger.info("Creating Scoring Input")

        psm_data = self.get_psm_data()

        protein_psm_dict = collections.defaultdict(list)

        try:
            score_key = self.SCORE_MAPPER[self.parameter_file_object.psm_score]
        except KeyError:
            score_key = self.CUSTOM_SCORE_KEY

        if self.parameter_file_object.inference_type != Inference.PEPTIDE_CENTRIC:
            # Loop through all Psms
            for psms in psm_data:
                psms.assign_main_score(score=score_key)
                # Loop through all proteins
                for prots in psms.possible_proteins:
                    protein_psm_dict[prots].append(psms)

        else:
            self.peptide_to_protein_dictionary()
            sp_proteins = self.digest.swiss_prot_protein_set
            for psms in psm_data:

                # Assign main score
                psms.assign_main_score(score=score_key)
                protein_set = self.peptide_protein_dictionary[psms.non_flanking_peptide]
                # Sort protein_set by sp-alpha, decoy-sp-alpha, tr-alpha, decoy-tr-alpha
                sorted_protein_list = self.sort_protein_strings(
                    protein_string_list=protein_set,
                    sp_proteins=sp_proteins,
                    decoy_symbol=self.parameter_file_object.decoy_symbol,
                )
                # Restrict the number of identifiers by the value in param file max_identifiers_peptide_centric
                sorted_protein_list = sorted_protein_list[: self.parameter_file_object.max_identifiers_peptide_centric]
                protein_name = ";".join(sorted_protein_list)
                protein_psm_dict[protein_name].append(psms)

        protein_list = []
        for pkey in sorted(protein_psm_dict.keys()):
            protein_object = Protein(identifier=pkey)
            protein_object.psms = protein_psm_dict[pkey]
            protein_object.raw_peptides = set([x.identifier for x in protein_psm_dict[pkey]])
            protein_list.append(protein_object)

        self.psm_score = self.parameter_file_object.psm_score
        self.scoring_input = protein_list

    def protein_to_peptide_dictionary(self):
        """
        Method that returns a map of protein strings to sets of peptide strings and is essentially half
         of a BiPartite graph.
        This method sets the `protein_peptide_dictionary` Attribute for the DataStore object.

        Returns:
            collections.defaultdict: Dictionary of protein strings (keys) that map to sets of peptide strings based
            on the peptides and proteins found in the search. Protein -> set(Peptides).

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> protein_peptide_dict = data.protein_to_peptide_dictionary()
        """
        psm_data = self.get_psm_data()

        res_pep_set = set(self.restricted_peptides)
        default_dict_proteins = collections.defaultdict(set)
        for peptide_objects in psm_data:
            for prots in peptide_objects.possible_proteins:
                cur_peptide = peptide_objects.non_flanking_peptide
                if cur_peptide in res_pep_set:
                    default_dict_proteins[prots].add(cur_peptide)

        self.protein_peptide_dictionary = default_dict_proteins

        return default_dict_proteins

    def peptide_to_protein_dictionary(self):
        """
        Method that returns a map of peptide strings to sets of protein strings and is essentially half of a
        BiPartite graph.
        This method sets the `peptide_protein_dictionary` Attribute for the DataStore object.

        Returns:
            collections.defaultdict: Dictionary of peptide strings (keys) that map to sets of protein strings based
                on the peptides and proteins found in the search. Peptide -> set(Proteins).

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> peptide_protein_dict = data.peptide_to_protein_dictionary()
        """
        psm_data = self.get_psm_data()

        res_pep_set = set(self.restricted_peptides)
        default_dict_peptides = collections.defaultdict(set)
        for peptide_objects in psm_data:
            for prots in peptide_objects.possible_proteins:
                cur_peptide = peptide_objects.non_flanking_peptide
                if cur_peptide in res_pep_set:
                    default_dict_peptides[cur_peptide].add(prots)
                else:
                    pass

        self.peptide_protein_dictionary = default_dict_peptides

        return default_dict_peptides

    def unique_to_leads_peptides(self):
        """
        Method to retrieve peptides that are unique based on the data from the searches
        (Not based on the database digestion).

        Returns:
            set: a Set of peptide strings

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> unique_peps = data.unique_to_leads_peptides()
        """
        if self.grouped_scored_proteins:
            lead_peptides = [list(x[0].peptides) for x in self.grouped_scored_proteins]
            flat_peptides = [item for sublist in lead_peptides for item in sublist]
            counted_peps = collections.Counter(flat_peptides)
            unique_to_leads_peptides = set([x for x in counted_peps if counted_peps[x] == 1])
        else:
            unique_to_leads_peptides = set()

        return unique_to_leads_peptides

    def higher_or_lower(self):
        """
        Method to determine if a higher or lower score is better for a given combination of score input and score type.

        This method sets the `high_low_better` Attribute for the DataStore object.

        This method depends on the output from the Score class to be sorted properly from best to worst score.

        Returns:
            str: String indicating "higher" or "lower" depending on if a higher or lower score is a
                better protein score.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> high_low = data.higher_or_lower()
        """

        if not self.high_low_better:
            logger.info("Determining If a higher or lower score is better based on scored proteins")
            worst_score = self.scored_proteins[-1].score
            best_score = self.scored_proteins[0].score

            if float(best_score) > float(worst_score):
                higher_or_lower = self.HIGHER_PSM_SCORE

            if float(best_score) < float(worst_score):
                higher_or_lower = self.LOWER_PSM_SCORE

            logger.info("best score = {}".format(best_score))
            logger.info("worst score = {}".format(worst_score))

            if best_score == worst_score:
                raise ValueError(
                    "Best and Worst scores were identical, equal to {}. Score type {} produced the error, "
                    "please change psm_score type.".format(best_score, self.psm_score)
                )

            self.high_low_better = higher_or_lower

        else:
            higher_or_lower = self.high_low_better

        return higher_or_lower

    def get_protein_identifiers(self, data_form):
        """
        Method to retrieve the protein string identifiers.

        Args:
            data_form (str): Can be one of the following: "main", "restricted", "picked", "picked_removed".

        Returns:
            list: list of protein identifier strings.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> protein_strings = data.get_protein_identifiers(data_form="main")
        """
        if data_form == "main":
            # All the data (unrestricted)
            data_to_select = self.main_data_form
            prots = [[x.possible_proteins] for x in data_to_select]
            proteins = prots

        if data_form == "restricted":
            # Proteins that pass certain restriction criteria (peptide length, pep, qvalue)
            data_to_select = self.main_data_restricted
            prots = [[x.possible_proteins] for x in data_to_select]
            proteins = prots

        if data_form == "picked":
            # Here we look at proteins that are 'picked' (aka the proteins that beat out their matching target/decoy)
            data_to_select = self.picked_proteins_scored
            prots = [x.identifier for x in data_to_select]
            proteins = prots

        if data_form == "picked_removed":
            # Here we look at the proteins that were removed due to picking (aka the proteins that
            # have a worse score than their target/decoy counterpart)
            data_to_select = self.picked_proteins_removed
            prots = [x.identifier for x in data_to_select]
            proteins = prots

        return proteins

    def get_protein_information(self, protein_string):
        """
        Method to retrieve attributes for a specific scored protein.

        Args:
            protein_string (str): Protein Identifier String.

        Returns:
            list: list of protein attributes.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> protein_attr = data.get_protein_information(protein_string="RAF1_HUMAN|P04049")
        """
        all_scored_protein_data = self.scored_proteins
        identifiers = [x.identifier for x in all_scored_protein_data]
        protein_scores = [x.score for x in all_scored_protein_data]
        groups = [x.group_identification for x in all_scored_protein_data]
        reviewed = [x.reviewed for x in all_scored_protein_data]
        peptides = [x.peptides for x in all_scored_protein_data]
        # Peptide scores currently broken...
        peptide_scores = [x.peptide_scores for x in all_scored_protein_data]
        picked = [x.picked for x in all_scored_protein_data]
        num_peptides = [x.num_peptides for x in all_scored_protein_data]

        main_index = identifiers.index(protein_string)

        list_structure = [
            [
                "identifier",
                "protein_score",
                "groups",
                "reviewed",
                "peptides",
                "peptide_scores",
                "picked",
                "num_peptides",
            ]
        ]
        list_structure.append([protein_string])
        list_structure[-1].append(protein_scores[main_index])
        list_structure[-1].append(groups[main_index])
        list_structure[-1].append(reviewed[main_index])
        list_structure[-1].append(peptides[main_index])
        list_structure[-1].append(peptide_scores[main_index])
        list_structure[-1].append(picked[main_index])
        list_structure[-1].append(num_peptides[main_index])

        return list_structure

    def exclude_non_distinguishing_peptides(self, protein_subset_type="hard"):
        """
        Method to Exclude peptides that are not distinguishing on either the search or database level.

        The method sets the `scoring_input` and `restricted_peptides` variables for the DataStore object.

        Args:
            protein_subset_type (str): Either "hard" or "soft". Hard will select distinguishing peptides based on
                the database digestion. "soft" will only use peptides identified in the search.

        Returns:
            None:

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> data.exclude_non_distinguishing_peptides(protein_subset_type="hard")
        """

        logger.info("Applying Exclusion Model")

        our_proteins_sorted = self.get_sorted_identifiers(scored=False)

        if protein_subset_type == "hard":
            # Hard protein subsetting defines protein subsets on the digest level (Entire protein is used)
            # This is how Percolator PI does subsetting
            peptides = [self.digest.protein_to_peptide_dictionary[x] for x in our_proteins_sorted]
        elif protein_subset_type == "soft":
            # Soft protein subsetting defines protein subsets on the Peptides identified from the search
            peptides = [set(x.raw_peptides) for x in self.scoring_input]
        else:
            # If neither is dfined we do "hard" exclusion
            peptides = [self.digest.protein_to_peptide_dictionary[x] for x in our_proteins_sorted]

        # Get frozen set of peptides....
        # We will also have a corresponding list of proteins...
        # They will have the same index...
        peptide_sets = [frozenset(e) for e in peptides]
        # Find a way to sort this list of sets...
        # We can sort the sets if we sort proteins from above...
        logger.info("{} number of peptide sets".format(len(peptide_sets)))
        non_subset_peptide_sets = set()
        i = 0
        # Get all peptide sets that are not a subset...
        while peptide_sets:
            i = i + 1
            peptide_set = peptide_sets.pop()
            if any(peptide_set.issubset(s) for s in peptide_sets) or any(
                peptide_set.issubset(s) for s in non_subset_peptide_sets
            ):
                continue
            else:
                non_subset_peptide_sets.add(peptide_set)
            if i % 10000 == 0:
                logger.info("Parsed {} Peptide Sets".format(i))

        logger.info("Parsed {} Peptide Sets".format(i))

        # Get their index from peptides which is the initial list of sets...
        list_of_indeces = []
        for pep_sets in non_subset_peptide_sets:
            ind = peptides.index(pep_sets)
            list_of_indeces.append(ind)

        non_subset_proteins = set([our_proteins_sorted[x] for x in list_of_indeces])

        logger.info("Removing direct subset Proteins from the data")
        # Remove all proteins from scoring input that are a subset of another protein...
        self.scoring_input = [x for x in self.scoring_input if x.identifier in non_subset_proteins]

        logger.info("{} proteins in scoring input after removing subset proteins".format(len(self.scoring_input)))

        # For all the proteins that are not a complete subset of another protein...
        # Get the raw peptides...
        raw_peps = [x.raw_peptides for x in self.scoring_input if x.identifier in non_subset_proteins]

        # Make the raw peptides a flat list
        flat_peptides = [Psm.split_peptide(peptide_string=item) for sublist in raw_peps for item in sublist]

        # Count the number of peptides in this list...
        # This is the number of proteins this peptide maps to....
        counted_peptides = collections.Counter(flat_peptides)

        # If the count is greater than 1... exclude the protein entirely from scoring input... :)
        raw_peps_good = set([x for x in counted_peptides.keys() if counted_peptides[x] <= 1])

        # Alter self.scoring_input by removing psms and peptides that are not in raw_peps_good
        current_score_input = list(self.scoring_input)
        for j in range(len(current_score_input)):
            k = j + 1
            psm_list = []
            new_raw_peptides = []
            current_psms = current_score_input[j].psms
            current_raw_peptides = current_score_input[j].raw_peptides

            for psm_scores in current_psms:
                if psm_scores.non_flanking_peptide in raw_peps_good:
                    psm_list.append(psm_scores)

            for rp in current_raw_peptides:
                if Psm.split_peptide(peptide_string=rp) in raw_peps_good:
                    new_raw_peptides.append(rp)

            current_score_input[j].psms = psm_list
            current_score_input[j].raw_peptides = new_raw_peptides

            if k % 10000 == 0:
                logger.info("Redefined {} Peptide Sets".format(k))

        logger.info("Redefined {} Peptide Sets".format(j))

        filtered_score_input = [x for x in current_score_input if x.psms]

        self.scoring_input = filtered_score_input

        # Recompute the flat peptides
        raw_peps = [x.raw_peptides for x in self.scoring_input if x.identifier in non_subset_proteins]

        # Make the raw peptides a flat list
        new_flat_peptides = set([Psm.split_peptide(peptide_string=item) for sublist in raw_peps for item in sublist])

        self.scoring_input = [x for x in self.scoring_input if x.psms]

        self.restricted_peptides = [x for x in self.restricted_peptides if x in new_flat_peptides]

    def protein_picker(self):
        """
        Method to run the protein picker algorithm.

        Proteins must be scored first with [score_psms][pyproteininference.scoring.Score.score_psms].

        The algorithm will match target and decoy proteins identified from the PSMs from the search.
        If a target and matching decoy is found then target/decoy competition is performed.
        In the Target/Decoy pair the protein with the better score is kept and the one with the worse score is
        discarded from the analysis.

        The method sets the `picked_proteins_scored` and `picked_proteins_removed` variables for
        the DataStore object.

        Returns:
            None:

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> data.protein_picker()
        """

        self._validate_scored_proteins()

        logger.info("Running Protein Picker")

        # Use higher or lower class to determine if a higher protein score or lower protein score is better
        # based on the scoring method used
        higher_or_lower = self.higher_or_lower()
        # Here we determine if a lower or higher score is better
        # Since all input is ordered from best to worst we can do the following

        index_to_remove = []
        # data.scored_proteins is simply a list of Protein objects...
        # Create list of all decoy proteins
        decoy_proteins = [x.identifier for x in self.scored_proteins if self.decoy_symbol in x.identifier]
        # Create a list of all potential matching targets (some of these may not exist in the search)
        matching_targets = [x.replace(self.decoy_symbol, "") for x in decoy_proteins]

        # Create a list of all the proteins from the scored data
        all_proteins = [x.identifier for x in self.scored_proteins]
        logger.info("{} proteins scored".format(len(all_proteins)))

        total_targets = []
        total_decoys = []
        decoys_removed = []
        targets_removed = []
        # Loop over all decoys identified in the search
        logger.info("Picking Proteins...")
        for i in range(len(decoy_proteins)):
            cur_decoy_index = all_proteins.index(decoy_proteins[i])
            cur_decoy_protein_object = self.scored_proteins[cur_decoy_index]
            total_decoys.append(cur_decoy_protein_object.identifier)

            # Try, Except here because the matching target to the decoy may not be a result from the search
            try:
                cur_target_index = all_proteins.index(matching_targets[i])
                cur_target_protein_object = self.scored_proteins[cur_target_index]
                total_targets.append(cur_target_protein_object.identifier)

                if higher_or_lower == self.HIGHER_PSM_SCORE:
                    if cur_target_protein_object.score > cur_decoy_protein_object.score:
                        index_to_remove.append(cur_decoy_index)
                        decoys_removed.append(cur_decoy_index)
                        cur_target_protein_object.picked = True
                        cur_decoy_protein_object.picked = False
                    else:
                        index_to_remove.append(cur_target_index)
                        targets_removed.append(cur_target_index)
                        cur_decoy_protein_object.picked = True
                        cur_target_protein_object.picked = False

                if higher_or_lower == self.LOWER_PSM_SCORE:
                    if cur_target_protein_object.score < cur_decoy_protein_object.score:
                        index_to_remove.append(cur_decoy_index)
                        decoys_removed.append(cur_decoy_index)
                        cur_target_protein_object.picked = True
                        cur_decoy_protein_object.picked = False
                    else:
                        index_to_remove.append(cur_target_index)
                        targets_removed.append(cur_target_index)
                        cur_decoy_protein_object.picked = True
                        cur_target_protein_object.picked = False
            except ValueError:
                pass

        logger.info("{} total decoy proteins".format(len(total_decoys)))
        logger.info("{} matching target proteins also found in search".format(len(total_targets)))
        logger.info("{} decoy proteins to be removed".format(len(decoys_removed)))
        logger.info("{} target proteins to be removed".format(len(targets_removed)))

        logger.info("Removing Lower Scoring Proteins...")
        picked_list = []
        removed_proteins = []
        for protein_objects in self.scored_proteins:
            if protein_objects.picked:
                picked_list.append(protein_objects)
            else:
                removed_proteins.append(protein_objects)
        self.picked_proteins_scored = picked_list
        self.picked_proteins_removed = removed_proteins
        logger.info("Finished Removing Proteins")

    def calculate_q_values(self, regular=True):
        """
        Method calculates Q values FDR on the lead protein in the group on the `protein_group_objects`
        instance variable.
        FDR is calculated As (2*decoys)/total if regular is set to True and is
        (decoys)/total if regular is set to False.

        This method updates the `protein_group_objects` for the DataStore object by updating
        the q_value variable of the [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        Returns:
            None:

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> # Data must be scored first
            >>> data.calculate_q_values()
        """

        self._validate_protein_group_objects()

        logger.info("Calculating Q values from the protein group objects")

        # pick out the lead scoring protein for each group... lead score is at 0 position
        lead_score = [x.proteins[0] for x in self.protein_group_objects]
        # Now pick out only the lead protein identifiers
        lead_proteins = [x.identifier for x in lead_score]

        lead_proteins.reverse()

        logger.info("Calculating FDRs")
        fdr_list = []
        for i in range(len(lead_proteins)):
            binary_decoy_target_list = [1 if self.decoy_symbol in elem else 0 for elem in lead_proteins]
            total = len(lead_proteins)
            decoys = sum(binary_decoy_target_list)
            # Calculate FDR at every step starting with the entire list...
            # Delete first entry (worst score) every time we go through a cycle
            if regular:
                fdr = (2 * decoys) / (float(total))
            else:
                fdr = (decoys) / (float(total))
            fdr_list.append(fdr)
            del lead_proteins[0]

        qvalue_list = []
        new_fdr_list = []
        logger.info("Calculating Q Values")
        for fdrs in fdr_list:
            new_fdr_list.append(fdrs)
            qvalue = min(new_fdr_list)
            # qvalue = fdrs
            qvalue_list.append(qvalue)

        qvalue_list.reverse()

        logger.info("Assigning Q Values")
        for k in range(len(self.protein_group_objects)):
            self.protein_group_objects[k].q_value = qvalue_list[k]

        fdr_restricted = [x for x in self.protein_group_objects if x.q_value <= self.parameter_file_object.fdr]

        fdr_restricted_set = [self.grouped_scored_proteins[x] for x in range(len(fdr_restricted))]

        onehitwonders = []
        for groups in fdr_restricted_set:
            if int(groups[0].num_peptides) == 1:
                onehitwonders.append(groups[0])

        logger.info(
            "Protein Group leads that pass with more than 1 PSM with a {} FDR = {}".format(
                self.parameter_file_object.fdr,
                str(len(fdr_restricted_set) - len(onehitwonders)),
            )
        )
        logger.info(
            "Protein Group lead One hit Wonders that pass {} FDR = {}".format(
                self.parameter_file_object.fdr, len(onehitwonders)
            )
        )

        logger.info(
            "Number of Protein groups that pass a {} percent FDR: {}".format(
                str(self.parameter_file_object.fdr * 100), len(fdr_restricted_set)
            )
        )

        logger.info("Finished Q value Calculation")

    def validate_psm_data(self):
        """
        Method that validates the PSM data.
        """
        self._validate_decoys_from_data()
        self._validate_isoform_from_data()

    def validate_digest(self):
        """
        Method that validates the [Digest object][pyproteininference.in_silico_digest.Digest].
        """
        self._validate_reviewed_v_unreviewed()
        self._check_target_decoy_split()

    def check_data_consistency(self):
        """
        Method that checks for data consistency.
        """
        self._check_data_digest_overlap_psms()
        self._check_data_digest_overlap_proteins()

    def _check_data_digest_overlap_psms(self):
        """
        Method that logs the overlap between the digested fasta file and the input files on the PSM level.
        """
        peptides = [x.stripped_peptide for x in self.main_data_form]
        peptides_in_digest = set(self.digest.peptide_to_protein_dictionary.keys())
        peptides_from_search_in_digest = [x for x in peptides if x in peptides_in_digest]
        percentage = float(len(set(peptides))) / float(len(set(peptides_from_search_in_digest)))
        logger.info("{} PSMs identified from input files".format(len(peptides)))
        logger.info(
            "{} PSMs identified from input files that are also present in database digestion".format(
                len(peptides_from_search_in_digest)
            )
        )
        logger.info(
            "{}; ratio of PSMs identified from input files to those that are present in the search"
            " and in the database digestion".format(percentage)
        )

    def _check_data_digest_overlap_proteins(self):
        """
        Method that logs the overlap between the digested fasta file and the input files on the Protein level.
        """
        proteins = [x.possible_proteins for x in self.main_data_form]
        flat_proteins = set([item for sublist in proteins for item in sublist])
        proteins_in_digest = set(self.digest.protein_to_peptide_dictionary.keys())
        proteins_from_search_in_digest = [x for x in flat_proteins if x in proteins_in_digest]
        percentage = float(len(flat_proteins)) / float(len(proteins_from_search_in_digest))
        logger.info("{} proteins identified from input files".format(len(flat_proteins)))
        logger.info(
            "{} proteins identified from input files that are also present in database digestion".format(
                len(proteins_from_search_in_digest)
            )
        )
        logger.info(
            "{}; ratio of proteins identified from input files that are also present in database digestion".format(
                percentage
            )
        )

    def _check_target_decoy_split(self):
        """
        Method that logs the number of target and decoy proteins from the digest.
        """
        # Check the number of targets vs the number of decoys from the digest
        targets = [
            x
            for x in self.digest.protein_to_peptide_dictionary.keys()
            if self.parameter_file_object.decoy_symbol not in x
        ]
        decoys = [
            x for x in self.digest.protein_to_peptide_dictionary.keys() if self.parameter_file_object.decoy_symbol in x
        ]
        ratio = float(len(targets)) / float(len(decoys))
        logger.info("Number of Target Proteins in Digest: {}".format(len(targets)))
        logger.info("Number of Decoy Proteins in Digest: {}".format(len(decoys)))
        logger.info("Ratio of Targets Proteins to Decoy Proteins: {}".format(ratio))

    def _validate_decoys_from_data(self):
        """
        Method that checks to make sure that target and decoy proteins exist in the data files.
        """
        # Check to see if we find decoys from our input files
        proteins = [x.possible_proteins for x in self.main_data_form]
        flat_proteins = set([item for sublist in proteins for item in sublist])
        targets = [x for x in flat_proteins if self.parameter_file_object.decoy_symbol not in x]
        decoys = [x for x in flat_proteins if self.parameter_file_object.decoy_symbol in x]
        logger.info("Number of Target Proteins in Data Files: {}".format(len(targets)))
        logger.info("Number of Decoy Proteins in Data Files: {}".format(len(decoys)))

    def _validate_isoform_from_data(self):
        """
        Method that validates whether or not isoforms are able to be identified in the data files.
        """
        # Check to see if we find any proteins with isoform info in name in our input files
        proteins = [x.possible_proteins for x in self.main_data_form]
        flat_proteins = set([item for sublist in proteins for item in sublist])
        if self.parameter_file_object.isoform_symbol:
            non_iso = [x for x in flat_proteins if self.parameter_file_object.isoform_symbol not in x]

        else:
            non_iso = [x for x in flat_proteins]

        if self.parameter_file_object.isoform_symbol:
            iso = [x for x in flat_proteins if self.parameter_file_object.isoform_symbol in x]

        else:
            iso = []
        logger.info("Number of Non Isoform Labeled Proteins in Data Files: {}".format(len(non_iso)))
        logger.info("Number of Isoform Labeled Proteins in Data Files: {}".format(len(iso)))

    def _validate_reviewed_v_unreviewed(self):
        """
        Method that logs whether or not we can distinguish from reviewed and unreviewd protein identifiers
        in the digest.
        """
        # Check to see if we get reviewed prots in digest...
        reviewed_proteins = len(self.digest.swiss_prot_protein_set)
        proteins_in_digest = len(set(self.digest.protein_to_peptide_dictionary.keys()))
        unreviewed_proteins = proteins_in_digest - reviewed_proteins
        logger.info("Number of Total Proteins in from Digest: {}".format(proteins_in_digest))
        logger.info("Number of Reviewed Proteins in from Digest: {}".format(reviewed_proteins))
        logger.info("Number of Unreviewed Proteins in from Digest: {}".format(unreviewed_proteins))

    @classmethod
    def sort_protein_strings(cls, protein_string_list, sp_proteins, decoy_symbol):
        """
        Method that sorts protein strings in the following order: Target Reviewed, Decoy Reviewed, Target Unreviewed,
         Decoy Unreviewed.

        Args:
            protein_string_list (list): List of Protein Strings.
            sp_proteins (set): Set of Reviewed Protein Strings.
            decoy_symbol (str): Symbol to denote a decoy protein identifier IE "##".

        Returns:
            list: List of sorted protein strings.

        Example:
            >>> list_of_group_objects = datastore.DataStore.sort_protein_strings(
            >>>     protein_string_list=protein_string_list, sp_proteins=sp_proteins, decoy_symbol="##"
            >>> )
        """

        our_target_sp_proteins = sorted([x for x in protein_string_list if x in sp_proteins and decoy_symbol not in x])
        our_decoy_sp_proteins = sorted([x for x in protein_string_list if x in sp_proteins and decoy_symbol in x])

        our_target_tr_proteins = sorted(
            [x for x in protein_string_list if x not in sp_proteins and decoy_symbol not in x]
        )
        our_decoy_tr_proteins = sorted([x for x in protein_string_list if x not in sp_proteins and decoy_symbol in x])

        identifiers_sorted = (
            our_target_sp_proteins + our_decoy_sp_proteins + our_target_tr_proteins + our_decoy_tr_proteins
        )

        return identifiers_sorted

    def input_has_q(self):
        """
        Method that checks to see if the input data has q values.
        """
        len_q = len([x.qvalue for x in self.main_data_form if x.qvalue])
        len_all = len(self.main_data_form)
        if len_q == len_all:
            status = True
            logger.info("Input has Q value; Can restrict by Q value")
        else:
            status = False
            logger.warning("Input does not have Q value; Cannot restrict by Q value")

        return status

    def input_has_pep(self):
        """
        Method that checks to see if the input data has pep values.
        """
        len_pep = len([x.pepvalue for x in self.main_data_form if x.pepvalue])
        len_all = len(self.main_data_form)
        if len_pep == len_all:
            status = True
            logger.info("Input has Pep value; Can restrict by Pep value")
        else:
            status = False
            logger.warning("Input does not have Pep value; Cannot restrict by Pep value")

        return status

    def input_has_custom(self):
        """
        Method that checks to see if the input data has custom score values.
        """
        len_c = len([x.custom_score for x in self.main_data_form if x.custom_score])
        len_all = len(self.main_data_form)
        if len_c == len_all:
            status = True
            logger.info("Input has Custom value; Can restrict by Custom value")

        else:
            status = False
            logger.warning("Input does not have Custom value; Cannot restrict by Custom value")

        return status

    def get_protein_objects(self, false_discovery_rate=None, fdr_restricted=False):
        """
        Method retrieves protein objects. Either retrieves FDR restricted list of protien objects,
        or retrieves all objects.

        Args:
            fdr_restricted (bool): True/False on whether to restrict the list of objects based on FDR.

        Returns:
            list: List of scored [ProteinGroup][pyproteininference.physical.ProteinGroup] objects
                that have been grouped and sorted.

        """
        if not false_discovery_rate:
            false_discovery_rate = self.parameter_file_object.fdr
        if fdr_restricted:
            protein_objects = [x.proteins for x in self.protein_group_objects if x.q_value <= false_discovery_rate]
        else:
            protein_objects = self.grouped_scored_proteins

        return protein_objects

    def _init_validate(self, reader):
        """
        Internal Method that checks to make sure the reader object is properly loaded and validated.
        """
        if reader.psms:
            self.main_data_form = reader.psms  # Unrestricted PSM data
            self.restricted_peptides = [x.non_flanking_peptide for x in self.main_data_form]
        else:
            raise ValueError(
                "Psms variable from Reader object is either empty or does not exist. "
                "Make sure your files contain proper data and that you run the 'read_psms' "
                "method on your Reader object."
            )

    def _validate_main_data_form(self):
        """
        Internal Method that checks to make sure the Main data has been defined to run DataStore methods.
        """
        if self.main_data_form:
            pass
        else:
            raise ValueError(
                "Main Data is not defined, thus method cannot be ran. Please make sure PSM data is properly"
                " loaded from the Reader object"
            )

    def _validate_main_data_restricted(self):
        """
        Internal Method that checks to make sure the Main data Restricted has been defined to run DataStore methods.
        """
        if self.main_data_restricted:
            pass
        else:
            raise ValueError(
                "Main Data Restricted is not defined, thus method cannot be ran. Please make sure PSM data is properly"
                " loaded from the Reader object and make sure to run DataStore method 'restrict_psm_data'."
            )

    def _validate_scored_proteins(self):
        """
        Internal Method that checks to make sure that proteins have been scored to run certain subsequent methods.
        """
        if self.picked_proteins_scored or self.scored_proteins:
            pass
        else:
            raise ValueError(
                "Proteins have not been scored, Please initialize a Score object and run a score method with"
                " 'score_psms' instance method."
            )

    def _validate_scoring_input(self):
        """
        Internal Method that checks to make sure that Scoring Input has been created to be able to run scoring methods.
        """
        if self.scoring_input:
            pass
        else:
            raise ValueError(
                "Scoring input has not been created, Please run 'create_scoring_input' method from the DataStore "
                "object to continue."
            )

    def _validate_protein_group_objects(self):
        """
        Internal Method that checks to make sure inference has been run before proceeding.
        """
        if self.protein_group_objects and self.grouped_scored_proteins:
            pass
        else:
            raise ValueError(
                "Either 'protein_group_objects' or 'grouped_scored_proteins' or both DataStore variables are undefined."
                " Please make sure you run an inference method from the Inference class before proceeding."
            )

    def generate_fdr_vs_target_hits(self, fdr_max=0.2):
        """
        Method for calculating FDR vs number of Target Proteins.

        Args:
            fdr_max (float): The maximum false discovery rate to calculate target hits for.
                Will stop once fdr_max is reached.

        Returns:
            list: List of lists of: (FDR, Number of Target Hits). Ordered by increasing number of Target Hits.

        """
        fdr_vs_count = []
        count_list = []
        for pg in self.protein_group_objects:
            if self.decoy_symbol not in pg.proteins[0].identifier:
                count_list.append(pg)
            fdr_vs_count.append([pg.q_value, len(count_list)])

        fdr_vs_count = [x for x in fdr_vs_count if x[0] < fdr_max]

        return fdr_vs_count

    def recover_mapping(self):
        logger.info("Recovering Proteins that exist in the input files but not in the database digest.")
        all_psms = self.get_psm_data()
        proteins = [x.possible_proteins for x in all_psms]
        flat_proteins = [item for sublist in proteins for item in sublist]

        missing_prots = []
        for prot in flat_proteins:
            try:
                self.digest.protein_to_peptide_dictionary[prot]
            except KeyError:
                missing_prots.append(prot)

                psm_data = self.get_psm_data()
                peptides = [x.stripped_peptide for x in psm_data if prot in x.possible_proteins]
                for pep in peptides:
                    self.digest.peptide_to_protein_dictionary.setdefault(pep, set()).add(prot)
                    self.digest.protein_to_peptide_dictionary.setdefault(prot, set()).add(pep)
        if missing_prots:
            logger.info(
                "{} proteins not found in mapping objects, please double check that your database"
                " provided is accurate for the given input data.".format(len(missing_prots))
            )
        else:
            logger.info("No missing proteins in the mapping objects.")

__init__(self, reader, digest, validate=True) special

Parameters:
  • reader (Reader) – Reader object Reader.

  • digest (Digest) – Digest object Digest.

  • validate (bool) – True/False to indicate if the input data should be validated.

Examples:

>>> pyproteininference.datastore.DataStore(reader = reader, digest=digest)
Source code in pyproteininference/datastore.py
def __init__(self, reader, digest, validate=True):
    """

    Args:
        reader (Reader): Reader object [Reader][pyproteininference.reader.Reader].
        digest (Digest): Digest object
            [Digest][pyproteininference.in_silico_digest.Digest].
        validate (bool): True/False to indicate if the input data should be validated.

    Example:
        >>> pyproteininference.datastore.DataStore(reader = reader, digest=digest)


    """
    # If the reader class is from a percolator.psms then define main_data_form as reader.psms
    # main_data_form is the starting point for all other analyses
    self._init_validate(reader=reader)

    self.parameter_file_object = reader.parameter_file_object  # Parameter object
    self.main_data_restricted = None  # PSM data post restriction
    self.scored_proteins = []  # List of scored Protein objects
    self.grouped_scored_proteins = []  # List of sorted scored Protein objects
    self.scoring_input = None  # List of non scored Protein objects
    self.picked_proteins_scored = None  # List of Protein objects after picker algorithm
    self.picked_proteins_removed = None  # Protein objects removed via picker
    self.protein_peptide_dictionary = None
    self.peptide_protein_dictionary = None
    self.high_low_better = None  # Variable that indicates whether a higher or lower protein score is better
    self.psm_score = None  # PSM Score used
    self.protein_score = None
    self.short_protein_score = None
    self.protein_group_objects = []  # List of sorted protein group objects
    self.decoy_symbol = self.parameter_file_object.decoy_symbol  # Decoy symbol from parameter file
    self.digest = digest  # Digest object

    # Run Checks and Validations
    if validate:
        self.validate_psm_data()
        self.validate_digest()
        self.check_data_consistency()

    # Run method to fix our parameter object if necessary
    self.parameter_file_object.fix_parameters_from_datastore(data=self)

calculate_q_values(self, regular=True)

Method calculates Q values FDR on the lead protein in the group on the protein_group_objects instance variable. FDR is calculated As (2*decoys)/total if regular is set to True and is (decoys)/total if regular is set to False.

This method updates the protein_group_objects for the DataStore object by updating the q_value variable of the ProteinGroup objects.

Returns:
  • None

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> # Data must be scored first
>>> data.calculate_q_values()
Source code in pyproteininference/datastore.py
def calculate_q_values(self, regular=True):
    """
    Method calculates Q values FDR on the lead protein in the group on the `protein_group_objects`
    instance variable.
    FDR is calculated As (2*decoys)/total if regular is set to True and is
    (decoys)/total if regular is set to False.

    This method updates the `protein_group_objects` for the DataStore object by updating
    the q_value variable of the [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

    Returns:
        None:

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> # Data must be scored first
        >>> data.calculate_q_values()
    """

    self._validate_protein_group_objects()

    logger.info("Calculating Q values from the protein group objects")

    # pick out the lead scoring protein for each group... lead score is at 0 position
    lead_score = [x.proteins[0] for x in self.protein_group_objects]
    # Now pick out only the lead protein identifiers
    lead_proteins = [x.identifier for x in lead_score]

    lead_proteins.reverse()

    logger.info("Calculating FDRs")
    fdr_list = []
    for i in range(len(lead_proteins)):
        binary_decoy_target_list = [1 if self.decoy_symbol in elem else 0 for elem in lead_proteins]
        total = len(lead_proteins)
        decoys = sum(binary_decoy_target_list)
        # Calculate FDR at every step starting with the entire list...
        # Delete first entry (worst score) every time we go through a cycle
        if regular:
            fdr = (2 * decoys) / (float(total))
        else:
            fdr = (decoys) / (float(total))
        fdr_list.append(fdr)
        del lead_proteins[0]

    qvalue_list = []
    new_fdr_list = []
    logger.info("Calculating Q Values")
    for fdrs in fdr_list:
        new_fdr_list.append(fdrs)
        qvalue = min(new_fdr_list)
        # qvalue = fdrs
        qvalue_list.append(qvalue)

    qvalue_list.reverse()

    logger.info("Assigning Q Values")
    for k in range(len(self.protein_group_objects)):
        self.protein_group_objects[k].q_value = qvalue_list[k]

    fdr_restricted = [x for x in self.protein_group_objects if x.q_value <= self.parameter_file_object.fdr]

    fdr_restricted_set = [self.grouped_scored_proteins[x] for x in range(len(fdr_restricted))]

    onehitwonders = []
    for groups in fdr_restricted_set:
        if int(groups[0].num_peptides) == 1:
            onehitwonders.append(groups[0])

    logger.info(
        "Protein Group leads that pass with more than 1 PSM with a {} FDR = {}".format(
            self.parameter_file_object.fdr,
            str(len(fdr_restricted_set) - len(onehitwonders)),
        )
    )
    logger.info(
        "Protein Group lead One hit Wonders that pass {} FDR = {}".format(
            self.parameter_file_object.fdr, len(onehitwonders)
        )
    )

    logger.info(
        "Number of Protein groups that pass a {} percent FDR: {}".format(
            str(self.parameter_file_object.fdr * 100), len(fdr_restricted_set)
        )
    )

    logger.info("Finished Q value Calculation")

check_data_consistency(self)

Method that checks for data consistency.

Source code in pyproteininference/datastore.py
def check_data_consistency(self):
    """
    Method that checks for data consistency.
    """
    self._check_data_digest_overlap_psms()
    self._check_data_digest_overlap_proteins()

create_scoring_input(self)

Method to create the scoring input. This method initializes a list of Protein objects to get them ready to be scored by Score methods. This method also takes into account the inference type and aggregates peptides -> proteins accordingly.

This method sets the scoring_input and score Attributes for the DataStore object.

The score selected comes from the protein inference parameter object.

Returns:
  • None

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.create_scoring_input()
Source code in pyproteininference/datastore.py
def create_scoring_input(self):
    """
    Method to create the scoring input.
    This method initializes a list of [Protein][pyproteininference.physical.Protein] objects to get them ready
    to be scored by [Score][pyproteininference.scoring.Score] methods.
    This method also takes into account the inference type and aggregates peptides -> proteins accordingly.

    This method sets the `scoring_input` and `score` Attributes for the DataStore object.

    The score selected comes from the protein inference parameter object.

    Returns:
        None:

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> data.create_scoring_input()
    """

    logger.info("Creating Scoring Input")

    psm_data = self.get_psm_data()

    protein_psm_dict = collections.defaultdict(list)

    try:
        score_key = self.SCORE_MAPPER[self.parameter_file_object.psm_score]
    except KeyError:
        score_key = self.CUSTOM_SCORE_KEY

    if self.parameter_file_object.inference_type != Inference.PEPTIDE_CENTRIC:
        # Loop through all Psms
        for psms in psm_data:
            psms.assign_main_score(score=score_key)
            # Loop through all proteins
            for prots in psms.possible_proteins:
                protein_psm_dict[prots].append(psms)

    else:
        self.peptide_to_protein_dictionary()
        sp_proteins = self.digest.swiss_prot_protein_set
        for psms in psm_data:

            # Assign main score
            psms.assign_main_score(score=score_key)
            protein_set = self.peptide_protein_dictionary[psms.non_flanking_peptide]
            # Sort protein_set by sp-alpha, decoy-sp-alpha, tr-alpha, decoy-tr-alpha
            sorted_protein_list = self.sort_protein_strings(
                protein_string_list=protein_set,
                sp_proteins=sp_proteins,
                decoy_symbol=self.parameter_file_object.decoy_symbol,
            )
            # Restrict the number of identifiers by the value in param file max_identifiers_peptide_centric
            sorted_protein_list = sorted_protein_list[: self.parameter_file_object.max_identifiers_peptide_centric]
            protein_name = ";".join(sorted_protein_list)
            protein_psm_dict[protein_name].append(psms)

    protein_list = []
    for pkey in sorted(protein_psm_dict.keys()):
        protein_object = Protein(identifier=pkey)
        protein_object.psms = protein_psm_dict[pkey]
        protein_object.raw_peptides = set([x.identifier for x in protein_psm_dict[pkey]])
        protein_list.append(protein_object)

    self.psm_score = self.parameter_file_object.psm_score
    self.scoring_input = protein_list

exclude_non_distinguishing_peptides(self, protein_subset_type='hard')

Method to Exclude peptides that are not distinguishing on either the search or database level.

The method sets the scoring_input and restricted_peptides variables for the DataStore object.

Parameters:
  • protein_subset_type (str) – Either "hard" or "soft". Hard will select distinguishing peptides based on the database digestion. "soft" will only use peptides identified in the search.

Returns:
  • None

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.exclude_non_distinguishing_peptides(protein_subset_type="hard")
Source code in pyproteininference/datastore.py
def exclude_non_distinguishing_peptides(self, protein_subset_type="hard"):
    """
    Method to Exclude peptides that are not distinguishing on either the search or database level.

    The method sets the `scoring_input` and `restricted_peptides` variables for the DataStore object.

    Args:
        protein_subset_type (str): Either "hard" or "soft". Hard will select distinguishing peptides based on
            the database digestion. "soft" will only use peptides identified in the search.

    Returns:
        None:

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> data.exclude_non_distinguishing_peptides(protein_subset_type="hard")
    """

    logger.info("Applying Exclusion Model")

    our_proteins_sorted = self.get_sorted_identifiers(scored=False)

    if protein_subset_type == "hard":
        # Hard protein subsetting defines protein subsets on the digest level (Entire protein is used)
        # This is how Percolator PI does subsetting
        peptides = [self.digest.protein_to_peptide_dictionary[x] for x in our_proteins_sorted]
    elif protein_subset_type == "soft":
        # Soft protein subsetting defines protein subsets on the Peptides identified from the search
        peptides = [set(x.raw_peptides) for x in self.scoring_input]
    else:
        # If neither is dfined we do "hard" exclusion
        peptides = [self.digest.protein_to_peptide_dictionary[x] for x in our_proteins_sorted]

    # Get frozen set of peptides....
    # We will also have a corresponding list of proteins...
    # They will have the same index...
    peptide_sets = [frozenset(e) for e in peptides]
    # Find a way to sort this list of sets...
    # We can sort the sets if we sort proteins from above...
    logger.info("{} number of peptide sets".format(len(peptide_sets)))
    non_subset_peptide_sets = set()
    i = 0
    # Get all peptide sets that are not a subset...
    while peptide_sets:
        i = i + 1
        peptide_set = peptide_sets.pop()
        if any(peptide_set.issubset(s) for s in peptide_sets) or any(
            peptide_set.issubset(s) for s in non_subset_peptide_sets
        ):
            continue
        else:
            non_subset_peptide_sets.add(peptide_set)
        if i % 10000 == 0:
            logger.info("Parsed {} Peptide Sets".format(i))

    logger.info("Parsed {} Peptide Sets".format(i))

    # Get their index from peptides which is the initial list of sets...
    list_of_indeces = []
    for pep_sets in non_subset_peptide_sets:
        ind = peptides.index(pep_sets)
        list_of_indeces.append(ind)

    non_subset_proteins = set([our_proteins_sorted[x] for x in list_of_indeces])

    logger.info("Removing direct subset Proteins from the data")
    # Remove all proteins from scoring input that are a subset of another protein...
    self.scoring_input = [x for x in self.scoring_input if x.identifier in non_subset_proteins]

    logger.info("{} proteins in scoring input after removing subset proteins".format(len(self.scoring_input)))

    # For all the proteins that are not a complete subset of another protein...
    # Get the raw peptides...
    raw_peps = [x.raw_peptides for x in self.scoring_input if x.identifier in non_subset_proteins]

    # Make the raw peptides a flat list
    flat_peptides = [Psm.split_peptide(peptide_string=item) for sublist in raw_peps for item in sublist]

    # Count the number of peptides in this list...
    # This is the number of proteins this peptide maps to....
    counted_peptides = collections.Counter(flat_peptides)

    # If the count is greater than 1... exclude the protein entirely from scoring input... :)
    raw_peps_good = set([x for x in counted_peptides.keys() if counted_peptides[x] <= 1])

    # Alter self.scoring_input by removing psms and peptides that are not in raw_peps_good
    current_score_input = list(self.scoring_input)
    for j in range(len(current_score_input)):
        k = j + 1
        psm_list = []
        new_raw_peptides = []
        current_psms = current_score_input[j].psms
        current_raw_peptides = current_score_input[j].raw_peptides

        for psm_scores in current_psms:
            if psm_scores.non_flanking_peptide in raw_peps_good:
                psm_list.append(psm_scores)

        for rp in current_raw_peptides:
            if Psm.split_peptide(peptide_string=rp) in raw_peps_good:
                new_raw_peptides.append(rp)

        current_score_input[j].psms = psm_list
        current_score_input[j].raw_peptides = new_raw_peptides

        if k % 10000 == 0:
            logger.info("Redefined {} Peptide Sets".format(k))

    logger.info("Redefined {} Peptide Sets".format(j))

    filtered_score_input = [x for x in current_score_input if x.psms]

    self.scoring_input = filtered_score_input

    # Recompute the flat peptides
    raw_peps = [x.raw_peptides for x in self.scoring_input if x.identifier in non_subset_proteins]

    # Make the raw peptides a flat list
    new_flat_peptides = set([Psm.split_peptide(peptide_string=item) for sublist in raw_peps for item in sublist])

    self.scoring_input = [x for x in self.scoring_input if x.psms]

    self.restricted_peptides = [x for x in self.restricted_peptides if x in new_flat_peptides]

generate_fdr_vs_target_hits(self, fdr_max=0.2)

Method for calculating FDR vs number of Target Proteins.

Parameters:
  • fdr_max (float) – The maximum false discovery rate to calculate target hits for. Will stop once fdr_max is reached.

Returns:
  • list – List of lists of: (FDR, Number of Target Hits). Ordered by increasing number of Target Hits.

Source code in pyproteininference/datastore.py
def generate_fdr_vs_target_hits(self, fdr_max=0.2):
    """
    Method for calculating FDR vs number of Target Proteins.

    Args:
        fdr_max (float): The maximum false discovery rate to calculate target hits for.
            Will stop once fdr_max is reached.

    Returns:
        list: List of lists of: (FDR, Number of Target Hits). Ordered by increasing number of Target Hits.

    """
    fdr_vs_count = []
    count_list = []
    for pg in self.protein_group_objects:
        if self.decoy_symbol not in pg.proteins[0].identifier:
            count_list.append(pg)
        fdr_vs_count.append([pg.q_value, len(count_list)])

    fdr_vs_count = [x for x in fdr_vs_count if x[0] < fdr_max]

    return fdr_vs_count

get_pep_values(self)

Method to retrieve a list of all posterior error probabilities for all PSMs.

Returns:
  • list – list of floats (pep values).

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> pep = data.get_pep_values()
Source code in pyproteininference/datastore.py
def get_pep_values(self):
    """
    Method to retrieve a list of all posterior error probabilities for all PSMs.

    Returns:
        list: list of floats (pep values).

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> pep = data.get_pep_values()
    """
    psm_data = self.get_psm_data()

    pep_values = [x.pepvalue for x in psm_data]

    return pep_values

get_protein_data(self)

Method to retrieve a list of Protein objects. Retrieves picked and scored data if the data has been picked and scored or just the scored data if the data has not been picked.

Returns:
  • list – list of Protein objects.

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> # Data must ben ran through a pyproteininference.scoring.Score method
>>> protein_data = data.get_protein_data()
Source code in pyproteininference/datastore.py
def get_protein_data(self):
    """
    Method to retrieve a list of [Protein][pyproteininference.physical.Protein] objects.
    Retrieves picked and scored data if the data has been picked and scored or just the scored data if the data has
     not been picked.

    Returns:
        list: list of [Protein][pyproteininference.physical.Protein] objects.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> # Data must ben ran through a pyproteininference.scoring.Score method
        >>> protein_data = data.get_protein_data()
    """

    if self.picked_proteins_scored:
        scored_proteins = self.picked_proteins_scored
    else:
        scored_proteins = self.scored_proteins

    return scored_proteins

get_protein_identifiers(self, data_form)

Method to retrieve the protein string identifiers.

Parameters:
  • data_form (str) – Can be one of the following: "main", "restricted", "picked", "picked_removed".

Returns:
  • list – list of protein identifier strings.

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_strings = data.get_protein_identifiers(data_form="main")
Source code in pyproteininference/datastore.py
def get_protein_identifiers(self, data_form):
    """
    Method to retrieve the protein string identifiers.

    Args:
        data_form (str): Can be one of the following: "main", "restricted", "picked", "picked_removed".

    Returns:
        list: list of protein identifier strings.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> protein_strings = data.get_protein_identifiers(data_form="main")
    """
    if data_form == "main":
        # All the data (unrestricted)
        data_to_select = self.main_data_form
        prots = [[x.possible_proteins] for x in data_to_select]
        proteins = prots

    if data_form == "restricted":
        # Proteins that pass certain restriction criteria (peptide length, pep, qvalue)
        data_to_select = self.main_data_restricted
        prots = [[x.possible_proteins] for x in data_to_select]
        proteins = prots

    if data_form == "picked":
        # Here we look at proteins that are 'picked' (aka the proteins that beat out their matching target/decoy)
        data_to_select = self.picked_proteins_scored
        prots = [x.identifier for x in data_to_select]
        proteins = prots

    if data_form == "picked_removed":
        # Here we look at the proteins that were removed due to picking (aka the proteins that
        # have a worse score than their target/decoy counterpart)
        data_to_select = self.picked_proteins_removed
        prots = [x.identifier for x in data_to_select]
        proteins = prots

    return proteins

get_protein_identifiers_from_psm_data(self)

Method to retrieve a list of lists of all possible protein identifiers from the psm data.

Returns:
  • list – list of lists of protein strings.

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_strings = data.get_protein_identifiers_from_psm_data()
Source code in pyproteininference/datastore.py
def get_protein_identifiers_from_psm_data(self):
    """
    Method to retrieve a list of lists of all possible protein identifiers from the psm data.

    Returns:
        list: list of lists of protein strings.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> protein_strings = data.get_protein_identifiers_from_psm_data()
    """
    psm_data = self.get_psm_data()

    proteins = [x.possible_proteins for x in psm_data]

    return proteins

get_protein_information(self, protein_string)

Method to retrieve attributes for a specific scored protein.

Parameters:
  • protein_string (str) – Protein Identifier String.

Returns:
  • list – list of protein attributes.

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_attr = data.get_protein_information(protein_string="RAF1_HUMAN|P04049")
Source code in pyproteininference/datastore.py
def get_protein_information(self, protein_string):
    """
    Method to retrieve attributes for a specific scored protein.

    Args:
        protein_string (str): Protein Identifier String.

    Returns:
        list: list of protein attributes.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> protein_attr = data.get_protein_information(protein_string="RAF1_HUMAN|P04049")
    """
    all_scored_protein_data = self.scored_proteins
    identifiers = [x.identifier for x in all_scored_protein_data]
    protein_scores = [x.score for x in all_scored_protein_data]
    groups = [x.group_identification for x in all_scored_protein_data]
    reviewed = [x.reviewed for x in all_scored_protein_data]
    peptides = [x.peptides for x in all_scored_protein_data]
    # Peptide scores currently broken...
    peptide_scores = [x.peptide_scores for x in all_scored_protein_data]
    picked = [x.picked for x in all_scored_protein_data]
    num_peptides = [x.num_peptides for x in all_scored_protein_data]

    main_index = identifiers.index(protein_string)

    list_structure = [
        [
            "identifier",
            "protein_score",
            "groups",
            "reviewed",
            "peptides",
            "peptide_scores",
            "picked",
            "num_peptides",
        ]
    ]
    list_structure.append([protein_string])
    list_structure[-1].append(protein_scores[main_index])
    list_structure[-1].append(groups[main_index])
    list_structure[-1].append(reviewed[main_index])
    list_structure[-1].append(peptides[main_index])
    list_structure[-1].append(peptide_scores[main_index])
    list_structure[-1].append(picked[main_index])
    list_structure[-1].append(num_peptides[main_index])

    return list_structure

get_protein_information_dictionary(self)

Method to retrieve a dictionary of scores for each peptide.

Returns:
  • dict – dictionary of scores for each protein.

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_dict = data.get_protein_information_dictionary()
Source code in pyproteininference/datastore.py
def get_protein_information_dictionary(self):
    """
    Method to retrieve a dictionary of scores for each peptide.

    Returns:
        dict: dictionary of scores for each protein.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> protein_dict = data.get_protein_information_dictionary()
    """
    psm_data = self.get_psm_data()

    protein_psm_score_dictionary = collections.defaultdict(list)

    # Loop through all Psms
    for psms in psm_data:
        # Loop through all proteins
        for prots in psms.possible_proteins:
            protein_psm_score_dictionary[prots].append(
                {
                    "peptide": psms.identifier,
                    "Qvalue": psms.qvalue,
                    "PosteriorErrorProbability": psms.pepvalue,
                    "Percscore": psms.percscore,
                }
            )

    return protein_psm_score_dictionary

get_protein_objects(self, false_discovery_rate=None, fdr_restricted=False)

Method retrieves protein objects. Either retrieves FDR restricted list of protien objects, or retrieves all objects.

Parameters:
  • fdr_restricted (bool) – True/False on whether to restrict the list of objects based on FDR.

Returns:
  • list – List of scored ProteinGroup objects that have been grouped and sorted.

Source code in pyproteininference/datastore.py
def get_protein_objects(self, false_discovery_rate=None, fdr_restricted=False):
    """
    Method retrieves protein objects. Either retrieves FDR restricted list of protien objects,
    or retrieves all objects.

    Args:
        fdr_restricted (bool): True/False on whether to restrict the list of objects based on FDR.

    Returns:
        list: List of scored [ProteinGroup][pyproteininference.physical.ProteinGroup] objects
            that have been grouped and sorted.

    """
    if not false_discovery_rate:
        false_discovery_rate = self.parameter_file_object.fdr
    if fdr_restricted:
        protein_objects = [x.proteins for x in self.protein_group_objects if x.q_value <= false_discovery_rate]
    else:
        protein_objects = self.grouped_scored_proteins

    return protein_objects

get_psm_data(self)

Method to retrieve a list of Psm objects. Retrieves restricted data if the data has been restricted or all of the data if the data has not been restricted.

Returns:
  • list – list of Psm objects.

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> psm_data = data.get_psm_data()
Source code in pyproteininference/datastore.py
def get_psm_data(self):
    """
    Method to retrieve a list of [Psm][pyproteininference.physical.Psm] objects.
    Retrieves restricted data if the data has been restricted or all of the data if the data has
    not been restricted.

    Returns:
        list: list of [Psm][pyproteininference.physical.Psm] objects.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> psm_data = data.get_psm_data()
    """
    if not self.main_data_restricted and not self.main_data_form:
        raise ValueError(
            "Both main_data_restricted and main_data_form variables are empty. Please re-load the DataStore "
            "object with a properly loaded Reader object."
        )

    if self.main_data_restricted:
        psm_data = self.main_data_restricted
    else:
        psm_data = self.main_data_form

    return psm_data

get_q_values(self)

Method to retrieve a list of all q values for all PSMs.

Returns:
  • list – list of floats (q values).

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> q = data.get_q_values()
Source code in pyproteininference/datastore.py
def get_q_values(self):
    """
    Method to retrieve a list of all q values for all PSMs.

    Returns:
        list: list of floats (q values).

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> q = data.get_q_values()
    """
    psm_data = self.get_psm_data()

    q_values = [x.qvalue for x in psm_data]

    return q_values

get_sorted_identifiers(self, scored=True)

Retrieves a sorted list of protein strings present in the analysis.

Parameters:
  • scored (bool) – True/False to indicate if we should return scored or non-scored identifiers.

Returns:
  • list – List of sorted protein identifier strings.

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> sorted_proteins = data.get_sorted_identifiers(scored=True)
Source code in pyproteininference/datastore.py
def get_sorted_identifiers(self, scored=True):
    """
    Retrieves a sorted list of protein strings present in the analysis.

    Args:
        scored (bool): True/False to indicate if we should return scored or non-scored identifiers.

    Returns:
        list: List of sorted protein identifier strings.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> sorted_proteins = data.get_sorted_identifiers(scored=True)
    """

    if scored:
        self._validate_scored_proteins()
        if self.picked_proteins_scored:
            proteins = set([x.identifier for x in self.picked_proteins_scored])
        else:
            proteins = set([x.identifier for x in self.scored_proteins])
    else:
        self._validate_scoring_input()
        proteins = [x.identifier for x in self.scoring_input]

    all_sp_proteins = set(self.digest.swiss_prot_protein_set)

    our_target_sp_proteins = sorted([x for x in proteins if x in all_sp_proteins and self.decoy_symbol not in x])
    our_decoy_sp_proteins = sorted([x for x in proteins if x in all_sp_proteins and self.decoy_symbol in x])

    our_target_tr_proteins = sorted(
        [x for x in proteins if x not in all_sp_proteins and self.decoy_symbol not in x]
    )
    our_decoy_tr_proteins = sorted([x for x in proteins if x not in all_sp_proteins and self.decoy_symbol in x])

    our_proteins_sorted = (
        our_target_sp_proteins + our_decoy_sp_proteins + our_target_tr_proteins + our_decoy_tr_proteins
    )

    return our_proteins_sorted

higher_or_lower(self)

Method to determine if a higher or lower score is better for a given combination of score input and score type.

This method sets the high_low_better Attribute for the DataStore object.

This method depends on the output from the Score class to be sorted properly from best to worst score.

Returns:
  • str – String indicating "higher" or "lower" depending on if a higher or lower score is a better protein score.

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> high_low = data.higher_or_lower()
Source code in pyproteininference/datastore.py
def higher_or_lower(self):
    """
    Method to determine if a higher or lower score is better for a given combination of score input and score type.

    This method sets the `high_low_better` Attribute for the DataStore object.

    This method depends on the output from the Score class to be sorted properly from best to worst score.

    Returns:
        str: String indicating "higher" or "lower" depending on if a higher or lower score is a
            better protein score.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> high_low = data.higher_or_lower()
    """

    if not self.high_low_better:
        logger.info("Determining If a higher or lower score is better based on scored proteins")
        worst_score = self.scored_proteins[-1].score
        best_score = self.scored_proteins[0].score

        if float(best_score) > float(worst_score):
            higher_or_lower = self.HIGHER_PSM_SCORE

        if float(best_score) < float(worst_score):
            higher_or_lower = self.LOWER_PSM_SCORE

        logger.info("best score = {}".format(best_score))
        logger.info("worst score = {}".format(worst_score))

        if best_score == worst_score:
            raise ValueError(
                "Best and Worst scores were identical, equal to {}. Score type {} produced the error, "
                "please change psm_score type.".format(best_score, self.psm_score)
            )

        self.high_low_better = higher_or_lower

    else:
        higher_or_lower = self.high_low_better

    return higher_or_lower

input_has_custom(self)

Method that checks to see if the input data has custom score values.

Source code in pyproteininference/datastore.py
def input_has_custom(self):
    """
    Method that checks to see if the input data has custom score values.
    """
    len_c = len([x.custom_score for x in self.main_data_form if x.custom_score])
    len_all = len(self.main_data_form)
    if len_c == len_all:
        status = True
        logger.info("Input has Custom value; Can restrict by Custom value")

    else:
        status = False
        logger.warning("Input does not have Custom value; Cannot restrict by Custom value")

    return status

input_has_pep(self)

Method that checks to see if the input data has pep values.

Source code in pyproteininference/datastore.py
def input_has_pep(self):
    """
    Method that checks to see if the input data has pep values.
    """
    len_pep = len([x.pepvalue for x in self.main_data_form if x.pepvalue])
    len_all = len(self.main_data_form)
    if len_pep == len_all:
        status = True
        logger.info("Input has Pep value; Can restrict by Pep value")
    else:
        status = False
        logger.warning("Input does not have Pep value; Cannot restrict by Pep value")

    return status

input_has_q(self)

Method that checks to see if the input data has q values.

Source code in pyproteininference/datastore.py
def input_has_q(self):
    """
    Method that checks to see if the input data has q values.
    """
    len_q = len([x.qvalue for x in self.main_data_form if x.qvalue])
    len_all = len(self.main_data_form)
    if len_q == len_all:
        status = True
        logger.info("Input has Q value; Can restrict by Q value")
    else:
        status = False
        logger.warning("Input does not have Q value; Cannot restrict by Q value")

    return status

peptide_to_protein_dictionary(self)

Method that returns a map of peptide strings to sets of protein strings and is essentially half of a BiPartite graph. This method sets the peptide_protein_dictionary Attribute for the DataStore object.

Returns:
  • collections.defaultdict – Dictionary of peptide strings (keys) that map to sets of protein strings based on the peptides and proteins found in the search. Peptide -> set(Proteins).

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> peptide_protein_dict = data.peptide_to_protein_dictionary()
Source code in pyproteininference/datastore.py
def peptide_to_protein_dictionary(self):
    """
    Method that returns a map of peptide strings to sets of protein strings and is essentially half of a
    BiPartite graph.
    This method sets the `peptide_protein_dictionary` Attribute for the DataStore object.

    Returns:
        collections.defaultdict: Dictionary of peptide strings (keys) that map to sets of protein strings based
            on the peptides and proteins found in the search. Peptide -> set(Proteins).

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> peptide_protein_dict = data.peptide_to_protein_dictionary()
    """
    psm_data = self.get_psm_data()

    res_pep_set = set(self.restricted_peptides)
    default_dict_peptides = collections.defaultdict(set)
    for peptide_objects in psm_data:
        for prots in peptide_objects.possible_proteins:
            cur_peptide = peptide_objects.non_flanking_peptide
            if cur_peptide in res_pep_set:
                default_dict_peptides[cur_peptide].add(prots)
            else:
                pass

    self.peptide_protein_dictionary = default_dict_peptides

    return default_dict_peptides

protein_picker(self)

Method to run the protein picker algorithm.

Proteins must be scored first with score_psms.

The algorithm will match target and decoy proteins identified from the PSMs from the search. If a target and matching decoy is found then target/decoy competition is performed. In the Target/Decoy pair the protein with the better score is kept and the one with the worse score is discarded from the analysis.

The method sets the picked_proteins_scored and picked_proteins_removed variables for the DataStore object.

Returns:
  • None

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.protein_picker()
Source code in pyproteininference/datastore.py
def protein_picker(self):
    """
    Method to run the protein picker algorithm.

    Proteins must be scored first with [score_psms][pyproteininference.scoring.Score.score_psms].

    The algorithm will match target and decoy proteins identified from the PSMs from the search.
    If a target and matching decoy is found then target/decoy competition is performed.
    In the Target/Decoy pair the protein with the better score is kept and the one with the worse score is
    discarded from the analysis.

    The method sets the `picked_proteins_scored` and `picked_proteins_removed` variables for
    the DataStore object.

    Returns:
        None:

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> data.protein_picker()
    """

    self._validate_scored_proteins()

    logger.info("Running Protein Picker")

    # Use higher or lower class to determine if a higher protein score or lower protein score is better
    # based on the scoring method used
    higher_or_lower = self.higher_or_lower()
    # Here we determine if a lower or higher score is better
    # Since all input is ordered from best to worst we can do the following

    index_to_remove = []
    # data.scored_proteins is simply a list of Protein objects...
    # Create list of all decoy proteins
    decoy_proteins = [x.identifier for x in self.scored_proteins if self.decoy_symbol in x.identifier]
    # Create a list of all potential matching targets (some of these may not exist in the search)
    matching_targets = [x.replace(self.decoy_symbol, "") for x in decoy_proteins]

    # Create a list of all the proteins from the scored data
    all_proteins = [x.identifier for x in self.scored_proteins]
    logger.info("{} proteins scored".format(len(all_proteins)))

    total_targets = []
    total_decoys = []
    decoys_removed = []
    targets_removed = []
    # Loop over all decoys identified in the search
    logger.info("Picking Proteins...")
    for i in range(len(decoy_proteins)):
        cur_decoy_index = all_proteins.index(decoy_proteins[i])
        cur_decoy_protein_object = self.scored_proteins[cur_decoy_index]
        total_decoys.append(cur_decoy_protein_object.identifier)

        # Try, Except here because the matching target to the decoy may not be a result from the search
        try:
            cur_target_index = all_proteins.index(matching_targets[i])
            cur_target_protein_object = self.scored_proteins[cur_target_index]
            total_targets.append(cur_target_protein_object.identifier)

            if higher_or_lower == self.HIGHER_PSM_SCORE:
                if cur_target_protein_object.score > cur_decoy_protein_object.score:
                    index_to_remove.append(cur_decoy_index)
                    decoys_removed.append(cur_decoy_index)
                    cur_target_protein_object.picked = True
                    cur_decoy_protein_object.picked = False
                else:
                    index_to_remove.append(cur_target_index)
                    targets_removed.append(cur_target_index)
                    cur_decoy_protein_object.picked = True
                    cur_target_protein_object.picked = False

            if higher_or_lower == self.LOWER_PSM_SCORE:
                if cur_target_protein_object.score < cur_decoy_protein_object.score:
                    index_to_remove.append(cur_decoy_index)
                    decoys_removed.append(cur_decoy_index)
                    cur_target_protein_object.picked = True
                    cur_decoy_protein_object.picked = False
                else:
                    index_to_remove.append(cur_target_index)
                    targets_removed.append(cur_target_index)
                    cur_decoy_protein_object.picked = True
                    cur_target_protein_object.picked = False
        except ValueError:
            pass

    logger.info("{} total decoy proteins".format(len(total_decoys)))
    logger.info("{} matching target proteins also found in search".format(len(total_targets)))
    logger.info("{} decoy proteins to be removed".format(len(decoys_removed)))
    logger.info("{} target proteins to be removed".format(len(targets_removed)))

    logger.info("Removing Lower Scoring Proteins...")
    picked_list = []
    removed_proteins = []
    for protein_objects in self.scored_proteins:
        if protein_objects.picked:
            picked_list.append(protein_objects)
        else:
            removed_proteins.append(protein_objects)
    self.picked_proteins_scored = picked_list
    self.picked_proteins_removed = removed_proteins
    logger.info("Finished Removing Proteins")

protein_to_peptide_dictionary(self)

Method that returns a map of protein strings to sets of peptide strings and is essentially half of a BiPartite graph. This method sets the protein_peptide_dictionary Attribute for the DataStore object.

Returns:
  • collections.defaultdict – Dictionary of protein strings (keys) that map to sets of peptide strings based on the peptides and proteins found in the search. Protein -> set(Peptides).

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> protein_peptide_dict = data.protein_to_peptide_dictionary()
Source code in pyproteininference/datastore.py
def protein_to_peptide_dictionary(self):
    """
    Method that returns a map of protein strings to sets of peptide strings and is essentially half
     of a BiPartite graph.
    This method sets the `protein_peptide_dictionary` Attribute for the DataStore object.

    Returns:
        collections.defaultdict: Dictionary of protein strings (keys) that map to sets of peptide strings based
        on the peptides and proteins found in the search. Protein -> set(Peptides).

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> protein_peptide_dict = data.protein_to_peptide_dictionary()
    """
    psm_data = self.get_psm_data()

    res_pep_set = set(self.restricted_peptides)
    default_dict_proteins = collections.defaultdict(set)
    for peptide_objects in psm_data:
        for prots in peptide_objects.possible_proteins:
            cur_peptide = peptide_objects.non_flanking_peptide
            if cur_peptide in res_pep_set:
                default_dict_proteins[prots].add(cur_peptide)

    self.protein_peptide_dictionary = default_dict_proteins

    return default_dict_proteins

restrict_psm_data(self, remove1pep=True)

Method to restrict the input of Psm objects. This method is central to the pyproteininference module and is able to restrict the Psm data by: Q value, Pep Value, Percolator Score, Peptide Length, and Custom Score Input. Restriction values are pulled from the ProteinInferenceParameter object.

This method sets the main_data_restricted and restricted_peptides Attributes for the DataStore object.

Parameters:
  • remove1pep (bool) – True/False on whether or not to remove PEP values that equal 1 even if other restrictions are set to not restrict.

Returns:
  • None

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> data.restrict_psm_data(remove1pep=True)
Source code in pyproteininference/datastore.py
def restrict_psm_data(self, remove1pep=True):
    """
    Method to restrict the input of [Psm][pyproteininference.physical.Psm]  objects.
    This method is central to the pyproteininference module and is able to restrict the Psm data by:
    Q value, Pep Value, Percolator Score, Peptide Length, and Custom Score Input.
    Restriction values are pulled from
    the [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter]
    object.

    This method sets the `main_data_restricted` and `restricted_peptides` Attributes for the DataStore object.

    Args:
        remove1pep (bool): True/False on whether or not to remove PEP values that equal 1 even if other restrictions
            are set to not restrict.

    Returns:
        None:

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> data.restrict_psm_data(remove1pep=True)
    """

    # Validate that we have the main data variable
    self._validate_main_data_form()

    logger.info("Restricting PSM data")

    peptide_length = self.parameter_file_object.restrict_peptide_length
    posterior_error_prob_threshold = self.parameter_file_object.restrict_pep
    q_value_threshold = self.parameter_file_object.restrict_q
    custom_threshold = self.parameter_file_object.restrict_custom

    main_psm_data = self.main_data_form
    logger.info("Length of main data: {}".format(len(self.main_data_form)))
    # If restrict_main_data is called, we automatically discard everything that has a PEP of 1
    if remove1pep and posterior_error_prob_threshold:
        main_psm_data = [x for x in main_psm_data if x.pepvalue != 1]

    # Restrict peptide length and posterior error probability
    if peptide_length and posterior_error_prob_threshold and not q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if len(psms.stripped_peptide) >= peptide_length and psms.pepvalue < float(
                posterior_error_prob_threshold
            ):
                restricted_data.append(psms)

    # Restrict peptide length only
    if peptide_length and not posterior_error_prob_threshold and not q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if len(psms.stripped_peptide) >= peptide_length:
                restricted_data.append(psms)

    # Restrict peptide length, posterior error probability, and qvalue
    if peptide_length and posterior_error_prob_threshold and q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if (
                len(psms.stripped_peptide) >= peptide_length
                and psms.pepvalue < float(posterior_error_prob_threshold)
                and psms.qvalue < float(q_value_threshold)
            ):
                restricted_data.append(psms)

    # Restrict peptide length and qvalue
    if peptide_length and not posterior_error_prob_threshold and q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if len(psms.stripped_peptide) >= peptide_length and psms.qvalue < float(q_value_threshold):
                restricted_data.append(psms)

    # Restrict posterior error probability and q value
    if not peptide_length and posterior_error_prob_threshold and q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if psms.pepvalue < float(posterior_error_prob_threshold) and psms.qvalue < float(q_value_threshold):
                restricted_data.append(psms)

    # Restrict qvalue only
    if not peptide_length and not posterior_error_prob_threshold and q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if psms.qvalue < float(q_value_threshold):
                restricted_data.append(psms)

    # Restrict posterior error probability only
    if not peptide_length and posterior_error_prob_threshold and not q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if psms.pepvalue < float(posterior_error_prob_threshold):
                restricted_data.append(psms)

    # Restrict nothing... (only PEP gets restricted - takes everything less than 1)
    if not peptide_length and not posterior_error_prob_threshold and not q_value_threshold:
        restricted_data = main_psm_data

    if custom_threshold:
        custom_restricted = []
        if self.parameter_file_object.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE:
            for psms in restricted_data:
                if psms.custom_score <= custom_threshold:
                    custom_restricted.append(psms)

        if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
            for psms in restricted_data:
                if psms.custom_score >= custom_threshold:
                    custom_restricted.append(psms)

        restricted_data = custom_restricted

    self.main_data_restricted = restricted_data

    logger.info("Length of restricted data: {}".format(len(restricted_data)))

    self.restricted_peptides = [x.non_flanking_peptide for x in restricted_data]

sort_protein_group_objects(protein_group_objects, higher_or_lower) classmethod

Class Method to sort a list of ProteinGroup objects by score and number of peptides.

Parameters:
  • protein_group_objects (list) – list of ProteinGroup objects.

  • higher_or_lower (str) – String to indicate if a "higher" or "lower" protein score is "better".

Returns:

Examples:

>>> list_of_group_objects = pyproteininference.datastore.DataStore.sort_protein_group_objects(
>>>     protein_group_objects=list_of_group_objects, higher_or_lower="higher"
>>> )
Source code in pyproteininference/datastore.py
@classmethod
def sort_protein_group_objects(cls, protein_group_objects, higher_or_lower):
    """
    Class Method to sort a list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects by
    score and number of peptides.

    Args:
        protein_group_objects (list): list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
        higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".

    Returns:
        list: list of sorted [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

    Example:
        >>> list_of_group_objects = pyproteininference.datastore.DataStore.sort_protein_group_objects(
        >>>     protein_group_objects=list_of_group_objects, higher_or_lower="higher"
        >>> )
    """
    if higher_or_lower == cls.LOWER_PSM_SCORE:

        protein_group_objects = sorted(
            protein_group_objects,
            key=lambda k: (
                k.proteins[0].score,
                -k.proteins[0].num_peptides,
            ),
            reverse=False,
        )
    elif higher_or_lower == cls.HIGHER_PSM_SCORE:

        protein_group_objects = sorted(
            protein_group_objects,
            key=lambda k: (
                k.proteins[0].score,
                k.proteins[0].num_peptides,
            ),
            reverse=True,
        )

    return protein_group_objects

sort_protein_objects(grouped_protein_objects, higher_or_lower) classmethod

Class Method to sort a list of Protein objects by score and number of peptides.

Parameters:
  • grouped_protein_objects (list) – list of Protein objects.

  • higher_or_lower (str) – String to indicate if a "higher" or "lower" protein score is "better".

Returns:
  • list – list of sorted Protein objects.

Examples:

>>> scores_grouped = pyproteininference.datastore.DataStore.sort_protein_objects(
>>>     grouped_protein_objects=scores_grouped, higher_or_lower="higher"
>>> )
Source code in pyproteininference/datastore.py
@classmethod
def sort_protein_objects(cls, grouped_protein_objects, higher_or_lower):
    """
    Class Method to sort a list of [Protein][pyproteininference.physical.Protein] objects by score and number of
    peptides.

    Args:
        grouped_protein_objects (list): list of [Protein][pyproteininference.physical.Protein] objects.
        higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".

    Returns:
        list: list of sorted [Protein][pyproteininference.physical.Protein] objects.

    Example:
        >>> scores_grouped = pyproteininference.datastore.DataStore.sort_protein_objects(
        >>>     grouped_protein_objects=scores_grouped, higher_or_lower="higher"
        >>> )
    """
    if higher_or_lower == cls.LOWER_PSM_SCORE:
        grouped_protein_objects = sorted(
            grouped_protein_objects,
            key=lambda k: (k[0].score, -k[0].num_peptides),
            reverse=False,
        )
    if higher_or_lower == cls.HIGHER_PSM_SCORE:
        grouped_protein_objects = sorted(
            grouped_protein_objects,
            key=lambda k: (k[0].score, k[0].num_peptides),
            reverse=True,
        )
    return grouped_protein_objects

sort_protein_strings(protein_string_list, sp_proteins, decoy_symbol) classmethod

Method that sorts protein strings in the following order: Target Reviewed, Decoy Reviewed, Target Unreviewed, Decoy Unreviewed.

Parameters:
  • protein_string_list (list) – List of Protein Strings.

  • sp_proteins (set) – Set of Reviewed Protein Strings.

  • decoy_symbol (str) – Symbol to denote a decoy protein identifier IE "##".

Returns:
  • list – List of sorted protein strings.

Examples:

>>> list_of_group_objects = datastore.DataStore.sort_protein_strings(
>>>     protein_string_list=protein_string_list, sp_proteins=sp_proteins, decoy_symbol="##"
>>> )
Source code in pyproteininference/datastore.py
@classmethod
def sort_protein_strings(cls, protein_string_list, sp_proteins, decoy_symbol):
    """
    Method that sorts protein strings in the following order: Target Reviewed, Decoy Reviewed, Target Unreviewed,
     Decoy Unreviewed.

    Args:
        protein_string_list (list): List of Protein Strings.
        sp_proteins (set): Set of Reviewed Protein Strings.
        decoy_symbol (str): Symbol to denote a decoy protein identifier IE "##".

    Returns:
        list: List of sorted protein strings.

    Example:
        >>> list_of_group_objects = datastore.DataStore.sort_protein_strings(
        >>>     protein_string_list=protein_string_list, sp_proteins=sp_proteins, decoy_symbol="##"
        >>> )
    """

    our_target_sp_proteins = sorted([x for x in protein_string_list if x in sp_proteins and decoy_symbol not in x])
    our_decoy_sp_proteins = sorted([x for x in protein_string_list if x in sp_proteins and decoy_symbol in x])

    our_target_tr_proteins = sorted(
        [x for x in protein_string_list if x not in sp_proteins and decoy_symbol not in x]
    )
    our_decoy_tr_proteins = sorted([x for x in protein_string_list if x not in sp_proteins and decoy_symbol in x])

    identifiers_sorted = (
        our_target_sp_proteins + our_decoy_sp_proteins + our_target_tr_proteins + our_decoy_tr_proteins
    )

    return identifiers_sorted

sort_protein_sub_groups(protein_list, higher_or_lower) classmethod

Method to sort protein sub lists.

Parameters:
  • protein_list (list) – List of Protein objects to be sorted.

  • higher_or_lower (str) – String to indicate if a "higher" or "lower" protein score is "better".

Returns:
  • list – List of Protein objects to be sorted by score and number of peptides.

Source code in pyproteininference/datastore.py
@classmethod
def sort_protein_sub_groups(cls, protein_list, higher_or_lower):
    """
    Method to sort protein sub lists.

    Args:
        protein_list (list): List of [Protein][pyproteininference.physical.Protein] objects to be sorted.
        higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".

    Returns:
        list: List of [Protein][pyproteininference.physical.Protein] objects to be sorted by score and number of
        peptides.

    """

    # Sort the groups based on higher or lower indication, secondarily sort the groups based on number of unique
    # peptides
    # We use the index [1:] as we do not wish to sort the lead protein...
    if higher_or_lower == cls.LOWER_PSM_SCORE:
        protein_list[1:] = sorted(
            protein_list[1:],
            key=lambda k: (float(k.score), -float(k.num_peptides)),
            reverse=False,
        )
    if higher_or_lower == cls.HIGHER_PSM_SCORE:
        protein_list[1:] = sorted(
            protein_list[1:],
            key=lambda k: (float(k.score), float(k.num_peptides)),
            reverse=True,
        )

    return protein_list

unique_to_leads_peptides(self)

Method to retrieve peptides that are unique based on the data from the searches (Not based on the database digestion).

Returns:
  • set – a Set of peptide strings

Examples:

>>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
>>> unique_peps = data.unique_to_leads_peptides()
Source code in pyproteininference/datastore.py
def unique_to_leads_peptides(self):
    """
    Method to retrieve peptides that are unique based on the data from the searches
    (Not based on the database digestion).

    Returns:
        set: a Set of peptide strings

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> unique_peps = data.unique_to_leads_peptides()
    """
    if self.grouped_scored_proteins:
        lead_peptides = [list(x[0].peptides) for x in self.grouped_scored_proteins]
        flat_peptides = [item for sublist in lead_peptides for item in sublist]
        counted_peps = collections.Counter(flat_peptides)
        unique_to_leads_peptides = set([x for x in counted_peps if counted_peps[x] == 1])
    else:
        unique_to_leads_peptides = set()

    return unique_to_leads_peptides

validate_digest(self)

Method that validates the Digest object.

Source code in pyproteininference/datastore.py
def validate_digest(self):
    """
    Method that validates the [Digest object][pyproteininference.in_silico_digest.Digest].
    """
    self._validate_reviewed_v_unreviewed()
    self._check_target_decoy_split()

validate_psm_data(self)

Method that validates the PSM data.

Source code in pyproteininference/datastore.py
def validate_psm_data(self):
    """
    Method that validates the PSM data.
    """
    self._validate_decoys_from_data()
    self._validate_isoform_from_data()

export

Export

Class that handles exporting protein inference results to filesystem as csv files.

Attributes:

Name Type Description
data DataStore

DataStore object.

filepath str

Path to file to be written.

Source code in pyproteininference/export.py
class Export(object):
    """
    Class that handles exporting protein inference results to filesystem as csv files.

    Attributes:
        data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
        filepath (str): Path to file to be written.

    """

    EXPORT_LEADS = "leads"
    EXPORT_ALL = "all"
    EXPORT_COMMA_SEP = "comma_sep"
    EXPORT_Q_VALUE_COMMA_SEP = "q_value_comma_sep"
    EXPORT_Q_VALUE = "q_value"
    EXPORT_Q_VALUE_ALL = "q_value_all"
    EXPORT_PEPTIDES = "peptides"
    EXPORT_PSMS = "psms"
    EXPORT_PSM_IDS = "psm_ids"
    EXPORT_LONG = "long"

    EXPORT_TYPES = [
        EXPORT_LEADS,
        EXPORT_ALL,
        EXPORT_COMMA_SEP,
        EXPORT_Q_VALUE_COMMA_SEP,
        EXPORT_Q_VALUE,
        EXPORT_Q_VALUE_ALL,
        EXPORT_PEPTIDES,
        EXPORT_PSMS,
        EXPORT_PSM_IDS,
        EXPORT_LONG,
    ]

    def __init__(self, data):
        """
        Initialization method for the Export class.

        Args:
            data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].

        Example:
            >>> export = pyproteininference.export.Export(data=data)

        """
        self.data = data
        self.filepath = None

    def export_to_csv(self, output_filename=None, directory=None, export_type="q_value"):
        """
        Method that dispatches to one of the many export methods given an export_type input.

        filepath is determined based on directory arg and information from
        [DataStore object][pyproteininference.datastore.DataStore].

        This method sets the `filepath` variable.

        Args:
            output_filename (str): Filepath to write to. If set as None will auto generate filename and
                will write to directory variable.
            directory (str): Directory to write the result file to. If None, will write to current working directory.
            export_type (str): Must be a value in `EXPORT_TYPES` and determines the output format.

        Example:
            >>> export = pyproteininference.export.Export(data=data)
            >>> export.export_to_csv(output_filename=None, directory="/path/to/output/dir/", export_type="psms")

        """

        if not directory:
            directory = os.getcwd()

        data = self.data
        tag = data.parameter_file_object.tag

        if self.EXPORT_LEADS == export_type:
            filename = "{}_leads_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_leads_restricted(filename_out=complete_filepath)

        elif self.EXPORT_ALL == export_type:
            filename = "{}_all_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_all_restricted(complete_filepath)

        elif self.EXPORT_COMMA_SEP == export_type:
            filename = "{}_comma_sep_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_comma_sep_restricted(complete_filepath)

        elif self.EXPORT_Q_VALUE_COMMA_SEP == export_type:
            filename = "{}_q_value_comma_sep_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_comma_sep(complete_filepath)

        elif self.EXPORT_Q_VALUE == export_type:
            filename = "{}_q_value_leads_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_leads(complete_filepath)

        elif self.EXPORT_Q_VALUE_ALL == export_type:
            filename = "{}_q_value_all_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_all(complete_filepath)

        elif self.EXPORT_PEPTIDES == export_type:
            filename = "{}_q_value_leads_peptides_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_leads_peptides(complete_filepath)

        elif self.EXPORT_PSMS == export_type:
            filename = "{}_q_value_leads_psms_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_leads_psms(complete_filepath)

        elif self.EXPORT_PSM_IDS == export_type:
            filename = "{}_q_value_leads_psm_ids_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_leads_psm_ids(complete_filepath)

        elif self.EXPORT_LONG == export_type:
            filename = "{}_q_value_long_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_leads_long(complete_filepath)

        else:
            complete_filepath = "protein_inference_results.csv"

        self.filepath = complete_filepath

    def csv_export_all_restricted(self, filename_out):
        """
        Method that outputs a subset of the passing proteins based on FDR.

        This method returns a non-square CSV file.

        Args:
            filename_out (str): Filename for the data to be written to

        """
        protein_objects = self.data.get_protein_objects(fdr_restricted=True)
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in protein_objects:
            for prots in groups:
                protein_export_list.append([prots.identifier])
                protein_export_list[-1].append(prots.score)
                protein_export_list[-1].append(prots.num_peptides)
                if prots.reviewed:
                    protein_export_list[-1].append("Reviewed")
                else:
                    protein_export_list[-1].append("Unreviewed")
                protein_export_list[-1].append(prots.group_identification)
                for peps in prots.peptides:
                    protein_export_list[-1].append(peps)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_leads_restricted(self, filename_out):
        """
        Method that outputs a subset of the passing proteins based on FDR.
        Only Proteins that pass FDR will be output and only Lead proteins will be output

        This method returns a non-square CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.

        """
        protein_objects = self.data.get_protein_objects(fdr_restricted=True)
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in protein_objects:
            protein_export_list.append([groups[0].identifier])
            protein_export_list[-1].append(groups[0].score)
            protein_export_list[-1].append(groups[0].num_peptides)
            if groups[0].reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups[0].group_identification)
            for peps in sorted(groups[0].peptides):
                protein_export_list[-1].append(peps)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_comma_sep_restricted(self, filename_out):
        """
        Method that outputs a subset of the passing proteins based on FDR.
        Only Proteins that pass FDR will be output and only Lead proteins will be output.
        Proteins in the groups of lead proteins will also be output in the same row as the lead.

        This method returns a non-square CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.

        """
        protein_objects = self.data.get_protein_objects(fdr_restricted=True)
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Other_Potential_Identifiers",
            ]
        ]
        for groups in protein_objects:
            for prots in groups:
                if prots == groups[0]:
                    protein_export_list.append([prots.identifier])
                    protein_export_list[-1].append(prots.score)
                    protein_export_list[-1].append(prots.num_peptides)
                    if prots.reviewed:
                        protein_export_list[-1].append("Reviewed")
                    else:
                        protein_export_list[-1].append("Unreviewed")
                    protein_export_list[-1].append(prots.group_identification)
                else:
                    protein_export_list[-1].append(prots.identifier)
        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_leads(self, filename_out):
        """
        Method that outputs all lead proteins with Q values.

        This method returns a non-square CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.

        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            lead_protein = groups.proteins[0]
            protein_export_list.append([lead_protein.identifier])
            protein_export_list[-1].append(lead_protein.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(lead_protein.num_peptides)
            if lead_protein.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            peptides = lead_protein.peptides
            for peps in sorted(peptides):
                protein_export_list[-1].append(peps)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_comma_sep(self, filename_out):
        """
        Method that outputs all lead proteins with Q values.
        Proteins in the groups of lead proteins will also be output in the same row as the lead.

        This method returns a non-square CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.

        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Other_Potential_Identifiers",
            ]
        ]
        for groups in self.data.protein_group_objects:
            lead_protein = groups.proteins[0]
            protein_export_list.append([lead_protein.identifier])
            protein_export_list[-1].append(lead_protein.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(lead_protein.num_peptides)
            if lead_protein.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            for other_prots in groups.proteins[1:]:
                protein_export_list[-1].append(other_prots.identifier)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_all(self, filename_out):
        """
        Method that outputs all proteins with Q values.
        Non Lead proteins are also output - entire group gets output.
        Proteins in the groups of lead proteins will also be output in the same row as the lead.

        This method returns a non-square CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.

        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            for proteins in groups.proteins:
                protein_export_list.append([proteins.identifier])
                protein_export_list[-1].append(proteins.score)
                protein_export_list[-1].append(groups.q_value)
                protein_export_list[-1].append(proteins.num_peptides)
                if proteins.reviewed:
                    protein_export_list[-1].append("Reviewed")
                else:
                    protein_export_list[-1].append("Unreviewed")
                protein_export_list[-1].append(groups.number_id)
                for peps in sorted(proteins.peptides):
                    protein_export_list[-1].append(peps)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_all_proteologic(self, filename_out):
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            for proteins in groups.proteins:
                protein_export_list.append([proteins.identifier])
                protein_export_list[-1].append(proteins.score)
                protein_export_list[-1].append(groups.q_value)
                protein_export_list[-1].append(proteins.num_peptides)
                if proteins.reviewed:
                    protein_export_list[-1].append("Reviewed")
                else:
                    protein_export_list[-1].append("Unreviewed")
                protein_export_list[-1].append(groups.number_id)
                for peps in sorted(proteins.peptides):
                    protein_export_list[-1].append(peps)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_leads_long(self, filename_out):
        """
        Method that outputs all lead proteins with Q values.

        This method returns a long formatted result file with one peptide on each row.

        Args:
            filename_out (str): Filename for the data to be written to.

        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            lead_protein = groups.proteins[0]
            for peps in sorted(lead_protein.peptides):
                protein_export_list.append([lead_protein.identifier])
                protein_export_list[-1].append(lead_protein.score)
                protein_export_list[-1].append(groups.q_value)
                protein_export_list[-1].append(lead_protein.num_peptides)
                if lead_protein.reviewed:
                    protein_export_list[-1].append("Reviewed")
                else:
                    protein_export_list[-1].append("Unreviewed")
                protein_export_list[-1].append(groups.number_id)
                protein_export_list[-1].append(peps)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_leads_peptides(self, filename_out, peptide_delimiter=" "):
        """
        Method that outputs all lead proteins with Q values in rectangular format.
        This method outputs unique peptides per protein.

        This method returns a rectangular CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.
            peptide_delimiter (str): String to separate peptides by in the "Peptides" column of the csv file
        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            lead_protein = groups.proteins[0]
            protein_export_list.append([lead_protein.identifier])
            protein_export_list[-1].append(lead_protein.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(lead_protein.num_peptides)
            if lead_protein.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            peptides = peptide_delimiter.join(list(sorted(lead_protein.peptides)))
            protein_export_list[-1].append(peptides)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_leads_psms(self, filename_out, peptide_delimiter=" "):
        """
        Method that outputs all lead proteins with Q values in rectangular format.
        This method outputs all PSMs for the protein not just unique peptide identifiers.

        This method returns a rectangular CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.
            peptide_delimiter (str): String to separate peptides by in the "Peptides" column of the csv file.
        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            lead_protein = groups.proteins[0]
            protein_export_list.append([lead_protein.identifier])
            protein_export_list[-1].append(lead_protein.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(lead_protein.num_peptides)
            if lead_protein.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            psms = peptide_delimiter.join(sorted([x.non_flanking_peptide for x in lead_protein.psms]))
            protein_export_list[-1].append(psms)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_leads_psm_ids(self, filename_out, peptide_delimiter=" "):
        """
        Method that outputs all lead proteins with Q values in rectangular format.
        Psms are output as the psm_id value. So sequence information is not output.

        This method returns a rectangular CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.
            peptide_delimiter (str): String to separate psm_ids by in the "Peptides" column of the csv file.
        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            lead_protein = groups.proteins[0]
            protein_export_list.append([lead_protein.identifier])
            protein_export_list[-1].append(lead_protein.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(lead_protein.num_peptides)
            if lead_protein.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            psms = peptide_delimiter.join(sorted(lead_protein.get_psm_ids()))
            protein_export_list[-1].append(psms)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

__init__(self, data) special

Initialization method for the Export class.

Parameters:

Examples:

>>> export = pyproteininference.export.Export(data=data)
Source code in pyproteininference/export.py
def __init__(self, data):
    """
    Initialization method for the Export class.

    Args:
        data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].

    Example:
        >>> export = pyproteininference.export.Export(data=data)

    """
    self.data = data
    self.filepath = None

csv_export_all_restricted(self, filename_out)

Method that outputs a subset of the passing proteins based on FDR.

This method returns a non-square CSV file.

Parameters:
  • filename_out (str) – Filename for the data to be written to

Source code in pyproteininference/export.py
def csv_export_all_restricted(self, filename_out):
    """
    Method that outputs a subset of the passing proteins based on FDR.

    This method returns a non-square CSV file.

    Args:
        filename_out (str): Filename for the data to be written to

    """
    protein_objects = self.data.get_protein_objects(fdr_restricted=True)
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in protein_objects:
        for prots in groups:
            protein_export_list.append([prots.identifier])
            protein_export_list[-1].append(prots.score)
            protein_export_list[-1].append(prots.num_peptides)
            if prots.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(prots.group_identification)
            for peps in prots.peptides:
                protein_export_list[-1].append(peps)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_comma_sep_restricted(self, filename_out)

Method that outputs a subset of the passing proteins based on FDR. Only Proteins that pass FDR will be output and only Lead proteins will be output. Proteins in the groups of lead proteins will also be output in the same row as the lead.

This method returns a non-square CSV file.

Parameters:
  • filename_out (str) – Filename for the data to be written to.

Source code in pyproteininference/export.py
def csv_export_comma_sep_restricted(self, filename_out):
    """
    Method that outputs a subset of the passing proteins based on FDR.
    Only Proteins that pass FDR will be output and only Lead proteins will be output.
    Proteins in the groups of lead proteins will also be output in the same row as the lead.

    This method returns a non-square CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.

    """
    protein_objects = self.data.get_protein_objects(fdr_restricted=True)
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Other_Potential_Identifiers",
        ]
    ]
    for groups in protein_objects:
        for prots in groups:
            if prots == groups[0]:
                protein_export_list.append([prots.identifier])
                protein_export_list[-1].append(prots.score)
                protein_export_list[-1].append(prots.num_peptides)
                if prots.reviewed:
                    protein_export_list[-1].append("Reviewed")
                else:
                    protein_export_list[-1].append("Unreviewed")
                protein_export_list[-1].append(prots.group_identification)
            else:
                protein_export_list[-1].append(prots.identifier)
    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_leads_restricted(self, filename_out)

Method that outputs a subset of the passing proteins based on FDR. Only Proteins that pass FDR will be output and only Lead proteins will be output

This method returns a non-square CSV file.

Parameters:
  • filename_out (str) – Filename for the data to be written to.

Source code in pyproteininference/export.py
def csv_export_leads_restricted(self, filename_out):
    """
    Method that outputs a subset of the passing proteins based on FDR.
    Only Proteins that pass FDR will be output and only Lead proteins will be output

    This method returns a non-square CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.

    """
    protein_objects = self.data.get_protein_objects(fdr_restricted=True)
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in protein_objects:
        protein_export_list.append([groups[0].identifier])
        protein_export_list[-1].append(groups[0].score)
        protein_export_list[-1].append(groups[0].num_peptides)
        if groups[0].reviewed:
            protein_export_list[-1].append("Reviewed")
        else:
            protein_export_list[-1].append("Unreviewed")
        protein_export_list[-1].append(groups[0].group_identification)
        for peps in sorted(groups[0].peptides):
            protein_export_list[-1].append(peps)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_all(self, filename_out)

Method that outputs all proteins with Q values. Non Lead proteins are also output - entire group gets output. Proteins in the groups of lead proteins will also be output in the same row as the lead.

This method returns a non-square CSV file.

Parameters:
  • filename_out (str) – Filename for the data to be written to.

Source code in pyproteininference/export.py
def csv_export_q_value_all(self, filename_out):
    """
    Method that outputs all proteins with Q values.
    Non Lead proteins are also output - entire group gets output.
    Proteins in the groups of lead proteins will also be output in the same row as the lead.

    This method returns a non-square CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.

    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in self.data.protein_group_objects:
        for proteins in groups.proteins:
            protein_export_list.append([proteins.identifier])
            protein_export_list[-1].append(proteins.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(proteins.num_peptides)
            if proteins.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            for peps in sorted(proteins.peptides):
                protein_export_list[-1].append(peps)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_comma_sep(self, filename_out)

Method that outputs all lead proteins with Q values. Proteins in the groups of lead proteins will also be output in the same row as the lead.

This method returns a non-square CSV file.

Parameters:
  • filename_out (str) – Filename for the data to be written to.

Source code in pyproteininference/export.py
def csv_export_q_value_comma_sep(self, filename_out):
    """
    Method that outputs all lead proteins with Q values.
    Proteins in the groups of lead proteins will also be output in the same row as the lead.

    This method returns a non-square CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.

    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Other_Potential_Identifiers",
        ]
    ]
    for groups in self.data.protein_group_objects:
        lead_protein = groups.proteins[0]
        protein_export_list.append([lead_protein.identifier])
        protein_export_list[-1].append(lead_protein.score)
        protein_export_list[-1].append(groups.q_value)
        protein_export_list[-1].append(lead_protein.num_peptides)
        if lead_protein.reviewed:
            protein_export_list[-1].append("Reviewed")
        else:
            protein_export_list[-1].append("Unreviewed")
        protein_export_list[-1].append(groups.number_id)
        for other_prots in groups.proteins[1:]:
            protein_export_list[-1].append(other_prots.identifier)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_leads(self, filename_out)

Method that outputs all lead proteins with Q values.

This method returns a non-square CSV file.

Parameters:
  • filename_out (str) – Filename for the data to be written to.

Source code in pyproteininference/export.py
def csv_export_q_value_leads(self, filename_out):
    """
    Method that outputs all lead proteins with Q values.

    This method returns a non-square CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.

    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in self.data.protein_group_objects:
        lead_protein = groups.proteins[0]
        protein_export_list.append([lead_protein.identifier])
        protein_export_list[-1].append(lead_protein.score)
        protein_export_list[-1].append(groups.q_value)
        protein_export_list[-1].append(lead_protein.num_peptides)
        if lead_protein.reviewed:
            protein_export_list[-1].append("Reviewed")
        else:
            protein_export_list[-1].append("Unreviewed")
        protein_export_list[-1].append(groups.number_id)
        peptides = lead_protein.peptides
        for peps in sorted(peptides):
            protein_export_list[-1].append(peps)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_leads_long(self, filename_out)

Method that outputs all lead proteins with Q values.

This method returns a long formatted result file with one peptide on each row.

Parameters:
  • filename_out (str) – Filename for the data to be written to.

Source code in pyproteininference/export.py
def csv_export_q_value_leads_long(self, filename_out):
    """
    Method that outputs all lead proteins with Q values.

    This method returns a long formatted result file with one peptide on each row.

    Args:
        filename_out (str): Filename for the data to be written to.

    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in self.data.protein_group_objects:
        lead_protein = groups.proteins[0]
        for peps in sorted(lead_protein.peptides):
            protein_export_list.append([lead_protein.identifier])
            protein_export_list[-1].append(lead_protein.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(lead_protein.num_peptides)
            if lead_protein.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            protein_export_list[-1].append(peps)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_leads_peptides(self, filename_out, peptide_delimiter=' ')

Method that outputs all lead proteins with Q values in rectangular format. This method outputs unique peptides per protein.

This method returns a rectangular CSV file.

Parameters:
  • filename_out (str) – Filename for the data to be written to.

  • peptide_delimiter (str) – String to separate peptides by in the "Peptides" column of the csv file

Source code in pyproteininference/export.py
def csv_export_q_value_leads_peptides(self, filename_out, peptide_delimiter=" "):
    """
    Method that outputs all lead proteins with Q values in rectangular format.
    This method outputs unique peptides per protein.

    This method returns a rectangular CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.
        peptide_delimiter (str): String to separate peptides by in the "Peptides" column of the csv file
    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in self.data.protein_group_objects:
        lead_protein = groups.proteins[0]
        protein_export_list.append([lead_protein.identifier])
        protein_export_list[-1].append(lead_protein.score)
        protein_export_list[-1].append(groups.q_value)
        protein_export_list[-1].append(lead_protein.num_peptides)
        if lead_protein.reviewed:
            protein_export_list[-1].append("Reviewed")
        else:
            protein_export_list[-1].append("Unreviewed")
        protein_export_list[-1].append(groups.number_id)
        peptides = peptide_delimiter.join(list(sorted(lead_protein.peptides)))
        protein_export_list[-1].append(peptides)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_leads_psm_ids(self, filename_out, peptide_delimiter=' ')

Method that outputs all lead proteins with Q values in rectangular format. Psms are output as the psm_id value. So sequence information is not output.

This method returns a rectangular CSV file.

Parameters:
  • filename_out (str) – Filename for the data to be written to.

  • peptide_delimiter (str) – String to separate psm_ids by in the "Peptides" column of the csv file.

Source code in pyproteininference/export.py
def csv_export_q_value_leads_psm_ids(self, filename_out, peptide_delimiter=" "):
    """
    Method that outputs all lead proteins with Q values in rectangular format.
    Psms are output as the psm_id value. So sequence information is not output.

    This method returns a rectangular CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.
        peptide_delimiter (str): String to separate psm_ids by in the "Peptides" column of the csv file.
    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in self.data.protein_group_objects:
        lead_protein = groups.proteins[0]
        protein_export_list.append([lead_protein.identifier])
        protein_export_list[-1].append(lead_protein.score)
        protein_export_list[-1].append(groups.q_value)
        protein_export_list[-1].append(lead_protein.num_peptides)
        if lead_protein.reviewed:
            protein_export_list[-1].append("Reviewed")
        else:
            protein_export_list[-1].append("Unreviewed")
        protein_export_list[-1].append(groups.number_id)
        psms = peptide_delimiter.join(sorted(lead_protein.get_psm_ids()))
        protein_export_list[-1].append(psms)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_leads_psms(self, filename_out, peptide_delimiter=' ')

Method that outputs all lead proteins with Q values in rectangular format. This method outputs all PSMs for the protein not just unique peptide identifiers.

This method returns a rectangular CSV file.

Parameters:
  • filename_out (str) – Filename for the data to be written to.

  • peptide_delimiter (str) – String to separate peptides by in the "Peptides" column of the csv file.

Source code in pyproteininference/export.py
def csv_export_q_value_leads_psms(self, filename_out, peptide_delimiter=" "):
    """
    Method that outputs all lead proteins with Q values in rectangular format.
    This method outputs all PSMs for the protein not just unique peptide identifiers.

    This method returns a rectangular CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.
        peptide_delimiter (str): String to separate peptides by in the "Peptides" column of the csv file.
    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in self.data.protein_group_objects:
        lead_protein = groups.proteins[0]
        protein_export_list.append([lead_protein.identifier])
        protein_export_list[-1].append(lead_protein.score)
        protein_export_list[-1].append(groups.q_value)
        protein_export_list[-1].append(lead_protein.num_peptides)
        if lead_protein.reviewed:
            protein_export_list[-1].append("Reviewed")
        else:
            protein_export_list[-1].append("Unreviewed")
        protein_export_list[-1].append(groups.number_id)
        psms = peptide_delimiter.join(sorted([x.non_flanking_peptide for x in lead_protein.psms]))
        protein_export_list[-1].append(psms)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

export_to_csv(self, output_filename=None, directory=None, export_type='q_value')

Method that dispatches to one of the many export methods given an export_type input.

filepath is determined based on directory arg and information from DataStore object.

This method sets the filepath variable.

Parameters:
  • output_filename (str) – Filepath to write to. If set as None will auto generate filename and will write to directory variable.

  • directory (str) – Directory to write the result file to. If None, will write to current working directory.

  • export_type (str) – Must be a value in EXPORT_TYPES and determines the output format.

Examples:

>>> export = pyproteininference.export.Export(data=data)
>>> export.export_to_csv(output_filename=None, directory="/path/to/output/dir/", export_type="psms")
Source code in pyproteininference/export.py
def export_to_csv(self, output_filename=None, directory=None, export_type="q_value"):
    """
    Method that dispatches to one of the many export methods given an export_type input.

    filepath is determined based on directory arg and information from
    [DataStore object][pyproteininference.datastore.DataStore].

    This method sets the `filepath` variable.

    Args:
        output_filename (str): Filepath to write to. If set as None will auto generate filename and
            will write to directory variable.
        directory (str): Directory to write the result file to. If None, will write to current working directory.
        export_type (str): Must be a value in `EXPORT_TYPES` and determines the output format.

    Example:
        >>> export = pyproteininference.export.Export(data=data)
        >>> export.export_to_csv(output_filename=None, directory="/path/to/output/dir/", export_type="psms")

    """

    if not directory:
        directory = os.getcwd()

    data = self.data
    tag = data.parameter_file_object.tag

    if self.EXPORT_LEADS == export_type:
        filename = "{}_leads_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_leads_restricted(filename_out=complete_filepath)

    elif self.EXPORT_ALL == export_type:
        filename = "{}_all_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_all_restricted(complete_filepath)

    elif self.EXPORT_COMMA_SEP == export_type:
        filename = "{}_comma_sep_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_comma_sep_restricted(complete_filepath)

    elif self.EXPORT_Q_VALUE_COMMA_SEP == export_type:
        filename = "{}_q_value_comma_sep_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_comma_sep(complete_filepath)

    elif self.EXPORT_Q_VALUE == export_type:
        filename = "{}_q_value_leads_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_leads(complete_filepath)

    elif self.EXPORT_Q_VALUE_ALL == export_type:
        filename = "{}_q_value_all_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_all(complete_filepath)

    elif self.EXPORT_PEPTIDES == export_type:
        filename = "{}_q_value_leads_peptides_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_leads_peptides(complete_filepath)

    elif self.EXPORT_PSMS == export_type:
        filename = "{}_q_value_leads_psms_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_leads_psms(complete_filepath)

    elif self.EXPORT_PSM_IDS == export_type:
        filename = "{}_q_value_leads_psm_ids_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_leads_psm_ids(complete_filepath)

    elif self.EXPORT_LONG == export_type:
        filename = "{}_q_value_long_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_leads_long(complete_filepath)

    else:
        complete_filepath = "protein_inference_results.csv"

    self.filepath = complete_filepath

heuristic

HeuristicPipeline (ProteinInferencePipeline)

This is the Protein Inference Heuristic class which houses the logic to run the Protein Inference Heuristic method to determine the best inference method for the given data. Logic is executed in the execute method.

Attributes:

Name Type Description
parameter_file str

Path to Protein Inference Yaml Parameter File.

database_file str

Path to Fasta database used in proteomics search.

target_files str/list

Path to Target Psm File (Or a list of files).

decoy_files str/list

Path to Decoy Psm File (Or a list of files).

combined_files str/list

Path to Combined Psm File (Or a list of files).

target_directory str

Path to Directory containing Target Psm Files.

decoy_directory str

Path to Directory containing Decoy Psm Files.

combined_directory str

Path to Directory containing Combined Psm Files.

output_directory str

Path to Directory where output will be written.

output_filename str

Path to Filename where output will be written. Will override output_directory.

id_splitting bool

True/False on whether to split protein IDs in the digest. Advanced usage only.

append_alt_from_db bool

True/False on whether to append alternative proteins from the DB digestion in Reader class.

pdf_filename str

Filepath to be written to by Heuristic Plotting method. This is optional and a default filename will be created in output_directory if this is left as None.

inference_method_list list

List of inference methods used in heuristic determination.

datastore_dict dict

Dictionary of DataStore objects generated in heuristic determination with the inference method as the key of each entry.

selected_methods list

a list of String representations of the selected inference methods based on the heuristic.

selected_datastores dict

a Dictionary of DataStore object objects as selected by the heuristic.

output_type str

How to output results. Can either be "all" or "optimal". Will either output all results or will only output the optimal results.

Source code in pyproteininference/heuristic.py
class HeuristicPipeline(ProteinInferencePipeline):
    """
    This is the Protein Inference Heuristic class which houses the logic to run the Protein Inference Heuristic method
     to determine the best inference method for the given data.
    Logic is executed in the [execute][pyproteininference.heuristic.HeuristicPipeline.execute] method.

    Attributes:
        parameter_file (str): Path to Protein Inference Yaml Parameter File.
        database_file (str): Path to Fasta database used in proteomics search.
        target_files (str/list): Path to Target Psm File (Or a list of files).
        decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
        combined_files (str/list): Path to Combined Psm File (Or a list of files).
        target_directory (str): Path to Directory containing Target Psm Files.
        decoy_directory (str): Path to Directory containing Decoy Psm Files.
        combined_directory (str): Path to Directory containing Combined Psm Files.
        output_directory (str): Path to Directory where output will be written.
        output_filename (str): Path to Filename where output will be written. Will override output_directory.
        id_splitting (bool): True/False on whether to split protein IDs in the digest.
            Advanced usage only.
        append_alt_from_db (bool): True/False on whether to append
            alternative proteins from the DB digestion in Reader class.
        pdf_filename (str): Filepath to be written to by Heuristic Plotting method.
            This is optional and a default filename will be created in output_directory if this is left as None.
        inference_method_list (list): List of inference methods used in heuristic determination.
        datastore_dict (dict): Dictionary of [DataStore][pyproteininference.datastore.DataStore]
            objects generated in heuristic determination with the inference method as the key of each entry.
        selected_methods (list): a list of String representations of the selected inference methods based on the
            heuristic.
        selected_datastores (dict):
            a Dictionary of [DataStore object][pyproteininference.datastore.DataStore] objects as selected by the
            heuristic.
        output_type (str): How to output results. Can either be "all" or "optimal". Will either output all results
            or will only output the optimal results.

    """

    RATIO_CONSTANT = 2
    OUTPUT_TYPES = ["all", "optimal"]

    def __init__(
        self,
        parameter_file=None,
        database_file=None,
        target_files=None,
        decoy_files=None,
        combined_files=None,
        target_directory=None,
        decoy_directory=None,
        combined_directory=None,
        output_directory=None,
        output_filename=None,
        id_splitting=False,
        append_alt_from_db=True,
        pdf_filename=None,
        output_type="all",
    ):
        """

        Args:
            parameter_file (str): Path to Protein Inference Yaml Parameter File.
            database_file (str): Path to Fasta database used in proteomics search.
            target_files (str/list): Path to Target Psm File (Or a list of files).
            decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
            combined_files (str/list): Path to Combined Psm File (Or a list of files).
            target_directory (str): Path to Directory containing Target Psm Files.
            decoy_directory (str): Path to Directory containing Decoy Psm Files.
            combined_directory (str): Path to Directory containing Combined Psm Files.
            output_directory (str): Path to Directory where output will be written.
            output_filename (str): Path to Filename where output will be written.
                Will override output_directory.
            id_splitting (bool): True/False on whether to split protein IDs in the digest.
                Advanced usage only.
            append_alt_from_db (bool): True/False on whether to append alternative proteins
                from the DB digestion in Reader class.
            pdf_filename (str): Filepath to be written to by Heuristic Plotting method.
                This is optional and a default filename will be created in output_directory if this is left as None
            output_type (str): How to output results. Can either be "all" or "optimal". Will either output all results
                        or will only output the optimal results.

        Returns:
            HeuristicPipeline: [HeuristicPipeline][pyproteininference.heuristic.HeuristicPipeline] object

        Example:
            >>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
            >>>     parameter_file=yaml_params,
            >>>     database_file=database,
            >>>     target_files=target,
            >>>     decoy_files=decoy,
            >>>     combined_files=combined_files,
            >>>     target_directory=target_directory,
            >>>     decoy_directory=decoy_directory,
            >>>     combined_directory=combined_directory,
            >>>     output_directory=dir_name,
            >>>     output_filename=output_filename,
            >>>     append_alt_from_db=append_alt,
            >>>     pdf_filename=pdf_filename,
            >>>     output_type="all"
            >>> )
        """

        self.parameter_file = parameter_file
        self.database_file = database_file
        self.target_files = target_files
        self.decoy_files = decoy_files
        self.combined_files = combined_files
        self.target_directory = target_directory
        self.decoy_directory = decoy_directory
        self.combined_directory = combined_directory
        self.output_directory = output_directory
        self.output_filename = output_filename
        self.id_splitting = id_splitting
        self.append_alt_from_db = append_alt_from_db
        self.output_type = output_type
        if self.output_type not in self.OUTPUT_TYPES:
            raise ValueError("The variable output_type must be set to either 'all' or 'optimal'")
        if not pdf_filename:
            if self.output_directory and not self.output_filename:
                self.pdf_filename = os.path.join(self.output_directory, "heuristic_plot.pdf")
            elif self.output_filename:
                self.pdf_filename = os.path.join(os.path.split(self.output_filename)[0], "heuristic_plot.pdf")
            else:
                self.pdf_filename = os.path.join(os.getcwd(), "heuristic_plot.pdf")

        else:
            self.pdf_filename = pdf_filename

        self.inference_method_list = [
            Inference.INCLUSION,
            Inference.EXCLUSION,
            Inference.PARSIMONY,
            Inference.PEPTIDE_CENTRIC,
        ]
        self.datastore_dict = {}
        self.selected_methods = None
        self.selected_datastores = {}

        self._validate_input()

        self._set_output_directory()

        self._log_append_alt_from_db()

    def execute(self, fdr_threshold=0.05):
        """
        This method is the main driver of the heuristic method.
        This method calls other classes and methods that make up the heuristic pipeline.
        This includes but is not limited to:

        1. Loops over the main inference methods: Inclusion, Exclusion, Parsimony, and Peptide Centric.
        2. Determines the optimal inference method based on the input data as well as the database file.
        3. Outputs the results and indicates the optimal results.

        Args:
            fdr_threshold (float): The Qvalue/FDR threshold the heuristic method uses to base calculations from.

        Returns:
            None:

        Example:
            >>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
            >>>     parameter_file=yaml_params,
            >>>     database_file=database,
            >>>     target_files=target,
            >>>     decoy_files=decoy,
            >>>     combined_files=combined_files,
            >>>     target_directory=target_directory,
            >>>     decoy_directory=decoy_directory,
            >>>     combined_directory=combined_directory,
            >>>     output_directory=dir_name,
            >>>     output_filename=output_filename,
            >>>     append_alt_from_db=append_alt,
            >>>     pdf_filename=pdf_filename,
            >>>     output_type="all"
            >>> )
            >>> heuristic.execute(fdr_threshold=0.05)

        """

        pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
            yaml_param_filepath=self.parameter_file
        )

        digest = pyproteininference.in_silico_digest.PyteomicsDigest(
            database_path=self.database_file,
            digest_type=pyproteininference_parameters.digest_type,
            missed_cleavages=pyproteininference_parameters.missed_cleavages,
            reviewed_identifier_symbol=pyproteininference_parameters.reviewed_identifier_symbol,
            max_peptide_length=pyproteininference_parameters.restrict_peptide_length,
            id_splitting=self.id_splitting,
        )
        if self.database_file:
            logger.info("Running In Silico Database Digest on file {}".format(self.database_file))
            digest.digest_fasta_database()
        else:
            logger.warning(
                "No Database File provided, Skipping database digest and only taking protein-peptide mapping from the "
                "input files."
            )

        for inference_method in self.inference_method_list:

            method_specific_parameters = copy.deepcopy(pyproteininference_parameters)

            logger.info("Overriding inference type {}".format(method_specific_parameters.inference_type))

            method_specific_parameters.inference_type = inference_method

            logger.info("New inference type {}".format(method_specific_parameters.inference_type))
            logger.info("FDR Threshold Set to {}".format(method_specific_parameters.fdr))

            reader = pyproteininference.reader.GenericReader(
                target_file=self.target_files,
                decoy_file=self.decoy_files,
                combined_files=self.combined_files,
                parameter_file_object=method_specific_parameters,
                digest=digest,
                append_alt_from_db=self.append_alt_from_db,
            )
            reader.read_psms()

            data = pyproteininference.datastore.DataStore(reader=reader, digest=digest)

            data.restrict_psm_data()

            data.recover_mapping()

            data.create_scoring_input()

            if method_specific_parameters.inference_type == Inference.EXCLUSION:
                data.exclude_non_distinguishing_peptides()

            score = pyproteininference.scoring.Score(data=data)
            score.score_psms(score_method=method_specific_parameters.protein_score)

            if method_specific_parameters.picker:
                data.protein_picker()
            else:
                pass

            pyproteininference.inference.Inference.run_inference(data=data, digest=digest)

            data.calculate_q_values()

            self.datastore_dict[inference_method] = data

        self.selected_methods = self.determine_optimal_inference_method(
            false_discovery_rate_threshold=fdr_threshold, pdf_filename=self.pdf_filename
        )
        self.selected_datastores = {x: self.datastore_dict[x] for x in self.selected_methods}

        if self.output_type == "all":
            self._write_all_results(parameters=method_specific_parameters)
        elif self.output_type == "optimal":
            self._write_optimal_results(parameters=method_specific_parameters)
        else:
            self._write_optimal_results(parameters=method_specific_parameters)

    def generate_roc_plot(self, fdr_max=0.2, pdf_filename=None):
        """
        This method produces a PDF ROC plot overlaying the 4 inference methods apart of the heuristic algorithm.

        Args:
            fdr_max (float): Max FDR to display on the plot.
            pdf_filename (str): Filename to write roc plot to.

        Returns:
            None:

        """
        f = plt.figure()
        for inference_method in self.datastore_dict.keys():
            fdr_vs_target_hits = self.datastore_dict[inference_method].generate_fdr_vs_target_hits(fdr_max=fdr_max)
            fdrs = [x[0] for x in fdr_vs_target_hits]
            target_hits = [x[1] for x in fdr_vs_target_hits]
            plt.plot(fdrs, target_hits, '-', label=inference_method.replace("_", " "))
            target_fdr = self.datastore_dict[inference_method].parameter_file_object.fdr
            if inference_method in self.selected_methods:
                best_value = min(fdrs, key=lambda x: abs(x - target_fdr))
                best_index = fdrs.index(best_value)
                best_target_hit_value = target_hits[best_index]  # noqa F841

        plt.axvline(target_fdr, color="black", linestyle='--', alpha=0.75, label="Target FDR")
        plt.legend()
        plt.xlabel('Decoy FDR')
        plt.ylabel('Target Protein Hits')
        plt.xlim([-0.01, fdr_max])
        plt.legend(loc='lower right')
        plt.title("FDR vs Target Protein Hits per Inference Method")
        if pdf_filename:
            logger.info("Writing ROC plot to: {}".format(pdf_filename))
            f.savefig(pdf_filename)
        plt.close()

    def _write_all_results(self, parameters):
        """
        Internal method that loops over all results and writes them out.
        """
        for method in list(self.datastore_dict.keys()):
            datastore = self.datastore_dict[method]
            if method in self.selected_methods:
                inference_method_string = "{}_{}".format(method, "optimal_method")
            else:
                inference_method_string = method
            if not self.output_filename and self.output_directory:
                # If a filename is not provided then construct one using output_directory
                # Note: output_directory will always get set even if its set as None - gets set to cwd
                inference_filename = os.path.join(
                    self.output_directory,
                    "{}_{}_{}_{}_{}".format(
                        inference_method_string,
                        parameters.tag,
                        datastore.short_protein_score,
                        datastore.psm_score,
                        "protein_inference_results.csv",
                    ),
                )
            if self.output_filename:
                # If the user specified an output filename then split it apart and insert the inference method
                # Then reconstruct the file
                split = os.path.split(self.output_filename)
                path = split[0]
                filename = split[1]
                inference_filename = os.path.join(path, "{}_{}".format(inference_method_string, filename))
            export = pyproteininference.export.Export(data=self.datastore_dict[method])
            export.export_to_csv(
                output_filename=inference_filename,
                directory=self.output_directory,
                export_type=parameters.export,
            )

    def _write_optimal_results(self, parameters):
        """
        Internal method that writes out the optimized results.
        """

        for method in self.selected_methods:
            datastore = self.datastore_dict[method]
            inference_method_string = "{}_{}".format(method, "optimal_method")
            if not self.output_filename and self.output_directory:
                # If a filename is not provided then construct one using output_directory
                # Note: output_directory will always get set even if its set as None - gets set to cwd
                inference_filename = os.path.join(
                    self.output_directory,
                    "{}_{}_{}_{}_{}".format(
                        inference_method_string,
                        parameters.tag,
                        datastore.short_protein_score,
                        datastore.psm_score,
                        "protein_inference_results.csv",
                    ),
                )
            if self.output_filename:
                # If the user specified an output filename then split it apart and insert the inference method
                # Then reconstruct the file
                split = os.path.split(self.output_filename)
                path = split[0]
                filename = split[1]
                inference_filename = os.path.join(path, "{}_{}".format(inference_method_string, filename))
            export = pyproteininference.export.Export(data=self.selected_datastores[method])
            export.export_to_csv(
                output_filename=inference_filename,
                directory=self.output_directory,
                export_type=parameters.export,
            )

    def determine_optimal_inference_method(
        self,
        false_discovery_rate_threshold=0.05,
        upper_empirical_threshold=1,
        lower_empirical_threshold=0.5,
        pdf_filename=None,
    ):
        """
        This method determines the optimal inference method from Inclusion, Exclusion, Parsimony, Peptide-Centric.

        Args:
            false_discovery_rate_threshold (float): The fdr threshold to use in heuristic algorithm -
                This parameter determines the maximum fdr used when creating a range of finite FDR values.
            upper_empirical_threshold (float): Upper Threshold used for parsimony/peptide centric cutoff for
                the heuristic algorithm.
            lower_empirical_threshold (float): Lower Threshold used for inclusion/exclusion cutoff for
                the heuristic algorithm.
            pdf_filename (str): Filename to write heuristic density plot to.


        Returns:
            list: List of string representations of the recommended inference methods.

        """

        # Get the number of passing proteins
        number_stdev_from_mean_dict = {}
        fdrs = [false_discovery_rate_threshold * 0.01 * x for x in range(100)]
        for fdr in fdrs:
            stdev_from_mean = self.determine_number_stdev_from_mean(false_discovery_rate=fdr)
            number_stdev_from_mean_dict[fdr] = stdev_from_mean

        stdev_collection = collections.defaultdict(list)
        for fdr in fdrs:
            for key in number_stdev_from_mean_dict[fdr]:
                stdev_collection[key].append(number_stdev_from_mean_dict[fdr][key])

        heuristic_scores = self.generate_density_plot(
            number_stdevs_from_mean=stdev_collection, pdf_filename=pdf_filename
        )

        # Apply conditional statement with lower and upper thresholds
        if (
            heuristic_scores[Inference.PARSIMONY] <= lower_empirical_threshold
            or heuristic_scores[Inference.PEPTIDE_CENTRIC] <= lower_empirical_threshold
        ):
            # If parsimony or peptide centric are less than the lower empirical threshold
            # Then select the best method of the two
            logger.info(
                "Either parsimony {} or peptide centric {} pass empirical threshold {}. "
                "Selecting the best method of the two.".format(
                    heuristic_scores[Inference.PARSIMONY],
                    heuristic_scores[Inference.PEPTIDE_CENTRIC],
                    lower_empirical_threshold,
                )
            )
            sub_dict = {
                Inference.PARSIMONY: heuristic_scores[Inference.PARSIMONY],
                Inference.PEPTIDE_CENTRIC: heuristic_scores[Inference.PEPTIDE_CENTRIC],
            }

            if (
                heuristic_scores[Inference.PARSIMONY] <= lower_empirical_threshold
                and heuristic_scores[Inference.PEPTIDE_CENTRIC] <= lower_empirical_threshold
            ):
                # If both are under the threshold return both
                selected_methods = [Inference.PARSIMONY, Inference.PEPTIDE_CENTRIC]

            else:
                selected_methods = [min(sub_dict, key=sub_dict.get)]

        # If the above condition does not apply
        elif (
            heuristic_scores[Inference.EXCLUSION] <= upper_empirical_threshold
            or heuristic_scores[Inference.INCLUSION] <= upper_empirical_threshold
        ):
            # If exclusion or inclusion are less than the upper empirical threshold
            # Then select the best method of the two
            logger.info(
                "Either inclusion {} or exclusion {} pass empirical threshold {}. "
                "Selecting the best method of the two.".format(
                    heuristic_scores[Inference.INCLUSION],
                    heuristic_scores[Inference.EXCLUSION],
                    upper_empirical_threshold,
                )
            )
            sub_dict = {
                Inference.EXCLUSION: heuristic_scores[Inference.EXCLUSION],
                Inference.INCLUSION: heuristic_scores[Inference.INCLUSION],
            }

            if (
                heuristic_scores[Inference.EXCLUSION] <= upper_empirical_threshold
                and heuristic_scores[Inference.INCLUSION] <= upper_empirical_threshold
            ):
                # If both are under the threshold return both
                selected_methods = [Inference.INCLUSION, Inference.EXCLUSION]

            else:
                selected_methods = [min(sub_dict, key=sub_dict.get)]

        else:
            # If we have no conditional scenarios...
            # Select the best method
            logger.info("No methods pass empirical thresholds, selecting the best method")
            selected_methods = [min(heuristic_scores, key=heuristic_scores.get)]

        logger.info("Method(s) {} selected with the heuristic algorithm".format(", ".join(selected_methods)))
        return selected_methods

    def generate_density_plot(self, number_stdevs_from_mean, pdf_filename=None):
        """
        This method produces a PDF Density Plot plot overlaying the 4 inference methods part of the heuristic algorithm.

        Args:
            number_stdevs_from_mean (dict): a dictionary of the number of standard deviations from the mean per
                inference method for a range of FDRs.
            pdf_filename (str): Filename to write heuristic density plot to.

        Returns:
            dict: a dictionary of heuristic scores per inference method which correlates to the
                maximum point of the density plot per inference method.

        """
        f = plt.figure()

        heuristic_scores = {}
        for method in number_stdevs_from_mean:
            readible_method_name = Inference.INFERENCE_NAME_MAP[method]
            kwargs = dict(histtype='stepfilled', alpha=0.3, density=True, bins=40, ec="k", label=readible_method_name)
            x, y, _ = plt.hist(number_stdevs_from_mean[method], **kwargs)
            center = y[list(x).index(max(x))]
            heuristic_scores[method] = abs(center)

        plt.axvline(0, color="black", linestyle='--', alpha=0.75)
        plt.title("Density Plot of the Number of Standard Deviations from the Mean")
        plt.xlabel('Number of Standard Deviations from the Mean')
        plt.ylabel('Number of Observations')
        plt.legend(loc='upper right')
        if pdf_filename:
            logger.info("Writing Heuristic Density plot to: {}".format(pdf_filename))
            f.savefig(pdf_filename)
        else:
            plt.show()
        plt.close()

        logger.info("Heuristic Scores")
        logger.info(heuristic_scores)

        return heuristic_scores

    def determine_number_stdev_from_mean(self, false_discovery_rate):
        """
        This method calculates the mean of the number of proteins identified at a specific FDR of all
        4 methods and then for each method calculates the number of standard deviations
        from the previous calculated mean.

        Args:
            false_discovery_rate (float): The false discovery rate used as a cutoff for calculations.

        Returns:
            dict: a dictionary of the number of standard deviations away from the mean per inference method.

        """

        filtered_protein_objects = {
            x: self.datastore_dict[x].get_protein_objects(
                fdr_restricted=True, false_discovery_rate=false_discovery_rate
            )
            for x in self.datastore_dict.keys()
        }
        number_passing_proteins = {x: len(filtered_protein_objects[x]) for x in filtered_protein_objects.keys()}

        # Calculate how similar the number of passing proteins is for each method
        all_values = [x for x in number_passing_proteins.values()]
        mean = numpy.mean(all_values)
        standard_deviation = statistics.stdev(all_values)
        number_stdev_from_mean_dict = {}
        for key in number_passing_proteins.keys():
            cur_value = number_passing_proteins[key]
            number_stdev_from_mean_dict[key] = (cur_value - mean) / standard_deviation

        return number_stdev_from_mean_dict

__init__(self, parameter_file=None, database_file=None, target_files=None, decoy_files=None, combined_files=None, target_directory=None, decoy_directory=None, combined_directory=None, output_directory=None, output_filename=None, id_splitting=False, append_alt_from_db=True, pdf_filename=None, output_type='all') special

Parameters:
  • parameter_file (str) – Path to Protein Inference Yaml Parameter File.

  • database_file (str) – Path to Fasta database used in proteomics search.

  • target_files (str/list) – Path to Target Psm File (Or a list of files).

  • decoy_files (str/list) – Path to Decoy Psm File (Or a list of files).

  • combined_files (str/list) – Path to Combined Psm File (Or a list of files).

  • target_directory (str) – Path to Directory containing Target Psm Files.

  • decoy_directory (str) – Path to Directory containing Decoy Psm Files.

  • combined_directory (str) – Path to Directory containing Combined Psm Files.

  • output_directory (str) – Path to Directory where output will be written.

  • output_filename (str) – Path to Filename where output will be written. Will override output_directory.

  • id_splitting (bool) – True/False on whether to split protein IDs in the digest. Advanced usage only.

  • append_alt_from_db (bool) – True/False on whether to append alternative proteins from the DB digestion in Reader class.

  • pdf_filename (str) – Filepath to be written to by Heuristic Plotting method. This is optional and a default filename will be created in output_directory if this is left as None

  • output_type (str) – How to output results. Can either be "all" or "optimal". Will either output all results or will only output the optimal results.

Returns:

Examples:

>>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
>>>     parameter_file=yaml_params,
>>>     database_file=database,
>>>     target_files=target,
>>>     decoy_files=decoy,
>>>     combined_files=combined_files,
>>>     target_directory=target_directory,
>>>     decoy_directory=decoy_directory,
>>>     combined_directory=combined_directory,
>>>     output_directory=dir_name,
>>>     output_filename=output_filename,
>>>     append_alt_from_db=append_alt,
>>>     pdf_filename=pdf_filename,
>>>     output_type="all"
>>> )
Source code in pyproteininference/heuristic.py
def __init__(
    self,
    parameter_file=None,
    database_file=None,
    target_files=None,
    decoy_files=None,
    combined_files=None,
    target_directory=None,
    decoy_directory=None,
    combined_directory=None,
    output_directory=None,
    output_filename=None,
    id_splitting=False,
    append_alt_from_db=True,
    pdf_filename=None,
    output_type="all",
):
    """

    Args:
        parameter_file (str): Path to Protein Inference Yaml Parameter File.
        database_file (str): Path to Fasta database used in proteomics search.
        target_files (str/list): Path to Target Psm File (Or a list of files).
        decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
        combined_files (str/list): Path to Combined Psm File (Or a list of files).
        target_directory (str): Path to Directory containing Target Psm Files.
        decoy_directory (str): Path to Directory containing Decoy Psm Files.
        combined_directory (str): Path to Directory containing Combined Psm Files.
        output_directory (str): Path to Directory where output will be written.
        output_filename (str): Path to Filename where output will be written.
            Will override output_directory.
        id_splitting (bool): True/False on whether to split protein IDs in the digest.
            Advanced usage only.
        append_alt_from_db (bool): True/False on whether to append alternative proteins
            from the DB digestion in Reader class.
        pdf_filename (str): Filepath to be written to by Heuristic Plotting method.
            This is optional and a default filename will be created in output_directory if this is left as None
        output_type (str): How to output results. Can either be "all" or "optimal". Will either output all results
                    or will only output the optimal results.

    Returns:
        HeuristicPipeline: [HeuristicPipeline][pyproteininference.heuristic.HeuristicPipeline] object

    Example:
        >>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
        >>>     parameter_file=yaml_params,
        >>>     database_file=database,
        >>>     target_files=target,
        >>>     decoy_files=decoy,
        >>>     combined_files=combined_files,
        >>>     target_directory=target_directory,
        >>>     decoy_directory=decoy_directory,
        >>>     combined_directory=combined_directory,
        >>>     output_directory=dir_name,
        >>>     output_filename=output_filename,
        >>>     append_alt_from_db=append_alt,
        >>>     pdf_filename=pdf_filename,
        >>>     output_type="all"
        >>> )
    """

    self.parameter_file = parameter_file
    self.database_file = database_file
    self.target_files = target_files
    self.decoy_files = decoy_files
    self.combined_files = combined_files
    self.target_directory = target_directory
    self.decoy_directory = decoy_directory
    self.combined_directory = combined_directory
    self.output_directory = output_directory
    self.output_filename = output_filename
    self.id_splitting = id_splitting
    self.append_alt_from_db = append_alt_from_db
    self.output_type = output_type
    if self.output_type not in self.OUTPUT_TYPES:
        raise ValueError("The variable output_type must be set to either 'all' or 'optimal'")
    if not pdf_filename:
        if self.output_directory and not self.output_filename:
            self.pdf_filename = os.path.join(self.output_directory, "heuristic_plot.pdf")
        elif self.output_filename:
            self.pdf_filename = os.path.join(os.path.split(self.output_filename)[0], "heuristic_plot.pdf")
        else:
            self.pdf_filename = os.path.join(os.getcwd(), "heuristic_plot.pdf")

    else:
        self.pdf_filename = pdf_filename

    self.inference_method_list = [
        Inference.INCLUSION,
        Inference.EXCLUSION,
        Inference.PARSIMONY,
        Inference.PEPTIDE_CENTRIC,
    ]
    self.datastore_dict = {}
    self.selected_methods = None
    self.selected_datastores = {}

    self._validate_input()

    self._set_output_directory()

    self._log_append_alt_from_db()

determine_number_stdev_from_mean(self, false_discovery_rate)

This method calculates the mean of the number of proteins identified at a specific FDR of all 4 methods and then for each method calculates the number of standard deviations from the previous calculated mean.

Parameters:
  • false_discovery_rate (float) – The false discovery rate used as a cutoff for calculations.

Returns:
  • dict – a dictionary of the number of standard deviations away from the mean per inference method.

Source code in pyproteininference/heuristic.py
def determine_number_stdev_from_mean(self, false_discovery_rate):
    """
    This method calculates the mean of the number of proteins identified at a specific FDR of all
    4 methods and then for each method calculates the number of standard deviations
    from the previous calculated mean.

    Args:
        false_discovery_rate (float): The false discovery rate used as a cutoff for calculations.

    Returns:
        dict: a dictionary of the number of standard deviations away from the mean per inference method.

    """

    filtered_protein_objects = {
        x: self.datastore_dict[x].get_protein_objects(
            fdr_restricted=True, false_discovery_rate=false_discovery_rate
        )
        for x in self.datastore_dict.keys()
    }
    number_passing_proteins = {x: len(filtered_protein_objects[x]) for x in filtered_protein_objects.keys()}

    # Calculate how similar the number of passing proteins is for each method
    all_values = [x for x in number_passing_proteins.values()]
    mean = numpy.mean(all_values)
    standard_deviation = statistics.stdev(all_values)
    number_stdev_from_mean_dict = {}
    for key in number_passing_proteins.keys():
        cur_value = number_passing_proteins[key]
        number_stdev_from_mean_dict[key] = (cur_value - mean) / standard_deviation

    return number_stdev_from_mean_dict

determine_optimal_inference_method(self, false_discovery_rate_threshold=0.05, upper_empirical_threshold=1, lower_empirical_threshold=0.5, pdf_filename=None)

This method determines the optimal inference method from Inclusion, Exclusion, Parsimony, Peptide-Centric.

Parameters:
  • false_discovery_rate_threshold (float) – The fdr threshold to use in heuristic algorithm - This parameter determines the maximum fdr used when creating a range of finite FDR values.

  • upper_empirical_threshold (float) – Upper Threshold used for parsimony/peptide centric cutoff for the heuristic algorithm.

  • lower_empirical_threshold (float) – Lower Threshold used for inclusion/exclusion cutoff for the heuristic algorithm.

  • pdf_filename (str) – Filename to write heuristic density plot to.

Returns:
  • list – List of string representations of the recommended inference methods.

Source code in pyproteininference/heuristic.py
def determine_optimal_inference_method(
    self,
    false_discovery_rate_threshold=0.05,
    upper_empirical_threshold=1,
    lower_empirical_threshold=0.5,
    pdf_filename=None,
):
    """
    This method determines the optimal inference method from Inclusion, Exclusion, Parsimony, Peptide-Centric.

    Args:
        false_discovery_rate_threshold (float): The fdr threshold to use in heuristic algorithm -
            This parameter determines the maximum fdr used when creating a range of finite FDR values.
        upper_empirical_threshold (float): Upper Threshold used for parsimony/peptide centric cutoff for
            the heuristic algorithm.
        lower_empirical_threshold (float): Lower Threshold used for inclusion/exclusion cutoff for
            the heuristic algorithm.
        pdf_filename (str): Filename to write heuristic density plot to.


    Returns:
        list: List of string representations of the recommended inference methods.

    """

    # Get the number of passing proteins
    number_stdev_from_mean_dict = {}
    fdrs = [false_discovery_rate_threshold * 0.01 * x for x in range(100)]
    for fdr in fdrs:
        stdev_from_mean = self.determine_number_stdev_from_mean(false_discovery_rate=fdr)
        number_stdev_from_mean_dict[fdr] = stdev_from_mean

    stdev_collection = collections.defaultdict(list)
    for fdr in fdrs:
        for key in number_stdev_from_mean_dict[fdr]:
            stdev_collection[key].append(number_stdev_from_mean_dict[fdr][key])

    heuristic_scores = self.generate_density_plot(
        number_stdevs_from_mean=stdev_collection, pdf_filename=pdf_filename
    )

    # Apply conditional statement with lower and upper thresholds
    if (
        heuristic_scores[Inference.PARSIMONY] <= lower_empirical_threshold
        or heuristic_scores[Inference.PEPTIDE_CENTRIC] <= lower_empirical_threshold
    ):
        # If parsimony or peptide centric are less than the lower empirical threshold
        # Then select the best method of the two
        logger.info(
            "Either parsimony {} or peptide centric {} pass empirical threshold {}. "
            "Selecting the best method of the two.".format(
                heuristic_scores[Inference.PARSIMONY],
                heuristic_scores[Inference.PEPTIDE_CENTRIC],
                lower_empirical_threshold,
            )
        )
        sub_dict = {
            Inference.PARSIMONY: heuristic_scores[Inference.PARSIMONY],
            Inference.PEPTIDE_CENTRIC: heuristic_scores[Inference.PEPTIDE_CENTRIC],
        }

        if (
            heuristic_scores[Inference.PARSIMONY] <= lower_empirical_threshold
            and heuristic_scores[Inference.PEPTIDE_CENTRIC] <= lower_empirical_threshold
        ):
            # If both are under the threshold return both
            selected_methods = [Inference.PARSIMONY, Inference.PEPTIDE_CENTRIC]

        else:
            selected_methods = [min(sub_dict, key=sub_dict.get)]

    # If the above condition does not apply
    elif (
        heuristic_scores[Inference.EXCLUSION] <= upper_empirical_threshold
        or heuristic_scores[Inference.INCLUSION] <= upper_empirical_threshold
    ):
        # If exclusion or inclusion are less than the upper empirical threshold
        # Then select the best method of the two
        logger.info(
            "Either inclusion {} or exclusion {} pass empirical threshold {}. "
            "Selecting the best method of the two.".format(
                heuristic_scores[Inference.INCLUSION],
                heuristic_scores[Inference.EXCLUSION],
                upper_empirical_threshold,
            )
        )
        sub_dict = {
            Inference.EXCLUSION: heuristic_scores[Inference.EXCLUSION],
            Inference.INCLUSION: heuristic_scores[Inference.INCLUSION],
        }

        if (
            heuristic_scores[Inference.EXCLUSION] <= upper_empirical_threshold
            and heuristic_scores[Inference.INCLUSION] <= upper_empirical_threshold
        ):
            # If both are under the threshold return both
            selected_methods = [Inference.INCLUSION, Inference.EXCLUSION]

        else:
            selected_methods = [min(sub_dict, key=sub_dict.get)]

    else:
        # If we have no conditional scenarios...
        # Select the best method
        logger.info("No methods pass empirical thresholds, selecting the best method")
        selected_methods = [min(heuristic_scores, key=heuristic_scores.get)]

    logger.info("Method(s) {} selected with the heuristic algorithm".format(", ".join(selected_methods)))
    return selected_methods

execute(self, fdr_threshold=0.05)

This method is the main driver of the heuristic method. This method calls other classes and methods that make up the heuristic pipeline. This includes but is not limited to:

  1. Loops over the main inference methods: Inclusion, Exclusion, Parsimony, and Peptide Centric.
  2. Determines the optimal inference method based on the input data as well as the database file.
  3. Outputs the results and indicates the optimal results.
Parameters:
  • fdr_threshold (float) – The Qvalue/FDR threshold the heuristic method uses to base calculations from.

Returns:
  • None

Examples:

>>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
>>>     parameter_file=yaml_params,
>>>     database_file=database,
>>>     target_files=target,
>>>     decoy_files=decoy,
>>>     combined_files=combined_files,
>>>     target_directory=target_directory,
>>>     decoy_directory=decoy_directory,
>>>     combined_directory=combined_directory,
>>>     output_directory=dir_name,
>>>     output_filename=output_filename,
>>>     append_alt_from_db=append_alt,
>>>     pdf_filename=pdf_filename,
>>>     output_type="all"
>>> )
>>> heuristic.execute(fdr_threshold=0.05)
Source code in pyproteininference/heuristic.py
def execute(self, fdr_threshold=0.05):
    """
    This method is the main driver of the heuristic method.
    This method calls other classes and methods that make up the heuristic pipeline.
    This includes but is not limited to:

    1. Loops over the main inference methods: Inclusion, Exclusion, Parsimony, and Peptide Centric.
    2. Determines the optimal inference method based on the input data as well as the database file.
    3. Outputs the results and indicates the optimal results.

    Args:
        fdr_threshold (float): The Qvalue/FDR threshold the heuristic method uses to base calculations from.

    Returns:
        None:

    Example:
        >>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
        >>>     parameter_file=yaml_params,
        >>>     database_file=database,
        >>>     target_files=target,
        >>>     decoy_files=decoy,
        >>>     combined_files=combined_files,
        >>>     target_directory=target_directory,
        >>>     decoy_directory=decoy_directory,
        >>>     combined_directory=combined_directory,
        >>>     output_directory=dir_name,
        >>>     output_filename=output_filename,
        >>>     append_alt_from_db=append_alt,
        >>>     pdf_filename=pdf_filename,
        >>>     output_type="all"
        >>> )
        >>> heuristic.execute(fdr_threshold=0.05)

    """

    pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
        yaml_param_filepath=self.parameter_file
    )

    digest = pyproteininference.in_silico_digest.PyteomicsDigest(
        database_path=self.database_file,
        digest_type=pyproteininference_parameters.digest_type,
        missed_cleavages=pyproteininference_parameters.missed_cleavages,
        reviewed_identifier_symbol=pyproteininference_parameters.reviewed_identifier_symbol,
        max_peptide_length=pyproteininference_parameters.restrict_peptide_length,
        id_splitting=self.id_splitting,
    )
    if self.database_file:
        logger.info("Running In Silico Database Digest on file {}".format(self.database_file))
        digest.digest_fasta_database()
    else:
        logger.warning(
            "No Database File provided, Skipping database digest and only taking protein-peptide mapping from the "
            "input files."
        )

    for inference_method in self.inference_method_list:

        method_specific_parameters = copy.deepcopy(pyproteininference_parameters)

        logger.info("Overriding inference type {}".format(method_specific_parameters.inference_type))

        method_specific_parameters.inference_type = inference_method

        logger.info("New inference type {}".format(method_specific_parameters.inference_type))
        logger.info("FDR Threshold Set to {}".format(method_specific_parameters.fdr))

        reader = pyproteininference.reader.GenericReader(
            target_file=self.target_files,
            decoy_file=self.decoy_files,
            combined_files=self.combined_files,
            parameter_file_object=method_specific_parameters,
            digest=digest,
            append_alt_from_db=self.append_alt_from_db,
        )
        reader.read_psms()

        data = pyproteininference.datastore.DataStore(reader=reader, digest=digest)

        data.restrict_psm_data()

        data.recover_mapping()

        data.create_scoring_input()

        if method_specific_parameters.inference_type == Inference.EXCLUSION:
            data.exclude_non_distinguishing_peptides()

        score = pyproteininference.scoring.Score(data=data)
        score.score_psms(score_method=method_specific_parameters.protein_score)

        if method_specific_parameters.picker:
            data.protein_picker()
        else:
            pass

        pyproteininference.inference.Inference.run_inference(data=data, digest=digest)

        data.calculate_q_values()

        self.datastore_dict[inference_method] = data

    self.selected_methods = self.determine_optimal_inference_method(
        false_discovery_rate_threshold=fdr_threshold, pdf_filename=self.pdf_filename
    )
    self.selected_datastores = {x: self.datastore_dict[x] for x in self.selected_methods}

    if self.output_type == "all":
        self._write_all_results(parameters=method_specific_parameters)
    elif self.output_type == "optimal":
        self._write_optimal_results(parameters=method_specific_parameters)
    else:
        self._write_optimal_results(parameters=method_specific_parameters)

generate_density_plot(self, number_stdevs_from_mean, pdf_filename=None)

This method produces a PDF Density Plot plot overlaying the 4 inference methods part of the heuristic algorithm.

Parameters:
  • number_stdevs_from_mean (dict) – a dictionary of the number of standard deviations from the mean per inference method for a range of FDRs.

  • pdf_filename (str) – Filename to write heuristic density plot to.

Returns:
  • dict – a dictionary of heuristic scores per inference method which correlates to the maximum point of the density plot per inference method.

Source code in pyproteininference/heuristic.py
def generate_density_plot(self, number_stdevs_from_mean, pdf_filename=None):
    """
    This method produces a PDF Density Plot plot overlaying the 4 inference methods part of the heuristic algorithm.

    Args:
        number_stdevs_from_mean (dict): a dictionary of the number of standard deviations from the mean per
            inference method for a range of FDRs.
        pdf_filename (str): Filename to write heuristic density plot to.

    Returns:
        dict: a dictionary of heuristic scores per inference method which correlates to the
            maximum point of the density plot per inference method.

    """
    f = plt.figure()

    heuristic_scores = {}
    for method in number_stdevs_from_mean:
        readible_method_name = Inference.INFERENCE_NAME_MAP[method]
        kwargs = dict(histtype='stepfilled', alpha=0.3, density=True, bins=40, ec="k", label=readible_method_name)
        x, y, _ = plt.hist(number_stdevs_from_mean[method], **kwargs)
        center = y[list(x).index(max(x))]
        heuristic_scores[method] = abs(center)

    plt.axvline(0, color="black", linestyle='--', alpha=0.75)
    plt.title("Density Plot of the Number of Standard Deviations from the Mean")
    plt.xlabel('Number of Standard Deviations from the Mean')
    plt.ylabel('Number of Observations')
    plt.legend(loc='upper right')
    if pdf_filename:
        logger.info("Writing Heuristic Density plot to: {}".format(pdf_filename))
        f.savefig(pdf_filename)
    else:
        plt.show()
    plt.close()

    logger.info("Heuristic Scores")
    logger.info(heuristic_scores)

    return heuristic_scores

generate_roc_plot(self, fdr_max=0.2, pdf_filename=None)

This method produces a PDF ROC plot overlaying the 4 inference methods apart of the heuristic algorithm.

Parameters:
  • fdr_max (float) – Max FDR to display on the plot.

  • pdf_filename (str) – Filename to write roc plot to.

Returns:
  • None

Source code in pyproteininference/heuristic.py
def generate_roc_plot(self, fdr_max=0.2, pdf_filename=None):
    """
    This method produces a PDF ROC plot overlaying the 4 inference methods apart of the heuristic algorithm.

    Args:
        fdr_max (float): Max FDR to display on the plot.
        pdf_filename (str): Filename to write roc plot to.

    Returns:
        None:

    """
    f = plt.figure()
    for inference_method in self.datastore_dict.keys():
        fdr_vs_target_hits = self.datastore_dict[inference_method].generate_fdr_vs_target_hits(fdr_max=fdr_max)
        fdrs = [x[0] for x in fdr_vs_target_hits]
        target_hits = [x[1] for x in fdr_vs_target_hits]
        plt.plot(fdrs, target_hits, '-', label=inference_method.replace("_", " "))
        target_fdr = self.datastore_dict[inference_method].parameter_file_object.fdr
        if inference_method in self.selected_methods:
            best_value = min(fdrs, key=lambda x: abs(x - target_fdr))
            best_index = fdrs.index(best_value)
            best_target_hit_value = target_hits[best_index]  # noqa F841

    plt.axvline(target_fdr, color="black", linestyle='--', alpha=0.75, label="Target FDR")
    plt.legend()
    plt.xlabel('Decoy FDR')
    plt.ylabel('Target Protein Hits')
    plt.xlim([-0.01, fdr_max])
    plt.legend(loc='lower right')
    plt.title("FDR vs Target Protein Hits per Inference Method")
    if pdf_filename:
        logger.info("Writing ROC plot to: {}".format(pdf_filename))
        f.savefig(pdf_filename)
    plt.close()

in_silico_digest

Digest

The following class handles data storage of in silico digest data from a fasta formatted sequence database.

Attributes:

Name Type Description
peptide_to_protein_dictionary dict

Dictionary of peptides (keys) to protein sets (values).

protein_to_peptide_dictionary dict

Dictionary of proteins (keys) to peptide sets (values).

swiss_prot_protein_set set

Set of reviewed proteins if they are able to be distinguished from unreviewed proteins.

database_path str

Path to fasta database file to digest.

missed_cleavages int

The number of missed cleavages to allow.

id_splitting bool

True/False on whether or not to split a given regex off identifiers. This is used to split of "sp|" and "tr|" from the database protein strings as sometimes the database will contain those strings while the input data will have the strings split already. Advanced usage only.

reviewed_identifier_symbol str/None

Identifier that distinguishes reviewed from unreviewed proteins. Typically this is "sp|". Can also be None type.

digest_type str

can be any value in LIST_OF_DIGEST_TYPES.

max_peptide_length int

Max peptide length to keep for analysis.

Source code in pyproteininference/in_silico_digest.py
class Digest(object):
    """
    The following class handles data storage of in silico digest data from a fasta formatted sequence database.

    Attributes:
        peptide_to_protein_dictionary (dict): Dictionary of peptides (keys) to protein sets (values).
        protein_to_peptide_dictionary (dict): Dictionary of proteins (keys) to peptide sets (values).
        swiss_prot_protein_set (set): Set of reviewed proteins if they are able to be distinguished from unreviewed
            proteins.
        database_path (str): Path to fasta database file to digest.
        missed_cleavages (int): The number of missed cleavages to allow.
        id_splitting (bool): True/False on whether or not to split a given regex off identifiers.
            This is used to split of "sp|" and "tr|"
            from the database protein strings as sometimes the database will contain those strings while
            the input data will have the strings split already.
            Advanced usage only.
        reviewed_identifier_symbol (str/None): Identifier that distinguishes reviewed from unreviewed proteins.
            Typically this is "sp|". Can also be None type.
        digest_type (str): can be any value in `LIST_OF_DIGEST_TYPES`.
        max_peptide_length (int): Max peptide length to keep for analysis.

    """

    TRYPSIN = "trypsin"
    LYSC = "lysc"
    LIST_OF_DIGEST_TYPES = set(parser.expasy_rules.keys())

    AA_LIST = [
        "A",
        "R",
        "N",
        "D",
        "C",
        "E",
        "Q",
        "G",
        "H",
        "I",
        "L",
        "K",
        "M",
        "F",
        "P",
        "S",
        "T",
        "W",
        "Y",
        "V",
    ]
    UNIPROT_STRS = "sp\||tr\|"  # noqa W605
    UNIPROT_STR_REGEX = re.compile(UNIPROT_STRS)
    SP_STRING = "sp|"
    METHIONINE = "M"
    ANY_AMINO_ACID = "X"

    def __init__(self):
        pass

PyteomicsDigest (Digest)

This class represents a pyteomics implementation of an in silico digest.

Source code in pyproteininference/in_silico_digest.py
class PyteomicsDigest(Digest):
    """
    This class represents a pyteomics implementation of an in silico digest.
    """

    def __init__(
        self,
        database_path,
        digest_type,
        missed_cleavages,
        reviewed_identifier_symbol,
        max_peptide_length,
        id_splitting=True,
    ):
        """
        The following class creates protein to peptide, peptide to protein, and reviewed protein mappings.

        The input is a fasta database, a protein inference parameter object, and whether or not to split IDs.

        This class sets important attributes for the Digest object such as: `peptide_to_protein_dictionary`,
        `protein_to_peptide_dictionary`, and `swiss_prot_protein_set`.

        Args:
            database_path (str): Path to fasta database file to digest.
            digest_type (str): Must be a value in `LIST_OF_DIGEST_TYPES`.
            missed_cleavages (int): Integer that indicates the maximum number of allowable missed cleavages from
                the ms search.
            reviewed_identifier_symbol (str/None): Symbol that indicates a reviewed identifier.
                If using Uniprot this is typically 'sp|'.
            max_peptide_length (int): The maximum length of peptides to keep for the analysis.
            id_splitting (bool): True/False on whether or not to split a given regex off identifiers.
                This is used to split of "sp|" and "tr|"
                from the database protein strings as sometimes the database will contain those
                strings while the input data will have the strings split already.
                Advanced usage only.

        Example:
            >>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
            >>>     database_path=database_file,
            >>>     digest_type='trypsin',
            >>>     missed_cleavages=2,
            >>>     reviewed_identifier_symbol='sp|',
            >>>     max_peptide_length=7,
            >>>     id_splitting=False,
            >>> )
        """
        self.peptide_to_protein_dictionary = {}
        self.protein_to_peptide_dictionary = {}
        self.swiss_prot_protein_set = set()
        self.database_path = database_path
        self.missed_cleavages = missed_cleavages
        self.id_splitting = id_splitting
        self.reviewed_identifier_symbol = reviewed_identifier_symbol
        if digest_type in self.LIST_OF_DIGEST_TYPES:
            self.digest_type = digest_type
        else:
            raise ValueError(
                "digest_type must be equal to one of the following {}".format(str(self.LIST_OF_DIGEST_TYPES))
            )
        self.max_peptide_length = max_peptide_length

    def digest_fasta_database(self):
        """
        This method reads in and prepares the fasta database for database digestion and assigns
        the several attributes for the Digest object: `peptide_to_protein_dictionary`,
        `protein_to_peptide_dictionary`, and `swiss_prot_protein_set`.

        Returns:
            None:

        Example:
            >>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
            >>>     database_path=database_file,
            >>>     digest_type='trypsin',
            >>>     missed_cleavages=2,
            >>>     reviewed_identifier_symbol='sp|',
            >>>     max_peptide_length=7,
            >>>     id_splitting=False,
            >>> )
            >>> digest.digest_fasta_database()

        """
        logger.info("Starting Pyteomics Digest...")
        pep_dict = {}
        prot_dict = {}
        sp_set = set()

        for description, sequence in fasta.read(self.database_path):
            new_peptides = parser.cleave(
                sequence,
                parser.expasy_rules[self.digest_type],
                self.missed_cleavages,
                min_length=self.max_peptide_length,
            )

            # Hopefully this splitting works...
            # IDK how robust this is...
            identifier = description.split(" ")[0]

            # Handle ID Splitting...
            if self.id_splitting:
                identifier_stripped = self.UNIPROT_STR_REGEX.sub("", identifier)
            else:
                identifier_stripped = identifier

            # If reviewed add to sp_set
            if self.reviewed_identifier_symbol:
                if identifier.startswith(self.reviewed_identifier_symbol):
                    sp_set.add(identifier_stripped)

            prot_dict[identifier_stripped] = new_peptides
            met_cleaved_peps = set()
            for peptide in new_peptides:
                pep_dict.setdefault(peptide, set()).add(identifier_stripped)
                # Need to account for potential N-term Methionine Cleavage
                if sequence.startswith(peptide) and peptide.startswith(self.METHIONINE):
                    # If our sequence starts with the current peptide... and our current peptide starts with methionine
                    # Then we remove the methionine from the peptide and add it to our dicts...
                    methionine_cleaved_peptide = peptide[1:]
                    met_cleaved_peps.add(methionine_cleaved_peptide)
            for met_peps in met_cleaved_peps:
                pep_dict.setdefault(met_peps, set()).add(identifier_stripped)
                prot_dict[identifier_stripped].add(met_peps)

        self.swiss_prot_protein_set = sp_set
        self.peptide_to_protein_dictionary = pep_dict
        self.protein_to_peptide_dictionary = prot_dict

        logger.info("Pyteomics Digest Finished...")

__init__(self, database_path, digest_type, missed_cleavages, reviewed_identifier_symbol, max_peptide_length, id_splitting=True) special

The following class creates protein to peptide, peptide to protein, and reviewed protein mappings.

The input is a fasta database, a protein inference parameter object, and whether or not to split IDs.

This class sets important attributes for the Digest object such as: peptide_to_protein_dictionary, protein_to_peptide_dictionary, and swiss_prot_protein_set.

Parameters:
  • database_path (str) – Path to fasta database file to digest.

  • digest_type (str) – Must be a value in LIST_OF_DIGEST_TYPES.

  • missed_cleavages (int) – Integer that indicates the maximum number of allowable missed cleavages from the ms search.

  • reviewed_identifier_symbol (str/None) – Symbol that indicates a reviewed identifier. If using Uniprot this is typically 'sp|'.

  • max_peptide_length (int) – The maximum length of peptides to keep for the analysis.

  • id_splitting (bool) – True/False on whether or not to split a given regex off identifiers. This is used to split of "sp|" and "tr|" from the database protein strings as sometimes the database will contain those strings while the input data will have the strings split already. Advanced usage only.

Examples:

>>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
>>>     database_path=database_file,
>>>     digest_type='trypsin',
>>>     missed_cleavages=2,
>>>     reviewed_identifier_symbol='sp|',
>>>     max_peptide_length=7,
>>>     id_splitting=False,
>>> )
Source code in pyproteininference/in_silico_digest.py
def __init__(
    self,
    database_path,
    digest_type,
    missed_cleavages,
    reviewed_identifier_symbol,
    max_peptide_length,
    id_splitting=True,
):
    """
    The following class creates protein to peptide, peptide to protein, and reviewed protein mappings.

    The input is a fasta database, a protein inference parameter object, and whether or not to split IDs.

    This class sets important attributes for the Digest object such as: `peptide_to_protein_dictionary`,
    `protein_to_peptide_dictionary`, and `swiss_prot_protein_set`.

    Args:
        database_path (str): Path to fasta database file to digest.
        digest_type (str): Must be a value in `LIST_OF_DIGEST_TYPES`.
        missed_cleavages (int): Integer that indicates the maximum number of allowable missed cleavages from
            the ms search.
        reviewed_identifier_symbol (str/None): Symbol that indicates a reviewed identifier.
            If using Uniprot this is typically 'sp|'.
        max_peptide_length (int): The maximum length of peptides to keep for the analysis.
        id_splitting (bool): True/False on whether or not to split a given regex off identifiers.
            This is used to split of "sp|" and "tr|"
            from the database protein strings as sometimes the database will contain those
            strings while the input data will have the strings split already.
            Advanced usage only.

    Example:
        >>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
        >>>     database_path=database_file,
        >>>     digest_type='trypsin',
        >>>     missed_cleavages=2,
        >>>     reviewed_identifier_symbol='sp|',
        >>>     max_peptide_length=7,
        >>>     id_splitting=False,
        >>> )
    """
    self.peptide_to_protein_dictionary = {}
    self.protein_to_peptide_dictionary = {}
    self.swiss_prot_protein_set = set()
    self.database_path = database_path
    self.missed_cleavages = missed_cleavages
    self.id_splitting = id_splitting
    self.reviewed_identifier_symbol = reviewed_identifier_symbol
    if digest_type in self.LIST_OF_DIGEST_TYPES:
        self.digest_type = digest_type
    else:
        raise ValueError(
            "digest_type must be equal to one of the following {}".format(str(self.LIST_OF_DIGEST_TYPES))
        )
    self.max_peptide_length = max_peptide_length

digest_fasta_database(self)

This method reads in and prepares the fasta database for database digestion and assigns the several attributes for the Digest object: peptide_to_protein_dictionary, protein_to_peptide_dictionary, and swiss_prot_protein_set.

Returns:
  • None

Examples:

>>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
>>>     database_path=database_file,
>>>     digest_type='trypsin',
>>>     missed_cleavages=2,
>>>     reviewed_identifier_symbol='sp|',
>>>     max_peptide_length=7,
>>>     id_splitting=False,
>>> )
>>> digest.digest_fasta_database()
Source code in pyproteininference/in_silico_digest.py
def digest_fasta_database(self):
    """
    This method reads in and prepares the fasta database for database digestion and assigns
    the several attributes for the Digest object: `peptide_to_protein_dictionary`,
    `protein_to_peptide_dictionary`, and `swiss_prot_protein_set`.

    Returns:
        None:

    Example:
        >>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
        >>>     database_path=database_file,
        >>>     digest_type='trypsin',
        >>>     missed_cleavages=2,
        >>>     reviewed_identifier_symbol='sp|',
        >>>     max_peptide_length=7,
        >>>     id_splitting=False,
        >>> )
        >>> digest.digest_fasta_database()

    """
    logger.info("Starting Pyteomics Digest...")
    pep_dict = {}
    prot_dict = {}
    sp_set = set()

    for description, sequence in fasta.read(self.database_path):
        new_peptides = parser.cleave(
            sequence,
            parser.expasy_rules[self.digest_type],
            self.missed_cleavages,
            min_length=self.max_peptide_length,
        )

        # Hopefully this splitting works...
        # IDK how robust this is...
        identifier = description.split(" ")[0]

        # Handle ID Splitting...
        if self.id_splitting:
            identifier_stripped = self.UNIPROT_STR_REGEX.sub("", identifier)
        else:
            identifier_stripped = identifier

        # If reviewed add to sp_set
        if self.reviewed_identifier_symbol:
            if identifier.startswith(self.reviewed_identifier_symbol):
                sp_set.add(identifier_stripped)

        prot_dict[identifier_stripped] = new_peptides
        met_cleaved_peps = set()
        for peptide in new_peptides:
            pep_dict.setdefault(peptide, set()).add(identifier_stripped)
            # Need to account for potential N-term Methionine Cleavage
            if sequence.startswith(peptide) and peptide.startswith(self.METHIONINE):
                # If our sequence starts with the current peptide... and our current peptide starts with methionine
                # Then we remove the methionine from the peptide and add it to our dicts...
                methionine_cleaved_peptide = peptide[1:]
                met_cleaved_peps.add(methionine_cleaved_peptide)
        for met_peps in met_cleaved_peps:
            pep_dict.setdefault(met_peps, set()).add(identifier_stripped)
            prot_dict[identifier_stripped].add(met_peps)

    self.swiss_prot_protein_set = sp_set
    self.peptide_to_protein_dictionary = pep_dict
    self.protein_to_peptide_dictionary = prot_dict

    logger.info("Pyteomics Digest Finished...")

inference

Exclusion (Inference)

Exclusion Inference class. This class contains methods that support the initialization of an Exclusion inference method.

Attributes:

Name Type Description
data DataStore

DataStore Object.

digest Digest

Digest Object.

scored_data list

a List of scored Protein objects.

Source code in pyproteininference/inference.py
class Exclusion(Inference):
    """
    Exclusion Inference class. This class contains methods that support the initialization of an
    Exclusion inference method.

    Attributes:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        scored_data (list): a List of scored [Protein][pyproteininference.physical.Protein] objects.

    """

    def __init__(self, data, digest):
        """
        Initialization method of the Exclusion Class.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

        """
        self.data = data
        self.digest = digest
        self.data._validate_scored_proteins()
        self.scored_data = self.data.get_protein_data()
        self.list_of_prots_not_in_db = None
        self.list_of_peps_not_in_db = None

    def infer_proteins(self):
        """
        This method performs the Exclusion inference/grouping method.

        For the exclusion inference method groups cannot be created because all shared peptides are removed.

        This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
        These are both variables of the [DataStore Object][pyproteininference.datastore.DataStore] and are
        lists of [Protein][pyproteininference.physical.Protein] objects
        and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        """

        grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)

        hl = self.data.higher_or_lower()

        logger.info("Applying Group ID's for the Exclusion Method")
        regrouped_proteins = self._apply_protein_group_ids(
            grouped_protein_objects=grouped_proteins,
        )

        grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
        protein_group_objects = regrouped_proteins["group_objects"]

        logger.info("Sorting Results based on lead Protein Score")
        grouped_protein_objects = datastore.DataStore.sort_protein_objects(
            grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
        )
        protein_group_objects = datastore.DataStore.sort_protein_group_objects(
            protein_group_objects=protein_group_objects, higher_or_lower=hl
        )

        self.data.grouped_scored_proteins = grouped_protein_objects
        self.data.protein_group_objects = protein_group_objects

__init__(self, data, digest) special

Initialization method of the Exclusion Class.

Parameters:
Source code in pyproteininference/inference.py
def __init__(self, data, digest):
    """
    Initialization method of the Exclusion Class.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

    """
    self.data = data
    self.digest = digest
    self.data._validate_scored_proteins()
    self.scored_data = self.data.get_protein_data()
    self.list_of_prots_not_in_db = None
    self.list_of_peps_not_in_db = None

infer_proteins(self)

This method performs the Exclusion inference/grouping method.

For the exclusion inference method groups cannot be created because all shared peptides are removed.

This method assigns the variables: grouped_scored_proteins and protein_group_objects. These are both variables of the DataStore Object and are lists of Protein objects and ProteinGroup objects.

Source code in pyproteininference/inference.py
def infer_proteins(self):
    """
    This method performs the Exclusion inference/grouping method.

    For the exclusion inference method groups cannot be created because all shared peptides are removed.

    This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
    These are both variables of the [DataStore Object][pyproteininference.datastore.DataStore] and are
    lists of [Protein][pyproteininference.physical.Protein] objects
    and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

    """

    grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)

    hl = self.data.higher_or_lower()

    logger.info("Applying Group ID's for the Exclusion Method")
    regrouped_proteins = self._apply_protein_group_ids(
        grouped_protein_objects=grouped_proteins,
    )

    grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
    protein_group_objects = regrouped_proteins["group_objects"]

    logger.info("Sorting Results based on lead Protein Score")
    grouped_protein_objects = datastore.DataStore.sort_protein_objects(
        grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
    )
    protein_group_objects = datastore.DataStore.sort_protein_group_objects(
        protein_group_objects=protein_group_objects, higher_or_lower=hl
    )

    self.data.grouped_scored_proteins = grouped_protein_objects
    self.data.protein_group_objects = protein_group_objects

FirstProtein (Inference)

FirstProtein Inference class. This class contains methods that support the initialization of a FirstProtein inference method.

Attributes:

Name Type Description
data DataStore

DataStore Object.

digest Digest

Digest Object.

Source code in pyproteininference/inference.py
class FirstProtein(Inference):
    """
    FirstProtein Inference class. This class contains methods that support the initialization of a
    FirstProtein inference method.

    Attributes:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

    """

    def __init__(self, data, digest):
        """
        FirstProtein Inference initialization method.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

        Returns:
            object:
        """
        self.data = data
        self.digest = digest
        self.data._validate_scored_proteins()
        self.scored_data = self.data.get_protein_data()
        self.data = data

    def infer_proteins(self):
        """
        This method performs the First Protein inference method.

        This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
        These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
        lists of [Protein][pyproteininference.physical.Protein] objects
        and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        """

        grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)

        # Get the higher or lower variable
        hl = self.data.higher_or_lower()

        logger.info("Applying Group ID's for the First Protein Method")
        regrouped_proteins = self._apply_protein_group_ids(
            grouped_protein_objects=grouped_proteins,
        )

        grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
        protein_group_objects = regrouped_proteins["group_objects"]

        logger.info("Sorting Results based on lead Protein Score")
        grouped_protein_objects = datastore.DataStore.sort_protein_objects(
            grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
        )
        protein_group_objects = datastore.DataStore.sort_protein_group_objects(
            protein_group_objects=protein_group_objects, higher_or_lower=hl
        )

        self.data.grouped_scored_proteins = grouped_protein_objects
        self.data.protein_group_objects = protein_group_objects

__init__(self, data, digest) special

FirstProtein Inference initialization method.

Parameters:
Returns:
  • object

Source code in pyproteininference/inference.py
def __init__(self, data, digest):
    """
    FirstProtein Inference initialization method.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

    Returns:
        object:
    """
    self.data = data
    self.digest = digest
    self.data._validate_scored_proteins()
    self.scored_data = self.data.get_protein_data()
    self.data = data

infer_proteins(self)

This method performs the First Protein inference method.

This method assigns the variables: grouped_scored_proteins and protein_group_objects. These are both variables of the DataStore object and are lists of Protein objects and ProteinGroup objects.

Source code in pyproteininference/inference.py
def infer_proteins(self):
    """
    This method performs the First Protein inference method.

    This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
    These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
    lists of [Protein][pyproteininference.physical.Protein] objects
    and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

    """

    grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)

    # Get the higher or lower variable
    hl = self.data.higher_or_lower()

    logger.info("Applying Group ID's for the First Protein Method")
    regrouped_proteins = self._apply_protein_group_ids(
        grouped_protein_objects=grouped_proteins,
    )

    grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
    protein_group_objects = regrouped_proteins["group_objects"]

    logger.info("Sorting Results based on lead Protein Score")
    grouped_protein_objects = datastore.DataStore.sort_protein_objects(
        grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
    )
    protein_group_objects = datastore.DataStore.sort_protein_group_objects(
        protein_group_objects=protein_group_objects, higher_or_lower=hl
    )

    self.data.grouped_scored_proteins = grouped_protein_objects
    self.data.protein_group_objects = protein_group_objects

Inclusion (Inference)

Inclusion Inference class. This class contains methods that support the initialization of an Inclusion inference method.

Attributes:

Name Type Description
data DataStore

DataStore Object.

digest Digest

Digest Object.

scored_data list

a List of scored Protein objects.

Source code in pyproteininference/inference.py
class Inclusion(Inference):
    """
    Inclusion Inference class. This class contains methods that support the initialization of an
    Inclusion inference method.

    Attributes:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        scored_data (list): a List of scored [Protein][pyproteininference.physical.Protein] objects.

    """

    def __init__(self, data, digest):
        """
        Initialization method of the Inclusion Inference method.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        """

        self.data = data
        self.digest = digest
        self.data._validate_scored_proteins()
        self.scored_data = self.data.get_protein_data()

    def infer_proteins(self):
        """
        This method performs the grouping for Inclusion.

        Inclusion actually does not do grouping as all peptides get assigned to all possible proteins
        and groups are not created.

        This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
        These are both variables of the [DataStore Object][pyproteininference.datastore.DataStore] and are
        lists of [Protein][pyproteininference.physical.Protein] objects
        and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
        """

        grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)

        hl = self.data.higher_or_lower()

        logger.info("Applying Group ID's for the Inclusion Method")

        regrouped_proteins = self._apply_protein_group_ids(
            grouped_protein_objects=grouped_proteins,
        )

        grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
        protein_group_objects = regrouped_proteins["group_objects"]

        logger.info("Sorting Results based on lead Protein Score")
        grouped_protein_objects = datastore.DataStore.sort_protein_objects(
            grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
        )
        protein_group_objects = datastore.DataStore.sort_protein_group_objects(
            protein_group_objects=protein_group_objects, higher_or_lower=hl
        )

        self.data.grouped_scored_proteins = grouped_protein_objects
        self.data.protein_group_objects = protein_group_objects

    def _apply_protein_group_ids(self, grouped_protein_objects):
        """
        This method creates the ProteinGroup objects for the inclusion inference type using protein groups from
         [_create_protein_groups][`pyproteininference.inference.Inference._create_protein_groups].

        Args:
            grouped_protein_objects (list): list of grouped [Protein][pyproteininference.physical.Protein] objects.

        Returns:
            dict: a Dictionary that contains a list of [ProteinGroup][pyproteininference.physical.ProteinGroup]
                objects (key:"group_objects") and a list of
                grouped [Protein][pyproteininference.physical.Protein] objects (key:"grouped_protein_objects").

        """

        sp_protein_set = set(self.digest.swiss_prot_protein_set)

        prot_pep_dict = self.data.protein_to_peptide_dictionary()

        # Here we create group ID's
        group_id = 0
        protein_group_objects = []
        for protein_group in grouped_protein_objects:
            protein_list = []
            group_id = group_id + 1
            pg = ProteinGroup(group_id)
            logger.debug("Created Protein Group with ID: {}".format(str(group_id)))
            for prot in protein_group:
                cur_protein = prot
                # The following loop assigns group_id's, reviewed/unreviewed status, and number of unique peptides...
                if group_id not in cur_protein.group_identification:
                    cur_protein.group_identification.add(group_id)
                if cur_protein.identifier in sp_protein_set:
                    cur_protein.reviewed = True
                else:
                    cur_protein.unreviewed = True
                cur_identifier = cur_protein.identifier
                cur_protein.num_peptides = len(prot_pep_dict[cur_identifier])
                # Here append the number of unique peptides... so we can use this as secondary sorting...
                protein_list.append(cur_protein)
                # Sorted protein_groups then becomes a list of lists... of protein objects

            pg.proteins = protein_list
            protein_group_objects.append(pg)

        return_dict = {
            "grouped_protein_objects": grouped_protein_objects,
            "group_objects": protein_group_objects,
        }

        return return_dict

__init__(self, data, digest) special

Initialization method of the Inclusion Inference method.

Parameters:
Source code in pyproteininference/inference.py
def __init__(self, data, digest):
    """
    Initialization method of the Inclusion Inference method.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
    """

    self.data = data
    self.digest = digest
    self.data._validate_scored_proteins()
    self.scored_data = self.data.get_protein_data()

infer_proteins(self)

This method performs the grouping for Inclusion.

Inclusion actually does not do grouping as all peptides get assigned to all possible proteins and groups are not created.

This method assigns the variables: grouped_scored_proteins and protein_group_objects. These are both variables of the DataStore Object and are lists of Protein objects and ProteinGroup objects.

Source code in pyproteininference/inference.py
def infer_proteins(self):
    """
    This method performs the grouping for Inclusion.

    Inclusion actually does not do grouping as all peptides get assigned to all possible proteins
    and groups are not created.

    This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
    These are both variables of the [DataStore Object][pyproteininference.datastore.DataStore] and are
    lists of [Protein][pyproteininference.physical.Protein] objects
    and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
    """

    grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)

    hl = self.data.higher_or_lower()

    logger.info("Applying Group ID's for the Inclusion Method")

    regrouped_proteins = self._apply_protein_group_ids(
        grouped_protein_objects=grouped_proteins,
    )

    grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
    protein_group_objects = regrouped_proteins["group_objects"]

    logger.info("Sorting Results based on lead Protein Score")
    grouped_protein_objects = datastore.DataStore.sort_protein_objects(
        grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
    )
    protein_group_objects = datastore.DataStore.sort_protein_group_objects(
        protein_group_objects=protein_group_objects, higher_or_lower=hl
    )

    self.data.grouped_scored_proteins = grouped_protein_objects
    self.data.protein_group_objects = protein_group_objects

Inference

Parent Inference class for all inference/grouper subset classes. The base Inference class contains several methods that are shared across the Inference sub-classes.

Attributes:

Name Type Description
data DataStore

DataStore object.

digest Digest

Digest object.

Source code in pyproteininference/inference.py
class Inference(object):
    """
    Parent Inference class for all inference/grouper subset classes.
    The base Inference class contains several methods that are shared across the Inference sub-classes.

    Attributes:
        data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].
    """

    PARSIMONY = "parsimony"
    INCLUSION = "inclusion"
    EXCLUSION = "exclusion"
    FIRST_PROTEIN = "first_protein"
    PEPTIDE_CENTRIC = "peptide_centric"

    INFERENCE_TYPES = [
        PARSIMONY,
        INCLUSION,
        EXCLUSION,
        FIRST_PROTEIN,
        PEPTIDE_CENTRIC,
    ]

    INFERENCE_NAME_MAP = {
        PARSIMONY: "Parsimony",
        INCLUSION: "Inclusion",
        EXCLUSION: "Exclusion",
        FIRST_PROTEIN: "First Protein",
        PEPTIDE_CENTRIC: "Peptide Centric",
    }

    SUBSET_PEPTIDES = "subset_peptides"
    SHARED_PEPTIDES = "shared_peptides"
    NONE_GROUPING = None

    GROUPING_TYPES = [SUBSET_PEPTIDES, SHARED_PEPTIDES, NONE_GROUPING]

    PULP = "pulp"
    LP_SOLVERS = [PULP]

    ALL_SHARED_PEPTIDES = "all"
    BEST_SHARED_PEPTIDES = "best"
    NONE_SHARED_PEPTIDES = None
    SHARED_PEPTIDE_TYPES = [
        ALL_SHARED_PEPTIDES,
        BEST_SHARED_PEPTIDES,
        NONE_SHARED_PEPTIDES,
    ]

    def __init__(self, data, digest):
        """
        Initialization method of Inference object.

        Args:
            data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].

        """
        self.data = data
        self.digest = digest
        self.data._validate_scored_proteins()

    @classmethod
    def run_inference(cls, data, digest):
        """
        This class method dispatches to one of the five different inference classes/models
        based on input from the [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter]
        object.
        The methods are "parsimony", "inclusion", "exclusion", "peptide_centric", and "first_protein".

        Args:
            data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].

        Example:
            >>> pyproteininference.inference.Inference.run_inference(data=data,digest=digest)

        """

        inference_type = data.parameter_file_object.inference_type

        logger.info("Running Inference with Inference Type: {}".format(inference_type))

        if inference_type == Inference.PARSIMONY:
            group = Parsimony(data=data, digest=digest)
            group.infer_proteins()

        if inference_type == Inference.INCLUSION:
            group = Inclusion(data=data, digest=digest)
            group.infer_proteins()

        if inference_type == Inference.EXCLUSION:
            group = Exclusion(data=data, digest=digest)
            group.infer_proteins()

        if inference_type == Inference.FIRST_PROTEIN:
            group = FirstProtein(data=data, digest=digest)
            group.infer_proteins()

        if inference_type == Inference.PEPTIDE_CENTRIC:
            group = PeptideCentric(data=data, digest=digest)
            group.infer_proteins()

    def _create_protein_groups(self, scored_proteins):
        """
        This method sets up protein groups for inference methods that do not need grouping.

        Args:
            scored_proteins (list): List of scored [Protein][pyproteininference.physical.Protein] objects.

        Returns:
            list: List of lists of scored [Protein][pyproteininference.physical.Protein] objects.

        """
        scored_proteins = sorted(
            scored_proteins,
            key=lambda k: (k.score, len(k.raw_peptides), k.identifier),
            reverse=True,
        )

        prot_pep_dict = self.data.protein_to_peptide_dictionary()

        restricted_peptides_set = set(self.data.restricted_peptides)

        grouped_proteins = []
        for protein_objects in scored_proteins:
            cur_protein_identifier = protein_objects.identifier

            # Set peptide variable if the peptide is in the restricted peptide set
            # Sort the peptides alphabetically
            protein_objects.peptides = set(
                sorted([x for x in prot_pep_dict[cur_protein_identifier] if x in restricted_peptides_set])
            )
            protein_list_group = [protein_objects]
            grouped_proteins.append(protein_list_group)
        return grouped_proteins

    def _apply_protein_group_ids(self, grouped_protein_objects):
        """
        This method creates the ProteinGroup objects from the output of
            [_create_protein_groups][`pyproteininference.inference.Inference._create_protein_groups].

        Args:
            grouped_protein_objects (list): list of grouped [Protein][pyproteininference.physical.Protein] objects.

        Returns:
            dict: a Dictionary that contains a list of [ProteinGroup][pyproteininference.physical.ProteinGroup]
                objects (key:"group_objects") and a list of grouped [Protein][pyproteininference.physical.Protein]
                objects (key:"grouped_protein_objects").


        """

        sp_protein_set = set(self.digest.swiss_prot_protein_set)

        prot_pep_dict = self.data.protein_to_peptide_dictionary()

        # Here we create group ID's
        group_id = 0
        protein_group_objects = []
        for protein_group in grouped_protein_objects:
            protein_list = []
            group_id = group_id + 1
            pg = ProteinGroup(group_id)
            logger.debug("Created Protein Group with ID: {}".format(str(group_id)))
            for protein in protein_group:
                cur_protein = protein
                # The following loop assigns group_id's, reviewed/unreviewed status, and number of unique peptides...
                if group_id not in cur_protein.group_identification:
                    cur_protein.group_identification.add(group_id)
                if protein.identifier in sp_protein_set:
                    cur_protein.reviewed = True
                else:
                    cur_protein.unreviewed = True
                cur_identifier = protein.identifier
                cur_protein.num_peptides = len(prot_pep_dict[cur_identifier])
                # Here append the number of unique peptides... so we can use this as secondary sorting...
                protein_list.append(cur_protein)
                # Sorted protein_groups then becomes a list of lists... of protein objects

            pg.proteins = protein_list
            protein_group_objects.append(pg)

        return_dict = {
            "grouped_protein_objects": grouped_protein_objects,
            "group_objects": protein_group_objects,
        }

        return return_dict

__init__(self, data, digest) special

Initialization method of Inference object.

Parameters:
Source code in pyproteininference/inference.py
def __init__(self, data, digest):
    """
    Initialization method of Inference object.

    Args:
        data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].

    """
    self.data = data
    self.digest = digest
    self.data._validate_scored_proteins()

run_inference(data, digest) classmethod

This class method dispatches to one of the five different inference classes/models based on input from the ProteinInferenceParameter object. The methods are "parsimony", "inclusion", "exclusion", "peptide_centric", and "first_protein".

Parameters:

Examples:

>>> pyproteininference.inference.Inference.run_inference(data=data,digest=digest)
Source code in pyproteininference/inference.py
@classmethod
def run_inference(cls, data, digest):
    """
    This class method dispatches to one of the five different inference classes/models
    based on input from the [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter]
    object.
    The methods are "parsimony", "inclusion", "exclusion", "peptide_centric", and "first_protein".

    Args:
        data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].

    Example:
        >>> pyproteininference.inference.Inference.run_inference(data=data,digest=digest)

    """

    inference_type = data.parameter_file_object.inference_type

    logger.info("Running Inference with Inference Type: {}".format(inference_type))

    if inference_type == Inference.PARSIMONY:
        group = Parsimony(data=data, digest=digest)
        group.infer_proteins()

    if inference_type == Inference.INCLUSION:
        group = Inclusion(data=data, digest=digest)
        group.infer_proteins()

    if inference_type == Inference.EXCLUSION:
        group = Exclusion(data=data, digest=digest)
        group.infer_proteins()

    if inference_type == Inference.FIRST_PROTEIN:
        group = FirstProtein(data=data, digest=digest)
        group.infer_proteins()

    if inference_type == Inference.PEPTIDE_CENTRIC:
        group = PeptideCentric(data=data, digest=digest)
        group.infer_proteins()

Parsimony (Inference)

Parsimony Inference class. This class contains methods that support the initialization of a Parsimony inference method.

Attributes:

Name Type Description
data DataStore

DataStore Object.

digest Digest

Digest Object.

scored_data list

a List of scored Protein objects.

lead_protein_set set

Set of protein strings that are classified as leads from the LP solver.

Source code in pyproteininference/inference.py
class Parsimony(Inference):
    """
    Parsimony Inference class. This class contains methods that support the initialization of a
    Parsimony inference method.

    Attributes:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        scored_data (list): a List of scored [Protein][pyproteininference.physical.Protein] objects.
        lead_protein_set (set): Set of protein strings that are classified as leads from the LP solver.

    """

    def __init__(self, data, digest):
        """
        Initialization method of the Parsimony object.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        """
        self.data = data
        self.digest = digest
        self.data._validate_scored_proteins()
        self.scored_data = self.data.get_protein_data()
        self.lead_protein_set = None
        self.parameter_file_object = data.parameter_file_object

    def _create_protein_groups(
        self,
        all_scored_proteins,
        lead_protein_objects,
        grouping_type="shared_peptides",
    ):
        """
        Internal method that creates a list of lists of [Protein][pyproteininference.physical.Protein]
        objects for the Parsimony inference object.
        These list of lists are "groups" and the proteins get grouped them according to grouping_type variable.

        Args:
            all_scored_proteins (list): list of [Protein][pyproteininference.physical.Protein] objects.
            lead_protein_objects (list): list of [Protein][pyproteininference.physical.Protein] objects
                Only needed if inference_type=parsimony.
            grouping_type: (str): One of `GROUPING_TYPES`.

        Returns:
            list: list of lists of [Protein][pyproteininference.physical.Protein] objects.

        """

        logger.info("Grouping Peptides with Grouping Type: {}".format(grouping_type))
        logger.info("Grouping Peptides with Inference Type: {}".format(self.PARSIMONY))

        all_scored_proteins = sorted(
            all_scored_proteins,
            key=lambda k: (len(k.raw_peptides), k.identifier),
            reverse=True,
        )

        lead_scored_proteins = lead_protein_objects
        lead_scored_proteins = sorted(
            lead_scored_proteins,
            key=lambda k: (len(k.raw_peptides), k.identifier),
            reverse=True,
        )

        protein_finder = [x.identifier for x in all_scored_proteins]

        prot_pep_dict = self.data.protein_to_peptide_dictionary()

        protein_tracker = set()
        restricted_peptides_set = set(self.data.restricted_peptides)
        try:
            picked_removed = set([x.identifier for x in self.data.picked_proteins_removed])
        except TypeError:
            picked_removed = set()

        missing_proteins = set()
        in_silico_peptides_to_proteins = self.digest.peptide_to_protein_dictionary
        grouped_proteins = []
        for protein_objects in lead_scored_proteins:
            if protein_objects not in protein_tracker:
                protein_tracker.add(protein_objects)
                cur_protein_identifier = protein_objects.identifier

                # Set peptide variable if the peptide is in the restricted peptide set
                # Sort the peptides alphabetically
                protein_objects.peptides = set(
                    sorted([x for x in prot_pep_dict[cur_protein_identifier] if x in restricted_peptides_set])
                )
                protein_list_group = [protein_objects]
                current_peptides = prot_pep_dict[cur_protein_identifier]

                current_grouped_proteins = set()
                for (
                    peptide
                ) in current_peptides:  # Probably put an if here... if peptide is in the list of peptide after being
                    # restricted by datastore.RestrictMainData
                    if peptide in restricted_peptides_set:
                        # Get the proteins that map to the current peptide using in_silico_peptides_to_proteins
                        # First make sure our peptide is formatted properly...
                        if not peptide.isupper() or not peptide.isalpha():
                            # If the peptide is not all upper case or if its not all alphabetical...
                            peptide = Psm.remove_peptide_mods(peptide)
                        potential_protein_list = in_silico_peptides_to_proteins[peptide]
                        if not potential_protein_list:
                            logger.warning(
                                "Protein {} and Peptide {} is not in database...".format(
                                    protein_objects.identifier, peptide
                                )
                            )

                        # Assign proteins to groups based on shared peptide... unless the protein is equivalent
                        # to the current identifier
                        if grouping_type != self.NONE_GROUPING:
                            for protein in potential_protein_list:
                                # If statement below to avoid grouping the same protein twice and to not group the lead
                                if (
                                    protein not in current_grouped_proteins
                                    and protein != cur_protein_identifier
                                    and protein not in picked_removed
                                    and protein not in missing_proteins
                                ):
                                    try:
                                        # Try to find its object using protein_finder (list of identifiers) and
                                        # lead_scored_proteins (list of Protein Objects)
                                        cur_index = protein_finder.index(protein)
                                        current_protein_object = all_scored_proteins[cur_index]
                                        if not current_protein_object.peptides:
                                            current_protein_object.peptides = set(
                                                sorted(
                                                    [
                                                        x
                                                        for x in prot_pep_dict[current_protein_object.identifier]
                                                        if x in restricted_peptides_set
                                                    ]
                                                )
                                            )
                                        if grouping_type == self.SHARED_PEPTIDES:
                                            current_grouped_proteins.add(current_protein_object)
                                        elif grouping_type == self.SUBSET_PEPTIDES:
                                            if current_protein_object.peptides.issubset(protein_objects.peptides):
                                                current_grouped_proteins.add(current_protein_object)
                                                protein_tracker.add(current_protein_object)
                                            else:
                                                pass
                                        else:
                                            pass
                                    except ValueError:
                                        logger.warning(
                                            "Protein from DB {} not found with protein finder for peptide {}".format(
                                                protein, peptide
                                            )
                                        )
                                        missing_proteins.add(protein)

                                else:
                                    pass
                # Add the proteins to the lead if they share peptides...
                protein_list_group = protein_list_group + list(current_grouped_proteins)
                # protein_list_group at first is just the lead protein object...
                # We then try apply grouping by looking at all peptide from the lead...
                # For all of these peptide look at all other non lead proteins and try to assign them to the group...
                # We assign the entire protein object as well... in the above try/except
                # Then append this sub group to the main list
                # The variable grouped_proteins is now a list of lists which each element being a Protein object and
                # each list of protein objects corresponding to a group
                grouped_proteins.append(protein_list_group)

        return grouped_proteins

    def _swissprot_and_isoform_override(
        self,
        scored_data,
        grouped_proteins,
        override_type="soft",
        isoform_override=True,
    ):
        """
        This internal method creates and reorders protein groups based on criteria such as Reviewed/Unreviewed
        Identifiers as well as Canonincal/Isoform Identifiers.
        This method is only used with parsimony inference type.

        Args:
            scored_data (list): list of scored [Protein][pyproteininference.physical.Protein] objects.
            grouped_proteins:  list of grouped [Protein][pyproteininference.physical.Protein] objects.
            override_type (str): "soft" or "hard" to indicate Reviewed/Unreviewed override. "soft" is preferred and
                default.
            isoform_override (bool): True/False on whether to favor canonical forms vs isoforms as group leads.

        Returns:
            dict: a Dictionary that contains a list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects
            (key:"group_objects") and a list of grouped [Protein][pyproteininference.physical.Protein]
            objects (key:"grouped_protein_objects").


        """

        sp_protein_set = set(self.digest.swiss_prot_protein_set)
        scored_proteins = list(scored_data)
        protein_finder = [x.identifier for x in scored_proteins]

        prot_pep_dict = self.data.protein_to_peptide_dictionary()

        # Get the higher or lower variable
        higher_or_lower = self.data.higher_or_lower()

        logger.info("Applying Group IDs... and Executing {} Swissprot Override...".format(override_type))
        # Here we create group ID's for all groups and do some sorting
        grouped_protein_objects = []
        group_id = 0
        leads = set()
        protein_group_objects = []
        for protein_group in grouped_proteins:
            protein_list = []
            group_id = group_id + 1
            # Make a protein group
            pg = ProteinGroup(group_id)
            logger.debug("Created Protein Group with ID: {}".format(str(group_id)))
            for prots in protein_group:
                # Loop over all proteins in the original group
                try:
                    # The following loop assigns group_id's, reviewed/unreviewed status, and number of unique peptides
                    pindex = protein_finder.index(prots.identifier)
                    # Attempt to find the protein object by identifier
                    cur_protein = scored_proteins[pindex]
                    if group_id not in cur_protein.group_identification:
                        cur_protein.group_identification.add(group_id)
                    if prots.identifier in sp_protein_set:
                        cur_protein.reviewed = True
                    else:
                        cur_protein.unreviewed = True
                    cur_identifier = prots.identifier
                    cur_protein.num_peptides = len(prot_pep_dict[cur_identifier])
                    # Here append the number of unique peptides... so we can use this as secondary sorting...
                    protein_list.append(cur_protein)
                    # Sorted groups then becomes a list of lists... of protein objects

                except ValueError:
                    # Here we pass if the protein does not have a score...
                    # Potentially it got 'picked' (removed) by protein picker...
                    pass

            # Sort protein sub group
            protein_list = datastore.DataStore.sort_protein_sub_groups(
                protein_list=protein_list, higher_or_lower=higher_or_lower
            )

            # grouped_protein_objects is the MAIN list of lists with grouped protein objects
            grouped_protein_objects.append(protein_list)
            # If the lead is reviewed append it to leads and do nothing else...
            # If the lead is unreviewed then try to replace it with the best reviewed hit
            # Run swissprot override
            if self.data.parameter_file_object.reviewed_identifier_symbol:
                sp_override = self._swissprot_override(
                    protein_list=protein_list,
                    leads=leads,
                    grouped_protein_objects=grouped_protein_objects,
                    override_type=override_type,
                )
                grouped_protein_objects = sp_override["grouped_protein_objects"]
                leads = sp_override["leads"]
                protein_list = sp_override["protein_list"]

            # Run isoform override If we want to run isoform_override and if the isoform symbol exists...
            if isoform_override and self.data.parameter_file_object.isoform_symbol:
                iso_override = self._isoform_override(
                    protein_list=protein_list,
                    leads=leads,
                    grouped_protein_objects=grouped_protein_objects,
                )
                grouped_protein_objects = iso_override["grouped_protein_objects"]
                leads = iso_override["leads"]
                protein_list = iso_override["protein_list"]

            pg.proteins = protein_list
            protein_group_objects.append(pg)

        return_dict = {
            "grouped_protein_objects": grouped_protein_objects,
            "group_objects": protein_group_objects,
        }

        return return_dict

    def _swissprot_override(self, protein_list, leads, grouped_protein_objects, override_type):
        """
        This method re-assigns protein group leads if the lead is an unreviewed protein and if the protein group
         contains a reviewed protein that contains the exact same set of peptides as the unreviewed lead.
        This method is here to provide consistency to the output.

        Args:
            protein_list (list): List of grouped [Protein][pyproteininference.physical.Protein] objects.
            leads (set): Set of string protein identifiers that have been identified as a lead.
            grouped_protein_objects (list): List of protein_list lists.
            override_type (str): "soft" or "hard" on how to override non reviewed identifiers. "soft" is preferred.

        Returns:
            dict: leads (set): Set of string protein identifiers that have been identified as a lead.
             Updated to reflect lead changes.
            grouped_protein_objects (list): List of protein_list lists. Updated to reflect lead changes.
            protein_list (list): List of grouped [Protein][pyproteininference.physical.Protein] objects.
                Updated to reflect lead changes.

        """

        if not protein_list[0].reviewed:
            # If the lead is unreviewed attempt to replace it...
            # Start to loop through protein_list which is the current group...
            for protein in protein_list[1:]:
                # Find the first reviewed it... if its not a lead protein already then do score swap and break...
                if protein.reviewed:
                    best_swiss_prot_prot = protein

                    if override_type == "soft":
                        # If the lead proteins peptides are a subset of the best swissprot.... then swap the proteins.
                        # (meaning equal peptides or the swissprot completely covers the tremble reference)
                        if best_swiss_prot_prot.identifier not in leads and set(protein_list[0].peptides).issubset(
                            set(best_swiss_prot_prot.peptides)
                        ):
                            # We use -1 as the idex of grouped_protein_objects because the current 'protein_list' is
                            # the last entry appended to scores grouped
                            # Essentially grouped_protein_objects[-1]==protein_list
                            # We need this syntax so we can switch the location of the unreviewed lead identifier with
                            # the best reviewed identifier in grouped_protein_objects
                            swiss_prot_override_index = grouped_protein_objects[-1].index(best_swiss_prot_prot)
                            cur_tr_lead = grouped_protein_objects[-1][0]
                            (
                                grouped_protein_objects[-1][0],
                                grouped_protein_objects[-1][swiss_prot_override_index],
                            ) = (
                                grouped_protein_objects[-1][swiss_prot_override_index],
                                grouped_protein_objects[-1][0],
                            )
                            grouped_protein_objects[-1][swiss_prot_override_index], grouped_protein_objects[-1][0]
                            new_sp_lead = grouped_protein_objects[-1][0]
                            logger.info(
                                "Overriding Unreviewed {} with Reviewed {}".format(
                                    cur_tr_lead.identifier, new_sp_lead.identifier
                                )
                            )

                            # Append new_sp_lead protein to leads, to make sure we dont repeat leads
                            leads.add(new_sp_lead.identifier)
                            break
                        else:
                            # If no reviewed and none not in leads then pass...
                            pass

                    if override_type == "hard":
                        if best_swiss_prot_prot.identifier not in leads:
                            # We use -1 as the index of grouped_protein_objects because the current 'protein_list'
                            # is the last entry appended to grouped_protein_objects
                            # Essentially grouped_protein_objects[-1]==protein_list
                            # We need this syntax so we can switch the location of the unreviewed lead identifier
                            # with the best reviewed identifier in grouped_protein_objects
                            swiss_prot_override_index = grouped_protein_objects[-1].index(best_swiss_prot_prot)
                            cur_tr_lead = grouped_protein_objects[-1][0]
                            # Re-assigning the value within the index will also reassign the value in protein_list...
                            # This is because grouped_protein_objects[-1] equals protein_list
                            # So we do not have to reassign values in protein_list
                            (
                                grouped_protein_objects[-1][0],
                                grouped_protein_objects[-1][swiss_prot_override_index],
                            ) = (
                                grouped_protein_objects[-1][swiss_prot_override_index],
                                grouped_protein_objects[-1][0],
                            )
                            new_sp_lead = grouped_protein_objects[-1][0]
                            logger.info(
                                "Overriding Unreviewed {} with Reviewed {}".format(
                                    cur_tr_lead.identifier, new_sp_lead.identifier
                                )
                            )

                            # Append new_sp_lead protein to leads, to make sure we dont repeat leads
                            leads.add(new_sp_lead.identifier)
                            break
                        else:
                            # If no reviewed and none not in leads then pass...
                            pass

                else:
                    pass

        else:
            leads.add(protein_list[0].identifier)

        return_dict = {
            "leads": leads,
            "grouped_protein_objects": grouped_protein_objects,
            "protein_list": protein_list,
        }

        return return_dict

    def _isoform_override(self, protein_list, grouped_protein_objects, leads):
        """
        This method re-assigns protein group leads if the lead is an isoform protein and if the protein group contains
        a canonical protein that contains the exact same set of peptides as the isoform lead.
        This method is here to provide consistency to the output.

        Args:
            protein_list (list): List of grouped [Protein][pyproteininference.physical.Protein] objects.
            leads (set): Set of string protein identifiers that have been identified as a lead.
            grouped_protein_objects (list): List of protein_list lists.

        Returns:
            dict: leads (set): Set of string protein identifiers that have been identified as a lead. Updated to
                reflect lead changes.
            grouped_protein_objects (list): List of protein_list lists. Updated to reflect lead changes.
            protein_list (list): List of grouped [Protein][pyproteininference.physical.Protein] objects.
                Updated to reflect lead changes.


        """

        if self.data.parameter_file_object.isoform_symbol in protein_list[0].identifier:
            pure_id = protein_list[0].identifier.split(self.data.parameter_file_object.isoform_symbol)[0]
            # Start to loop through protein_list which is the current group...
            for potential_replacement in protein_list[1:]:
                isoform_override = potential_replacement
                if (
                    isoform_override.identifier == pure_id
                    and isoform_override.identifier not in leads
                    and set(protein_list[0].peptides).issubset(set(isoform_override.peptides))
                ):
                    isoform_override_index = grouped_protein_objects[-1].index(isoform_override)
                    cur_iso_lead = grouped_protein_objects[-1][0]
                    # Re-assigning the value within the index will also reassign the value in protein_list...
                    # This is because grouped_protein_objects[-1] equals protein_list
                    # So we do not have to reassign values in protein_list
                    (grouped_protein_objects[-1][0], grouped_protein_objects[-1][isoform_override_index],) = (
                        grouped_protein_objects[-1][isoform_override_index],
                        grouped_protein_objects[-1][0],
                    )
                    grouped_protein_objects[-1][isoform_override_index], grouped_protein_objects[-1][0]

                    new_iso_lead = grouped_protein_objects[-1][0]
                    logger.info(
                        "Overriding Isoform {} with {}".format(cur_iso_lead.identifier, new_iso_lead.identifier)
                    )
                    leads.add(protein_list[0].identifier)

        return_dict = {
            "leads": leads,
            "grouped_protein_objects": grouped_protein_objects,
            "protein_list": protein_list,
        }

        return return_dict

    def _reassign_protein_group_leads(self, protein_group_objects):
        """
        This internal method corrects leads that are improperly assigned in the parsimony inference method.
        This method acts on the protein group objects.

        Args:
            protein_group_objects (list): List of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        Returns:
            protein_group_objects (list): List of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects
            where leads have been reassigned properly.


        """

        # Get the higher or lower variable
        if not self.data.high_low_better:
            higher_or_lower = self.data.higher_or_lower()
        else:
            higher_or_lower = self.data.high_low_better

        # Sometimes we have cases where:
        # protein a maps to peptides 1,2,3
        # protein b maps to peptides 1,2
        # protein c maps to a bunch of peptides and peptide 3
        # Therefore, in the model proteins a and b are equivalent in that they map to 2 peptides together - 1 and 2.
        # peptide 3 maps to a but also to c...
        # Sometimes the model (pulp) will spit out protein b as the lead... we wish to swap protein b as the lead with
        # protein a because it will likely have a better score...
        logger.info("Potentially Reassigning Protein Group leads...")
        lead_protein_set = set([x.proteins[0].identifier for x in protein_group_objects])
        for i in range(len(protein_group_objects)):
            for j in range(1, len(protein_group_objects[i].proteins)):  # Loop over all sub proteins in the group...
                # if the lead proteins peptides are a subset of one of its proteins in the group, and the secondary
                # protein is not a lead protein and its score is better than the leads... and it has more peptides...
                new_lead = protein_group_objects[i].proteins[j]
                old_lead = protein_group_objects[i].proteins[0]
                if higher_or_lower == datastore.DataStore.HIGHER_PSM_SCORE:
                    if (
                        set(old_lead.peptides).issubset(set(new_lead.peptides))
                        and new_lead.identifier not in lead_protein_set
                        and old_lead.score <= new_lead.score
                        and len(old_lead.peptides) < len(new_lead.peptides)
                    ):
                        logger.info(
                            "protein {} will replace protein {} as lead, with index {}, New Num Peptides: {}, "
                            "Old Num Peptides: {}".format(
                                str(new_lead.identifier),
                                str(old_lead.identifier),
                                str(j),
                                str(len(new_lead.peptides)),
                                str(len(old_lead.peptides)),
                            )
                        )
                        lead_protein_set.add(new_lead.identifier)
                        lead_protein_set.remove(old_lead.identifier)
                        # Swap their positions in the list
                        (
                            protein_group_objects[i].proteins[0],
                            protein_group_objects[i].proteins[j],
                        ) = (new_lead, old_lead)
                        break

                if higher_or_lower == datastore.DataStore.LOWER_PSM_SCORE:
                    if (
                        set(old_lead.peptides).issubset(set(new_lead.peptides))
                        and new_lead.identifier not in lead_protein_set
                        and old_lead.score >= new_lead.score
                        and len(old_lead.peptides) < len(new_lead.peptides)
                    ):
                        logger.info(
                            "protein {} will replace protein {} as lead, with index {}, New Num Peptides: {}, "
                            "Old Num Peptides: {}".format(
                                str(new_lead.identifier),
                                str(old_lead.identifier),
                                str(j),
                                str(len(new_lead.peptides)),
                                str(len(old_lead.peptides)),
                            )
                        )
                        lead_protein_set.add(new_lead.identifier)
                        lead_protein_set.remove(old_lead.identifier)
                        # Swap their positions in the list
                        (
                            protein_group_objects[i].proteins[0],
                            protein_group_objects[i].proteins[j],
                        ) = (new_lead, old_lead)
                        break

        return protein_group_objects

    def _reassign_protein_list_leads(self, grouped_protein_objects):
        """
        This internal method corrects leads that are improperly assigned in the parsimony inference method.
        This method acts on the grouped protein objects.

        Args:
            grouped_protein_objects (list): List of [Protein][pyproteininference.physical.Protein] objects.

        Returns:
            list: List of [Protein][pyproteininference.physical.Protein] objects where leads have been
                reassigned properly.


        """

        # Get the higher or lower variable
        if not self.data.high_low_better:
            higher_or_lower = self.data.higher_or_lower()
        else:
            higher_or_lower = self.data.high_low_better

        # Sometimes we have cases where:
        # protein a maps to peptides 1,2,3
        # protein b maps to peptides 1,2
        # protein c maps to a bunch of peptides and peptide 3
        # Therefore, in the model proteins a and b are equivalent in that they map to 2 peptides together - 1 and 2.
        # peptide 3 maps to a but also to c...
        # Sometimes the model (pulp) will spit out protein b as the lead... we wish to swap protein b as the lead with
        # protein a because it will likely have a better score...
        logger.info("Potentially Reassigning Proteoin List leads...")
        lead_protein_set = set([x[0].identifier for x in grouped_protein_objects])
        for i in range(len(grouped_protein_objects)):
            for j in range(1, len(grouped_protein_objects[i])):  # Loop over all sub proteins in the group...
                # if the lead proteins peptides are a subset of one of its proteins in the group, and the secondary
                # protein is not a lead protein and its score is better than the leads... and it has more peptides...
                new_lead = grouped_protein_objects[i][j]
                old_lead = grouped_protein_objects[i][0]
                if higher_or_lower == datastore.DataStore.HIGHER_PSM_SCORE:
                    if (
                        set(old_lead.peptides).issubset(set(new_lead.peptides))
                        and new_lead.identifier not in lead_protein_set
                        and old_lead.score <= new_lead.score
                        and len(old_lead.peptides) < len(new_lead.peptides)
                    ):
                        logger.info(
                            "protein {} will replace protein {} as lead, with index {}, New Num Peptides: {}, "
                            "Old Num Peptides: {}".format(
                                str(new_lead.identifier),
                                str(old_lead.identifier),
                                str(j),
                                str(len(new_lead.peptides)),
                                str(len(old_lead.peptides)),
                            )
                        )
                        lead_protein_set.add(new_lead.identifier)
                        lead_protein_set.remove(old_lead.identifier)
                        # Swap their positions in the list
                        (
                            grouped_protein_objects[i][0],
                            grouped_protein_objects[i][j],
                        ) = (new_lead, old_lead)
                        break

                if higher_or_lower == datastore.DataStore.LOWER_PSM_SCORE:
                    if (
                        set(old_lead.peptides).issubset(set(new_lead.peptides))
                        and new_lead.identifier not in lead_protein_set
                        and old_lead.score >= new_lead.score
                        and len(old_lead.peptides) < len(new_lead.peptides)
                    ):
                        logger.info(
                            "protein {} will replace protein {} as lead, with index {}, New Num Peptides: {}, "
                            "Old Num Peptides: {}".format(
                                str(new_lead.identifier),
                                str(old_lead.identifier),
                                str(j),
                                str(len(new_lead.peptides)),
                                str(len(old_lead.peptides)),
                            )
                        )
                        lead_protein_set.add(new_lead.identifier)
                        lead_protein_set.remove(old_lead.identifier)
                        # Swap their positions in the list
                        (
                            grouped_protein_objects[i][0],
                            grouped_protein_objects[i][j],
                        ) = (new_lead, old_lead)
                        break

        return grouped_protein_objects

    def _pulp_grouper(self):
        """
        This internal function uses pulp to solve the lp problem for parsimony then performs protein grouping with the
         various internal grouping functions.

        This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
        These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
        lists of [Protein][pyproteininference.physical.Protein] objects
        and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        """

        # Here we get the peptide to protein dictionary
        pep_prot_dict = self.data.peptide_to_protein_dictionary()

        self.data.protein_to_peptide_dictionary()

        identifiers_sorted = self.data.get_sorted_identifiers(scored=True)

        # Get all the proteins that we scored and the ones picked if picker was ran...
        data_proteins = sorted([x for x in self.data.protein_peptide_dictionary.keys() if x in identifiers_sorted])
        # Get the set of peptides for each protein...
        data_peptides = [set(self.data.protein_peptide_dictionary[x]) for x in data_proteins]
        flat_peptides_in_data = set([item for sublist in data_peptides for item in sublist])

        peptide_sets = []
        # Loop over the list of peptides...
        for k in range(len(data_peptides)):
            raw_peptides = data_peptides[k]
            peptide_set = set()
            # Loop over each individual peptide per protein...
            for peps in raw_peptides:
                peptide = peps

                # Remove mods...
                new_peptide = Psm.remove_peptide_mods(peptide)
                # Add it to a temporary set...
                peptide_set.add(new_peptide)
            # Append this set to a new list...
            peptide_sets.append(peptide_set)
            # Set that proteins peptides to be the unmodified ones...
            data_peptides[k] = peptide_set

        # Get them all...
        all_peptides = [x for x in data_peptides]
        # Remove redundant sets...
        non_redundant_peptide_sets = [set(i) for i in OrderedDict.fromkeys(frozenset(item) for item in peptide_sets)]

        # Loop over  the restricted list of peptides...
        ind_list = []
        for pep_sets in non_redundant_peptide_sets:
            # Get its index in terms of the overall list...
            ind_list.append(all_peptides.index(pep_sets))

        # Get the protein based on the index
        restricted_proteins = [data_proteins[x] for x in range(len(data_peptides)) if x in ind_list]

        # Here we get the list of all proteins
        plist = []
        for peps in pep_prot_dict.keys():
            for prots in list(pep_prot_dict[peps]):
                if prots in restricted_proteins and peps in flat_peptides_in_data:
                    plist.append(prots)

        # Here we get the unique proteins
        unique_prots = list(set(plist).union())
        unique_protein_set = set(unique_prots)

        unique_prots_sorted = [x for x in identifiers_sorted if x in unique_prots]

        # Define the protein variables with a lower bound of 0 and catgeory Integer
        prots = pulp.LpVariable.dicts("prot", indices=unique_prots_sorted, lowBound=0, cat="Integer")

        # Define our Lp Problem which is to Minimize our objective function
        prob = pulp.LpProblem("Parsimony_Problem", pulp.LpMinimize)

        # Define our objective function, which is to take the sum of all of our proteins and find the minimum set.
        prob += pulp.lpSum([prots[i] for i in prots])

        # Set up our constraints. The constrains are as follows:

        # Loop over each peptide and determine the proteins it maps to...
        # Each peptide is a constraint with the proteins it maps to having to be greater than or equal to 1
        # In the case below we see that protein 3 has a unique peptide, protein 2 is redundant

        logger.info("Sorting peptides before looping")
        for peptides in sorted(list(pep_prot_dict.keys())):
            try:
                prob += (
                    pulp.lpSum([prots[i] for i in sorted(list(pep_prot_dict[peptides])) if i in unique_protein_set])
                    >= 1
                )
            except KeyError:
                logger.info("Not including protein {} in pulp model".format(pep_prot_dict[peptides]))

        prob.solve()

        scored_data = self.data.get_protein_data()
        scored_proteins = list(scored_data)
        protein_finder = [x.identifier for x in scored_proteins]

        lead_protein_objects = []
        lead_protein_identifiers = []
        for proteins in unique_prots_sorted:
            parsimony_value = pulp.value(prots[proteins])
            if proteins in protein_finder and parsimony_value == 1:
                p_ind = protein_finder.index(proteins)
                protein_object = scored_proteins[p_ind]
                lead_protein_objects.append(protein_object)
                lead_protein_identifiers.append(protein_object.identifier)
            else:
                if parsimony_value == 1:
                    # Why are some proteins not being found when we run exclusion???
                    logger.warning("Protein {} not found with protein finder...".format(proteins))
                else:
                    pass

        self.lead_protein_objects = lead_protein_objects

        grouped_proteins = self._create_protein_groups(
            all_scored_proteins=scored_data,
            lead_protein_objects=self.lead_protein_objects,
            grouping_type=self.data.parameter_file_object.grouping_type,
        )

        regrouped_proteins = self._swissprot_and_isoform_override(
            scored_data=scored_data,
            grouped_proteins=grouped_proteins,
            override_type="soft",
            isoform_override=True,
        )

        grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
        protein_group_objects = regrouped_proteins["group_objects"]

        # Get the higher or lower variable
        hl = self.data.higher_or_lower()

        logger.info("Sorting Results based on lead Protein Score")
        grouped_protein_objects = datastore.DataStore.sort_protein_objects(
            grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
        )
        protein_group_objects = datastore.DataStore.sort_protein_group_objects(
            protein_group_objects=protein_group_objects, higher_or_lower=hl
        )

        # Run lead reassignment for the group objets and protein objects
        protein_group_objects = self._reassign_protein_group_leads(
            protein_group_objects=protein_group_objects,
        )

        grouped_protein_objects = self._reassign_protein_list_leads(grouped_protein_objects=grouped_protein_objects)

        logger.info("Re Sorting Results based on lead Protein Score")
        grouped_protein_objects = datastore.DataStore.sort_protein_objects(
            grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
        )
        protein_group_objects = datastore.DataStore.sort_protein_group_objects(
            protein_group_objects=protein_group_objects, higher_or_lower=hl
        )

        self.data.grouped_scored_proteins = grouped_protein_objects
        self.data.protein_group_objects = protein_group_objects

    def infer_proteins(self):
        """
        This method performs the Parsimony inference method and uses pulp for the LP solver.

        This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
        These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
        lists of [Protein][pyproteininference.physical.Protein] objects
        and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        """

        if self.parameter_file_object.lp_solver == self.PULP:

            self._pulp_grouper()

        else:
            raise ValueError(
                "Parsimony cannot run if lp_solver parameter value is not one of the following: {}".format(
                    ", ".join(Inference.LP_SOLVERS)
                )
            )

        # Call assign shared peptides
        self._assign_shared_peptides(shared_pep_type=self.parameter_file_object.shared_peptides)

    def _assign_shared_peptides(self, shared_pep_type="all"):

        if not self.data.grouped_scored_proteins and self.data.protein_group_objects:
            raise ValueError(
                "Grouped Protein objects could not be found. Please run 'infer_proteins' method of the Parsimony class"
            )

        if shared_pep_type == self.ALL_SHARED_PEPTIDES:
            pass

        elif shared_pep_type == self.BEST_SHARED_PEPTIDES:
            logger.info("Assigning Shared Peptides from Parsimony to the Best Scoring Protein")
            raw_peptide_tracker = set()
            peptide_tracker = set()
            for prots in self.data.grouped_scored_proteins:
                new_psms = []
                new_raw_peptides = set()
                new_peptides = set()
                lead_prot = prots[0]
                for psm in lead_prot.psms:
                    raw_pep = psm.identifier
                    pep = psm.non_flanking_peptide
                    if raw_pep not in raw_peptide_tracker:
                        new_raw_peptides.add(raw_pep)
                        raw_peptide_tracker.add(raw_pep)
                    if pep not in peptide_tracker:
                        new_peptides.add(pep)
                        new_psms.append(psm)
                        peptide_tracker.add(pep)
                lead_prot.psms = new_psms
                lead_prot.raw_peptides = new_raw_peptides
                lead_prot.peptides = new_peptides

            raw_peptide_tracker = set()
            peptide_tracker = set()
            for group in self.data.protein_group_objects:
                lead_prot = group.proteins[0]
                new_psms = []
                new_raw_peptides = set()
                new_peptides = set()
                for psm in lead_prot.psms:
                    raw_pep = psm.identifier
                    pep = psm.non_flanking_peptide
                    if raw_pep not in raw_peptide_tracker:
                        new_raw_peptides.add(raw_pep)
                        raw_peptide_tracker.add(raw_pep)
                    if pep not in peptide_tracker:
                        new_peptides.add(pep)
                        new_psms.append(psm)
                        peptide_tracker.add(pep)

                lead_prot.psms = new_psms
                lead_prot.raw_peptides = new_raw_peptides
                lead_prot.peptides = new_peptides

        else:
            pass

__init__(self, data, digest) special

Initialization method of the Parsimony object.

Parameters:
Source code in pyproteininference/inference.py
def __init__(self, data, digest):
    """
    Initialization method of the Parsimony object.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
    """
    self.data = data
    self.digest = digest
    self.data._validate_scored_proteins()
    self.scored_data = self.data.get_protein_data()
    self.lead_protein_set = None
    self.parameter_file_object = data.parameter_file_object

infer_proteins(self)

This method performs the Parsimony inference method and uses pulp for the LP solver.

This method assigns the variables: grouped_scored_proteins and protein_group_objects. These are both variables of the DataStore object and are lists of Protein objects and ProteinGroup objects.

Source code in pyproteininference/inference.py
def infer_proteins(self):
    """
    This method performs the Parsimony inference method and uses pulp for the LP solver.

    This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
    These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
    lists of [Protein][pyproteininference.physical.Protein] objects
    and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

    """

    if self.parameter_file_object.lp_solver == self.PULP:

        self._pulp_grouper()

    else:
        raise ValueError(
            "Parsimony cannot run if lp_solver parameter value is not one of the following: {}".format(
                ", ".join(Inference.LP_SOLVERS)
            )
        )

    # Call assign shared peptides
    self._assign_shared_peptides(shared_pep_type=self.parameter_file_object.shared_peptides)

PeptideCentric (Inference)

PeptideCentric Inference class. This class contains methods that support the initialization of a PeptideCentric inference method.

Attributes:

Name Type Description
data DataStore

DataStore Object.

digest Digest

Digest Object.

Source code in pyproteininference/inference.py
class PeptideCentric(Inference):
    """
    PeptideCentric Inference class. This class contains methods that support the initialization of a
    PeptideCentric inference method.

    Attributes:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

    """

    def __init__(self, data, digest):
        """
        PeptideCentric Inference initialization method.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

        Returns:
            object:
        """
        self.data = data
        self.digest = digest
        self.data._validate_scored_proteins()
        self.scored_data = self.data.get_protein_data()

    def infer_proteins(self):
        """
        This method performs the Peptide Centric inference method.

        This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
        These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
        lists of [Protein][pyproteininference.physical.Protein] objects
        and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        Returns:
            None:

        """

        # Get the higher or lower variable
        hl = self.data.higher_or_lower()

        logger.info("Applying Group ID's for the Peptide Centric Method")
        regrouped_proteins = self._apply_protein_group_ids()

        grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
        protein_group_objects = regrouped_proteins["group_objects"]

        logger.info("Sorting Results based on lead Protein Score")
        grouped_protein_objects = datastore.DataStore.sort_protein_objects(
            grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
        )
        protein_group_objects = datastore.DataStore.sort_protein_group_objects(
            protein_group_objects=protein_group_objects, higher_or_lower=hl
        )

        self.data.grouped_scored_proteins = grouped_protein_objects
        self.data.protein_group_objects = protein_group_objects

    def _apply_protein_group_ids(self):
        """
        This method creates the ProteinGroup objects for the peptide_centric inference based on protein groups
        from [._create_protein_groups][pyproteininference.inference.Inference._create_protein_groups].

        Returns:
            dict: a Dictionary that contains a list of [ProteinGroup]]pyproteininference.physical.ProteinGroup]
            objects (key:"group_objects") and a list of grouped [Protein]]pyproteininference.physical.Protein]
            objects (key:"grouped_protein_objects").

        """

        grouped_protein_objects = self.data.get_protein_data()

        # Here we create group ID's
        group_id = 0
        list_of_proteins_grouped = []
        protein_group_objects = []
        for protein_group in grouped_protein_objects:
            protein_group.peptides = set(
                [Psm.split_peptide(peptide_string=x) for x in list(protein_group.raw_peptides)]
            )
            protein_list = []
            group_id = group_id + 1
            pg = ProteinGroup(group_id)
            logger.debug("Created Protein Group with ID: {}".format(str(group_id)))
            # The following loop assigns group_id's, reviewed/unreviewed status, and number of unique peptides...
            if group_id not in protein_group.group_identification:
                protein_group.group_identification.add(group_id)
            protein_group.num_peptides = len(protein_group.peptides)
            # Here append the number of unique peptides... so we can use this as secondary sorting...
            protein_list.append(protein_group)
            # Sorted protein_groups then becomes a list of lists... of protein objects

            pg.proteins = protein_list
            protein_group_objects.append(pg)
            list_of_proteins_grouped.append([protein_group])

        return_dict = {
            "grouped_protein_objects": list_of_proteins_grouped,
            "group_objects": protein_group_objects,
        }

        return return_dict

__init__(self, data, digest) special

PeptideCentric Inference initialization method.

Parameters:
Returns:
  • object

Source code in pyproteininference/inference.py
def __init__(self, data, digest):
    """
    PeptideCentric Inference initialization method.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

    Returns:
        object:
    """
    self.data = data
    self.digest = digest
    self.data._validate_scored_proteins()
    self.scored_data = self.data.get_protein_data()

infer_proteins(self)

This method performs the Peptide Centric inference method.

This method assigns the variables: grouped_scored_proteins and protein_group_objects. These are both variables of the DataStore object and are lists of Protein objects and ProteinGroup objects.

Returns:
  • None

Source code in pyproteininference/inference.py
def infer_proteins(self):
    """
    This method performs the Peptide Centric inference method.

    This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
    These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
    lists of [Protein][pyproteininference.physical.Protein] objects
    and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

    Returns:
        None:

    """

    # Get the higher or lower variable
    hl = self.data.higher_or_lower()

    logger.info("Applying Group ID's for the Peptide Centric Method")
    regrouped_proteins = self._apply_protein_group_ids()

    grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
    protein_group_objects = regrouped_proteins["group_objects"]

    logger.info("Sorting Results based on lead Protein Score")
    grouped_protein_objects = datastore.DataStore.sort_protein_objects(
        grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
    )
    protein_group_objects = datastore.DataStore.sort_protein_group_objects(
        protein_group_objects=protein_group_objects, higher_or_lower=hl
    )

    self.data.grouped_scored_proteins = grouped_protein_objects
    self.data.protein_group_objects = protein_group_objects

parameters

ProteinInferenceParameter

Class that handles data retrieval, storage, and validation of Protein Inference Parameters.

Attributes:

Name Type Description
yaml_param_filepath str

path to properly formatted parameter file specific to Protein Inference.

digest_type str

String that determines that type of digestion in silico digestion for Digest object. Typically "trypsin".

export str

String to indicate the export type for Export object. Typically this is "psms", "peptides", or "psm_ids".

fdr float

Float to indicate FDR filtering.

missed_cleavages int

Integer to determine the number of missed cleavages in the database digestion Digest object.

picker bool

True/False on whether or not to run the protein picker algorithm.

restrict_pep float/None

Float to restrict the posterior error probability values by in the PSM input. Used in restrict_psm_data.

restrict_peptide_length int/None

Float to restrict the peptide length values by in the PSM input. Used in restrict_psm_data.

restrict_q float/None

Float to restrict the q values by in the PSM input. Used in restrict_psm_data.

restrict_custom float/None

Float to restrict the custom values by in the PSM input. Used in restrict_psm_data. Filtering depends on score_type variable. If score_type is multiplicative then values that are less than restrict_custom are kept. If score_type is additive then values that are more than restrict_custom are kept.

protein_score str

String to determine the way in which Proteins are scored can be any of the SCORE_METHODS in Score object.

psm_score_type str

String to determine the type of score that the PSM scores are (Additive or Multiplicative) can be any of the SCORE_TYPES in Score object.

decoy_symbol str

String to denote decoy proteins from target proteins. IE "##".

isoform_symbol str

String to denote isoforms from regular proteins. IE "-". Can also be None.

reviewed_identifier_symbol str

String to denote a "Reviewed" Protein. Typically this is: "sp|" if using Uniprot Fasta database.

inference_type str

String to determine the inference procedure. Can be any value of INFERENCE_TYPES of Inference object.

tag str

String to be added to output files.

psm_score str

String that indicates the PSM input score. The value should match the string in the input data of the score you want to use for PSM score. This score will be used in scoring methods here: Score object.

grouping_type str/None

String to determine the grouping procedure. Can be any value of GROUPING_TYPES of Inference object.

max_identifiers_peptide_centric int

Maximum number of identifiers to assign to a group when running peptide_centric inference. Typically this is 10 or 5.

lp_solver str/None

The LP solver to use if inference_type="Parsimony". Can be any value in LP_SOLVERS in the Inference object.

Source code in pyproteininference/parameters.py
class ProteinInferenceParameter(object):
    """
    Class that handles data retrieval, storage, and validation of Protein Inference Parameters.

    Attributes:
        yaml_param_filepath (str): path to properly formatted parameter file specific to Protein Inference.
        digest_type (str): String that determines that type of digestion in silico digestion for
            [Digest object][pyproteininference.in_silico_digest.Digest]. Typically "trypsin".
        export (str): String to indicate the export type for [Export object][pyproteininference.export.Export].
            Typically this is "psms", "peptides", or "psm_ids".
        fdr (float): Float to indicate FDR filtering.
        missed_cleavages (int): Integer to determine the number of missed cleavages in the database digestion
            [Digest object][pyproteininference.in_silico_digest.Digest].
        picker (bool): True/False on whether or not to run
            the [protein picker][pyproteininference.datastore.DataStore.protein_picker] algorithm.
        restrict_pep (float/None): Float to restrict the posterior error probability values by in the PSM input.
            Used in [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
        restrict_peptide_length (int/None): Float to restrict the peptide length values by in the PSM input.
            Used in [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
        restrict_q (float/None): Float to restrict the q values by in the PSM input.
            Used in [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
        restrict_custom (float/None): Float to restrict the custom values by in the PSM input.
            Used in [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
            Filtering depends on score_type variable. If score_type is multiplicative then values that are less than
            restrict_custom are kept. If score_type is additive then values that are more than restrict_custom are kept.
        protein_score (str): String to determine the way in which Proteins are scored can be any of the SCORE_METHODS
            in [Score object][pyproteininference.scoring.Score].
        psm_score_type (str): String to determine the type of score that the PSM scores are
            (Additive or Multiplicative) can be any of the SCORE_TYPES
            in [Score object][pyproteininference.scoring.Score].
        decoy_symbol (str): String to denote decoy proteins from target proteins. IE "##".
        isoform_symbol (str): String to denote isoforms from regular proteins. IE "-". Can also be None.
        reviewed_identifier_symbol (str): String to denote a "Reviewed" Protein. Typically this is: "sp|"
            if using Uniprot Fasta database.
        inference_type (str): String to determine the inference procedure. Can be any value of INFERENCE_TYPES
            of [Inference object][pyproteininference.inference.Inference].
        tag (str): String to be added to output files.
        psm_score (str): String that indicates the PSM input score. The value should match the string in the
            input data of the score you want to use for PSM score. This score will be used in scoring methods
                here: [Score object][pyproteininference.scoring.Score].
        grouping_type (str/None): String to determine the grouping procedure. Can be any value of
            GROUPING_TYPES of [Inference object][pyproteininference.inference.Inference].
        max_identifiers_peptide_centric (int): Maximum number of identifiers to assign to a group when
            running peptide_centric inference. Typically this is 10 or 5.
        lp_solver (str/None): The LP solver to use if inference_type="Parsimony".
            Can be any value in LP_SOLVERS in the [Inference object][pyproteininference.inference.Inference].

    """

    PARENT_PARAMETER_KEY = "parameters"

    GENERAL_PARAMETER_KEY = "general"
    DATA_RESTRICTION_PARAMETER_KEY = "data_restriction"
    SCORE_PARAMETER_KEY = "score"
    IDENTIFIERS_PARAMETER_KEY = "identifiers"
    INFERENCE_PARAMETER_KEY = "inference"
    DIGEST_PARAMETER_KEY = "digest"
    PARSIMONY_PARAMETER_KEY = "parsimony"
    PEPTIDE_CENTRIC_PARAMETER_KEY = "peptide_centric"

    PARAMETER_MAIN_KEYS = {
        GENERAL_PARAMETER_KEY,
        DATA_RESTRICTION_PARAMETER_KEY,
        SCORE_PARAMETER_KEY,
        IDENTIFIERS_PARAMETER_KEY,
        INFERENCE_PARAMETER_KEY,
        DIGEST_PARAMETER_KEY,
        PARSIMONY_PARAMETER_KEY,
        PEPTIDE_CENTRIC_PARAMETER_KEY,
    }

    EXPORT_PARAMETER = "export"
    FDR_PARAMETER = "fdr"
    PICKER_PARAMETER = "picker"
    TAG_PARAMETER = "tag"

    GENERAL_PARAMETER_SUB_KEYS = {
        EXPORT_PARAMETER,
        FDR_PARAMETER,
        PICKER_PARAMETER,
        TAG_PARAMETER,
    }

    PEP_RESTRICT_PARAMETER = "pep_restriction"
    PEPTIDE_LENGTH_RESTRICT_PARAMETER = "peptide_length_restriction"
    Q_VALUE_RESTRICT_PARAMETER = "q_value_restriction"
    CUSTOM_RESTRICT_PARAMETER = "custom_restriction"

    DATA_RESTRICTION_PARAMETER_SUB_KEYS = {
        PEP_RESTRICT_PARAMETER,
        PEPTIDE_LENGTH_RESTRICT_PARAMETER,
        Q_VALUE_RESTRICT_PARAMETER,
        CUSTOM_RESTRICT_PARAMETER,
    }

    PROTEIN_SCORE_PARAMETER = "protein_score"
    PSM_SCORE_PARAMETER = "psm_score"
    PSM_SCORE_TYPE_PARAMETER = "psm_score_type"

    SCORE_PARAMETER_SUB_KEYS = {
        PROTEIN_SCORE_PARAMETER,
        PSM_SCORE_PARAMETER,
        PSM_SCORE_TYPE_PARAMETER,
    }

    DECOY_SYMBOL_PARAMETER = "decoy_symbol"
    ISOFORM_SYMBOL_PARAMETER = "isoform_symbol"
    REVIEWED_IDENTIFIER_PARAMETER = "reviewed_identifier_symbol"

    IDENTIFIER_SUB_KEYS = {
        DECOY_SYMBOL_PARAMETER,
        ISOFORM_SYMBOL_PARAMETER,
        REVIEWED_IDENTIFIER_PARAMETER,
    }

    INFERENCE_TYPE_PARAMETER = "inference_type"
    GROUPING_TYPE_PARAMETER = "grouping_type"

    INFERENCE_SUB_KEYS = {INFERENCE_TYPE_PARAMETER, GROUPING_TYPE_PARAMETER}

    DIGEST_TYPE_PARAMETER = "digest_type"
    MISSED_CLEAV_PARAMETER = "missed_cleavages"

    DIGEST_SUB_KEYS = {DIGEST_TYPE_PARAMETER, MISSED_CLEAV_PARAMETER}

    LP_SOLVER_PARAMETER = "lp_solver"
    SHARED_PEPTIDES_PARAMETER = "shared_peptides"

    PARSIMONY_SUB_KEYS = {
        LP_SOLVER_PARAMETER,
        SHARED_PEPTIDES_PARAMETER,
    }

    MAX_IDENTIFIERS_PARAMETER = "max_identifiers"

    PEPTIDE_CENTRIC_SUB_KEYS = {MAX_IDENTIFIERS_PARAMETER}

    DEFAULT_DIGEST_TYPE = "trypsin"
    DEFAULT_EXPORT = "peptides"
    DEFAULT_FDR = 0.01
    DEFAULT_MISSED_CLEAVAGES = 3
    DEFAULT_PICKER = True
    DEFAULT_RESTRICT_PEP = 0.9
    DEFAULT_RESTRICT_PEPTIDE_LENGTH = 7
    DEFAULT_RESTRICT_Q = 0.005
    DEFAULT_RESTRICT_CUSTOM = "None"
    DEFAULT_PROTEIN_SCORE = "multiplicative_log"
    DEFAULT_PSM_SCORE = "posterior_error_prob"
    DEFAULT_DECOY_SYMBOL = "##"
    DEFAULT_ISOFORM_SYMBOL = "-"
    DEFAULT_REVIEWED_IDENTIFIER_SYMBOL = "sp|"
    DEFAULT_INFERENCE_TYPE = "peptide_centric"
    DEFAULT_TAG = "py_protein_inference"
    DEFAULT_PSM_SCORE_TYPE = "multiplicative"
    DEFAULT_GROUPING_TYPE = "shared_peptides"
    DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC = 5
    DEFAULT_LP_SOLVER = "pulp"
    DEFAULT_SHARED_PEPTIDES = "all"

    def __init__(self, yaml_param_filepath, validate=True):
        """Class to store Protein Inference parameter information as an object.

        Args:
            yaml_param_filepath (str): path to properly formatted parameter file specific to Protein Inference.
            validate (bool): True/False on whether to validate the parameter file of interest.

        Returns:
            None:

        Example:
            >>> pyproteininference.parameters.ProteinInferenceParameter(
            >>>     yaml_param_filepath = "/path/to/pyproteininference_params.yaml", validate=True
            >>> )


        """
        self.yaml_param_filepath = yaml_param_filepath
        self.digest_type = self.DEFAULT_DIGEST_TYPE
        self.export = self.DEFAULT_EXPORT
        self.fdr = self.DEFAULT_FDR
        self.missed_cleavages = self.DEFAULT_MISSED_CLEAVAGES
        self.picker = self.DEFAULT_PICKER
        self.restrict_pep = self.DEFAULT_RESTRICT_PEP
        self.restrict_peptide_length = self.DEFAULT_RESTRICT_PEPTIDE_LENGTH
        self.restrict_q = self.DEFAULT_RESTRICT_Q
        self.restrict_custom = self.DEFAULT_RESTRICT_CUSTOM
        self.protein_score = self.DEFAULT_PROTEIN_SCORE
        self.psm_score_type = self.DEFAULT_PSM_SCORE_TYPE
        self.decoy_symbol = self.DEFAULT_DECOY_SYMBOL
        self.isoform_symbol = self.DEFAULT_ISOFORM_SYMBOL
        self.reviewed_identifier_symbol = self.DEFAULT_REVIEWED_IDENTIFIER_SYMBOL
        self.inference_type = self.DEFAULT_INFERENCE_TYPE
        self.tag = self.DEFAULT_TAG
        self.psm_score = self.DEFAULT_PSM_SCORE
        self.grouping_type = self.DEFAULT_GROUPING_TYPE
        self.max_identifiers_peptide_centric = self.DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC
        self.lp_solver = self.DEFAULT_LP_SOLVER
        self.shared_peptides = self.DEFAULT_SHARED_PEPTIDES
        self.validate = validate

        self.convert_to_object()

        if validate:
            self.validate_parameters()

        self._fix_none_parameters()

    def convert_to_object(self):
        """
        Function that takes a Protein Inference parameter file and converts it into a ProteinInferenceParameter object
        by assigning all Attributes of the ProteinInferenceParameter object.

        If no parameter filepath is supplied the parameter object will be loaded with default params.

        This function gets ran in the initialization of the ProteinInferenceParameter object.

        Returns:
            None:

        """
        if self.yaml_param_filepath:
            with open(self.yaml_param_filepath, "r") as stream:
                yaml_params = yaml.load(stream, Loader=yaml.Loader)

            try:
                self.digest_type = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY][
                    self.DIGEST_TYPE_PARAMETER
                ]
            except KeyError:
                logger.warning("digest_type set to default of {}".format(self.DEFAULT_DIGEST_TYPE))

            try:
                self.export = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.EXPORT_PARAMETER]
            except KeyError:
                logger.warning("export set to default of {}".format(self.DEFAULT_EXPORT))

            try:
                self.fdr = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.FDR_PARAMETER]
            except KeyError:
                logger.warning("fdr set to default of {}".format(self.DEFAULT_FDR))
            try:
                self.missed_cleavages = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY][
                    self.MISSED_CLEAV_PARAMETER
                ]
            except KeyError:
                logger.warning("missed_cleavages set to default of {}".format(self.DEFAULT_MISSED_CLEAVAGES))

            try:
                self.picker = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.PICKER_PARAMETER]
            except KeyError:
                logger.warning("picker set to default of {}".format(self.DEFAULT_PICKER))

            try:
                self.restrict_pep = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
                    self.PEP_RESTRICT_PARAMETER
                ]
            except KeyError:
                logger.warning("restrict_pep set to default of {}".format(self.DEFAULT_RESTRICT_PEP))

            try:
                self.restrict_peptide_length = yaml_params[self.PARENT_PARAMETER_KEY][
                    self.DATA_RESTRICTION_PARAMETER_KEY
                ][self.PEPTIDE_LENGTH_RESTRICT_PARAMETER]
            except KeyError:
                logger.warning(
                    "restrict_peptide_length set to default of {}".format(self.DEFAULT_RESTRICT_PEPTIDE_LENGTH)
                )

            try:
                self.restrict_q = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
                    self.Q_VALUE_RESTRICT_PARAMETER
                ]
            except KeyError:
                logger.warning("restrict_q set to default of {}".format(self.DEFAULT_RESTRICT_Q))

            try:
                self.restrict_custom = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
                    self.CUSTOM_RESTRICT_PARAMETER
                ]
            except KeyError:
                logger.warning("restrict_custom set to default of {}".format(self.DEFAULT_RESTRICT_CUSTOM))

            try:
                self.protein_score = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
                    self.PROTEIN_SCORE_PARAMETER
                ]
            except KeyError:
                logger.warning("protein_score set to default of {}".format(self.DEFAULT_PROTEIN_SCORE))

            try:
                self.psm_score_type = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
                    self.PSM_SCORE_TYPE_PARAMETER
                ]
            except KeyError:
                logger.warning("psm_score_type set to default of {}".format(self.DEFAULT_PSM_SCORE_TYPE))

            try:
                self.decoy_symbol = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY][
                    self.DECOY_SYMBOL_PARAMETER
                ]
            except KeyError:
                logger.warning("decoy_symbol set to default of {}".format(self.DEFAULT_DECOY_SYMBOL))

            try:
                self.isoform_symbol = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY][
                    self.ISOFORM_SYMBOL_PARAMETER
                ]
            except KeyError:
                logger.warning("isoform_symbol set to default of {}".format(self.DEFAULT_ISOFORM_SYMBOL))

            try:
                self.reviewed_identifier_symbol = yaml_params[self.PARENT_PARAMETER_KEY][
                    self.IDENTIFIERS_PARAMETER_KEY
                ][self.REVIEWED_IDENTIFIER_PARAMETER]
            except KeyError:
                logger.warning(
                    "reviewed_identifier_symbol set to default of {}".format(self.DEFAULT_REVIEWED_IDENTIFIER_SYMBOL)
                )

            try:
                self.inference_type = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY][
                    self.INFERENCE_TYPE_PARAMETER
                ]
            except KeyError:
                logger.warning("inference_type set to default of {}".format(self.DEFAULT_INFERENCE_TYPE))

            try:
                self.tag = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.TAG_PARAMETER]
            except KeyError:
                logger.warning("tag set to default of {}".format(self.DEFAULT_TAG))

            try:
                self.psm_score = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
                    self.PSM_SCORE_PARAMETER
                ]
            except KeyError:
                logger.warning("psm_score set to default of {}".format(self.DEFAULT_PSM_SCORE))

            try:
                self.grouping_type = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY][
                    self.GROUPING_TYPE_PARAMETER
                ]
            except KeyError:
                logger.warning("grouping_type set to default of {}".format(self.DEFAULT_GROUPING_TYPE))

            try:
                self.max_identifiers_peptide_centric = yaml_params[self.PARENT_PARAMETER_KEY][
                    self.PEPTIDE_CENTRIC_PARAMETER_KEY
                ][self.MAX_IDENTIFIERS_PARAMETER]
            except KeyError:
                logger.warning(
                    "max_identifiers_peptide_centric set to default of {}".format(
                        self.DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC
                    )
                )

            try:
                self.lp_solver = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY][
                    self.LP_SOLVER_PARAMETER
                ]
            except KeyError:
                logger.warning("lp_solver set to default of {}".format(self.DEFAULT_LP_SOLVER))
            try:
                # Do try except here to make old param files backwards compatible
                self.shared_peptides = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY][
                    self.SHARED_PEPTIDES_PARAMETER
                ]
            except KeyError:
                logger.warning("shared_peptides set to default of {}".format(self.DEFAULT_SHARED_PEPTIDES))

        else:
            logger.warning("Yaml parameter file not found, all parameters set to default")

    def validate_parameters(self):
        """
        Class method to validate all parameters.

        Returns:
            None:

        """
        # Run all of the parameter validations
        self._validate_digest_type()
        self._validate_export_type()
        self._validate_floats()
        self._validate_bools()
        self._validate_score_type()
        self._validate_score_method()
        self._validate_score_combination()
        self._validate_inference_type()
        self._validate_grouping_type()
        self._validate_max_id()
        self._validate_lp_solver()
        self._validate_identifiers()
        self._validate_parsimony_shared_peptides()

    def _validate_digest_type(self):
        """
        Internal ProteinInferenceParameter method to validate the digest type.
        """
        # Make sure we have a valid digest type
        if self.digest_type in PyteomicsDigest.LIST_OF_DIGEST_TYPES:
            logger.info("Using digest type '{}'".format(self.digest_type))
        else:
            raise ValueError(
                "Digest Type '{}' not supported, please use one of the following enyzme digestions: '{}'".format(
                    self.digest_type, ", ".join(PyteomicsDigest.LIST_OF_DIGEST_TYPES)
                )
            )

    def _validate_export_type(self):
        """
        Internal ProteinInferenceParameter method to validate the export type.
        """
        # Make sure we have a valid export type
        if self.export in Export.EXPORT_TYPES:
            logger.info("Using Export type '{}'".format(self.export))
        else:
            raise ValueError(
                "Export Type '{}' not supported, please use one of the following export types: '{}'".format(
                    self.export, ", ".join(Export.EXPORT_TYPES)
                )
            )
        pass

    def _validate_floats(self):
        """
        Internal ProteinInferenceParameter method to validate floats.
        """
        # Validate that FDR, cleavages, and restrict values are all floats and or ints if they need to be

        try:
            if 0 <= float(self.fdr) <= 1:
                logger.info("FDR Input {}".format(self.fdr))

        except ValueError:
            raise ValueError("FDR must be a decimal between 0 and 1, FDR provided: {}".format(self.fdr))

        try:
            if 0 <= float(self.restrict_pep) <= 1:
                logger.info("PEP restriction {}".format(self.restrict_pep))

        except ValueError:
            if not self.restrict_pep or self.restrict_pep.lower() == "none":
                self.restrict_pep = None
                logger.info("Not restrict by PEP Value")
            else:
                raise ValueError(
                    "PEP restriction must be a decimal between 0 and 1, PEP restriction provided: {}".format(
                        self.restrict_pep
                    )
                )

        try:
            if 0 <= float(self.restrict_q) <= 1:
                logger.info("Q Value restriction {}".format(self.restrict_q))

        except ValueError:
            if not self.restrict_q or self.restrict_q.lower() == "none":
                self.restrict_q = None
                logger.info("Not restrict by Q Value")
            else:
                raise ValueError(
                    "Q Value restriction must be a decimal between 0 and 1, Q Value restriction provided: {}".format(
                        self.restrict_q
                    )
                )

        try:
            int(self.missed_cleavages)
            logger.info("Missed Cleavages selected: {}".format(self.missed_cleavages))
        except ValueError:
            raise ValueError(
                "Missed Cleavages must be an integer, Provided Missed Cleavages value: {}".format(self.missed_cleavages)
            )

        try:
            int(self.restrict_peptide_length)
            logger.info("Peptide Length Restriction: Len {}".format(self.restrict_peptide_length))
        except ValueError:
            if not self.restrict_peptide_length or self.restrict_peptide_length.lower() == "none":
                self.restrict_peptide_length = None
                logger.info("Not Restricting by Peptide Length")
            else:
                raise ValueError(
                    "Peptide Length Restriction must be an integer, "
                    "Provided Peptide Length Restriction value: {}".format(self.restrict_peptide_length)
                )

        try:
            float(self.restrict_custom)
            logger.info("Custom restriction {}".format(self.restrict_custom))
        except ValueError or TypeError:
            if not self.restrict_custom or self.restrict_custom.lower() == "none":
                self.restrict_custom = None
                logger.info("Not Restricting by Custom Value")
            else:
                raise ValueError(
                    "Custom restriction must be a number, Custom restriction provided: {}".format(self.restrict_custom)
                )

    def _validate_bools(self):
        """
        Internal ProteinInferenceParameter method to validate the bools.
        """
        # Make sure picker is a bool
        if type(self.picker) == bool:
            if self.picker:
                logger.info("Parameters loaded to run Picker")
            else:
                logger.info("Parameters loaded to NOT run Picker")
        else:
            raise ValueError(
                "Picker Variable must be set to True or False, Picker Variable provided: {}".format(self.picker)
            )

    def _validate_score_method(self):
        """
        Internal ProteinInferenceParameter method to validate the score method.
        """
        # Make sure we have the score method defined in code to use...
        if self.protein_score in Score.SCORE_METHODS:
            logger.info("Using Score Method '{}'".format(self.protein_score))
        else:
            raise ValueError(
                "Score Method '{}' not supported, "
                "please use one of the following Score Methods: '{}'".format(
                    self.protein_score, ", ".join(Score.SCORE_METHODS)
                )
            )

    def _validate_score_type(self):
        """
        Internal ProteinInferenceParameter method to validate the score type.
        """
        # Make sure score type is multiplicative or additive
        if self.psm_score_type in Score.SCORE_TYPES:
            logger.info("Using Score Type '{}'".format(self.psm_score_type))
        else:
            raise ValueError(
                "Score Type '{}' not supported, "
                "please use one of the following Score Types: '{}'".format(
                    self.psm_score_type, ", ".join(Score.SCORE_TYPES)
                )
            )

    def _validate_score_combination(self):
        """
        Internal ProteinInferenceParameter method to validate combination of score method and score type.
        """
        # Check to see if combination of score (column), method(multiplicative log, additive),
        # and score type (multiplicative/additive) is possible...
        # This will be super custom

        if self.psm_score_type == Score.ADDITIVE_SCORE_TYPE and self.protein_score != Score.ADDITIVE:
            raise ValueError(
                "If Score type is 'additive' (Higher PSM score is better) then you must use the 'additive' score method"
            )

        elif self.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE and self.protein_score == Score.ADDITIVE:
            raise ValueError(
                "If Score type is 'multiplicative' (Lower PSM score is better) "
                "then you must NOT use the 'additive' score method please "
                "select one of the following score methods: {}".format(
                    ", ".join([x for x in Score.SCORE_METHODS if x != "additive"])
                )
            )

        else:
            logger.info(
                "Combination of Score Type: '{}' and Score Method: '{}' is Ok".format(
                    self.psm_score_type, self.protein_score
                )
            )

    def _validate_inference_type(self):
        """
        Internal ProteinInferenceParameter method to validate the inference type.
        """
        # Check if its parsimony, exclusion, inclusion, none
        if self.inference_type in Inference.INFERENCE_TYPES:
            logger.info("Using inference type '{}'".format(self.inference_type))
        else:
            raise ValueError(
                "Inferece Type '{}' not supported, please use one of the following Inferece Types: '{}'".format(
                    self.inference_type, ", ".join(Inference.INFERENCE_TYPES)
                )
            )

    def _validate_grouping_type(self):
        """
        Internal ProteinInferenceParameter method to validate the grouping type.
        """
        # Check if its parsimony, exclusion, inclusion, none
        if self.grouping_type in Inference.GROUPING_TYPES:
            logger.info("Using Grouping type '{}'".format(self.grouping_type))
        else:
            if self.grouping_type.lower() == "none" or not self.grouping_type:
                self.grouping_type = None
                logger.info("Using Grouping type: None")
            else:

                raise ValueError(
                    "Grouping Type '{}' not supported, please use one of the following Grouping Types: '{}'".format(
                        self.grouping_type, Inference.GROUPING_TYPES
                    )
                )

    def _validate_max_id(self):
        """
        Internal ProteinInferenceParameter method to validate the max peptide centric id.
        """
        # Check if max_identifiers_peptide_centric param is an INT
        if type(self.max_identifiers_peptide_centric) == int:
            logger.info(
                "Max Number of Indentifiers for Peptide Centric Inference: '{}'".format(
                    self.max_identifiers_peptide_centric
                )
            )
        else:
            raise ValueError(
                "Max Number of Indentifiers for Peptide Centric Inference must be an integer, "
                "provided value: {}".format(self.max_identifiers_peptide_centric)
            )

    def _validate_lp_solver(self):
        """
        Internal ProteinInferenceParameter method to validate the lp solver.
        """
        # Check if its pulp or None
        if self.lp_solver in Inference.LP_SOLVERS:
            logger.info("Using LP Solver '{}'".format(self.lp_solver))
        else:
            if self.lp_solver.lower() == "none" or not self.lp_solver:
                self.lp_solver = None
                logger.info("Setting LP Solver to None")
            else:
                raise ValueError(
                    "LP Solver '{}' not supported, please use one of the following LP Solvers: '{}'".format(
                        self.lp_solver, ", ".join(Inference.LP_SOLVERS)
                    )
                )

    def _validate_parsimony_shared_peptides(self):
        """
        Internal ProteinInferenceParameter method to validate the shared peptides parameter.
        """
        # Check if its all, best, or none
        if self.shared_peptides in Inference.SHARED_PEPTIDE_TYPES:
            logger.info("Using Shared Peptide types '{}'".format(self.shared_peptides))
        else:
            if self.shared_peptides.lower() == "none" or not self.shared_peptides:
                self.shared_peptides = None
                logger.info("Setting Shared Peptide type to None")
            else:
                raise ValueError(
                    "Shared Peptide types '{}' not supported, please use one of the following "
                    "Shared Peptide types: '{}'".format(self.shared_peptides, Inference.SHARED_PEPTIDE_TYPES)
                )

    def _validate_identifiers(self):
        """
        Internal ProteinInferenceParameter method to validate the decoy symbol, isoform symbol,
        and reviewed identifier symbol.

        """
        if type(self.decoy_symbol) == str:
            logger.info("Decoy Symbol set to: '{}'".format(self.decoy_symbol))
        else:
            raise ValueError("Decoy Symbol must be a string, provided value: {}".format(self.decoy_symbol))

        if type(self.isoform_symbol) == str:
            logger.info("Isoform Symbol set to: '{}'".format(self.isoform_symbol))
            if self.isoform_symbol.lower() == "none" or not self.isoform_symbol:
                self.isoform_symbol = None
                logger.info("Isoform Symbol set to None")
        else:
            if self.isoform_symbol:
                self.isoform_symbol = None
                logger.info("Isoform Symbol set to None")
            raise ValueError("Isoform Symbol must be a string, provided value: {}".format(self.isoform_symbol))

        if type(self.reviewed_identifier_symbol) == str:
            logger.info("Reviewed Identifier Symbol set to: '{}'".format(self.reviewed_identifier_symbol))
            if self.reviewed_identifier_symbol.lower() == "none" or not self.reviewed_identifier_symbol:
                self.reviewed_identifier_symbol = None
                logger.info("Reviewed Identifier Symbol set to None")
        else:
            if not self.reviewed_identifier_symbol:
                self.reviewed_identifier_symbol = None
                logger.info("Reviewed Identifier Symbol set to None")
            raise ValueError(
                "Reviewed Identifier Symbol must be a string, provided value: {}".format(
                    self.reviewed_identifier_symbol
                )
            )

    def _validate_parameter_shape(self, yaml_params):
        """
        Internal ProteinInferenceParameter method to validate shape of the parameter file by checking to make sure
         that all necessary main parameter fields are defined.
        """
        if self.PARENT_PARAMETER_KEY in yaml_params.keys():
            logger.info("Main Parameter Key is Present")
        else:
            raise ValueError(
                "Key {} needs to be defined as the outermost parameter group".format(self.PARENT_PARAMETER_KEY)
            )

        if self.PARAMETER_MAIN_KEYS.issubset(yaml_params[self.PARENT_PARAMETER_KEY]):
            logger.info("All Sub Parameter Keys Present")
        else:
            raise ValueError(
                "All of the following values: {}. Need to be Sub Parameters in the Yaml Parameter file".format(
                    ", ".join(self.PARAMETER_MAIN_KEYS),
                )
            )

        try:
            general_params = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY]
            for gkey in self.GENERAL_PARAMETER_SUB_KEYS:
                if gkey in general_params.keys():
                    pass
                else:
                    raise ValueError(
                        "General Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the general parameter field".format(gkey)
                    )

        except KeyError:
            raise ValueError("'general' sub Parameter not defined in the parameter file")

        try:
            data_res_params = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY]
            for drkey in self.DATA_RESTRICTION_PARAMETER_SUB_KEYS:
                if drkey in data_res_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Data Restriction Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the data_restriction parameter field".format(drkey)
                    )

        except KeyError:
            raise ValueError("'data_restriction' sub Parameter not defined in the parameter file")

        try:
            score_params = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY]
            for skey in self.SCORE_PARAMETER_SUB_KEYS:
                if skey in score_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Score Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the score parameter field".format(skey)
                    )

        except KeyError:
            raise ValueError("'score' sub Parameter not defined in the parameter file")

        try:
            id_params = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY]
            for ikey in self.IDENTIFIER_SUB_KEYS:
                if ikey in id_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Identifiers Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the identifiers parameter field".format(ikey)
                    )

        except KeyError:
            raise ValueError("'identifiers' sub Parameter not defined in the parameter file")

        try:
            inf_params = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY]
            for infkey in self.INFERENCE_SUB_KEYS:
                if infkey in inf_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Inference Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the inference parameter field".format(infkey)
                    )

        except KeyError:
            raise ValueError("'inference' sub Parameter not defined in the parameter file")

        try:
            digest_params = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY]
            for dkey in self.DIGEST_SUB_KEYS:
                if dkey in digest_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Digest Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the digest parameter field".format(dkey)
                    )

        except KeyError:
            raise ValueError("'digest' sub Parameter not defined in the parameter file")

        try:
            parsimony_params = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY]
            for pkey in self.PARSIMONY_SUB_KEYS:
                if pkey in parsimony_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Parsimony Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the parsimony parameter field".format(pkey)
                    )

        except KeyError:
            raise ValueError("'parsimony' sub Parameter not defined in the parameter file")

        try:
            pep_cen_params = yaml_params[self.PARENT_PARAMETER_KEY][self.PEPTIDE_CENTRIC_PARAMETER_KEY]
            for pckey in self.PEPTIDE_CENTRIC_SUB_KEYS:
                if pckey in pep_cen_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Peptide Centric Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the peptide_centric parameter field".format(pckey)
                    )

        except KeyError:
            raise ValueError("'peptide_centric' sub Parameter not defined in the parameter file")

    def override_q_restrict(self, data):
        """
        ProteinInferenceParameter method to override restrict_q if the input data does not contain q values.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

        """
        data_has_q = data.input_has_q()
        if data_has_q:
            pass
        else:
            if self.restrict_q:
                logger.warning("No Q values found in the input data, overriding parameters to not filter on Q value")
                self.restrict_q = None

    def override_pep_restrict(self, data):
        """
        ProteinInferenceParameter method to override restrict_pep if the input data does not contain pep values.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

        """
        data_has_pep = data.input_has_pep()
        if data_has_pep:
            pass
        else:
            if self.restrict_pep:
                logger.warning(
                    "No Pep values found in the input data, overriding parameters to not filter on Pep value"
                )
                self.restrict_pep = None

    def override_custom_restrict(self, data):
        """
        ProteinInferenceParameter method to override restrict_custom if
        the input data does not contain custom score values.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

        """
        data_has_custom = data.input_has_custom()
        if data_has_custom:
            pass
        else:
            if self.restrict_custom:
                logger.warning(
                    "No Custom values found in the input data, overriding parameters to not filter on Custom value"
                )
                self.restrict_custom = None

    def fix_parameters_from_datastore(self, data):
        """
        ProteinInferenceParameter method to override restriction values in the
        parameter file if those scores do not exist in the input files.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

        """

        self.override_q_restrict(data=data)
        self.override_pep_restrict(data=data)
        self.override_custom_restrict(data=data)

    def _fix_none_parameters(self):
        """
        Internal ProteinInferenceParameter method to fix parameters that have been defined as None.
        These get read in as strings with YAML reader and need to be converted to None type.
        """

        self._fix_grouping_type()
        self._fix_lp_solver()
        self._fix_shared_peptides()

    def _fix_grouping_type(self):
        """
        Internal ProteinInferenceParameter method to override grouping type for None value.
        """
        if self.grouping_type in ["None", "none", None]:
            self.grouping_type = None

    def _fix_lp_solver(self):
        """
        Internal ProteinInferenceParameter method to override lp_solver for None value.
        """
        if self.lp_solver in ["None", "none", None]:
            self.lp_solver = None

    def _fix_shared_peptides(self):
        """
        Internal ProteinInferenceParameter method to override shared_peptides for None value.
        """
        if self.shared_peptides in ["None", "none", None]:
            self.shared_peptides = None

__init__(self, yaml_param_filepath, validate=True) special

Class to store Protein Inference parameter information as an object.

Parameters:
  • yaml_param_filepath (str) – path to properly formatted parameter file specific to Protein Inference.

  • validate (bool) – True/False on whether to validate the parameter file of interest.

Returns:
  • None

Examples:

>>> pyproteininference.parameters.ProteinInferenceParameter(
>>>     yaml_param_filepath = "/path/to/pyproteininference_params.yaml", validate=True
>>> )
Source code in pyproteininference/parameters.py
def __init__(self, yaml_param_filepath, validate=True):
    """Class to store Protein Inference parameter information as an object.

    Args:
        yaml_param_filepath (str): path to properly formatted parameter file specific to Protein Inference.
        validate (bool): True/False on whether to validate the parameter file of interest.

    Returns:
        None:

    Example:
        >>> pyproteininference.parameters.ProteinInferenceParameter(
        >>>     yaml_param_filepath = "/path/to/pyproteininference_params.yaml", validate=True
        >>> )


    """
    self.yaml_param_filepath = yaml_param_filepath
    self.digest_type = self.DEFAULT_DIGEST_TYPE
    self.export = self.DEFAULT_EXPORT
    self.fdr = self.DEFAULT_FDR
    self.missed_cleavages = self.DEFAULT_MISSED_CLEAVAGES
    self.picker = self.DEFAULT_PICKER
    self.restrict_pep = self.DEFAULT_RESTRICT_PEP
    self.restrict_peptide_length = self.DEFAULT_RESTRICT_PEPTIDE_LENGTH
    self.restrict_q = self.DEFAULT_RESTRICT_Q
    self.restrict_custom = self.DEFAULT_RESTRICT_CUSTOM
    self.protein_score = self.DEFAULT_PROTEIN_SCORE
    self.psm_score_type = self.DEFAULT_PSM_SCORE_TYPE
    self.decoy_symbol = self.DEFAULT_DECOY_SYMBOL
    self.isoform_symbol = self.DEFAULT_ISOFORM_SYMBOL
    self.reviewed_identifier_symbol = self.DEFAULT_REVIEWED_IDENTIFIER_SYMBOL
    self.inference_type = self.DEFAULT_INFERENCE_TYPE
    self.tag = self.DEFAULT_TAG
    self.psm_score = self.DEFAULT_PSM_SCORE
    self.grouping_type = self.DEFAULT_GROUPING_TYPE
    self.max_identifiers_peptide_centric = self.DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC
    self.lp_solver = self.DEFAULT_LP_SOLVER
    self.shared_peptides = self.DEFAULT_SHARED_PEPTIDES
    self.validate = validate

    self.convert_to_object()

    if validate:
        self.validate_parameters()

    self._fix_none_parameters()

convert_to_object(self)

Function that takes a Protein Inference parameter file and converts it into a ProteinInferenceParameter object by assigning all Attributes of the ProteinInferenceParameter object.

If no parameter filepath is supplied the parameter object will be loaded with default params.

This function gets ran in the initialization of the ProteinInferenceParameter object.

Returns:
  • None

Source code in pyproteininference/parameters.py
def convert_to_object(self):
    """
    Function that takes a Protein Inference parameter file and converts it into a ProteinInferenceParameter object
    by assigning all Attributes of the ProteinInferenceParameter object.

    If no parameter filepath is supplied the parameter object will be loaded with default params.

    This function gets ran in the initialization of the ProteinInferenceParameter object.

    Returns:
        None:

    """
    if self.yaml_param_filepath:
        with open(self.yaml_param_filepath, "r") as stream:
            yaml_params = yaml.load(stream, Loader=yaml.Loader)

        try:
            self.digest_type = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY][
                self.DIGEST_TYPE_PARAMETER
            ]
        except KeyError:
            logger.warning("digest_type set to default of {}".format(self.DEFAULT_DIGEST_TYPE))

        try:
            self.export = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.EXPORT_PARAMETER]
        except KeyError:
            logger.warning("export set to default of {}".format(self.DEFAULT_EXPORT))

        try:
            self.fdr = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.FDR_PARAMETER]
        except KeyError:
            logger.warning("fdr set to default of {}".format(self.DEFAULT_FDR))
        try:
            self.missed_cleavages = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY][
                self.MISSED_CLEAV_PARAMETER
            ]
        except KeyError:
            logger.warning("missed_cleavages set to default of {}".format(self.DEFAULT_MISSED_CLEAVAGES))

        try:
            self.picker = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.PICKER_PARAMETER]
        except KeyError:
            logger.warning("picker set to default of {}".format(self.DEFAULT_PICKER))

        try:
            self.restrict_pep = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
                self.PEP_RESTRICT_PARAMETER
            ]
        except KeyError:
            logger.warning("restrict_pep set to default of {}".format(self.DEFAULT_RESTRICT_PEP))

        try:
            self.restrict_peptide_length = yaml_params[self.PARENT_PARAMETER_KEY][
                self.DATA_RESTRICTION_PARAMETER_KEY
            ][self.PEPTIDE_LENGTH_RESTRICT_PARAMETER]
        except KeyError:
            logger.warning(
                "restrict_peptide_length set to default of {}".format(self.DEFAULT_RESTRICT_PEPTIDE_LENGTH)
            )

        try:
            self.restrict_q = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
                self.Q_VALUE_RESTRICT_PARAMETER
            ]
        except KeyError:
            logger.warning("restrict_q set to default of {}".format(self.DEFAULT_RESTRICT_Q))

        try:
            self.restrict_custom = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
                self.CUSTOM_RESTRICT_PARAMETER
            ]
        except KeyError:
            logger.warning("restrict_custom set to default of {}".format(self.DEFAULT_RESTRICT_CUSTOM))

        try:
            self.protein_score = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
                self.PROTEIN_SCORE_PARAMETER
            ]
        except KeyError:
            logger.warning("protein_score set to default of {}".format(self.DEFAULT_PROTEIN_SCORE))

        try:
            self.psm_score_type = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
                self.PSM_SCORE_TYPE_PARAMETER
            ]
        except KeyError:
            logger.warning("psm_score_type set to default of {}".format(self.DEFAULT_PSM_SCORE_TYPE))

        try:
            self.decoy_symbol = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY][
                self.DECOY_SYMBOL_PARAMETER
            ]
        except KeyError:
            logger.warning("decoy_symbol set to default of {}".format(self.DEFAULT_DECOY_SYMBOL))

        try:
            self.isoform_symbol = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY][
                self.ISOFORM_SYMBOL_PARAMETER
            ]
        except KeyError:
            logger.warning("isoform_symbol set to default of {}".format(self.DEFAULT_ISOFORM_SYMBOL))

        try:
            self.reviewed_identifier_symbol = yaml_params[self.PARENT_PARAMETER_KEY][
                self.IDENTIFIERS_PARAMETER_KEY
            ][self.REVIEWED_IDENTIFIER_PARAMETER]
        except KeyError:
            logger.warning(
                "reviewed_identifier_symbol set to default of {}".format(self.DEFAULT_REVIEWED_IDENTIFIER_SYMBOL)
            )

        try:
            self.inference_type = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY][
                self.INFERENCE_TYPE_PARAMETER
            ]
        except KeyError:
            logger.warning("inference_type set to default of {}".format(self.DEFAULT_INFERENCE_TYPE))

        try:
            self.tag = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.TAG_PARAMETER]
        except KeyError:
            logger.warning("tag set to default of {}".format(self.DEFAULT_TAG))

        try:
            self.psm_score = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
                self.PSM_SCORE_PARAMETER
            ]
        except KeyError:
            logger.warning("psm_score set to default of {}".format(self.DEFAULT_PSM_SCORE))

        try:
            self.grouping_type = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY][
                self.GROUPING_TYPE_PARAMETER
            ]
        except KeyError:
            logger.warning("grouping_type set to default of {}".format(self.DEFAULT_GROUPING_TYPE))

        try:
            self.max_identifiers_peptide_centric = yaml_params[self.PARENT_PARAMETER_KEY][
                self.PEPTIDE_CENTRIC_PARAMETER_KEY
            ][self.MAX_IDENTIFIERS_PARAMETER]
        except KeyError:
            logger.warning(
                "max_identifiers_peptide_centric set to default of {}".format(
                    self.DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC
                )
            )

        try:
            self.lp_solver = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY][
                self.LP_SOLVER_PARAMETER
            ]
        except KeyError:
            logger.warning("lp_solver set to default of {}".format(self.DEFAULT_LP_SOLVER))
        try:
            # Do try except here to make old param files backwards compatible
            self.shared_peptides = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY][
                self.SHARED_PEPTIDES_PARAMETER
            ]
        except KeyError:
            logger.warning("shared_peptides set to default of {}".format(self.DEFAULT_SHARED_PEPTIDES))

    else:
        logger.warning("Yaml parameter file not found, all parameters set to default")

fix_parameters_from_datastore(self, data)

ProteinInferenceParameter method to override restriction values in the parameter file if those scores do not exist in the input files.

Parameters:
Source code in pyproteininference/parameters.py
def fix_parameters_from_datastore(self, data):
    """
    ProteinInferenceParameter method to override restriction values in the
    parameter file if those scores do not exist in the input files.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

    """

    self.override_q_restrict(data=data)
    self.override_pep_restrict(data=data)
    self.override_custom_restrict(data=data)

override_custom_restrict(self, data)

ProteinInferenceParameter method to override restrict_custom if the input data does not contain custom score values.

Parameters:
Source code in pyproteininference/parameters.py
def override_custom_restrict(self, data):
    """
    ProteinInferenceParameter method to override restrict_custom if
    the input data does not contain custom score values.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

    """
    data_has_custom = data.input_has_custom()
    if data_has_custom:
        pass
    else:
        if self.restrict_custom:
            logger.warning(
                "No Custom values found in the input data, overriding parameters to not filter on Custom value"
            )
            self.restrict_custom = None

override_pep_restrict(self, data)

ProteinInferenceParameter method to override restrict_pep if the input data does not contain pep values.

Parameters:
Source code in pyproteininference/parameters.py
def override_pep_restrict(self, data):
    """
    ProteinInferenceParameter method to override restrict_pep if the input data does not contain pep values.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

    """
    data_has_pep = data.input_has_pep()
    if data_has_pep:
        pass
    else:
        if self.restrict_pep:
            logger.warning(
                "No Pep values found in the input data, overriding parameters to not filter on Pep value"
            )
            self.restrict_pep = None

override_q_restrict(self, data)

ProteinInferenceParameter method to override restrict_q if the input data does not contain q values.

Parameters:
Source code in pyproteininference/parameters.py
def override_q_restrict(self, data):
    """
    ProteinInferenceParameter method to override restrict_q if the input data does not contain q values.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

    """
    data_has_q = data.input_has_q()
    if data_has_q:
        pass
    else:
        if self.restrict_q:
            logger.warning("No Q values found in the input data, overriding parameters to not filter on Q value")
            self.restrict_q = None

validate_parameters(self)

Class method to validate all parameters.

Returns:
  • None

Source code in pyproteininference/parameters.py
def validate_parameters(self):
    """
    Class method to validate all parameters.

    Returns:
        None:

    """
    # Run all of the parameter validations
    self._validate_digest_type()
    self._validate_export_type()
    self._validate_floats()
    self._validate_bools()
    self._validate_score_type()
    self._validate_score_method()
    self._validate_score_combination()
    self._validate_inference_type()
    self._validate_grouping_type()
    self._validate_max_id()
    self._validate_lp_solver()
    self._validate_identifiers()
    self._validate_parsimony_shared_peptides()

physical

Protein

The following class is a representation of a Protein that stores characteristics/attributes of a protein for the entire analysis. We use slots to predefine the attributes the Protein Object can have. This is done to speed up runtime of the PI algorithm.

Attributes:

Name Type Description
identifier str

String identifier for the Protein object.

score float

Float that represents the protein score as output from Score object methods.

psms list

List of Psm objects.

group_identification set

Set of group Identifiers that the protein belongs to (int).

reviewed bool

True/False on if the identifier is reviewed.

unreviewed bool

True/False on if the identifier is reviewed.

peptides list

List of non flanking peptide sequences.

peptide_scores list

List of Psm scores associated with the protein.

picked bool

True/False if the protein passes the picker algo. True if passes. False if does not pass.

num_peptides int

Number of peptides that map to the given Protein.

unique_peptides list

List of peptide strings that are unique to this protein across the analysis.

num_unique_peptides int

Number of unique peptides.

raw_peptides list

List of raw peptides. Includes flanking AA and Mods.

Source code in pyproteininference/physical.py
class Protein(object):
    """
    The following class is a representation of a Protein that stores characteristics/attributes of a protein for the
        entire analysis.
    We use __slots__ to predefine the attributes the Protein Object can have.
    This is done to speed up runtime of the PI algorithm.

    Attributes:
        identifier (str): String identifier for the Protein object.
        score (float): Float that represents the protein score as output from
            [Score object][pyproteininference.scoring.Score] methods.
        psms (list): List of [Psm][pyproteininference.physical.Psm] objects.
        group_identification (set): Set of group Identifiers that the protein belongs to (int).
        reviewed (bool): True/False on if the identifier is reviewed.
        unreviewed (bool): True/False on if the identifier is reviewed.
        peptides (list): List of non flanking peptide sequences.
        peptide_scores (list): List of Psm scores associated with the protein.
        picked (bool): True/False if the protein passes the picker algo. True if passes. False if does not pass.
        num_peptides (int): Number of peptides that map to the given Protein.
        unique_peptides (list): List of peptide strings that are unique to this protein across the analysis.
        num_unique_peptides (int): Number of unique peptides.
        raw_peptides (list): List of raw peptides. Includes flanking AA and Mods.

    """

    __slots__ = (
        "identifier",
        "score",
        "psms",
        "group_identification",
        "reviewed",
        "unreviewed",
        "peptides",
        "peptide_scores",
        "picked",
        "num_peptides",
        "unique_peptides",
        "num_unique_peptides",
        "raw_peptides",
    )

    def __init__(self, identifier):
        """
        Initialization method for Protein object.

        Args:
            identifier (str): String identifier for the Protein object.

        Example:
            >>> protein = pyproteininference.physical.Protein(identifier = "PRKDC_HUMAN|P78527")

        """
        self.identifier = identifier
        self.score = None
        self.psms = []  # List of psm objects
        self.group_identification = set()
        self.reviewed = False
        self.unreviewed = False
        self.peptides = None  # Sequence info without flanking
        self.peptide_scores = None  # remove
        self.picked = True
        self.num_peptides = None  # remove
        self.unique_peptides = None  # remove
        self.num_unique_peptides = None  # remove
        self.raw_peptides = set()  # Includes Flanking Seq Info

    def get_psm_scores(self):
        """
        Retrieves psm scores for a given protein.

        Returns:
            list: List of psm scores for the given protein.

        """
        score_list = [x.main_score for x in self.psms]
        return score_list

    def get_psm_identifiers(self):
        """
        Retrieves a list of Psm identifiers.

         Returns:
             list: List of Psm identifiers.

        """
        psms = [x.identifier for x in self.psms]
        return psms

    def get_stripped_psm_identifiers(self):
        """
        Retrieves a list of Psm identifiers that have had mods removed and flanking AAs removed.

         Returns:
             list: List of Psm identifiers that have no mods or flanking AAs.

        """
        psms = [x.stripped_peptide for x in self.psms]
        return psms

    def get_unique_peptide_identifiers(self):
        """
        Retrieves the unique set of peptides for a protein.

         Returns:
             set: Set of peptide strings.

        """
        unique_peptides = set(self.get_psm_identifiers())
        return unique_peptides

    def get_unique_stripped_peptide_identifiers(self):
        """
        Retrieves the unique set of peptides for a protein that are stripped.

         Returns:
             set: Set of peptide strings that are stripped of mods and flanking AAs.

        """
        stripped_peptide_identifiers = set(self.get_stripped_psm_identifiers())
        return stripped_peptide_identifiers

    def get_num_psms(self):
        """
        Retrieves the number of Psms.

         Returns:
             int: Number of Psms.

        """
        num_psms = len(self.get_psm_identifiers())
        return num_psms

    def get_num_peptides(self):
        """
        Retrieves the number of peptides.

         Returns:
             int: Number of peptides.

        """
        num_peptides = len(self.get_unique_peptide_identifiers())
        return num_peptides

    def get_psm_ids(self):
        """
        Retrieves the Psm Ids.

         Returns:
            list: List of Psm Ids.

        """
        psm_ids = [x.psm_id for x in self.psms]
        return psm_ids

__init__(self, identifier) special

Initialization method for Protein object.

Parameters:
  • identifier (str) – String identifier for the Protein object.

Examples:

>>> protein = pyproteininference.physical.Protein(identifier = "PRKDC_HUMAN|P78527")
Source code in pyproteininference/physical.py
def __init__(self, identifier):
    """
    Initialization method for Protein object.

    Args:
        identifier (str): String identifier for the Protein object.

    Example:
        >>> protein = pyproteininference.physical.Protein(identifier = "PRKDC_HUMAN|P78527")

    """
    self.identifier = identifier
    self.score = None
    self.psms = []  # List of psm objects
    self.group_identification = set()
    self.reviewed = False
    self.unreviewed = False
    self.peptides = None  # Sequence info without flanking
    self.peptide_scores = None  # remove
    self.picked = True
    self.num_peptides = None  # remove
    self.unique_peptides = None  # remove
    self.num_unique_peptides = None  # remove
    self.raw_peptides = set()  # Includes Flanking Seq Info

get_num_peptides(self)

Retrieves the number of peptides.

!!! returns int: Number of peptides.

Source code in pyproteininference/physical.py
def get_num_peptides(self):
    """
    Retrieves the number of peptides.

     Returns:
         int: Number of peptides.

    """
    num_peptides = len(self.get_unique_peptide_identifiers())
    return num_peptides

get_num_psms(self)

Retrieves the number of Psms.

!!! returns int: Number of Psms.

Source code in pyproteininference/physical.py
def get_num_psms(self):
    """
    Retrieves the number of Psms.

     Returns:
         int: Number of Psms.

    """
    num_psms = len(self.get_psm_identifiers())
    return num_psms

get_psm_identifiers(self)

Retrieves a list of Psm identifiers.

!!! returns list: List of Psm identifiers.

Source code in pyproteininference/physical.py
def get_psm_identifiers(self):
    """
    Retrieves a list of Psm identifiers.

     Returns:
         list: List of Psm identifiers.

    """
    psms = [x.identifier for x in self.psms]
    return psms

get_psm_ids(self)

Retrieves the Psm Ids.

Returns: list: List of Psm Ids.

Source code in pyproteininference/physical.py
def get_psm_ids(self):
    """
    Retrieves the Psm Ids.

     Returns:
        list: List of Psm Ids.

    """
    psm_ids = [x.psm_id for x in self.psms]
    return psm_ids

get_psm_scores(self)

Retrieves psm scores for a given protein.

Returns:
  • list – List of psm scores for the given protein.

Source code in pyproteininference/physical.py
def get_psm_scores(self):
    """
    Retrieves psm scores for a given protein.

    Returns:
        list: List of psm scores for the given protein.

    """
    score_list = [x.main_score for x in self.psms]
    return score_list

get_stripped_psm_identifiers(self)

Retrieves a list of Psm identifiers that have had mods removed and flanking AAs removed.

!!! returns list: List of Psm identifiers that have no mods or flanking AAs.

Source code in pyproteininference/physical.py
def get_stripped_psm_identifiers(self):
    """
    Retrieves a list of Psm identifiers that have had mods removed and flanking AAs removed.

     Returns:
         list: List of Psm identifiers that have no mods or flanking AAs.

    """
    psms = [x.stripped_peptide for x in self.psms]
    return psms

get_unique_peptide_identifiers(self)

Retrieves the unique set of peptides for a protein.

!!! returns set: Set of peptide strings.

Source code in pyproteininference/physical.py
def get_unique_peptide_identifiers(self):
    """
    Retrieves the unique set of peptides for a protein.

     Returns:
         set: Set of peptide strings.

    """
    unique_peptides = set(self.get_psm_identifiers())
    return unique_peptides

get_unique_stripped_peptide_identifiers(self)

Retrieves the unique set of peptides for a protein that are stripped.

!!! returns set: Set of peptide strings that are stripped of mods and flanking AAs.

Source code in pyproteininference/physical.py
def get_unique_stripped_peptide_identifiers(self):
    """
    Retrieves the unique set of peptides for a protein that are stripped.

     Returns:
         set: Set of peptide strings that are stripped of mods and flanking AAs.

    """
    stripped_peptide_identifiers = set(self.get_stripped_psm_identifiers())
    return stripped_peptide_identifiers

ProteinGroup

The following class is a physical Protein Group class that stores characteristics of a Protein Group for the entire analysis. We use slots to predefine the attributes the Psm Object can have. This is done to speed up runtime of the PI algorithm.

Attributes:

Name Type Description
number_id int

unique Integer to represent a group.

proteins list

List of Protein objects.

q_value float

Q value for the protein group that is calculated with method calculate_q_values.

Source code in pyproteininference/physical.py
class ProteinGroup(object):
    """
    The following class is a physical Protein Group class that stores characteristics of a Protein Group for the entire
        analysis.
    We use __slots__ to predefine the attributes the Psm Object can have.
    This is done to speed up runtime of the PI algorithm.

    Attributes:
        number_id (int): unique Integer to represent a group.
        proteins (list): List of [Protein][pyproteininference.physical.Protein] objects.
        q_value (float): Q value for the protein group that is calculated with method
            [calculate_q_values][pyproteininference.datastore.DataStore.calculate_q_values].

    """

    __slots__ = ("proteins", "number_id", "q_value")

    def __init__(self, number_id):
        """
        Initialization method for ProteinGroup object.

        Args:
            number_id (int): unique Integer to represent a group.

        Example:
            >>> pg = pyproteininference.physical.ProteinGroup(number_id = 1)
        """

        self.proteins = []
        self.number_id = number_id
        self.q_value = None

__init__(self, number_id) special

Initialization method for ProteinGroup object.

Parameters:
  • number_id (int) – unique Integer to represent a group.

Examples:

>>> pg = pyproteininference.physical.ProteinGroup(number_id = 1)
Source code in pyproteininference/physical.py
def __init__(self, number_id):
    """
    Initialization method for ProteinGroup object.

    Args:
        number_id (int): unique Integer to represent a group.

    Example:
        >>> pg = pyproteininference.physical.ProteinGroup(number_id = 1)
    """

    self.proteins = []
    self.number_id = number_id
    self.q_value = None

Psm

The following class is a physical Psm class that stores characteristics of a psm for the entire analysis. We use slots to predefine the attributes the Psm Object can have. This is done to speed up runtime of the PI algorithm.

Attributes:

Name Type Description
identifier str

Peptide Identifier: IE "K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".

percscore float

Percolator Score from input file if it exists.

qvalue float

Q value from input file if it exists.

pepvalue float

Pep value from input file if it exists.

possible_proteins list

List of protein strings that the Psm maps to based on the digest.

psm_id str

String that represents a global identifier for the Psm. Should come from input files.

custom_score float

Score that comes from a custom column in the input files.

main_score float

The Psm score to be used as the scoring variable for protein scoring. can be percscore,qvalue,pepvalue, or custom_score.

stripped_peptide str

This is the identifier attribute that has had mods removed and flanking AAs removed IE: DLIDEGHAATQLVNQLHDVVVENNLSDK.

non_flanking_peptide str

This is the identifier attribute that has had flanking AAs removed IE: DLIDEGH#AATQLVNQLHDVVVENNLSDK. #NOTE Mods are still present here.

Source code in pyproteininference/physical.py
class Psm(object):
    """
    The following class is a physical Psm class that stores characteristics of a psm for the entire analysis.
    We use __slots__ to predefine the attributes the Psm Object can have.
    This is done to speed up runtime of the PI algorithm.

    Attributes:
        identifier (str): Peptide Identifier: IE "K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".
        percscore (float): Percolator Score from input file if it exists.
        qvalue (float): Q value from input file if it exists.
        pepvalue (float): Pep value from input file if it exists.
        possible_proteins (list): List of protein strings that the Psm maps to based on the digest.
        psm_id (str): String that represents a global identifier for the Psm. Should come from input files.
        custom_score (float): Score that comes from a custom column in the input files.
        main_score (float): The Psm score to be used as the scoring variable for protein scoring. can be
            percscore,qvalue,pepvalue, or custom_score.
        stripped_peptide (str): This is the identifier attribute that has had mods removed and flanking AAs
            removed IE: DLIDEGHAATQLVNQLHDVVVENNLSDK.
        non_flanking_peptide (str): This is the identifier attribute that has had flanking AAs
            removed IE: DLIDEGH#AATQLVNQLHDVVVENNLSDK. #NOTE Mods are still present here.

    """

    __slots__ = (
        "identifier",
        "percscore",
        "qvalue",
        "pepvalue",
        "possible_proteins",
        "psm_id",
        "custom_score",
        "main_score",
        "stripped_peptide",
        "non_flanking_peptide",
    )

    # The regex removes anything between parantheses including parenthases - \([^()]*\)
    # The regex removes anything between brackets including parenthases - \[.*?\]
    # And the regex removes anything that is not an A-Z character [^A-Z]
    MOD_REGEX = re.compile("\([^()]*\)|\[.*?\]|[^A-Z]")  # noqa W605

    FRONT_FLANKING_REGEX = re.compile("^[A-Z|-][.]")
    BACK_FLANKING_REGEX = re.compile("[.][A-Z|-]$")

    SCORE_ATTRIBUTE_NAMES = set(["pepvalue", "qvalue", "percscore", "custom_score"])

    def __init__(self, identifier):
        """
        Initialization method for the Psm object.
        This method also initializes the `stripped_peptide` and `non_flanking_peptide` attributes.

        Args:
            identifier (str): Peptide Identifier: IE ""K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".

        Example:
            >>> psm = pyproteininference.physical.Psm(identifier = "K.DLIDEGHAATQLVNQLHDVVVENNLSDK.Q")

        """
        self.identifier = identifier
        self.percscore = None
        self.qvalue = None
        self.pepvalue = None
        self.possible_proteins = None
        self.psm_id = None
        self.custom_score = None
        self.main_score = None
        self.stripped_peptide = None
        self.non_flanking_peptide = None

        # Add logic to split the peptide and strip it of mods
        current_peptide = Psm.split_peptide(peptide_string=self.identifier)

        self.non_flanking_peptide = current_peptide

        if not current_peptide.isupper() or not current_peptide.isalpha():
            # If we have mods remove them...
            peptide_string = current_peptide.upper()
            stripped_peptide = Psm.remove_peptide_mods(peptide_string)
            current_peptide = stripped_peptide

        # Set stripped_peptide variable
        self.stripped_peptide = current_peptide

    @classmethod
    def remove_peptide_mods(cls, peptide_string):
        """
        This class method takes a string and uses a `MOD_REGEX` to remove mods from peptide strings.

        Args:
            peptide_string (str): Peptide string to have mods removed from.

        Returns:
            str: a peptide string with mods removed.

        """
        stripped_peptide = cls.MOD_REGEX.sub("", peptide_string)
        return stripped_peptide

    @classmethod
    def split_peptide(cls, peptide_string, delimiter="."):
        """
        This class method takes a peptide string with flanking AAs and removes them from the peptide string.
        This method uses string splitting and if the method produces a faulty peptide the method
            [split_peptide_pro][pyproteininference.physical.Psm.split_peptide_pro] will be called.

        Args:
            peptide_string (str): Peptide string to have mods removed from.
            delimiter (str): a string to indicate what separates a leading/trailing (flanking) AA from the
                peptide sequence.

        Returns:
            str: a peptide string with flanking AAs removed.

        """
        peptide_split = peptide_string.split(delimiter)
        if len(peptide_split) == 3:
            # If we get 3 chunks it will usually be ['A', 'ADGSDFGSS', 'F']
            # So take index 1
            peptide = peptide_split[1]
        elif len(peptide_split) == 1:
            # If we get 1 chunk it should just be ['ADGSDFGSS']
            # So take index 0
            peptide = peptide_split[0]
        else:
            # If we split the peptide and it is not length 1 or 3 then try to split with pro
            peptide = cls.split_peptide_pro(peptide_string=peptide_string, delimiter=delimiter)

        return peptide

    @classmethod
    def split_peptide_pro(cls, peptide_string, delimiter="."):
        """
        This class method takes a peptide string with flanking AAs and removes them from the peptide string.
        This is a specialized method of [split_peptide][pyproteininference.physical.Psm.split_peptide] that uses
         regex identifiers to replace flanking AAs as opposed to string splitting.


        Args:
            peptide_string (str): Peptide string to have mods removed from.
            delimiter (str): a string to indicate what separates a leading/trailing (flanking) AA from the peptide
                sequence.

        Returns:
            str: a peptide string with flanking AAs removed.

        """

        if delimiter != ".":
            front_regex = "^[A-Z|-][{}]".format(delimiter)
            cls.FRONT_FLANKING_REGEX = re.compile(front_regex)
            back_regex = "[{}][A-Z|-]$".format(delimiter)
            cls.BACK_FLANKING_REGEX = re.compile(back_regex)

        # Replace the front flanking with nothing
        peptide_string = cls.FRONT_FLANKING_REGEX.sub("", peptide_string)

        # Replace the back flanking with nothing
        peptide_string = cls.BACK_FLANKING_REGEX.sub("", peptide_string)

        return peptide_string

    def assign_main_score(self, score):
        """
        This method takes in a score type and assigns the variable main_score for a given Psm based on the score type.

        Args:
            score (str): This is a string representation of the Psm attribute that will get assigned to the main_score
                variable.

        """
        # Assign a main score based on user input
        if score not in self.SCORE_ATTRIBUTE_NAMES:
            raise ValueError("Scores must either be one of: '{}'".format(", ".join(self.SCORE_ATTRIBUTE_NAMES)))
        else:
            self.main_score = getattr(self, score)

__init__(self, identifier) special

Initialization method for the Psm object. This method also initializes the stripped_peptide and non_flanking_peptide attributes.

Parameters:
  • identifier (str) – Peptide Identifier: IE ""K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".

Examples:

>>> psm = pyproteininference.physical.Psm(identifier = "K.DLIDEGHAATQLVNQLHDVVVENNLSDK.Q")
Source code in pyproteininference/physical.py
def __init__(self, identifier):
    """
    Initialization method for the Psm object.
    This method also initializes the `stripped_peptide` and `non_flanking_peptide` attributes.

    Args:
        identifier (str): Peptide Identifier: IE ""K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".

    Example:
        >>> psm = pyproteininference.physical.Psm(identifier = "K.DLIDEGHAATQLVNQLHDVVVENNLSDK.Q")

    """
    self.identifier = identifier
    self.percscore = None
    self.qvalue = None
    self.pepvalue = None
    self.possible_proteins = None
    self.psm_id = None
    self.custom_score = None
    self.main_score = None
    self.stripped_peptide = None
    self.non_flanking_peptide = None

    # Add logic to split the peptide and strip it of mods
    current_peptide = Psm.split_peptide(peptide_string=self.identifier)

    self.non_flanking_peptide = current_peptide

    if not current_peptide.isupper() or not current_peptide.isalpha():
        # If we have mods remove them...
        peptide_string = current_peptide.upper()
        stripped_peptide = Psm.remove_peptide_mods(peptide_string)
        current_peptide = stripped_peptide

    # Set stripped_peptide variable
    self.stripped_peptide = current_peptide

assign_main_score(self, score)

This method takes in a score type and assigns the variable main_score for a given Psm based on the score type.

Parameters:
  • score (str) – This is a string representation of the Psm attribute that will get assigned to the main_score variable.

Source code in pyproteininference/physical.py
def assign_main_score(self, score):
    """
    This method takes in a score type and assigns the variable main_score for a given Psm based on the score type.

    Args:
        score (str): This is a string representation of the Psm attribute that will get assigned to the main_score
            variable.

    """
    # Assign a main score based on user input
    if score not in self.SCORE_ATTRIBUTE_NAMES:
        raise ValueError("Scores must either be one of: '{}'".format(", ".join(self.SCORE_ATTRIBUTE_NAMES)))
    else:
        self.main_score = getattr(self, score)

remove_peptide_mods(peptide_string) classmethod

This class method takes a string and uses a MOD_REGEX to remove mods from peptide strings.

Parameters:
  • peptide_string (str) – Peptide string to have mods removed from.

Returns:
  • str – a peptide string with mods removed.

Source code in pyproteininference/physical.py
@classmethod
def remove_peptide_mods(cls, peptide_string):
    """
    This class method takes a string and uses a `MOD_REGEX` to remove mods from peptide strings.

    Args:
        peptide_string (str): Peptide string to have mods removed from.

    Returns:
        str: a peptide string with mods removed.

    """
    stripped_peptide = cls.MOD_REGEX.sub("", peptide_string)
    return stripped_peptide

split_peptide(peptide_string, delimiter='.') classmethod

This class method takes a peptide string with flanking AAs and removes them from the peptide string. This method uses string splitting and if the method produces a faulty peptide the method split_peptide_pro will be called.

Parameters:
  • peptide_string (str) – Peptide string to have mods removed from.

  • delimiter (str) – a string to indicate what separates a leading/trailing (flanking) AA from the peptide sequence.

Returns:
  • str – a peptide string with flanking AAs removed.

Source code in pyproteininference/physical.py
@classmethod
def split_peptide(cls, peptide_string, delimiter="."):
    """
    This class method takes a peptide string with flanking AAs and removes them from the peptide string.
    This method uses string splitting and if the method produces a faulty peptide the method
        [split_peptide_pro][pyproteininference.physical.Psm.split_peptide_pro] will be called.

    Args:
        peptide_string (str): Peptide string to have mods removed from.
        delimiter (str): a string to indicate what separates a leading/trailing (flanking) AA from the
            peptide sequence.

    Returns:
        str: a peptide string with flanking AAs removed.

    """
    peptide_split = peptide_string.split(delimiter)
    if len(peptide_split) == 3:
        # If we get 3 chunks it will usually be ['A', 'ADGSDFGSS', 'F']
        # So take index 1
        peptide = peptide_split[1]
    elif len(peptide_split) == 1:
        # If we get 1 chunk it should just be ['ADGSDFGSS']
        # So take index 0
        peptide = peptide_split[0]
    else:
        # If we split the peptide and it is not length 1 or 3 then try to split with pro
        peptide = cls.split_peptide_pro(peptide_string=peptide_string, delimiter=delimiter)

    return peptide

split_peptide_pro(peptide_string, delimiter='.') classmethod

This class method takes a peptide string with flanking AAs and removes them from the peptide string. This is a specialized method of split_peptide that uses regex identifiers to replace flanking AAs as opposed to string splitting.

Parameters:
  • peptide_string (str) – Peptide string to have mods removed from.

  • delimiter (str) – a string to indicate what separates a leading/trailing (flanking) AA from the peptide sequence.

Returns:
  • str – a peptide string with flanking AAs removed.

Source code in pyproteininference/physical.py
@classmethod
def split_peptide_pro(cls, peptide_string, delimiter="."):
    """
    This class method takes a peptide string with flanking AAs and removes them from the peptide string.
    This is a specialized method of [split_peptide][pyproteininference.physical.Psm.split_peptide] that uses
     regex identifiers to replace flanking AAs as opposed to string splitting.


    Args:
        peptide_string (str): Peptide string to have mods removed from.
        delimiter (str): a string to indicate what separates a leading/trailing (flanking) AA from the peptide
            sequence.

    Returns:
        str: a peptide string with flanking AAs removed.

    """

    if delimiter != ".":
        front_regex = "^[A-Z|-][{}]".format(delimiter)
        cls.FRONT_FLANKING_REGEX = re.compile(front_regex)
        back_regex = "[{}][A-Z|-]$".format(delimiter)
        cls.BACK_FLANKING_REGEX = re.compile(back_regex)

    # Replace the front flanking with nothing
    peptide_string = cls.FRONT_FLANKING_REGEX.sub("", peptide_string)

    # Replace the back flanking with nothing
    peptide_string = cls.BACK_FLANKING_REGEX.sub("", peptide_string)

    return peptide_string

pipeline

ProteinInferencePipeline

This is the main Protein Inference class which houses the logic of the entire data analysis pipeline. Logic is executed in the execute method.

Attributes:

Name Type Description
parameter_file str

Path to Protein Inference Yaml Parameter File.

database_file str

Path to Fasta database used in proteomics search.

target_files str/list

Path to Target Psm File (Or a list of files).

decoy_files str/list

Path to Decoy Psm File (Or a list of files).

combined_files str/list

Path to Combined Psm File (Or a list of files).

target_directory str

Path to Directory containing Target Psm Files.

decoy_directory str

Path to Directory containing Decoy Psm Files.

combined_directory str

Path to Directory containing Combined Psm Files.

output_directory str

Path to Directory where output will be written.

output_filename str

Path to Filename where output will be written. Will override output_directory.

id_splitting bool

True/False on whether to split protein IDs in the digest. Advanced usage only.

append_alt_from_db bool

True/False on whether to append alternative proteins from the DB digestion in Reader class.

data DataStore

DataStore Object.

digest Digest

Digest Object.

Source code in pyproteininference/pipeline.py
class ProteinInferencePipeline(object):
    """
    This is the main Protein Inference class which houses the logic of the entire data analysis pipeline.
    Logic is executed in the [execute][pyproteininference.pipeline.ProteinInferencePipeline.execute] method.

    Attributes:
        parameter_file (str): Path to Protein Inference Yaml Parameter File.
        database_file (str): Path to Fasta database used in proteomics search.
        target_files (str/list): Path to Target Psm File (Or a list of files).
        decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
        combined_files (str/list): Path to Combined Psm File (Or a list of files).
        target_directory (str): Path to Directory containing Target Psm Files.
        decoy_directory (str): Path to Directory containing Decoy Psm Files.
        combined_directory (str): Path to Directory containing Combined Psm Files.
        output_directory (str): Path to Directory where output will be written.
        output_filename (str): Path to Filename where output will be written. Will override output_directory.
        id_splitting (bool): True/False on whether to split protein IDs in the digest. Advanced usage only.
        append_alt_from_db (bool): True/False on whether to append alternative proteins from the DB digestion in
            Reader class.
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

    """

    def __init__(
        self,
        parameter_file,
        database_file=None,
        target_files=None,
        decoy_files=None,
        combined_files=None,
        target_directory=None,
        decoy_directory=None,
        combined_directory=None,
        output_directory=None,
        output_filename=None,
        id_splitting=False,
        append_alt_from_db=True,
    ):
        """

        Args:
            parameter_file (str): Path to Protein Inference Yaml Parameter File.
            database_file (str): Path to Fasta database used in proteomics search.
            target_files (str/list): Path to Target Psm File (Or a list of files).
            decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
            combined_files (str/list): Path to Combined Psm File (Or a list of files).
            target_directory (str): Path to Directory containing Target Psm Files.
            decoy_directory (str): Path to Directory containing Decoy Psm Files.
            combined_directory (str): Path to Directory containing Combined Psm Files.
            output_filename (str): Path to Filename where output will be written. Will override output_directory.
            output_directory (str): Path to Directory where output will be written.
            id_splitting (bool): True/False on whether to split protein IDs in the digest. Advanced usage only.
            append_alt_from_db (bool): True/False on whether to append alternative proteins from the DB digestion in
                Reader class.

        Returns:
            object:

        Example:
            >>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
            >>>     parameter_file=yaml_params,
            >>>     database_file=database,
            >>>     target_files=target,
            >>>     decoy_files=decoy,
            >>>     combined_files=combined_files,
            >>>     target_directory=target_directory,
            >>>     decoy_directory=decoy_directory,
            >>>     combined_directory=combined_directory,
            >>>     output_directory=dir_name,
            >>>     output_filename=output_filename,
            >>>     append_alt_from_db=append_alt,
            >>> )
        """

        self.parameter_file = parameter_file
        self.database_file = database_file
        self.target_files = target_files
        self.decoy_files = decoy_files
        self.combined_files = combined_files
        self.target_directory = target_directory
        self.decoy_directory = decoy_directory
        self.combined_directory = combined_directory
        self.output_directory = output_directory
        self.output_filename = output_filename
        self.id_splitting = id_splitting
        self.append_alt_from_db = append_alt_from_db
        self.data = None
        self.digest = None

        self._validate_input()

        self._set_output_directory()

        self._log_append_alt_from_db()

        self._log_id_splitting()

    def execute(self):
        """
        This method is the main driver of the data analysis for the protein inference package.
        This method calls other classes and methods that make up the protein inference pipeline.
        This includes but is not limited to:

        This method sets the data [DataStore Object][pyproteininference.datastore.DataStore] and digest
            [Digest Object][pyproteininference.in_silico_digest.Digest].

        1. Parameter file management.
        2. Digesting Fasta Database (Optional).
        3. Reading in input Psm Files.
        4. Initializing the [DataStore Object][pyproteininference.datastore.DataStore].
        5. Restricting Psms.
        6. Creating Protein objects/scoring input.
        7. Scoring Proteins.
        8. Running Protein Picker.
        9. Running Inference Methods/Grouping.
        10. Calculating Q Values.
        11. Exporting Proteins to filesystem.

        Example:
            >>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
            >>>     parameter_file=yaml_params,
            >>>     database_file=database,
            >>>     target_files=target,
            >>>     decoy_files=decoy,
            >>>     combined_files=combined_files,
            >>>     target_directory=target_directory,
            >>>     decoy_directory=decoy_directory,
            >>>     combined_directory=combined_directory,
            >>>     output_directory=dir_name,
            >>>     output_filename=output_filename,
            >>>     append_alt_from_db=append_alt,
            >>> )
            >>> pipeline.execute()

        """
        # STEP 1: Load parameter file #
        # STEP 1: Load parameter file #
        # STEP 1: Load parameter file #
        pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
            yaml_param_filepath=self.parameter_file
        )

        # STEP 2: Start with running an In Silico Digestion #
        # STEP 2: Start with running an In Silico Digestion #
        # STEP 2: Start with running an In Silico Digestion #
        digest = pyproteininference.in_silico_digest.PyteomicsDigest(
            database_path=self.database_file,
            digest_type=pyproteininference_parameters.digest_type,
            missed_cleavages=pyproteininference_parameters.missed_cleavages,
            reviewed_identifier_symbol=pyproteininference_parameters.reviewed_identifier_symbol,
            max_peptide_length=pyproteininference_parameters.restrict_peptide_length,
            id_splitting=self.id_splitting,
        )
        if self.database_file:
            logger.info("Running In Silico Database Digest on file {}".format(self.database_file))
            digest.digest_fasta_database()
        else:
            logger.warning(
                "No Database File provided, Skipping database digest and only taking protein-peptide mapping from the "
                "input files."
            )

        # STEP 3: Read PSM Data #
        # STEP 3: Read PSM Data #
        # STEP 3: Read PSM Data #
        reader = pyproteininference.reader.GenericReader(
            target_file=self.target_files,
            decoy_file=self.decoy_files,
            combined_files=self.combined_files,
            parameter_file_object=pyproteininference_parameters,
            digest=digest,
            append_alt_from_db=self.append_alt_from_db,
        )
        reader.read_psms()

        # STEP 4: Initiate the datastore object #
        # STEP 4: Initiate the datastore object #
        # STEP 4: Initiate the datastore object #
        data = pyproteininference.datastore.DataStore(reader=reader, digest=digest)

        # Step 5: Restrict the PSM data
        # Step 5: Restrict the PSM data
        # Step 5: Restrict the PSM data
        data.restrict_psm_data()

        data.recover_mapping()
        # Step 6: Generate protein scoring input
        # Step 6: Generate protein scoring input
        # Step 6: Generate protein scoring input
        data.create_scoring_input()

        # Step 7: Remove non unique peptides if running exclusion
        # Step 7: Remove non unique peptides if running exclusion
        # Step 7: Remove non unique peptides if running exclusion
        if pyproteininference_parameters.inference_type == Inference.EXCLUSION:
            # This gets ran if we run exclusion...
            data.exclude_non_distinguishing_peptides()

        # STEP 8: Score our PSMs given a score method
        # STEP 8: Score our PSMs given a score method
        # STEP 8: Score our PSMs given a score method
        score = pyproteininference.scoring.Score(data=data)
        score.score_psms(score_method=pyproteininference_parameters.protein_score)

        # STEP 9: Run protein picker on the data
        # STEP 9: Run protein picker on the data
        # STEP 9: Run protein picker on the data
        if pyproteininference_parameters.picker:
            data.protein_picker()
        else:
            pass

        # STEP 10: Apply Inference
        # STEP 10: Apply Inference
        # STEP 10: Apply Inference
        pyproteininference.inference.Inference.run_inference(data=data, digest=digest)

        # STEP 11: Q value Calculations
        # STEP 11: Q value Calculations
        # STEP 11: Q value Calculations
        data.calculate_q_values()

        # STEP 12: Export to CSV
        # STEP 12: Export to CSV
        # STEP 12: Export to CSV
        export = pyproteininference.export.Export(data=data)
        export.export_to_csv(
            output_filename=self.output_filename,
            directory=self.output_directory,
            export_type=pyproteininference_parameters.export,
        )

        self.data = data
        self.digest = digest

        logger.info("Protein Inference Finished")

    def _validate_input(self):
        """
        Internal method that validates whether the proper input files have been defined.

        One of the following combinations must be selected as input. No more and no less:

        1. either one or multiple target_files and decoy_files.
        2. either one or multiple combined_files that include target and decoy data.
        3. a directory that contains target files (target_directory) as well as a directory that contains decoy files
            (decoy_directory).
        4. a directory that contains combined target/decoy files (combined_directory).

        Raises:
            ValueError: ValueError will occur if an improper combination of input.
        """
        if (
            self.target_files
            and self.decoy_files
            and not self.combined_files
            and not self.target_directory
            and not self.decoy_directory
            and not self.combined_directory
        ):
            logger.info("Validating input as target_files and decoy_files")
        elif (
            self.combined_files
            and not self.target_files
            and not self.decoy_files
            and not self.decoy_directory
            and not self.target_directory
            and not self.combined_directory
        ):
            logger.info("Validating input as combined_files")
        elif (
            self.target_directory
            and self.decoy_directory
            and not self.target_files
            and not self.decoy_files
            and not self.combined_directory
            and not self.combined_files
        ):
            logger.info("Validating input as target_directory and decoy_directory")
            self._transform_directory_to_files()
        elif (
            self.combined_directory
            and not self.combined_files
            and not self.decoy_files
            and not self.decoy_directory
            and not self.target_files
            and not self.target_directory
        ):
            logger.info("Validating input as combined_directory")
            self._transform_directory_to_files()
        else:
            raise ValueError(
                "To run Protein inference please supply either: "
                "(1) either one or multiple target_files and decoy_files, "
                "(2) either one or multiple combined_files that include target and decoy data"
                "(3) a directory that contains target files (target_directory) as well as a directory that "
                "contains decoy files (decoy_directory)"
                "(4) a directory that contains combined target/decoy files (combined_directory)"
            )

    def _transform_directory_to_files(self):
        """
        This internal method takes files that are in the target_directory, decoy_directory, or combined_directory and
        reassigns these files to the target_files, decoy_files, and combined_files to be used in
         [Reader][pyproteininference.reader.Reader] object.
        """
        if self.target_directory and self.decoy_directory:
            logger.info("Transforming target_directory and decoy_directory into files")
            target_files = os.listdir(self.target_directory)
            target_files_full = [
                os.path.join(self.target_directory, x) for x in target_files if x.endswith(".txt") or x.endswith(".tsv")
            ]

            decoy_files = os.listdir(self.decoy_directory)
            decoy_files_full = [
                os.path.join(self.decoy_directory, x) for x in decoy_files if x.endswith(".txt") or x.endswith(".tsv")
            ]

            self.target_files = target_files_full
            self.decoy_files = decoy_files_full

        elif self.combined_directory:
            logger.info("Transforming combined_directory into files")
            combined_files = os.listdir(self.combined_directory)
            combined_files_full = [
                os.path.join(self.combined_directory, x)
                for x in combined_files
                if x.endswith(".txt") or x.endswith(".tsv")
            ]
            self.combined_files = combined_files_full

    def _set_output_directory(self):
        """
        Internal method for setting the output directory.
        If the output_directory argument is not supplied the output directory is set as the cwd.
        """
        if not self.output_directory:
            self.output_directory = os.getcwd()
        else:
            pass

    def _log_append_alt_from_db(self):
        """
        Internal method for logging whether the user sets alternative protein append to True or False.
        """
        if self.append_alt_from_db:
            logger.info("Append Alternative Proteins from Database set to True")
        else:
            logger.info("Append Alternative Proteins from Database set to False")

    def _log_id_splitting(self):
        """
        Internal method for logging whether the user sets ID splitting to True or False.
        """
        if self.id_splitting:
            logger.info("ID Splitting for Database Digestion set to True")
        else:
            logger.info("ID Splitting for Database Digestion set to False")

__init__(self, parameter_file, database_file=None, target_files=None, decoy_files=None, combined_files=None, target_directory=None, decoy_directory=None, combined_directory=None, output_directory=None, output_filename=None, id_splitting=False, append_alt_from_db=True) special

Parameters:
  • parameter_file (str) – Path to Protein Inference Yaml Parameter File.

  • database_file (str) – Path to Fasta database used in proteomics search.

  • target_files (str/list) – Path to Target Psm File (Or a list of files).

  • decoy_files (str/list) – Path to Decoy Psm File (Or a list of files).

  • combined_files (str/list) – Path to Combined Psm File (Or a list of files).

  • target_directory (str) – Path to Directory containing Target Psm Files.

  • decoy_directory (str) – Path to Directory containing Decoy Psm Files.

  • combined_directory (str) – Path to Directory containing Combined Psm Files.

  • output_filename (str) – Path to Filename where output will be written. Will override output_directory.

  • output_directory (str) – Path to Directory where output will be written.

  • id_splitting (bool) – True/False on whether to split protein IDs in the digest. Advanced usage only.

  • append_alt_from_db (bool) – True/False on whether to append alternative proteins from the DB digestion in Reader class.

Returns:
  • object

Examples:

>>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
>>>     parameter_file=yaml_params,
>>>     database_file=database,
>>>     target_files=target,
>>>     decoy_files=decoy,
>>>     combined_files=combined_files,
>>>     target_directory=target_directory,
>>>     decoy_directory=decoy_directory,
>>>     combined_directory=combined_directory,
>>>     output_directory=dir_name,
>>>     output_filename=output_filename,
>>>     append_alt_from_db=append_alt,
>>> )
Source code in pyproteininference/pipeline.py
def __init__(
    self,
    parameter_file,
    database_file=None,
    target_files=None,
    decoy_files=None,
    combined_files=None,
    target_directory=None,
    decoy_directory=None,
    combined_directory=None,
    output_directory=None,
    output_filename=None,
    id_splitting=False,
    append_alt_from_db=True,
):
    """

    Args:
        parameter_file (str): Path to Protein Inference Yaml Parameter File.
        database_file (str): Path to Fasta database used in proteomics search.
        target_files (str/list): Path to Target Psm File (Or a list of files).
        decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
        combined_files (str/list): Path to Combined Psm File (Or a list of files).
        target_directory (str): Path to Directory containing Target Psm Files.
        decoy_directory (str): Path to Directory containing Decoy Psm Files.
        combined_directory (str): Path to Directory containing Combined Psm Files.
        output_filename (str): Path to Filename where output will be written. Will override output_directory.
        output_directory (str): Path to Directory where output will be written.
        id_splitting (bool): True/False on whether to split protein IDs in the digest. Advanced usage only.
        append_alt_from_db (bool): True/False on whether to append alternative proteins from the DB digestion in
            Reader class.

    Returns:
        object:

    Example:
        >>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
        >>>     parameter_file=yaml_params,
        >>>     database_file=database,
        >>>     target_files=target,
        >>>     decoy_files=decoy,
        >>>     combined_files=combined_files,
        >>>     target_directory=target_directory,
        >>>     decoy_directory=decoy_directory,
        >>>     combined_directory=combined_directory,
        >>>     output_directory=dir_name,
        >>>     output_filename=output_filename,
        >>>     append_alt_from_db=append_alt,
        >>> )
    """

    self.parameter_file = parameter_file
    self.database_file = database_file
    self.target_files = target_files
    self.decoy_files = decoy_files
    self.combined_files = combined_files
    self.target_directory = target_directory
    self.decoy_directory = decoy_directory
    self.combined_directory = combined_directory
    self.output_directory = output_directory
    self.output_filename = output_filename
    self.id_splitting = id_splitting
    self.append_alt_from_db = append_alt_from_db
    self.data = None
    self.digest = None

    self._validate_input()

    self._set_output_directory()

    self._log_append_alt_from_db()

    self._log_id_splitting()

execute(self)

This method is the main driver of the data analysis for the protein inference package. This method calls other classes and methods that make up the protein inference pipeline. This includes but is not limited to:

This method sets the data DataStore Object and digest Digest Object.

  1. Parameter file management.
  2. Digesting Fasta Database (Optional).
  3. Reading in input Psm Files.
  4. Initializing the DataStore Object.
  5. Restricting Psms.
  6. Creating Protein objects/scoring input.
  7. Scoring Proteins.
  8. Running Protein Picker.
  9. Running Inference Methods/Grouping.
  10. Calculating Q Values.
  11. Exporting Proteins to filesystem.

Examples:

>>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
>>>     parameter_file=yaml_params,
>>>     database_file=database,
>>>     target_files=target,
>>>     decoy_files=decoy,
>>>     combined_files=combined_files,
>>>     target_directory=target_directory,
>>>     decoy_directory=decoy_directory,
>>>     combined_directory=combined_directory,
>>>     output_directory=dir_name,
>>>     output_filename=output_filename,
>>>     append_alt_from_db=append_alt,
>>> )
>>> pipeline.execute()
Source code in pyproteininference/pipeline.py
def execute(self):
    """
    This method is the main driver of the data analysis for the protein inference package.
    This method calls other classes and methods that make up the protein inference pipeline.
    This includes but is not limited to:

    This method sets the data [DataStore Object][pyproteininference.datastore.DataStore] and digest
        [Digest Object][pyproteininference.in_silico_digest.Digest].

    1. Parameter file management.
    2. Digesting Fasta Database (Optional).
    3. Reading in input Psm Files.
    4. Initializing the [DataStore Object][pyproteininference.datastore.DataStore].
    5. Restricting Psms.
    6. Creating Protein objects/scoring input.
    7. Scoring Proteins.
    8. Running Protein Picker.
    9. Running Inference Methods/Grouping.
    10. Calculating Q Values.
    11. Exporting Proteins to filesystem.

    Example:
        >>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
        >>>     parameter_file=yaml_params,
        >>>     database_file=database,
        >>>     target_files=target,
        >>>     decoy_files=decoy,
        >>>     combined_files=combined_files,
        >>>     target_directory=target_directory,
        >>>     decoy_directory=decoy_directory,
        >>>     combined_directory=combined_directory,
        >>>     output_directory=dir_name,
        >>>     output_filename=output_filename,
        >>>     append_alt_from_db=append_alt,
        >>> )
        >>> pipeline.execute()

    """
    # STEP 1: Load parameter file #
    # STEP 1: Load parameter file #
    # STEP 1: Load parameter file #
    pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
        yaml_param_filepath=self.parameter_file
    )

    # STEP 2: Start with running an In Silico Digestion #
    # STEP 2: Start with running an In Silico Digestion #
    # STEP 2: Start with running an In Silico Digestion #
    digest = pyproteininference.in_silico_digest.PyteomicsDigest(
        database_path=self.database_file,
        digest_type=pyproteininference_parameters.digest_type,
        missed_cleavages=pyproteininference_parameters.missed_cleavages,
        reviewed_identifier_symbol=pyproteininference_parameters.reviewed_identifier_symbol,
        max_peptide_length=pyproteininference_parameters.restrict_peptide_length,
        id_splitting=self.id_splitting,
    )
    if self.database_file:
        logger.info("Running In Silico Database Digest on file {}".format(self.database_file))
        digest.digest_fasta_database()
    else:
        logger.warning(
            "No Database File provided, Skipping database digest and only taking protein-peptide mapping from the "
            "input files."
        )

    # STEP 3: Read PSM Data #
    # STEP 3: Read PSM Data #
    # STEP 3: Read PSM Data #
    reader = pyproteininference.reader.GenericReader(
        target_file=self.target_files,
        decoy_file=self.decoy_files,
        combined_files=self.combined_files,
        parameter_file_object=pyproteininference_parameters,
        digest=digest,
        append_alt_from_db=self.append_alt_from_db,
    )
    reader.read_psms()

    # STEP 4: Initiate the datastore object #
    # STEP 4: Initiate the datastore object #
    # STEP 4: Initiate the datastore object #
    data = pyproteininference.datastore.DataStore(reader=reader, digest=digest)

    # Step 5: Restrict the PSM data
    # Step 5: Restrict the PSM data
    # Step 5: Restrict the PSM data
    data.restrict_psm_data()

    data.recover_mapping()
    # Step 6: Generate protein scoring input
    # Step 6: Generate protein scoring input
    # Step 6: Generate protein scoring input
    data.create_scoring_input()

    # Step 7: Remove non unique peptides if running exclusion
    # Step 7: Remove non unique peptides if running exclusion
    # Step 7: Remove non unique peptides if running exclusion
    if pyproteininference_parameters.inference_type == Inference.EXCLUSION:
        # This gets ran if we run exclusion...
        data.exclude_non_distinguishing_peptides()

    # STEP 8: Score our PSMs given a score method
    # STEP 8: Score our PSMs given a score method
    # STEP 8: Score our PSMs given a score method
    score = pyproteininference.scoring.Score(data=data)
    score.score_psms(score_method=pyproteininference_parameters.protein_score)

    # STEP 9: Run protein picker on the data
    # STEP 9: Run protein picker on the data
    # STEP 9: Run protein picker on the data
    if pyproteininference_parameters.picker:
        data.protein_picker()
    else:
        pass

    # STEP 10: Apply Inference
    # STEP 10: Apply Inference
    # STEP 10: Apply Inference
    pyproteininference.inference.Inference.run_inference(data=data, digest=digest)

    # STEP 11: Q value Calculations
    # STEP 11: Q value Calculations
    # STEP 11: Q value Calculations
    data.calculate_q_values()

    # STEP 12: Export to CSV
    # STEP 12: Export to CSV
    # STEP 12: Export to CSV
    export = pyproteininference.export.Export(data=data)
    export.export_to_csv(
        output_filename=self.output_filename,
        directory=self.output_directory,
        export_type=pyproteininference_parameters.export,
    )

    self.data = data
    self.digest = digest

    logger.info("Protein Inference Finished")

reader

GenericReader (Reader)

The following class takes a percolator like target file and a percolator like decoy file and creates standard Psm objects.

Percolator Like Output is formatted as follows: with each entry being tab delimited.

| PSMId | score | q-value | posterior_error_prob | peptide | proteinIds | | | | # noqa E501 W605 |-------------------------------|----------|-------------|-----------------------|--------------------------------|---------------------|----------------------|----------------------|-------------------------| # noqa E501 W605 | 116108.15139.15139.6.dta | 3.44016 | 0.000479928 | 7.60258e-10 | K.MVVSMTLGLHPWIANIDDTQYLAAK.R | CNDP1_HUMAN|Q96KN2 | B4E180_HUMAN|B4E180 | A8K1K1_HUMAN|A8K1K1 | J3KRP0_HUMAN|J3KRP0 | # noqa E501 W605

Custom columns can be added and used as scoring input. Please see package documentation for more information.

Attributes:

Name Type Description
target_file str/list

Path to Target PSM result files.

decoy_file str/list

Path to Decoy PSM result files.

combined_files str/list

Path to Combined PSM result files.

directory str

Path to directory containing combined PSM result files.

psms list

List of Psm objects.

load_custom_score bool

True/False on whether or not to load a custom score. Depends on scoring_variable.

scoring_variable str

String to indicate which column in the input file is to be used as the scoring input.

digest Digest

Digest Object.

parameter_file_object ProteinInferenceParameter

ProteinInferenceParameter object

append_alt_from_db bool

Whether or not to append alternative proteins found in the database that are not in the input files.

Source code in pyproteininference/reader.py
class GenericReader(Reader):
    """
    The following class takes a percolator like target file and a percolator like decoy file
    and creates standard [Psm][pyproteininference.physical.Psm] objects.

    Percolator Like Output is formatted as follows:
    with each entry being tab delimited.

    | PSMId                         | score    |  q-value    | posterior_error_prob  |  peptide                       | proteinIds          |                      |                      |                         | # noqa E501 W605
    |-------------------------------|----------|-------------|-----------------------|--------------------------------|---------------------|----------------------|----------------------|-------------------------| # noqa E501 W605
    |     116108.15139.15139.6.dta  |  3.44016 | 0.000479928 | 7.60258e-10           | K.MVVSMTLGLHPWIANIDDTQYLAAK.R  | CNDP1_HUMAN\|Q96KN2 | B4E180_HUMAN\|B4E180 | A8K1K1_HUMAN\|A8K1K1 | J3KRP0_HUMAN\|J3KRP0    | # noqa E501 W605

    Custom columns can be added and used as scoring input. Please see package documentation for more information.

    Attributes:
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.
        psms (list): List of [Psm][pyproteininference.physical.Psm] objects.
        load_custom_score (bool): True/False on whether or not to load a custom score. Depends on scoring_variable.
        scoring_variable (str): String to indicate which column in the input file is to be used as the scoring input.
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        parameter_file_object (ProteinInferenceParameter):
            [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object
        append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
            are not in the input files.



    """

    PSMID = "PSMId"
    SCORE = "score"
    Q_VALUE = "q-value"
    POSTERIOR_ERROR_PROB = "posterior_error_prob"
    PEPTIDE = "peptide"
    PROTEIN_IDS = "proteinIds"
    ALTERNATIVE_PROTEINS = "alternative_proteins"

    def __init__(
        self,
        digest,
        parameter_file_object,
        append_alt_from_db=True,
        target_file=None,
        decoy_file=None,
        combined_files=None,
        directory=None,
    ):
        """

        Args:
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
            parameter_file_object (ProteinInferenceParameter):
                [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
            append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
                are not in the input files.
            target_file (str/list): Path to Target PSM result files.
            decoy_file (str/list): Path to Decoy PSM result files.
            combined_files (str/list): Path to Combined PSM result files.
            directory (str): Path to directory containing combined PSM result files.

        Returns:
            Reader: [Reader][pyproteininference.reader.Reader] object.

        Example:
            >>> pyproteininference.reader.GenericReader(target_file = "example_target.txt",
            >>>     decoy_file = "example_decoy.txt",
            >>>     digest=digest, parameter_file_object=pi_params)
        """
        self.target_file = target_file
        self.decoy_file = decoy_file
        self.combined_files = combined_files
        self.directory = directory

        self.psms = None
        self.search_id = None
        self.digest = digest
        self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
        self.load_custom_score = False

        self.append_alt_from_db = append_alt_from_db

        self.parameter_file_object = parameter_file_object
        self.scoring_variable = parameter_file_object.psm_score

        self._validate_input()

        if self.scoring_variable != self.Q_VALUE and self.scoring_variable != self.POSTERIOR_ERROR_PROB:
            self.load_custom_score = True
            logger.info(
                "Pulling custom column based on parameter file input for score, Column: {}".format(
                    self.scoring_variable
                )
            )
        else:
            logger.info(
                "Pulling no custom columns based on parameter file input for score, using standard Column: {}".format(
                    self.scoring_variable
                )
            )

        # If we select to not run inference at all
        if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
            # Only allow 1 Protein per PSM
            self.MAX_ALLOWED_ALTERNATIVE_PROTEINS = 1

    def read_psms(self):
        """
        Method to read psms from the input files and to transform them into a list of
        [Psm][pyproteininference.physical.Psm] objects.

        This method sets the `psms` variable. Which is a list of Psm objets.

        This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

        Example:
            >>> reader = pyproteininference.reader.GenericReader(target_file = "example_target.txt",
            >>>     decoy_file = "example_decoy.txt",
            >>>     digest=digest, parameter_file_object=pi_params)
            >>> reader.read_psms()

        """
        logger.info("Reading in Input Files using Generic Reader...")
        # Read in and split by line
        # If target_file is a list... read them all in and concatenate...
        if self.target_file and self.decoy_file:
            if isinstance(self.target_file, (list,)):
                all_target = []
                for t_files in self.target_file:
                    ptarg = []
                    with open(t_files, "r") as psm_target_file:
                        logger.info(t_files)
                        spamreader = csv.DictReader(psm_target_file, delimiter="\t")
                        for row in spamreader:
                            row = self.get_alternative_proteins_from_input(row)
                            ptarg.append(row)
                    all_target = all_target + ptarg
            else:
                # If not just read the file...
                ptarg = []
                with open(self.target_file, "r") as psm_target_file:
                    logger.info(self.target_file)
                    spamreader = csv.DictReader(psm_target_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        ptarg.append(row)
                all_target = ptarg

            # Repeat for decoy file
            if isinstance(self.decoy_file, (list,)):
                all_decoy = []
                for d_files in self.decoy_file:
                    pdec = []
                    with open(d_files, "r") as psm_decoy_file:
                        logger.info(d_files)
                        spamreader = csv.DictReader(psm_decoy_file, delimiter="\t")
                        for row in spamreader:
                            row = self.get_alternative_proteins_from_input(row)
                            pdec.append(row)
                    all_decoy = all_decoy + pdec
            else:
                pdec = []
                with open(self.decoy_file, "r") as psm_decoy_file:
                    logger.info(self.decoy_file)
                    spamreader = csv.DictReader(psm_decoy_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        pdec.append(row)
                all_decoy = pdec

            # Combine the lists
            all_psms = all_target + all_decoy

        elif self.combined_files:
            if isinstance(self.combined_files, (list,)):
                all = []
                for c_files in self.combined_files:
                    c_all = []
                    with open(c_files, "r") as psm_file:
                        logger.info(c_files)
                        spamreader = csv.DictReader(psm_file, delimiter="\t")
                        for row in spamreader:
                            row = self.get_alternative_proteins_from_input(row)
                            c_all.append(row)
                    all = all + c_all
            else:
                c_all = []
                with open(self.combined_files, "r") as psm_file:
                    logger.info(self.combined_files)
                    spamreader = csv.DictReader(psm_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        c_all.append(row)
                all = c_all
            all_psms = all

        elif self.directory:
            all_files = os.listdir(self.directory)
            all = []
            for files in all_files:
                psm_per_file = []
                with open(files, "r") as psm_file:
                    logger.info(files)
                    spamreader = csv.DictReader(psm_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        psm_per_file.append(row)
                all = all + psm_per_file
            all_psms = all

        psms_all_filtered = []
        for psms in all_psms:
            if self.POSTERIOR_ERROR_PROB in psms.keys():
                try:
                    float(psms[self.POSTERIOR_ERROR_PROB])
                    psms_all_filtered.append(psms)
                except ValueError as e:  # noqa F841
                    pass
            else:
                try:
                    float(psms[self.scoring_variable])
                    psms_all_filtered.append(psms)
                except ValueError as e:  # noqa F841
                    pass

        # Filter by pep
        try:
            logger.info("Sorting by {}".format(self.POSTERIOR_ERROR_PROB))
            all_psms = sorted(
                psms_all_filtered,
                key=lambda x: float(x[self.POSTERIOR_ERROR_PROB]),
                reverse=False,
            )
        except KeyError:
            logger.info("Cannot Sort by {} the values do not exist".format(self.POSTERIOR_ERROR_PROB))
            logger.info("Sorting by {}".format(self.scoring_variable))
            if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
                all_psms = sorted(
                    psms_all_filtered,
                    key=lambda x: float(x[self.scoring_variable]),
                    reverse=True,
                )
            if self.parameter_file_object.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE:
                all_psms = sorted(
                    psms_all_filtered,
                    key=lambda x: float(x[self.scoring_variable]),
                    reverse=False,
                )

        list_of_psm_objects = []
        peptide_tracker = set()
        all_sp_proteins = set(self.digest.swiss_prot_protein_set)
        # We only want to get unique peptides... using all messes up scoring...
        # Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...

        peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

        initial_poss_prots = []
        logger.info("Number of PSMs in the input data: {}".format(len(all_psms)))
        psms_with_alternative_proteins = self._find_psms_with_alternative_proteins(raw_psms=all_psms)
        logger.info(
            "Number of PSMs that have alternative proteins in the input data {}".format(
                len(psms_with_alternative_proteins)
            )
        )
        if len(psms_with_alternative_proteins) == 0:
            logger.warning(
                "No PSMs in the input have alternative proteins. "
                "Make sure your input is properly formatted. "
                "Alternative Proteins will be retrieved from the fasta database"
            )
        for psm_info in all_psms:
            current_peptide = psm_info[self.PEPTIDE]
            # Define the Psm...
            if current_peptide not in peptide_tracker:
                psm = Psm(identifier=current_peptide)
                # Attempt to add variables from PSM info...
                # If they do not exist in the psm info then we skip...
                try:
                    psm.percscore = float(psm_info[self.SCORE])
                except KeyError:
                    pass
                try:
                    psm.qvalue = float(psm_info[self.Q_VALUE])
                except KeyError:
                    pass
                try:
                    psm.pepvalue = float(psm_info[self.POSTERIOR_ERROR_PROB])
                except KeyError:
                    pass
                # If user has a custom score IE not q-value or pep_value...
                if self.load_custom_score:
                    # Then we look for it...
                    psm.custom_score = float(psm_info[self.scoring_variable])
                psm.possible_proteins = []
                psm.possible_proteins.append(psm_info[self.PROTEIN_IDS])
                psm.possible_proteins = psm.possible_proteins + [x for x in psm_info[self.ALTERNATIVE_PROTEINS] if x]
                # Remove potential Repeats
                if self.parameter_file_object.inference_type != Inference.FIRST_PROTEIN:
                    psm.possible_proteins = sorted(list(set(psm.possible_proteins)))

                input_poss_prots = copy.copy(psm.possible_proteins)

                # Get PSM ID
                psm.psm_id = psm_info[self.PSMID]

                # Split peptide if flanking
                current_peptide = Psm.split_peptide(peptide_string=current_peptide)

                if not current_peptide.isupper() or not current_peptide.isalpha():
                    # If we have mods remove them...
                    peptide_string = current_peptide.upper()
                    stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                    current_peptide = stripped_peptide
                # Add the other possible_proteins from insilicodigest here...
                try:
                    current_alt_proteins = sorted(list(peptide_to_protein_dictionary[current_peptide]))
                except KeyError:
                    current_alt_proteins = []
                    logger.debug(
                        "Peptide {} was not found in the supplied DB for Proteins {}".format(
                            current_peptide, ";".join(psm.possible_proteins)
                        )
                    )
                    for poss_prot in psm.possible_proteins:
                        self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                        self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                        logger.debug(
                            "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                        )

                # Sort Alt Proteins by Swissprot then Trembl...
                identifiers_sorted = DataStore.sort_protein_strings(
                    protein_string_list=current_alt_proteins,
                    sp_proteins=all_sp_proteins,
                    decoy_symbol=self.parameter_file_object.decoy_symbol,
                )

                # Restrict to 50 possible proteins
                psm = self._fix_alternative_proteins(
                    append_alt_from_db=self.append_alt_from_db,
                    identifiers_sorted=identifiers_sorted,
                    max_proteins=self.MAX_ALLOWED_ALTERNATIVE_PROTEINS,
                    psm=psm,
                    parameter_file_object=self.parameter_file_object,
                )

                list_of_psm_objects.append(psm)
                peptide_tracker.add(current_peptide)

                initial_poss_prots.append(input_poss_prots)

        self.psms = list_of_psm_objects

        self._check_initial_database_overlap(
            initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
        )

        logger.info("Length of PSM Data: {}".format(len(self.psms)))

        logger.info("Finished GenericReader.read_psms...")

    def _find_psms_with_alternative_proteins(self, raw_psms):

        psms_with_alternative_proteins = [x for x in raw_psms if x["alternative_proteins"]]

        return psms_with_alternative_proteins

__init__(self, digest, parameter_file_object, append_alt_from_db=True, target_file=None, decoy_file=None, combined_files=None, directory=None) special

Parameters:
  • digest (Digest) – Digest Object.

  • parameter_file_object (ProteinInferenceParameter) – ProteinInferenceParameter object.

  • append_alt_from_db (bool) – Whether or not to append alternative proteins found in the database that are not in the input files.

  • target_file (str/list) – Path to Target PSM result files.

  • decoy_file (str/list) – Path to Decoy PSM result files.

  • combined_files (str/list) – Path to Combined PSM result files.

  • directory (str) – Path to directory containing combined PSM result files.

Returns:

Examples:

>>> pyproteininference.reader.GenericReader(target_file = "example_target.txt",
>>>     decoy_file = "example_decoy.txt",
>>>     digest=digest, parameter_file_object=pi_params)
Source code in pyproteininference/reader.py
def __init__(
    self,
    digest,
    parameter_file_object,
    append_alt_from_db=True,
    target_file=None,
    decoy_file=None,
    combined_files=None,
    directory=None,
):
    """

    Args:
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        parameter_file_object (ProteinInferenceParameter):
            [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
        append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
            are not in the input files.
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.

    Returns:
        Reader: [Reader][pyproteininference.reader.Reader] object.

    Example:
        >>> pyproteininference.reader.GenericReader(target_file = "example_target.txt",
        >>>     decoy_file = "example_decoy.txt",
        >>>     digest=digest, parameter_file_object=pi_params)
    """
    self.target_file = target_file
    self.decoy_file = decoy_file
    self.combined_files = combined_files
    self.directory = directory

    self.psms = None
    self.search_id = None
    self.digest = digest
    self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
    self.load_custom_score = False

    self.append_alt_from_db = append_alt_from_db

    self.parameter_file_object = parameter_file_object
    self.scoring_variable = parameter_file_object.psm_score

    self._validate_input()

    if self.scoring_variable != self.Q_VALUE and self.scoring_variable != self.POSTERIOR_ERROR_PROB:
        self.load_custom_score = True
        logger.info(
            "Pulling custom column based on parameter file input for score, Column: {}".format(
                self.scoring_variable
            )
        )
    else:
        logger.info(
            "Pulling no custom columns based on parameter file input for score, using standard Column: {}".format(
                self.scoring_variable
            )
        )

    # If we select to not run inference at all
    if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
        # Only allow 1 Protein per PSM
        self.MAX_ALLOWED_ALTERNATIVE_PROTEINS = 1

read_psms(self)

Method to read psms from the input files and to transform them into a list of Psm objects.

This method sets the psms variable. Which is a list of Psm objets.

This method must be ran before initializing DataStore object.

Examples:

>>> reader = pyproteininference.reader.GenericReader(target_file = "example_target.txt",
>>>     decoy_file = "example_decoy.txt",
>>>     digest=digest, parameter_file_object=pi_params)
>>> reader.read_psms()
Source code in pyproteininference/reader.py
def read_psms(self):
    """
    Method to read psms from the input files and to transform them into a list of
    [Psm][pyproteininference.physical.Psm] objects.

    This method sets the `psms` variable. Which is a list of Psm objets.

    This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

    Example:
        >>> reader = pyproteininference.reader.GenericReader(target_file = "example_target.txt",
        >>>     decoy_file = "example_decoy.txt",
        >>>     digest=digest, parameter_file_object=pi_params)
        >>> reader.read_psms()

    """
    logger.info("Reading in Input Files using Generic Reader...")
    # Read in and split by line
    # If target_file is a list... read them all in and concatenate...
    if self.target_file and self.decoy_file:
        if isinstance(self.target_file, (list,)):
            all_target = []
            for t_files in self.target_file:
                ptarg = []
                with open(t_files, "r") as psm_target_file:
                    logger.info(t_files)
                    spamreader = csv.DictReader(psm_target_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        ptarg.append(row)
                all_target = all_target + ptarg
        else:
            # If not just read the file...
            ptarg = []
            with open(self.target_file, "r") as psm_target_file:
                logger.info(self.target_file)
                spamreader = csv.DictReader(psm_target_file, delimiter="\t")
                for row in spamreader:
                    row = self.get_alternative_proteins_from_input(row)
                    ptarg.append(row)
            all_target = ptarg

        # Repeat for decoy file
        if isinstance(self.decoy_file, (list,)):
            all_decoy = []
            for d_files in self.decoy_file:
                pdec = []
                with open(d_files, "r") as psm_decoy_file:
                    logger.info(d_files)
                    spamreader = csv.DictReader(psm_decoy_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        pdec.append(row)
                all_decoy = all_decoy + pdec
        else:
            pdec = []
            with open(self.decoy_file, "r") as psm_decoy_file:
                logger.info(self.decoy_file)
                spamreader = csv.DictReader(psm_decoy_file, delimiter="\t")
                for row in spamreader:
                    row = self.get_alternative_proteins_from_input(row)
                    pdec.append(row)
            all_decoy = pdec

        # Combine the lists
        all_psms = all_target + all_decoy

    elif self.combined_files:
        if isinstance(self.combined_files, (list,)):
            all = []
            for c_files in self.combined_files:
                c_all = []
                with open(c_files, "r") as psm_file:
                    logger.info(c_files)
                    spamreader = csv.DictReader(psm_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        c_all.append(row)
                all = all + c_all
        else:
            c_all = []
            with open(self.combined_files, "r") as psm_file:
                logger.info(self.combined_files)
                spamreader = csv.DictReader(psm_file, delimiter="\t")
                for row in spamreader:
                    row = self.get_alternative_proteins_from_input(row)
                    c_all.append(row)
            all = c_all
        all_psms = all

    elif self.directory:
        all_files = os.listdir(self.directory)
        all = []
        for files in all_files:
            psm_per_file = []
            with open(files, "r") as psm_file:
                logger.info(files)
                spamreader = csv.DictReader(psm_file, delimiter="\t")
                for row in spamreader:
                    row = self.get_alternative_proteins_from_input(row)
                    psm_per_file.append(row)
            all = all + psm_per_file
        all_psms = all

    psms_all_filtered = []
    for psms in all_psms:
        if self.POSTERIOR_ERROR_PROB in psms.keys():
            try:
                float(psms[self.POSTERIOR_ERROR_PROB])
                psms_all_filtered.append(psms)
            except ValueError as e:  # noqa F841
                pass
        else:
            try:
                float(psms[self.scoring_variable])
                psms_all_filtered.append(psms)
            except ValueError as e:  # noqa F841
                pass

    # Filter by pep
    try:
        logger.info("Sorting by {}".format(self.POSTERIOR_ERROR_PROB))
        all_psms = sorted(
            psms_all_filtered,
            key=lambda x: float(x[self.POSTERIOR_ERROR_PROB]),
            reverse=False,
        )
    except KeyError:
        logger.info("Cannot Sort by {} the values do not exist".format(self.POSTERIOR_ERROR_PROB))
        logger.info("Sorting by {}".format(self.scoring_variable))
        if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
            all_psms = sorted(
                psms_all_filtered,
                key=lambda x: float(x[self.scoring_variable]),
                reverse=True,
            )
        if self.parameter_file_object.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE:
            all_psms = sorted(
                psms_all_filtered,
                key=lambda x: float(x[self.scoring_variable]),
                reverse=False,
            )

    list_of_psm_objects = []
    peptide_tracker = set()
    all_sp_proteins = set(self.digest.swiss_prot_protein_set)
    # We only want to get unique peptides... using all messes up scoring...
    # Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...

    peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

    initial_poss_prots = []
    logger.info("Number of PSMs in the input data: {}".format(len(all_psms)))
    psms_with_alternative_proteins = self._find_psms_with_alternative_proteins(raw_psms=all_psms)
    logger.info(
        "Number of PSMs that have alternative proteins in the input data {}".format(
            len(psms_with_alternative_proteins)
        )
    )
    if len(psms_with_alternative_proteins) == 0:
        logger.warning(
            "No PSMs in the input have alternative proteins. "
            "Make sure your input is properly formatted. "
            "Alternative Proteins will be retrieved from the fasta database"
        )
    for psm_info in all_psms:
        current_peptide = psm_info[self.PEPTIDE]
        # Define the Psm...
        if current_peptide not in peptide_tracker:
            psm = Psm(identifier=current_peptide)
            # Attempt to add variables from PSM info...
            # If they do not exist in the psm info then we skip...
            try:
                psm.percscore = float(psm_info[self.SCORE])
            except KeyError:
                pass
            try:
                psm.qvalue = float(psm_info[self.Q_VALUE])
            except KeyError:
                pass
            try:
                psm.pepvalue = float(psm_info[self.POSTERIOR_ERROR_PROB])
            except KeyError:
                pass
            # If user has a custom score IE not q-value or pep_value...
            if self.load_custom_score:
                # Then we look for it...
                psm.custom_score = float(psm_info[self.scoring_variable])
            psm.possible_proteins = []
            psm.possible_proteins.append(psm_info[self.PROTEIN_IDS])
            psm.possible_proteins = psm.possible_proteins + [x for x in psm_info[self.ALTERNATIVE_PROTEINS] if x]
            # Remove potential Repeats
            if self.parameter_file_object.inference_type != Inference.FIRST_PROTEIN:
                psm.possible_proteins = sorted(list(set(psm.possible_proteins)))

            input_poss_prots = copy.copy(psm.possible_proteins)

            # Get PSM ID
            psm.psm_id = psm_info[self.PSMID]

            # Split peptide if flanking
            current_peptide = Psm.split_peptide(peptide_string=current_peptide)

            if not current_peptide.isupper() or not current_peptide.isalpha():
                # If we have mods remove them...
                peptide_string = current_peptide.upper()
                stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                current_peptide = stripped_peptide
            # Add the other possible_proteins from insilicodigest here...
            try:
                current_alt_proteins = sorted(list(peptide_to_protein_dictionary[current_peptide]))
            except KeyError:
                current_alt_proteins = []
                logger.debug(
                    "Peptide {} was not found in the supplied DB for Proteins {}".format(
                        current_peptide, ";".join(psm.possible_proteins)
                    )
                )
                for poss_prot in psm.possible_proteins:
                    self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                    self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                    logger.debug(
                        "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                    )

            # Sort Alt Proteins by Swissprot then Trembl...
            identifiers_sorted = DataStore.sort_protein_strings(
                protein_string_list=current_alt_proteins,
                sp_proteins=all_sp_proteins,
                decoy_symbol=self.parameter_file_object.decoy_symbol,
            )

            # Restrict to 50 possible proteins
            psm = self._fix_alternative_proteins(
                append_alt_from_db=self.append_alt_from_db,
                identifiers_sorted=identifiers_sorted,
                max_proteins=self.MAX_ALLOWED_ALTERNATIVE_PROTEINS,
                psm=psm,
                parameter_file_object=self.parameter_file_object,
            )

            list_of_psm_objects.append(psm)
            peptide_tracker.add(current_peptide)

            initial_poss_prots.append(input_poss_prots)

    self.psms = list_of_psm_objects

    self._check_initial_database_overlap(
        initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
    )

    logger.info("Length of PSM Data: {}".format(len(self.psms)))

    logger.info("Finished GenericReader.read_psms...")

PercolatorReader (Reader)

The following class takes a percolator target file and a percolator decoy file or combined files/directory and creates standard Psm objects. This reader class is used as input for DataStore object.

Percolator Output is formatted as follows: with each entry being tab delimited.

| PSMId | score | q-value | posterior_error_prob | peptide | proteinIds | | | | # noqa E501 W605 |-------------------------------|----------|-------------|-----------------------|--------------------------------|---------------------|----------------------|----------------------|-------------------------| # noqa E501 W605 | 116108.15139.15139.6.dta | 3.44016 | 0.000479928 | 7.60258e-10 | K.MVVSMTLGLHPWIANIDDTQYLAAK.R | CNDP1_HUMAN|Q96KN2 | B4E180_HUMAN|B4E180 | A8K1K1_HUMAN|A8K1K1 | J3KRP0_HUMAN|J3KRP0 | # noqa E501 W605

Attributes:

Name Type Description
target_file str/list

Path to Target PSM result files.

decoy_file str/list

Path to Decoy PSM result files.

combined_files str/list

Path to Combined PSM result files.

directory str

Path to directory containing combined PSM result files.

PSMID_INDEX int

Index of the PSMId from the input files.

PERC_SCORE_INDEX int

Index of the Percolator score from the input files.

Q_VALUE_INDEX int

Index of the q-value from the input files.

POSTERIOR_ERROR_PROB_INDEX int

Index of the posterior error probability from the input files.

PEPTIDE_INDEX int

Index of the peptides from the input files.

PROTEINIDS_INDEX int

Index of the proteins from the input files.

psms list

List of Psm objects.

Source code in pyproteininference/reader.py
class PercolatorReader(Reader):
    """
    The following class takes a percolator target file and a percolator decoy file
    or combined files/directory and creates standard [Psm][pyproteininference.physical.Psm] objects.
    This reader class is used as input for [DataStore object][pyproteininference.datastore.DataStore].

    Percolator Output is formatted as follows:
    with each entry being tab delimited.

    | PSMId                         | score    |  q-value    | posterior_error_prob  |  peptide                       | proteinIds          |                      |                      |                         | # noqa E501 W605
    |-------------------------------|----------|-------------|-----------------------|--------------------------------|---------------------|----------------------|----------------------|-------------------------| # noqa E501 W605
    |     116108.15139.15139.6.dta  |  3.44016 | 0.000479928 | 7.60258e-10           | K.MVVSMTLGLHPWIANIDDTQYLAAK.R  | CNDP1_HUMAN\|Q96KN2 | B4E180_HUMAN\|B4E180 | A8K1K1_HUMAN\|A8K1K1 | J3KRP0_HUMAN\|J3KRP0    | # noqa E501 W605

    Attributes:
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.
        PSMID_INDEX (int): Index of the PSMId from the input files.
        PERC_SCORE_INDEX (int): Index of the Percolator score from the input files.
        Q_VALUE_INDEX (int): Index of the q-value from the input files.
        POSTERIOR_ERROR_PROB_INDEX (int): Index of the posterior error probability from the input files.
        PEPTIDE_INDEX (int): Index of the peptides from the input files.
        PROTEINIDS_INDEX (int): Index of the proteins from the input files.
        psms (list): List of [Psm][pyproteininference.physical.Psm] objects.

    """

    PSMID_INDEX = 0
    PERC_SCORE_INDEX = 1
    Q_VALUE_INDEX = 2
    POSTERIOR_ERROR_PROB_INDEX = 3
    PEPTIDE_INDEX = 4
    PROTEINIDS_INDEX = 5

    def __init__(
        self,
        digest,
        parameter_file_object,
        append_alt_from_db=True,
        target_file=None,
        decoy_file=None,
        combined_files=None,
        directory=None,
    ):
        """

        Args:
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
            parameter_file_object (ProteinInferenceParameter):
                [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter].
            append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
                are not in the input files.
            target_file (str/list): Path to Target PSM result files.
            decoy_file (str/list): Path to Decoy PSM result files.
            combined_files (str/list): Path to Combined PSM result files.
            directory (str): Path to directory containing combined PSM result files.

        Returns:
            Reader: [Reader][pyproteininference.reader.Reader] object.

        Example:
            >>> pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
            >>>     decoy_file = "example_decoy.txt", digest=digest,parameter_file_object=pi_params)
        """
        self.target_file = target_file
        self.decoy_file = decoy_file
        self.combined_files = combined_files
        self.directory = directory
        # Define Indicies based on input

        self.psms = None
        self.search_id = None
        self.digest = digest
        self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
        self.append_alt_from_db = append_alt_from_db

        self.parameter_file_object = parameter_file_object

        self._validate_input()

    def read_psms(self):
        """
        Method to read psms from the input files and to transform them into a list of
        [Psm][pyproteininference.physical.Psm] objects.

        This method sets the `psms` variable. Which is a list of Psm objets.

        This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

        Example:
            >>> reader = pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
            >>>     decoy_file = "example_decoy.txt",
            >>>     digest=digest, parameter_file_object=pi_params)
            >>> reader.read_psms()

        """
        # Read in and split by line
        if self.target_file and self.decoy_file:
            # If target_file is a list... read them all in and concatenate...
            if isinstance(self.target_file, (list,)):
                all_target = []
                for t_files in self.target_file:
                    logger.info(t_files)
                    ptarg = []
                    with open(t_files, "r") as perc_target_file:
                        spamreader = csv.reader(perc_target_file, delimiter="\t")
                        for row in spamreader:
                            ptarg.append(row)
                    del ptarg[0]
                    all_target = all_target + ptarg
            elif self.target_file:
                # If not just read the file...
                ptarg = []
                with open(self.target_file, "r") as perc_target_file:
                    spamreader = csv.reader(perc_target_file, delimiter="\t")
                    for row in spamreader:
                        ptarg.append(row)
                del ptarg[0]
                all_target = ptarg

            # Repeat for decoy file
            if isinstance(self.decoy_file, (list,)):
                all_decoy = []
                for d_files in self.decoy_file:
                    logger.info(d_files)
                    pdec = []
                    with open(d_files, "r") as perc_decoy_file:
                        spamreader = csv.reader(perc_decoy_file, delimiter="\t")
                        for row in spamreader:
                            pdec.append(row)
                    del pdec[0]
                    all_decoy = all_decoy + pdec
            elif self.decoy_file:
                pdec = []
                with open(self.decoy_file, "r") as perc_decoy_file:
                    spamreader = csv.reader(perc_decoy_file, delimiter="\t")
                    for row in spamreader:
                        pdec.append(row)
                del pdec[0]
                all_decoy = pdec

            # Combine the lists
            perc_all = all_target + all_decoy

        elif self.combined_files:
            if isinstance(self.combined_files, (list,)):
                all = []
                for f in self.combined_files:
                    logger.info(f)
                    combined_psm_result_rows = []
                    with open(f, "r") as perc_files:
                        spamreader = csv.reader(perc_files, delimiter="\t")
                        for row in spamreader:
                            combined_psm_result_rows.append(row)
                    del combined_psm_result_rows[0]
                    all = all + combined_psm_result_rows
            elif self.combined_files:
                # If not just read the file...
                combined_psm_result_rows = []
                with open(self.combined_files, "r") as perc_files:
                    spamreader = csv.reader(perc_files, delimiter="\t")
                    for row in spamreader:
                        combined_psm_result_rows.append(row)
                del combined_psm_result_rows[0]
                all = combined_psm_result_rows
            perc_all = all

        elif self.directory:

            all_files = os.listdir(self.directory)
            all = []
            for files in all_files:
                logger.info(files)
                combined_psm_result_rows = []
                with open(files, "r") as perc_file:
                    spamreader = csv.reader(perc_file, delimiter="\t")
                    for row in spamreader:
                        combined_psm_result_rows.append(row)
                del combined_psm_result_rows[0]
                all = all + combined_psm_result_rows
            perc_all = all

        peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

        perc_all_filtered = []
        for psms in perc_all:
            try:
                float(psms[self.POSTERIOR_ERROR_PROB_INDEX])
                perc_all_filtered.append(psms)
            except ValueError as e:  # noqa F841
                pass

        # Filter by pep
        perc_all = sorted(
            perc_all_filtered,
            key=lambda x: float(x[self.POSTERIOR_ERROR_PROB_INDEX]),
            reverse=False,
        )

        list_of_psm_objects = []
        peptide_tracker = set()
        all_sp_proteins = set(self.digest.swiss_prot_protein_set)
        # We only want to get unique peptides... using all messes up scoring...
        # Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...

        initial_poss_prots = []
        logger.info("Length of PSM Data: {}".format(len(perc_all)))
        for psm_info in perc_all:
            current_peptide = psm_info[self.PEPTIDE_INDEX]
            # Define the Psm...
            if current_peptide not in peptide_tracker:
                combined_psm_result_rows = Psm(identifier=current_peptide)
                # Add all the attributes
                combined_psm_result_rows.percscore = float(psm_info[self.PERC_SCORE_INDEX])
                combined_psm_result_rows.qvalue = float(psm_info[self.Q_VALUE_INDEX])
                combined_psm_result_rows.pepvalue = float(psm_info[self.POSTERIOR_ERROR_PROB_INDEX])
                if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
                    poss_proteins = [psm_info[self.PROTEINIDS_INDEX]]
                else:
                    poss_proteins = sorted(list(set(psm_info[self.PROTEINIDS_INDEX :])))  # noqa E203
                    poss_proteins = poss_proteins[: self.MAX_ALLOWED_ALTERNATIVE_PROTEINS]
                combined_psm_result_rows.possible_proteins = poss_proteins  # Restrict to 50 total possible proteins...
                combined_psm_result_rows.psm_id = psm_info[self.PSMID_INDEX]
                input_poss_prots = copy.copy(poss_proteins)

                # Split peptide if flanking
                current_peptide = Psm.split_peptide(peptide_string=current_peptide)

                if not current_peptide.isupper() or not current_peptide.isalpha():
                    # If we have mods remove them...
                    peptide_string = current_peptide.upper()
                    stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                    current_peptide = stripped_peptide

                # Add the other possible_proteins from insilicodigest here...
                try:
                    current_alt_proteins = sorted(
                        list(peptide_to_protein_dictionary[current_peptide])
                    )  # This peptide needs to be scrubbed of Mods...
                except KeyError:
                    current_alt_proteins = []
                    logger.debug(
                        "Peptide {} was not found in the supplied DB with the following proteins {}".format(
                            current_peptide,
                            ";".join(combined_psm_result_rows.possible_proteins),
                        )
                    )
                    for poss_prot in combined_psm_result_rows.possible_proteins:
                        self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                        self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                        logger.debug(
                            "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                        )

                # Sort Alt Proteins by Swissprot then Trembl...
                identifiers_sorted = DataStore.sort_protein_strings(
                    protein_string_list=current_alt_proteins,
                    sp_proteins=all_sp_proteins,
                    decoy_symbol=self.parameter_file_object.decoy_symbol,
                )

                # Restrict to 50 possible proteins
                combined_psm_result_rows = self._fix_alternative_proteins(
                    append_alt_from_db=self.append_alt_from_db,
                    identifiers_sorted=identifiers_sorted,
                    max_proteins=self.MAX_ALLOWED_ALTERNATIVE_PROTEINS,
                    psm=combined_psm_result_rows,
                    parameter_file_object=self.parameter_file_object,
                )

                # Remove blank alt proteins
                combined_psm_result_rows.possible_proteins = [
                    x for x in combined_psm_result_rows.possible_proteins if x != ""
                ]

                list_of_psm_objects.append(combined_psm_result_rows)
                peptide_tracker.add(current_peptide)

                initial_poss_prots.append(input_poss_prots)

        self.psms = list_of_psm_objects

        self._check_initial_database_overlap(
            initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
        )

        logger.info("Length of PSM Data: {}".format(len(self.psms)))

__init__(self, digest, parameter_file_object, append_alt_from_db=True, target_file=None, decoy_file=None, combined_files=None, directory=None) special

Parameters:
  • digest (Digest) – Digest Object.

  • parameter_file_object (ProteinInferenceParameter) – ProteinInferenceParameter.

  • append_alt_from_db (bool) – Whether or not to append alternative proteins found in the database that are not in the input files.

  • target_file (str/list) – Path to Target PSM result files.

  • decoy_file (str/list) – Path to Decoy PSM result files.

  • combined_files (str/list) – Path to Combined PSM result files.

  • directory (str) – Path to directory containing combined PSM result files.

Returns:

Examples:

>>> pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
>>>     decoy_file = "example_decoy.txt", digest=digest,parameter_file_object=pi_params)
Source code in pyproteininference/reader.py
def __init__(
    self,
    digest,
    parameter_file_object,
    append_alt_from_db=True,
    target_file=None,
    decoy_file=None,
    combined_files=None,
    directory=None,
):
    """

    Args:
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        parameter_file_object (ProteinInferenceParameter):
            [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter].
        append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
            are not in the input files.
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.

    Returns:
        Reader: [Reader][pyproteininference.reader.Reader] object.

    Example:
        >>> pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
        >>>     decoy_file = "example_decoy.txt", digest=digest,parameter_file_object=pi_params)
    """
    self.target_file = target_file
    self.decoy_file = decoy_file
    self.combined_files = combined_files
    self.directory = directory
    # Define Indicies based on input

    self.psms = None
    self.search_id = None
    self.digest = digest
    self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
    self.append_alt_from_db = append_alt_from_db

    self.parameter_file_object = parameter_file_object

    self._validate_input()

read_psms(self)

Method to read psms from the input files and to transform them into a list of Psm objects.

This method sets the psms variable. Which is a list of Psm objets.

This method must be ran before initializing DataStore object.

Examples:

>>> reader = pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
>>>     decoy_file = "example_decoy.txt",
>>>     digest=digest, parameter_file_object=pi_params)
>>> reader.read_psms()
Source code in pyproteininference/reader.py
def read_psms(self):
    """
    Method to read psms from the input files and to transform them into a list of
    [Psm][pyproteininference.physical.Psm] objects.

    This method sets the `psms` variable. Which is a list of Psm objets.

    This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

    Example:
        >>> reader = pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
        >>>     decoy_file = "example_decoy.txt",
        >>>     digest=digest, parameter_file_object=pi_params)
        >>> reader.read_psms()

    """
    # Read in and split by line
    if self.target_file and self.decoy_file:
        # If target_file is a list... read them all in and concatenate...
        if isinstance(self.target_file, (list,)):
            all_target = []
            for t_files in self.target_file:
                logger.info(t_files)
                ptarg = []
                with open(t_files, "r") as perc_target_file:
                    spamreader = csv.reader(perc_target_file, delimiter="\t")
                    for row in spamreader:
                        ptarg.append(row)
                del ptarg[0]
                all_target = all_target + ptarg
        elif self.target_file:
            # If not just read the file...
            ptarg = []
            with open(self.target_file, "r") as perc_target_file:
                spamreader = csv.reader(perc_target_file, delimiter="\t")
                for row in spamreader:
                    ptarg.append(row)
            del ptarg[0]
            all_target = ptarg

        # Repeat for decoy file
        if isinstance(self.decoy_file, (list,)):
            all_decoy = []
            for d_files in self.decoy_file:
                logger.info(d_files)
                pdec = []
                with open(d_files, "r") as perc_decoy_file:
                    spamreader = csv.reader(perc_decoy_file, delimiter="\t")
                    for row in spamreader:
                        pdec.append(row)
                del pdec[0]
                all_decoy = all_decoy + pdec
        elif self.decoy_file:
            pdec = []
            with open(self.decoy_file, "r") as perc_decoy_file:
                spamreader = csv.reader(perc_decoy_file, delimiter="\t")
                for row in spamreader:
                    pdec.append(row)
            del pdec[0]
            all_decoy = pdec

        # Combine the lists
        perc_all = all_target + all_decoy

    elif self.combined_files:
        if isinstance(self.combined_files, (list,)):
            all = []
            for f in self.combined_files:
                logger.info(f)
                combined_psm_result_rows = []
                with open(f, "r") as perc_files:
                    spamreader = csv.reader(perc_files, delimiter="\t")
                    for row in spamreader:
                        combined_psm_result_rows.append(row)
                del combined_psm_result_rows[0]
                all = all + combined_psm_result_rows
        elif self.combined_files:
            # If not just read the file...
            combined_psm_result_rows = []
            with open(self.combined_files, "r") as perc_files:
                spamreader = csv.reader(perc_files, delimiter="\t")
                for row in spamreader:
                    combined_psm_result_rows.append(row)
            del combined_psm_result_rows[0]
            all = combined_psm_result_rows
        perc_all = all

    elif self.directory:

        all_files = os.listdir(self.directory)
        all = []
        for files in all_files:
            logger.info(files)
            combined_psm_result_rows = []
            with open(files, "r") as perc_file:
                spamreader = csv.reader(perc_file, delimiter="\t")
                for row in spamreader:
                    combined_psm_result_rows.append(row)
            del combined_psm_result_rows[0]
            all = all + combined_psm_result_rows
        perc_all = all

    peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

    perc_all_filtered = []
    for psms in perc_all:
        try:
            float(psms[self.POSTERIOR_ERROR_PROB_INDEX])
            perc_all_filtered.append(psms)
        except ValueError as e:  # noqa F841
            pass

    # Filter by pep
    perc_all = sorted(
        perc_all_filtered,
        key=lambda x: float(x[self.POSTERIOR_ERROR_PROB_INDEX]),
        reverse=False,
    )

    list_of_psm_objects = []
    peptide_tracker = set()
    all_sp_proteins = set(self.digest.swiss_prot_protein_set)
    # We only want to get unique peptides... using all messes up scoring...
    # Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...

    initial_poss_prots = []
    logger.info("Length of PSM Data: {}".format(len(perc_all)))
    for psm_info in perc_all:
        current_peptide = psm_info[self.PEPTIDE_INDEX]
        # Define the Psm...
        if current_peptide not in peptide_tracker:
            combined_psm_result_rows = Psm(identifier=current_peptide)
            # Add all the attributes
            combined_psm_result_rows.percscore = float(psm_info[self.PERC_SCORE_INDEX])
            combined_psm_result_rows.qvalue = float(psm_info[self.Q_VALUE_INDEX])
            combined_psm_result_rows.pepvalue = float(psm_info[self.POSTERIOR_ERROR_PROB_INDEX])
            if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
                poss_proteins = [psm_info[self.PROTEINIDS_INDEX]]
            else:
                poss_proteins = sorted(list(set(psm_info[self.PROTEINIDS_INDEX :])))  # noqa E203
                poss_proteins = poss_proteins[: self.MAX_ALLOWED_ALTERNATIVE_PROTEINS]
            combined_psm_result_rows.possible_proteins = poss_proteins  # Restrict to 50 total possible proteins...
            combined_psm_result_rows.psm_id = psm_info[self.PSMID_INDEX]
            input_poss_prots = copy.copy(poss_proteins)

            # Split peptide if flanking
            current_peptide = Psm.split_peptide(peptide_string=current_peptide)

            if not current_peptide.isupper() or not current_peptide.isalpha():
                # If we have mods remove them...
                peptide_string = current_peptide.upper()
                stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                current_peptide = stripped_peptide

            # Add the other possible_proteins from insilicodigest here...
            try:
                current_alt_proteins = sorted(
                    list(peptide_to_protein_dictionary[current_peptide])
                )  # This peptide needs to be scrubbed of Mods...
            except KeyError:
                current_alt_proteins = []
                logger.debug(
                    "Peptide {} was not found in the supplied DB with the following proteins {}".format(
                        current_peptide,
                        ";".join(combined_psm_result_rows.possible_proteins),
                    )
                )
                for poss_prot in combined_psm_result_rows.possible_proteins:
                    self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                    self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                    logger.debug(
                        "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                    )

            # Sort Alt Proteins by Swissprot then Trembl...
            identifiers_sorted = DataStore.sort_protein_strings(
                protein_string_list=current_alt_proteins,
                sp_proteins=all_sp_proteins,
                decoy_symbol=self.parameter_file_object.decoy_symbol,
            )

            # Restrict to 50 possible proteins
            combined_psm_result_rows = self._fix_alternative_proteins(
                append_alt_from_db=self.append_alt_from_db,
                identifiers_sorted=identifiers_sorted,
                max_proteins=self.MAX_ALLOWED_ALTERNATIVE_PROTEINS,
                psm=combined_psm_result_rows,
                parameter_file_object=self.parameter_file_object,
            )

            # Remove blank alt proteins
            combined_psm_result_rows.possible_proteins = [
                x for x in combined_psm_result_rows.possible_proteins if x != ""
            ]

            list_of_psm_objects.append(combined_psm_result_rows)
            peptide_tracker.add(current_peptide)

            initial_poss_prots.append(input_poss_prots)

    self.psms = list_of_psm_objects

    self._check_initial_database_overlap(
        initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
    )

    logger.info("Length of PSM Data: {}".format(len(self.psms)))

ProteologicPostSearchReader (Reader)

This class is used to read from post processing proteologic logical object.

Attributes:

Name Type Description
proteologic_object list

List of proteologic post search objects.

search_id int

Search ID or Search IDs associated with the data.

postsearch_id int

PostSearch ID or PostSearch IDs associated with the data.

digest Digest

Digest Object.

parameter_file_object ProteinInferenceParameter

ProteinInferenceParameter object.

append_alt_from_db bool

Whether or not to append alternative proteins found in the database that are not in the input files.

Source code in pyproteininference/reader.py
class ProteologicPostSearchReader(Reader):
    """
    This class is used to read from post processing proteologic logical object.

    Attributes:
        proteologic_object (list): List of proteologic post search objects.
        search_id (int): Search ID or Search IDs associated with the data.
        postsearch_id (int): PostSearch ID or PostSearch IDs associated with the data.
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        parameter_file_object (ProteinInferenceParameter):
            [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
        append_alt_from_db (bool): Whether or not to append alternative proteins found in the database
            that are not in the input files.

    """

    def __init__(
        self,
        proteologic_object,
        search_id,
        postsearch_id,
        digest,
        parameter_file_object,
        append_alt_from_db=True,
    ):
        """

        Args:
            proteologic_object (list): List of proteologic post search objects.
            search_id (int): Search ID or Search IDs associated with the data.
            postsearch_id: PostSearch ID or PostSearch IDs associated with the data.
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
            parameter_file_object (ProteinInferenceParameter):
                [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
            append_alt_from_db (bool): Whether or not to append alternative proteins found in the database
                that are not in the input files.


        Returns:
            object:
        """
        self.proteologic_object = proteologic_object
        self.search_id = search_id
        self.postsearch_id = postsearch_id

        self.psms = None
        self.digest = digest
        self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
        self.append_alt_from_db = append_alt_from_db

        self.parameter_file_object = parameter_file_object

    def read_psms(self):
        """
        Method to read psms from the input files and to transform them into a list of
        [Psm][pyproteininference.physical.Psm] objects.

        This method sets the `psms` variable. Which is a list of Psm objets.

        This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

        """
        logger.info("Reading in data from Proteologic...")
        if isinstance(self.proteologic_object, (list,)):
            list_of_psms = []
            for p_objs in self.proteologic_object:
                for psms in p_objs.physical_object.psm_sets:
                    list_of_psms.append(psms)
        else:
            list_of_psms = self.proteologic_object.physical_object.psm_sets

        # Sort this by posterior error prob...
        list_of_psms = sorted(list_of_psms, key=lambda x: float(x.psm_filter.pepvalue))

        peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

        list_of_psm_objects = []
        peptide_tracker = set()
        all_sp_proteins = set(self.digest.swiss_prot_protein_set)
        # Peptide tracker is used because we only want UNIQUE peptides...
        # The data is sorted by percolator score... or at least it should be...
        # Or sorted by posterior error probability

        initial_poss_prots = []
        for peps in list_of_psms:
            current_peptide = peps.peptide.sequence
            # Define the Psm...
            if current_peptide not in peptide_tracker:
                p = Psm(identifier=current_peptide)
                # Add all the attributes
                p.percscore = float(0)  # Will be stored in table in future I think...
                p.qvalue = float(peps.psm_filter.q_value)
                p.pepvalue = float(peps.psm_filter.pepvalue)
                if peps.peptide.protein not in peps.alternative_proteins:
                    p.possible_proteins = [peps.peptide.protein] + peps.alternative_proteins
                else:
                    p.possible_proteins = peps.alternative_proteins

                p.possible_proteins = list(filter(None, p.possible_proteins))
                input_poss_prots = copy.copy(p.possible_proteins)
                p.psm_id = peps.spectrum.spectrum_identifier

                # Split peptide if flanking
                current_peptide = Psm.split_peptide(peptide_string=current_peptide)

                if not current_peptide.isupper() or not current_peptide.isalpha():
                    # If we have mods remove them...
                    peptide_string = current_peptide.upper()
                    stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                    current_peptide = stripped_peptide

                # Add the other possible_proteins from insilicodigest here...
                try:
                    current_alt_proteins = sorted(
                        list(peptide_to_protein_dictionary[current_peptide])
                    )  # This peptide needs to be scrubbed of Mods...
                except KeyError:
                    current_alt_proteins = []
                    logger.debug(
                        "Peptide {} was not found in the supplied DB with the following proteins {}".format(
                            current_peptide, ";".join(p.possible_proteins)
                        )
                    )
                    for poss_prot in p.possible_proteins:
                        self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                        self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                        logger.debug(
                            "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                        )

                # Sort Alt Proteins by Swissprot then Trembl...
                identifiers_sorted = DataStore.sort_protein_strings(
                    protein_string_list=current_alt_proteins,
                    sp_proteins=all_sp_proteins,
                    decoy_symbol=self.parameter_file_object.decoy_symbol,
                )

                # Restrict to 50 possible proteins... and append alt proteins from db
                p = self._fix_alternative_proteins(
                    append_alt_from_db=self.append_alt_from_db,
                    identifiers_sorted=identifiers_sorted,
                    max_proteins=self.MAX_ALLOWED_ALTERNATIVE_PROTEINS,
                    psm=p,
                    parameter_file_object=self.parameter_file_object,
                )

                list_of_psm_objects.append(p)
                peptide_tracker.add(current_peptide)

                initial_poss_prots.append(input_poss_prots)

        self.psms = list_of_psm_objects

        self._check_initial_database_overlap(
            initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
        )

        logger.info("Finished reading in data from Proteologic...")

__init__(self, proteologic_object, search_id, postsearch_id, digest, parameter_file_object, append_alt_from_db=True) special

Parameters:
  • proteologic_object (list) – List of proteologic post search objects.

  • search_id (int) – Search ID or Search IDs associated with the data.

  • postsearch_id – PostSearch ID or PostSearch IDs associated with the data.

  • digest (Digest) – Digest Object.

  • parameter_file_object (ProteinInferenceParameter) – ProteinInferenceParameter object.

  • append_alt_from_db (bool) – Whether or not to append alternative proteins found in the database that are not in the input files.

Returns:
  • object

Source code in pyproteininference/reader.py
def __init__(
    self,
    proteologic_object,
    search_id,
    postsearch_id,
    digest,
    parameter_file_object,
    append_alt_from_db=True,
):
    """

    Args:
        proteologic_object (list): List of proteologic post search objects.
        search_id (int): Search ID or Search IDs associated with the data.
        postsearch_id: PostSearch ID or PostSearch IDs associated with the data.
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        parameter_file_object (ProteinInferenceParameter):
            [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
        append_alt_from_db (bool): Whether or not to append alternative proteins found in the database
            that are not in the input files.


    Returns:
        object:
    """
    self.proteologic_object = proteologic_object
    self.search_id = search_id
    self.postsearch_id = postsearch_id

    self.psms = None
    self.digest = digest
    self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
    self.append_alt_from_db = append_alt_from_db

    self.parameter_file_object = parameter_file_object

read_psms(self)

Method to read psms from the input files and to transform them into a list of Psm objects.

This method sets the psms variable. Which is a list of Psm objets.

This method must be ran before initializing DataStore object.

Source code in pyproteininference/reader.py
def read_psms(self):
    """
    Method to read psms from the input files and to transform them into a list of
    [Psm][pyproteininference.physical.Psm] objects.

    This method sets the `psms` variable. Which is a list of Psm objets.

    This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

    """
    logger.info("Reading in data from Proteologic...")
    if isinstance(self.proteologic_object, (list,)):
        list_of_psms = []
        for p_objs in self.proteologic_object:
            for psms in p_objs.physical_object.psm_sets:
                list_of_psms.append(psms)
    else:
        list_of_psms = self.proteologic_object.physical_object.psm_sets

    # Sort this by posterior error prob...
    list_of_psms = sorted(list_of_psms, key=lambda x: float(x.psm_filter.pepvalue))

    peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

    list_of_psm_objects = []
    peptide_tracker = set()
    all_sp_proteins = set(self.digest.swiss_prot_protein_set)
    # Peptide tracker is used because we only want UNIQUE peptides...
    # The data is sorted by percolator score... or at least it should be...
    # Or sorted by posterior error probability

    initial_poss_prots = []
    for peps in list_of_psms:
        current_peptide = peps.peptide.sequence
        # Define the Psm...
        if current_peptide not in peptide_tracker:
            p = Psm(identifier=current_peptide)
            # Add all the attributes
            p.percscore = float(0)  # Will be stored in table in future I think...
            p.qvalue = float(peps.psm_filter.q_value)
            p.pepvalue = float(peps.psm_filter.pepvalue)
            if peps.peptide.protein not in peps.alternative_proteins:
                p.possible_proteins = [peps.peptide.protein] + peps.alternative_proteins
            else:
                p.possible_proteins = peps.alternative_proteins

            p.possible_proteins = list(filter(None, p.possible_proteins))
            input_poss_prots = copy.copy(p.possible_proteins)
            p.psm_id = peps.spectrum.spectrum_identifier

            # Split peptide if flanking
            current_peptide = Psm.split_peptide(peptide_string=current_peptide)

            if not current_peptide.isupper() or not current_peptide.isalpha():
                # If we have mods remove them...
                peptide_string = current_peptide.upper()
                stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                current_peptide = stripped_peptide

            # Add the other possible_proteins from insilicodigest here...
            try:
                current_alt_proteins = sorted(
                    list(peptide_to_protein_dictionary[current_peptide])
                )  # This peptide needs to be scrubbed of Mods...
            except KeyError:
                current_alt_proteins = []
                logger.debug(
                    "Peptide {} was not found in the supplied DB with the following proteins {}".format(
                        current_peptide, ";".join(p.possible_proteins)
                    )
                )
                for poss_prot in p.possible_proteins:
                    self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                    self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                    logger.debug(
                        "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                    )

            # Sort Alt Proteins by Swissprot then Trembl...
            identifiers_sorted = DataStore.sort_protein_strings(
                protein_string_list=current_alt_proteins,
                sp_proteins=all_sp_proteins,
                decoy_symbol=self.parameter_file_object.decoy_symbol,
            )

            # Restrict to 50 possible proteins... and append alt proteins from db
            p = self._fix_alternative_proteins(
                append_alt_from_db=self.append_alt_from_db,
                identifiers_sorted=identifiers_sorted,
                max_proteins=self.MAX_ALLOWED_ALTERNATIVE_PROTEINS,
                psm=p,
                parameter_file_object=self.parameter_file_object,
            )

            list_of_psm_objects.append(p)
            peptide_tracker.add(current_peptide)

            initial_poss_prots.append(input_poss_prots)

    self.psms = list_of_psm_objects

    self._check_initial_database_overlap(
        initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
    )

    logger.info("Finished reading in data from Proteologic...")

Reader

Main Reader Class which is parent to all reader subclasses.

Attributes:

Name Type Description
target_file str/list

Path to Target PSM result files.

decoy_file str/list

Path to Decoy PSM result files.

combined_files str/list

Path to Combined PSM result files.

directory str

Path to directory containing combined PSM result files.

Source code in pyproteininference/reader.py
class Reader(object):
    """
    Main Reader Class which is parent to all reader subclasses.

    Attributes:
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.

    """

    MAX_ALLOWED_ALTERNATIVE_PROTEINS = 50

    def __init__(self, target_file=None, decoy_file=None, combined_files=None, directory=None):
        """

        Args:
            target_file (str/list): Path to Target PSM result files.
            decoy_file (str/list): Path to Decoy PSM result files.
            combined_files (str/list): Path to Combined PSM result files.
            directory (str): Path to directory containing combined PSM result files.

        """
        self.target_file = target_file
        self.decoy_file = decoy_file
        self.combined_files = combined_files
        self.directory = directory

    def get_alternative_proteins_from_input(self, row):
        """
        Method to get the alternative proteins from the input files.

        """
        if None in row.keys():
            try:
                row["alternative_proteins"] = row.pop(None)
                # Sort the alternative proteins - when they are read in they become unsorted
                row["alternative_proteins"] = sorted(row["alternative_proteins"])
            except KeyError:
                row["alternative_proteins"] = []
        else:
            row["alternative_proteins"] = []
        return row

    def _validate_input(self):
        """
        Internal method to validate the input to Reader.

        """
        if self.target_file and self.decoy_file and not self.combined_files and not self.directory:
            logger.info("Validating input as target_file and decoy_file")
        elif self.combined_files and not self.target_file and not self.decoy_file and not self.directory:
            logger.info("Validating input as combined_files")
        elif self.directory and not self.combined_files and not self.decoy_file and not self.target_file:
            logger.info("Validating input as combined_directory")
        else:
            raise ValueError(
                "To run Protein inference please supply either: "
                "(1) either one or multiple target_files and decoy_files, "
                "(2) either one or multiple combined_files that include target and decoy data"
                "(3) a combined_directory that contains combined target/decoy files (combined_directory)"
            )

    @classmethod
    def _fix_alternative_proteins(
        cls,
        append_alt_from_db,
        identifiers_sorted,
        max_proteins,
        psm,
        parameter_file_object,
    ):
        """
        Internal method to fix the alternative proteins variable for a given
         [Psm][pyproteininference.physical.Psm] object.

        Args:
            append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that are
                not in the input files.
            identifiers_sorted (list): List of sorted Protein Strings for the given Psm.
            max_proteins (int): Maximum number of proteins that a [Psm][pyproteininference.physical.Psm]
                is allowed to map to.
            psm: (Psm): [Psm][pyproteininference.physical.Psm] object of interest.
            parameter_file_object: (ProteinInferenceParameter):
                [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter].

        Returns:
            pyproteininference.physical.Psm: [Psm][pyproteininference.physical.Psm] with alternative proteins fixed.

        """
        # If we are appending alternative proteins from the db
        if append_alt_from_db:
            # Loop over the Identifiers from the DB These are identifiers that contain the current peptide
            for alt_proteins in identifiers_sorted[:max_proteins]:
                # If the identifier is not already in possible proteins
                # and if then len of poss prot is less than the max...
                # Then append
                if alt_proteins not in psm.possible_proteins and len(psm.possible_proteins) < max_proteins:
                    psm.possible_proteins.append(alt_proteins)
        # Next if the len of possible proteins is greater than max then restrict the list length...
        if len(psm.possible_proteins) > max_proteins:
            psm.possible_proteins = [psm.possible_proteins[x] for x in range(max_proteins)]
        else:
            pass

        # If no inference only select first poss protein
        if parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
            psm.possible_proteins = [psm.possible_proteins[0]]

        return psm

    def _check_initial_database_overlap(self, initial_possible_proteins, initial_protein_peptide_map):
        """
        Internal method that checks to make sure there is at least some overlap between proteins in the input files
        And the proteins in the database digestion.
        """

        if len(initial_protein_peptide_map.keys()) > 0:
            input_protein_ids_flat = set([protein for group in initial_possible_proteins for protein in group])

            digest_proteins = set(initial_protein_peptide_map.keys())

            intersection = input_protein_ids_flat.intersection(digest_proteins)

            if len(intersection) < 1:
                raise ValueError(
                    "The Intersection of Protein Identifiers between the database digest "
                    "and the input files is zero. Please consider setting id_splitting to True. "
                    "Or make sure that the identifiers in the input files and database file match. "
                    "Example Protein Identifier from input file '{}'."
                    "Example Protein Identifier from database file '{}'".format(
                        list(input_protein_ids_flat)[0], list(digest_proteins)[0]
                    )
                )
            else:
                logger.info("Number of matching proteins from database and input files: {}".format(len(intersection)))
                logger.info("Number of proteins from database file: {}".format(len(digest_proteins)))
                logger.info("Number of proteins from input files: {}".format(len(input_protein_ids_flat)))

        else:
            pass

__init__(self, target_file=None, decoy_file=None, combined_files=None, directory=None) special

Parameters:
  • target_file (str/list) – Path to Target PSM result files.

  • decoy_file (str/list) – Path to Decoy PSM result files.

  • combined_files (str/list) – Path to Combined PSM result files.

  • directory (str) – Path to directory containing combined PSM result files.

Source code in pyproteininference/reader.py
def __init__(self, target_file=None, decoy_file=None, combined_files=None, directory=None):
    """

    Args:
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.

    """
    self.target_file = target_file
    self.decoy_file = decoy_file
    self.combined_files = combined_files
    self.directory = directory

get_alternative_proteins_from_input(self, row)

Method to get the alternative proteins from the input files.

Source code in pyproteininference/reader.py
def get_alternative_proteins_from_input(self, row):
    """
    Method to get the alternative proteins from the input files.

    """
    if None in row.keys():
        try:
            row["alternative_proteins"] = row.pop(None)
            # Sort the alternative proteins - when they are read in they become unsorted
            row["alternative_proteins"] = sorted(row["alternative_proteins"])
        except KeyError:
            row["alternative_proteins"] = []
    else:
        row["alternative_proteins"] = []
    return row

scoring

Score

Score class that contains methods to do a variety of scoring methods on the Psm objects contained inside of Protein objects.

Methods in the class loop over each Protein object and creates a protein "score" variable using the Psm object scores.

Methods score all proteins from scoring_input from DataStore object. The PSM score that is used is determined from create_scoring_input.

Each scoring method will set the following attributes for the DataStore object.

  1. score_method; This is the full name of the score method.
  2. short_score_method; This is the short name of the score method.
  3. scored_proteins; This is a list of Protein objects that have been scored.

Attributes:

Name Type Description
pre_score_data list

This is a list of Protein objects that contain Psm objects.

data DataStore

DataStore object.

Source code in pyproteininference/scoring.py
class Score(object):
    """
    Score class that contains methods to do a variety of scoring methods on the
    [Psm][pyproteininference.physical.Psm] objects
    contained inside of [Protein][pyproteininference.physical.Protein] objects.

    Methods in the class loop over each Protein object and creates a protein "score" variable using the Psm object
    scores.

    Methods score all proteins from `scoring_input` from [DataStore object][pyproteininference.datastore.DataStore].
    The PSM score that is used is determined from
    [create_scoring_input][pyproteininference.datastore.DataStore.create_scoring_input].

    Each scoring method will set the following attributes for
    the [DataStore object][pyproteininference.datastore.DataStore].

    1. `score_method`; This is the full name of the score method.
    2. `short_score_method`; This is the short name of the score method.
    3. `scored_proteins`; This is a list of [Protein][pyproteininference.physical.Protein] objects
    that have been scored.

    Attributes:
        pre_score_data (list): This is a list of [Protein][pyproteininference.physical.Protein] objects
            that contain [Psm][pyproteininference.physical.Psm] objects.
        data (DataStore): [DataStore][pyproteininference.datastore.DataStore] object.

    """

    BEST_PEPTIDE_PER_PROTEIN = "best_peptide_per_protein"
    ITERATIVE_DOWNWEIGHTED_LOG = "iterative_downweighted_log"
    MULTIPLICATIVE_LOG = "multiplicative_log"
    DOWNWEIGHTED_MULTIPLICATIVE_LOG = "downweighted_multiplicative_log"
    DOWNWEIGHTED_VERSION2 = "downweighted_version2"
    TOP_TWO_COMBINED = "top_two_combined"
    GEOMETRIC_MEAN = "geometric_mean"
    ADDITIVE = "additive"

    SCORE_METHODS = [
        BEST_PEPTIDE_PER_PROTEIN,
        ITERATIVE_DOWNWEIGHTED_LOG,
        MULTIPLICATIVE_LOG,
        DOWNWEIGHTED_MULTIPLICATIVE_LOG,
        DOWNWEIGHTED_VERSION2,
        TOP_TWO_COMBINED,
        GEOMETRIC_MEAN,
        ADDITIVE,
    ]

    SHORT_BEST_PEPTIDE_PER_PROTEIN = "bppp"
    SHORT_ITERATIVE_DOWNWEIGHTED_LOG = "idwl"
    SHORT_MULTIPLICATIVE_LOG = "ml"
    SHORT_DOWNWEIGHTED_MULTIPLICATIVE_LOG = "dwml"
    SHORT_DOWNWEIGHTED_VERSION2 = "dw2"
    SHORT_TOP_TWO_COMBINED = "ttc"
    SHORT_GEOMETRIC_MEAN = "gm"
    SHORT_ADDITIVE = "add"

    SHORT_SCORE_METHODS = [
        SHORT_BEST_PEPTIDE_PER_PROTEIN,
        SHORT_ITERATIVE_DOWNWEIGHTED_LOG,
        SHORT_MULTIPLICATIVE_LOG,
        SHORT_DOWNWEIGHTED_MULTIPLICATIVE_LOG,
        SHORT_DOWNWEIGHTED_VERSION2,
        SHORT_TOP_TWO_COMBINED,
        SHORT_GEOMETRIC_MEAN,
        SHORT_ADDITIVE,
    ]

    MULTIPLICATIVE_SCORE_TYPE = "multiplicative"
    ADDITIVE_SCORE_TYPE = "additive"

    SCORE_TYPES = [MULTIPLICATIVE_SCORE_TYPE, ADDITIVE_SCORE_TYPE]

    def __init__(self, data):
        """
        Initialization method for the Score class.

        Args:
            data (DataStore): [DataStore][pyproteininference.datastore.DataStore] object.

        Raises:
            ValueError: If the variable `scoring_input` for the [DataStore][pyproteininference.datastore.DataStore]
                object is Empty "[]" or does not exist "None".

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
        """
        if data.scoring_input:
            self.pre_score_data = data.scoring_input
        else:
            raise ValueError(
                "scoring input not found in data object - Please run 'create_scoring_input' method from "
                "DataStore to run any scoring type"
            )
        self.data = data

    def score_psms(self, score_method="multiplicative_log"):
        """
        This method dispatches to the actual scoring method given a string input that is defined in the
        [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.

        Args:
            score_method (str): This is a string that represents which scoring method to call.

        Raises:
            ValueError: Will Error out if the score_method is not present in the constant `SCORE_METHODS`.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.score_psms(score_method="best_peptide_per_protein")
        """

        if score_method not in self.SCORE_METHODS:
            raise ValueError(
                "score method '{}' is not a proper method. Score method must be one of the following: '{}'".format(
                    score_method, ", ".join(self.SCORE_METHODS)
                )
            )
        else:
            if score_method == self.BEST_PEPTIDE_PER_PROTEIN:
                self.best_peptide_per_protein()
            if score_method == self.ITERATIVE_DOWNWEIGHTED_LOG:
                self.iterative_down_weighted_log()
            if score_method == self.MULTIPLICATIVE_LOG:
                self.multiplicative_log()
            if score_method == self.DOWNWEIGHTED_MULTIPLICATIVE_LOG:
                self.down_weighted_multiplicative_log()
            if score_method == self.DOWNWEIGHTED_VERSION2:
                self.down_weighted_v2()
            if score_method == self.TOP_TWO_COMBINED:
                self.top_two_combied()
            if score_method == self.GEOMETRIC_MEAN:
                self.geometric_mean_log()
            if score_method == self.ADDITIVE:
                self.additive()

    def best_peptide_per_protein(self):
        """
        This method uses a best peptide per protein scoring scheme.
        The top scoring Psm for each protein is selected as the overall Protein object score.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.best_peptide_per_protein()

        """

        all_scores = []

        logger.info("Scoring Proteins with BPPP")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()
            score = min([float(x) for x in val_list])

            protein.score = score

            all_scores.append(protein)
        # Here do ascending sorting because a lower pep or q value is better
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=False)

        self.data.protein_score = self.BEST_PEPTIDE_PER_PROTEIN
        self.data.short_protein_score = self.SHORT_BEST_PEPTIDE_PER_PROTEIN
        self.data.scored_proteins = all_scores

    def fishers_method(self):
        """
        This method uses a fishers method scoring scheme.
\
        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.fishers_method()

         """

        all_scores = []
        logger.info("Scoring Proteins with fishers method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()
            score = -2 * sum([math.log(x) for x in val_list])

            protein.score = score

            all_scores.append(protein)
        # Here reverse the sorting to descending because a higher score is better
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
        self.data.protein_score = "fishers_method"
        self.data.short_protein_score = "fm"
        self.data.scored_proteins = all_scores

    def multiplicative_log(self):
        """
        This method uses a Multiplicative Log scoring scheme.
        The selected Psm score from all the peptides per protein are multiplied together and we take -Log(X)
        of the multiplied Peptide scores.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.multiplicative_log()
        """

        # Instead of making all_scores a list... make it a generator??

        all_scores = []
        logger.info("Scoring Proteins with Multiplicative Log Method")
        for protein in self.pre_score_data:
            # We create a generator of val_list...
            val_list = protein.get_psm_scores()

            combine = reduce(lambda x, y: x * y, val_list)
            if combine == 0:
                combine = sys.float_info.min
            score = -math.log(combine)
            protein.score = score

            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.MULTIPLICATIVE_LOG
        self.data.short_protein_score = self.SHORT_MULTIPLICATIVE_LOG
        self.data.scored_proteins = all_scores

    def down_weighted_multiplicative_log(self):
        """
        This method uses a Multiplicative Log scoring scheme.
        The selected PSM score from all the peptides per protein are multiplied together and
        then this number is divided by the set PSM scores mean raised to the number of peptides for that protein
        then we take -Log(X) of the following value.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.down_weighted_multiplicative_log()
        """

        score_list = []
        for proteins in self.pre_score_data:
            cur_scores = proteins.get_psm_scores()
            for scores in cur_scores:
                score_list.append(scores)
        score_mean = numpy.mean(score_list)

        all_scores = []
        logger.info("Scoring Proteins with DWML method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()
            # Divide by the score mean raised to the length of the number of unique peptides for the protein
            # This is an attempt to normalize for number of peptides per protein
            combine = reduce(lambda x, y: x * y, val_list)
            if combine == 0:
                combine = sys.float_info.min
            score = -math.log(combine / (score_mean ** len(val_list)))
            protein.score = score

            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.DOWNWEIGHTED_MULTIPLICATIVE_LOG
        self.data.short_protein_score = self.SHORT_DOWNWEIGHTED_MULTIPLICATIVE_LOG
        self.data.scored_proteins = all_scores

    def top_two_combied(self):
        """
        This method uses a Top Two scoring scheme.
        The top two scores for each protein are multiplied together and we take -Log(X) of the multiplied value.
        If a protein only has 1 score/peptide, then we only do -Log(X) of the 1 peptide score.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.top_two_combied()
        """

        all_scores = []
        logger.info("Scoring Proteins with Top Two Method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()

            try:
                # Try to combine the top two scores
                # Divide by 2 to attempt to normalize the value
                score = -math.log((val_list[0] * val_list[1]) / 2)
            except IndexError:
                # If there is only 1 score/1 peptide then just use the 1 peptide provided
                score = -math.log(val_list[0])

            protein.score = score
            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.TOP_TWO_COMBINED
        self.data.short_protein_score = self.SHORT_TOP_TWO_COMBINED
        self.data.scored_proteins = all_scores

    def down_weighted_v2(self):
        """
        This method uses a Downweighted Multiplicative Log scoring scheme.
        Each peptide is iteratively downweighted by raising the peptide QValue or PepValue to the
        following power (1/(1+index_number)).
        Where index_number is the peptide number per protein.
        Each score for a protein provides less and less weight iteratively.

        We also take -Log(X) of the final score here.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.down_weighted_v2()
        """

        all_scores = []
        logger.info("Scoring Proteins with down weighted v2 method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()

            # Here take each score and raise it to the power of (1/(1+index_number)).
            # This downweights each successive score by reducing its weight in a decreasing fashion
            # Basically, each score for a protein will provide less and less weight iteratively
            val_list = [val_list[x] ** (1 / float(1 + x)) for x in range(len(val_list))]
            # val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
            score = -math.log(reduce(lambda x, y: x * y, val_list))

            protein.score = score
            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.DOWNWEIGHTED_VERSION2
        self.data.short_protein_score = self.SHORT_DOWNWEIGHTED_VERSION2
        self.data.scored_proteins = all_scores

    def iterative_down_weighted_log(self):
        """
        This method uses a Downweighted Multiplicative Log scoring scheme.
        Each peptide is iteratively downweighted by multiplying the peptide QValue or PepValue to
        the following (1+index_number).
        Where index_number is the peptide number per protein.
        Each score for a protein provides less and less weight iteratively.

        We also take -Log(X) of the final score here.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.iterative_down_weighted_log()
        """

        all_scores = []
        logger.info("Scoring Proteins with IDWL method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()

            # Here take each score and multiply it by its index number).
            # This downweights each successive score by reducing its weight in a decreasing fashion
            # Basically, each score for a protein will provide less and less weight iteratively
            val_list = [val_list[x] * (float(1 + x)) for x in range(len(val_list))]
            # val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
            combine = reduce(lambda x, y: x * y, val_list)
            if combine == 0:
                combine = sys.float_info.min
            score = -math.log(combine)
            protein.score = score

            protein.score = score
            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.ITERATIVE_DOWNWEIGHTED_LOG
        self.data.short_protein_score = self.SHORT_ITERATIVE_DOWNWEIGHTED_LOG
        self.data.scored_proteins = all_scores

    def geometric_mean_log(self):
        """
        This method uses a Geometric Mean scoring scheme.

        We also take -Log(X) of the final score here.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.geometric_mean_log()
        """

        all_scores = []
        logger.info("Scoring Proteins. with GML method")
        for protein in self.pre_score_data:
            psm_scores = protein.get_psm_scores()
            val_list = []
            for vals in psm_scores:
                val_list.append(float(vals))
                combine = reduce(lambda x, y: x * y, val_list)
                if combine == 0:
                    combine = sys.float_info.min
                pre_log_score = combine ** (1 / float(len(val_list)))
            score = -math.log(pre_log_score)

            protein.score = score
            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.GEOMETRIC_MEAN
        self.data.short_protein_score = self.SHORT_GEOMETRIC_MEAN
        self.data.scored_proteins = all_scores

    def iterative_down_weighted_v2(self):
        """
        The following method is an experimental method essentially used for future development of potential scoring
        schemes.
        """

        all_scores = []
        logger.info("Scoring Proteins with iterative down weighted v2 method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()

            # Here take each score and raise it to the power of (1/(1+index_number)).
            # This downweights each successive score by reducing its weight in a decreasing fashion
            # Basically, each score for a protein will provide less and less weight iteratively
            val_list = [val_list[x] ** (1 / float(1 + x)) for x in range(len(val_list))]
            # val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
            score = -math.log(reduce(lambda x, y: x * y, val_list))

            protein.score = score
            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = "iterative_downweighting2"
        self.data.short_protein_score = "idw2"
        self.data.scored_proteins = all_scores

    def additive(self):
        """
        This method uses an additive scoring scheme.
        The method can only be used if a larger PSM score is a better PSM score such as the Percolator score.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.additive()
        """

        all_scores = []
        logger.info("Scoring Proteins with additive method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()

            # Take the sum of our scores
            score = sum(val_list)

            protein.score = score
            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.ADDITIVE
        self.data.short_protein_score = self.SHORT_ADDITIVE
        self.data.scored_proteins = all_scores

__init__(self, data) special

Initialization method for the Score class.

Parameters:
Exceptions:
  • ValueError – If the variable scoring_input for the DataStore object is Empty "[]" or does not exist "None".

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
Source code in pyproteininference/scoring.py
def __init__(self, data):
    """
    Initialization method for the Score class.

    Args:
        data (DataStore): [DataStore][pyproteininference.datastore.DataStore] object.

    Raises:
        ValueError: If the variable `scoring_input` for the [DataStore][pyproteininference.datastore.DataStore]
            object is Empty "[]" or does not exist "None".

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
    """
    if data.scoring_input:
        self.pre_score_data = data.scoring_input
    else:
        raise ValueError(
            "scoring input not found in data object - Please run 'create_scoring_input' method from "
            "DataStore to run any scoring type"
        )
    self.data = data

additive(self)

This method uses an additive scoring scheme. The method can only be used if a larger PSM score is a better PSM score such as the Percolator score.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.additive()
Source code in pyproteininference/scoring.py
def additive(self):
    """
    This method uses an additive scoring scheme.
    The method can only be used if a larger PSM score is a better PSM score such as the Percolator score.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.additive()
    """

    all_scores = []
    logger.info("Scoring Proteins with additive method")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()

        # Take the sum of our scores
        score = sum(val_list)

        protein.score = score
        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.ADDITIVE
    self.data.short_protein_score = self.SHORT_ADDITIVE
    self.data.scored_proteins = all_scores

best_peptide_per_protein(self)

This method uses a best peptide per protein scoring scheme. The top scoring Psm for each protein is selected as the overall Protein object score.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.best_peptide_per_protein()
Source code in pyproteininference/scoring.py
def best_peptide_per_protein(self):
    """
    This method uses a best peptide per protein scoring scheme.
    The top scoring Psm for each protein is selected as the overall Protein object score.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.best_peptide_per_protein()

    """

    all_scores = []

    logger.info("Scoring Proteins with BPPP")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()
        score = min([float(x) for x in val_list])

        protein.score = score

        all_scores.append(protein)
    # Here do ascending sorting because a lower pep or q value is better
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=False)

    self.data.protein_score = self.BEST_PEPTIDE_PER_PROTEIN
    self.data.short_protein_score = self.SHORT_BEST_PEPTIDE_PER_PROTEIN
    self.data.scored_proteins = all_scores

down_weighted_multiplicative_log(self)

This method uses a Multiplicative Log scoring scheme. The selected PSM score from all the peptides per protein are multiplied together and then this number is divided by the set PSM scores mean raised to the number of peptides for that protein then we take -Log(X) of the following value.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.down_weighted_multiplicative_log()
Source code in pyproteininference/scoring.py
def down_weighted_multiplicative_log(self):
    """
    This method uses a Multiplicative Log scoring scheme.
    The selected PSM score from all the peptides per protein are multiplied together and
    then this number is divided by the set PSM scores mean raised to the number of peptides for that protein
    then we take -Log(X) of the following value.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.down_weighted_multiplicative_log()
    """

    score_list = []
    for proteins in self.pre_score_data:
        cur_scores = proteins.get_psm_scores()
        for scores in cur_scores:
            score_list.append(scores)
    score_mean = numpy.mean(score_list)

    all_scores = []
    logger.info("Scoring Proteins with DWML method")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()
        # Divide by the score mean raised to the length of the number of unique peptides for the protein
        # This is an attempt to normalize for number of peptides per protein
        combine = reduce(lambda x, y: x * y, val_list)
        if combine == 0:
            combine = sys.float_info.min
        score = -math.log(combine / (score_mean ** len(val_list)))
        protein.score = score

        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.DOWNWEIGHTED_MULTIPLICATIVE_LOG
    self.data.short_protein_score = self.SHORT_DOWNWEIGHTED_MULTIPLICATIVE_LOG
    self.data.scored_proteins = all_scores

down_weighted_v2(self)

This method uses a Downweighted Multiplicative Log scoring scheme. Each peptide is iteratively downweighted by raising the peptide QValue or PepValue to the following power (1/(1+index_number)). Where index_number is the peptide number per protein. Each score for a protein provides less and less weight iteratively.

We also take -Log(X) of the final score here.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.down_weighted_v2()
Source code in pyproteininference/scoring.py
def down_weighted_v2(self):
    """
    This method uses a Downweighted Multiplicative Log scoring scheme.
    Each peptide is iteratively downweighted by raising the peptide QValue or PepValue to the
    following power (1/(1+index_number)).
    Where index_number is the peptide number per protein.
    Each score for a protein provides less and less weight iteratively.

    We also take -Log(X) of the final score here.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.down_weighted_v2()
    """

    all_scores = []
    logger.info("Scoring Proteins with down weighted v2 method")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()

        # Here take each score and raise it to the power of (1/(1+index_number)).
        # This downweights each successive score by reducing its weight in a decreasing fashion
        # Basically, each score for a protein will provide less and less weight iteratively
        val_list = [val_list[x] ** (1 / float(1 + x)) for x in range(len(val_list))]
        # val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
        score = -math.log(reduce(lambda x, y: x * y, val_list))

        protein.score = score
        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.DOWNWEIGHTED_VERSION2
    self.data.short_protein_score = self.SHORT_DOWNWEIGHTED_VERSION2
    self.data.scored_proteins = all_scores

fishers_method(self)

This method uses a fishers method scoring scheme.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.fishers_method()
Source code in pyproteininference/scoring.py
    def fishers_method(self):
        """
        This method uses a fishers method scoring scheme.
\
        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.fishers_method()

         """

        all_scores = []
        logger.info("Scoring Proteins with fishers method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()
            score = -2 * sum([math.log(x) for x in val_list])

            protein.score = score

            all_scores.append(protein)
        # Here reverse the sorting to descending because a higher score is better
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
        self.data.protein_score = "fishers_method"
        self.data.short_protein_score = "fm"
        self.data.scored_proteins = all_scores

geometric_mean_log(self)

This method uses a Geometric Mean scoring scheme.

We also take -Log(X) of the final score here.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.geometric_mean_log()
Source code in pyproteininference/scoring.py
def geometric_mean_log(self):
    """
    This method uses a Geometric Mean scoring scheme.

    We also take -Log(X) of the final score here.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.geometric_mean_log()
    """

    all_scores = []
    logger.info("Scoring Proteins. with GML method")
    for protein in self.pre_score_data:
        psm_scores = protein.get_psm_scores()
        val_list = []
        for vals in psm_scores:
            val_list.append(float(vals))
            combine = reduce(lambda x, y: x * y, val_list)
            if combine == 0:
                combine = sys.float_info.min
            pre_log_score = combine ** (1 / float(len(val_list)))
        score = -math.log(pre_log_score)

        protein.score = score
        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.GEOMETRIC_MEAN
    self.data.short_protein_score = self.SHORT_GEOMETRIC_MEAN
    self.data.scored_proteins = all_scores

iterative_down_weighted_log(self)

This method uses a Downweighted Multiplicative Log scoring scheme. Each peptide is iteratively downweighted by multiplying the peptide QValue or PepValue to the following (1+index_number). Where index_number is the peptide number per protein. Each score for a protein provides less and less weight iteratively.

We also take -Log(X) of the final score here.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.iterative_down_weighted_log()
Source code in pyproteininference/scoring.py
def iterative_down_weighted_log(self):
    """
    This method uses a Downweighted Multiplicative Log scoring scheme.
    Each peptide is iteratively downweighted by multiplying the peptide QValue or PepValue to
    the following (1+index_number).
    Where index_number is the peptide number per protein.
    Each score for a protein provides less and less weight iteratively.

    We also take -Log(X) of the final score here.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.iterative_down_weighted_log()
    """

    all_scores = []
    logger.info("Scoring Proteins with IDWL method")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()

        # Here take each score and multiply it by its index number).
        # This downweights each successive score by reducing its weight in a decreasing fashion
        # Basically, each score for a protein will provide less and less weight iteratively
        val_list = [val_list[x] * (float(1 + x)) for x in range(len(val_list))]
        # val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
        combine = reduce(lambda x, y: x * y, val_list)
        if combine == 0:
            combine = sys.float_info.min
        score = -math.log(combine)
        protein.score = score

        protein.score = score
        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.ITERATIVE_DOWNWEIGHTED_LOG
    self.data.short_protein_score = self.SHORT_ITERATIVE_DOWNWEIGHTED_LOG
    self.data.scored_proteins = all_scores

iterative_down_weighted_v2(self)

The following method is an experimental method essentially used for future development of potential scoring schemes.

Source code in pyproteininference/scoring.py
def iterative_down_weighted_v2(self):
    """
    The following method is an experimental method essentially used for future development of potential scoring
    schemes.
    """

    all_scores = []
    logger.info("Scoring Proteins with iterative down weighted v2 method")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()

        # Here take each score and raise it to the power of (1/(1+index_number)).
        # This downweights each successive score by reducing its weight in a decreasing fashion
        # Basically, each score for a protein will provide less and less weight iteratively
        val_list = [val_list[x] ** (1 / float(1 + x)) for x in range(len(val_list))]
        # val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
        score = -math.log(reduce(lambda x, y: x * y, val_list))

        protein.score = score
        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = "iterative_downweighting2"
    self.data.short_protein_score = "idw2"
    self.data.scored_proteins = all_scores

multiplicative_log(self)

This method uses a Multiplicative Log scoring scheme. The selected Psm score from all the peptides per protein are multiplied together and we take -Log(X) of the multiplied Peptide scores.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.multiplicative_log()
Source code in pyproteininference/scoring.py
def multiplicative_log(self):
    """
    This method uses a Multiplicative Log scoring scheme.
    The selected Psm score from all the peptides per protein are multiplied together and we take -Log(X)
    of the multiplied Peptide scores.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.multiplicative_log()
    """

    # Instead of making all_scores a list... make it a generator??

    all_scores = []
    logger.info("Scoring Proteins with Multiplicative Log Method")
    for protein in self.pre_score_data:
        # We create a generator of val_list...
        val_list = protein.get_psm_scores()

        combine = reduce(lambda x, y: x * y, val_list)
        if combine == 0:
            combine = sys.float_info.min
        score = -math.log(combine)
        protein.score = score

        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.MULTIPLICATIVE_LOG
    self.data.short_protein_score = self.SHORT_MULTIPLICATIVE_LOG
    self.data.scored_proteins = all_scores

score_psms(self, score_method='multiplicative_log')

This method dispatches to the actual scoring method given a string input that is defined in the ProteinInferenceParameter object.

Parameters:
  • score_method (str) – This is a string that represents which scoring method to call.

Exceptions:
  • ValueError – Will Error out if the score_method is not present in the constant SCORE_METHODS.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.score_psms(score_method="best_peptide_per_protein")
Source code in pyproteininference/scoring.py
def score_psms(self, score_method="multiplicative_log"):
    """
    This method dispatches to the actual scoring method given a string input that is defined in the
    [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.

    Args:
        score_method (str): This is a string that represents which scoring method to call.

    Raises:
        ValueError: Will Error out if the score_method is not present in the constant `SCORE_METHODS`.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.score_psms(score_method="best_peptide_per_protein")
    """

    if score_method not in self.SCORE_METHODS:
        raise ValueError(
            "score method '{}' is not a proper method. Score method must be one of the following: '{}'".format(
                score_method, ", ".join(self.SCORE_METHODS)
            )
        )
    else:
        if score_method == self.BEST_PEPTIDE_PER_PROTEIN:
            self.best_peptide_per_protein()
        if score_method == self.ITERATIVE_DOWNWEIGHTED_LOG:
            self.iterative_down_weighted_log()
        if score_method == self.MULTIPLICATIVE_LOG:
            self.multiplicative_log()
        if score_method == self.DOWNWEIGHTED_MULTIPLICATIVE_LOG:
            self.down_weighted_multiplicative_log()
        if score_method == self.DOWNWEIGHTED_VERSION2:
            self.down_weighted_v2()
        if score_method == self.TOP_TWO_COMBINED:
            self.top_two_combied()
        if score_method == self.GEOMETRIC_MEAN:
            self.geometric_mean_log()
        if score_method == self.ADDITIVE:
            self.additive()

top_two_combied(self)

This method uses a Top Two scoring scheme. The top two scores for each protein are multiplied together and we take -Log(X) of the multiplied value. If a protein only has 1 score/peptide, then we only do -Log(X) of the 1 peptide score.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.top_two_combied()
Source code in pyproteininference/scoring.py
def top_two_combied(self):
    """
    This method uses a Top Two scoring scheme.
    The top two scores for each protein are multiplied together and we take -Log(X) of the multiplied value.
    If a protein only has 1 score/peptide, then we only do -Log(X) of the 1 peptide score.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.top_two_combied()
    """

    all_scores = []
    logger.info("Scoring Proteins with Top Two Method")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()

        try:
            # Try to combine the top two scores
            # Divide by 2 to attempt to normalize the value
            score = -math.log((val_list[0] * val_list[1]) / 2)
        except IndexError:
            # If there is only 1 score/1 peptide then just use the 1 peptide provided
            score = -math.log(val_list[0])

        protein.score = score
        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.TOP_TWO_COMBINED
    self.data.short_protein_score = self.SHORT_TOP_TWO_COMBINED
    self.data.scored_proteins = all_scores