Py Protein Inference Module

ProteinInferencePipeline

Bases: object

This is the main Protein Inference class which houses the logic of the entire data analysis pipeline. Logic is executed in the execute method.

Attributes:
  • parameter_file (str) –

    Path to Protein Inference Yaml Parameter File.

  • database_file (str) –

    Path to Fasta database used in proteomics search.

  • target_files (str / list) –

    Path to Target Psm File (Or a list of files).

  • decoy_files (str / list) –

    Path to Decoy Psm File (Or a list of files).

  • combined_files (str / list) –

    Path to Combined Psm File (Or a list of files).

  • target_directory (str) –

    Path to Directory containing Target Psm Files.

  • decoy_directory (str) –

    Path to Directory containing Decoy Psm Files.

  • combined_directory (str) –

    Path to Directory containing Combined Psm Files.

  • output_directory (str) –

    Path to Directory where output will be written.

  • output_filename (str) –

    Path to Filename where output will be written. Will override output_directory.

  • id_splitting (bool) –

    True/False on whether to split protein IDs in the digest. Advanced usage only.

  • append_alt_from_db (bool) –

    True/False on whether to append alternative proteins from the DB digestion in Reader class.

  • data (DataStore) –
  • digest (Digest) –
Source code in pyproteininference/pipeline.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
class ProteinInferencePipeline(object):
    """
    This is the main Protein Inference class which houses the logic of the entire data analysis pipeline.
    Logic is executed in the [execute][pyproteininference.pipeline.ProteinInferencePipeline.execute] method.

    Attributes:
        parameter_file (str): Path to Protein Inference Yaml Parameter File.
        database_file (str): Path to Fasta database used in proteomics search.
        target_files (str/list): Path to Target Psm File (Or a list of files).
        decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
        combined_files (str/list): Path to Combined Psm File (Or a list of files).
        target_directory (str): Path to Directory containing Target Psm Files.
        decoy_directory (str): Path to Directory containing Decoy Psm Files.
        combined_directory (str): Path to Directory containing Combined Psm Files.
        output_directory (str): Path to Directory where output will be written.
        output_filename (str): Path to Filename where output will be written. Will override output_directory.
        id_splitting (bool): True/False on whether to split protein IDs in the digest. Advanced usage only.
        append_alt_from_db (bool): True/False on whether to append alternative proteins from the DB digestion in
            Reader class.
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

    """

    def __init__(
        self,
        parameter_file,
        database_file=None,
        target_files=None,
        decoy_files=None,
        combined_files=None,
        target_directory=None,
        decoy_directory=None,
        combined_directory=None,
        output_directory=None,
        output_filename=None,
        id_splitting=False,
        append_alt_from_db=True,
    ):
        """

        Args:
            parameter_file (str/Configuration): Path to Protein Inference Yaml Parameter File.
            database_file (str): Path to Fasta database used in proteomics search.
            target_files (str/list): Path to Target Psm File (Or a list of files).
            decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
            combined_files (str/list): Path to Combined Psm File (Or a list of files).
            target_directory (str): Path to Directory containing Target Psm Files.
            decoy_directory (str): Path to Directory containing Decoy Psm Files.
            combined_directory (str): Path to Directory containing Combined Psm Files.
            output_filename (str): Path to Filename where output will be written. Will override output_directory.
            output_directory (str): Path to Directory where output will be written.
            id_splitting (bool): True/False on whether to split protein IDs in the digest. Advanced usage only.
            append_alt_from_db (bool): True/False on whether to append alternative proteins from the DB digestion in
                Reader class.

        Returns:
            object:

        Example:
            >>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
            >>>     parameter_file=yaml_params,
            >>>     database_file=database,
            >>>     target_files=target,
            >>>     decoy_files=decoy,
            >>>     combined_files=combined_files,
            >>>     target_directory=target_directory,
            >>>     decoy_directory=decoy_directory,
            >>>     combined_directory=combined_directory,
            >>>     output_directory=dir_name,
            >>>     output_filename=output_filename,
            >>>     append_alt_from_db=append_alt,
            >>> )
        """

        self.parameter_file = parameter_file
        self.database_file = database_file
        self.target_files = target_files
        self.decoy_files = decoy_files
        self.combined_files = combined_files
        self.target_directory = target_directory
        self.decoy_directory = decoy_directory
        self.combined_directory = combined_directory
        self.output_directory = output_directory
        self.output_filename = output_filename
        self.id_splitting = id_splitting
        self.append_alt_from_db = append_alt_from_db
        self.data = None
        self.digest = None
        self.gui_status_queue = None

        self._validate_input()

        self._set_output_directory()

        self._log_append_alt_from_db()

        self._log_id_splitting()

    @classmethod
    def create_from_gui_config(cls, queue: Queue, config: Configuration):
        """Creates the ProteinInferencePipeline from a Config object passed from the graphical user interface."""
        pipeline = cls(
            parameter_file=config,
            database_file=config.fasta_file[0] if isinstance(config.fasta_file, list) else None,
            combined_files=list(config.input_files),
            output_filename=config.output_file,
            id_splitting=config.identifier_splitting,
            append_alt_from_db=config.use_alt_proteins,
        )
        pipeline.gui_status_queue = queue
        return pipeline

    def execute(self):
        """
        This method is the main driver of the data analysis for the protein inference package.
        This method calls other classes and methods that make up the protein inference pipeline.
        This includes but is not limited to:

        This method sets the data [DataStore Object][pyproteininference.datastore.DataStore] and digest
            [Digest Object][pyproteininference.in_silico_digest.Digest].

        1. Parameter file management.
        2. Digesting Fasta Database (Optional).
        3. Reading in input Psm Files.
        4. Initializing the [DataStore Object][pyproteininference.datastore.DataStore].
        5. Restricting Psms.
        6. Creating Protein objects/scoring input.
        7. Scoring Proteins.
        8. Running Protein Picker.
        9. Running Inference Methods/Grouping.
        10. Calculating Q Values.
        11. Exporting Proteins to filesystem.

        Example:
            >>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
            >>>     parameter_file=yaml_params,
            >>>     database_file=database,
            >>>     target_files=target,
            >>>     decoy_files=decoy,
            >>>     combined_files=combined_files,
            >>>     target_directory=target_directory,
            >>>     decoy_directory=decoy_directory,
            >>>     combined_directory=combined_directory,
            >>>     output_directory=dir_name,
            >>>     output_filename=output_filename,
            >>>     append_alt_from_db=append_alt,
            >>> )
            >>> pipeline.execute()

        """
        # STEP 1: Load parameter file #
        # STEP 1: Load parameter file #
        # STEP 1: Load parameter file #
        self._update_status(0, "Configuring analysis")
        if isinstance(self.parameter_file, Configuration):
            pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
                None, configuration=self.parameter_file
            )
        else:
            pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
                yaml_param_filepath=self.parameter_file
            )

        # STEP 2: Start with running an In Silico Digestion #
        # STEP 2: Start with running an In Silico Digestion #
        # STEP 2: Start with running an In Silico Digestion #
        self._update_status(0.05, "Running In Silico Digestion")
        digest = pyproteininference.in_silico_digest.PyteomicsDigest(
            database_path=self.database_file,
            digest_type=pyproteininference_parameters.digest_type,
            missed_cleavages=pyproteininference_parameters.missed_cleavages,
            reviewed_identifier_symbol=pyproteininference_parameters.reviewed_identifier_symbol,
            max_peptide_length=pyproteininference_parameters.restrict_peptide_length,
            id_splitting=self.id_splitting,
        )
        if self.database_file:
            logger.info("Running In Silico Database Digest on file {}".format(self.database_file))
            digest.digest_fasta_database()
        else:
            logger.warning(
                "No Database File provided, Skipping database digest and only taking protein-peptide mapping from the "
                "input files."
            )

        # STEP 3: Read PSM Data #
        # STEP 3: Read PSM Data #
        # STEP 3: Read PSM Data #
        self._update_status(0.25, "Reading PSM Data")

        def _as_list(x: Union[str, List[str]]) -> List[str]:
            return [x] if isinstance(x, str) else x

        input_files = (
            _as_list(self.target_files)
            if self.target_files
            else (
                _as_list(self.decoy_files)
                if self.decoy_files
                else _as_list(self.combined_files) if self.combined_files else list()
            )
        )
        extensions = set([os.path.splitext(x)[1].lower() for x in input_files])
        if len(extensions) > 1:
            raise ValueError("All input files must be of the same type and have the same file extension.")
        logger.info("File(s) have extensions: {}".format(extensions))
        if (
            ".idxml" in extensions
            or ".mzid" in extensions
            or ".pep.xml" in extensions
            or ".xml" in extensions
            or ".pepxml" in extensions
        ):
            reader = pyproteininference.reader.IdXMLReader(
                target_file=self.target_files,
                decoy_file=self.decoy_files,
                combined_files=self.combined_files,
                parameter_file_object=pyproteininference_parameters,
                digest=digest,
                append_alt_from_db=self.append_alt_from_db,
            )
        else:
            reader = pyproteininference.reader.GenericReader(
                target_file=self.target_files,
                decoy_file=self.decoy_files,
                combined_files=self.combined_files,
                parameter_file_object=pyproteininference_parameters,
                digest=digest,
                append_alt_from_db=self.append_alt_from_db,
            )
        reader.read_psms()

        # STEP 4: Initiate the datastore object #
        # STEP 4: Initiate the datastore object #
        # STEP 4: Initiate the datastore object #
        data = pyproteininference.datastore.DataStore(reader=reader, digest=digest)

        # Step 5: Restrict the PSM data
        # Step 5: Restrict the PSM data
        # Step 5: Restrict the PSM data
        self._update_status(0.50, "Filtering PSM Data")
        data.restrict_psm_data()

        data.recover_mapping()
        # Step 6: Generate protein scoring input
        # Step 6: Generate protein scoring input
        # Step 6: Generate protein scoring input
        self._update_status(0.60, "Calculating Scores")
        data.create_scoring_input()

        # Step 7: Remove non unique peptides if running exclusion
        # Step 7: Remove non unique peptides if running exclusion
        # Step 7: Remove non unique peptides if running exclusion
        if pyproteininference_parameters.inference_type == Inference.EXCLUSION:
            # This gets ran if we run exclusion...
            data.exclude_non_distinguishing_peptides()

        # STEP 8: Score our PSMs given a score method
        # STEP 8: Score our PSMs given a score method
        # STEP 8: Score our PSMs given a score method
        score = pyproteininference.scoring.Score(data=data)
        score.score_psms(score_method=pyproteininference_parameters.protein_score)

        # STEP 9: Run protein picker on the data
        # STEP 9: Run protein picker on the data
        # STEP 9: Run protein picker on the data
        self._update_status(0.65, "Selecting Proteins")
        if pyproteininference_parameters.picker:
            data.protein_picker()
        else:
            pass

        # STEP 10: Apply Inference
        # STEP 10: Apply Inference
        # STEP 10: Apply Inference
        self._update_status(0.75, "Performing Inference")
        pyproteininference.inference.Inference.run_inference(data=data, digest=digest)

        # STEP 11: Q value Calculations
        # STEP 11: Q value Calculations
        # STEP 11: Q value Calculations
        self._update_status(0.90, "Calculating Q Values")
        data.calculate_q_values()

        # STEP 12: Export to CSV
        # STEP 12: Export to CSV
        # STEP 12: Export to CSV
        self._update_status(0.95, "Saving Results")
        export = pyproteininference.export.Export(data=data)
        export.export_to_csv(
            output_filename=self.output_filename,
            directory=self.output_directory,
            export_type=pyproteininference_parameters.export,
        )

        self.data = data
        self.digest = digest

        logger.info("Protein Inference Finished")
        self._update_status(1, "Protein Inference Finished")

    def _update_status(self, percentage: float, message: str):
        """
        Internal method for updating the status of the pipeline in the GUI.

        Args:
            percentage (float): The percentage of the pipeline that has been completed.
            message (str): The message to display to the user.
        """
        if self.gui_status_queue:
            self.gui_status_queue.put_nowait((percentage, message))

    def _validate_input(self):
        """
        Internal method that validates whether the proper input files have been defined.

        One of the following combinations must be selected as input. No more and no less:

        1. either one or multiple target_files and decoy_files.
        2. either one or multiple combined_files that include target and decoy data.
        3. a directory that contains target files (target_directory) as well as a directory that contains decoy files
            (decoy_directory).
        4. a directory that contains combined target/decoy files (combined_directory).

        Raises:
            ValueError: ValueError will occur if an improper combination of input.
        """
        if (
            self.target_files
            and self.decoy_files
            and not self.combined_files
            and not self.target_directory
            and not self.decoy_directory
            and not self.combined_directory
        ):
            logger.info("Validating input as target_files and decoy_files")
        elif (
            self.combined_files
            and not self.target_files
            and not self.decoy_files
            and not self.decoy_directory
            and not self.target_directory
            and not self.combined_directory
        ):
            logger.info("Validating input as combined_files")
        elif (
            self.target_directory
            and self.decoy_directory
            and not self.target_files
            and not self.decoy_files
            and not self.combined_directory
            and not self.combined_files
        ):
            logger.info("Validating input as target_directory and decoy_directory")
            self._transform_directory_to_files()
        elif (
            self.combined_directory
            and not self.combined_files
            and not self.decoy_files
            and not self.decoy_directory
            and not self.target_files
            and not self.target_directory
        ):
            logger.info("Validating input as combined_directory")
            self._transform_directory_to_files()
        else:
            raise ValueError(
                "To run Protein inference please supply either: "
                "(1) either one or multiple target_files and decoy_files, "
                "(2) either one or multiple combined_files that include target and decoy data"
                "(3) a directory that contains target files (target_directory) as well as a directory that "
                "contains decoy files (decoy_directory)"
                "(4) a directory that contains combined target/decoy files (combined_directory)"
            )

    def _transform_directory_to_files(self):
        """
        This internal method takes files that are in the target_directory, decoy_directory, or combined_directory and
        reassigns these files to the target_files, decoy_files, and combined_files to be used in
         [Reader][pyproteininference.reader.Reader] object.
        """
        if self.target_directory and self.decoy_directory:
            logger.info("Transforming target_directory and decoy_directory into files")
            target_files = os.listdir(self.target_directory)
            target_files_full = [
                os.path.join(self.target_directory, x) for x in target_files if x.endswith(".txt") or x.endswith(".tsv")
            ]

            decoy_files = os.listdir(self.decoy_directory)
            decoy_files_full = [
                os.path.join(self.decoy_directory, x) for x in decoy_files if x.endswith(".txt") or x.endswith(".tsv")
            ]

            self.target_files = target_files_full
            self.decoy_files = decoy_files_full

        elif self.combined_directory:
            logger.info("Transforming combined_directory into files")
            combined_files = os.listdir(self.combined_directory)
            combined_files_full = [
                os.path.join(self.combined_directory, x)
                for x in combined_files
                if x.endswith(".txt") or x.endswith(".tsv")
            ]
            self.combined_files = combined_files_full

    def _set_output_directory(self):
        """
        Internal method for setting the output directory.
        If the output_directory argument is not supplied the output directory is set as the cwd.
        """
        if not self.output_directory:
            self.output_directory = os.getcwd()
        else:
            pass

    def _log_append_alt_from_db(self):
        """
        Internal method for logging whether the user sets alternative protein append to True or False.
        """
        if self.append_alt_from_db:
            logger.info("Append Alternative Proteins from Database set to True")
        else:
            logger.info("Append Alternative Proteins from Database set to False")

    def _log_id_splitting(self):
        """
        Internal method for logging whether the user sets ID splitting to True or False.
        """
        if self.id_splitting:
            logger.info("ID Splitting for Database Digestion set to True")
        else:
            logger.info("ID Splitting for Database Digestion set to False")

__init__(parameter_file, database_file=None, target_files=None, decoy_files=None, combined_files=None, target_directory=None, decoy_directory=None, combined_directory=None, output_directory=None, output_filename=None, id_splitting=False, append_alt_from_db=True)

Parameters:
  • parameter_file (str / Configuration) –

    Path to Protein Inference Yaml Parameter File.

  • database_file (str, default: None ) –

    Path to Fasta database used in proteomics search.

  • target_files (str / list, default: None ) –

    Path to Target Psm File (Or a list of files).

  • decoy_files (str / list, default: None ) –

    Path to Decoy Psm File (Or a list of files).

  • combined_files (str / list, default: None ) –

    Path to Combined Psm File (Or a list of files).

  • target_directory (str, default: None ) –

    Path to Directory containing Target Psm Files.

  • decoy_directory (str, default: None ) –

    Path to Directory containing Decoy Psm Files.

  • combined_directory (str, default: None ) –

    Path to Directory containing Combined Psm Files.

  • output_filename (str, default: None ) –

    Path to Filename where output will be written. Will override output_directory.

  • output_directory (str, default: None ) –

    Path to Directory where output will be written.

  • id_splitting (bool, default: False ) –

    True/False on whether to split protein IDs in the digest. Advanced usage only.

  • append_alt_from_db (bool, default: True ) –

    True/False on whether to append alternative proteins from the DB digestion in Reader class.

Returns:
  • object
Example

pipeline = pyproteininference.pipeline.ProteinInferencePipeline( parameter_file=yaml_params, database_file=database, target_files=target, decoy_files=decoy, combined_files=combined_files, target_directory=target_directory, decoy_directory=decoy_directory, combined_directory=combined_directory, output_directory=dir_name, output_filename=output_filename, append_alt_from_db=append_alt, )

Source code in pyproteininference/pipeline.py
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
def __init__(
    self,
    parameter_file,
    database_file=None,
    target_files=None,
    decoy_files=None,
    combined_files=None,
    target_directory=None,
    decoy_directory=None,
    combined_directory=None,
    output_directory=None,
    output_filename=None,
    id_splitting=False,
    append_alt_from_db=True,
):
    """

    Args:
        parameter_file (str/Configuration): Path to Protein Inference Yaml Parameter File.
        database_file (str): Path to Fasta database used in proteomics search.
        target_files (str/list): Path to Target Psm File (Or a list of files).
        decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
        combined_files (str/list): Path to Combined Psm File (Or a list of files).
        target_directory (str): Path to Directory containing Target Psm Files.
        decoy_directory (str): Path to Directory containing Decoy Psm Files.
        combined_directory (str): Path to Directory containing Combined Psm Files.
        output_filename (str): Path to Filename where output will be written. Will override output_directory.
        output_directory (str): Path to Directory where output will be written.
        id_splitting (bool): True/False on whether to split protein IDs in the digest. Advanced usage only.
        append_alt_from_db (bool): True/False on whether to append alternative proteins from the DB digestion in
            Reader class.

    Returns:
        object:

    Example:
        >>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
        >>>     parameter_file=yaml_params,
        >>>     database_file=database,
        >>>     target_files=target,
        >>>     decoy_files=decoy,
        >>>     combined_files=combined_files,
        >>>     target_directory=target_directory,
        >>>     decoy_directory=decoy_directory,
        >>>     combined_directory=combined_directory,
        >>>     output_directory=dir_name,
        >>>     output_filename=output_filename,
        >>>     append_alt_from_db=append_alt,
        >>> )
    """

    self.parameter_file = parameter_file
    self.database_file = database_file
    self.target_files = target_files
    self.decoy_files = decoy_files
    self.combined_files = combined_files
    self.target_directory = target_directory
    self.decoy_directory = decoy_directory
    self.combined_directory = combined_directory
    self.output_directory = output_directory
    self.output_filename = output_filename
    self.id_splitting = id_splitting
    self.append_alt_from_db = append_alt_from_db
    self.data = None
    self.digest = None
    self.gui_status_queue = None

    self._validate_input()

    self._set_output_directory()

    self._log_append_alt_from_db()

    self._log_id_splitting()

create_from_gui_config(queue, config) classmethod

Creates the ProteinInferencePipeline from a Config object passed from the graphical user interface.

Source code in pyproteininference/pipeline.py
113
114
115
116
117
118
119
120
121
122
123
124
125
@classmethod
def create_from_gui_config(cls, queue: Queue, config: Configuration):
    """Creates the ProteinInferencePipeline from a Config object passed from the graphical user interface."""
    pipeline = cls(
        parameter_file=config,
        database_file=config.fasta_file[0] if isinstance(config.fasta_file, list) else None,
        combined_files=list(config.input_files),
        output_filename=config.output_file,
        id_splitting=config.identifier_splitting,
        append_alt_from_db=config.use_alt_proteins,
    )
    pipeline.gui_status_queue = queue
    return pipeline

execute()

This method is the main driver of the data analysis for the protein inference package. This method calls other classes and methods that make up the protein inference pipeline. This includes but is not limited to:

This method sets the data DataStore Object and digest Digest Object.

  1. Parameter file management.
  2. Digesting Fasta Database (Optional).
  3. Reading in input Psm Files.
  4. Initializing the DataStore Object.
  5. Restricting Psms.
  6. Creating Protein objects/scoring input.
  7. Scoring Proteins.
  8. Running Protein Picker.
  9. Running Inference Methods/Grouping.
  10. Calculating Q Values.
  11. Exporting Proteins to filesystem.
Example

pipeline = pyproteininference.pipeline.ProteinInferencePipeline( parameter_file=yaml_params, database_file=database, target_files=target, decoy_files=decoy, combined_files=combined_files, target_directory=target_directory, decoy_directory=decoy_directory, combined_directory=combined_directory, output_directory=dir_name, output_filename=output_filename, append_alt_from_db=append_alt, ) pipeline.execute()

Source code in pyproteininference/pipeline.py
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
def execute(self):
    """
    This method is the main driver of the data analysis for the protein inference package.
    This method calls other classes and methods that make up the protein inference pipeline.
    This includes but is not limited to:

    This method sets the data [DataStore Object][pyproteininference.datastore.DataStore] and digest
        [Digest Object][pyproteininference.in_silico_digest.Digest].

    1. Parameter file management.
    2. Digesting Fasta Database (Optional).
    3. Reading in input Psm Files.
    4. Initializing the [DataStore Object][pyproteininference.datastore.DataStore].
    5. Restricting Psms.
    6. Creating Protein objects/scoring input.
    7. Scoring Proteins.
    8. Running Protein Picker.
    9. Running Inference Methods/Grouping.
    10. Calculating Q Values.
    11. Exporting Proteins to filesystem.

    Example:
        >>> pipeline = pyproteininference.pipeline.ProteinInferencePipeline(
        >>>     parameter_file=yaml_params,
        >>>     database_file=database,
        >>>     target_files=target,
        >>>     decoy_files=decoy,
        >>>     combined_files=combined_files,
        >>>     target_directory=target_directory,
        >>>     decoy_directory=decoy_directory,
        >>>     combined_directory=combined_directory,
        >>>     output_directory=dir_name,
        >>>     output_filename=output_filename,
        >>>     append_alt_from_db=append_alt,
        >>> )
        >>> pipeline.execute()

    """
    # STEP 1: Load parameter file #
    # STEP 1: Load parameter file #
    # STEP 1: Load parameter file #
    self._update_status(0, "Configuring analysis")
    if isinstance(self.parameter_file, Configuration):
        pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
            None, configuration=self.parameter_file
        )
    else:
        pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
            yaml_param_filepath=self.parameter_file
        )

    # STEP 2: Start with running an In Silico Digestion #
    # STEP 2: Start with running an In Silico Digestion #
    # STEP 2: Start with running an In Silico Digestion #
    self._update_status(0.05, "Running In Silico Digestion")
    digest = pyproteininference.in_silico_digest.PyteomicsDigest(
        database_path=self.database_file,
        digest_type=pyproteininference_parameters.digest_type,
        missed_cleavages=pyproteininference_parameters.missed_cleavages,
        reviewed_identifier_symbol=pyproteininference_parameters.reviewed_identifier_symbol,
        max_peptide_length=pyproteininference_parameters.restrict_peptide_length,
        id_splitting=self.id_splitting,
    )
    if self.database_file:
        logger.info("Running In Silico Database Digest on file {}".format(self.database_file))
        digest.digest_fasta_database()
    else:
        logger.warning(
            "No Database File provided, Skipping database digest and only taking protein-peptide mapping from the "
            "input files."
        )

    # STEP 3: Read PSM Data #
    # STEP 3: Read PSM Data #
    # STEP 3: Read PSM Data #
    self._update_status(0.25, "Reading PSM Data")

    def _as_list(x: Union[str, List[str]]) -> List[str]:
        return [x] if isinstance(x, str) else x

    input_files = (
        _as_list(self.target_files)
        if self.target_files
        else (
            _as_list(self.decoy_files)
            if self.decoy_files
            else _as_list(self.combined_files) if self.combined_files else list()
        )
    )
    extensions = set([os.path.splitext(x)[1].lower() for x in input_files])
    if len(extensions) > 1:
        raise ValueError("All input files must be of the same type and have the same file extension.")
    logger.info("File(s) have extensions: {}".format(extensions))
    if (
        ".idxml" in extensions
        or ".mzid" in extensions
        or ".pep.xml" in extensions
        or ".xml" in extensions
        or ".pepxml" in extensions
    ):
        reader = pyproteininference.reader.IdXMLReader(
            target_file=self.target_files,
            decoy_file=self.decoy_files,
            combined_files=self.combined_files,
            parameter_file_object=pyproteininference_parameters,
            digest=digest,
            append_alt_from_db=self.append_alt_from_db,
        )
    else:
        reader = pyproteininference.reader.GenericReader(
            target_file=self.target_files,
            decoy_file=self.decoy_files,
            combined_files=self.combined_files,
            parameter_file_object=pyproteininference_parameters,
            digest=digest,
            append_alt_from_db=self.append_alt_from_db,
        )
    reader.read_psms()

    # STEP 4: Initiate the datastore object #
    # STEP 4: Initiate the datastore object #
    # STEP 4: Initiate the datastore object #
    data = pyproteininference.datastore.DataStore(reader=reader, digest=digest)

    # Step 5: Restrict the PSM data
    # Step 5: Restrict the PSM data
    # Step 5: Restrict the PSM data
    self._update_status(0.50, "Filtering PSM Data")
    data.restrict_psm_data()

    data.recover_mapping()
    # Step 6: Generate protein scoring input
    # Step 6: Generate protein scoring input
    # Step 6: Generate protein scoring input
    self._update_status(0.60, "Calculating Scores")
    data.create_scoring_input()

    # Step 7: Remove non unique peptides if running exclusion
    # Step 7: Remove non unique peptides if running exclusion
    # Step 7: Remove non unique peptides if running exclusion
    if pyproteininference_parameters.inference_type == Inference.EXCLUSION:
        # This gets ran if we run exclusion...
        data.exclude_non_distinguishing_peptides()

    # STEP 8: Score our PSMs given a score method
    # STEP 8: Score our PSMs given a score method
    # STEP 8: Score our PSMs given a score method
    score = pyproteininference.scoring.Score(data=data)
    score.score_psms(score_method=pyproteininference_parameters.protein_score)

    # STEP 9: Run protein picker on the data
    # STEP 9: Run protein picker on the data
    # STEP 9: Run protein picker on the data
    self._update_status(0.65, "Selecting Proteins")
    if pyproteininference_parameters.picker:
        data.protein_picker()
    else:
        pass

    # STEP 10: Apply Inference
    # STEP 10: Apply Inference
    # STEP 10: Apply Inference
    self._update_status(0.75, "Performing Inference")
    pyproteininference.inference.Inference.run_inference(data=data, digest=digest)

    # STEP 11: Q value Calculations
    # STEP 11: Q value Calculations
    # STEP 11: Q value Calculations
    self._update_status(0.90, "Calculating Q Values")
    data.calculate_q_values()

    # STEP 12: Export to CSV
    # STEP 12: Export to CSV
    # STEP 12: Export to CSV
    self._update_status(0.95, "Saving Results")
    export = pyproteininference.export.Export(data=data)
    export.export_to_csv(
        output_filename=self.output_filename,
        directory=self.output_directory,
        export_type=pyproteininference_parameters.export,
    )

    self.data = data
    self.digest = digest

    logger.info("Protein Inference Finished")
    self._update_status(1, "Protein Inference Finished")

ProteinInferenceParameter

Bases: object

Class that handles data retrieval, storage, and validation of Protein Inference Parameters.

Attributes:
  • yaml_param_filepath (str) –

    path to properly formatted parameter file specific to Protein Inference.

  • digest_type (str) –

    String that determines that type of digestion in silico digestion for Digest object. Typically "trypsin".

  • export (str) –

    String to indicate the export type for Export object. Typically this is "psms", "peptides", or "psm_ids".

  • fdr (float) –

    Float to indicate FDR filtering.

  • missed_cleavages (int) –

    Integer to determine the number of missed cleavages in the database digestion Digest object.

  • picker (bool) –

    True/False on whether or not to run the protein picker algorithm.

  • restrict_pep (float / None) –

    Float to restrict the posterior error probability values by in the PSM input. Used in restrict_psm_data.

  • restrict_peptide_length (int / None) –

    Float to restrict the peptide length values by in the PSM input. Used in restrict_psm_data.

  • restrict_q (float / None) –

    Float to restrict the q values by in the PSM input. Used in restrict_psm_data.

  • restrict_custom (float / None) –

    Float to restrict the custom values by in the PSM input. Used in restrict_psm_data. Filtering depends on score_type variable. If score_type is multiplicative then values that are less than restrict_custom are kept. If score_type is additive then values that are more than restrict_custom are kept.

  • protein_score (str) –

    String to determine the way in which Proteins are scored can be any of the SCORE_METHODS in Score object.

  • psm_score_type (str) –

    String to determine the type of score that the PSM scores are (Additive or Multiplicative) can be any of the SCORE_TYPES in Score object.

  • decoy_symbol (str) –

    String to denote decoy proteins from target proteins. IE "##".

  • isoform_symbol (str) –

    String to denote isoforms from regular proteins. IE "-". Can also be None.

  • reviewed_identifier_symbol (str) –

    String to denote a "Reviewed" Protein. Typically this is: "sp|" if using Uniprot Fasta database.

  • inference_type (str) –

    String to determine the inference procedure. Can be any value of INFERENCE_TYPES of Inference object.

  • tag (str) –

    String to be added to output files.

  • psm_score (str) –

    String that indicates the PSM input score. The value should match the string in the input data of the score you want to use for PSM score. This score will be used in scoring methods here: Score object.

  • grouping_type (str / None) –

    String to determine the grouping procedure. Can be any value of GROUPING_TYPES of Inference object.

  • max_identifiers_peptide_centric (int) –

    Maximum number of identifiers to assign to a group when running peptide_centric inference. Typically this is 10 or 5.

  • lp_solver (str / None) –

    The LP solver to use if inference_type="Parsimony". Can be any value in LP_SOLVERS in the Inference object.

Source code in pyproteininference/parameters.py
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
class ProteinInferenceParameter(object):
    """
    Class that handles data retrieval, storage, and validation of Protein Inference Parameters.

    Attributes:
        yaml_param_filepath (str): path to properly formatted parameter file specific to Protein Inference.
        digest_type (str): String that determines that type of digestion in silico digestion for
            [Digest object][pyproteininference.in_silico_digest.Digest]. Typically "trypsin".
        export (str): String to indicate the export type for [Export object][pyproteininference.export.Export].
            Typically this is "psms", "peptides", or "psm_ids".
        fdr (float): Float to indicate FDR filtering.
        missed_cleavages (int): Integer to determine the number of missed cleavages in the database digestion
            [Digest object][pyproteininference.in_silico_digest.Digest].
        picker (bool): True/False on whether or not to run
            the [protein picker][pyproteininference.datastore.DataStore.protein_picker] algorithm.
        restrict_pep (float/None): Float to restrict the posterior error probability values by in the PSM input.
            Used in [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
        restrict_peptide_length (int/None): Float to restrict the peptide length values by in the PSM input.
            Used in [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
        restrict_q (float/None): Float to restrict the q values by in the PSM input.
            Used in [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
        restrict_custom (float/None): Float to restrict the custom values by in the PSM input.
            Used in [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
            Filtering depends on score_type variable. If score_type is multiplicative then values that are less than
            restrict_custom are kept. If score_type is additive then values that are more than restrict_custom are kept.
        protein_score (str): String to determine the way in which Proteins are scored can be any of the SCORE_METHODS
            in [Score object][pyproteininference.scoring.Score].
        psm_score_type (str): String to determine the type of score that the PSM scores are
            (Additive or Multiplicative) can be any of the SCORE_TYPES
            in [Score object][pyproteininference.scoring.Score].
        decoy_symbol (str): String to denote decoy proteins from target proteins. IE "##".
        isoform_symbol (str): String to denote isoforms from regular proteins. IE "-". Can also be None.
        reviewed_identifier_symbol (str): String to denote a "Reviewed" Protein. Typically this is: "sp|"
            if using Uniprot Fasta database.
        inference_type (str): String to determine the inference procedure. Can be any value of INFERENCE_TYPES
            of [Inference object][pyproteininference.inference.Inference].
        tag (str): String to be added to output files.
        psm_score (str): String that indicates the PSM input score. The value should match the string in the
            input data of the score you want to use for PSM score. This score will be used in scoring methods
                here: [Score object][pyproteininference.scoring.Score].
        grouping_type (str/None): String to determine the grouping procedure. Can be any value of
            GROUPING_TYPES of [Inference object][pyproteininference.inference.Inference].
        max_identifiers_peptide_centric (int): Maximum number of identifiers to assign to a group when
            running peptide_centric inference. Typically this is 10 or 5.
        lp_solver (str/None): The LP solver to use if inference_type="Parsimony".
            Can be any value in LP_SOLVERS in the [Inference object][pyproteininference.inference.Inference].

    """

    PARENT_PARAMETER_KEY = "parameters"

    GENERAL_PARAMETER_KEY = "general"
    DATA_RESTRICTION_PARAMETER_KEY = "data_restriction"
    SCORE_PARAMETER_KEY = "score"
    IDENTIFIERS_PARAMETER_KEY = "identifiers"
    INFERENCE_PARAMETER_KEY = "inference"
    DIGEST_PARAMETER_KEY = "digest"
    PARSIMONY_PARAMETER_KEY = "parsimony"
    PEPTIDE_CENTRIC_PARAMETER_KEY = "peptide_centric"

    XML_INPUT_PARSER_PARAMETER_KEY = "xml_parser"

    PARAMETER_MAIN_KEYS = {
        GENERAL_PARAMETER_KEY,
        DATA_RESTRICTION_PARAMETER_KEY,
        SCORE_PARAMETER_KEY,
        IDENTIFIERS_PARAMETER_KEY,
        INFERENCE_PARAMETER_KEY,
        DIGEST_PARAMETER_KEY,
        PARSIMONY_PARAMETER_KEY,
        PEPTIDE_CENTRIC_PARAMETER_KEY,
    }

    EXPORT_PARAMETER = "export"
    FDR_PARAMETER = "fdr"
    PICKER_PARAMETER = "picker"
    TAG_PARAMETER = "tag"

    GENERAL_PARAMETER_SUB_KEYS = {
        EXPORT_PARAMETER,
        FDR_PARAMETER,
        PICKER_PARAMETER,
        TAG_PARAMETER,
    }

    PEP_RESTRICT_PARAMETER = "pep_restriction"
    PEPTIDE_LENGTH_RESTRICT_PARAMETER = "peptide_length_restriction"
    Q_VALUE_RESTRICT_PARAMETER = "q_value_restriction"
    CUSTOM_RESTRICT_PARAMETER = "custom_restriction"
    MAX_ALLOWED_ALTERNATIVE_PROTEINS_PARAMETER = "max_allowed_alternative_proteins"

    DATA_RESTRICTION_PARAMETER_SUB_KEYS = {
        PEP_RESTRICT_PARAMETER,
        PEPTIDE_LENGTH_RESTRICT_PARAMETER,
        Q_VALUE_RESTRICT_PARAMETER,
        CUSTOM_RESTRICT_PARAMETER,
        MAX_ALLOWED_ALTERNATIVE_PROTEINS_PARAMETER,
    }

    PROTEIN_SCORE_PARAMETER = "protein_score"
    PSM_SCORE_PARAMETER = "psm_score"
    PSM_SCORE_TYPE_PARAMETER = "psm_score_type"

    SCORE_PARAMETER_SUB_KEYS = {
        PROTEIN_SCORE_PARAMETER,
        PSM_SCORE_PARAMETER,
        PSM_SCORE_TYPE_PARAMETER,
    }

    DECOY_SYMBOL_PARAMETER = "decoy_symbol"
    ISOFORM_SYMBOL_PARAMETER = "isoform_symbol"
    REVIEWED_IDENTIFIER_PARAMETER = "reviewed_identifier_symbol"

    IDENTIFIER_SUB_KEYS = {
        DECOY_SYMBOL_PARAMETER,
        ISOFORM_SYMBOL_PARAMETER,
        REVIEWED_IDENTIFIER_PARAMETER,
    }

    INFERENCE_TYPE_PARAMETER = "inference_type"
    GROUPING_TYPE_PARAMETER = "grouping_type"

    INFERENCE_SUB_KEYS = {INFERENCE_TYPE_PARAMETER, GROUPING_TYPE_PARAMETER}

    DIGEST_TYPE_PARAMETER = "digest_type"
    MISSED_CLEAV_PARAMETER = "missed_cleavages"

    DIGEST_SUB_KEYS = {DIGEST_TYPE_PARAMETER, MISSED_CLEAV_PARAMETER}

    LP_SOLVER_PARAMETER = "lp_solver"
    SHARED_PEPTIDES_PARAMETER = "shared_peptides"

    PARSIMONY_SUB_KEYS = {
        LP_SOLVER_PARAMETER,
        SHARED_PEPTIDES_PARAMETER,
    }

    MAX_IDENTIFIERS_PARAMETER = "max_identifiers"

    PEPTIDE_CENTRIC_SUB_KEYS = {MAX_IDENTIFIERS_PARAMETER}

    PARSER_OPENMS = "openms"
    PARSER_PYTEOMICS = "pyteomics"

    DEFAULT_DIGEST_TYPE = "trypsin"
    DEFAULT_EXPORT = "peptides"
    DEFAULT_FDR = 0.01
    DEFAULT_MISSED_CLEAVAGES = 3
    DEFAULT_PICKER = True
    DEFAULT_RESTRICT_PEP = 0.9
    DEFAULT_RESTRICT_PEPTIDE_LENGTH = 7
    DEFAULT_RESTRICT_Q = 0.005
    DEFAULT_MAX_ALLOWED_ALTERNATIVE_PROTEINS = 50
    DEFAULT_RESTRICT_CUSTOM = "None"
    DEFAULT_PROTEIN_SCORE = "multiplicative_log"
    DEFAULT_PSM_SCORE = "posterior_error_prob"
    DEFAULT_DECOY_SYMBOL = "##"
    DEFAULT_ISOFORM_SYMBOL = "-"
    DEFAULT_REVIEWED_IDENTIFIER_SYMBOL = "sp|"
    DEFAULT_INFERENCE_TYPE = "peptide_centric"
    DEFAULT_TAG = "py_protein_inference"
    DEFAULT_PSM_SCORE_TYPE = "multiplicative"
    DEFAULT_GROUPING_TYPE = "parsimonious_grouping"
    DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC = 5
    DEFAULT_LP_SOLVER = "pulp"
    DEFAULT_SHARED_PEPTIDES = "all"
    DEFAULT_XML_INPUT_PARSER = PARSER_OPENMS

    def __init__(self, yaml_param_filepath, configuration=None, validate=True):
        """Class to store Protein Inference parameter information as an object.

        Args:
            yaml_param_filepath (str): path to properly formatted parameter file specific to Protein Inference.
            validate (bool): True/False on whether to validate the parameter file of interest.

        Returns:
            None:

        Example:
            >>> pyproteininference.parameters.ProteinInferenceParameter(
            >>>     yaml_param_filepath = "/path/to/pyproteininference_params.yaml", validate=True
            >>> )


        """
        self.yaml_param_filepath = yaml_param_filepath
        self.digest_type = self.DEFAULT_DIGEST_TYPE
        self.export = self.DEFAULT_EXPORT
        self.fdr = self.DEFAULT_FDR
        self.missed_cleavages = self.DEFAULT_MISSED_CLEAVAGES
        self.picker = self.DEFAULT_PICKER
        self.restrict_pep = self.DEFAULT_RESTRICT_PEP
        self.restrict_peptide_length = self.DEFAULT_RESTRICT_PEPTIDE_LENGTH
        self.restrict_q = self.DEFAULT_RESTRICT_Q
        self.restrict_custom = self.DEFAULT_RESTRICT_CUSTOM
        self.protein_score = self.DEFAULT_PROTEIN_SCORE
        self.psm_score_type = self.DEFAULT_PSM_SCORE_TYPE
        self.decoy_symbol = self.DEFAULT_DECOY_SYMBOL
        self.isoform_symbol = self.DEFAULT_ISOFORM_SYMBOL
        self.reviewed_identifier_symbol = self.DEFAULT_REVIEWED_IDENTIFIER_SYMBOL
        self.inference_type = self.DEFAULT_INFERENCE_TYPE
        self.tag = self.DEFAULT_TAG
        self.psm_score = self.DEFAULT_PSM_SCORE
        self.grouping_type = self.DEFAULT_GROUPING_TYPE
        self.max_identifiers_peptide_centric = self.DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC
        self.lp_solver = self.DEFAULT_LP_SOLVER
        self.shared_peptides = self.DEFAULT_SHARED_PEPTIDES
        self.validate = validate
        self.xml_input_parser = self.DEFAULT_XML_INPUT_PARSER
        self.max_allowed_alternative_proteins = self.DEFAULT_MAX_ALLOWED_ALTERNATIVE_PROTEINS

        if configuration is not None:
            self.convert_from_gui_configuration(configuration)
        else:
            self.convert_to_object()

        if validate:
            self.validate_parameters()

        self._fix_none_parameters()

    def convert_to_object(self):
        """
        Function that takes a Protein Inference parameter file and converts it into a ProteinInferenceParameter object
        by assigning all Attributes of the ProteinInferenceParameter object.

        If no parameter filepath is supplied the parameter object will be loaded with default params.

        This function gets ran in the initialization of the ProteinInferenceParameter object.

        Returns:
            None:

        """
        if self.yaml_param_filepath:
            with open(self.yaml_param_filepath, "r") as stream:
                yaml_params = yaml.load(stream, Loader=yaml.Loader)

            try:
                self.digest_type = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY][
                    self.DIGEST_TYPE_PARAMETER
                ]
            except KeyError:
                logger.warning("digest_type set to default of {}".format(self.DEFAULT_DIGEST_TYPE))

            try:
                self.export = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.EXPORT_PARAMETER]
            except KeyError:
                logger.warning("export set to default of {}".format(self.DEFAULT_EXPORT))

            try:
                self.fdr = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.FDR_PARAMETER]
            except KeyError:
                logger.warning("fdr set to default of {}".format(self.DEFAULT_FDR))
            try:
                self.missed_cleavages = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY][
                    self.MISSED_CLEAV_PARAMETER
                ]
            except KeyError:
                logger.warning("missed_cleavages set to default of {}".format(self.DEFAULT_MISSED_CLEAVAGES))

            try:
                self.picker = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.PICKER_PARAMETER]
            except KeyError:
                logger.warning("picker set to default of {}".format(self.DEFAULT_PICKER))

            try:
                self.restrict_pep = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
                    self.PEP_RESTRICT_PARAMETER
                ]
            except KeyError:
                logger.warning("restrict_pep set to default of {}".format(self.DEFAULT_RESTRICT_PEP))

            try:
                self.restrict_peptide_length = yaml_params[self.PARENT_PARAMETER_KEY][
                    self.DATA_RESTRICTION_PARAMETER_KEY
                ][self.PEPTIDE_LENGTH_RESTRICT_PARAMETER]
            except KeyError:
                logger.warning(
                    "restrict_peptide_length set to default of {}".format(self.DEFAULT_RESTRICT_PEPTIDE_LENGTH)
                )

            try:
                self.restrict_q = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
                    self.Q_VALUE_RESTRICT_PARAMETER
                ]
            except KeyError:
                logger.warning("restrict_q set to default of {}".format(self.DEFAULT_RESTRICT_Q))

            try:
                self.restrict_custom = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
                    self.CUSTOM_RESTRICT_PARAMETER
                ]
            except KeyError:
                logger.warning("restrict_custom set to default of {}".format(self.DEFAULT_RESTRICT_CUSTOM))

            try:
                self.max_allowed_alternative_proteins = yaml_params[self.PARENT_PARAMETER_KEY][
                    self.DATA_RESTRICTION_PARAMETER_KEY
                ][self.MAX_ALLOWED_ALTERNATIVE_PROTEINS_PARAMETER]
            except KeyError:
                logger.warning(
                    "max_allowed_alternative_proteins set to default of {}".format(
                        self.DEFAULT_MAX_ALLOWED_ALTERNATIVE_PROTEINS
                    )
                )

            try:
                self.protein_score = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
                    self.PROTEIN_SCORE_PARAMETER
                ]
            except KeyError:
                logger.warning("protein_score set to default of {}".format(self.DEFAULT_PROTEIN_SCORE))

            try:
                self.psm_score_type = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
                    self.PSM_SCORE_TYPE_PARAMETER
                ]
            except KeyError:
                logger.warning("psm_score_type set to default of {}".format(self.DEFAULT_PSM_SCORE_TYPE))

            try:
                self.decoy_symbol = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY][
                    self.DECOY_SYMBOL_PARAMETER
                ]
            except KeyError:
                logger.warning("decoy_symbol set to default of {}".format(self.DEFAULT_DECOY_SYMBOL))

            try:
                self.isoform_symbol = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY][
                    self.ISOFORM_SYMBOL_PARAMETER
                ]
            except KeyError:
                logger.warning("isoform_symbol set to default of {}".format(self.DEFAULT_ISOFORM_SYMBOL))

            try:
                self.reviewed_identifier_symbol = yaml_params[self.PARENT_PARAMETER_KEY][
                    self.IDENTIFIERS_PARAMETER_KEY
                ][self.REVIEWED_IDENTIFIER_PARAMETER]
            except KeyError:
                logger.warning(
                    "reviewed_identifier_symbol set to default of {}".format(self.DEFAULT_REVIEWED_IDENTIFIER_SYMBOL)
                )

            try:
                self.inference_type = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY][
                    self.INFERENCE_TYPE_PARAMETER
                ]
            except KeyError:
                logger.warning("inference_type set to default of {}".format(self.DEFAULT_INFERENCE_TYPE))

            try:
                self.tag = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.TAG_PARAMETER]
            except KeyError:
                logger.warning("tag set to default of {}".format(self.DEFAULT_TAG))

            try:
                self.psm_score = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
                    self.PSM_SCORE_PARAMETER
                ]
            except KeyError:
                logger.warning("psm_score set to default of {}".format(self.DEFAULT_PSM_SCORE))

            try:
                self.grouping_type = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY][
                    self.GROUPING_TYPE_PARAMETER
                ]
            except KeyError:
                logger.warning("grouping_type set to default of {}".format(self.DEFAULT_GROUPING_TYPE))

            try:
                self.xml_input_parser = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][
                    self.XML_INPUT_PARSER_PARAMETER_KEY
                ]
            except KeyError:
                logger.warning("xml_input_parser set to default of {}".format(self.DEFAULT_XML_INPUT_PARSER))

            try:
                self.max_identifiers_peptide_centric = yaml_params[self.PARENT_PARAMETER_KEY][
                    self.PEPTIDE_CENTRIC_PARAMETER_KEY
                ][self.MAX_IDENTIFIERS_PARAMETER]
            except KeyError:
                logger.warning(
                    "max_identifiers_peptide_centric set to default of {}".format(
                        self.DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC
                    )
                )

            try:
                self.lp_solver = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY][
                    self.LP_SOLVER_PARAMETER
                ]
            except KeyError:
                logger.warning("lp_solver set to default of {}".format(self.DEFAULT_LP_SOLVER))
            try:
                # Do try except here to make old param files backwards compatible
                self.shared_peptides = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY][
                    self.SHARED_PEPTIDES_PARAMETER
                ]
            except KeyError:
                logger.warning("shared_peptides set to default of {}".format(self.DEFAULT_SHARED_PEPTIDES))

        else:
            logger.warning("Yaml parameter file not found, all parameters set to default")

    def convert_from_gui_configuration(self, configuration: Configuration):
        """
        Function that takes a Protein Inference GUI configuration object and converts it into a ProteinInferenceParameter object
        by assigning all Attributes of the ProteinInferenceParameter object.

        If no parameter filepath is supplied the parameter object will be loaded with default params.

        This function gets ran in the initialization of the ProteinInferenceParameter object.

        Returns:
            None:

        """
        self.digest_type = configuration.digest_type
        self.export = self.DEFAULT_EXPORT
        self.fdr = configuration.false_discovery_rate
        self.missed_cleavages = configuration.missed_cleavages
        self.picker = configuration.picker
        self.restrict_pep = configuration.pep_restriction
        self.restrict_peptide_length = configuration.peptide_length_restriction
        self.restrict_q = configuration.q_value_restriction
        self.restrict_custom = self.DEFAULT_RESTRICT_CUSTOM
        self.protein_score = configuration.protein_score
        self.psm_score_type = configuration.psm_score_type
        self.decoy_symbol = configuration.decoy_symbol
        self.isoform_symbol = configuration.isoform_symbol
        self.reviewed_identifier_symbol = configuration.reviewed_identifier_symbol
        self.inference_type = configuration.inference_type
        self.tag = self.DEFAULT_TAG
        if configuration.psm_score == "custom":
            self.psm_score = configuration.psm_score_custom
        else:
            self.psm_score = configuration.psm_score
        self.grouping_type = configuration.grouping_type
        self.max_identifiers_peptide_centric = configuration.max_identifiers
        self.lp_solver = self.DEFAULT_LP_SOLVER
        self.shared_peptides = configuration.shared_peptides
        self.xml_input_parser = configuration.xml_input_parser
        self.max_allowed_alternative_proteins = configuration.max_allowed_alternative_proteins

    def validate_parameters(self):
        """
        Class method to validate all parameters.

        Returns:
            None:

        """
        # Run all of the parameter validations
        self._validate_digest_type()
        self._validate_export_type()
        self._validate_floats()
        self._validate_bools()
        self._validate_score_type()
        self._validate_score_method()
        self._validate_score_combination()
        self._validate_inference_type()
        self._validate_grouping_type()
        self._validate_max_id()
        self._validate_lp_solver()
        self._validate_identifiers()
        self._validate_parsimony_shared_peptides()

    def _validate_digest_type(self):
        """
        Internal ProteinInferenceParameter method to validate the digest type.
        """
        # Make sure we have a valid digest type
        if self.digest_type in PyteomicsDigest.LIST_OF_DIGEST_TYPES:
            logger.info("Using digest type '{}'".format(self.digest_type))
        else:
            raise ValueError(
                "Digest Type '{}' not supported, please use one of the following enyzme digestions: '{}'".format(
                    self.digest_type, ", ".join(PyteomicsDigest.LIST_OF_DIGEST_TYPES)
                )
            )

    def _validate_export_type(self):
        """
        Internal ProteinInferenceParameter method to validate the export type.
        """
        # Make sure we have a valid export type
        if self.export in Export.EXPORT_TYPES:
            logger.info("Using Export type '{}'".format(self.export))
        else:
            raise ValueError(
                "Export Type '{}' not supported, please use one of the following export types: '{}'".format(
                    self.export, ", ".join(Export.EXPORT_TYPES)
                )
            )
        pass

    def _validate_floats(self):
        """
        Internal ProteinInferenceParameter method to validate floats.
        """
        # Validate that FDR, cleavages, and restrict values are all floats and or ints if they need to be

        try:
            if 0 <= float(self.fdr) <= 1:
                logger.info("FDR Input {}".format(self.fdr))

        except ValueError:
            raise ValueError("FDR must be a decimal between 0 and 1, FDR provided: {}".format(self.fdr))

        try:
            if 0 <= float(self.restrict_pep) <= 1:
                logger.info("PEP restriction {}".format(self.restrict_pep))

        except ValueError:
            if not self.restrict_pep or self.restrict_pep.lower() == "none":
                self.restrict_pep = None
                logger.info("Not restrict by PEP Value")
            else:
                raise ValueError(
                    "PEP restriction must be a decimal between 0 and 1, PEP restriction provided: {}".format(
                        self.restrict_pep
                    )
                )

        try:
            if 0 <= float(self.restrict_q) <= 1:
                logger.info("Q Value restriction {}".format(self.restrict_q))

        except ValueError:
            if not self.restrict_q or self.restrict_q.lower() == "none":
                self.restrict_q = None
                logger.info("Not restrict by Q Value")
            else:
                raise ValueError(
                    "Q Value restriction must be a decimal between 0 and 1, Q Value restriction provided: {}".format(
                        self.restrict_q
                    )
                )

        try:
            int(self.missed_cleavages)
            logger.info("Missed Cleavages selected: {}".format(self.missed_cleavages))
        except ValueError:
            raise ValueError(
                "Missed Cleavages must be an integer, Provided Missed Cleavages value: {}".format(self.missed_cleavages)
            )

        try:
            int(self.restrict_peptide_length)
            logger.info("Peptide Length Restriction: Len {}".format(self.restrict_peptide_length))
        except ValueError:
            if not self.restrict_peptide_length or self.restrict_peptide_length.lower() == "none":
                self.restrict_peptide_length = None
                logger.info("Not Restricting by Peptide Length")
            else:
                raise ValueError(
                    "Peptide Length Restriction must be an integer, "
                    "Provided Peptide Length Restriction value: {}".format(self.restrict_peptide_length)
                )

        try:
            float(self.restrict_custom)
            logger.info("Custom restriction {}".format(self.restrict_custom))
        except ValueError or TypeError:
            if not self.restrict_custom or self.restrict_custom.lower() == "none":
                self.restrict_custom = None
                logger.info("Not Restricting by Custom Value")
            else:
                raise ValueError(
                    "Custom restriction must be a number, Custom restriction provided: {}".format(self.restrict_custom)
                )

    def _validate_bools(self):
        """
        Internal ProteinInferenceParameter method to validate the bools.
        """
        # Make sure picker is a bool
        if type(self.picker) == bool:
            if self.picker:
                logger.info("Parameters loaded to run Picker")
            else:
                logger.info("Parameters loaded to NOT run Picker")
        else:
            raise ValueError(
                "Picker Variable must be set to True or False, Picker Variable provided: {}".format(self.picker)
            )

    def _validate_score_method(self):
        """
        Internal ProteinInferenceParameter method to validate the score method.
        """
        # Make sure we have the score method defined in code to use...
        if self.protein_score in Score.SCORE_METHODS:
            logger.info("Using Score Method '{}'".format(self.protein_score))
        else:
            raise ValueError(
                "Score Method '{}' not supported, "
                "please use one of the following Score Methods: '{}'".format(
                    self.protein_score, ", ".join(Score.SCORE_METHODS)
                )
            )

    def _validate_score_type(self):
        """
        Internal ProteinInferenceParameter method to validate the score type.
        """
        # Make sure score type is multiplicative or additive
        if self.psm_score_type in Score.SCORE_TYPES:
            logger.info("Using Score Type '{}'".format(self.psm_score_type))
        else:
            raise ValueError(
                "Score Type '{}' not supported, "
                "please use one of the following Score Types: '{}'".format(
                    self.psm_score_type, ", ".join(Score.SCORE_TYPES)
                )
            )

    def _validate_score_combination(self):
        """
        Internal ProteinInferenceParameter method to validate combination of score method and score type.
        """
        # Check to see if combination of score (column), method(multiplicative log, additive),
        # and score type (multiplicative/additive) is possible...
        # This will be super custom

        if self.psm_score_type == Score.ADDITIVE_SCORE_TYPE and self.protein_score != Score.ADDITIVE:
            raise ValueError(
                "If Score type is 'additive' (Higher PSM score is better) then you must use the 'additive' score method"
            )

        elif self.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE and self.protein_score == Score.ADDITIVE:
            raise ValueError(
                "If Score type is 'multiplicative' (Lower PSM score is better) "
                "then you must NOT use the 'additive' score method please "
                "select one of the following score methods: {}".format(
                    ", ".join([x for x in Score.SCORE_METHODS if x != "additive"])
                )
            )

        else:
            logger.info(
                "Combination of Score Type: '{}' and Score Method: '{}' is Ok".format(
                    self.psm_score_type, self.protein_score
                )
            )

    def _validate_inference_type(self):
        """
        Internal ProteinInferenceParameter method to validate the inference type.
        """
        # Check if its parsimony, exclusion, inclusion, none
        if self.inference_type in Inference.INFERENCE_TYPES:
            logger.info("Using inference type '{}'".format(self.inference_type))
        else:
            raise ValueError(
                "Inferece Type '{}' not supported, please use one of the following Inferece Types: '{}'".format(
                    self.inference_type, ", ".join(Inference.INFERENCE_TYPES)
                )
            )

    def _validate_grouping_type(self):
        """
        Internal ProteinInferenceParameter method to validate the grouping type.
        """
        # Check if its parsimony, exclusion, inclusion, none
        if self.grouping_type in Inference.GROUPING_TYPES:
            logger.info("Using Grouping type '{}'".format(self.grouping_type))
        else:
            if self.grouping_type.lower() == "none" or not self.grouping_type:
                self.grouping_type = None
                logger.info("Using Grouping type: None")
            else:

                raise ValueError(
                    "Grouping Type '{}' not supported, please use one of the following Grouping Types: '{}'".format(
                        self.grouping_type, Inference.GROUPING_TYPES
                    )
                )

    def _validate_max_id(self):
        """
        Internal ProteinInferenceParameter method to validate the max peptide centric id.
        """
        # Check if max_identifiers_peptide_centric param is an INT
        if type(self.max_identifiers_peptide_centric) == int:
            logger.info(
                "Max Number of Indentifiers for Peptide Centric Inference: '{}'".format(
                    self.max_identifiers_peptide_centric
                )
            )
        else:
            raise ValueError(
                "Max Number of Indentifiers for Peptide Centric Inference must be an integer, "
                "provided value: {}".format(self.max_identifiers_peptide_centric)
            )

    def _validate_lp_solver(self):
        """
        Internal ProteinInferenceParameter method to validate the lp solver.
        """
        # Check if its pulp or None
        if self.lp_solver in Inference.LP_SOLVERS:
            logger.info("Using LP Solver '{}'".format(self.lp_solver))
        else:
            if self.lp_solver.lower() == "none" or not self.lp_solver:
                self.lp_solver = None
                logger.info("Setting LP Solver to None")
            else:
                raise ValueError(
                    "LP Solver '{}' not supported, please use one of the following LP Solvers: '{}'".format(
                        self.lp_solver, ", ".join(Inference.LP_SOLVERS)
                    )
                )

    def _validate_parsimony_shared_peptides(self):
        """
        Internal ProteinInferenceParameter method to validate the shared peptides parameter.
        """
        # Check if its all, best, or none
        if self.shared_peptides in Inference.SHARED_PEPTIDE_TYPES:
            logger.info("Using Shared Peptide types '{}'".format(self.shared_peptides))
        else:
            if self.shared_peptides.lower() == "none" or not self.shared_peptides:
                self.shared_peptides = None
                logger.info("Setting Shared Peptide type to None")
            else:
                raise ValueError(
                    "Shared Peptide types '{}' not supported, please use one of the following "
                    "Shared Peptide types: '{}'".format(self.shared_peptides, Inference.SHARED_PEPTIDE_TYPES)
                )

    def _validate_identifiers(self):
        """
        Internal ProteinInferenceParameter method to validate the decoy symbol, isoform symbol,
        and reviewed identifier symbol.

        """
        if type(self.decoy_symbol) == str:
            logger.info("Decoy Symbol set to: '{}'".format(self.decoy_symbol))
        else:
            raise ValueError("Decoy Symbol must be a string, provided value: {}".format(self.decoy_symbol))

        if type(self.isoform_symbol) == str:
            logger.info("Isoform Symbol set to: '{}'".format(self.isoform_symbol))
            if self.isoform_symbol.lower() == "none" or not self.isoform_symbol:
                self.isoform_symbol = None
                logger.info("Isoform Symbol set to None")
        else:
            if self.isoform_symbol:
                self.isoform_symbol = None
                logger.info("Isoform Symbol set to None")
            raise ValueError("Isoform Symbol must be a string, provided value: {}".format(self.isoform_symbol))

        if type(self.reviewed_identifier_symbol) == str:
            logger.info("Reviewed Identifier Symbol set to: '{}'".format(self.reviewed_identifier_symbol))
            if self.reviewed_identifier_symbol.lower() == "none" or not self.reviewed_identifier_symbol:
                self.reviewed_identifier_symbol = None
                logger.info("Reviewed Identifier Symbol set to None")
        else:
            if not self.reviewed_identifier_symbol:
                self.reviewed_identifier_symbol = None
                logger.info("Reviewed Identifier Symbol set to None")
            raise ValueError(
                "Reviewed Identifier Symbol must be a string, provided value: {}".format(
                    self.reviewed_identifier_symbol
                )
            )

    def _validate_parameter_shape(self, yaml_params):
        """
        Internal ProteinInferenceParameter method to validate shape of the parameter file by checking to make sure
         that all necessary main parameter fields are defined.
        """
        if self.PARENT_PARAMETER_KEY in yaml_params.keys():
            logger.info("Main Parameter Key is Present")
        else:
            raise ValueError(
                "Key {} needs to be defined as the outermost parameter group".format(self.PARENT_PARAMETER_KEY)
            )

        if self.PARAMETER_MAIN_KEYS.issubset(yaml_params[self.PARENT_PARAMETER_KEY]):
            logger.info("All Sub Parameter Keys Present")
        else:
            raise ValueError(
                "All of the following values: {}. Need to be Sub Parameters in the Yaml Parameter file".format(
                    ", ".join(self.PARAMETER_MAIN_KEYS),
                )
            )

        try:
            general_params = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY]
            for gkey in self.GENERAL_PARAMETER_SUB_KEYS:
                if gkey in general_params.keys():
                    pass
                else:
                    raise ValueError(
                        "General Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the general parameter field".format(gkey)
                    )

        except KeyError:
            raise ValueError("'general' sub Parameter not defined in the parameter file")

        try:
            data_res_params = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY]
            for drkey in self.DATA_RESTRICTION_PARAMETER_SUB_KEYS:
                if drkey in data_res_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Data Restriction Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the data_restriction parameter field".format(drkey)
                    )

        except KeyError:
            raise ValueError("'data_restriction' sub Parameter not defined in the parameter file")

        try:
            score_params = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY]
            for skey in self.SCORE_PARAMETER_SUB_KEYS:
                if skey in score_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Score Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the score parameter field".format(skey)
                    )

        except KeyError:
            raise ValueError("'score' sub Parameter not defined in the parameter file")

        try:
            id_params = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY]
            for ikey in self.IDENTIFIER_SUB_KEYS:
                if ikey in id_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Identifiers Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the identifiers parameter field".format(ikey)
                    )

        except KeyError:
            raise ValueError("'identifiers' sub Parameter not defined in the parameter file")

        try:
            inf_params = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY]
            for infkey in self.INFERENCE_SUB_KEYS:
                if infkey in inf_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Inference Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the inference parameter field".format(infkey)
                    )

        except KeyError:
            raise ValueError("'inference' sub Parameter not defined in the parameter file")

        try:
            digest_params = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY]
            for dkey in self.DIGEST_SUB_KEYS:
                if dkey in digest_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Digest Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the digest parameter field".format(dkey)
                    )

        except KeyError:
            raise ValueError("'digest' sub Parameter not defined in the parameter file")

        try:
            parsimony_params = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY]
            for pkey in self.PARSIMONY_SUB_KEYS:
                if pkey in parsimony_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Parsimony Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the parsimony parameter field".format(pkey)
                    )

        except KeyError:
            raise ValueError("'parsimony' sub Parameter not defined in the parameter file")

        try:
            pep_cen_params = yaml_params[self.PARENT_PARAMETER_KEY][self.PEPTIDE_CENTRIC_PARAMETER_KEY]
            for pckey in self.PEPTIDE_CENTRIC_SUB_KEYS:
                if pckey in pep_cen_params.keys():
                    pass
                else:
                    raise ValueError(
                        "Peptide Centric Sub Parameter '{}' is not found in the parameter file. "
                        "Please add it as a sub parameter of the peptide_centric parameter field".format(pckey)
                    )

        except KeyError:
            raise ValueError("'peptide_centric' sub Parameter not defined in the parameter file")

    def override_q_restrict(self, data):
        """
        ProteinInferenceParameter method to override restrict_q if the input data does not contain q values.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

        """
        data_has_q = data.input_has_q()
        if data_has_q:
            pass
        else:
            if self.restrict_q:
                logger.warning("No Q values found in the input data, overriding parameters to not filter on Q value")
                self.restrict_q = None

    def override_pep_restrict(self, data):
        """
        ProteinInferenceParameter method to override restrict_pep if the input data does not contain pep values.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

        """
        data_has_pep = data.input_has_pep()
        if data_has_pep:
            pass
        else:
            if self.restrict_pep:
                logger.warning(
                    "No Pep values found in the input data, overriding parameters to not filter on Pep value"
                )
                self.restrict_pep = None

    def override_custom_restrict(self, data):
        """
        ProteinInferenceParameter method to override restrict_custom if
        the input data does not contain custom score values.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

        """
        data_has_custom = data.input_has_custom()
        if data_has_custom:
            pass
        else:
            if self.restrict_custom:
                logger.warning(
                    "No Custom values found in the input data, overriding parameters to not filter on Custom value"
                )
                self.restrict_custom = None

    def fix_parameters_from_datastore(self, data):
        """
        ProteinInferenceParameter method to override restriction values in the
        parameter file if those scores do not exist in the input files.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

        """

        self.override_q_restrict(data=data)
        self.override_pep_restrict(data=data)
        self.override_custom_restrict(data=data)

    def _fix_none_parameters(self):
        """
        Internal ProteinInferenceParameter method to fix parameters that have been defined as None.
        These get read in as strings with YAML reader and need to be converted to None type.
        """

        self._fix_grouping_type()
        self._fix_lp_solver()
        self._fix_shared_peptides()

    def _fix_grouping_type(self):
        """
        Internal ProteinInferenceParameter method to override grouping type for None value.
        """
        if self.grouping_type in ["None", "none", None]:
            self.grouping_type = None

    def _fix_lp_solver(self):
        """
        Internal ProteinInferenceParameter method to override lp_solver for None value.
        """
        if self.lp_solver in ["None", "none", None]:
            self.lp_solver = None

    def _fix_shared_peptides(self):
        """
        Internal ProteinInferenceParameter method to override shared_peptides for None value.
        """
        if self.shared_peptides in ["None", "none", None]:
            self.shared_peptides = None

__init__(yaml_param_filepath, configuration=None, validate=True)

Class to store Protein Inference parameter information as an object.

Parameters:
  • yaml_param_filepath (str) –

    path to properly formatted parameter file specific to Protein Inference.

  • validate (bool, default: True ) –

    True/False on whether to validate the parameter file of interest.

Returns:
  • None
Example

pyproteininference.parameters.ProteinInferenceParameter( yaml_param_filepath = "/path/to/pyproteininference_params.yaml", validate=True )

Source code in pyproteininference/parameters.py
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
def __init__(self, yaml_param_filepath, configuration=None, validate=True):
    """Class to store Protein Inference parameter information as an object.

    Args:
        yaml_param_filepath (str): path to properly formatted parameter file specific to Protein Inference.
        validate (bool): True/False on whether to validate the parameter file of interest.

    Returns:
        None:

    Example:
        >>> pyproteininference.parameters.ProteinInferenceParameter(
        >>>     yaml_param_filepath = "/path/to/pyproteininference_params.yaml", validate=True
        >>> )


    """
    self.yaml_param_filepath = yaml_param_filepath
    self.digest_type = self.DEFAULT_DIGEST_TYPE
    self.export = self.DEFAULT_EXPORT
    self.fdr = self.DEFAULT_FDR
    self.missed_cleavages = self.DEFAULT_MISSED_CLEAVAGES
    self.picker = self.DEFAULT_PICKER
    self.restrict_pep = self.DEFAULT_RESTRICT_PEP
    self.restrict_peptide_length = self.DEFAULT_RESTRICT_PEPTIDE_LENGTH
    self.restrict_q = self.DEFAULT_RESTRICT_Q
    self.restrict_custom = self.DEFAULT_RESTRICT_CUSTOM
    self.protein_score = self.DEFAULT_PROTEIN_SCORE
    self.psm_score_type = self.DEFAULT_PSM_SCORE_TYPE
    self.decoy_symbol = self.DEFAULT_DECOY_SYMBOL
    self.isoform_symbol = self.DEFAULT_ISOFORM_SYMBOL
    self.reviewed_identifier_symbol = self.DEFAULT_REVIEWED_IDENTIFIER_SYMBOL
    self.inference_type = self.DEFAULT_INFERENCE_TYPE
    self.tag = self.DEFAULT_TAG
    self.psm_score = self.DEFAULT_PSM_SCORE
    self.grouping_type = self.DEFAULT_GROUPING_TYPE
    self.max_identifiers_peptide_centric = self.DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC
    self.lp_solver = self.DEFAULT_LP_SOLVER
    self.shared_peptides = self.DEFAULT_SHARED_PEPTIDES
    self.validate = validate
    self.xml_input_parser = self.DEFAULT_XML_INPUT_PARSER
    self.max_allowed_alternative_proteins = self.DEFAULT_MAX_ALLOWED_ALTERNATIVE_PROTEINS

    if configuration is not None:
        self.convert_from_gui_configuration(configuration)
    else:
        self.convert_to_object()

    if validate:
        self.validate_parameters()

    self._fix_none_parameters()

convert_from_gui_configuration(configuration)

Function that takes a Protein Inference GUI configuration object and converts it into a ProteinInferenceParameter object by assigning all Attributes of the ProteinInferenceParameter object.

If no parameter filepath is supplied the parameter object will be loaded with default params.

This function gets ran in the initialization of the ProteinInferenceParameter object.

Returns:
  • None
Source code in pyproteininference/parameters.py
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
def convert_from_gui_configuration(self, configuration: Configuration):
    """
    Function that takes a Protein Inference GUI configuration object and converts it into a ProteinInferenceParameter object
    by assigning all Attributes of the ProteinInferenceParameter object.

    If no parameter filepath is supplied the parameter object will be loaded with default params.

    This function gets ran in the initialization of the ProteinInferenceParameter object.

    Returns:
        None:

    """
    self.digest_type = configuration.digest_type
    self.export = self.DEFAULT_EXPORT
    self.fdr = configuration.false_discovery_rate
    self.missed_cleavages = configuration.missed_cleavages
    self.picker = configuration.picker
    self.restrict_pep = configuration.pep_restriction
    self.restrict_peptide_length = configuration.peptide_length_restriction
    self.restrict_q = configuration.q_value_restriction
    self.restrict_custom = self.DEFAULT_RESTRICT_CUSTOM
    self.protein_score = configuration.protein_score
    self.psm_score_type = configuration.psm_score_type
    self.decoy_symbol = configuration.decoy_symbol
    self.isoform_symbol = configuration.isoform_symbol
    self.reviewed_identifier_symbol = configuration.reviewed_identifier_symbol
    self.inference_type = configuration.inference_type
    self.tag = self.DEFAULT_TAG
    if configuration.psm_score == "custom":
        self.psm_score = configuration.psm_score_custom
    else:
        self.psm_score = configuration.psm_score
    self.grouping_type = configuration.grouping_type
    self.max_identifiers_peptide_centric = configuration.max_identifiers
    self.lp_solver = self.DEFAULT_LP_SOLVER
    self.shared_peptides = configuration.shared_peptides
    self.xml_input_parser = configuration.xml_input_parser
    self.max_allowed_alternative_proteins = configuration.max_allowed_alternative_proteins

convert_to_object()

Function that takes a Protein Inference parameter file and converts it into a ProteinInferenceParameter object by assigning all Attributes of the ProteinInferenceParameter object.

If no parameter filepath is supplied the parameter object will be loaded with default params.

This function gets ran in the initialization of the ProteinInferenceParameter object.

Returns:
  • None
Source code in pyproteininference/parameters.py
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
def convert_to_object(self):
    """
    Function that takes a Protein Inference parameter file and converts it into a ProteinInferenceParameter object
    by assigning all Attributes of the ProteinInferenceParameter object.

    If no parameter filepath is supplied the parameter object will be loaded with default params.

    This function gets ran in the initialization of the ProteinInferenceParameter object.

    Returns:
        None:

    """
    if self.yaml_param_filepath:
        with open(self.yaml_param_filepath, "r") as stream:
            yaml_params = yaml.load(stream, Loader=yaml.Loader)

        try:
            self.digest_type = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY][
                self.DIGEST_TYPE_PARAMETER
            ]
        except KeyError:
            logger.warning("digest_type set to default of {}".format(self.DEFAULT_DIGEST_TYPE))

        try:
            self.export = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.EXPORT_PARAMETER]
        except KeyError:
            logger.warning("export set to default of {}".format(self.DEFAULT_EXPORT))

        try:
            self.fdr = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.FDR_PARAMETER]
        except KeyError:
            logger.warning("fdr set to default of {}".format(self.DEFAULT_FDR))
        try:
            self.missed_cleavages = yaml_params[self.PARENT_PARAMETER_KEY][self.DIGEST_PARAMETER_KEY][
                self.MISSED_CLEAV_PARAMETER
            ]
        except KeyError:
            logger.warning("missed_cleavages set to default of {}".format(self.DEFAULT_MISSED_CLEAVAGES))

        try:
            self.picker = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.PICKER_PARAMETER]
        except KeyError:
            logger.warning("picker set to default of {}".format(self.DEFAULT_PICKER))

        try:
            self.restrict_pep = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
                self.PEP_RESTRICT_PARAMETER
            ]
        except KeyError:
            logger.warning("restrict_pep set to default of {}".format(self.DEFAULT_RESTRICT_PEP))

        try:
            self.restrict_peptide_length = yaml_params[self.PARENT_PARAMETER_KEY][
                self.DATA_RESTRICTION_PARAMETER_KEY
            ][self.PEPTIDE_LENGTH_RESTRICT_PARAMETER]
        except KeyError:
            logger.warning(
                "restrict_peptide_length set to default of {}".format(self.DEFAULT_RESTRICT_PEPTIDE_LENGTH)
            )

        try:
            self.restrict_q = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
                self.Q_VALUE_RESTRICT_PARAMETER
            ]
        except KeyError:
            logger.warning("restrict_q set to default of {}".format(self.DEFAULT_RESTRICT_Q))

        try:
            self.restrict_custom = yaml_params[self.PARENT_PARAMETER_KEY][self.DATA_RESTRICTION_PARAMETER_KEY][
                self.CUSTOM_RESTRICT_PARAMETER
            ]
        except KeyError:
            logger.warning("restrict_custom set to default of {}".format(self.DEFAULT_RESTRICT_CUSTOM))

        try:
            self.max_allowed_alternative_proteins = yaml_params[self.PARENT_PARAMETER_KEY][
                self.DATA_RESTRICTION_PARAMETER_KEY
            ][self.MAX_ALLOWED_ALTERNATIVE_PROTEINS_PARAMETER]
        except KeyError:
            logger.warning(
                "max_allowed_alternative_proteins set to default of {}".format(
                    self.DEFAULT_MAX_ALLOWED_ALTERNATIVE_PROTEINS
                )
            )

        try:
            self.protein_score = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
                self.PROTEIN_SCORE_PARAMETER
            ]
        except KeyError:
            logger.warning("protein_score set to default of {}".format(self.DEFAULT_PROTEIN_SCORE))

        try:
            self.psm_score_type = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
                self.PSM_SCORE_TYPE_PARAMETER
            ]
        except KeyError:
            logger.warning("psm_score_type set to default of {}".format(self.DEFAULT_PSM_SCORE_TYPE))

        try:
            self.decoy_symbol = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY][
                self.DECOY_SYMBOL_PARAMETER
            ]
        except KeyError:
            logger.warning("decoy_symbol set to default of {}".format(self.DEFAULT_DECOY_SYMBOL))

        try:
            self.isoform_symbol = yaml_params[self.PARENT_PARAMETER_KEY][self.IDENTIFIERS_PARAMETER_KEY][
                self.ISOFORM_SYMBOL_PARAMETER
            ]
        except KeyError:
            logger.warning("isoform_symbol set to default of {}".format(self.DEFAULT_ISOFORM_SYMBOL))

        try:
            self.reviewed_identifier_symbol = yaml_params[self.PARENT_PARAMETER_KEY][
                self.IDENTIFIERS_PARAMETER_KEY
            ][self.REVIEWED_IDENTIFIER_PARAMETER]
        except KeyError:
            logger.warning(
                "reviewed_identifier_symbol set to default of {}".format(self.DEFAULT_REVIEWED_IDENTIFIER_SYMBOL)
            )

        try:
            self.inference_type = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY][
                self.INFERENCE_TYPE_PARAMETER
            ]
        except KeyError:
            logger.warning("inference_type set to default of {}".format(self.DEFAULT_INFERENCE_TYPE))

        try:
            self.tag = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][self.TAG_PARAMETER]
        except KeyError:
            logger.warning("tag set to default of {}".format(self.DEFAULT_TAG))

        try:
            self.psm_score = yaml_params[self.PARENT_PARAMETER_KEY][self.SCORE_PARAMETER_KEY][
                self.PSM_SCORE_PARAMETER
            ]
        except KeyError:
            logger.warning("psm_score set to default of {}".format(self.DEFAULT_PSM_SCORE))

        try:
            self.grouping_type = yaml_params[self.PARENT_PARAMETER_KEY][self.INFERENCE_PARAMETER_KEY][
                self.GROUPING_TYPE_PARAMETER
            ]
        except KeyError:
            logger.warning("grouping_type set to default of {}".format(self.DEFAULT_GROUPING_TYPE))

        try:
            self.xml_input_parser = yaml_params[self.PARENT_PARAMETER_KEY][self.GENERAL_PARAMETER_KEY][
                self.XML_INPUT_PARSER_PARAMETER_KEY
            ]
        except KeyError:
            logger.warning("xml_input_parser set to default of {}".format(self.DEFAULT_XML_INPUT_PARSER))

        try:
            self.max_identifiers_peptide_centric = yaml_params[self.PARENT_PARAMETER_KEY][
                self.PEPTIDE_CENTRIC_PARAMETER_KEY
            ][self.MAX_IDENTIFIERS_PARAMETER]
        except KeyError:
            logger.warning(
                "max_identifiers_peptide_centric set to default of {}".format(
                    self.DEFAULT_MAX_IDENTIFIERS_PEPTIDE_CENTRIC
                )
            )

        try:
            self.lp_solver = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY][
                self.LP_SOLVER_PARAMETER
            ]
        except KeyError:
            logger.warning("lp_solver set to default of {}".format(self.DEFAULT_LP_SOLVER))
        try:
            # Do try except here to make old param files backwards compatible
            self.shared_peptides = yaml_params[self.PARENT_PARAMETER_KEY][self.PARSIMONY_PARAMETER_KEY][
                self.SHARED_PEPTIDES_PARAMETER
            ]
        except KeyError:
            logger.warning("shared_peptides set to default of {}".format(self.DEFAULT_SHARED_PEPTIDES))

    else:
        logger.warning("Yaml parameter file not found, all parameters set to default")

fix_parameters_from_datastore(data)

ProteinInferenceParameter method to override restriction values in the parameter file if those scores do not exist in the input files.

Parameters:
Source code in pyproteininference/parameters.py
970
971
972
973
974
975
976
977
978
979
980
981
982
def fix_parameters_from_datastore(self, data):
    """
    ProteinInferenceParameter method to override restriction values in the
    parameter file if those scores do not exist in the input files.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

    """

    self.override_q_restrict(data=data)
    self.override_pep_restrict(data=data)
    self.override_custom_restrict(data=data)

override_custom_restrict(data)

ProteinInferenceParameter method to override restrict_custom if the input data does not contain custom score values.

Parameters:
Source code in pyproteininference/parameters.py
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
def override_custom_restrict(self, data):
    """
    ProteinInferenceParameter method to override restrict_custom if
    the input data does not contain custom score values.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

    """
    data_has_custom = data.input_has_custom()
    if data_has_custom:
        pass
    else:
        if self.restrict_custom:
            logger.warning(
                "No Custom values found in the input data, overriding parameters to not filter on Custom value"
            )
            self.restrict_custom = None

override_pep_restrict(data)

ProteinInferenceParameter method to override restrict_pep if the input data does not contain pep values.

Parameters:
Source code in pyproteininference/parameters.py
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
def override_pep_restrict(self, data):
    """
    ProteinInferenceParameter method to override restrict_pep if the input data does not contain pep values.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

    """
    data_has_pep = data.input_has_pep()
    if data_has_pep:
        pass
    else:
        if self.restrict_pep:
            logger.warning(
                "No Pep values found in the input data, overriding parameters to not filter on Pep value"
            )
            self.restrict_pep = None

override_q_restrict(data)

ProteinInferenceParameter method to override restrict_q if the input data does not contain q values.

Parameters:
Source code in pyproteininference/parameters.py
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
def override_q_restrict(self, data):
    """
    ProteinInferenceParameter method to override restrict_q if the input data does not contain q values.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].

    """
    data_has_q = data.input_has_q()
    if data_has_q:
        pass
    else:
        if self.restrict_q:
            logger.warning("No Q values found in the input data, overriding parameters to not filter on Q value")
            self.restrict_q = None

validate_parameters()

Class method to validate all parameters.

Returns:
  • None
Source code in pyproteininference/parameters.py
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
def validate_parameters(self):
    """
    Class method to validate all parameters.

    Returns:
        None:

    """
    # Run all of the parameter validations
    self._validate_digest_type()
    self._validate_export_type()
    self._validate_floats()
    self._validate_bools()
    self._validate_score_type()
    self._validate_score_method()
    self._validate_score_combination()
    self._validate_inference_type()
    self._validate_grouping_type()
    self._validate_max_id()
    self._validate_lp_solver()
    self._validate_identifiers()
    self._validate_parsimony_shared_peptides()

GenericReader

Bases: Reader

The following class takes a percolator like target file and a percolator like decoy file and creates standard Psm objects.

Percolator Like Output is formatted as follows: with each entry being tab delimited.

| PSMId | score | q-value | posterior_error_prob | peptide | proteinIds | | | | # noqa E501 W605 |-------------------------------|----------|-------------|-----------------------|--------------------------------|---------------------|----------------------|----------------------|-------------------------| # noqa E501 W605 | 116108.15139.15139.6.dta | 3.44016 | 0.000479928 | 7.60258e-10 | K.MVVSMTLGLHPWIANIDDTQYLAAK.R | CNDP1_HUMAN|Q96KN2 | B4E180_HUMAN|B4E180 | A8K1K1_HUMAN|A8K1K1 | J3KRP0_HUMAN|J3KRP0 | # noqa E501 W605

Custom columns can be added and used as scoring input. Please see package documentation for more information.

Attributes:
  • target_file (str / list) –

    Path to Target PSM result files.

  • decoy_file (str / list) –

    Path to Decoy PSM result files.

  • combined_files (str / list) –

    Path to Combined PSM result files.

  • directory (str) –

    Path to directory containing combined PSM result files.

  • psms (list) –

    List of Psm objects.

  • load_custom_score (bool) –

    True/False on whether or not to load a custom score. Depends on scoring_variable.

  • scoring_variable (str) –

    String to indicate which column in the input file is to be used as the scoring input.

  • digest (Digest) –
  • parameter_file_object (ProteinInferenceParameter) –
  • append_alt_from_db (bool) –

    Whether or not to append alternative proteins found in the database that are not in the input files.

Source code in pyproteininference/reader.py
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
class GenericReader(Reader):
    """
    The following class takes a percolator like target file and a percolator like decoy file
    and creates standard [Psm][pyproteininference.physical.Psm] objects.

    Percolator Like Output is formatted as follows:
    with each entry being tab delimited.

    | PSMId                         | score    |  q-value    | posterior_error_prob  |  peptide                       | proteinIds          |                      |                      |                         | # noqa E501 W605
    |-------------------------------|----------|-------------|-----------------------|--------------------------------|---------------------|----------------------|----------------------|-------------------------| # noqa E501 W605
    |     116108.15139.15139.6.dta  |  3.44016 | 0.000479928 | 7.60258e-10           | K.MVVSMTLGLHPWIANIDDTQYLAAK.R  | CNDP1_HUMAN\|Q96KN2 | B4E180_HUMAN\|B4E180 | A8K1K1_HUMAN\|A8K1K1 | J3KRP0_HUMAN\|J3KRP0    | # noqa E501 W605

    Custom columns can be added and used as scoring input. Please see package documentation for more information.

    Attributes:
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.
        psms (list): List of [Psm][pyproteininference.physical.Psm] objects.
        load_custom_score (bool): True/False on whether or not to load a custom score. Depends on scoring_variable.
        scoring_variable (str): String to indicate which column in the input file is to be used as the scoring input.
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        parameter_file_object (ProteinInferenceParameter):
            [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object
        append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
            are not in the input files.



    """

    PSMID = "PSMId"
    SCORE = "score"
    Q_VALUE = "q-value"
    POSTERIOR_ERROR_PROB = "posterior_error_prob"
    PEPTIDE = "peptide"
    PROTEIN_IDS = "proteinIds"
    ALTERNATIVE_PROTEINS = "alternative_proteins"

    def __init__(
        self,
        digest,
        parameter_file_object,
        append_alt_from_db=True,
        target_file=None,
        decoy_file=None,
        combined_files=None,
        directory=None,
        top_hit_per_psm_only=False,
    ):
        """

        Args:
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
            parameter_file_object (ProteinInferenceParameter):
                [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
            append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
                are not in the input files.
            target_file (str/list): Path to Target PSM result files.
            decoy_file (str/list): Path to Decoy PSM result files.
            combined_files (str/list): Path to Combined PSM result files.
            directory (str): Path to directory containing combined PSM result files.
            top_hit_per_psm_only (bool): If True, only include top hit for each PSM.

        Returns:
            Reader: [Reader][pyproteininference.reader.Reader] object.

        Example:
            >>> pyproteininference.reader.GenericReader(target_file = "example_target.txt",
            >>>     decoy_file = "example_decoy.txt",
            >>>     digest=digest, parameter_file_object=pi_params)
        """
        super().__init__(target_file, decoy_file, combined_files, directory)

        self.psms = None
        self.search_id = None
        self.digest = digest
        self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
        self.load_custom_score = False

        self.top_hit_per_psm_only = top_hit_per_psm_only

        self.append_alt_from_db = append_alt_from_db

        self.parameter_file_object = parameter_file_object
        self.scoring_variable = parameter_file_object.psm_score

        self._validate_input()

        if self.scoring_variable != self.Q_VALUE and self.scoring_variable != self.POSTERIOR_ERROR_PROB:
            self.load_custom_score = True
            logger.info(
                "Pulling custom column based on parameter file input for score, Column: {}".format(
                    self.scoring_variable
                )
            )
        else:
            logger.info(
                "Pulling no custom columns based on parameter file input for score, using standard Column: {}".format(
                    self.scoring_variable
                )
            )

        # If we select to not run inference at all
        if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
            # Only allow 1 Protein per PSM
            self.parameter_file_object.max_allowed_alternative_proteins = 1

    def read_psms(self):
        """
        Method to read psms from the input files and to transform them into a list of
        [Psm][pyproteininference.physical.Psm] objects.

        This method sets the `psms` variable. Which is a list of Psm objets.

        This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

        Example:
            >>> reader = pyproteininference.reader.GenericReader(target_file = "example_target.txt",
            >>>     decoy_file = "example_decoy.txt",
            >>>     digest=digest, parameter_file_object=pi_params)
            >>> reader.read_psms()

        """
        logger.info("Reading in Input Files using Generic Reader...")
        all_psms = None
        # Read in and split by line
        # If target_file is a list... read them all in and concatenate...
        if self.target_file and self.decoy_file:
            if isinstance(self.target_file, (list,)):
                all_target = []
                for t_files in self.target_file:
                    ptarg = []
                    with open(t_files, "r") as psm_target_file:
                        logger.info(t_files)
                        spamreader = csv.DictReader(psm_target_file, delimiter="\t")
                        for row in spamreader:
                            row = self.get_alternative_proteins_from_input(row)
                            ptarg.append(row)
                    all_target = all_target + ptarg
            else:
                # If not just read the file...
                ptarg = []
                with open(self.target_file, "r") as psm_target_file:
                    logger.info(self.target_file)
                    spamreader = csv.DictReader(psm_target_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        ptarg.append(row)
                all_target = ptarg

            # Repeat for decoy file
            if isinstance(self.decoy_file, (list,)):
                all_decoy = []
                for d_files in self.decoy_file:
                    pdec = []
                    with open(d_files, "r") as psm_decoy_file:
                        logger.info(d_files)
                        spamreader = csv.DictReader(psm_decoy_file, delimiter="\t")
                        for row in spamreader:
                            row = self.get_alternative_proteins_from_input(row)
                            pdec.append(row)
                    all_decoy = all_decoy + pdec
            else:
                pdec = []
                with open(self.decoy_file, "r") as psm_decoy_file:
                    logger.info(self.decoy_file)
                    spamreader = csv.DictReader(psm_decoy_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        pdec.append(row)
                all_decoy = pdec

            # Combine the lists
            all_psms = all_target + all_decoy

        elif self.combined_files:
            if isinstance(self.combined_files, (list,)):
                all = []
                for c_files in self.combined_files:
                    c_all = []
                    with open(c_files, "r") as psm_file:
                        logger.info(c_files)
                        spamreader = csv.DictReader(psm_file, delimiter="\t")
                        for row in spamreader:
                            row = self.get_alternative_proteins_from_input(row)
                            c_all.append(row)
                    all = all + c_all
            else:
                c_all = []
                with open(self.combined_files, "r") as psm_file:
                    logger.info(self.combined_files)
                    spamreader = csv.DictReader(psm_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        c_all.append(row)
                all = c_all
            all_psms = all

        elif self.directory:
            all_files = os.listdir(self.directory)
            all = []
            for files in all_files:
                psm_per_file = []
                with open(files, "r") as psm_file:
                    logger.info(files)
                    spamreader = csv.DictReader(psm_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        psm_per_file.append(row)
                all = all + psm_per_file
            all_psms = all

        psms_all_filtered = []
        for psms in all_psms:
            if self.POSTERIOR_ERROR_PROB in psms.keys():
                try:
                    float(psms[self.POSTERIOR_ERROR_PROB])
                    psms_all_filtered.append(psms)
                except ValueError as e:  # noqa F841
                    pass
            else:
                try:
                    float(psms[self.scoring_variable])
                    psms_all_filtered.append(psms)
                except ValueError as e:  # noqa F841
                    pass

        # Filter by pep
        try:
            logger.info("Sorting by {}".format(self.POSTERIOR_ERROR_PROB))
            all_psms = sorted(
                psms_all_filtered,
                key=lambda x: float(x[self.POSTERIOR_ERROR_PROB]),
                reverse=False,
            )
        except KeyError:
            logger.info("Cannot Sort by {} the values do not exist".format(self.POSTERIOR_ERROR_PROB))
            logger.info("Sorting by {}".format(self.scoring_variable))
            if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
                all_psms = sorted(
                    psms_all_filtered,
                    key=lambda x: float(x[self.scoring_variable]),
                    reverse=True,
                )
            if self.parameter_file_object.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE:
                all_psms = sorted(
                    psms_all_filtered,
                    key=lambda x: float(x[self.scoring_variable]),
                    reverse=False,
                )

        logger.info("Number of PSMs in the input data: {}".format(len(all_psms)))
        if self.top_hit_per_psm_only:
            logger.info("Filtering to only top hit per PSM")
            psm_ids = set()
            all_psms_filtered = []
            for psm in all_psms:
                if psm[self.PSMID] not in psm_ids:
                    psm_ids.add(psm[self.PSMID])
                    all_psms_filtered.append(psm)
            all_psms = all_psms_filtered
            logger.info("Number of PSMs after filtering to top hit per PSM: {}".format(len(all_psms)))

        list_of_psm_objects = []
        peptide_tracker = set()
        all_sp_proteins = set(self.digest.swiss_prot_protein_set)
        # We only want to get unique peptides... using all messes up scoring...
        # Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...

        peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

        initial_poss_prots = []
        logger.info("Number of PSMs in the input data: {}".format(len(all_psms)))
        psms_with_alternative_proteins = self._find_psms_with_alternative_proteins(raw_psms=all_psms)
        logger.info(
            "Number of PSMs that have alternative proteins in the input data {}".format(
                len(psms_with_alternative_proteins)
            )
        )
        if len(psms_with_alternative_proteins) == 0:
            logger.warning(
                "No PSMs in the input have alternative proteins. "
                "Make sure your input is properly formatted. "
                "Alternative Proteins will be retrieved from the fasta database"
            )
        for psm_info in all_psms:
            current_peptide = psm_info[self.PEPTIDE]
            # Define the Psm...
            if current_peptide not in peptide_tracker:
                psm = Psm(identifier=current_peptide)
                # Attempt to add variables from PSM info...
                # If they do not exist in the psm info then we skip...
                try:
                    psm.percscore = float(psm_info[self.SCORE])
                except KeyError:
                    pass
                try:
                    psm.qvalue = float(psm_info[self.Q_VALUE])
                except KeyError:
                    pass
                try:
                    psm.pepvalue = float(psm_info[self.POSTERIOR_ERROR_PROB])
                except KeyError:
                    pass
                # If user has a custom score IE not q-value or pep_value...
                if self.load_custom_score:
                    # Then we look for it...
                    psm.custom_score = float(psm_info[self.scoring_variable])
                psm.possible_proteins = []
                psm.possible_proteins.append(psm_info[self.PROTEIN_IDS])
                psm.possible_proteins = psm.possible_proteins + [x for x in psm_info[self.ALTERNATIVE_PROTEINS] if x]
                # Remove potential Repeats
                if self.parameter_file_object.inference_type != Inference.FIRST_PROTEIN:
                    psm.possible_proteins = sorted(list(set(psm.possible_proteins)))

                input_poss_prots = copy.copy(psm.possible_proteins)

                # Get PSM ID
                psm.psm_id = psm_info[self.PSMID]

                # Split peptide if flanking
                current_peptide = Psm.split_peptide(peptide_string=current_peptide)

                if not current_peptide.isupper() or not current_peptide.isalpha():
                    # If we have mods remove them...
                    peptide_string = current_peptide.upper()
                    stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                    current_peptide = stripped_peptide
                # Add the other possible_proteins from insilicodigest here...
                try:
                    current_alt_proteins = sorted(list(peptide_to_protein_dictionary[current_peptide]))
                except KeyError:
                    current_alt_proteins = []
                    logger.debug(
                        "Peptide {} was not found in the supplied DB for Proteins {}".format(
                            current_peptide, ";".join(psm.possible_proteins)
                        )
                    )
                    for poss_prot in psm.possible_proteins:
                        self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                        self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                        logger.debug(
                            "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                        )

                # Sort Alt Proteins by Swissprot then Trembl...
                identifiers_sorted = DataStore.sort_protein_strings(
                    protein_string_list=current_alt_proteins,
                    sp_proteins=all_sp_proteins,
                    decoy_symbol=self.parameter_file_object.decoy_symbol,
                )

                # Restrict to 50 possible proteins
                psm = self._fix_alternative_proteins(
                    append_alt_from_db=self.append_alt_from_db,
                    identifiers_sorted=identifiers_sorted,
                    max_proteins=self.parameter_file_object.max_allowed_alternative_proteins,
                    psm=psm,
                    parameter_file_object=self.parameter_file_object,
                )

                list_of_psm_objects.append(psm)
                peptide_tracker.add(current_peptide)

                initial_poss_prots.append(input_poss_prots)

        self.psms = list_of_psm_objects

        self._check_initial_database_overlap(
            initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
        )

        logger.info("Length of PSM Data: {}".format(len(self.psms)))

        logger.info("Finished GenericReader.read_psms...")

    def _find_psms_with_alternative_proteins(self, raw_psms):

        psms_with_alternative_proteins = [x for x in raw_psms if x["alternative_proteins"]]

        return psms_with_alternative_proteins

__init__(digest, parameter_file_object, append_alt_from_db=True, target_file=None, decoy_file=None, combined_files=None, directory=None, top_hit_per_psm_only=False)

Parameters:
  • digest (Digest) –
  • parameter_file_object (ProteinInferenceParameter) –
  • append_alt_from_db (bool, default: True ) –

    Whether or not to append alternative proteins found in the database that are not in the input files.

  • target_file (str / list, default: None ) –

    Path to Target PSM result files.

  • decoy_file (str / list, default: None ) –

    Path to Decoy PSM result files.

  • combined_files (str / list, default: None ) –

    Path to Combined PSM result files.

  • directory (str, default: None ) –

    Path to directory containing combined PSM result files.

  • top_hit_per_psm_only (bool, default: False ) –

    If True, only include top hit for each PSM.

Returns:
Example

pyproteininference.reader.GenericReader(target_file = "example_target.txt", decoy_file = "example_decoy.txt", digest=digest, parameter_file_object=pi_params)

Source code in pyproteininference/reader.py
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
def __init__(
    self,
    digest,
    parameter_file_object,
    append_alt_from_db=True,
    target_file=None,
    decoy_file=None,
    combined_files=None,
    directory=None,
    top_hit_per_psm_only=False,
):
    """

    Args:
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        parameter_file_object (ProteinInferenceParameter):
            [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
        append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
            are not in the input files.
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.
        top_hit_per_psm_only (bool): If True, only include top hit for each PSM.

    Returns:
        Reader: [Reader][pyproteininference.reader.Reader] object.

    Example:
        >>> pyproteininference.reader.GenericReader(target_file = "example_target.txt",
        >>>     decoy_file = "example_decoy.txt",
        >>>     digest=digest, parameter_file_object=pi_params)
    """
    super().__init__(target_file, decoy_file, combined_files, directory)

    self.psms = None
    self.search_id = None
    self.digest = digest
    self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
    self.load_custom_score = False

    self.top_hit_per_psm_only = top_hit_per_psm_only

    self.append_alt_from_db = append_alt_from_db

    self.parameter_file_object = parameter_file_object
    self.scoring_variable = parameter_file_object.psm_score

    self._validate_input()

    if self.scoring_variable != self.Q_VALUE and self.scoring_variable != self.POSTERIOR_ERROR_PROB:
        self.load_custom_score = True
        logger.info(
            "Pulling custom column based on parameter file input for score, Column: {}".format(
                self.scoring_variable
            )
        )
    else:
        logger.info(
            "Pulling no custom columns based on parameter file input for score, using standard Column: {}".format(
                self.scoring_variable
            )
        )

    # If we select to not run inference at all
    if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
        # Only allow 1 Protein per PSM
        self.parameter_file_object.max_allowed_alternative_proteins = 1

read_psms()

Method to read psms from the input files and to transform them into a list of Psm objects.

This method sets the psms variable. Which is a list of Psm objets.

This method must be ran before initializing DataStore object.

Example

reader = pyproteininference.reader.GenericReader(target_file = "example_target.txt", decoy_file = "example_decoy.txt", digest=digest, parameter_file_object=pi_params) reader.read_psms()

Source code in pyproteininference/reader.py
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
def read_psms(self):
    """
    Method to read psms from the input files and to transform them into a list of
    [Psm][pyproteininference.physical.Psm] objects.

    This method sets the `psms` variable. Which is a list of Psm objets.

    This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

    Example:
        >>> reader = pyproteininference.reader.GenericReader(target_file = "example_target.txt",
        >>>     decoy_file = "example_decoy.txt",
        >>>     digest=digest, parameter_file_object=pi_params)
        >>> reader.read_psms()

    """
    logger.info("Reading in Input Files using Generic Reader...")
    all_psms = None
    # Read in and split by line
    # If target_file is a list... read them all in and concatenate...
    if self.target_file and self.decoy_file:
        if isinstance(self.target_file, (list,)):
            all_target = []
            for t_files in self.target_file:
                ptarg = []
                with open(t_files, "r") as psm_target_file:
                    logger.info(t_files)
                    spamreader = csv.DictReader(psm_target_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        ptarg.append(row)
                all_target = all_target + ptarg
        else:
            # If not just read the file...
            ptarg = []
            with open(self.target_file, "r") as psm_target_file:
                logger.info(self.target_file)
                spamreader = csv.DictReader(psm_target_file, delimiter="\t")
                for row in spamreader:
                    row = self.get_alternative_proteins_from_input(row)
                    ptarg.append(row)
            all_target = ptarg

        # Repeat for decoy file
        if isinstance(self.decoy_file, (list,)):
            all_decoy = []
            for d_files in self.decoy_file:
                pdec = []
                with open(d_files, "r") as psm_decoy_file:
                    logger.info(d_files)
                    spamreader = csv.DictReader(psm_decoy_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        pdec.append(row)
                all_decoy = all_decoy + pdec
        else:
            pdec = []
            with open(self.decoy_file, "r") as psm_decoy_file:
                logger.info(self.decoy_file)
                spamreader = csv.DictReader(psm_decoy_file, delimiter="\t")
                for row in spamreader:
                    row = self.get_alternative_proteins_from_input(row)
                    pdec.append(row)
            all_decoy = pdec

        # Combine the lists
        all_psms = all_target + all_decoy

    elif self.combined_files:
        if isinstance(self.combined_files, (list,)):
            all = []
            for c_files in self.combined_files:
                c_all = []
                with open(c_files, "r") as psm_file:
                    logger.info(c_files)
                    spamreader = csv.DictReader(psm_file, delimiter="\t")
                    for row in spamreader:
                        row = self.get_alternative_proteins_from_input(row)
                        c_all.append(row)
                all = all + c_all
        else:
            c_all = []
            with open(self.combined_files, "r") as psm_file:
                logger.info(self.combined_files)
                spamreader = csv.DictReader(psm_file, delimiter="\t")
                for row in spamreader:
                    row = self.get_alternative_proteins_from_input(row)
                    c_all.append(row)
            all = c_all
        all_psms = all

    elif self.directory:
        all_files = os.listdir(self.directory)
        all = []
        for files in all_files:
            psm_per_file = []
            with open(files, "r") as psm_file:
                logger.info(files)
                spamreader = csv.DictReader(psm_file, delimiter="\t")
                for row in spamreader:
                    row = self.get_alternative_proteins_from_input(row)
                    psm_per_file.append(row)
            all = all + psm_per_file
        all_psms = all

    psms_all_filtered = []
    for psms in all_psms:
        if self.POSTERIOR_ERROR_PROB in psms.keys():
            try:
                float(psms[self.POSTERIOR_ERROR_PROB])
                psms_all_filtered.append(psms)
            except ValueError as e:  # noqa F841
                pass
        else:
            try:
                float(psms[self.scoring_variable])
                psms_all_filtered.append(psms)
            except ValueError as e:  # noqa F841
                pass

    # Filter by pep
    try:
        logger.info("Sorting by {}".format(self.POSTERIOR_ERROR_PROB))
        all_psms = sorted(
            psms_all_filtered,
            key=lambda x: float(x[self.POSTERIOR_ERROR_PROB]),
            reverse=False,
        )
    except KeyError:
        logger.info("Cannot Sort by {} the values do not exist".format(self.POSTERIOR_ERROR_PROB))
        logger.info("Sorting by {}".format(self.scoring_variable))
        if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
            all_psms = sorted(
                psms_all_filtered,
                key=lambda x: float(x[self.scoring_variable]),
                reverse=True,
            )
        if self.parameter_file_object.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE:
            all_psms = sorted(
                psms_all_filtered,
                key=lambda x: float(x[self.scoring_variable]),
                reverse=False,
            )

    logger.info("Number of PSMs in the input data: {}".format(len(all_psms)))
    if self.top_hit_per_psm_only:
        logger.info("Filtering to only top hit per PSM")
        psm_ids = set()
        all_psms_filtered = []
        for psm in all_psms:
            if psm[self.PSMID] not in psm_ids:
                psm_ids.add(psm[self.PSMID])
                all_psms_filtered.append(psm)
        all_psms = all_psms_filtered
        logger.info("Number of PSMs after filtering to top hit per PSM: {}".format(len(all_psms)))

    list_of_psm_objects = []
    peptide_tracker = set()
    all_sp_proteins = set(self.digest.swiss_prot_protein_set)
    # We only want to get unique peptides... using all messes up scoring...
    # Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...

    peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

    initial_poss_prots = []
    logger.info("Number of PSMs in the input data: {}".format(len(all_psms)))
    psms_with_alternative_proteins = self._find_psms_with_alternative_proteins(raw_psms=all_psms)
    logger.info(
        "Number of PSMs that have alternative proteins in the input data {}".format(
            len(psms_with_alternative_proteins)
        )
    )
    if len(psms_with_alternative_proteins) == 0:
        logger.warning(
            "No PSMs in the input have alternative proteins. "
            "Make sure your input is properly formatted. "
            "Alternative Proteins will be retrieved from the fasta database"
        )
    for psm_info in all_psms:
        current_peptide = psm_info[self.PEPTIDE]
        # Define the Psm...
        if current_peptide not in peptide_tracker:
            psm = Psm(identifier=current_peptide)
            # Attempt to add variables from PSM info...
            # If they do not exist in the psm info then we skip...
            try:
                psm.percscore = float(psm_info[self.SCORE])
            except KeyError:
                pass
            try:
                psm.qvalue = float(psm_info[self.Q_VALUE])
            except KeyError:
                pass
            try:
                psm.pepvalue = float(psm_info[self.POSTERIOR_ERROR_PROB])
            except KeyError:
                pass
            # If user has a custom score IE not q-value or pep_value...
            if self.load_custom_score:
                # Then we look for it...
                psm.custom_score = float(psm_info[self.scoring_variable])
            psm.possible_proteins = []
            psm.possible_proteins.append(psm_info[self.PROTEIN_IDS])
            psm.possible_proteins = psm.possible_proteins + [x for x in psm_info[self.ALTERNATIVE_PROTEINS] if x]
            # Remove potential Repeats
            if self.parameter_file_object.inference_type != Inference.FIRST_PROTEIN:
                psm.possible_proteins = sorted(list(set(psm.possible_proteins)))

            input_poss_prots = copy.copy(psm.possible_proteins)

            # Get PSM ID
            psm.psm_id = psm_info[self.PSMID]

            # Split peptide if flanking
            current_peptide = Psm.split_peptide(peptide_string=current_peptide)

            if not current_peptide.isupper() or not current_peptide.isalpha():
                # If we have mods remove them...
                peptide_string = current_peptide.upper()
                stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                current_peptide = stripped_peptide
            # Add the other possible_proteins from insilicodigest here...
            try:
                current_alt_proteins = sorted(list(peptide_to_protein_dictionary[current_peptide]))
            except KeyError:
                current_alt_proteins = []
                logger.debug(
                    "Peptide {} was not found in the supplied DB for Proteins {}".format(
                        current_peptide, ";".join(psm.possible_proteins)
                    )
                )
                for poss_prot in psm.possible_proteins:
                    self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                    self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                    logger.debug(
                        "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                    )

            # Sort Alt Proteins by Swissprot then Trembl...
            identifiers_sorted = DataStore.sort_protein_strings(
                protein_string_list=current_alt_proteins,
                sp_proteins=all_sp_proteins,
                decoy_symbol=self.parameter_file_object.decoy_symbol,
            )

            # Restrict to 50 possible proteins
            psm = self._fix_alternative_proteins(
                append_alt_from_db=self.append_alt_from_db,
                identifiers_sorted=identifiers_sorted,
                max_proteins=self.parameter_file_object.max_allowed_alternative_proteins,
                psm=psm,
                parameter_file_object=self.parameter_file_object,
            )

            list_of_psm_objects.append(psm)
            peptide_tracker.add(current_peptide)

            initial_poss_prots.append(input_poss_prots)

    self.psms = list_of_psm_objects

    self._check_initial_database_overlap(
        initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
    )

    logger.info("Length of PSM Data: {}".format(len(self.psms)))

    logger.info("Finished GenericReader.read_psms...")

IdXMLReader

Bases: Reader

The following class takes a idXML like file and creates standard Psm objects.

Attributes:
  • target_file (str / list) –

    Path to Target PSM result files.

  • decoy_file (str / list) –

    Path to Decoy PSM result files.

  • combined_files (str / list) –

    Path to Combined PSM result files.

  • directory (str) –

    Path to directory containing combined PSM result files.

  • psms (list) –

    List of Psm objects.

  • load_custom_score (bool) –

    True/False on whether or not to load a custom score. Depends on scoring_variable.

  • scoring_variable (str) –

    String to indicate which column in the input file is to be used as the scoring input.

  • digest (Digest) –
  • parameter_file_object (ProteinInferenceParameter) –
  • append_alt_from_db (bool) –

    Whether or not to append alternative proteins found in the database that are not in the input files.

Source code in pyproteininference/reader.py
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
class IdXMLReader(Reader):
    """
    The following class takes a idXML like file
    and creates standard [Psm][pyproteininference.physical.Psm] objects.

    Attributes:
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.
        psms (list): List of [Psm][pyproteininference.physical.Psm] objects.
        load_custom_score (bool): True/False on whether or not to load a custom score. Depends on scoring_variable.
        scoring_variable (str): String to indicate which column in the input file is to be used as the scoring input.
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        parameter_file_object (ProteinInferenceParameter):
            [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object
        append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
            are not in the input files.



    """

    PSMID = "PSMId"
    SCORE = "MS:1001492"
    Q_VALUE = "MS:1001491"
    POSTERIOR_ERROR_PROB = "MS:1001493"
    PEPTIDE = "peptide"
    PROTEIN_IDS = "proteinIds"
    ALTERNATIVE_PROTEINS = "alternative_proteins"

    PSM_SCORE_MAPPING = {
        "posterior_error_prob": POSTERIOR_ERROR_PROB,
        "q-value": Q_VALUE,
        "score": SCORE,
    }

    def __init__(
        self,
        digest,
        parameter_file_object,
        append_alt_from_db=True,
        target_file=None,
        decoy_file=None,
        combined_files=None,
        directory=None,
        top_hit_per_psm_only=False,
    ):
        """

        Args:
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
            parameter_file_object (ProteinInferenceParameter):
                [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
            append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
                are not in the input files.
            target_file (str/list): Path to Target PSM result files.
            decoy_file (str/list): Path to Decoy PSM result files.
            combined_files (str/list): Path to Combined PSM result files.
            directory (str): Path to directory containing combined PSM result files.
            top_hit_per_psm_only (bool): If True, only include top hit for each PSM.

        Returns:
            Reader: [Reader][pyproteininference.reader.Reader] object.

        Example:
            >>> pyproteininference.reader.IdXMLReader(combined_file = "example_file.idXML",
            >>>     digest=digest, parameter_file_object=pi_params)
        """
        self.target_file = target_file
        self.decoy_file = decoy_file
        self.combined_files = combined_files
        self.directory = directory

        self.top_hit_per_psm_only = top_hit_per_psm_only

        self.psms = None
        self.search_id = None
        self.digest = digest
        self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
        self.load_custom_score = False

        self.append_alt_from_db = append_alt_from_db

        self.parameter_file_object = parameter_file_object
        # map the common scoring variables (posterior_error_prob, q-value, score) to the PSI MS CV terms,
        # or use the custom term as-is if not present in the mapping
        self.scoring_variable = self.PSM_SCORE_MAPPING.get(
            parameter_file_object.psm_score, parameter_file_object.psm_score
        )

        self._validate_input()

        if self.scoring_variable != self.Q_VALUE and self.scoring_variable != self.POSTERIOR_ERROR_PROB:
            self.load_custom_score = True
            logger.info(
                "Pulling custom column based on parameter file input for score, Attribute: {}".format(
                    self.scoring_variable
                )
            )
        else:
            logger.info(
                "Pulling no custom columns based on parameter file input for score, using standard Attribute: {}".format(
                    self.scoring_variable
                )
            )

        # If we select to not run inference at all
        if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
            # Only allow 1 Protein per PSM
            self.parameter_file_object.max_allowed_alternative_proteins = 1

    def read_psms(self):
        if self.parameter_file_object.xml_input_parser == "pyteomics":
            self._read_psms_pyteomics()
        else:
            self._read_psms_openms()

    def _read_psms_pyteomics(self):
        """
        Method to read psms from the input files and to transform them into a list of
        [Psm][pyproteininference.physical.Psm] objects.

        This method sets the `psms` variable. Which is a list of Psm objets.

        This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

        Example:
            >>> reader = pyproteininference.reader.IdXMLReader(combined_file = "example_file.idXML",
            >>>     digest=digest, parameter_file_object=pi_params)
            >>> reader.read_psms()

        """
        logger.info("Reading in Input Files using IdXML Reader (pyteomics)...")

        input_files = list()
        for input_filenames in (self.combined_files, self.target_file, self.decoy_file):
            if input_filenames is not None:
                if isinstance(input_filenames, str):
                    input_filenames = [input_filenames]
                input_files.extend(input_filenames)
        if len(input_files) == 0:
            raise ValueError("For idXML files, at least one file must be supplied as target, decoy, or combined input.")
        logger.info(f"Reading input from {','.join([str(x) for x in self.combined_files])}")
        input_files = self.combined_files
        reader = idxml.chain().from_iterable(input_files)

        def _sort_protein_hits_by_selected_score(hits, score_key, fallback_score_key):
            sorting_key = score_key
            sorting_direction_reversed = False
            if not all([score_key in hit for hit in hits]):
                if not all([fallback_score_key in hit for hit in hits]):
                    raise ValueError(
                        f"Score key {score_key} not found in all hits: "
                        f"{[str(x['sequence'] for x in hits if score_key not in hits)]}"
                    )
                sorting_key = score_key
                # if additive score type, reverse the sorting (multiplicative score type still is normal sort)
                if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
                    sorting_direction_reversed = True
            # presort by target / decoy to maintain deterministic parity in how percolator generic results are handled
            # (with a preference for targets over decoys there)
            sorted_hits = sorted(hits, key=lambda x: 1 if x['target_decoy'] == "decoy" else 0, reverse=False)
            return sorted(sorted_hits, key=lambda x: float(x[sorting_key]), reverse=sorting_direction_reversed)

        list_of_psm_objects = []
        peptide_tracker = set()
        all_sp_proteins = set(self.digest.swiss_prot_protein_set)
        # We only want to get unique peptides... using all messes up scoring...
        # Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...

        peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

        initial_poss_prots = []

        for psm_info in tqdm.tqdm(reader, desc="Reading PSMs", unit=" PSMs"):
            peptide_hits = psm_info.get("PeptideHit", None)
            if len(peptide_hits) < 1:
                continue

            sorted_hits = _sort_protein_hits_by_selected_score(
                peptide_hits, self.scoring_variable, self.POSTERIOR_ERROR_PROB
            )
            best_hit = sorted_hits[0]
            current_peptide = (
                f"{best_hit.get('aa_before',['-'])[0]}"
                f".{best_hit.get('sequence','')}."
                f"{best_hit.get('aa_after',['-'])[0]}"
            )
            # Define the Psm...
            if current_peptide not in peptide_tracker:
                psm = Psm(identifier=current_peptide)
                # Attempt to add variables from PSM info...
                # If they do not exist in the psm info then we skip...
                psm.percscore = float(best_hit.get(self.SCORE)) if best_hit.get(self.SCORE) is not None else None
                psm.qvalue = float(best_hit.get(self.Q_VALUE)) if best_hit.get(self.Q_VALUE) is not None else None
                psm.pepvalue = (
                    float(best_hit.get(self.POSTERIOR_ERROR_PROB))
                    if best_hit.get(self.POSTERIOR_ERROR_PROB) is not None
                    else None
                )
                if self.load_custom_score:
                    psm.custom_score = (
                        float(best_hit.get(self.scoring_variable))
                        if best_hit.get(self.scoring_variable) is not None
                        else None
                    )

                psm.possible_proteins = [prot['accession'] for prot in best_hit['protein']]
                # Remove potential Repeats
                if self.parameter_file_object.inference_type != Inference.FIRST_PROTEIN:
                    psm.possible_proteins = sorted(list(set(psm.possible_proteins)))

                input_poss_prots = copy.copy(psm.possible_proteins)

                # Get PSM ID
                psm.psm_id = psm_info.get('spectrum_reference')

                # Split peptide if flanking
                current_peptide = Psm.split_peptide(peptide_string=current_peptide)

                if not current_peptide.isupper() or not current_peptide.isalpha():
                    # If we have mods remove them...
                    peptide_string = current_peptide.upper()
                    stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                    current_peptide = stripped_peptide
                # Add the other possible_proteins from insilicodigest here...
                try:
                    current_alt_proteins = sorted(list(peptide_to_protein_dictionary[current_peptide]))
                except KeyError:
                    current_alt_proteins = []
                    logger.debug(
                        "Peptide {} was not found in the supplied DB for Proteins {}".format(
                            current_peptide, ";".join(psm.possible_proteins)
                        )
                    )
                    for poss_prot in psm.possible_proteins:
                        self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                        self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                        logger.debug(
                            "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                        )

                # Sort Alt Proteins by Swissprot then Trembl...
                identifiers_sorted = DataStore.sort_protein_strings(
                    protein_string_list=current_alt_proteins,
                    sp_proteins=all_sp_proteins,
                    decoy_symbol=self.parameter_file_object.decoy_symbol,
                )

                # Restrict to 50 possible proteins
                psm = self._fix_alternative_proteins(
                    append_alt_from_db=self.append_alt_from_db,
                    identifiers_sorted=identifiers_sorted,
                    max_proteins=self.parameter_file_object.max_allowed_alternative_proteins,
                    psm=psm,
                    parameter_file_object=self.parameter_file_object,
                )

                list_of_psm_objects.append(psm)
                peptide_tracker.add(current_peptide)

                initial_poss_prots.append(input_poss_prots)

        self.psms = list_of_psm_objects

        self._check_initial_database_overlap(
            initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
        )

        logger.info("Length of PSM Data: {}".format(len(self.psms)))

        logger.info("Finished IdXMLReader.read_psms...")

    def _read_psms_openms(self):
        """
        Method to read psms from the input files and to transform them into a list of
        [Psm][pyproteininference.physical.Psm] objects.

        This method sets the `psms` variable. Which is a list of Psm objets.

        This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

        Example:
            >>> reader = pyproteininference.reader.IdXMLReader(combined_file = "example_file.idXML",
            >>>     digest=digest, parameter_file_object=pi_params)
            >>> reader.read_psms()

        """
        logger.info("Reading in Input Files using IdXML Reader (OpenMS)...")

        input_files = list()
        for input_filenames in (self.combined_files, self.target_file, self.decoy_file):
            if input_filenames is not None:
                if isinstance(input_filenames, str):
                    input_filenames = [input_filenames]
                input_files.extend(input_filenames)
        if len(input_files) == 0:
            raise ValueError("For idXML files, at least one file must be supplied as target, decoy, or combined input.")
        logger.info(f"Reading input from {','.join([str(x) for x in self.combined_files])}")
        input_files = self.combined_files

        def _sort_protein_hits_by_selected_score(hits, score_key, fallback_score_key):
            sorting_key = score_key
            sorting_direction_reversed = False
            if not all([hit.metaValueExists(score_key) for hit in hits]):
                if not all([hit.metaValueExists(fallback_score_key) for hit in hits]):
                    raise ValueError(
                        f"Score key {score_key} not found in all hits: "
                        f"{[x.getSequence().toString() for x in hits if not x.metaValueExists(score_key)]}"
                    )
                sorting_key = fallback_score_key
                # if additive score type, reverse the sorting (multiplicative score type still is normal sort)
                if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
                    sorting_direction_reversed = True
            # presort by target / decoy to maintain deterministic parity in how percolator generic results are handled
            # (with a preference for targets over decoys there)
            # @todo check that target_decoy is created for all instances (e.g. pepxml)
            sorted_hits = sorted(
                hits, key=lambda x: 1 if x.getMetaValue('target_decoy') == "decoy" else 0, reverse=False
            )
            return sorted(
                sorted_hits, key=lambda x: float(x.getMetaValue(sorting_key)), reverse=sorting_direction_reversed
            )

        list_of_psm_objects = []
        peptide_tracker = set()
        all_sp_proteins = set(self.digest.swiss_prot_protein_set)
        # We only want to get unique peptides... using all messes up scoring...
        # Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...

        peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

        initial_poss_prots = []

        for input_file in tqdm.tqdm(input_files, desc="Reading Input files", unit=" files"):

            protein_ids = []
            peptide_ids = []

            file_extension = os.path.splitext(input_file)[1].lower()
            if file_extension == ".mzid":
                logger.info(f"Reading input as mzIdentML")
                pyopenms.MzIdentMLFile().load(input_file, protein_ids, peptide_ids)
            elif file_extension in (".pepxml", ".pep.xml", ".xml"):
                logger.info(f"Reading input as pepXML")
                pyopenms.PepXMLFile().load(input_file, protein_ids, peptide_ids)
            else:
                logger.info(f"Reading input as idXML")
                pyopenms.IdXMLFile().load(input_file, protein_ids, peptide_ids)

            for peptide_id in tqdm.tqdm(peptide_ids, desc="Reading PSMs", unit=" PSMs"):
                peptide_hits = peptide_id.getHits()
                if len(peptide_hits) < 1:
                    continue

                sorted_hits = _sort_protein_hits_by_selected_score(
                    peptide_hits, self.scoring_variable, self.POSTERIOR_ERROR_PROB
                )
                best_hit = sorted_hits[0]
                aa_before = best_hit.getPeptideEvidences()[0].getAABefore()
                aa_after = best_hit.getPeptideEvidences()[0].getAAAfter()
                if aa_before is None or aa_before == "UNKNOWN_AA":
                    aa_before = "X"
                elif aa_before == "N_TERMINAL_AA":
                    aa_before = "-"
                if aa_after is None or aa_after == "UNKNOWN_AA":
                    aa_after = "X"
                elif aa_after == "C_TERMINAL_AA":
                    aa_after = "-"

                current_peptide = f"{aa_before}" f".{best_hit.getSequence().toString()}." f"{aa_after}"
                # Define the Psm...
                if current_peptide not in peptide_tracker:
                    psm = Psm(identifier=current_peptide)
                    # Attempt to add variables from PSM info...
                    # If they do not exist in the psm info then we skip...
                    psm.percscore = (
                        float(best_hit.getMetaValue(self.SCORE))
                        if best_hit.getMetaValue(self.SCORE) is not None
                        else None
                    )
                    psm.qvalue = (
                        float(best_hit.getMetaValue(self.Q_VALUE))
                        if best_hit.getMetaValue(self.Q_VALUE) is not None
                        else None
                    )
                    psm.pepvalue = (
                        float(best_hit.getMetaValue(self.POSTERIOR_ERROR_PROB))
                        if best_hit.getMetaValue(self.POSTERIOR_ERROR_PROB) is not None
                        else None
                    )
                    if self.load_custom_score:
                        psm.custom_score = (
                            float(best_hit.getMetaValue(self.scoring_variable))
                            if best_hit.getMetaValue(self.scoring_variable) is not None
                            else None
                        )

                    psm.possible_proteins = [x.decode("utf-8") for x in best_hit.extractProteinAccessionsSet()]
                    # Remove potential Repeats
                    if self.parameter_file_object.inference_type != Inference.FIRST_PROTEIN:
                        psm.possible_proteins = sorted(list(set(psm.possible_proteins)))

                    input_poss_prots = copy.copy(psm.possible_proteins)

                    # Get PSM ID
                    psm.psm_id = peptide_id.getMetaValue("spectrum_reference")

                    # Split peptide if flanking
                    current_peptide = Psm.split_peptide(peptide_string=current_peptide)

                    if not current_peptide.isupper() or not current_peptide.isalpha():
                        # If we have mods remove them...
                        peptide_string = current_peptide.upper()
                        stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                        current_peptide = stripped_peptide
                    # Add the other possible_proteins from insilicodigest here...
                    try:
                        current_alt_proteins = sorted(list(peptide_to_protein_dictionary[current_peptide]))
                    except KeyError:
                        current_alt_proteins = []
                        logger.debug(
                            "Peptide {} was not found in the supplied DB for Proteins {}".format(
                                current_peptide, ";".join(psm.possible_proteins)
                            )
                        )
                        for poss_prot in psm.possible_proteins:
                            self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                            self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                            logger.debug(
                                "Adding Peptide {} and Protein {} to Digest dictionaries".format(
                                    current_peptide, poss_prot
                                )
                            )

                    # Sort Alt Proteins by Swissprot then Trembl...
                    identifiers_sorted = DataStore.sort_protein_strings(
                        protein_string_list=current_alt_proteins,
                        sp_proteins=all_sp_proteins,
                        decoy_symbol=self.parameter_file_object.decoy_symbol,
                    )

                    # Restrict to 50 possible proteins
                    psm = self._fix_alternative_proteins(
                        append_alt_from_db=self.append_alt_from_db,
                        identifiers_sorted=identifiers_sorted,
                        max_proteins=self.parameter_file_object.max_allowed_alternative_proteins,
                        psm=psm,
                        parameter_file_object=self.parameter_file_object,
                    )

                    list_of_psm_objects.append(psm)
                    peptide_tracker.add(current_peptide)

                    initial_poss_prots.append(input_poss_prots)

        self.psms = list_of_psm_objects

        self._check_initial_database_overlap(
            initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
        )

        logger.info("Length of PSM Data: {}".format(len(self.psms)))

        logger.info("Finished IdXMLReader.read_psms...")

    def _find_psms_with_alternative_proteins(self, raw_psms):

        psms_with_alternative_proteins = [x for x in raw_psms if x["alternative_proteins"]]

        return psms_with_alternative_proteins

__init__(digest, parameter_file_object, append_alt_from_db=True, target_file=None, decoy_file=None, combined_files=None, directory=None, top_hit_per_psm_only=False)

Parameters:
  • digest (Digest) –
  • parameter_file_object (ProteinInferenceParameter) –
  • append_alt_from_db (bool, default: True ) –

    Whether or not to append alternative proteins found in the database that are not in the input files.

  • target_file (str / list, default: None ) –

    Path to Target PSM result files.

  • decoy_file (str / list, default: None ) –

    Path to Decoy PSM result files.

  • combined_files (str / list, default: None ) –

    Path to Combined PSM result files.

  • directory (str, default: None ) –

    Path to directory containing combined PSM result files.

  • top_hit_per_psm_only (bool, default: False ) –

    If True, only include top hit for each PSM.

Returns:
Example

pyproteininference.reader.IdXMLReader(combined_file = "example_file.idXML", digest=digest, parameter_file_object=pi_params)

Source code in pyproteininference/reader.py
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
def __init__(
    self,
    digest,
    parameter_file_object,
    append_alt_from_db=True,
    target_file=None,
    decoy_file=None,
    combined_files=None,
    directory=None,
    top_hit_per_psm_only=False,
):
    """

    Args:
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        parameter_file_object (ProteinInferenceParameter):
            [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
        append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
            are not in the input files.
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.
        top_hit_per_psm_only (bool): If True, only include top hit for each PSM.

    Returns:
        Reader: [Reader][pyproteininference.reader.Reader] object.

    Example:
        >>> pyproteininference.reader.IdXMLReader(combined_file = "example_file.idXML",
        >>>     digest=digest, parameter_file_object=pi_params)
    """
    self.target_file = target_file
    self.decoy_file = decoy_file
    self.combined_files = combined_files
    self.directory = directory

    self.top_hit_per_psm_only = top_hit_per_psm_only

    self.psms = None
    self.search_id = None
    self.digest = digest
    self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
    self.load_custom_score = False

    self.append_alt_from_db = append_alt_from_db

    self.parameter_file_object = parameter_file_object
    # map the common scoring variables (posterior_error_prob, q-value, score) to the PSI MS CV terms,
    # or use the custom term as-is if not present in the mapping
    self.scoring_variable = self.PSM_SCORE_MAPPING.get(
        parameter_file_object.psm_score, parameter_file_object.psm_score
    )

    self._validate_input()

    if self.scoring_variable != self.Q_VALUE and self.scoring_variable != self.POSTERIOR_ERROR_PROB:
        self.load_custom_score = True
        logger.info(
            "Pulling custom column based on parameter file input for score, Attribute: {}".format(
                self.scoring_variable
            )
        )
    else:
        logger.info(
            "Pulling no custom columns based on parameter file input for score, using standard Attribute: {}".format(
                self.scoring_variable
            )
        )

    # If we select to not run inference at all
    if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
        # Only allow 1 Protein per PSM
        self.parameter_file_object.max_allowed_alternative_proteins = 1

PercolatorReader

Bases: Reader

The following class takes a percolator target file and a percolator decoy file or combined files/directory and creates standard Psm objects. This reader class is used as input for DataStore object.

Percolator Output is formatted as follows: with each entry being tab delimited.

| PSMId | score | q-value | posterior_error_prob | peptide | proteinIds | | | | # noqa E501 W605 |-------------------------------|----------|-------------|-----------------------|--------------------------------|---------------------|----------------------|----------------------|-------------------------| # noqa E501 W605 | 116108.15139.15139.6.dta | 3.44016 | 0.000479928 | 7.60258e-10 | K.MVVSMTLGLHPWIANIDDTQYLAAK.R | CNDP1_HUMAN|Q96KN2 | B4E180_HUMAN|B4E180 | A8K1K1_HUMAN|A8K1K1 | J3KRP0_HUMAN|J3KRP0 | # noqa E501 W605

Attributes:
  • target_file (str / list) –

    Path to Target PSM result files.

  • decoy_file (str / list) –

    Path to Decoy PSM result files.

  • combined_files (str / list) –

    Path to Combined PSM result files.

  • directory (str) –

    Path to directory containing combined PSM result files.

  • PSMID_INDEX (int) –

    Index of the PSMId from the input files.

  • PERC_SCORE_INDEX (int) –

    Index of the Percolator score from the input files.

  • Q_VALUE_INDEX (int) –

    Index of the q-value from the input files.

  • POSTERIOR_ERROR_PROB_INDEX (int) –

    Index of the posterior error probability from the input files.

  • PEPTIDE_INDEX (int) –

    Index of the peptides from the input files.

  • PROTEINIDS_INDEX (int) –

    Index of the proteins from the input files.

  • psms (list) –

    List of Psm objects.

Source code in pyproteininference/reader.py
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
class PercolatorReader(Reader):
    """
    The following class takes a percolator target file and a percolator decoy file
    or combined files/directory and creates standard [Psm][pyproteininference.physical.Psm] objects.
    This reader class is used as input for [DataStore object][pyproteininference.datastore.DataStore].

    Percolator Output is formatted as follows:
    with each entry being tab delimited.

    | PSMId                         | score    |  q-value    | posterior_error_prob  |  peptide                       | proteinIds          |                      |                      |                         | # noqa E501 W605
    |-------------------------------|----------|-------------|-----------------------|--------------------------------|---------------------|----------------------|----------------------|-------------------------| # noqa E501 W605
    |     116108.15139.15139.6.dta  |  3.44016 | 0.000479928 | 7.60258e-10           | K.MVVSMTLGLHPWIANIDDTQYLAAK.R  | CNDP1_HUMAN\|Q96KN2 | B4E180_HUMAN\|B4E180 | A8K1K1_HUMAN\|A8K1K1 | J3KRP0_HUMAN\|J3KRP0    | # noqa E501 W605

    Attributes:
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.
        PSMID_INDEX (int): Index of the PSMId from the input files.
        PERC_SCORE_INDEX (int): Index of the Percolator score from the input files.
        Q_VALUE_INDEX (int): Index of the q-value from the input files.
        POSTERIOR_ERROR_PROB_INDEX (int): Index of the posterior error probability from the input files.
        PEPTIDE_INDEX (int): Index of the peptides from the input files.
        PROTEINIDS_INDEX (int): Index of the proteins from the input files.
        psms (list): List of [Psm][pyproteininference.physical.Psm] objects.

    """

    PSMID_INDEX = 0
    PERC_SCORE_INDEX = 1
    Q_VALUE_INDEX = 2
    POSTERIOR_ERROR_PROB_INDEX = 3
    PEPTIDE_INDEX = 4
    PROTEINIDS_INDEX = 5

    def __init__(
        self,
        digest,
        parameter_file_object,
        append_alt_from_db=True,
        target_file=None,
        decoy_file=None,
        combined_files=None,
        directory=None,
        top_hit_per_psm_only=False,
    ):
        """

        Args:
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
            parameter_file_object (ProteinInferenceParameter):
                [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter].
            append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
                are not in the input files.
            target_file (str/list): Path to Target PSM result files.
            decoy_file (str/list): Path to Decoy PSM result files.
            combined_files (str/list): Path to Combined PSM result files.
            directory (str): Path to directory containing combined PSM result files.
            top_hit_per_psm_only (bool): If True, only include top hit for each PSM.

        Returns:
            Reader: [Reader][pyproteininference.reader.Reader] object.

        Example:
            >>> pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
            >>>     decoy_file = "example_decoy.txt", digest=digest,parameter_file_object=pi_params)
        """
        self.target_file = target_file
        self.decoy_file = decoy_file
        self.combined_files = combined_files
        self.directory = directory
        # Define Indicies based on input

        self.top_hit_per_psm_only = top_hit_per_psm_only

        self.psms = None
        self.search_id = None
        self.digest = digest
        self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
        self.append_alt_from_db = append_alt_from_db

        self.parameter_file_object = parameter_file_object

        self._validate_input()

    def read_psms(self):
        """
        Method to read psms from the input files and to transform them into a list of
        [Psm][pyproteininference.physical.Psm] objects.

        This method sets the `psms` variable. Which is a list of Psm objets.

        This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

        Example:
            >>> reader = pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
            >>>     decoy_file = "example_decoy.txt",
            >>>     digest=digest, parameter_file_object=pi_params)
            >>> reader.read_psms()

        """
        # Read in and split by line
        if self.target_file and self.decoy_file:
            # If target_file is a list... read them all in and concatenate...
            if isinstance(self.target_file, (list,)):
                all_target = []
                for t_files in self.target_file:
                    logger.info(t_files)
                    ptarg = []
                    with open(t_files, "r") as perc_target_file:
                        spamreader = csv.reader(perc_target_file, delimiter="\t")
                        for row in spamreader:
                            ptarg.append(row)
                    del ptarg[0]
                    all_target = all_target + ptarg
            elif self.target_file:
                # If not just read the file...
                ptarg = []
                with open(self.target_file, "r") as perc_target_file:
                    spamreader = csv.reader(perc_target_file, delimiter="\t")
                    for row in spamreader:
                        ptarg.append(row)
                del ptarg[0]
                all_target = ptarg

            # Repeat for decoy file
            if isinstance(self.decoy_file, (list,)):
                all_decoy = []
                for d_files in self.decoy_file:
                    logger.info(d_files)
                    pdec = []
                    with open(d_files, "r") as perc_decoy_file:
                        spamreader = csv.reader(perc_decoy_file, delimiter="\t")
                        for row in spamreader:
                            pdec.append(row)
                    del pdec[0]
                    all_decoy = all_decoy + pdec
            elif self.decoy_file:
                pdec = []
                with open(self.decoy_file, "r") as perc_decoy_file:
                    spamreader = csv.reader(perc_decoy_file, delimiter="\t")
                    for row in spamreader:
                        pdec.append(row)
                del pdec[0]
                all_decoy = pdec

            # Combine the lists
            perc_all = all_target + all_decoy

        elif self.combined_files:
            if isinstance(self.combined_files, (list,)):
                all = []
                for f in self.combined_files:
                    logger.info(f)
                    combined_psm_result_rows = []
                    with open(f, "r") as perc_files:
                        spamreader = csv.reader(perc_files, delimiter="\t")
                        for row in spamreader:
                            combined_psm_result_rows.append(row)
                    del combined_psm_result_rows[0]
                    all = all + combined_psm_result_rows
            elif self.combined_files:
                # If not just read the file...
                combined_psm_result_rows = []
                with open(self.combined_files, "r") as perc_files:
                    spamreader = csv.reader(perc_files, delimiter="\t")
                    for row in spamreader:
                        combined_psm_result_rows.append(row)
                del combined_psm_result_rows[0]
                all = combined_psm_result_rows
            perc_all = all

        elif self.directory:

            all_files = os.listdir(self.directory)
            all = []
            for files in all_files:
                logger.info(files)
                combined_psm_result_rows = []
                with open(files, "r") as perc_file:
                    spamreader = csv.reader(perc_file, delimiter="\t")
                    for row in spamreader:
                        combined_psm_result_rows.append(row)
                del combined_psm_result_rows[0]
                all = all + combined_psm_result_rows
            perc_all = all

        peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

        perc_all_filtered = []
        for psms in perc_all:
            try:
                float(psms[self.POSTERIOR_ERROR_PROB_INDEX])
                perc_all_filtered.append(psms)
            except ValueError as e:  # noqa F841
                pass

        # Filter by pep
        perc_all = sorted(
            perc_all_filtered,
            key=lambda x: float(x[self.POSTERIOR_ERROR_PROB_INDEX]),
            reverse=False,
        )

        logger.info("Number of PSMs in the input data: {}".format(len(perc_all)))
        if self.top_hit_per_psm_only:
            logger.info("Filtering to only top hit per PSM")
            psm_ids = set()
            all_psms_filtered = []
            for psm in perc_all:
                if psm[self.PSMID_INDEX] not in psm_ids:
                    psm_ids.add(psm[self.PSMID_INDEX])
                    all_psms_filtered.append(psm)
            perc_all = all_psms_filtered
            logger.info("Number of PSMs after filtering to top hit per PSM: {}".format(len(perc_all)))

        list_of_psm_objects = []
        peptide_tracker = set()
        all_sp_proteins = set(self.digest.swiss_prot_protein_set)
        # We only want to get unique peptides... using all messes up scoring...
        # Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...

        initial_poss_prots = []
        logger.info("Length of PSM Data: {}".format(len(perc_all)))
        for psm_info in perc_all:
            current_peptide = psm_info[self.PEPTIDE_INDEX]
            # Define the Psm...
            if current_peptide not in peptide_tracker:
                combined_psm_result_rows = Psm(identifier=current_peptide)
                # Add all the attributes
                combined_psm_result_rows.percscore = float(psm_info[self.PERC_SCORE_INDEX])
                combined_psm_result_rows.qvalue = float(psm_info[self.Q_VALUE_INDEX])
                combined_psm_result_rows.pepvalue = float(psm_info[self.POSTERIOR_ERROR_PROB_INDEX])
                if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
                    poss_proteins = [psm_info[self.PROTEINIDS_INDEX]]
                else:
                    poss_proteins = sorted(list(set(psm_info[self.PROTEINIDS_INDEX :])))  # noqa E203
                    poss_proteins = poss_proteins[: self.parameter_file_object.max_allowed_alternative_proteins]
                combined_psm_result_rows.possible_proteins = poss_proteins  # Restrict to 50 total possible proteins...
                combined_psm_result_rows.psm_id = psm_info[self.PSMID_INDEX]
                input_poss_prots = copy.copy(poss_proteins)

                # Split peptide if flanking
                current_peptide = Psm.split_peptide(peptide_string=current_peptide)

                if not current_peptide.isupper() or not current_peptide.isalpha():
                    # If we have mods remove them...
                    peptide_string = current_peptide.upper()
                    stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                    current_peptide = stripped_peptide

                # Add the other possible_proteins from insilicodigest here...
                try:
                    current_alt_proteins = sorted(
                        list(peptide_to_protein_dictionary[current_peptide])
                    )  # This peptide needs to be scrubbed of Mods...
                except KeyError:
                    current_alt_proteins = []
                    logger.debug(
                        "Peptide {} was not found in the supplied DB with the following proteins {}".format(
                            current_peptide,
                            ";".join(combined_psm_result_rows.possible_proteins),
                        )
                    )
                    for poss_prot in combined_psm_result_rows.possible_proteins:
                        self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                        self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                        logger.debug(
                            "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                        )

                # Sort Alt Proteins by Swissprot then Trembl...
                identifiers_sorted = DataStore.sort_protein_strings(
                    protein_string_list=current_alt_proteins,
                    sp_proteins=all_sp_proteins,
                    decoy_symbol=self.parameter_file_object.decoy_symbol,
                )

                # Restrict to 50 possible proteins
                combined_psm_result_rows = self._fix_alternative_proteins(
                    append_alt_from_db=self.append_alt_from_db,
                    identifiers_sorted=identifiers_sorted,
                    max_proteins=self.parameter_file_object.max_allowed_alternative_proteins,
                    psm=combined_psm_result_rows,
                    parameter_file_object=self.parameter_file_object,
                )

                # Remove blank alt proteins
                combined_psm_result_rows.possible_proteins = [
                    x for x in combined_psm_result_rows.possible_proteins if x != ""
                ]

                list_of_psm_objects.append(combined_psm_result_rows)
                peptide_tracker.add(current_peptide)

                initial_poss_prots.append(input_poss_prots)

        self.psms = list_of_psm_objects

        self._check_initial_database_overlap(
            initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
        )

        logger.info("Length of PSM Data: {}".format(len(self.psms)))

__init__(digest, parameter_file_object, append_alt_from_db=True, target_file=None, decoy_file=None, combined_files=None, directory=None, top_hit_per_psm_only=False)

Parameters:
  • digest (Digest) –
  • parameter_file_object (ProteinInferenceParameter) –
  • append_alt_from_db (bool, default: True ) –

    Whether or not to append alternative proteins found in the database that are not in the input files.

  • target_file (str / list, default: None ) –

    Path to Target PSM result files.

  • decoy_file (str / list, default: None ) –

    Path to Decoy PSM result files.

  • combined_files (str / list, default: None ) –

    Path to Combined PSM result files.

  • directory (str, default: None ) –

    Path to directory containing combined PSM result files.

  • top_hit_per_psm_only (bool, default: False ) –

    If True, only include top hit for each PSM.

Returns:
Example

pyproteininference.reader.PercolatorReader(target_file = "example_target.txt", decoy_file = "example_decoy.txt", digest=digest,parameter_file_object=pi_params)

Source code in pyproteininference/reader.py
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
def __init__(
    self,
    digest,
    parameter_file_object,
    append_alt_from_db=True,
    target_file=None,
    decoy_file=None,
    combined_files=None,
    directory=None,
    top_hit_per_psm_only=False,
):
    """

    Args:
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        parameter_file_object (ProteinInferenceParameter):
            [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter].
        append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that
            are not in the input files.
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.
        top_hit_per_psm_only (bool): If True, only include top hit for each PSM.

    Returns:
        Reader: [Reader][pyproteininference.reader.Reader] object.

    Example:
        >>> pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
        >>>     decoy_file = "example_decoy.txt", digest=digest,parameter_file_object=pi_params)
    """
    self.target_file = target_file
    self.decoy_file = decoy_file
    self.combined_files = combined_files
    self.directory = directory
    # Define Indicies based on input

    self.top_hit_per_psm_only = top_hit_per_psm_only

    self.psms = None
    self.search_id = None
    self.digest = digest
    self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
    self.append_alt_from_db = append_alt_from_db

    self.parameter_file_object = parameter_file_object

    self._validate_input()

read_psms()

Method to read psms from the input files and to transform them into a list of Psm objects.

This method sets the psms variable. Which is a list of Psm objets.

This method must be ran before initializing DataStore object.

Example

reader = pyproteininference.reader.PercolatorReader(target_file = "example_target.txt", decoy_file = "example_decoy.txt", digest=digest, parameter_file_object=pi_params) reader.read_psms()

Source code in pyproteininference/reader.py
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
def read_psms(self):
    """
    Method to read psms from the input files and to transform them into a list of
    [Psm][pyproteininference.physical.Psm] objects.

    This method sets the `psms` variable. Which is a list of Psm objets.

    This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

    Example:
        >>> reader = pyproteininference.reader.PercolatorReader(target_file = "example_target.txt",
        >>>     decoy_file = "example_decoy.txt",
        >>>     digest=digest, parameter_file_object=pi_params)
        >>> reader.read_psms()

    """
    # Read in and split by line
    if self.target_file and self.decoy_file:
        # If target_file is a list... read them all in and concatenate...
        if isinstance(self.target_file, (list,)):
            all_target = []
            for t_files in self.target_file:
                logger.info(t_files)
                ptarg = []
                with open(t_files, "r") as perc_target_file:
                    spamreader = csv.reader(perc_target_file, delimiter="\t")
                    for row in spamreader:
                        ptarg.append(row)
                del ptarg[0]
                all_target = all_target + ptarg
        elif self.target_file:
            # If not just read the file...
            ptarg = []
            with open(self.target_file, "r") as perc_target_file:
                spamreader = csv.reader(perc_target_file, delimiter="\t")
                for row in spamreader:
                    ptarg.append(row)
            del ptarg[0]
            all_target = ptarg

        # Repeat for decoy file
        if isinstance(self.decoy_file, (list,)):
            all_decoy = []
            for d_files in self.decoy_file:
                logger.info(d_files)
                pdec = []
                with open(d_files, "r") as perc_decoy_file:
                    spamreader = csv.reader(perc_decoy_file, delimiter="\t")
                    for row in spamreader:
                        pdec.append(row)
                del pdec[0]
                all_decoy = all_decoy + pdec
        elif self.decoy_file:
            pdec = []
            with open(self.decoy_file, "r") as perc_decoy_file:
                spamreader = csv.reader(perc_decoy_file, delimiter="\t")
                for row in spamreader:
                    pdec.append(row)
            del pdec[0]
            all_decoy = pdec

        # Combine the lists
        perc_all = all_target + all_decoy

    elif self.combined_files:
        if isinstance(self.combined_files, (list,)):
            all = []
            for f in self.combined_files:
                logger.info(f)
                combined_psm_result_rows = []
                with open(f, "r") as perc_files:
                    spamreader = csv.reader(perc_files, delimiter="\t")
                    for row in spamreader:
                        combined_psm_result_rows.append(row)
                del combined_psm_result_rows[0]
                all = all + combined_psm_result_rows
        elif self.combined_files:
            # If not just read the file...
            combined_psm_result_rows = []
            with open(self.combined_files, "r") as perc_files:
                spamreader = csv.reader(perc_files, delimiter="\t")
                for row in spamreader:
                    combined_psm_result_rows.append(row)
            del combined_psm_result_rows[0]
            all = combined_psm_result_rows
        perc_all = all

    elif self.directory:

        all_files = os.listdir(self.directory)
        all = []
        for files in all_files:
            logger.info(files)
            combined_psm_result_rows = []
            with open(files, "r") as perc_file:
                spamreader = csv.reader(perc_file, delimiter="\t")
                for row in spamreader:
                    combined_psm_result_rows.append(row)
            del combined_psm_result_rows[0]
            all = all + combined_psm_result_rows
        perc_all = all

    peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

    perc_all_filtered = []
    for psms in perc_all:
        try:
            float(psms[self.POSTERIOR_ERROR_PROB_INDEX])
            perc_all_filtered.append(psms)
        except ValueError as e:  # noqa F841
            pass

    # Filter by pep
    perc_all = sorted(
        perc_all_filtered,
        key=lambda x: float(x[self.POSTERIOR_ERROR_PROB_INDEX]),
        reverse=False,
    )

    logger.info("Number of PSMs in the input data: {}".format(len(perc_all)))
    if self.top_hit_per_psm_only:
        logger.info("Filtering to only top hit per PSM")
        psm_ids = set()
        all_psms_filtered = []
        for psm in perc_all:
            if psm[self.PSMID_INDEX] not in psm_ids:
                psm_ids.add(psm[self.PSMID_INDEX])
                all_psms_filtered.append(psm)
        perc_all = all_psms_filtered
        logger.info("Number of PSMs after filtering to top hit per PSM: {}".format(len(perc_all)))

    list_of_psm_objects = []
    peptide_tracker = set()
    all_sp_proteins = set(self.digest.swiss_prot_protein_set)
    # We only want to get unique peptides... using all messes up scoring...
    # Create Psm objects with the identifier, percscore, qvalue, pepvalue, and possible proteins...

    initial_poss_prots = []
    logger.info("Length of PSM Data: {}".format(len(perc_all)))
    for psm_info in perc_all:
        current_peptide = psm_info[self.PEPTIDE_INDEX]
        # Define the Psm...
        if current_peptide not in peptide_tracker:
            combined_psm_result_rows = Psm(identifier=current_peptide)
            # Add all the attributes
            combined_psm_result_rows.percscore = float(psm_info[self.PERC_SCORE_INDEX])
            combined_psm_result_rows.qvalue = float(psm_info[self.Q_VALUE_INDEX])
            combined_psm_result_rows.pepvalue = float(psm_info[self.POSTERIOR_ERROR_PROB_INDEX])
            if self.parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
                poss_proteins = [psm_info[self.PROTEINIDS_INDEX]]
            else:
                poss_proteins = sorted(list(set(psm_info[self.PROTEINIDS_INDEX :])))  # noqa E203
                poss_proteins = poss_proteins[: self.parameter_file_object.max_allowed_alternative_proteins]
            combined_psm_result_rows.possible_proteins = poss_proteins  # Restrict to 50 total possible proteins...
            combined_psm_result_rows.psm_id = psm_info[self.PSMID_INDEX]
            input_poss_prots = copy.copy(poss_proteins)

            # Split peptide if flanking
            current_peptide = Psm.split_peptide(peptide_string=current_peptide)

            if not current_peptide.isupper() or not current_peptide.isalpha():
                # If we have mods remove them...
                peptide_string = current_peptide.upper()
                stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                current_peptide = stripped_peptide

            # Add the other possible_proteins from insilicodigest here...
            try:
                current_alt_proteins = sorted(
                    list(peptide_to_protein_dictionary[current_peptide])
                )  # This peptide needs to be scrubbed of Mods...
            except KeyError:
                current_alt_proteins = []
                logger.debug(
                    "Peptide {} was not found in the supplied DB with the following proteins {}".format(
                        current_peptide,
                        ";".join(combined_psm_result_rows.possible_proteins),
                    )
                )
                for poss_prot in combined_psm_result_rows.possible_proteins:
                    self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                    self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                    logger.debug(
                        "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                    )

            # Sort Alt Proteins by Swissprot then Trembl...
            identifiers_sorted = DataStore.sort_protein_strings(
                protein_string_list=current_alt_proteins,
                sp_proteins=all_sp_proteins,
                decoy_symbol=self.parameter_file_object.decoy_symbol,
            )

            # Restrict to 50 possible proteins
            combined_psm_result_rows = self._fix_alternative_proteins(
                append_alt_from_db=self.append_alt_from_db,
                identifiers_sorted=identifiers_sorted,
                max_proteins=self.parameter_file_object.max_allowed_alternative_proteins,
                psm=combined_psm_result_rows,
                parameter_file_object=self.parameter_file_object,
            )

            # Remove blank alt proteins
            combined_psm_result_rows.possible_proteins = [
                x for x in combined_psm_result_rows.possible_proteins if x != ""
            ]

            list_of_psm_objects.append(combined_psm_result_rows)
            peptide_tracker.add(current_peptide)

            initial_poss_prots.append(input_poss_prots)

    self.psms = list_of_psm_objects

    self._check_initial_database_overlap(
        initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
    )

    logger.info("Length of PSM Data: {}".format(len(self.psms)))

ProteologicPostSearchReader

Bases: Reader

This class is used to read from post processing proteologic logical object.

Attributes:
  • proteologic_object (list) –

    List of proteologic post search objects.

  • search_id (int) –

    Search ID or Search IDs associated with the data.

  • postsearch_id (int) –

    PostSearch ID or PostSearch IDs associated with the data.

  • digest (Digest) –
  • parameter_file_object (ProteinInferenceParameter) –
  • append_alt_from_db (bool) –

    Whether or not to append alternative proteins found in the database that are not in the input files.

Source code in pyproteininference/reader.py
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
class ProteologicPostSearchReader(Reader):
    """
    This class is used to read from post processing proteologic logical object.

    Attributes:
        proteologic_object (list): List of proteologic post search objects.
        search_id (int): Search ID or Search IDs associated with the data.
        postsearch_id (int): PostSearch ID or PostSearch IDs associated with the data.
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        parameter_file_object (ProteinInferenceParameter):
            [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
        append_alt_from_db (bool): Whether or not to append alternative proteins found in the database
            that are not in the input files.

    """

    def __init__(
        self,
        proteologic_object,
        search_id,
        postsearch_id,
        digest,
        parameter_file_object,
        append_alt_from_db=True,
        top_hit_per_psm_only=False,
    ):
        """

        Args:
            proteologic_object (list): List of proteologic post search objects.
            search_id (int): Search ID or Search IDs associated with the data.
            postsearch_id: PostSearch ID or PostSearch IDs associated with the data.
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
            parameter_file_object (ProteinInferenceParameter):
                [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
            append_alt_from_db (bool): Whether or not to append alternative proteins found in the database
                that are not in the input files.
            top_hit_per_psm_only (bool): If True, only include top hit for each PSM.


        Returns:
            object:
        """
        self.proteologic_object = proteologic_object
        self.search_id = search_id
        self.postsearch_id = postsearch_id

        self.psms = None
        self.digest = digest
        self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
        self.append_alt_from_db = append_alt_from_db

        self.top_hit_per_psm_only = top_hit_per_psm_only

        self.parameter_file_object = parameter_file_object

    def read_psms(self):
        """
        Method to read psms from the input files and to transform them into a list of
        [Psm][pyproteininference.physical.Psm] objects.

        This method sets the `psms` variable. Which is a list of Psm objets.

        This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

        """
        logger.info("Reading in data from Proteologic...")
        if isinstance(self.proteologic_object, (list,)):
            list_of_psms = []
            for p_objs in self.proteologic_object:
                for psms in p_objs.physical_object.psm_sets:
                    list_of_psms.append(psms)
        else:
            list_of_psms = self.proteologic_object.physical_object.psm_sets

        # Sort this by posterior error prob...
        list_of_psms = sorted(list_of_psms, key=lambda x: float(x.psm_filter.pepvalue))

        peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

        list_of_psm_objects = []
        peptide_tracker = set()
        all_sp_proteins = set(self.digest.swiss_prot_protein_set)
        # Peptide tracker is used because we only want UNIQUE peptides...
        # The data is sorted by percolator score... or at least it should be...
        # Or sorted by posterior error probability

        initial_poss_prots = []
        for peps in list_of_psms:
            current_peptide = peps.peptide.sequence
            # Define the Psm...
            if current_peptide not in peptide_tracker:
                p = Psm(identifier=current_peptide)
                # Add all the attributes
                p.percscore = float(0)  # Will be stored in table in future I think...
                p.qvalue = float(peps.psm_filter.q_value)
                p.pepvalue = float(peps.psm_filter.pepvalue)
                if peps.peptide.protein not in peps.alternative_proteins:
                    p.possible_proteins = [peps.peptide.protein] + peps.alternative_proteins
                else:
                    p.possible_proteins = peps.alternative_proteins

                p.possible_proteins = list(filter(None, p.possible_proteins))
                input_poss_prots = copy.copy(p.possible_proteins)
                p.psm_id = peps.spectrum.spectrum_identifier

                # Split peptide if flanking
                current_peptide = Psm.split_peptide(peptide_string=current_peptide)

                if not current_peptide.isupper() or not current_peptide.isalpha():
                    # If we have mods remove them...
                    peptide_string = current_peptide.upper()
                    stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                    current_peptide = stripped_peptide

                # Add the other possible_proteins from insilicodigest here...
                try:
                    current_alt_proteins = sorted(
                        list(peptide_to_protein_dictionary[current_peptide])
                    )  # This peptide needs to be scrubbed of Mods...
                except KeyError:
                    current_alt_proteins = []
                    logger.debug(
                        "Peptide {} was not found in the supplied DB with the following proteins {}".format(
                            current_peptide, ";".join(p.possible_proteins)
                        )
                    )
                    for poss_prot in p.possible_proteins:
                        self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                        self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                        logger.debug(
                            "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                        )

                # Sort Alt Proteins by Swissprot then Trembl...
                identifiers_sorted = DataStore.sort_protein_strings(
                    protein_string_list=current_alt_proteins,
                    sp_proteins=all_sp_proteins,
                    decoy_symbol=self.parameter_file_object.decoy_symbol,
                )

                # Restrict to 50 possible proteins... and append alt proteins from db
                p = self._fix_alternative_proteins(
                    append_alt_from_db=self.append_alt_from_db,
                    identifiers_sorted=identifiers_sorted,
                    max_proteins=self.parameter_file_object.max_allowed_alternative_proteins,
                    psm=p,
                    parameter_file_object=self.parameter_file_object,
                )

                list_of_psm_objects.append(p)
                peptide_tracker.add(current_peptide)

                initial_poss_prots.append(input_poss_prots)

        self.psms = list_of_psm_objects

        self._check_initial_database_overlap(
            initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
        )

        logger.info("Finished reading in data from Proteologic...")

__init__(proteologic_object, search_id, postsearch_id, digest, parameter_file_object, append_alt_from_db=True, top_hit_per_psm_only=False)

Parameters:
  • proteologic_object (list) –

    List of proteologic post search objects.

  • search_id (int) –

    Search ID or Search IDs associated with the data.

  • postsearch_id

    PostSearch ID or PostSearch IDs associated with the data.

  • digest (Digest) –
  • parameter_file_object (ProteinInferenceParameter) –
  • append_alt_from_db (bool, default: True ) –

    Whether or not to append alternative proteins found in the database that are not in the input files.

  • top_hit_per_psm_only (bool, default: False ) –

    If True, only include top hit for each PSM.

Returns:
  • object
Source code in pyproteininference/reader.py
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
def __init__(
    self,
    proteologic_object,
    search_id,
    postsearch_id,
    digest,
    parameter_file_object,
    append_alt_from_db=True,
    top_hit_per_psm_only=False,
):
    """

    Args:
        proteologic_object (list): List of proteologic post search objects.
        search_id (int): Search ID or Search IDs associated with the data.
        postsearch_id: PostSearch ID or PostSearch IDs associated with the data.
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        parameter_file_object (ProteinInferenceParameter):
            [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.
        append_alt_from_db (bool): Whether or not to append alternative proteins found in the database
            that are not in the input files.
        top_hit_per_psm_only (bool): If True, only include top hit for each PSM.


    Returns:
        object:
    """
    self.proteologic_object = proteologic_object
    self.search_id = search_id
    self.postsearch_id = postsearch_id

    self.psms = None
    self.digest = digest
    self.initial_protein_peptide_map = copy.copy(self.digest.protein_to_peptide_dictionary)
    self.append_alt_from_db = append_alt_from_db

    self.top_hit_per_psm_only = top_hit_per_psm_only

    self.parameter_file_object = parameter_file_object

read_psms()

Method to read psms from the input files and to transform them into a list of Psm objects.

This method sets the psms variable. Which is a list of Psm objets.

This method must be ran before initializing DataStore object.

Source code in pyproteininference/reader.py
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
def read_psms(self):
    """
    Method to read psms from the input files and to transform them into a list of
    [Psm][pyproteininference.physical.Psm] objects.

    This method sets the `psms` variable. Which is a list of Psm objets.

    This method must be ran before initializing [DataStore object][pyproteininference.datastore.DataStore].

    """
    logger.info("Reading in data from Proteologic...")
    if isinstance(self.proteologic_object, (list,)):
        list_of_psms = []
        for p_objs in self.proteologic_object:
            for psms in p_objs.physical_object.psm_sets:
                list_of_psms.append(psms)
    else:
        list_of_psms = self.proteologic_object.physical_object.psm_sets

    # Sort this by posterior error prob...
    list_of_psms = sorted(list_of_psms, key=lambda x: float(x.psm_filter.pepvalue))

    peptide_to_protein_dictionary = self.digest.peptide_to_protein_dictionary

    list_of_psm_objects = []
    peptide_tracker = set()
    all_sp_proteins = set(self.digest.swiss_prot_protein_set)
    # Peptide tracker is used because we only want UNIQUE peptides...
    # The data is sorted by percolator score... or at least it should be...
    # Or sorted by posterior error probability

    initial_poss_prots = []
    for peps in list_of_psms:
        current_peptide = peps.peptide.sequence
        # Define the Psm...
        if current_peptide not in peptide_tracker:
            p = Psm(identifier=current_peptide)
            # Add all the attributes
            p.percscore = float(0)  # Will be stored in table in future I think...
            p.qvalue = float(peps.psm_filter.q_value)
            p.pepvalue = float(peps.psm_filter.pepvalue)
            if peps.peptide.protein not in peps.alternative_proteins:
                p.possible_proteins = [peps.peptide.protein] + peps.alternative_proteins
            else:
                p.possible_proteins = peps.alternative_proteins

            p.possible_proteins = list(filter(None, p.possible_proteins))
            input_poss_prots = copy.copy(p.possible_proteins)
            p.psm_id = peps.spectrum.spectrum_identifier

            # Split peptide if flanking
            current_peptide = Psm.split_peptide(peptide_string=current_peptide)

            if not current_peptide.isupper() or not current_peptide.isalpha():
                # If we have mods remove them...
                peptide_string = current_peptide.upper()
                stripped_peptide = Psm.remove_peptide_mods(peptide_string)
                current_peptide = stripped_peptide

            # Add the other possible_proteins from insilicodigest here...
            try:
                current_alt_proteins = sorted(
                    list(peptide_to_protein_dictionary[current_peptide])
                )  # This peptide needs to be scrubbed of Mods...
            except KeyError:
                current_alt_proteins = []
                logger.debug(
                    "Peptide {} was not found in the supplied DB with the following proteins {}".format(
                        current_peptide, ";".join(p.possible_proteins)
                    )
                )
                for poss_prot in p.possible_proteins:
                    self.digest.peptide_to_protein_dictionary.setdefault(current_peptide, set()).add(poss_prot)
                    self.digest.protein_to_peptide_dictionary.setdefault(poss_prot, set()).add(current_peptide)
                    logger.debug(
                        "Adding Peptide {} and Protein {} to Digest dictionaries".format(current_peptide, poss_prot)
                    )

            # Sort Alt Proteins by Swissprot then Trembl...
            identifiers_sorted = DataStore.sort_protein_strings(
                protein_string_list=current_alt_proteins,
                sp_proteins=all_sp_proteins,
                decoy_symbol=self.parameter_file_object.decoy_symbol,
            )

            # Restrict to 50 possible proteins... and append alt proteins from db
            p = self._fix_alternative_proteins(
                append_alt_from_db=self.append_alt_from_db,
                identifiers_sorted=identifiers_sorted,
                max_proteins=self.parameter_file_object.max_allowed_alternative_proteins,
                psm=p,
                parameter_file_object=self.parameter_file_object,
            )

            list_of_psm_objects.append(p)
            peptide_tracker.add(current_peptide)

            initial_poss_prots.append(input_poss_prots)

    self.psms = list_of_psm_objects

    self._check_initial_database_overlap(
        initial_possible_proteins=initial_poss_prots, initial_protein_peptide_map=self.initial_protein_peptide_map
    )

    logger.info("Finished reading in data from Proteologic...")

Reader

Bases: object

Main Reader Class which is parent to all reader subclasses.

Attributes:
  • target_file (str / list) –

    Path to Target PSM result files.

  • decoy_file (str / list) –

    Path to Decoy PSM result files.

  • combined_files (str / list) –

    Path to Combined PSM result files.

  • directory (str) –

    Path to directory containing combined PSM result files.

Source code in pyproteininference/reader.py
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
class Reader(object):
    """
    Main Reader Class which is parent to all reader subclasses.

    Attributes:
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.

    """

    def __init__(
        self, target_file=None, decoy_file=None, combined_files=None, directory=None, top_hit_per_psm_only=False
    ):
        """

        Args:
            target_file (str/list): Path to Target PSM result files.
            decoy_file (str/list): Path to Decoy PSM result files.
            combined_files (str/list): Path to Combined PSM result files.
            directory (str): Path to directory containing combined PSM result files.
            top_hit_per_psm_only (bool): If True, only include top hit for each PSM.

        """
        self.target_file = target_file
        self.decoy_file = decoy_file
        self.combined_files = combined_files
        self.directory = directory
        self.top_hit_per_psm_only = top_hit_per_psm_only

    def get_alternative_proteins_from_input(self, row):
        """
        Method to get the alternative proteins from the input files.

        """
        if None in row.keys():
            try:
                row["alternative_proteins"] = row.pop(None)
                # Sort the alternative proteins - when they are read in they become unsorted
                row["alternative_proteins"] = sorted(row["alternative_proteins"])
            except KeyError:
                row["alternative_proteins"] = []
        else:
            row["alternative_proteins"] = []
        return row

    def _validate_input(self):
        """
        Internal method to validate the input to Reader.

        """
        if self.target_file and self.decoy_file and not self.combined_files and not self.directory:
            logger.info("Validating input as target_file and decoy_file")
        elif self.combined_files and not self.target_file and not self.decoy_file and not self.directory:
            logger.info("Validating input as combined_files")
        elif self.directory and not self.combined_files and not self.decoy_file and not self.target_file:
            logger.info("Validating input as combined_directory")
        else:
            raise ValueError(
                "To run Protein inference please supply either: "
                "(1) either one or multiple target_files and decoy_files, "
                "(2) either one or multiple combined_files that include target and decoy data"
                "(3) a combined_directory that contains combined target/decoy files (combined_directory)"
            )

    @classmethod
    def _fix_alternative_proteins(
        cls,
        append_alt_from_db,
        identifiers_sorted,
        max_proteins,
        psm,
        parameter_file_object,
    ):
        """
        Internal method to fix the alternative proteins variable for a given
         [Psm][pyproteininference.physical.Psm] object.

        Args:
            append_alt_from_db (bool): Whether or not to append alternative proteins found in the database that are
                not in the input files.
            identifiers_sorted (list): List of sorted Protein Strings for the given Psm.
            max_proteins (int): Maximum number of proteins that a [Psm][pyproteininference.physical.Psm]
                is allowed to map to.
            psm: (Psm): [Psm][pyproteininference.physical.Psm] object of interest.
            parameter_file_object: (ProteinInferenceParameter):
                [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter].

        Returns:
            pyproteininference.physical.Psm: [Psm][pyproteininference.physical.Psm] with alternative proteins fixed.

        """
        # If we are appending alternative proteins from the db
        if append_alt_from_db:
            # Loop over the Identifiers from the DB These are identifiers that contain the current peptide
            for alt_proteins in identifiers_sorted[:max_proteins]:
                # If the identifier is not already in possible proteins
                # and if then len of poss prot is less than the max...
                # Then append
                if alt_proteins not in psm.possible_proteins and len(psm.possible_proteins) < max_proteins:
                    psm.possible_proteins.append(alt_proteins)
        # Next if the len of possible proteins is greater than max then restrict the list length...
        if len(psm.possible_proteins) > max_proteins:
            psm.possible_proteins = [psm.possible_proteins[x] for x in range(max_proteins)]
        else:
            pass

        # If no inference only select first poss protein
        if parameter_file_object.inference_type == Inference.FIRST_PROTEIN:
            psm.possible_proteins = [psm.possible_proteins[0]]

        return psm

    def _check_initial_database_overlap(self, initial_possible_proteins, initial_protein_peptide_map):
        """
        Internal method that checks to make sure there is at least some overlap between proteins in the input files
        And the proteins in the database digestion.
        """

        if len(initial_protein_peptide_map.keys()) > 0:
            input_protein_ids_flat = set([protein for group in initial_possible_proteins for protein in group])

            digest_proteins = set(initial_protein_peptide_map.keys())

            intersection = input_protein_ids_flat.intersection(digest_proteins)

            if len(intersection) < 1:
                raise ValueError(
                    "The Intersection of Protein Identifiers between the database digest "
                    "and the input files is zero. Please consider setting id_splitting to True. "
                    "Or make sure that the identifiers in the input files and database file match. "
                    "Example Protein Identifier from input file '{}'."
                    "Example Protein Identifier from database file '{}'".format(
                        list(input_protein_ids_flat)[0], list(digest_proteins)[0]
                    )
                )
            else:
                logger.info("Number of matching proteins from database and input files: {}".format(len(intersection)))
                logger.info("Number of proteins from database file: {}".format(len(digest_proteins)))
                logger.info("Number of proteins from input files: {}".format(len(input_protein_ids_flat)))

        else:
            pass

__init__(target_file=None, decoy_file=None, combined_files=None, directory=None, top_hit_per_psm_only=False)

Parameters:
  • target_file (str / list, default: None ) –

    Path to Target PSM result files.

  • decoy_file (str / list, default: None ) –

    Path to Decoy PSM result files.

  • combined_files (str / list, default: None ) –

    Path to Combined PSM result files.

  • directory (str, default: None ) –

    Path to directory containing combined PSM result files.

  • top_hit_per_psm_only (bool, default: False ) –

    If True, only include top hit for each PSM.

Source code in pyproteininference/reader.py
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def __init__(
    self, target_file=None, decoy_file=None, combined_files=None, directory=None, top_hit_per_psm_only=False
):
    """

    Args:
        target_file (str/list): Path to Target PSM result files.
        decoy_file (str/list): Path to Decoy PSM result files.
        combined_files (str/list): Path to Combined PSM result files.
        directory (str): Path to directory containing combined PSM result files.
        top_hit_per_psm_only (bool): If True, only include top hit for each PSM.

    """
    self.target_file = target_file
    self.decoy_file = decoy_file
    self.combined_files = combined_files
    self.directory = directory
    self.top_hit_per_psm_only = top_hit_per_psm_only

get_alternative_proteins_from_input(row)

Method to get the alternative proteins from the input files.

Source code in pyproteininference/reader.py
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def get_alternative_proteins_from_input(self, row):
    """
    Method to get the alternative proteins from the input files.

    """
    if None in row.keys():
        try:
            row["alternative_proteins"] = row.pop(None)
            # Sort the alternative proteins - when they are read in they become unsorted
            row["alternative_proteins"] = sorted(row["alternative_proteins"])
        except KeyError:
            row["alternative_proteins"] = []
    else:
        row["alternative_proteins"] = []
    return row

DataStore

Bases: object

The following Class serves as the data storage object for a protein inference analysis The class serves as a central point that is accessed at virtually every PI processing step

Attributes:
  • main_data_form (list) –

    List of unrestricted Psm objects.

  • parameter_file_object (ProteinInferenceParameter) –

    protein inference parameter object.

  • restricted_peptides (list) –

    List of non flaking peptide strings present in the current analysis.

  • main_data_restricted (list) –

    List of restricted Psm objects. Restriction is based on the parameter_file_object and the object is created by function restrict_psm_data.

  • scored_proteins (list) –

    List of scored Protein objects. Output from scoring methods from scoring.

  • grouped_scored_proteins (list) –

    List of scored Protein objects that have been grouped and sorted. Output from run_inference method.

  • scoring_input (list) –

    List of non-scored Protein objects. Output from create_scoring_input.

  • picked_proteins_scored (list) –

    List of Protein objects that pass the protein picker algorithm (protein_picker).

  • picked_proteins_removed (list) –

    List of Protein objects that do not pass the protein picker algorithm (protein_picker).

  • protein_peptide_dictionary (defaultdict) –

    Dictionary of protein strings (keys) that map to sets of peptide strings based on the peptides and proteins found in the search. Protein -> set(Peptides).

  • peptide_protein_dictionary (defaultdict) –

    Dictionary of peptide strings (keys) that map to sets of protein strings based on the peptides and proteins found in the search. Peptide -> set(Proteins).

  • high_low_better (str) –

    Variable that indicates whether a higher or a lower protein score is better. This is necessary to sort Protein objects by score properly. Can either be "higher" or "lower".

  • psm_score (str) –

    Variable that indicates the Psm score being used in the analysis to generate Protein scores.

  • protein_score (str) –

    String to indicate the protein score method used.

  • short_protein_score (str) –

    Short String to indicate the protein score method used.

  • protein_group_objects (list) –

    List of scored ProteinGroup objects that have been grouped and sorted. Output from run_inference method.

  • decoy_symbol (str) –

    String that is used to differentiate between decoy proteins and target proteins. Ex: "##".

  • digest (Digest) –
  • SCORE_MAPPER (dict) –

    Dictionary that maps potential scores in input files to internal score names.

  • CUSTOM_SCORE_KEY (str) –

    String that indicates a custom score is being used.

Source code in pyproteininference/datastore.py
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
class DataStore(object):
    """
    The following Class serves as the data storage object for a protein inference analysis
    The class serves as a central point that is accessed at virtually every PI processing step


    Attributes:
        main_data_form (list): List of unrestricted Psm objects.
        parameter_file_object (ProteinInferenceParameter): protein inference parameter
            [object][pyproteininference.parameters.ProteinInferenceParameter].
        restricted_peptides (list): List of non flaking peptide strings present in the current analysis.
        main_data_restricted (list): List of restricted [Psm][pyproteininference.physical.Psm] objects.
            Restriction is based on the parameter_file_object and the object is created by function
                [restrict_psm_data][pyproteininference.datastore.DataStore.restrict_psm_data].
        scored_proteins (list): List of scored [Protein][pyproteininference.physical.Protein] objects.
            Output from scoring methods from [scoring][pyproteininference.scoring].
        grouped_scored_proteins (list): List of scored [Protein][pyproteininference.physical.Protein]
            objects that have been grouped and sorted. Output from
                [run_inference][pyproteininference.inference.Inference.run_inference] method.
        scoring_input (list): List of non-scored [Protein][pyproteininference.physical.Protein] objects.
            Output from [create_scoring_input][pyproteininference.datastore.DataStore.create_scoring_input].
        picked_proteins_scored (list): List of [Protein][pyproteininference.physical.Protein] objects that pass
            the protein picker algorithm ([protein_picker][pyproteininference.datastore.DataStore.protein_picker]).
        picked_proteins_removed (list): List of [Protein][pyproteininference.physical.Protein] objects that do not
            pass the protein picker algorithm ([protein_picker][pyproteininference.datastore.DataStore.protein_picker]).
        protein_peptide_dictionary (collections.defaultdict): Dictionary of protein strings (keys) that map to sets
            of peptide strings based on the peptides and proteins found in the search. Protein -> set(Peptides).
        peptide_protein_dictionary (collections.defaultdict): Dictionary of peptide strings (keys) that map to sets
            of protein strings based on the peptides and proteins found in the search. Peptide -> set(Proteins).
        high_low_better (str): Variable that indicates whether a higher or a lower protein score is better.
            This is necessary to sort Protein objects by score properly. Can either be "higher" or "lower".
        psm_score (str): Variable that indicates the [Psm][pyproteininference.physical.Psm]
            score being used in the analysis to generate [Protein][pyproteininference.physical.Protein] scores.
        protein_score (str): String to indicate the protein score method used.
        short_protein_score (str): Short String to indicate the protein score method used.
        protein_group_objects (list): List of scored [ProteinGroup][pyproteininference.physical.ProteinGroup]
            objects that have been grouped and sorted. Output from
             [run_inference][pyproteininference.inference.Inference.run_inference] method.
        decoy_symbol (str): String that is used to differentiate between decoy proteins and target proteins. Ex: "##".
        digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].
        SCORE_MAPPER (dict): Dictionary that maps potential scores in input files to internal score names.
        CUSTOM_SCORE_KEY (str): String that indicates a custom score is being used.

    """

    SCORE_MAPPER = {
        "q_value": "qvalue",
        "pep_value": "pepvalue",
        "perc_score": "percscore",
        "score": "percscore",
        "q-value": "qvalue",
        "posterior_error_prob": "pepvalue",
        "posterior_error_probability": "pepvalue",
        "MS:1001493": "pepvalue",  # Added to make sure custom input for pep/qval accession gets mapped to pep/qval
        "MS:1001491": "qvalue",
    }

    CUSTOM_SCORE_KEY = "custom_score"

    HIGHER_PSM_SCORE = "higher"
    LOWER_PSM_SCORE = "lower"

    def __init__(self, reader, digest, validate=True):
        """

        Args:
            reader (Reader): Reader object [Reader][pyproteininference.reader.Reader].
            digest (Digest): Digest object
                [Digest][pyproteininference.in_silico_digest.Digest].
            validate (bool): True/False to indicate if the input data should be validated.

        Example:
            >>> pyproteininference.datastore.DataStore(reader = reader, digest=digest)


        """
        # If the reader class is from a percolator.psms then define main_data_form as reader.psms
        # main_data_form is the starting point for all other analyses
        self._init_validate(reader=reader)

        self.parameter_file_object = reader.parameter_file_object  # Parameter object
        self.main_data_restricted = None  # PSM data post restriction
        self.scored_proteins = []  # List of scored Protein objects
        self.grouped_scored_proteins = []  # List of sorted scored Protein objects
        self.scoring_input = None  # List of non scored Protein objects
        self.picked_proteins_scored = None  # List of Protein objects after picker algorithm
        self.picked_proteins_removed = None  # Protein objects removed via picker
        self.protein_peptide_dictionary = None
        self.peptide_protein_dictionary = None
        self.high_low_better = None  # Variable that indicates whether a higher or lower protein score is better
        self.psm_score = None  # PSM Score used
        self.protein_score = None
        self.short_protein_score = None
        self.protein_group_objects = []  # List of sorted protein group objects
        self.decoy_symbol = self.parameter_file_object.decoy_symbol  # Decoy symbol from parameter file
        self.digest = digest  # Digest object

        # Run Checks and Validations
        if validate:
            self.validate_psm_data()
            self.validate_digest()
            self.check_data_consistency()

        # Run method to fix our parameter object if necessary
        self.parameter_file_object.fix_parameters_from_datastore(data=self)

    def get_sorted_identifiers(self, scored=True):
        """
        Retrieves a sorted list of protein strings present in the analysis.

        Args:
            scored (bool): True/False to indicate if we should return scored or non-scored identifiers.

        Returns:
            list: List of sorted protein identifier strings.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> sorted_proteins = data.get_sorted_identifiers(scored=True)
        """

        if scored:
            self._validate_scored_proteins()
            if self.picked_proteins_scored:
                proteins = set([x.identifier for x in self.picked_proteins_scored])
            else:
                proteins = set([x.identifier for x in self.scored_proteins])
        else:
            self._validate_scoring_input()
            proteins = [x.identifier for x in self.scoring_input]

        all_sp_proteins = set(self.digest.swiss_prot_protein_set)

        our_target_sp_proteins = sorted([x for x in proteins if x in all_sp_proteins and self.decoy_symbol not in x])
        our_decoy_sp_proteins = sorted([x for x in proteins if x in all_sp_proteins and self.decoy_symbol in x])

        our_target_tr_proteins = sorted(
            [x for x in proteins if x not in all_sp_proteins and self.decoy_symbol not in x]
        )
        our_decoy_tr_proteins = sorted([x for x in proteins if x not in all_sp_proteins and self.decoy_symbol in x])

        our_proteins_sorted = (
            our_target_sp_proteins + our_decoy_sp_proteins + our_target_tr_proteins + our_decoy_tr_proteins
        )

        return our_proteins_sorted

    @classmethod
    def sort_protein_group_objects(cls, protein_group_objects, higher_or_lower):
        """
        Class Method to sort a list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects by
        score and number of peptides.

        Args:
            protein_group_objects (list): list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
            higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".

        Returns:
            list: list of sorted [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        Example:
            >>> list_of_group_objects = pyproteininference.datastore.DataStore.sort_protein_group_objects(
            >>>     protein_group_objects=list_of_group_objects, higher_or_lower="higher"
            >>> )
        """
        if higher_or_lower == cls.LOWER_PSM_SCORE:

            protein_group_objects = sorted(
                protein_group_objects,
                key=lambda k: (
                    k.proteins[0].score,
                    -k.proteins[0].num_peptides,
                ),
                reverse=False,
            )
        elif higher_or_lower == cls.HIGHER_PSM_SCORE:

            protein_group_objects = sorted(
                protein_group_objects,
                key=lambda k: (
                    k.proteins[0].score,
                    k.proteins[0].num_peptides,
                ),
                reverse=True,
            )

        return protein_group_objects

    @classmethod
    def sort_protein_objects(cls, grouped_protein_objects, higher_or_lower):
        """
        Class Method to sort a list of [Protein][pyproteininference.physical.Protein] objects by score and number of
        peptides.

        Args:
            grouped_protein_objects (list): list of [Protein][pyproteininference.physical.Protein] objects.
            higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".

        Returns:
            list: list of sorted [Protein][pyproteininference.physical.Protein] objects.

        Example:
            >>> scores_grouped = pyproteininference.datastore.DataStore.sort_protein_objects(
            >>>     grouped_protein_objects=scores_grouped, higher_or_lower="higher"
            >>> )
        """
        if higher_or_lower == cls.LOWER_PSM_SCORE:
            grouped_protein_objects = sorted(
                grouped_protein_objects,
                key=lambda k: (k[0].score, -k[0].num_peptides),
                reverse=False,
            )
        if higher_or_lower == cls.HIGHER_PSM_SCORE:
            grouped_protein_objects = sorted(
                grouped_protein_objects,
                key=lambda k: (k[0].score, k[0].num_peptides),
                reverse=True,
            )
        return grouped_protein_objects

    @classmethod
    def sort_protein_sub_groups(cls, protein_list, higher_or_lower):
        """
        Method to sort protein sub lists.

        Args:
            protein_list (list): List of [Protein][pyproteininference.physical.Protein] objects to be sorted.
            higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".

        Returns:
            list: List of [Protein][pyproteininference.physical.Protein] objects to be sorted by score and number of
            peptides.

        """

        # Sort the groups based on higher or lower indication, secondarily sort the groups based on number of unique
        # peptides
        # We use the index [1:] as we do not wish to sort the lead protein...
        if higher_or_lower == cls.LOWER_PSM_SCORE:
            protein_list[1:] = sorted(
                protein_list[1:],
                key=lambda k: (float(k.score), -float(k.num_peptides)),
                reverse=False,
            )
        if higher_or_lower == cls.HIGHER_PSM_SCORE:
            protein_list[1:] = sorted(
                protein_list[1:],
                key=lambda k: (float(k.score), float(k.num_peptides)),
                reverse=True,
            )

        return protein_list

    def get_psm_data(self):
        """
        Method to retrieve a list of [Psm][pyproteininference.physical.Psm] objects.
        Retrieves restricted data if the data has been restricted or all of the data if the data has
        not been restricted.

        Returns:
            list: list of [Psm][pyproteininference.physical.Psm] objects.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> psm_data = data.get_psm_data()
        """
        if not self.main_data_restricted and not self.main_data_form:
            raise ValueError(
                "Both main_data_restricted and main_data_form variables are empty. Please re-load the DataStore "
                "object with a properly loaded Reader object."
            )

        if self.main_data_restricted:
            psm_data = self.main_data_restricted
        else:
            psm_data = self.main_data_form

        return psm_data

    def get_protein_data(self):
        """
        Method to retrieve a list of [Protein][pyproteininference.physical.Protein] objects.
        Retrieves picked and scored data if the data has been picked and scored or just the scored data if the data has
         not been picked.

        Returns:
            list: list of [Protein][pyproteininference.physical.Protein] objects.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> # Data must ben ran through a pyproteininference.scoring.Score method
            >>> protein_data = data.get_protein_data()
        """

        if self.picked_proteins_scored:
            scored_proteins = self.picked_proteins_scored
        else:
            scored_proteins = self.scored_proteins

        return scored_proteins

    def get_protein_identifiers_from_psm_data(self):
        """
        Method to retrieve a list of lists of all possible protein identifiers from the psm data.

        Returns:
            list: list of lists of protein strings.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> protein_strings = data.get_protein_identifiers_from_psm_data()
        """
        psm_data = self.get_psm_data()

        proteins = [x.possible_proteins for x in psm_data]

        return proteins

    def get_q_values(self):
        """
        Method to retrieve a list of all q values for all PSMs.

        Returns:
            list: list of floats (q values).

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> q = data.get_q_values()
        """
        psm_data = self.get_psm_data()

        q_values = [x.qvalue for x in psm_data]

        return q_values

    def get_pep_values(self):
        """
        Method to retrieve a list of all posterior error probabilities for all PSMs.

        Returns:
            list: list of floats (pep values).

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> pep = data.get_pep_values()
        """
        psm_data = self.get_psm_data()

        pep_values = [x.pepvalue for x in psm_data]

        return pep_values

    def get_protein_information_dictionary(self):
        """
        Method to retrieve a dictionary of scores for each peptide.

        Returns:
            dict: dictionary of scores for each protein.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> protein_dict = data.get_protein_information_dictionary()
        """
        psm_data = self.get_psm_data()

        protein_psm_score_dictionary = collections.defaultdict(list)

        # Loop through all Psms
        for psms in psm_data:
            # Loop through all proteins
            for prots in psms.possible_proteins:
                protein_psm_score_dictionary[prots].append(
                    {
                        "peptide": psms.identifier,
                        "Qvalue": psms.qvalue,
                        "PosteriorErrorProbability": psms.pepvalue,
                        "Percscore": psms.percscore,
                    }
                )

        return protein_psm_score_dictionary

    def restrict_psm_data(self, remove1pep=True):
        """
        Method to restrict the input of [Psm][pyproteininference.physical.Psm]  objects.
        This method is central to the pyproteininference module and is able to restrict the Psm data by:
        Q value, Pep Value, Percolator Score, Peptide Length, and Custom Score Input.
        Restriction values are pulled from
        the [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter]
        object.

        This method sets the `main_data_restricted` and `restricted_peptides` Attributes for the DataStore object.

        Args:
            remove1pep (bool): True/False on whether or not to remove PEP values that equal 1 even if other restrictions
                are set to not restrict.

        Returns:
            None:

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> data.restrict_psm_data(remove1pep=True)
        """

        # Validate that we have the main data variable
        self._validate_main_data_form()

        logger.info("Restricting PSM data")

        peptide_length = self.parameter_file_object.restrict_peptide_length
        posterior_error_prob_threshold = self.parameter_file_object.restrict_pep
        q_value_threshold = self.parameter_file_object.restrict_q
        custom_threshold = self.parameter_file_object.restrict_custom

        main_psm_data = self.main_data_form
        logger.info("Length of main data: {}".format(len(self.main_data_form)))
        # If restrict_main_data is called, we automatically discard everything that has a PEP of 1
        if remove1pep and posterior_error_prob_threshold:
            main_psm_data = [x for x in main_psm_data if x.pepvalue != 1]

        # Restrict peptide length and posterior error probability
        if peptide_length and posterior_error_prob_threshold and not q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if len(psms.stripped_peptide) >= peptide_length and psms.pepvalue < float(
                    posterior_error_prob_threshold
                ):
                    restricted_data.append(psms)

        # Restrict peptide length only
        if peptide_length and not posterior_error_prob_threshold and not q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if len(psms.stripped_peptide) >= peptide_length:
                    restricted_data.append(psms)

        # Restrict peptide length, posterior error probability, and qvalue
        if peptide_length and posterior_error_prob_threshold and q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if (
                    len(psms.stripped_peptide) >= peptide_length
                    and psms.pepvalue < float(posterior_error_prob_threshold)
                    and psms.qvalue < float(q_value_threshold)
                ):
                    restricted_data.append(psms)

        # Restrict peptide length and qvalue
        if peptide_length and not posterior_error_prob_threshold and q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if len(psms.stripped_peptide) >= peptide_length and psms.qvalue < float(q_value_threshold):
                    restricted_data.append(psms)

        # Restrict posterior error probability and q value
        if not peptide_length and posterior_error_prob_threshold and q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if psms.pepvalue < float(posterior_error_prob_threshold) and psms.qvalue < float(q_value_threshold):
                    restricted_data.append(psms)

        # Restrict qvalue only
        if not peptide_length and not posterior_error_prob_threshold and q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if psms.qvalue < float(q_value_threshold):
                    restricted_data.append(psms)

        # Restrict posterior error probability only
        if not peptide_length and posterior_error_prob_threshold and not q_value_threshold:
            restricted_data = []
            for psms in main_psm_data:
                if psms.pepvalue < float(posterior_error_prob_threshold):
                    restricted_data.append(psms)

        # Restrict nothing... (only PEP gets restricted - takes everything less than 1)
        if not peptide_length and not posterior_error_prob_threshold and not q_value_threshold:
            restricted_data = main_psm_data

        if custom_threshold:
            custom_restricted = []
            if self.parameter_file_object.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE:
                for psms in restricted_data:
                    if psms.custom_score <= custom_threshold:
                        custom_restricted.append(psms)

            if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
                for psms in restricted_data:
                    if psms.custom_score >= custom_threshold:
                        custom_restricted.append(psms)

            restricted_data = custom_restricted

        self.main_data_restricted = restricted_data

        logger.info("Length of restricted data: {}".format(len(restricted_data)))

        self.restricted_peptides = [x.non_flanking_peptide for x in restricted_data]

    def create_scoring_input(self):
        """
        Method to create the scoring input.
        This method initializes a list of [Protein][pyproteininference.physical.Protein] objects to get them ready
        to be scored by [Score][pyproteininference.scoring.Score] methods.
        This method also takes into account the inference type and aggregates peptides -> proteins accordingly.

        This method sets the `scoring_input` and `score` Attributes for the DataStore object.

        The score selected comes from the protein inference parameter object.

        Returns:
            None:

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> data.create_scoring_input()
        """

        logger.info("Creating Scoring Input")

        psm_data = self.get_psm_data()

        protein_psm_dict = collections.defaultdict(list)

        try:
            score_key = self.SCORE_MAPPER[self.parameter_file_object.psm_score]
        except KeyError:
            score_key = self.CUSTOM_SCORE_KEY

        if self.parameter_file_object.inference_type != Inference.PEPTIDE_CENTRIC:
            # Loop through all Psms
            for psms in psm_data:
                psms.assign_main_score(score=score_key)
                # Loop through all proteins
                for prots in psms.possible_proteins:
                    protein_psm_dict[prots].append(psms)

        else:
            self.peptide_to_protein_dictionary()
            sp_proteins = self.digest.swiss_prot_protein_set
            for psms in psm_data:

                # Assign main score
                psms.assign_main_score(score=score_key)
                protein_set = self.peptide_protein_dictionary[psms.non_flanking_peptide]
                # Sort protein_set by sp-alpha, decoy-sp-alpha, tr-alpha, decoy-tr-alpha
                sorted_protein_list = self.sort_protein_strings(
                    protein_string_list=protein_set,
                    sp_proteins=sp_proteins,
                    decoy_symbol=self.parameter_file_object.decoy_symbol,
                )
                # Restrict the number of identifiers by the value in param file max_identifiers_peptide_centric
                sorted_protein_list = sorted_protein_list[: self.parameter_file_object.max_identifiers_peptide_centric]
                protein_name = ";".join(sorted_protein_list)
                protein_psm_dict[protein_name].append(psms)

        protein_list = []
        for pkey in sorted(protein_psm_dict.keys()):
            protein_object = Protein(identifier=pkey)
            protein_object.psms = protein_psm_dict[pkey]
            protein_object.raw_peptides = set([x.identifier for x in protein_psm_dict[pkey]])
            protein_list.append(protein_object)

        self.psm_score = self.parameter_file_object.psm_score
        self.scoring_input = protein_list

    def protein_to_peptide_dictionary(self):
        """
        Method that returns a map of protein strings to sets of peptide strings and is essentially half
         of a BiPartite graph.
        This method sets the `protein_peptide_dictionary` Attribute for the DataStore object.

        Returns:
            collections.defaultdict: Dictionary of protein strings (keys) that map to sets of peptide strings based
            on the peptides and proteins found in the search. Protein -> set(Peptides).

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> protein_peptide_dict = data.protein_to_peptide_dictionary()
        """
        psm_data = self.get_psm_data()

        res_pep_set = set(self.restricted_peptides)
        default_dict_proteins = collections.defaultdict(set)
        for peptide_objects in psm_data:
            for prots in peptide_objects.possible_proteins:
                cur_peptide = peptide_objects.non_flanking_peptide
                if cur_peptide in res_pep_set:
                    default_dict_proteins[prots].add(cur_peptide)

        self.protein_peptide_dictionary = default_dict_proteins

        return default_dict_proteins

    def peptide_to_protein_dictionary(self):
        """
        Method that returns a map of peptide strings to sets of protein strings and is essentially half of a
        BiPartite graph.
        This method sets the `peptide_protein_dictionary` Attribute for the DataStore object.

        Returns:
            collections.defaultdict: Dictionary of peptide strings (keys) that map to sets of protein strings based
                on the peptides and proteins found in the search. Peptide -> set(Proteins).

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> peptide_protein_dict = data.peptide_to_protein_dictionary()
        """
        psm_data = self.get_psm_data()

        res_pep_set = set(self.restricted_peptides)
        default_dict_peptides = collections.defaultdict(set)
        for peptide_objects in psm_data:
            for prots in peptide_objects.possible_proteins:
                cur_peptide = peptide_objects.non_flanking_peptide
                if cur_peptide in res_pep_set:
                    default_dict_peptides[cur_peptide].add(prots)
                else:
                    pass

        self.peptide_protein_dictionary = default_dict_peptides

        return default_dict_peptides

    def unique_to_leads_peptides(self):
        """
        Method to retrieve peptides that are unique based on the data from the searches
        (Not based on the database digestion).

        Returns:
            set: a Set of peptide strings

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> unique_peps = data.unique_to_leads_peptides()
        """
        if self.grouped_scored_proteins:
            lead_peptides = [list(x[0].peptides) for x in self.grouped_scored_proteins]
            flat_peptides = [item for sublist in lead_peptides for item in sublist]
            counted_peps = collections.Counter(flat_peptides)
            unique_to_leads_peptides = set([x for x in counted_peps if counted_peps[x] == 1])
        else:
            unique_to_leads_peptides = set()

        return unique_to_leads_peptides

    def higher_or_lower(self):
        """
        Method to determine if a higher or lower score is better for a given combination of score input and score type.

        This method sets the `high_low_better` Attribute for the DataStore object.

        This method depends on the output from the Score class to be sorted properly from best to worst score.

        Returns:
            str: String indicating "higher" or "lower" depending on if a higher or lower score is a
                better protein score.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> high_low = data.higher_or_lower()
        """

        if not self.high_low_better:
            logger.info("Determining If a higher or lower score is better based on scored proteins")
            worst_score = self.scored_proteins[-1].score
            best_score = self.scored_proteins[0].score

            if float(best_score) > float(worst_score):
                higher_or_lower = self.HIGHER_PSM_SCORE

            if float(best_score) < float(worst_score):
                higher_or_lower = self.LOWER_PSM_SCORE

            logger.info("best score = {}".format(best_score))
            logger.info("worst score = {}".format(worst_score))

            if best_score == worst_score:
                raise ValueError(
                    "Best and Worst scores were identical, equal to {}. Score type {} produced the error, "
                    "please change psm_score type.".format(best_score, self.psm_score)
                )

            self.high_low_better = higher_or_lower

        else:
            higher_or_lower = self.high_low_better

        return higher_or_lower

    def get_protein_identifiers(self, data_form):
        """
        Method to retrieve the protein string identifiers.

        Args:
            data_form (str): Can be one of the following: "main", "restricted", "picked", "picked_removed".

        Returns:
            list: list of protein identifier strings.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> protein_strings = data.get_protein_identifiers(data_form="main")
        """
        if data_form == "main":
            # All the data (unrestricted)
            data_to_select = self.main_data_form
            prots = [[x.possible_proteins] for x in data_to_select]
            proteins = prots

        if data_form == "restricted":
            # Proteins that pass certain restriction criteria (peptide length, pep, qvalue)
            data_to_select = self.main_data_restricted
            prots = [[x.possible_proteins] for x in data_to_select]
            proteins = prots

        if data_form == "picked":
            # Here we look at proteins that are 'picked' (aka the proteins that beat out their matching target/decoy)
            data_to_select = self.picked_proteins_scored
            prots = [x.identifier for x in data_to_select]
            proteins = prots

        if data_form == "picked_removed":
            # Here we look at the proteins that were removed due to picking (aka the proteins that
            # have a worse score than their target/decoy counterpart)
            data_to_select = self.picked_proteins_removed
            prots = [x.identifier for x in data_to_select]
            proteins = prots

        return proteins

    def get_protein_information(self, protein_string):
        """
        Method to retrieve attributes for a specific scored protein.

        Args:
            protein_string (str): Protein Identifier String.

        Returns:
            list: list of protein attributes.

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> protein_attr = data.get_protein_information(protein_string="RAF1_HUMAN|P04049")
        """
        all_scored_protein_data = self.scored_proteins
        identifiers = [x.identifier for x in all_scored_protein_data]
        protein_scores = [x.score for x in all_scored_protein_data]
        groups = [x.group_identification for x in all_scored_protein_data]
        reviewed = [x.reviewed for x in all_scored_protein_data]
        peptides = [x.peptides for x in all_scored_protein_data]
        # Peptide scores currently broken...
        peptide_scores = [x.peptide_scores for x in all_scored_protein_data]
        picked = [x.picked for x in all_scored_protein_data]
        num_peptides = [x.num_peptides for x in all_scored_protein_data]

        main_index = identifiers.index(protein_string)

        list_structure = [
            [
                "identifier",
                "protein_score",
                "groups",
                "reviewed",
                "peptides",
                "peptide_scores",
                "picked",
                "num_peptides",
            ]
        ]
        list_structure.append([protein_string])
        list_structure[-1].append(protein_scores[main_index])
        list_structure[-1].append(groups[main_index])
        list_structure[-1].append(reviewed[main_index])
        list_structure[-1].append(peptides[main_index])
        list_structure[-1].append(peptide_scores[main_index])
        list_structure[-1].append(picked[main_index])
        list_structure[-1].append(num_peptides[main_index])

        return list_structure

    def exclude_non_distinguishing_peptides(self, protein_subset_type="hard"):
        """
        Method to Exclude peptides that are not distinguishing on either the search or database level.

        The method sets the `scoring_input` and `restricted_peptides` variables for the DataStore object.

        Args:
            protein_subset_type (str): Either "hard" or "soft". Hard will select distinguishing peptides based on
                the database digestion. "soft" will only use peptides identified in the search.

        Returns:
            None:

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> data.exclude_non_distinguishing_peptides(protein_subset_type="hard")
        """

        logger.info("Applying Exclusion Model")

        our_proteins_sorted = self.get_sorted_identifiers(scored=False)

        if protein_subset_type == "hard":
            # Hard protein subsetting defines protein subsets on the digest level (Entire protein is used)
            # This is how Percolator PI does subsetting
            peptides = [self.digest.protein_to_peptide_dictionary[x] for x in our_proteins_sorted]
        elif protein_subset_type == "soft":
            # Soft protein subsetting defines protein subsets on the Peptides identified from the search
            peptides = [set(x.raw_peptides) for x in self.scoring_input]
        else:
            # If neither is dfined we do "hard" exclusion
            peptides = [self.digest.protein_to_peptide_dictionary[x] for x in our_proteins_sorted]

        # Get frozen set of peptides....
        # We will also have a corresponding list of proteins...
        # They will have the same index...
        peptide_sets = [frozenset(e) for e in peptides]
        # Find a way to sort this list of sets...
        # We can sort the sets if we sort proteins from above...
        logger.info("{} number of peptide sets".format(len(peptide_sets)))
        non_subset_peptide_sets = set()
        i = 0
        # Get all peptide sets that are not a subset...
        while peptide_sets:
            i = i + 1
            peptide_set = peptide_sets.pop()
            if any(peptide_set.issubset(s) for s in peptide_sets) or any(
                peptide_set.issubset(s) for s in non_subset_peptide_sets
            ):
                continue
            else:
                non_subset_peptide_sets.add(peptide_set)
            if i % 10000 == 0:
                logger.info("Parsed {} Peptide Sets".format(i))

        logger.info("Parsed {} Peptide Sets".format(i))

        # Get their index from peptides which is the initial list of sets...
        list_of_indeces = []
        for pep_sets in non_subset_peptide_sets:
            ind = peptides.index(pep_sets)
            list_of_indeces.append(ind)

        non_subset_proteins = set([our_proteins_sorted[x] for x in list_of_indeces])

        logger.info("Removing direct subset Proteins from the data")
        # Remove all proteins from scoring input that are a subset of another protein...
        self.scoring_input = [x for x in self.scoring_input if x.identifier in non_subset_proteins]

        logger.info("{} proteins in scoring input after removing subset proteins".format(len(self.scoring_input)))

        # For all the proteins that are not a complete subset of another protein...
        # Get the raw peptides...
        raw_peps = [x.raw_peptides for x in self.scoring_input if x.identifier in non_subset_proteins]

        # Make the raw peptides a flat list
        flat_peptides = [Psm.split_peptide(peptide_string=item) for sublist in raw_peps for item in sublist]

        # Count the number of peptides in this list...
        # This is the number of proteins this peptide maps to....
        counted_peptides = collections.Counter(flat_peptides)

        # If the count is greater than 1... exclude the protein entirely from scoring input... :)
        raw_peps_good = set([x for x in counted_peptides.keys() if counted_peptides[x] <= 1])

        # Alter self.scoring_input by removing psms and peptides that are not in raw_peps_good
        current_score_input = list(self.scoring_input)
        for j in range(len(current_score_input)):
            k = j + 1
            psm_list = []
            new_raw_peptides = []
            current_psms = current_score_input[j].psms
            current_raw_peptides = current_score_input[j].raw_peptides

            for psm_scores in current_psms:
                if psm_scores.non_flanking_peptide in raw_peps_good:
                    psm_list.append(psm_scores)

            for rp in current_raw_peptides:
                if Psm.split_peptide(peptide_string=rp) in raw_peps_good:
                    new_raw_peptides.append(rp)

            current_score_input[j].psms = psm_list
            current_score_input[j].raw_peptides = new_raw_peptides

            if k % 10000 == 0:
                logger.info("Redefined {} Peptide Sets".format(k))

        logger.info("Redefined {} Peptide Sets".format(j))

        filtered_score_input = [x for x in current_score_input if x.psms]

        self.scoring_input = filtered_score_input

        # Recompute the flat peptides
        raw_peps = [x.raw_peptides for x in self.scoring_input if x.identifier in non_subset_proteins]

        # Make the raw peptides a flat list
        new_flat_peptides = set([Psm.split_peptide(peptide_string=item) for sublist in raw_peps for item in sublist])

        self.scoring_input = [x for x in self.scoring_input if x.psms]

        self.restricted_peptides = [x for x in self.restricted_peptides if x in new_flat_peptides]

    def protein_picker(self):
        """
        Method to run the protein picker algorithm.

        Proteins must be scored first with [score_psms][pyproteininference.scoring.Score.score_psms].

        The algorithm will match target and decoy proteins identified from the PSMs from the search.
        If a target and matching decoy is found then target/decoy competition is performed.
        In the Target/Decoy pair the protein with the better score is kept and the one with the worse score is
        discarded from the analysis.

        The method sets the `picked_proteins_scored` and `picked_proteins_removed` variables for
        the DataStore object.

        Returns:
            None:

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> data.protein_picker()
        """

        self._validate_scored_proteins()

        logger.info("Running Protein Picker")

        # Use higher or lower class to determine if a higher protein score or lower protein score is better
        # based on the scoring method used
        higher_or_lower = self.higher_or_lower()
        # Here we determine if a lower or higher score is better
        # Since all input is ordered from best to worst we can do the following

        index_to_remove = []
        # data.scored_proteins is simply a list of Protein objects...
        # Create list of all decoy proteins
        decoy_proteins = [x.identifier for x in self.scored_proteins if self.decoy_symbol in x.identifier]
        # Create a list of all potential matching targets (some of these may not exist in the search)
        matching_targets = [x.replace(self.decoy_symbol, "") for x in decoy_proteins]

        # Create a list of all the proteins from the scored data
        all_proteins = [x.identifier for x in self.scored_proteins]
        logger.info("{} proteins scored".format(len(all_proteins)))

        total_targets = []
        total_decoys = []
        decoys_removed = []
        targets_removed = []
        # Loop over all decoys identified in the search
        logger.info("Picking Proteins...")
        for i in range(len(decoy_proteins)):
            cur_decoy_index = all_proteins.index(decoy_proteins[i])
            cur_decoy_protein_object = self.scored_proteins[cur_decoy_index]
            total_decoys.append(cur_decoy_protein_object.identifier)

            # Try, Except here because the matching target to the decoy may not be a result from the search
            try:
                cur_target_index = all_proteins.index(matching_targets[i])
                cur_target_protein_object = self.scored_proteins[cur_target_index]
                total_targets.append(cur_target_protein_object.identifier)

                if higher_or_lower == self.HIGHER_PSM_SCORE:
                    if cur_target_protein_object.score > cur_decoy_protein_object.score:
                        index_to_remove.append(cur_decoy_index)
                        decoys_removed.append(cur_decoy_index)
                        cur_target_protein_object.picked = True
                        cur_decoy_protein_object.picked = False
                    else:
                        index_to_remove.append(cur_target_index)
                        targets_removed.append(cur_target_index)
                        cur_decoy_protein_object.picked = True
                        cur_target_protein_object.picked = False

                if higher_or_lower == self.LOWER_PSM_SCORE:
                    if cur_target_protein_object.score < cur_decoy_protein_object.score:
                        index_to_remove.append(cur_decoy_index)
                        decoys_removed.append(cur_decoy_index)
                        cur_target_protein_object.picked = True
                        cur_decoy_protein_object.picked = False
                    else:
                        index_to_remove.append(cur_target_index)
                        targets_removed.append(cur_target_index)
                        cur_decoy_protein_object.picked = True
                        cur_target_protein_object.picked = False
            except ValueError:
                pass

        logger.info("{} total decoy proteins".format(len(total_decoys)))
        logger.info("{} matching target proteins also found in search".format(len(total_targets)))
        logger.info("{} decoy proteins to be removed".format(len(decoys_removed)))
        logger.info("{} target proteins to be removed".format(len(targets_removed)))

        logger.info("Removing Lower Scoring Proteins...")
        picked_list = []
        removed_proteins = []
        for protein_objects in self.scored_proteins:
            if protein_objects.picked:
                picked_list.append(protein_objects)
            else:
                removed_proteins.append(protein_objects)
        self.picked_proteins_scored = picked_list
        self.picked_proteins_removed = removed_proteins
        logger.info("Finished Removing Proteins")

    def calculate_q_values(self, regular=True):
        """
        Method calculates Q values FDR on the lead protein in the group on the `protein_group_objects`
        instance variable.
        FDR is calculated As (2*decoys)/total if regular is set to True and is
        (decoys)/total if regular is set to False.

        This method updates the `protein_group_objects` for the DataStore object by updating
        the q_value variable of the [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        Returns:
            None:

        Example:
            >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
            >>> # Data must be scored first
            >>> data.calculate_q_values()
        """

        self._validate_protein_group_objects()

        logger.info("Calculating Q values from the protein group objects")

        # pick out the lead scoring protein for each group... lead score is at 0 position
        lead_score = [x.proteins[0] for x in self.protein_group_objects]
        # Now pick out only the lead protein identifiers
        lead_proteins = [x.identifier for x in lead_score]

        lead_proteins.reverse()

        logger.info("Calculating FDRs")
        fdr_list = []
        for i in range(len(lead_proteins)):
            binary_decoy_target_list = [1 if self.decoy_symbol in elem else 0 for elem in lead_proteins]
            total = len(lead_proteins)
            decoys = sum(binary_decoy_target_list)
            # Calculate FDR at every step starting with the entire list...
            # Delete first entry (worst score) every time we go through a cycle
            if regular:
                fdr = (2 * decoys) / (float(total))
            else:
                fdr = (decoys) / (float(total))
            fdr_list.append(fdr)
            del lead_proteins[0]

        qvalue_list = []
        new_fdr_list = []
        logger.info("Calculating Q Values")
        for fdrs in fdr_list:
            new_fdr_list.append(fdrs)
            qvalue = min(new_fdr_list)
            # qvalue = fdrs
            qvalue_list.append(qvalue)

        qvalue_list.reverse()

        logger.info("Assigning Q Values")
        for k in range(len(self.protein_group_objects)):
            self.protein_group_objects[k].q_value = qvalue_list[k]

        fdr_restricted = [x for x in self.protein_group_objects if x.q_value <= self.parameter_file_object.fdr]

        fdr_restricted_set = [self.grouped_scored_proteins[x] for x in range(len(fdr_restricted))]

        onehitwonders = []
        for groups in fdr_restricted_set:
            if int(groups[0].num_peptides) == 1:
                onehitwonders.append(groups[0])

        logger.info(
            "Protein Group leads that pass with more than 1 PSM with a {} FDR = {}".format(
                self.parameter_file_object.fdr,
                str(len(fdr_restricted_set) - len(onehitwonders)),
            )
        )
        logger.info(
            "Protein Group lead One hit Wonders that pass {} FDR = {}".format(
                self.parameter_file_object.fdr, len(onehitwonders)
            )
        )

        logger.info(
            "Number of Protein groups that pass a {} percent FDR: {}".format(
                str(self.parameter_file_object.fdr * 100), len(fdr_restricted_set)
            )
        )

        logger.info("Finished Q value Calculation")

    def validate_psm_data(self):
        """
        Method that validates the PSM data.
        """
        self._validate_decoys_from_data()
        self._validate_isoform_from_data()

    def validate_digest(self):
        """
        Method that validates the [Digest object][pyproteininference.in_silico_digest.Digest].
        """
        self._validate_reviewed_v_unreviewed()
        self._check_target_decoy_split()

    def check_data_consistency(self):
        """
        Method that checks for data consistency.
        """
        self._check_data_digest_overlap_psms()
        self._check_data_digest_overlap_proteins()

    def _check_data_digest_overlap_psms(self):
        """
        Method that logs the overlap between the digested fasta file and the input files on the PSM level.
        """
        peptides = [x.stripped_peptide for x in self.main_data_form]
        peptides_in_digest = set(self.digest.peptide_to_protein_dictionary.keys())
        peptides_from_search_in_digest = [x for x in peptides if x in peptides_in_digest]
        percentage = float(len(set(peptides))) / float(len(set(peptides_from_search_in_digest)))
        logger.info("{} PSMs identified from input files".format(len(peptides)))
        logger.info(
            "{} PSMs identified from input files that are also present in database digestion".format(
                len(peptides_from_search_in_digest)
            )
        )
        logger.info(
            "{}; ratio of PSMs identified from input files to those that are present in the search"
            " and in the database digestion".format(percentage)
        )

    def _check_data_digest_overlap_proteins(self):
        """
        Method that logs the overlap between the digested fasta file and the input files on the Protein level.
        """
        proteins = [x.possible_proteins for x in self.main_data_form]
        flat_proteins = set([item for sublist in proteins for item in sublist])
        proteins_in_digest = set(self.digest.protein_to_peptide_dictionary.keys())
        proteins_from_search_in_digest = [x for x in flat_proteins if x in proteins_in_digest]
        percentage = float(len(flat_proteins)) / float(len(proteins_from_search_in_digest))
        logger.info("{} proteins identified from input files".format(len(flat_proteins)))
        logger.info(
            "{} proteins identified from input files that are also present in database digestion".format(
                len(proteins_from_search_in_digest)
            )
        )
        logger.info(
            "{}; ratio of proteins identified from input files that are also present in database digestion".format(
                percentage
            )
        )

    def _check_target_decoy_split(self):
        """
        Method that logs the number of target and decoy proteins from the digest.
        """
        # Check the number of targets vs the number of decoys from the digest
        targets = [
            x
            for x in self.digest.protein_to_peptide_dictionary.keys()
            if self.parameter_file_object.decoy_symbol not in x
        ]
        decoys = [
            x for x in self.digest.protein_to_peptide_dictionary.keys() if self.parameter_file_object.decoy_symbol in x
        ]
        if len(decoys) == 0:
            raise ValueError(
                "No decoy proteins found in digest file with decoy symbol: {}. Please double check your decoy symbol and make sure decoy proteins are present in your input file(s).".format(
                    self.parameter_file_object.decoy_symbol
                )
            )
        ratio = float(len(targets)) / float(len(decoys))
        logger.info("Number of Target Proteins in Digest: {}".format(len(targets)))
        logger.info("Number of Decoy Proteins in Digest: {}".format(len(decoys)))
        logger.info("Ratio of Targets Proteins to Decoy Proteins: {}".format(ratio))

    def _validate_decoys_from_data(self):
        """
        Method that checks to make sure that target and decoy proteins exist in the data files.
        """
        # Check to see if we find decoys from our input files
        proteins = [x.possible_proteins for x in self.main_data_form]
        flat_proteins = set([item for sublist in proteins for item in sublist])
        targets = [x for x in flat_proteins if self.parameter_file_object.decoy_symbol not in x]
        decoys = [x for x in flat_proteins if self.parameter_file_object.decoy_symbol in x]
        logger.info("Number of Target Proteins in Data Files: {}".format(len(targets)))
        logger.info("Number of Decoy Proteins in Data Files: {}".format(len(decoys)))

    def _validate_isoform_from_data(self):
        """
        Method that validates whether or not isoforms are able to be identified in the data files.
        """
        # Check to see if we find any proteins with isoform info in name in our input files
        proteins = [x.possible_proteins for x in self.main_data_form]
        flat_proteins = set([item for sublist in proteins for item in sublist])
        if self.parameter_file_object.isoform_symbol:
            non_iso = [x for x in flat_proteins if self.parameter_file_object.isoform_symbol not in x]

        else:
            non_iso = [x for x in flat_proteins]

        if self.parameter_file_object.isoform_symbol:
            iso = [x for x in flat_proteins if self.parameter_file_object.isoform_symbol in x]

        else:
            iso = []
        logger.info("Number of Non Isoform Labeled Proteins in Data Files: {}".format(len(non_iso)))
        logger.info("Number of Isoform Labeled Proteins in Data Files: {}".format(len(iso)))

    def _validate_reviewed_v_unreviewed(self):
        """
        Method that logs whether or not we can distinguish from reviewed and unreviewd protein identifiers
        in the digest.
        """
        # Check to see if we get reviewed prots in digest...
        reviewed_proteins = len(self.digest.swiss_prot_protein_set)
        proteins_in_digest = len(set(self.digest.protein_to_peptide_dictionary.keys()))
        unreviewed_proteins = proteins_in_digest - reviewed_proteins
        logger.info("Number of Total Proteins in from Digest: {}".format(proteins_in_digest))
        logger.info("Number of Reviewed Proteins in from Digest: {}".format(reviewed_proteins))
        logger.info("Number of Unreviewed Proteins in from Digest: {}".format(unreviewed_proteins))

    @classmethod
    def sort_protein_strings(cls, protein_string_list, sp_proteins, decoy_symbol):
        """
        Method that sorts protein strings in the following order: Target Reviewed, Decoy Reviewed, Target Unreviewed,
         Decoy Unreviewed.

        Args:
            protein_string_list (list): List of Protein Strings.
            sp_proteins (set): Set of Reviewed Protein Strings.
            decoy_symbol (str): Symbol to denote a decoy protein identifier IE "##".

        Returns:
            list: List of sorted protein strings.

        Example:
            >>> list_of_group_objects = datastore.DataStore.sort_protein_strings(
            >>>     protein_string_list=protein_string_list, sp_proteins=sp_proteins, decoy_symbol="##"
            >>> )
        """

        our_target_sp_proteins = sorted([x for x in protein_string_list if x in sp_proteins and decoy_symbol not in x])
        our_decoy_sp_proteins = sorted([x for x in protein_string_list if x in sp_proteins and decoy_symbol in x])

        our_target_tr_proteins = sorted(
            [x for x in protein_string_list if x not in sp_proteins and decoy_symbol not in x]
        )
        our_decoy_tr_proteins = sorted([x for x in protein_string_list if x not in sp_proteins and decoy_symbol in x])

        identifiers_sorted = (
            our_target_sp_proteins + our_decoy_sp_proteins + our_target_tr_proteins + our_decoy_tr_proteins
        )

        return identifiers_sorted

    def input_has_q(self):
        """
        Method that checks to see if the input data has q values.
        """
        len_q = len([x.qvalue for x in self.main_data_form if x.qvalue])
        len_all = len(self.main_data_form)
        if len_q == len_all:
            status = True
            logger.info("Input has Q value; Can restrict by Q value")
        else:
            status = False
            logger.warning("Input does not have Q value; Cannot restrict by Q value")

        return status

    def input_has_pep(self):
        """
        Method that checks to see if the input data has pep values.
        """
        len_pep = len([x.pepvalue for x in self.main_data_form if x.pepvalue])
        len_all = len(self.main_data_form)
        if len_pep == len_all:
            status = True
            logger.info("Input has Pep value; Can restrict by Pep value")
        else:
            status = False
            logger.warning("Input does not have Pep value; Cannot restrict by Pep value")

        return status

    def input_has_custom(self):
        """
        Method that checks to see if the input data has custom score values.
        """
        len_c = len([x.custom_score for x in self.main_data_form if x.custom_score])
        len_all = len(self.main_data_form)
        if len_c == len_all:
            status = True
            logger.info("Input has Custom value; Can restrict by Custom value")

        else:
            status = False
            logger.warning("Input does not have Custom value; Cannot restrict by Custom value")

        return status

    def get_protein_objects(self, false_discovery_rate=None, fdr_restricted=False):
        """
        Method retrieves protein objects. Either retrieves FDR restricted list of protien objects,
        or retrieves all objects.

        Args:
            fdr_restricted (bool): True/False on whether to restrict the list of objects based on FDR.

        Returns:
            list: List of scored [ProteinGroup][pyproteininference.physical.ProteinGroup] objects
                that have been grouped and sorted.

        """
        if not false_discovery_rate:
            false_discovery_rate = self.parameter_file_object.fdr
        if fdr_restricted:
            protein_objects = [x.proteins for x in self.protein_group_objects if x.q_value <= false_discovery_rate]
        else:
            protein_objects = self.grouped_scored_proteins

        return protein_objects

    def _init_validate(self, reader):
        """
        Internal Method that checks to make sure the reader object is properly loaded and validated.
        """
        if reader.psms:
            self.main_data_form = reader.psms  # Unrestricted PSM data
            self.restricted_peptides = [x.non_flanking_peptide for x in self.main_data_form]
        else:
            raise ValueError(
                "Psms variable from Reader object is either empty or does not exist. "
                "Make sure your files contain proper data and that you run the 'read_psms' "
                "method on your Reader object."
            )

    def _validate_main_data_form(self):
        """
        Internal Method that checks to make sure the Main data has been defined to run DataStore methods.
        """
        if self.main_data_form:
            pass
        else:
            raise ValueError(
                "Main Data is not defined, thus method cannot be ran. Please make sure PSM data is properly"
                " loaded from the Reader object"
            )

    def _validate_main_data_restricted(self):
        """
        Internal Method that checks to make sure the Main data Restricted has been defined to run DataStore methods.
        """
        if self.main_data_restricted:
            pass
        else:
            raise ValueError(
                "Main Data Restricted is not defined, thus method cannot be ran. Please make sure PSM data is properly"
                " loaded from the Reader object and make sure to run DataStore method 'restrict_psm_data'."
            )

    def _validate_scored_proteins(self):
        """
        Internal Method that checks to make sure that proteins have been scored to run certain subsequent methods.
        """
        if self.picked_proteins_scored or self.scored_proteins:
            pass
        else:
            raise ValueError(
                "Proteins have not been scored, Please initialize a Score object and run a score method with"
                " 'score_psms' instance method."
            )

    def _validate_scoring_input(self):
        """
        Internal Method that checks to make sure that Scoring Input has been created to be able to run scoring methods.
        """
        if self.scoring_input:
            pass
        else:
            raise ValueError(
                "Scoring input has not been created, Please run 'create_scoring_input' method from the DataStore "
                "object to continue."
            )

    def _validate_protein_group_objects(self):
        """
        Internal Method that checks to make sure inference has been run before proceeding.
        """
        if self.protein_group_objects and self.grouped_scored_proteins:
            pass
        else:
            raise ValueError(
                "Either 'protein_group_objects' or 'grouped_scored_proteins' or both DataStore variables are undefined."
                " Please make sure you run an inference method from the Inference class before proceeding."
            )

    def generate_fdr_vs_target_hits(self, fdr_max=0.2):
        """
        Method for calculating FDR vs number of Target Proteins.

        Args:
            fdr_max (float): The maximum false discovery rate to calculate target hits for.
                Will stop once fdr_max is reached.

        Returns:
            list: List of lists of: (FDR, Number of Target Hits). Ordered by increasing number of Target Hits.

        """
        fdr_vs_count = []
        count_list = []
        for pg in self.protein_group_objects:
            if self.decoy_symbol not in pg.proteins[0].identifier:
                count_list.append(pg)
            fdr_vs_count.append([pg.q_value, len(count_list)])

        fdr_vs_count = [x for x in fdr_vs_count if x[0] < fdr_max]

        return fdr_vs_count

    def recover_mapping(self):
        logger.info("Recovering Proteins that exist in the input files but not in the database digest.")
        all_psms = self.get_psm_data()
        proteins = [x.possible_proteins for x in all_psms]
        flat_proteins = [item for sublist in proteins for item in sublist]

        missing_prots = []
        for prot in flat_proteins:
            try:
                self.digest.protein_to_peptide_dictionary[prot]
            except KeyError:
                missing_prots.append(prot)

                psm_data = self.get_psm_data()
                peptides = [x.stripped_peptide for x in psm_data if prot in x.possible_proteins]
                for pep in peptides:
                    self.digest.peptide_to_protein_dictionary.setdefault(pep, set()).add(prot)
                    self.digest.protein_to_peptide_dictionary.setdefault(prot, set()).add(pep)
        if missing_prots:
            logger.info(
                "{} proteins not found in mapping objects, please double check that your database"
                " provided is accurate for the given input data.".format(len(missing_prots))
            )
        else:
            logger.info("No missing proteins in the mapping objects.")

__init__(reader, digest, validate=True)

Parameters:
  • reader (Reader) –

    Reader object Reader.

  • digest (Digest) –

    Digest object Digest.

  • validate (bool, default: True ) –

    True/False to indicate if the input data should be validated.

Example

pyproteininference.datastore.DataStore(reader = reader, digest=digest)

Source code in pyproteininference/datastore.py
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
def __init__(self, reader, digest, validate=True):
    """

    Args:
        reader (Reader): Reader object [Reader][pyproteininference.reader.Reader].
        digest (Digest): Digest object
            [Digest][pyproteininference.in_silico_digest.Digest].
        validate (bool): True/False to indicate if the input data should be validated.

    Example:
        >>> pyproteininference.datastore.DataStore(reader = reader, digest=digest)


    """
    # If the reader class is from a percolator.psms then define main_data_form as reader.psms
    # main_data_form is the starting point for all other analyses
    self._init_validate(reader=reader)

    self.parameter_file_object = reader.parameter_file_object  # Parameter object
    self.main_data_restricted = None  # PSM data post restriction
    self.scored_proteins = []  # List of scored Protein objects
    self.grouped_scored_proteins = []  # List of sorted scored Protein objects
    self.scoring_input = None  # List of non scored Protein objects
    self.picked_proteins_scored = None  # List of Protein objects after picker algorithm
    self.picked_proteins_removed = None  # Protein objects removed via picker
    self.protein_peptide_dictionary = None
    self.peptide_protein_dictionary = None
    self.high_low_better = None  # Variable that indicates whether a higher or lower protein score is better
    self.psm_score = None  # PSM Score used
    self.protein_score = None
    self.short_protein_score = None
    self.protein_group_objects = []  # List of sorted protein group objects
    self.decoy_symbol = self.parameter_file_object.decoy_symbol  # Decoy symbol from parameter file
    self.digest = digest  # Digest object

    # Run Checks and Validations
    if validate:
        self.validate_psm_data()
        self.validate_digest()
        self.check_data_consistency()

    # Run method to fix our parameter object if necessary
    self.parameter_file_object.fix_parameters_from_datastore(data=self)

calculate_q_values(regular=True)

Method calculates Q values FDR on the lead protein in the group on the protein_group_objects instance variable. FDR is calculated As (2*decoys)/total if regular is set to True and is (decoys)/total if regular is set to False.

This method updates the protein_group_objects for the DataStore object by updating the q_value variable of the ProteinGroup objects.

Returns:
  • None
Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)

Data must be scored first

data.calculate_q_values()

Source code in pyproteininference/datastore.py
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
def calculate_q_values(self, regular=True):
    """
    Method calculates Q values FDR on the lead protein in the group on the `protein_group_objects`
    instance variable.
    FDR is calculated As (2*decoys)/total if regular is set to True and is
    (decoys)/total if regular is set to False.

    This method updates the `protein_group_objects` for the DataStore object by updating
    the q_value variable of the [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

    Returns:
        None:

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> # Data must be scored first
        >>> data.calculate_q_values()
    """

    self._validate_protein_group_objects()

    logger.info("Calculating Q values from the protein group objects")

    # pick out the lead scoring protein for each group... lead score is at 0 position
    lead_score = [x.proteins[0] for x in self.protein_group_objects]
    # Now pick out only the lead protein identifiers
    lead_proteins = [x.identifier for x in lead_score]

    lead_proteins.reverse()

    logger.info("Calculating FDRs")
    fdr_list = []
    for i in range(len(lead_proteins)):
        binary_decoy_target_list = [1 if self.decoy_symbol in elem else 0 for elem in lead_proteins]
        total = len(lead_proteins)
        decoys = sum(binary_decoy_target_list)
        # Calculate FDR at every step starting with the entire list...
        # Delete first entry (worst score) every time we go through a cycle
        if regular:
            fdr = (2 * decoys) / (float(total))
        else:
            fdr = (decoys) / (float(total))
        fdr_list.append(fdr)
        del lead_proteins[0]

    qvalue_list = []
    new_fdr_list = []
    logger.info("Calculating Q Values")
    for fdrs in fdr_list:
        new_fdr_list.append(fdrs)
        qvalue = min(new_fdr_list)
        # qvalue = fdrs
        qvalue_list.append(qvalue)

    qvalue_list.reverse()

    logger.info("Assigning Q Values")
    for k in range(len(self.protein_group_objects)):
        self.protein_group_objects[k].q_value = qvalue_list[k]

    fdr_restricted = [x for x in self.protein_group_objects if x.q_value <= self.parameter_file_object.fdr]

    fdr_restricted_set = [self.grouped_scored_proteins[x] for x in range(len(fdr_restricted))]

    onehitwonders = []
    for groups in fdr_restricted_set:
        if int(groups[0].num_peptides) == 1:
            onehitwonders.append(groups[0])

    logger.info(
        "Protein Group leads that pass with more than 1 PSM with a {} FDR = {}".format(
            self.parameter_file_object.fdr,
            str(len(fdr_restricted_set) - len(onehitwonders)),
        )
    )
    logger.info(
        "Protein Group lead One hit Wonders that pass {} FDR = {}".format(
            self.parameter_file_object.fdr, len(onehitwonders)
        )
    )

    logger.info(
        "Number of Protein groups that pass a {} percent FDR: {}".format(
            str(self.parameter_file_object.fdr * 100), len(fdr_restricted_set)
        )
    )

    logger.info("Finished Q value Calculation")

check_data_consistency()

Method that checks for data consistency.

Source code in pyproteininference/datastore.py
1124
1125
1126
1127
1128
1129
def check_data_consistency(self):
    """
    Method that checks for data consistency.
    """
    self._check_data_digest_overlap_psms()
    self._check_data_digest_overlap_proteins()

create_scoring_input()

Method to create the scoring input. This method initializes a list of Protein objects to get them ready to be scored by Score methods. This method also takes into account the inference type and aggregates peptides -> proteins accordingly.

This method sets the scoring_input and score Attributes for the DataStore object.

The score selected comes from the protein inference parameter object.

Returns:
  • None
Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) data.create_scoring_input()

Source code in pyproteininference/datastore.py
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
def create_scoring_input(self):
    """
    Method to create the scoring input.
    This method initializes a list of [Protein][pyproteininference.physical.Protein] objects to get them ready
    to be scored by [Score][pyproteininference.scoring.Score] methods.
    This method also takes into account the inference type and aggregates peptides -> proteins accordingly.

    This method sets the `scoring_input` and `score` Attributes for the DataStore object.

    The score selected comes from the protein inference parameter object.

    Returns:
        None:

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> data.create_scoring_input()
    """

    logger.info("Creating Scoring Input")

    psm_data = self.get_psm_data()

    protein_psm_dict = collections.defaultdict(list)

    try:
        score_key = self.SCORE_MAPPER[self.parameter_file_object.psm_score]
    except KeyError:
        score_key = self.CUSTOM_SCORE_KEY

    if self.parameter_file_object.inference_type != Inference.PEPTIDE_CENTRIC:
        # Loop through all Psms
        for psms in psm_data:
            psms.assign_main_score(score=score_key)
            # Loop through all proteins
            for prots in psms.possible_proteins:
                protein_psm_dict[prots].append(psms)

    else:
        self.peptide_to_protein_dictionary()
        sp_proteins = self.digest.swiss_prot_protein_set
        for psms in psm_data:

            # Assign main score
            psms.assign_main_score(score=score_key)
            protein_set = self.peptide_protein_dictionary[psms.non_flanking_peptide]
            # Sort protein_set by sp-alpha, decoy-sp-alpha, tr-alpha, decoy-tr-alpha
            sorted_protein_list = self.sort_protein_strings(
                protein_string_list=protein_set,
                sp_proteins=sp_proteins,
                decoy_symbol=self.parameter_file_object.decoy_symbol,
            )
            # Restrict the number of identifiers by the value in param file max_identifiers_peptide_centric
            sorted_protein_list = sorted_protein_list[: self.parameter_file_object.max_identifiers_peptide_centric]
            protein_name = ";".join(sorted_protein_list)
            protein_psm_dict[protein_name].append(psms)

    protein_list = []
    for pkey in sorted(protein_psm_dict.keys()):
        protein_object = Protein(identifier=pkey)
        protein_object.psms = protein_psm_dict[pkey]
        protein_object.raw_peptides = set([x.identifier for x in protein_psm_dict[pkey]])
        protein_list.append(protein_object)

    self.psm_score = self.parameter_file_object.psm_score
    self.scoring_input = protein_list

exclude_non_distinguishing_peptides(protein_subset_type='hard')

Method to Exclude peptides that are not distinguishing on either the search or database level.

The method sets the scoring_input and restricted_peptides variables for the DataStore object.

Parameters:
  • protein_subset_type (str, default: 'hard' ) –

    Either "hard" or "soft". Hard will select distinguishing peptides based on the database digestion. "soft" will only use peptides identified in the search.

Returns:
  • None
Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) data.exclude_non_distinguishing_peptides(protein_subset_type="hard")

Source code in pyproteininference/datastore.py
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
def exclude_non_distinguishing_peptides(self, protein_subset_type="hard"):
    """
    Method to Exclude peptides that are not distinguishing on either the search or database level.

    The method sets the `scoring_input` and `restricted_peptides` variables for the DataStore object.

    Args:
        protein_subset_type (str): Either "hard" or "soft". Hard will select distinguishing peptides based on
            the database digestion. "soft" will only use peptides identified in the search.

    Returns:
        None:

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> data.exclude_non_distinguishing_peptides(protein_subset_type="hard")
    """

    logger.info("Applying Exclusion Model")

    our_proteins_sorted = self.get_sorted_identifiers(scored=False)

    if protein_subset_type == "hard":
        # Hard protein subsetting defines protein subsets on the digest level (Entire protein is used)
        # This is how Percolator PI does subsetting
        peptides = [self.digest.protein_to_peptide_dictionary[x] for x in our_proteins_sorted]
    elif protein_subset_type == "soft":
        # Soft protein subsetting defines protein subsets on the Peptides identified from the search
        peptides = [set(x.raw_peptides) for x in self.scoring_input]
    else:
        # If neither is dfined we do "hard" exclusion
        peptides = [self.digest.protein_to_peptide_dictionary[x] for x in our_proteins_sorted]

    # Get frozen set of peptides....
    # We will also have a corresponding list of proteins...
    # They will have the same index...
    peptide_sets = [frozenset(e) for e in peptides]
    # Find a way to sort this list of sets...
    # We can sort the sets if we sort proteins from above...
    logger.info("{} number of peptide sets".format(len(peptide_sets)))
    non_subset_peptide_sets = set()
    i = 0
    # Get all peptide sets that are not a subset...
    while peptide_sets:
        i = i + 1
        peptide_set = peptide_sets.pop()
        if any(peptide_set.issubset(s) for s in peptide_sets) or any(
            peptide_set.issubset(s) for s in non_subset_peptide_sets
        ):
            continue
        else:
            non_subset_peptide_sets.add(peptide_set)
        if i % 10000 == 0:
            logger.info("Parsed {} Peptide Sets".format(i))

    logger.info("Parsed {} Peptide Sets".format(i))

    # Get their index from peptides which is the initial list of sets...
    list_of_indeces = []
    for pep_sets in non_subset_peptide_sets:
        ind = peptides.index(pep_sets)
        list_of_indeces.append(ind)

    non_subset_proteins = set([our_proteins_sorted[x] for x in list_of_indeces])

    logger.info("Removing direct subset Proteins from the data")
    # Remove all proteins from scoring input that are a subset of another protein...
    self.scoring_input = [x for x in self.scoring_input if x.identifier in non_subset_proteins]

    logger.info("{} proteins in scoring input after removing subset proteins".format(len(self.scoring_input)))

    # For all the proteins that are not a complete subset of another protein...
    # Get the raw peptides...
    raw_peps = [x.raw_peptides for x in self.scoring_input if x.identifier in non_subset_proteins]

    # Make the raw peptides a flat list
    flat_peptides = [Psm.split_peptide(peptide_string=item) for sublist in raw_peps for item in sublist]

    # Count the number of peptides in this list...
    # This is the number of proteins this peptide maps to....
    counted_peptides = collections.Counter(flat_peptides)

    # If the count is greater than 1... exclude the protein entirely from scoring input... :)
    raw_peps_good = set([x for x in counted_peptides.keys() if counted_peptides[x] <= 1])

    # Alter self.scoring_input by removing psms and peptides that are not in raw_peps_good
    current_score_input = list(self.scoring_input)
    for j in range(len(current_score_input)):
        k = j + 1
        psm_list = []
        new_raw_peptides = []
        current_psms = current_score_input[j].psms
        current_raw_peptides = current_score_input[j].raw_peptides

        for psm_scores in current_psms:
            if psm_scores.non_flanking_peptide in raw_peps_good:
                psm_list.append(psm_scores)

        for rp in current_raw_peptides:
            if Psm.split_peptide(peptide_string=rp) in raw_peps_good:
                new_raw_peptides.append(rp)

        current_score_input[j].psms = psm_list
        current_score_input[j].raw_peptides = new_raw_peptides

        if k % 10000 == 0:
            logger.info("Redefined {} Peptide Sets".format(k))

    logger.info("Redefined {} Peptide Sets".format(j))

    filtered_score_input = [x for x in current_score_input if x.psms]

    self.scoring_input = filtered_score_input

    # Recompute the flat peptides
    raw_peps = [x.raw_peptides for x in self.scoring_input if x.identifier in non_subset_proteins]

    # Make the raw peptides a flat list
    new_flat_peptides = set([Psm.split_peptide(peptide_string=item) for sublist in raw_peps for item in sublist])

    self.scoring_input = [x for x in self.scoring_input if x.psms]

    self.restricted_peptides = [x for x in self.restricted_peptides if x in new_flat_peptides]

generate_fdr_vs_target_hits(fdr_max=0.2)

Method for calculating FDR vs number of Target Proteins.

Parameters:
  • fdr_max (float, default: 0.2 ) –

    The maximum false discovery rate to calculate target hits for. Will stop once fdr_max is reached.

Returns:
  • list

    List of lists of: (FDR, Number of Target Hits). Ordered by increasing number of Target Hits.

Source code in pyproteininference/datastore.py
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
def generate_fdr_vs_target_hits(self, fdr_max=0.2):
    """
    Method for calculating FDR vs number of Target Proteins.

    Args:
        fdr_max (float): The maximum false discovery rate to calculate target hits for.
            Will stop once fdr_max is reached.

    Returns:
        list: List of lists of: (FDR, Number of Target Hits). Ordered by increasing number of Target Hits.

    """
    fdr_vs_count = []
    count_list = []
    for pg in self.protein_group_objects:
        if self.decoy_symbol not in pg.proteins[0].identifier:
            count_list.append(pg)
        fdr_vs_count.append([pg.q_value, len(count_list)])

    fdr_vs_count = [x for x in fdr_vs_count if x[0] < fdr_max]

    return fdr_vs_count

get_pep_values()

Method to retrieve a list of all posterior error probabilities for all PSMs.

Returns:
  • list

    list of floats (pep values).

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) pep = data.get_pep_values()

Source code in pyproteininference/datastore.py
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
def get_pep_values(self):
    """
    Method to retrieve a list of all posterior error probabilities for all PSMs.

    Returns:
        list: list of floats (pep values).

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> pep = data.get_pep_values()
    """
    psm_data = self.get_psm_data()

    pep_values = [x.pepvalue for x in psm_data]

    return pep_values

get_protein_data()

Method to retrieve a list of Protein objects. Retrieves picked and scored data if the data has been picked and scored or just the scored data if the data has not been picked.

Returns:
  • list

    list of Protein objects.

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)

Data must ben ran through a pyproteininference.scoring.Score method

protein_data = data.get_protein_data()

Source code in pyproteininference/datastore.py
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
def get_protein_data(self):
    """
    Method to retrieve a list of [Protein][pyproteininference.physical.Protein] objects.
    Retrieves picked and scored data if the data has been picked and scored or just the scored data if the data has
     not been picked.

    Returns:
        list: list of [Protein][pyproteininference.physical.Protein] objects.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> # Data must ben ran through a pyproteininference.scoring.Score method
        >>> protein_data = data.get_protein_data()
    """

    if self.picked_proteins_scored:
        scored_proteins = self.picked_proteins_scored
    else:
        scored_proteins = self.scored_proteins

    return scored_proteins

get_protein_identifiers(data_form)

Method to retrieve the protein string identifiers.

Parameters:
  • data_form (str) –

    Can be one of the following: "main", "restricted", "picked", "picked_removed".

Returns:
  • list

    list of protein identifier strings.

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) protein_strings = data.get_protein_identifiers(data_form="main")

Source code in pyproteininference/datastore.py
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
def get_protein_identifiers(self, data_form):
    """
    Method to retrieve the protein string identifiers.

    Args:
        data_form (str): Can be one of the following: "main", "restricted", "picked", "picked_removed".

    Returns:
        list: list of protein identifier strings.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> protein_strings = data.get_protein_identifiers(data_form="main")
    """
    if data_form == "main":
        # All the data (unrestricted)
        data_to_select = self.main_data_form
        prots = [[x.possible_proteins] for x in data_to_select]
        proteins = prots

    if data_form == "restricted":
        # Proteins that pass certain restriction criteria (peptide length, pep, qvalue)
        data_to_select = self.main_data_restricted
        prots = [[x.possible_proteins] for x in data_to_select]
        proteins = prots

    if data_form == "picked":
        # Here we look at proteins that are 'picked' (aka the proteins that beat out their matching target/decoy)
        data_to_select = self.picked_proteins_scored
        prots = [x.identifier for x in data_to_select]
        proteins = prots

    if data_form == "picked_removed":
        # Here we look at the proteins that were removed due to picking (aka the proteins that
        # have a worse score than their target/decoy counterpart)
        data_to_select = self.picked_proteins_removed
        prots = [x.identifier for x in data_to_select]
        proteins = prots

    return proteins

get_protein_identifiers_from_psm_data()

Method to retrieve a list of lists of all possible protein identifiers from the psm data.

Returns:
  • list

    list of lists of protein strings.

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) protein_strings = data.get_protein_identifiers_from_psm_data()

Source code in pyproteininference/datastore.py
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
def get_protein_identifiers_from_psm_data(self):
    """
    Method to retrieve a list of lists of all possible protein identifiers from the psm data.

    Returns:
        list: list of lists of protein strings.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> protein_strings = data.get_protein_identifiers_from_psm_data()
    """
    psm_data = self.get_psm_data()

    proteins = [x.possible_proteins for x in psm_data]

    return proteins

get_protein_information(protein_string)

Method to retrieve attributes for a specific scored protein.

Parameters:
  • protein_string (str) –

    Protein Identifier String.

Returns:
  • list

    list of protein attributes.

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) protein_attr = data.get_protein_information(protein_string="RAF1_HUMAN|P04049")

Source code in pyproteininference/datastore.py
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
def get_protein_information(self, protein_string):
    """
    Method to retrieve attributes for a specific scored protein.

    Args:
        protein_string (str): Protein Identifier String.

    Returns:
        list: list of protein attributes.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> protein_attr = data.get_protein_information(protein_string="RAF1_HUMAN|P04049")
    """
    all_scored_protein_data = self.scored_proteins
    identifiers = [x.identifier for x in all_scored_protein_data]
    protein_scores = [x.score for x in all_scored_protein_data]
    groups = [x.group_identification for x in all_scored_protein_data]
    reviewed = [x.reviewed for x in all_scored_protein_data]
    peptides = [x.peptides for x in all_scored_protein_data]
    # Peptide scores currently broken...
    peptide_scores = [x.peptide_scores for x in all_scored_protein_data]
    picked = [x.picked for x in all_scored_protein_data]
    num_peptides = [x.num_peptides for x in all_scored_protein_data]

    main_index = identifiers.index(protein_string)

    list_structure = [
        [
            "identifier",
            "protein_score",
            "groups",
            "reviewed",
            "peptides",
            "peptide_scores",
            "picked",
            "num_peptides",
        ]
    ]
    list_structure.append([protein_string])
    list_structure[-1].append(protein_scores[main_index])
    list_structure[-1].append(groups[main_index])
    list_structure[-1].append(reviewed[main_index])
    list_structure[-1].append(peptides[main_index])
    list_structure[-1].append(peptide_scores[main_index])
    list_structure[-1].append(picked[main_index])
    list_structure[-1].append(num_peptides[main_index])

    return list_structure

get_protein_information_dictionary()

Method to retrieve a dictionary of scores for each peptide.

Returns:
  • dict

    dictionary of scores for each protein.

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) protein_dict = data.get_protein_information_dictionary()

Source code in pyproteininference/datastore.py
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
def get_protein_information_dictionary(self):
    """
    Method to retrieve a dictionary of scores for each peptide.

    Returns:
        dict: dictionary of scores for each protein.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> protein_dict = data.get_protein_information_dictionary()
    """
    psm_data = self.get_psm_data()

    protein_psm_score_dictionary = collections.defaultdict(list)

    # Loop through all Psms
    for psms in psm_data:
        # Loop through all proteins
        for prots in psms.possible_proteins:
            protein_psm_score_dictionary[prots].append(
                {
                    "peptide": psms.identifier,
                    "Qvalue": psms.qvalue,
                    "PosteriorErrorProbability": psms.pepvalue,
                    "Percscore": psms.percscore,
                }
            )

    return protein_psm_score_dictionary

get_protein_objects(false_discovery_rate=None, fdr_restricted=False)

Method retrieves protein objects. Either retrieves FDR restricted list of protien objects, or retrieves all objects.

Parameters:
  • fdr_restricted (bool, default: False ) –

    True/False on whether to restrict the list of objects based on FDR.

Returns:
  • list

    List of scored ProteinGroup objects that have been grouped and sorted.

Source code in pyproteininference/datastore.py
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
def get_protein_objects(self, false_discovery_rate=None, fdr_restricted=False):
    """
    Method retrieves protein objects. Either retrieves FDR restricted list of protien objects,
    or retrieves all objects.

    Args:
        fdr_restricted (bool): True/False on whether to restrict the list of objects based on FDR.

    Returns:
        list: List of scored [ProteinGroup][pyproteininference.physical.ProteinGroup] objects
            that have been grouped and sorted.

    """
    if not false_discovery_rate:
        false_discovery_rate = self.parameter_file_object.fdr
    if fdr_restricted:
        protein_objects = [x.proteins for x in self.protein_group_objects if x.q_value <= false_discovery_rate]
    else:
        protein_objects = self.grouped_scored_proteins

    return protein_objects

get_psm_data()

Method to retrieve a list of Psm objects. Retrieves restricted data if the data has been restricted or all of the data if the data has not been restricted.

Returns:
  • list

    list of Psm objects.

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) psm_data = data.get_psm_data()

Source code in pyproteininference/datastore.py
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
def get_psm_data(self):
    """
    Method to retrieve a list of [Psm][pyproteininference.physical.Psm] objects.
    Retrieves restricted data if the data has been restricted or all of the data if the data has
    not been restricted.

    Returns:
        list: list of [Psm][pyproteininference.physical.Psm] objects.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> psm_data = data.get_psm_data()
    """
    if not self.main_data_restricted and not self.main_data_form:
        raise ValueError(
            "Both main_data_restricted and main_data_form variables are empty. Please re-load the DataStore "
            "object with a properly loaded Reader object."
        )

    if self.main_data_restricted:
        psm_data = self.main_data_restricted
    else:
        psm_data = self.main_data_form

    return psm_data

get_q_values()

Method to retrieve a list of all q values for all PSMs.

Returns:
  • list

    list of floats (q values).

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) q = data.get_q_values()

Source code in pyproteininference/datastore.py
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
def get_q_values(self):
    """
    Method to retrieve a list of all q values for all PSMs.

    Returns:
        list: list of floats (q values).

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> q = data.get_q_values()
    """
    psm_data = self.get_psm_data()

    q_values = [x.qvalue for x in psm_data]

    return q_values

get_sorted_identifiers(scored=True)

Retrieves a sorted list of protein strings present in the analysis.

Parameters:
  • scored (bool, default: True ) –

    True/False to indicate if we should return scored or non-scored identifiers.

Returns:
  • list

    List of sorted protein identifier strings.

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) sorted_proteins = data.get_sorted_identifiers(scored=True)

Source code in pyproteininference/datastore.py
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
def get_sorted_identifiers(self, scored=True):
    """
    Retrieves a sorted list of protein strings present in the analysis.

    Args:
        scored (bool): True/False to indicate if we should return scored or non-scored identifiers.

    Returns:
        list: List of sorted protein identifier strings.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> sorted_proteins = data.get_sorted_identifiers(scored=True)
    """

    if scored:
        self._validate_scored_proteins()
        if self.picked_proteins_scored:
            proteins = set([x.identifier for x in self.picked_proteins_scored])
        else:
            proteins = set([x.identifier for x in self.scored_proteins])
    else:
        self._validate_scoring_input()
        proteins = [x.identifier for x in self.scoring_input]

    all_sp_proteins = set(self.digest.swiss_prot_protein_set)

    our_target_sp_proteins = sorted([x for x in proteins if x in all_sp_proteins and self.decoy_symbol not in x])
    our_decoy_sp_proteins = sorted([x for x in proteins if x in all_sp_proteins and self.decoy_symbol in x])

    our_target_tr_proteins = sorted(
        [x for x in proteins if x not in all_sp_proteins and self.decoy_symbol not in x]
    )
    our_decoy_tr_proteins = sorted([x for x in proteins if x not in all_sp_proteins and self.decoy_symbol in x])

    our_proteins_sorted = (
        our_target_sp_proteins + our_decoy_sp_proteins + our_target_tr_proteins + our_decoy_tr_proteins
    )

    return our_proteins_sorted

higher_or_lower()

Method to determine if a higher or lower score is better for a given combination of score input and score type.

This method sets the high_low_better Attribute for the DataStore object.

This method depends on the output from the Score class to be sorted properly from best to worst score.

Returns:
  • str

    String indicating "higher" or "lower" depending on if a higher or lower score is a better protein score.

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) high_low = data.higher_or_lower()

Source code in pyproteininference/datastore.py
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
def higher_or_lower(self):
    """
    Method to determine if a higher or lower score is better for a given combination of score input and score type.

    This method sets the `high_low_better` Attribute for the DataStore object.

    This method depends on the output from the Score class to be sorted properly from best to worst score.

    Returns:
        str: String indicating "higher" or "lower" depending on if a higher or lower score is a
            better protein score.

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> high_low = data.higher_or_lower()
    """

    if not self.high_low_better:
        logger.info("Determining If a higher or lower score is better based on scored proteins")
        worst_score = self.scored_proteins[-1].score
        best_score = self.scored_proteins[0].score

        if float(best_score) > float(worst_score):
            higher_or_lower = self.HIGHER_PSM_SCORE

        if float(best_score) < float(worst_score):
            higher_or_lower = self.LOWER_PSM_SCORE

        logger.info("best score = {}".format(best_score))
        logger.info("worst score = {}".format(worst_score))

        if best_score == worst_score:
            raise ValueError(
                "Best and Worst scores were identical, equal to {}. Score type {} produced the error, "
                "please change psm_score type.".format(best_score, self.psm_score)
            )

        self.high_low_better = higher_or_lower

    else:
        higher_or_lower = self.high_low_better

    return higher_or_lower

input_has_custom()

Method that checks to see if the input data has custom score values.

Source code in pyproteininference/datastore.py
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
def input_has_custom(self):
    """
    Method that checks to see if the input data has custom score values.
    """
    len_c = len([x.custom_score for x in self.main_data_form if x.custom_score])
    len_all = len(self.main_data_form)
    if len_c == len_all:
        status = True
        logger.info("Input has Custom value; Can restrict by Custom value")

    else:
        status = False
        logger.warning("Input does not have Custom value; Cannot restrict by Custom value")

    return status

input_has_pep()

Method that checks to see if the input data has pep values.

Source code in pyproteininference/datastore.py
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
def input_has_pep(self):
    """
    Method that checks to see if the input data has pep values.
    """
    len_pep = len([x.pepvalue for x in self.main_data_form if x.pepvalue])
    len_all = len(self.main_data_form)
    if len_pep == len_all:
        status = True
        logger.info("Input has Pep value; Can restrict by Pep value")
    else:
        status = False
        logger.warning("Input does not have Pep value; Cannot restrict by Pep value")

    return status

input_has_q()

Method that checks to see if the input data has q values.

Source code in pyproteininference/datastore.py
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
def input_has_q(self):
    """
    Method that checks to see if the input data has q values.
    """
    len_q = len([x.qvalue for x in self.main_data_form if x.qvalue])
    len_all = len(self.main_data_form)
    if len_q == len_all:
        status = True
        logger.info("Input has Q value; Can restrict by Q value")
    else:
        status = False
        logger.warning("Input does not have Q value; Cannot restrict by Q value")

    return status

peptide_to_protein_dictionary()

Method that returns a map of peptide strings to sets of protein strings and is essentially half of a BiPartite graph. This method sets the peptide_protein_dictionary Attribute for the DataStore object.

Returns:
  • collections.defaultdict: Dictionary of peptide strings (keys) that map to sets of protein strings based on the peptides and proteins found in the search. Peptide -> set(Proteins).

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) peptide_protein_dict = data.peptide_to_protein_dictionary()

Source code in pyproteininference/datastore.py
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
def peptide_to_protein_dictionary(self):
    """
    Method that returns a map of peptide strings to sets of protein strings and is essentially half of a
    BiPartite graph.
    This method sets the `peptide_protein_dictionary` Attribute for the DataStore object.

    Returns:
        collections.defaultdict: Dictionary of peptide strings (keys) that map to sets of protein strings based
            on the peptides and proteins found in the search. Peptide -> set(Proteins).

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> peptide_protein_dict = data.peptide_to_protein_dictionary()
    """
    psm_data = self.get_psm_data()

    res_pep_set = set(self.restricted_peptides)
    default_dict_peptides = collections.defaultdict(set)
    for peptide_objects in psm_data:
        for prots in peptide_objects.possible_proteins:
            cur_peptide = peptide_objects.non_flanking_peptide
            if cur_peptide in res_pep_set:
                default_dict_peptides[cur_peptide].add(prots)
            else:
                pass

    self.peptide_protein_dictionary = default_dict_peptides

    return default_dict_peptides

protein_picker()

Method to run the protein picker algorithm.

Proteins must be scored first with score_psms.

The algorithm will match target and decoy proteins identified from the PSMs from the search. If a target and matching decoy is found then target/decoy competition is performed. In the Target/Decoy pair the protein with the better score is kept and the one with the worse score is discarded from the analysis.

The method sets the picked_proteins_scored and picked_proteins_removed variables for the DataStore object.

Returns:
  • None
Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) data.protein_picker()

Source code in pyproteininference/datastore.py
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
def protein_picker(self):
    """
    Method to run the protein picker algorithm.

    Proteins must be scored first with [score_psms][pyproteininference.scoring.Score.score_psms].

    The algorithm will match target and decoy proteins identified from the PSMs from the search.
    If a target and matching decoy is found then target/decoy competition is performed.
    In the Target/Decoy pair the protein with the better score is kept and the one with the worse score is
    discarded from the analysis.

    The method sets the `picked_proteins_scored` and `picked_proteins_removed` variables for
    the DataStore object.

    Returns:
        None:

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> data.protein_picker()
    """

    self._validate_scored_proteins()

    logger.info("Running Protein Picker")

    # Use higher or lower class to determine if a higher protein score or lower protein score is better
    # based on the scoring method used
    higher_or_lower = self.higher_or_lower()
    # Here we determine if a lower or higher score is better
    # Since all input is ordered from best to worst we can do the following

    index_to_remove = []
    # data.scored_proteins is simply a list of Protein objects...
    # Create list of all decoy proteins
    decoy_proteins = [x.identifier for x in self.scored_proteins if self.decoy_symbol in x.identifier]
    # Create a list of all potential matching targets (some of these may not exist in the search)
    matching_targets = [x.replace(self.decoy_symbol, "") for x in decoy_proteins]

    # Create a list of all the proteins from the scored data
    all_proteins = [x.identifier for x in self.scored_proteins]
    logger.info("{} proteins scored".format(len(all_proteins)))

    total_targets = []
    total_decoys = []
    decoys_removed = []
    targets_removed = []
    # Loop over all decoys identified in the search
    logger.info("Picking Proteins...")
    for i in range(len(decoy_proteins)):
        cur_decoy_index = all_proteins.index(decoy_proteins[i])
        cur_decoy_protein_object = self.scored_proteins[cur_decoy_index]
        total_decoys.append(cur_decoy_protein_object.identifier)

        # Try, Except here because the matching target to the decoy may not be a result from the search
        try:
            cur_target_index = all_proteins.index(matching_targets[i])
            cur_target_protein_object = self.scored_proteins[cur_target_index]
            total_targets.append(cur_target_protein_object.identifier)

            if higher_or_lower == self.HIGHER_PSM_SCORE:
                if cur_target_protein_object.score > cur_decoy_protein_object.score:
                    index_to_remove.append(cur_decoy_index)
                    decoys_removed.append(cur_decoy_index)
                    cur_target_protein_object.picked = True
                    cur_decoy_protein_object.picked = False
                else:
                    index_to_remove.append(cur_target_index)
                    targets_removed.append(cur_target_index)
                    cur_decoy_protein_object.picked = True
                    cur_target_protein_object.picked = False

            if higher_or_lower == self.LOWER_PSM_SCORE:
                if cur_target_protein_object.score < cur_decoy_protein_object.score:
                    index_to_remove.append(cur_decoy_index)
                    decoys_removed.append(cur_decoy_index)
                    cur_target_protein_object.picked = True
                    cur_decoy_protein_object.picked = False
                else:
                    index_to_remove.append(cur_target_index)
                    targets_removed.append(cur_target_index)
                    cur_decoy_protein_object.picked = True
                    cur_target_protein_object.picked = False
        except ValueError:
            pass

    logger.info("{} total decoy proteins".format(len(total_decoys)))
    logger.info("{} matching target proteins also found in search".format(len(total_targets)))
    logger.info("{} decoy proteins to be removed".format(len(decoys_removed)))
    logger.info("{} target proteins to be removed".format(len(targets_removed)))

    logger.info("Removing Lower Scoring Proteins...")
    picked_list = []
    removed_proteins = []
    for protein_objects in self.scored_proteins:
        if protein_objects.picked:
            picked_list.append(protein_objects)
        else:
            removed_proteins.append(protein_objects)
    self.picked_proteins_scored = picked_list
    self.picked_proteins_removed = removed_proteins
    logger.info("Finished Removing Proteins")

protein_to_peptide_dictionary()

Method that returns a map of protein strings to sets of peptide strings and is essentially half of a BiPartite graph. This method sets the protein_peptide_dictionary Attribute for the DataStore object.

Returns:
  • collections.defaultdict: Dictionary of protein strings (keys) that map to sets of peptide strings based

  • on the peptides and proteins found in the search. Protein -> set(Peptides).

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) protein_peptide_dict = data.protein_to_peptide_dictionary()

Source code in pyproteininference/datastore.py
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
def protein_to_peptide_dictionary(self):
    """
    Method that returns a map of protein strings to sets of peptide strings and is essentially half
     of a BiPartite graph.
    This method sets the `protein_peptide_dictionary` Attribute for the DataStore object.

    Returns:
        collections.defaultdict: Dictionary of protein strings (keys) that map to sets of peptide strings based
        on the peptides and proteins found in the search. Protein -> set(Peptides).

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> protein_peptide_dict = data.protein_to_peptide_dictionary()
    """
    psm_data = self.get_psm_data()

    res_pep_set = set(self.restricted_peptides)
    default_dict_proteins = collections.defaultdict(set)
    for peptide_objects in psm_data:
        for prots in peptide_objects.possible_proteins:
            cur_peptide = peptide_objects.non_flanking_peptide
            if cur_peptide in res_pep_set:
                default_dict_proteins[prots].add(cur_peptide)

    self.protein_peptide_dictionary = default_dict_proteins

    return default_dict_proteins

restrict_psm_data(remove1pep=True)

Method to restrict the input of Psm objects. This method is central to the pyproteininference module and is able to restrict the Psm data by: Q value, Pep Value, Percolator Score, Peptide Length, and Custom Score Input. Restriction values are pulled from the ProteinInferenceParameter object.

This method sets the main_data_restricted and restricted_peptides Attributes for the DataStore object.

Parameters:
  • remove1pep (bool, default: True ) –

    True/False on whether or not to remove PEP values that equal 1 even if other restrictions are set to not restrict.

Returns:
  • None
Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) data.restrict_psm_data(remove1pep=True)

Source code in pyproteininference/datastore.py
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
def restrict_psm_data(self, remove1pep=True):
    """
    Method to restrict the input of [Psm][pyproteininference.physical.Psm]  objects.
    This method is central to the pyproteininference module and is able to restrict the Psm data by:
    Q value, Pep Value, Percolator Score, Peptide Length, and Custom Score Input.
    Restriction values are pulled from
    the [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter]
    object.

    This method sets the `main_data_restricted` and `restricted_peptides` Attributes for the DataStore object.

    Args:
        remove1pep (bool): True/False on whether or not to remove PEP values that equal 1 even if other restrictions
            are set to not restrict.

    Returns:
        None:

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> data.restrict_psm_data(remove1pep=True)
    """

    # Validate that we have the main data variable
    self._validate_main_data_form()

    logger.info("Restricting PSM data")

    peptide_length = self.parameter_file_object.restrict_peptide_length
    posterior_error_prob_threshold = self.parameter_file_object.restrict_pep
    q_value_threshold = self.parameter_file_object.restrict_q
    custom_threshold = self.parameter_file_object.restrict_custom

    main_psm_data = self.main_data_form
    logger.info("Length of main data: {}".format(len(self.main_data_form)))
    # If restrict_main_data is called, we automatically discard everything that has a PEP of 1
    if remove1pep and posterior_error_prob_threshold:
        main_psm_data = [x for x in main_psm_data if x.pepvalue != 1]

    # Restrict peptide length and posterior error probability
    if peptide_length and posterior_error_prob_threshold and not q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if len(psms.stripped_peptide) >= peptide_length and psms.pepvalue < float(
                posterior_error_prob_threshold
            ):
                restricted_data.append(psms)

    # Restrict peptide length only
    if peptide_length and not posterior_error_prob_threshold and not q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if len(psms.stripped_peptide) >= peptide_length:
                restricted_data.append(psms)

    # Restrict peptide length, posterior error probability, and qvalue
    if peptide_length and posterior_error_prob_threshold and q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if (
                len(psms.stripped_peptide) >= peptide_length
                and psms.pepvalue < float(posterior_error_prob_threshold)
                and psms.qvalue < float(q_value_threshold)
            ):
                restricted_data.append(psms)

    # Restrict peptide length and qvalue
    if peptide_length and not posterior_error_prob_threshold and q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if len(psms.stripped_peptide) >= peptide_length and psms.qvalue < float(q_value_threshold):
                restricted_data.append(psms)

    # Restrict posterior error probability and q value
    if not peptide_length and posterior_error_prob_threshold and q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if psms.pepvalue < float(posterior_error_prob_threshold) and psms.qvalue < float(q_value_threshold):
                restricted_data.append(psms)

    # Restrict qvalue only
    if not peptide_length and not posterior_error_prob_threshold and q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if psms.qvalue < float(q_value_threshold):
                restricted_data.append(psms)

    # Restrict posterior error probability only
    if not peptide_length and posterior_error_prob_threshold and not q_value_threshold:
        restricted_data = []
        for psms in main_psm_data:
            if psms.pepvalue < float(posterior_error_prob_threshold):
                restricted_data.append(psms)

    # Restrict nothing... (only PEP gets restricted - takes everything less than 1)
    if not peptide_length and not posterior_error_prob_threshold and not q_value_threshold:
        restricted_data = main_psm_data

    if custom_threshold:
        custom_restricted = []
        if self.parameter_file_object.psm_score_type == Score.MULTIPLICATIVE_SCORE_TYPE:
            for psms in restricted_data:
                if psms.custom_score <= custom_threshold:
                    custom_restricted.append(psms)

        if self.parameter_file_object.psm_score_type == Score.ADDITIVE_SCORE_TYPE:
            for psms in restricted_data:
                if psms.custom_score >= custom_threshold:
                    custom_restricted.append(psms)

        restricted_data = custom_restricted

    self.main_data_restricted = restricted_data

    logger.info("Length of restricted data: {}".format(len(restricted_data)))

    self.restricted_peptides = [x.non_flanking_peptide for x in restricted_data]

sort_protein_group_objects(protein_group_objects, higher_or_lower) classmethod

Class Method to sort a list of ProteinGroup objects by score and number of peptides.

Parameters:
  • protein_group_objects (list) –

    list of ProteinGroup objects.

  • higher_or_lower (str) –

    String to indicate if a "higher" or "lower" protein score is "better".

Returns:
Example

list_of_group_objects = pyproteininference.datastore.DataStore.sort_protein_group_objects( protein_group_objects=list_of_group_objects, higher_or_lower="higher" )

Source code in pyproteininference/datastore.py
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
@classmethod
def sort_protein_group_objects(cls, protein_group_objects, higher_or_lower):
    """
    Class Method to sort a list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects by
    score and number of peptides.

    Args:
        protein_group_objects (list): list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
        higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".

    Returns:
        list: list of sorted [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

    Example:
        >>> list_of_group_objects = pyproteininference.datastore.DataStore.sort_protein_group_objects(
        >>>     protein_group_objects=list_of_group_objects, higher_or_lower="higher"
        >>> )
    """
    if higher_or_lower == cls.LOWER_PSM_SCORE:

        protein_group_objects = sorted(
            protein_group_objects,
            key=lambda k: (
                k.proteins[0].score,
                -k.proteins[0].num_peptides,
            ),
            reverse=False,
        )
    elif higher_or_lower == cls.HIGHER_PSM_SCORE:

        protein_group_objects = sorted(
            protein_group_objects,
            key=lambda k: (
                k.proteins[0].score,
                k.proteins[0].num_peptides,
            ),
            reverse=True,
        )

    return protein_group_objects

sort_protein_objects(grouped_protein_objects, higher_or_lower) classmethod

Class Method to sort a list of Protein objects by score and number of peptides.

Parameters:
  • grouped_protein_objects (list) –

    list of Protein objects.

  • higher_or_lower (str) –

    String to indicate if a "higher" or "lower" protein score is "better".

Returns:
  • list

    list of sorted Protein objects.

Example

scores_grouped = pyproteininference.datastore.DataStore.sort_protein_objects( grouped_protein_objects=scores_grouped, higher_or_lower="higher" )

Source code in pyproteininference/datastore.py
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
@classmethod
def sort_protein_objects(cls, grouped_protein_objects, higher_or_lower):
    """
    Class Method to sort a list of [Protein][pyproteininference.physical.Protein] objects by score and number of
    peptides.

    Args:
        grouped_protein_objects (list): list of [Protein][pyproteininference.physical.Protein] objects.
        higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".

    Returns:
        list: list of sorted [Protein][pyproteininference.physical.Protein] objects.

    Example:
        >>> scores_grouped = pyproteininference.datastore.DataStore.sort_protein_objects(
        >>>     grouped_protein_objects=scores_grouped, higher_or_lower="higher"
        >>> )
    """
    if higher_or_lower == cls.LOWER_PSM_SCORE:
        grouped_protein_objects = sorted(
            grouped_protein_objects,
            key=lambda k: (k[0].score, -k[0].num_peptides),
            reverse=False,
        )
    if higher_or_lower == cls.HIGHER_PSM_SCORE:
        grouped_protein_objects = sorted(
            grouped_protein_objects,
            key=lambda k: (k[0].score, k[0].num_peptides),
            reverse=True,
        )
    return grouped_protein_objects

sort_protein_strings(protein_string_list, sp_proteins, decoy_symbol) classmethod

Target Reviewed, Decoy Reviewed, Target Unreviewed,

Decoy Unreviewed.

Parameters:
  • protein_string_list (list) –

    List of Protein Strings.

  • sp_proteins (set) –

    Set of Reviewed Protein Strings.

  • decoy_symbol (str) –

    Symbol to denote a decoy protein identifier IE "##".

Returns:
  • list

    List of sorted protein strings.

Example

list_of_group_objects = datastore.DataStore.sort_protein_strings( protein_string_list=protein_string_list, sp_proteins=sp_proteins, decoy_symbol="##" )

Source code in pyproteininference/datastore.py
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
@classmethod
def sort_protein_strings(cls, protein_string_list, sp_proteins, decoy_symbol):
    """
    Method that sorts protein strings in the following order: Target Reviewed, Decoy Reviewed, Target Unreviewed,
     Decoy Unreviewed.

    Args:
        protein_string_list (list): List of Protein Strings.
        sp_proteins (set): Set of Reviewed Protein Strings.
        decoy_symbol (str): Symbol to denote a decoy protein identifier IE "##".

    Returns:
        list: List of sorted protein strings.

    Example:
        >>> list_of_group_objects = datastore.DataStore.sort_protein_strings(
        >>>     protein_string_list=protein_string_list, sp_proteins=sp_proteins, decoy_symbol="##"
        >>> )
    """

    our_target_sp_proteins = sorted([x for x in protein_string_list if x in sp_proteins and decoy_symbol not in x])
    our_decoy_sp_proteins = sorted([x for x in protein_string_list if x in sp_proteins and decoy_symbol in x])

    our_target_tr_proteins = sorted(
        [x for x in protein_string_list if x not in sp_proteins and decoy_symbol not in x]
    )
    our_decoy_tr_proteins = sorted([x for x in protein_string_list if x not in sp_proteins and decoy_symbol in x])

    identifiers_sorted = (
        our_target_sp_proteins + our_decoy_sp_proteins + our_target_tr_proteins + our_decoy_tr_proteins
    )

    return identifiers_sorted

sort_protein_sub_groups(protein_list, higher_or_lower) classmethod

Method to sort protein sub lists.

Parameters:
  • protein_list (list) –

    List of Protein objects to be sorted.

  • higher_or_lower (str) –

    String to indicate if a "higher" or "lower" protein score is "better".

Returns:
  • list

    List of Protein objects to be sorted by score and number of

  • peptides.

Source code in pyproteininference/datastore.py
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
@classmethod
def sort_protein_sub_groups(cls, protein_list, higher_or_lower):
    """
    Method to sort protein sub lists.

    Args:
        protein_list (list): List of [Protein][pyproteininference.physical.Protein] objects to be sorted.
        higher_or_lower (str): String to indicate if a "higher" or "lower" protein score is "better".

    Returns:
        list: List of [Protein][pyproteininference.physical.Protein] objects to be sorted by score and number of
        peptides.

    """

    # Sort the groups based on higher or lower indication, secondarily sort the groups based on number of unique
    # peptides
    # We use the index [1:] as we do not wish to sort the lead protein...
    if higher_or_lower == cls.LOWER_PSM_SCORE:
        protein_list[1:] = sorted(
            protein_list[1:],
            key=lambda k: (float(k.score), -float(k.num_peptides)),
            reverse=False,
        )
    if higher_or_lower == cls.HIGHER_PSM_SCORE:
        protein_list[1:] = sorted(
            protein_list[1:],
            key=lambda k: (float(k.score), float(k.num_peptides)),
            reverse=True,
        )

    return protein_list

unique_to_leads_peptides()

Method to retrieve peptides that are unique based on the data from the searches (Not based on the database digestion).

Returns:
  • set

    a Set of peptide strings

Example

data = pyproteininference.datastore.DataStore(reader = reader, digest=digest) unique_peps = data.unique_to_leads_peptides()

Source code in pyproteininference/datastore.py
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
def unique_to_leads_peptides(self):
    """
    Method to retrieve peptides that are unique based on the data from the searches
    (Not based on the database digestion).

    Returns:
        set: a Set of peptide strings

    Example:
        >>> data = pyproteininference.datastore.DataStore(reader = reader, digest=digest)
        >>> unique_peps = data.unique_to_leads_peptides()
    """
    if self.grouped_scored_proteins:
        lead_peptides = [list(x[0].peptides) for x in self.grouped_scored_proteins]
        flat_peptides = [item for sublist in lead_peptides for item in sublist]
        counted_peps = collections.Counter(flat_peptides)
        unique_to_leads_peptides = set([x for x in counted_peps if counted_peps[x] == 1])
    else:
        unique_to_leads_peptides = set()

    return unique_to_leads_peptides

validate_digest()

Method that validates the Digest object.

Source code in pyproteininference/datastore.py
1117
1118
1119
1120
1121
1122
def validate_digest(self):
    """
    Method that validates the [Digest object][pyproteininference.in_silico_digest.Digest].
    """
    self._validate_reviewed_v_unreviewed()
    self._check_target_decoy_split()

validate_psm_data()

Method that validates the PSM data.

Source code in pyproteininference/datastore.py
1110
1111
1112
1113
1114
1115
def validate_psm_data(self):
    """
    Method that validates the PSM data.
    """
    self._validate_decoys_from_data()
    self._validate_isoform_from_data()

Digest

Bases: object

The following class handles data storage of in silico digest data from a fasta formatted sequence database.

Attributes:
  • peptide_to_protein_dictionary (dict) –

    Dictionary of peptides (keys) to protein sets (values).

  • protein_to_peptide_dictionary (dict) –

    Dictionary of proteins (keys) to peptide sets (values).

  • swiss_prot_protein_set (set) –

    Set of reviewed proteins if they are able to be distinguished from unreviewed proteins.

  • database_path (str) –

    Path to fasta database file to digest.

  • missed_cleavages (int) –

    The number of missed cleavages to allow.

  • id_splitting (bool) –

    True/False on whether or not to split a given regex off identifiers. This is used to split of "sp|" and "tr|" from the database protein strings as sometimes the database will contain those strings while the input data will have the strings split already. Advanced usage only.

  • reviewed_identifier_symbol (str / None) –

    Identifier that distinguishes reviewed from unreviewed proteins. Typically this is "sp|". Can also be None type.

  • digest_type (str) –

    can be any value in LIST_OF_DIGEST_TYPES.

  • max_peptide_length (int) –

    Max peptide length to keep for analysis.

Source code in pyproteininference/in_silico_digest.py
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
class Digest(object):
    """
    The following class handles data storage of in silico digest data from a fasta formatted sequence database.

    Attributes:
        peptide_to_protein_dictionary (dict): Dictionary of peptides (keys) to protein sets (values).
        protein_to_peptide_dictionary (dict): Dictionary of proteins (keys) to peptide sets (values).
        swiss_prot_protein_set (set): Set of reviewed proteins if they are able to be distinguished from unreviewed
            proteins.
        database_path (str): Path to fasta database file to digest.
        missed_cleavages (int): The number of missed cleavages to allow.
        id_splitting (bool): True/False on whether or not to split a given regex off identifiers.
            This is used to split of "sp|" and "tr|"
            from the database protein strings as sometimes the database will contain those strings while
            the input data will have the strings split already.
            Advanced usage only.
        reviewed_identifier_symbol (str/None): Identifier that distinguishes reviewed from unreviewed proteins.
            Typically this is "sp|". Can also be None type.
        digest_type (str): can be any value in `LIST_OF_DIGEST_TYPES`.
        max_peptide_length (int): Max peptide length to keep for analysis.

    """

    TRYPSIN = "trypsin"
    LYSC = "lysc"
    LIST_OF_DIGEST_TYPES = set(parser.expasy_rules.keys())

    AA_LIST = [
        "A",
        "R",
        "N",
        "D",
        "C",
        "E",
        "Q",
        "G",
        "H",
        "I",
        "L",
        "K",
        "M",
        "F",
        "P",
        "S",
        "T",
        "W",
        "Y",
        "V",
    ]
    UNIPROT_STRS = "sp\||tr\|"  # noqa W605
    UNIPROT_STR_REGEX = re.compile(UNIPROT_STRS)
    SP_STRING = "sp|"
    METHIONINE = "M"
    ANY_AMINO_ACID = "X"

    def __init__(self):
        pass

PyteomicsDigest

Bases: Digest

This class represents a pyteomics implementation of an in silico digest.

Source code in pyproteininference/in_silico_digest.py
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
class PyteomicsDigest(Digest):
    """
    This class represents a pyteomics implementation of an in silico digest.
    """

    def __init__(
        self,
        database_path,
        digest_type,
        missed_cleavages,
        reviewed_identifier_symbol,
        max_peptide_length,
        id_splitting=True,
    ):
        """
        The following class creates protein to peptide, peptide to protein, and reviewed protein mappings.

        The input is a fasta database, a protein inference parameter object, and whether or not to split IDs.

        This class sets important attributes for the Digest object such as: `peptide_to_protein_dictionary`,
        `protein_to_peptide_dictionary`, and `swiss_prot_protein_set`.

        Args:
            database_path (str): Path to fasta database file to digest.
            digest_type (str): Must be a value in `LIST_OF_DIGEST_TYPES`.
            missed_cleavages (int): Integer that indicates the maximum number of allowable missed cleavages from
                the ms search.
            reviewed_identifier_symbol (str/None): Symbol that indicates a reviewed identifier.
                If using Uniprot this is typically 'sp|'.
            max_peptide_length (int): The maximum length of peptides to keep for the analysis.
            id_splitting (bool): True/False on whether or not to split a given regex off identifiers.
                This is used to split of "sp|" and "tr|"
                from the database protein strings as sometimes the database will contain those
                strings while the input data will have the strings split already.
                Advanced usage only.

        Example:
            >>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
            >>>     database_path=database_file,
            >>>     digest_type='trypsin',
            >>>     missed_cleavages=2,
            >>>     reviewed_identifier_symbol='sp|',
            >>>     max_peptide_length=7,
            >>>     id_splitting=False,
            >>> )
        """
        self.peptide_to_protein_dictionary = {}
        self.protein_to_peptide_dictionary = {}
        self.swiss_prot_protein_set = set()
        self.database_path = database_path
        self.missed_cleavages = missed_cleavages
        self.id_splitting = id_splitting
        self.reviewed_identifier_symbol = reviewed_identifier_symbol
        if digest_type in self.LIST_OF_DIGEST_TYPES:
            self.digest_type = digest_type
        else:
            raise ValueError(
                "digest_type must be equal to one of the following {}".format(str(self.LIST_OF_DIGEST_TYPES))
            )
        self.max_peptide_length = max_peptide_length

    def digest_fasta_database(self):
        """
        This method reads in and prepares the fasta database for database digestion and assigns
        the several attributes for the Digest object: `peptide_to_protein_dictionary`,
        `protein_to_peptide_dictionary`, and `swiss_prot_protein_set`.

        Returns:
            None:

        Example:
            >>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
            >>>     database_path=database_file,
            >>>     digest_type='trypsin',
            >>>     missed_cleavages=2,
            >>>     reviewed_identifier_symbol='sp|',
            >>>     max_peptide_length=7,
            >>>     id_splitting=False,
            >>> )
            >>> digest.digest_fasta_database()

        """
        logger.info("Starting Pyteomics Digest...")
        pep_dict = {}
        prot_dict = {}
        sp_set = set()

        for description, sequence in tqdm.tqdm(fasta.read(self.database_path), unit=" entries"):
            new_peptides = parser.cleave(
                sequence,
                parser.expasy_rules[self.digest_type],
                self.missed_cleavages,
                min_length=self.max_peptide_length,
            )

            # Hopefully this splitting works...
            # IDK how robust this is...
            identifier = description.split(" ")[0]

            # Handle ID Splitting...
            if self.id_splitting:
                identifier_stripped = self.UNIPROT_STR_REGEX.sub("", identifier)
            else:
                identifier_stripped = identifier

            # If reviewed add to sp_set
            if self.reviewed_identifier_symbol:
                if identifier.startswith(self.reviewed_identifier_symbol):
                    sp_set.add(identifier_stripped)

            prot_dict[identifier_stripped] = new_peptides
            met_cleaved_peps = set()
            for peptide in new_peptides:
                pep_dict.setdefault(peptide, set()).add(identifier_stripped)
                # Need to account for potential N-term Methionine Cleavage
                if sequence.startswith(peptide) and peptide.startswith(self.METHIONINE):
                    # If our sequence starts with the current peptide... and our current peptide starts with methionine
                    # Then we remove the methionine from the peptide and add it to our dicts...
                    methionine_cleaved_peptide = peptide[1:]
                    met_cleaved_peps.add(methionine_cleaved_peptide)
            for met_peps in met_cleaved_peps:
                pep_dict.setdefault(met_peps, set()).add(identifier_stripped)
                prot_dict[identifier_stripped].add(met_peps)

        self.swiss_prot_protein_set = sp_set
        self.peptide_to_protein_dictionary = pep_dict
        self.protein_to_peptide_dictionary = prot_dict

        logger.info("Pyteomics Digest Finished...")

__init__(database_path, digest_type, missed_cleavages, reviewed_identifier_symbol, max_peptide_length, id_splitting=True)

The following class creates protein to peptide, peptide to protein, and reviewed protein mappings.

The input is a fasta database, a protein inference parameter object, and whether or not to split IDs.

This class sets important attributes for the Digest object such as: peptide_to_protein_dictionary, protein_to_peptide_dictionary, and swiss_prot_protein_set.

Parameters:
  • database_path (str) –

    Path to fasta database file to digest.

  • digest_type (str) –

    Must be a value in LIST_OF_DIGEST_TYPES.

  • missed_cleavages (int) –

    Integer that indicates the maximum number of allowable missed cleavages from the ms search.

  • reviewed_identifier_symbol (str / None) –

    Symbol that indicates a reviewed identifier. If using Uniprot this is typically 'sp|'.

  • max_peptide_length (int) –

    The maximum length of peptides to keep for the analysis.

  • id_splitting (bool, default: True ) –

    True/False on whether or not to split a given regex off identifiers. This is used to split of "sp|" and "tr|" from the database protein strings as sometimes the database will contain those strings while the input data will have the strings split already. Advanced usage only.

Example

digest = pyproteininference.in_silico_digest.PyteomicsDigest( database_path=database_file, digest_type='trypsin', missed_cleavages=2, reviewed_identifier_symbol='sp|', max_peptide_length=7, id_splitting=False, )

Source code in pyproteininference/in_silico_digest.py
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
def __init__(
    self,
    database_path,
    digest_type,
    missed_cleavages,
    reviewed_identifier_symbol,
    max_peptide_length,
    id_splitting=True,
):
    """
    The following class creates protein to peptide, peptide to protein, and reviewed protein mappings.

    The input is a fasta database, a protein inference parameter object, and whether or not to split IDs.

    This class sets important attributes for the Digest object such as: `peptide_to_protein_dictionary`,
    `protein_to_peptide_dictionary`, and `swiss_prot_protein_set`.

    Args:
        database_path (str): Path to fasta database file to digest.
        digest_type (str): Must be a value in `LIST_OF_DIGEST_TYPES`.
        missed_cleavages (int): Integer that indicates the maximum number of allowable missed cleavages from
            the ms search.
        reviewed_identifier_symbol (str/None): Symbol that indicates a reviewed identifier.
            If using Uniprot this is typically 'sp|'.
        max_peptide_length (int): The maximum length of peptides to keep for the analysis.
        id_splitting (bool): True/False on whether or not to split a given regex off identifiers.
            This is used to split of "sp|" and "tr|"
            from the database protein strings as sometimes the database will contain those
            strings while the input data will have the strings split already.
            Advanced usage only.

    Example:
        >>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
        >>>     database_path=database_file,
        >>>     digest_type='trypsin',
        >>>     missed_cleavages=2,
        >>>     reviewed_identifier_symbol='sp|',
        >>>     max_peptide_length=7,
        >>>     id_splitting=False,
        >>> )
    """
    self.peptide_to_protein_dictionary = {}
    self.protein_to_peptide_dictionary = {}
    self.swiss_prot_protein_set = set()
    self.database_path = database_path
    self.missed_cleavages = missed_cleavages
    self.id_splitting = id_splitting
    self.reviewed_identifier_symbol = reviewed_identifier_symbol
    if digest_type in self.LIST_OF_DIGEST_TYPES:
        self.digest_type = digest_type
    else:
        raise ValueError(
            "digest_type must be equal to one of the following {}".format(str(self.LIST_OF_DIGEST_TYPES))
        )
    self.max_peptide_length = max_peptide_length

digest_fasta_database()

This method reads in and prepares the fasta database for database digestion and assigns the several attributes for the Digest object: peptide_to_protein_dictionary, protein_to_peptide_dictionary, and swiss_prot_protein_set.

Returns:
  • None
Example

digest = pyproteininference.in_silico_digest.PyteomicsDigest( database_path=database_file, digest_type='trypsin', missed_cleavages=2, reviewed_identifier_symbol='sp|', max_peptide_length=7, id_splitting=False, ) digest.digest_fasta_database()

Source code in pyproteininference/in_silico_digest.py
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
def digest_fasta_database(self):
    """
    This method reads in and prepares the fasta database for database digestion and assigns
    the several attributes for the Digest object: `peptide_to_protein_dictionary`,
    `protein_to_peptide_dictionary`, and `swiss_prot_protein_set`.

    Returns:
        None:

    Example:
        >>> digest = pyproteininference.in_silico_digest.PyteomicsDigest(
        >>>     database_path=database_file,
        >>>     digest_type='trypsin',
        >>>     missed_cleavages=2,
        >>>     reviewed_identifier_symbol='sp|',
        >>>     max_peptide_length=7,
        >>>     id_splitting=False,
        >>> )
        >>> digest.digest_fasta_database()

    """
    logger.info("Starting Pyteomics Digest...")
    pep_dict = {}
    prot_dict = {}
    sp_set = set()

    for description, sequence in tqdm.tqdm(fasta.read(self.database_path), unit=" entries"):
        new_peptides = parser.cleave(
            sequence,
            parser.expasy_rules[self.digest_type],
            self.missed_cleavages,
            min_length=self.max_peptide_length,
        )

        # Hopefully this splitting works...
        # IDK how robust this is...
        identifier = description.split(" ")[0]

        # Handle ID Splitting...
        if self.id_splitting:
            identifier_stripped = self.UNIPROT_STR_REGEX.sub("", identifier)
        else:
            identifier_stripped = identifier

        # If reviewed add to sp_set
        if self.reviewed_identifier_symbol:
            if identifier.startswith(self.reviewed_identifier_symbol):
                sp_set.add(identifier_stripped)

        prot_dict[identifier_stripped] = new_peptides
        met_cleaved_peps = set()
        for peptide in new_peptides:
            pep_dict.setdefault(peptide, set()).add(identifier_stripped)
            # Need to account for potential N-term Methionine Cleavage
            if sequence.startswith(peptide) and peptide.startswith(self.METHIONINE):
                # If our sequence starts with the current peptide... and our current peptide starts with methionine
                # Then we remove the methionine from the peptide and add it to our dicts...
                methionine_cleaved_peptide = peptide[1:]
                met_cleaved_peps.add(methionine_cleaved_peptide)
        for met_peps in met_cleaved_peps:
            pep_dict.setdefault(met_peps, set()).add(identifier_stripped)
            prot_dict[identifier_stripped].add(met_peps)

    self.swiss_prot_protein_set = sp_set
    self.peptide_to_protein_dictionary = pep_dict
    self.protein_to_peptide_dictionary = prot_dict

    logger.info("Pyteomics Digest Finished...")

Exclusion

Bases: Inference

Exclusion Inference class. This class contains methods that support the initialization of an Exclusion inference method.

Attributes:
Source code in pyproteininference/inference.py
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
class Exclusion(Inference):
    """
    Exclusion Inference class. This class contains methods that support the initialization of an
    Exclusion inference method.

    Attributes:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        scored_data (list): a List of scored [Protein][pyproteininference.physical.Protein] objects.

    """

    def __init__(self, data, digest):
        """
        Initialization method of the Exclusion Class.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

        """
        self.data = data
        self.digest = digest
        self.data._validate_scored_proteins()
        self.scored_data = self.data.get_protein_data()
        self.list_of_prots_not_in_db = None
        self.list_of_peps_not_in_db = None

    def infer_proteins(self):
        """
        This method performs the Exclusion inference/grouping method.

        For the exclusion inference method groups cannot be created because all shared peptides are removed.

        This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
        These are both variables of the [DataStore Object][pyproteininference.datastore.DataStore] and are
        lists of [Protein][pyproteininference.physical.Protein] objects
        and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        """

        grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)

        hl = self.data.higher_or_lower()

        logger.info("Applying Group ID's for the Exclusion Method")
        regrouped_proteins = self._apply_protein_group_ids(
            grouped_protein_objects=grouped_proteins,
        )

        grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
        protein_group_objects = regrouped_proteins["group_objects"]

        logger.info("Sorting Results based on lead Protein Score")
        grouped_protein_objects = datastore.DataStore.sort_protein_objects(
            grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
        )
        protein_group_objects = datastore.DataStore.sort_protein_group_objects(
            protein_group_objects=protein_group_objects, higher_or_lower=hl
        )

        self.data.grouped_scored_proteins = grouped_protein_objects
        self.data.protein_group_objects = protein_group_objects

__init__(data, digest)

Initialization method of the Exclusion Class.

Parameters:
Source code in pyproteininference/inference.py
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
def __init__(self, data, digest):
    """
    Initialization method of the Exclusion Class.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

    """
    self.data = data
    self.digest = digest
    self.data._validate_scored_proteins()
    self.scored_data = self.data.get_protein_data()
    self.list_of_prots_not_in_db = None
    self.list_of_peps_not_in_db = None

infer_proteins()

This method performs the Exclusion inference/grouping method.

For the exclusion inference method groups cannot be created because all shared peptides are removed.

This method assigns the variables: grouped_scored_proteins and protein_group_objects. These are both variables of the DataStore Object and are lists of Protein objects and ProteinGroup objects.

Source code in pyproteininference/inference.py
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
def infer_proteins(self):
    """
    This method performs the Exclusion inference/grouping method.

    For the exclusion inference method groups cannot be created because all shared peptides are removed.

    This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
    These are both variables of the [DataStore Object][pyproteininference.datastore.DataStore] and are
    lists of [Protein][pyproteininference.physical.Protein] objects
    and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

    """

    grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)

    hl = self.data.higher_or_lower()

    logger.info("Applying Group ID's for the Exclusion Method")
    regrouped_proteins = self._apply_protein_group_ids(
        grouped_protein_objects=grouped_proteins,
    )

    grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
    protein_group_objects = regrouped_proteins["group_objects"]

    logger.info("Sorting Results based on lead Protein Score")
    grouped_protein_objects = datastore.DataStore.sort_protein_objects(
        grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
    )
    protein_group_objects = datastore.DataStore.sort_protein_group_objects(
        protein_group_objects=protein_group_objects, higher_or_lower=hl
    )

    self.data.grouped_scored_proteins = grouped_protein_objects
    self.data.protein_group_objects = protein_group_objects

FirstProtein

Bases: Inference

FirstProtein Inference class. This class contains methods that support the initialization of a FirstProtein inference method.

Attributes:
Source code in pyproteininference/inference.py
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
class FirstProtein(Inference):
    """
    FirstProtein Inference class. This class contains methods that support the initialization of a
    FirstProtein inference method.

    Attributes:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

    """

    def __init__(self, data, digest):
        """
        FirstProtein Inference initialization method.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

        Returns:
            object:
        """
        self.data = data
        self.digest = digest
        self.data._validate_scored_proteins()
        self.scored_data = self.data.get_protein_data()
        self.data = data

    def infer_proteins(self):
        """
        This method performs the First Protein inference method.

        This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
        These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
        lists of [Protein][pyproteininference.physical.Protein] objects
        and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        """

        grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)

        # Get the higher or lower variable
        hl = self.data.higher_or_lower()

        logger.info("Applying Group ID's for the First Protein Method")
        regrouped_proteins = self._apply_protein_group_ids(
            grouped_protein_objects=grouped_proteins,
        )

        grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
        protein_group_objects = regrouped_proteins["group_objects"]

        logger.info("Sorting Results based on lead Protein Score")
        grouped_protein_objects = datastore.DataStore.sort_protein_objects(
            grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
        )
        protein_group_objects = datastore.DataStore.sort_protein_group_objects(
            protein_group_objects=protein_group_objects, higher_or_lower=hl
        )

        self.data.grouped_scored_proteins = grouped_protein_objects
        self.data.protein_group_objects = protein_group_objects

__init__(data, digest)

FirstProtein Inference initialization method.

Parameters:
Returns:
  • object
Source code in pyproteininference/inference.py
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
def __init__(self, data, digest):
    """
    FirstProtein Inference initialization method.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

    Returns:
        object:
    """
    self.data = data
    self.digest = digest
    self.data._validate_scored_proteins()
    self.scored_data = self.data.get_protein_data()
    self.data = data

infer_proteins()

This method performs the First Protein inference method.

This method assigns the variables: grouped_scored_proteins and protein_group_objects. These are both variables of the DataStore object and are lists of Protein objects and ProteinGroup objects.

Source code in pyproteininference/inference.py
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
def infer_proteins(self):
    """
    This method performs the First Protein inference method.

    This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
    These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
    lists of [Protein][pyproteininference.physical.Protein] objects
    and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

    """

    grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)

    # Get the higher or lower variable
    hl = self.data.higher_or_lower()

    logger.info("Applying Group ID's for the First Protein Method")
    regrouped_proteins = self._apply_protein_group_ids(
        grouped_protein_objects=grouped_proteins,
    )

    grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
    protein_group_objects = regrouped_proteins["group_objects"]

    logger.info("Sorting Results based on lead Protein Score")
    grouped_protein_objects = datastore.DataStore.sort_protein_objects(
        grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
    )
    protein_group_objects = datastore.DataStore.sort_protein_group_objects(
        protein_group_objects=protein_group_objects, higher_or_lower=hl
    )

    self.data.grouped_scored_proteins = grouped_protein_objects
    self.data.protein_group_objects = protein_group_objects

Inclusion

Bases: Inference

Inclusion Inference class. This class contains methods that support the initialization of an Inclusion inference method.

Attributes:
Source code in pyproteininference/inference.py
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
class Inclusion(Inference):
    """
    Inclusion Inference class. This class contains methods that support the initialization of an
    Inclusion inference method.

    Attributes:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        scored_data (list): a List of scored [Protein][pyproteininference.physical.Protein] objects.

    """

    def __init__(self, data, digest):
        """
        Initialization method of the Inclusion Inference method.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        """

        self.data = data
        self.digest = digest
        self.data._validate_scored_proteins()
        self.scored_data = self.data.get_protein_data()

    def infer_proteins(self):
        """
        This method performs the grouping for Inclusion.

        Inclusion actually does not do grouping as all peptides get assigned to all possible proteins
        and groups are not created.

        This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
        These are both variables of the [DataStore Object][pyproteininference.datastore.DataStore] and are
        lists of [Protein][pyproteininference.physical.Protein] objects
        and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
        """

        grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)

        hl = self.data.higher_or_lower()

        logger.info("Applying Group ID's for the Inclusion Method")

        regrouped_proteins = self._apply_protein_group_ids(
            grouped_protein_objects=grouped_proteins,
        )

        grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
        protein_group_objects = regrouped_proteins["group_objects"]

        logger.info("Sorting Results based on lead Protein Score")
        grouped_protein_objects = datastore.DataStore.sort_protein_objects(
            grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
        )
        protein_group_objects = datastore.DataStore.sort_protein_group_objects(
            protein_group_objects=protein_group_objects, higher_or_lower=hl
        )

        self.data.grouped_scored_proteins = grouped_protein_objects
        self.data.protein_group_objects = protein_group_objects

    def _apply_protein_group_ids(self, grouped_protein_objects):
        """
        This method creates the ProteinGroup objects for the inclusion inference type using protein groups from
         [_create_protein_groups][`pyproteininference.inference.Inference._create_protein_groups].

        Args:
            grouped_protein_objects (list): list of grouped [Protein][pyproteininference.physical.Protein] objects.

        Returns:
            dict: a Dictionary that contains a list of [ProteinGroup][pyproteininference.physical.ProteinGroup]
                objects (key:"group_objects") and a list of
                grouped [Protein][pyproteininference.physical.Protein] objects (key:"grouped_protein_objects").

        """

        sp_protein_set = set(self.digest.swiss_prot_protein_set)

        prot_pep_dict = self.data.protein_to_peptide_dictionary()

        # Here we create group ID's
        group_id = 0
        protein_group_objects = []
        for protein_group in grouped_protein_objects:
            protein_list = []
            group_id = group_id + 1
            pg = ProteinGroup(group_id)
            logger.debug("Created Protein Group with ID: {}".format(str(group_id)))
            for prot in protein_group:
                cur_protein = prot
                # The following loop assigns group_id's, reviewed/unreviewed status, and number of unique peptides...
                if group_id not in cur_protein.group_identification:
                    cur_protein.group_identification.add(group_id)
                if cur_protein.identifier in sp_protein_set:
                    cur_protein.reviewed = True
                else:
                    cur_protein.unreviewed = True
                cur_identifier = cur_protein.identifier
                cur_protein.num_peptides = len(prot_pep_dict[cur_identifier])
                # Here append the number of unique peptides... so we can use this as secondary sorting...
                protein_list.append(cur_protein)
                # Sorted protein_groups then becomes a list of lists... of protein objects

            pg.proteins = protein_list
            protein_group_objects.append(pg)

        return_dict = {
            "grouped_protein_objects": grouped_protein_objects,
            "group_objects": protein_group_objects,
        }

        return return_dict

__init__(data, digest)

Initialization method of the Inclusion Inference method.

Parameters:
Source code in pyproteininference/inference.py
219
220
221
222
223
224
225
226
227
228
229
230
231
def __init__(self, data, digest):
    """
    Initialization method of the Inclusion Inference method.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
    """

    self.data = data
    self.digest = digest
    self.data._validate_scored_proteins()
    self.scored_data = self.data.get_protein_data()

infer_proteins()

This method performs the grouping for Inclusion.

Inclusion actually does not do grouping as all peptides get assigned to all possible proteins and groups are not created.

This method assigns the variables: grouped_scored_proteins and protein_group_objects. These are both variables of the DataStore Object and are lists of Protein objects and ProteinGroup objects.

Source code in pyproteininference/inference.py
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
def infer_proteins(self):
    """
    This method performs the grouping for Inclusion.

    Inclusion actually does not do grouping as all peptides get assigned to all possible proteins
    and groups are not created.

    This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
    These are both variables of the [DataStore Object][pyproteininference.datastore.DataStore] and are
    lists of [Protein][pyproteininference.physical.Protein] objects
    and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.
    """

    grouped_proteins = self._create_protein_groups(scored_proteins=self.scored_data)

    hl = self.data.higher_or_lower()

    logger.info("Applying Group ID's for the Inclusion Method")

    regrouped_proteins = self._apply_protein_group_ids(
        grouped_protein_objects=grouped_proteins,
    )

    grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
    protein_group_objects = regrouped_proteins["group_objects"]

    logger.info("Sorting Results based on lead Protein Score")
    grouped_protein_objects = datastore.DataStore.sort_protein_objects(
        grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
    )
    protein_group_objects = datastore.DataStore.sort_protein_group_objects(
        protein_group_objects=protein_group_objects, higher_or_lower=hl
    )

    self.data.grouped_scored_proteins = grouped_protein_objects
    self.data.protein_group_objects = protein_group_objects

Inference

Bases: object

Parent Inference class for all inference/grouper subset classes. The base Inference class contains several methods that are shared across the Inference sub-classes.

Attributes:
Source code in pyproteininference/inference.py
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
class Inference(object):
    """
    Parent Inference class for all inference/grouper subset classes.
    The base Inference class contains several methods that are shared across the Inference sub-classes.

    Attributes:
        data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].
    """

    PARSIMONY = "parsimony"
    INCLUSION = "inclusion"
    EXCLUSION = "exclusion"
    FIRST_PROTEIN = "first_protein"
    PEPTIDE_CENTRIC = "peptide_centric"

    INFERENCE_TYPES = [
        PARSIMONY,
        INCLUSION,
        EXCLUSION,
        FIRST_PROTEIN,
        PEPTIDE_CENTRIC,
    ]

    INFERENCE_NAME_MAP = {
        PARSIMONY: "Parsimony",
        INCLUSION: "Inclusion",
        EXCLUSION: "Exclusion",
        FIRST_PROTEIN: "First Protein",
        PEPTIDE_CENTRIC: "Peptide Centric",
    }

    SUBSET_PEPTIDES = "subset_peptides"
    SHARED_PEPTIDES = "shared_peptides"
    PARSIMONIOUS_GROUPING = "parsimonious_grouping"
    NONE_GROUPING = None

    GROUPING_TYPES = [SUBSET_PEPTIDES, SHARED_PEPTIDES, NONE_GROUPING, PARSIMONIOUS_GROUPING]

    PULP = "pulp"
    LP_SOLVERS = [PULP]

    ALL_SHARED_PEPTIDES = "all"
    BEST_SHARED_PEPTIDES = "best"
    NONE_SHARED_PEPTIDES = None
    SHARED_PEPTIDE_TYPES = [
        ALL_SHARED_PEPTIDES,
        BEST_SHARED_PEPTIDES,
        NONE_SHARED_PEPTIDES,
    ]

    def __init__(self, data, digest):
        """
        Initialization method of Inference object.

        Args:
            data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].

        """
        self.data = data
        self.digest = digest
        self.data._validate_scored_proteins()

    @classmethod
    def run_inference(cls, data, digest):
        """
        This class method dispatches to one of the five different inference classes/models
        based on input from the [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter]
        object.
        The methods are "parsimony", "inclusion", "exclusion", "peptide_centric", and "first_protein".

        Args:
            data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].

        Example:
            >>> pyproteininference.inference.Inference.run_inference(data=data,digest=digest)

        """

        inference_type = data.parameter_file_object.inference_type

        logger.info("Running Inference with Inference Type: {}".format(inference_type))

        if inference_type == Inference.PARSIMONY:
            group = Parsimony(data=data, digest=digest)
            group.infer_proteins()

        if inference_type == Inference.INCLUSION:
            group = Inclusion(data=data, digest=digest)
            group.infer_proteins()

        if inference_type == Inference.EXCLUSION:
            group = Exclusion(data=data, digest=digest)
            group.infer_proteins()

        if inference_type == Inference.FIRST_PROTEIN:
            group = FirstProtein(data=data, digest=digest)
            group.infer_proteins()

        if inference_type == Inference.PEPTIDE_CENTRIC:
            group = PeptideCentric(data=data, digest=digest)
            group.infer_proteins()

    def _create_protein_groups(self, scored_proteins):
        """
        This method sets up protein groups for inference methods that do not need grouping.

        Args:
            scored_proteins (list): List of scored [Protein][pyproteininference.physical.Protein] objects.

        Returns:
            list: List of lists of scored [Protein][pyproteininference.physical.Protein] objects.

        """
        scored_proteins = sorted(
            scored_proteins,
            key=lambda k: (k.score, len(k.raw_peptides), k.identifier),
            reverse=True,
        )

        prot_pep_dict = self.data.protein_to_peptide_dictionary()

        restricted_peptides_set = set(self.data.restricted_peptides)

        grouped_proteins = []
        for protein_objects in scored_proteins:
            cur_protein_identifier = protein_objects.identifier

            # Set peptide variable if the peptide is in the restricted peptide set
            # Sort the peptides alphabetically
            protein_objects.peptides = set(
                sorted([x for x in prot_pep_dict[cur_protein_identifier] if x in restricted_peptides_set])
            )
            protein_list_group = [protein_objects]
            grouped_proteins.append(protein_list_group)
        return grouped_proteins

    def _apply_protein_group_ids(self, grouped_protein_objects):
        """
        This method creates the ProteinGroup objects from the output of
            [_create_protein_groups][`pyproteininference.inference.Inference._create_protein_groups].

        Args:
            grouped_protein_objects (list): list of grouped [Protein][pyproteininference.physical.Protein] objects.

        Returns:
            dict: a Dictionary that contains a list of [ProteinGroup][pyproteininference.physical.ProteinGroup]
                objects (key:"group_objects") and a list of grouped [Protein][pyproteininference.physical.Protein]
                objects (key:"grouped_protein_objects").


        """

        sp_protein_set = set(self.digest.swiss_prot_protein_set)

        prot_pep_dict = self.data.protein_to_peptide_dictionary()

        # Here we create group ID's
        group_id = 0
        protein_group_objects = []
        for protein_group in grouped_protein_objects:
            protein_list = []
            group_id = group_id + 1
            pg = ProteinGroup(group_id)
            logger.debug("Created Protein Group with ID: {}".format(str(group_id)))
            for protein in protein_group:
                cur_protein = protein
                # The following loop assigns group_id's, reviewed/unreviewed status, and number of unique peptides...
                if group_id not in cur_protein.group_identification:
                    cur_protein.group_identification.add(group_id)
                if protein.identifier in sp_protein_set:
                    cur_protein.reviewed = True
                else:
                    cur_protein.unreviewed = True
                cur_identifier = protein.identifier
                cur_protein.num_peptides = len(prot_pep_dict[cur_identifier])
                # Here append the number of unique peptides... so we can use this as secondary sorting...
                protein_list.append(cur_protein)
                # Sorted protein_groups then becomes a list of lists... of protein objects

            pg.proteins = protein_list
            protein_group_objects.append(pg)

        return_dict = {
            "grouped_protein_objects": grouped_protein_objects,
            "group_objects": protein_group_objects,
        }

        return return_dict

__init__(data, digest)

Initialization method of Inference object.

Parameters:
Source code in pyproteininference/inference.py
65
66
67
68
69
70
71
72
73
74
75
76
def __init__(self, data, digest):
    """
    Initialization method of Inference object.

    Args:
        data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].

    """
    self.data = data
    self.digest = digest
    self.data._validate_scored_proteins()

run_inference(data, digest) classmethod

This class method dispatches to one of the five different inference classes/models based on input from the ProteinInferenceParameter object. The methods are "parsimony", "inclusion", "exclusion", "peptide_centric", and "first_protein".

Parameters:
Example

pyproteininference.inference.Inference.run_inference(data=data,digest=digest)

Source code in pyproteininference/inference.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
@classmethod
def run_inference(cls, data, digest):
    """
    This class method dispatches to one of the five different inference classes/models
    based on input from the [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter]
    object.
    The methods are "parsimony", "inclusion", "exclusion", "peptide_centric", and "first_protein".

    Args:
        data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest object][pyproteininference.in_silico_digest.Digest].

    Example:
        >>> pyproteininference.inference.Inference.run_inference(data=data,digest=digest)

    """

    inference_type = data.parameter_file_object.inference_type

    logger.info("Running Inference with Inference Type: {}".format(inference_type))

    if inference_type == Inference.PARSIMONY:
        group = Parsimony(data=data, digest=digest)
        group.infer_proteins()

    if inference_type == Inference.INCLUSION:
        group = Inclusion(data=data, digest=digest)
        group.infer_proteins()

    if inference_type == Inference.EXCLUSION:
        group = Exclusion(data=data, digest=digest)
        group.infer_proteins()

    if inference_type == Inference.FIRST_PROTEIN:
        group = FirstProtein(data=data, digest=digest)
        group.infer_proteins()

    if inference_type == Inference.PEPTIDE_CENTRIC:
        group = PeptideCentric(data=data, digest=digest)
        group.infer_proteins()

Parsimony

Bases: Inference

Parsimony Inference class. This class contains methods that support the initialization of a Parsimony inference method.

Attributes:
  • data (DataStore) –
  • digest (Digest) –
  • scored_data (list) –

    a List of scored Protein objects.

  • lead_protein_set (set) –

    Set of protein strings that are classified as leads from the LP solver.

Source code in pyproteininference/inference.py
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
class Parsimony(Inference):
    """
    Parsimony Inference class. This class contains methods that support the initialization of a
    Parsimony inference method.

    Attributes:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        scored_data (list): a List of scored [Protein][pyproteininference.physical.Protein] objects.
        lead_protein_set (set): Set of protein strings that are classified as leads from the LP solver.

    """

    def __init__(self, data, digest):
        """
        Initialization method of the Parsimony object.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
        """
        self.data = data
        self.digest = digest
        self.data._validate_scored_proteins()
        self.scored_data = self.data.get_protein_data()
        self.lead_protein_set = None
        self.parameter_file_object = data.parameter_file_object

    def _create_protein_groups(
        self,
        all_scored_proteins,
        lead_protein_objects,
        grouping_type="shared_peptides",
    ):
        """
        Internal method that creates a list of lists of [Protein][pyproteininference.physical.Protein]
        objects for the Parsimony inference object.
        These list of lists are "groups" and the proteins get grouped them according to grouping_type variable.

        Args:
            all_scored_proteins (list): list of [Protein][pyproteininference.physical.Protein] objects.
            lead_protein_objects (list): list of [Protein][pyproteininference.physical.Protein] objects
                Only needed if inference_type=parsimony.
            grouping_type: (str): One of `GROUPING_TYPES`.

        Returns:
            list: list of lists of [Protein][pyproteininference.physical.Protein] objects.

        """

        logger.info("Grouping Peptides with Grouping Type: {}".format(grouping_type))
        logger.info("Grouping Peptides with Inference Type: {}".format(self.PARSIMONY))

        all_scored_proteins = sorted(
            all_scored_proteins,
            key=lambda k: (len(k.raw_peptides), k.identifier),
            reverse=True,
        )

        lead_scored_proteins = lead_protein_objects
        lead_scored_proteins = sorted(
            lead_scored_proteins,
            key=lambda k: (len(k.raw_peptides), k.identifier),
            reverse=True,
        )

        protein_finder = [x.identifier for x in all_scored_proteins]

        prot_pep_dict = self.data.protein_to_peptide_dictionary()

        protein_tracker = set()
        restricted_peptides_set = set(self.data.restricted_peptides)
        try:
            picked_removed = set([x.identifier for x in self.data.picked_proteins_removed])
        except TypeError:
            picked_removed = set()

        missing_proteins = set()
        in_silico_peptides_to_proteins = self.digest.peptide_to_protein_dictionary
        grouped_proteins = []
        for protein_objects in lead_scored_proteins:
            if protein_objects not in protein_tracker:
                protein_tracker.add(protein_objects)
                cur_protein_identifier = protein_objects.identifier

                # Set peptide variable if the peptide is in the restricted peptide set
                # Sort the peptides alphabetically
                protein_objects.peptides = set(
                    sorted([x for x in prot_pep_dict[cur_protein_identifier] if x in restricted_peptides_set])
                )
                protein_list_group = [protein_objects]
                current_peptides = prot_pep_dict[cur_protein_identifier]

                current_grouped_proteins = set()
                for (
                    peptide
                ) in current_peptides:  # Probably put an if here... if peptide is in the list of peptide after being
                    # restricted by datastore.RestrictMainData
                    if peptide in restricted_peptides_set:
                        # Get the proteins that map to the current peptide using in_silico_peptides_to_proteins
                        # First make sure our peptide is formatted properly...
                        if not peptide.isupper() or not peptide.isalpha():
                            # If the peptide is not all upper case or if its not all alphabetical...
                            peptide = Psm.remove_peptide_mods(peptide)
                        potential_protein_list = in_silico_peptides_to_proteins[peptide]
                        if not potential_protein_list:
                            logger.warning(
                                "Protein {} and Peptide {} is not in database...".format(
                                    protein_objects.identifier, peptide
                                )
                            )

                        # Assign proteins to groups based on shared peptide... unless the protein is equivalent
                        # to the current identifier
                        if grouping_type != self.NONE_GROUPING:
                            for protein in potential_protein_list:
                                # If statement below to avoid grouping the same protein twice and to not group the lead
                                if (
                                    protein not in current_grouped_proteins
                                    and protein != cur_protein_identifier
                                    and protein not in picked_removed
                                    and protein not in missing_proteins
                                ):
                                    try:
                                        # Try to find its object using protein_finder (list of identifiers) and
                                        # lead_scored_proteins (list of Protein Objects)
                                        cur_index = protein_finder.index(protein)
                                        current_protein_object = all_scored_proteins[cur_index]
                                        if not current_protein_object.peptides:
                                            current_protein_object.peptides = set(
                                                sorted(
                                                    [
                                                        x
                                                        for x in prot_pep_dict[current_protein_object.identifier]
                                                        if x in restricted_peptides_set
                                                    ]
                                                )
                                            )
                                        if grouping_type == self.SHARED_PEPTIDES:
                                            current_grouped_proteins.add(current_protein_object)
                                        elif grouping_type == self.SUBSET_PEPTIDES:
                                            if current_protein_object.peptides.issubset(protein_objects.peptides):
                                                current_grouped_proteins.add(current_protein_object)
                                                protein_tracker.add(current_protein_object)
                                            else:
                                                pass
                                        elif grouping_type == self.PARSIMONIOUS_GROUPING:
                                            if protein_objects.peptides == current_protein_object.peptides:
                                                current_grouped_proteins.add(current_protein_object)
                                                protein_tracker.add(current_protein_object)
                                        else:
                                            pass
                                    except ValueError:
                                        logger.warning(
                                            "Protein from DB {} not found with protein finder for peptide {}".format(
                                                protein, peptide
                                            )
                                        )
                                        missing_proteins.add(protein)

                                else:
                                    pass
                # Add the proteins to the lead if they share peptides...
                protein_list_group = protein_list_group + list(current_grouped_proteins)
                # protein_list_group at first is just the lead protein object...
                # We then try apply grouping by looking at all peptide from the lead...
                # For all of these peptide look at all other non lead proteins and try to assign them to the group...
                # We assign the entire protein object as well... in the above try/except
                # Then append this sub group to the main list
                # The variable grouped_proteins is now a list of lists which each element being a Protein object and
                # each list of protein objects corresponding to a group
                grouped_proteins.append(protein_list_group)

        return grouped_proteins

    def _swissprot_and_isoform_override(
        self,
        scored_data,
        grouped_proteins,
        override_type="soft",
        isoform_override=True,
    ):
        """
        This internal method creates and reorders protein groups based on criteria such as Reviewed/Unreviewed
        Identifiers as well as Canonincal/Isoform Identifiers.
        This method is only used with parsimony inference type.

        Args:
            scored_data (list): list of scored [Protein][pyproteininference.physical.Protein] objects.
            grouped_proteins:  list of grouped [Protein][pyproteininference.physical.Protein] objects.
            override_type (str): "soft" or "hard" to indicate Reviewed/Unreviewed override. "soft" is preferred and
                default.
            isoform_override (bool): True/False on whether to favor canonical forms vs isoforms as group leads.

        Returns:
            dict: a Dictionary that contains a list of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects
            (key:"group_objects") and a list of grouped [Protein][pyproteininference.physical.Protein]
            objects (key:"grouped_protein_objects").


        """

        sp_protein_set = set(self.digest.swiss_prot_protein_set)
        scored_proteins = list(scored_data)
        protein_finder = [x.identifier for x in scored_proteins]

        prot_pep_dict = self.data.protein_to_peptide_dictionary()

        # Get the higher or lower variable
        higher_or_lower = self.data.higher_or_lower()

        logger.info("Applying Group IDs... and Executing {} Swissprot Override...".format(override_type))
        # Here we create group ID's for all groups and do some sorting
        grouped_protein_objects = []
        group_id = 0
        leads = set()
        protein_group_objects = []
        for protein_group in grouped_proteins:
            protein_list = []
            group_id = group_id + 1
            # Make a protein group
            pg = ProteinGroup(group_id)
            logger.debug("Created Protein Group with ID: {}".format(str(group_id)))
            for prots in protein_group:
                # Loop over all proteins in the original group
                try:
                    # The following loop assigns group_id's, reviewed/unreviewed status, and number of unique peptides
                    pindex = protein_finder.index(prots.identifier)
                    # Attempt to find the protein object by identifier
                    cur_protein = scored_proteins[pindex]
                    if group_id not in cur_protein.group_identification:
                        cur_protein.group_identification.add(group_id)
                    if prots.identifier in sp_protein_set:
                        cur_protein.reviewed = True
                    else:
                        cur_protein.unreviewed = True
                    cur_identifier = prots.identifier
                    cur_protein.num_peptides = len(prot_pep_dict[cur_identifier])
                    # Here append the number of unique peptides... so we can use this as secondary sorting...
                    protein_list.append(cur_protein)
                    # Sorted groups then becomes a list of lists... of protein objects

                except ValueError:
                    # Here we pass if the protein does not have a score...
                    # Potentially it got 'picked' (removed) by protein picker...
                    pass

            # Sort protein sub group
            protein_list = datastore.DataStore.sort_protein_sub_groups(
                protein_list=protein_list, higher_or_lower=higher_or_lower
            )

            # grouped_protein_objects is the MAIN list of lists with grouped protein objects
            grouped_protein_objects.append(protein_list)
            # If the lead is reviewed append it to leads and do nothing else...
            # If the lead is unreviewed then try to replace it with the best reviewed hit
            # Run swissprot override
            if self.data.parameter_file_object.reviewed_identifier_symbol:
                sp_override = self._swissprot_override(
                    protein_list=protein_list,
                    leads=leads,
                    grouped_protein_objects=grouped_protein_objects,
                    override_type=override_type,
                )
                grouped_protein_objects = sp_override["grouped_protein_objects"]
                leads = sp_override["leads"]
                protein_list = sp_override["protein_list"]

            # Run isoform override If we want to run isoform_override and if the isoform symbol exists...
            if isoform_override and self.data.parameter_file_object.isoform_symbol:
                iso_override = self._isoform_override(
                    protein_list=protein_list,
                    leads=leads,
                    grouped_protein_objects=grouped_protein_objects,
                )
                grouped_protein_objects = iso_override["grouped_protein_objects"]
                leads = iso_override["leads"]
                protein_list = iso_override["protein_list"]

            pg.proteins = protein_list
            protein_group_objects.append(pg)

        return_dict = {
            "grouped_protein_objects": grouped_protein_objects,
            "group_objects": protein_group_objects,
        }

        return return_dict

    def _swissprot_override(self, protein_list, leads, grouped_protein_objects, override_type):
        """
        This method re-assigns protein group leads if the lead is an unreviewed protein and if the protein group
         contains a reviewed protein that contains the exact same set of peptides as the unreviewed lead.
        This method is here to provide consistency to the output.

        Args:
            protein_list (list): List of grouped [Protein][pyproteininference.physical.Protein] objects.
            leads (set): Set of string protein identifiers that have been identified as a lead.
            grouped_protein_objects (list): List of protein_list lists.
            override_type (str): "soft" or "hard" on how to override non reviewed identifiers. "soft" is preferred.

        Returns:
            dict: leads (set): Set of string protein identifiers that have been identified as a lead.
             Updated to reflect lead changes.
            grouped_protein_objects (list): List of protein_list lists. Updated to reflect lead changes.
            protein_list (list): List of grouped [Protein][pyproteininference.physical.Protein] objects.
                Updated to reflect lead changes.

        """

        if not protein_list[0].reviewed:
            # If the lead is unreviewed attempt to replace it...
            # Start to loop through protein_list which is the current group...
            for protein in protein_list[1:]:
                # Find the first reviewed it... if its not a lead protein already then do score swap and break...
                if protein.reviewed:
                    best_swiss_prot_prot = protein

                    if override_type == "soft":
                        # If the lead proteins peptides are a subset of the best swissprot.... then swap the proteins.
                        # (meaning equal peptides or the swissprot completely covers the tremble reference)
                        if best_swiss_prot_prot.identifier not in leads and set(protein_list[0].peptides).issubset(
                            set(best_swiss_prot_prot.peptides)
                        ):
                            # We use -1 as the idex of grouped_protein_objects because the current 'protein_list' is
                            # the last entry appended to scores grouped
                            # Essentially grouped_protein_objects[-1]==protein_list
                            # We need this syntax so we can switch the location of the unreviewed lead identifier with
                            # the best reviewed identifier in grouped_protein_objects
                            swiss_prot_override_index = grouped_protein_objects[-1].index(best_swiss_prot_prot)
                            cur_tr_lead = grouped_protein_objects[-1][0]
                            (
                                grouped_protein_objects[-1][0],
                                grouped_protein_objects[-1][swiss_prot_override_index],
                            ) = (
                                grouped_protein_objects[-1][swiss_prot_override_index],
                                grouped_protein_objects[-1][0],
                            )
                            grouped_protein_objects[-1][swiss_prot_override_index], grouped_protein_objects[-1][0]
                            new_sp_lead = grouped_protein_objects[-1][0]
                            logger.info(
                                "Overriding Unreviewed {} with Reviewed {}".format(
                                    cur_tr_lead.identifier, new_sp_lead.identifier
                                )
                            )

                            # Append new_sp_lead protein to leads, to make sure we dont repeat leads
                            leads.add(new_sp_lead.identifier)
                            break
                        else:
                            # If no reviewed and none not in leads then pass...
                            pass

                    if override_type == "hard":
                        if best_swiss_prot_prot.identifier not in leads:
                            # We use -1 as the index of grouped_protein_objects because the current 'protein_list'
                            # is the last entry appended to grouped_protein_objects
                            # Essentially grouped_protein_objects[-1]==protein_list
                            # We need this syntax so we can switch the location of the unreviewed lead identifier
                            # with the best reviewed identifier in grouped_protein_objects
                            swiss_prot_override_index = grouped_protein_objects[-1].index(best_swiss_prot_prot)
                            cur_tr_lead = grouped_protein_objects[-1][0]
                            # Re-assigning the value within the index will also reassign the value in protein_list...
                            # This is because grouped_protein_objects[-1] equals protein_list
                            # So we do not have to reassign values in protein_list
                            (
                                grouped_protein_objects[-1][0],
                                grouped_protein_objects[-1][swiss_prot_override_index],
                            ) = (
                                grouped_protein_objects[-1][swiss_prot_override_index],
                                grouped_protein_objects[-1][0],
                            )
                            new_sp_lead = grouped_protein_objects[-1][0]
                            logger.info(
                                "Overriding Unreviewed {} with Reviewed {}".format(
                                    cur_tr_lead.identifier, new_sp_lead.identifier
                                )
                            )

                            # Append new_sp_lead protein to leads, to make sure we dont repeat leads
                            leads.add(new_sp_lead.identifier)
                            break
                        else:
                            # If no reviewed and none not in leads then pass...
                            pass

                else:
                    pass

        else:
            leads.add(protein_list[0].identifier)

        return_dict = {
            "leads": leads,
            "grouped_protein_objects": grouped_protein_objects,
            "protein_list": protein_list,
        }

        return return_dict

    def _isoform_override(self, protein_list, grouped_protein_objects, leads):
        """
        This method re-assigns protein group leads if the lead is an isoform protein and if the protein group contains
        a canonical protein that contains the exact same set of peptides as the isoform lead.
        This method is here to provide consistency to the output.

        Args:
            protein_list (list): List of grouped [Protein][pyproteininference.physical.Protein] objects.
            leads (set): Set of string protein identifiers that have been identified as a lead.
            grouped_protein_objects (list): List of protein_list lists.

        Returns:
            dict: leads (set): Set of string protein identifiers that have been identified as a lead. Updated to
                reflect lead changes.
            grouped_protein_objects (list): List of protein_list lists. Updated to reflect lead changes.
            protein_list (list): List of grouped [Protein][pyproteininference.physical.Protein] objects.
                Updated to reflect lead changes.


        """

        if self.data.parameter_file_object.isoform_symbol in protein_list[0].identifier:
            pure_id = protein_list[0].identifier.split(self.data.parameter_file_object.isoform_symbol)[0]
            # Start to loop through protein_list which is the current group...
            for potential_replacement in protein_list[1:]:
                isoform_override = potential_replacement
                if (
                    isoform_override.identifier == pure_id
                    and isoform_override.identifier not in leads
                    and set(protein_list[0].peptides).issubset(set(isoform_override.peptides))
                ):
                    isoform_override_index = grouped_protein_objects[-1].index(isoform_override)
                    cur_iso_lead = grouped_protein_objects[-1][0]
                    # Re-assigning the value within the index will also reassign the value in protein_list...
                    # This is because grouped_protein_objects[-1] equals protein_list
                    # So we do not have to reassign values in protein_list
                    (
                        grouped_protein_objects[-1][0],
                        grouped_protein_objects[-1][isoform_override_index],
                    ) = (
                        grouped_protein_objects[-1][isoform_override_index],
                        grouped_protein_objects[-1][0],
                    )
                    grouped_protein_objects[-1][isoform_override_index], grouped_protein_objects[-1][0]

                    new_iso_lead = grouped_protein_objects[-1][0]
                    logger.info(
                        "Overriding Isoform {} with {}".format(cur_iso_lead.identifier, new_iso_lead.identifier)
                    )
                    leads.add(protein_list[0].identifier)

        return_dict = {
            "leads": leads,
            "grouped_protein_objects": grouped_protein_objects,
            "protein_list": protein_list,
        }

        return return_dict

    def _reassign_protein_group_leads(self, protein_group_objects):
        """
        This internal method corrects leads that are improperly assigned in the parsimony inference method.
        This method acts on the protein group objects.

        Args:
            protein_group_objects (list): List of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        Returns:
            protein_group_objects (list): List of [ProteinGroup][pyproteininference.physical.ProteinGroup] objects
            where leads have been reassigned properly.


        """

        # Get the higher or lower variable
        if not self.data.high_low_better:
            higher_or_lower = self.data.higher_or_lower()
        else:
            higher_or_lower = self.data.high_low_better

        # Sometimes we have cases where:
        # protein a maps to peptides 1,2,3
        # protein b maps to peptides 1,2
        # protein c maps to a bunch of peptides and peptide 3
        # Therefore, in the model proteins a and b are equivalent in that they map to 2 peptides together - 1 and 2.
        # peptide 3 maps to a but also to c...
        # Sometimes the model (pulp) will spit out protein b as the lead... we wish to swap protein b as the lead with
        # protein a because it will likely have a better score...
        logger.info("Potentially Reassigning Protein Group leads...")
        lead_protein_set = set([x.proteins[0].identifier for x in protein_group_objects])
        for i in range(len(protein_group_objects)):
            for j in range(1, len(protein_group_objects[i].proteins)):  # Loop over all sub proteins in the group...
                # if the lead proteins peptides are a subset of one of its proteins in the group, and the secondary
                # protein is not a lead protein and its score is better than the leads... and it has more peptides...
                new_lead = protein_group_objects[i].proteins[j]
                old_lead = protein_group_objects[i].proteins[0]
                if higher_or_lower == datastore.DataStore.HIGHER_PSM_SCORE:
                    if (
                        set(old_lead.peptides).issubset(set(new_lead.peptides))
                        and new_lead.identifier not in lead_protein_set
                        and old_lead.score <= new_lead.score
                        and len(old_lead.peptides) < len(new_lead.peptides)
                    ):
                        logger.info(
                            "protein {} will replace protein {} as lead, with index {}, New Num Peptides: {}, "
                            "Old Num Peptides: {}".format(
                                str(new_lead.identifier),
                                str(old_lead.identifier),
                                str(j),
                                str(len(new_lead.peptides)),
                                str(len(old_lead.peptides)),
                            )
                        )
                        lead_protein_set.add(new_lead.identifier)
                        lead_protein_set.remove(old_lead.identifier)
                        # Swap their positions in the list
                        (
                            protein_group_objects[i].proteins[0],
                            protein_group_objects[i].proteins[j],
                        ) = (new_lead, old_lead)
                        break

                if higher_or_lower == datastore.DataStore.LOWER_PSM_SCORE:
                    if (
                        set(old_lead.peptides).issubset(set(new_lead.peptides))
                        and new_lead.identifier not in lead_protein_set
                        and old_lead.score >= new_lead.score
                        and len(old_lead.peptides) < len(new_lead.peptides)
                    ):
                        logger.info(
                            "protein {} will replace protein {} as lead, with index {}, New Num Peptides: {}, "
                            "Old Num Peptides: {}".format(
                                str(new_lead.identifier),
                                str(old_lead.identifier),
                                str(j),
                                str(len(new_lead.peptides)),
                                str(len(old_lead.peptides)),
                            )
                        )
                        lead_protein_set.add(new_lead.identifier)
                        lead_protein_set.remove(old_lead.identifier)
                        # Swap their positions in the list
                        (
                            protein_group_objects[i].proteins[0],
                            protein_group_objects[i].proteins[j],
                        ) = (new_lead, old_lead)
                        break

        return protein_group_objects

    def _reassign_protein_list_leads(self, grouped_protein_objects):
        """
        This internal method corrects leads that are improperly assigned in the parsimony inference method.
        This method acts on the grouped protein objects.

        Args:
            grouped_protein_objects (list): List of [Protein][pyproteininference.physical.Protein] objects.

        Returns:
            list: List of [Protein][pyproteininference.physical.Protein] objects where leads have been
                reassigned properly.


        """

        # Get the higher or lower variable
        if not self.data.high_low_better:
            higher_or_lower = self.data.higher_or_lower()
        else:
            higher_or_lower = self.data.high_low_better

        # Sometimes we have cases where:
        # protein a maps to peptides 1,2,3
        # protein b maps to peptides 1,2
        # protein c maps to a bunch of peptides and peptide 3
        # Therefore, in the model proteins a and b are equivalent in that they map to 2 peptides together - 1 and 2.
        # peptide 3 maps to a but also to c...
        # Sometimes the model (pulp) will spit out protein b as the lead... we wish to swap protein b as the lead with
        # protein a because it will likely have a better score...
        logger.info("Potentially Reassigning Proteoin List leads...")
        lead_protein_set = set([x[0].identifier for x in grouped_protein_objects])
        for i in range(len(grouped_protein_objects)):
            for j in range(1, len(grouped_protein_objects[i])):  # Loop over all sub proteins in the group...
                # if the lead proteins peptides are a subset of one of its proteins in the group, and the secondary
                # protein is not a lead protein and its score is better than the leads... and it has more peptides...
                new_lead = grouped_protein_objects[i][j]
                old_lead = grouped_protein_objects[i][0]
                if higher_or_lower == datastore.DataStore.HIGHER_PSM_SCORE:
                    if (
                        set(old_lead.peptides).issubset(set(new_lead.peptides))
                        and new_lead.identifier not in lead_protein_set
                        and old_lead.score <= new_lead.score
                        and len(old_lead.peptides) < len(new_lead.peptides)
                    ):
                        logger.info(
                            "protein {} will replace protein {} as lead, with index {}, New Num Peptides: {}, "
                            "Old Num Peptides: {}".format(
                                str(new_lead.identifier),
                                str(old_lead.identifier),
                                str(j),
                                str(len(new_lead.peptides)),
                                str(len(old_lead.peptides)),
                            )
                        )
                        lead_protein_set.add(new_lead.identifier)
                        lead_protein_set.remove(old_lead.identifier)
                        # Swap their positions in the list
                        (
                            grouped_protein_objects[i][0],
                            grouped_protein_objects[i][j],
                        ) = (new_lead, old_lead)
                        break

                if higher_or_lower == datastore.DataStore.LOWER_PSM_SCORE:
                    if (
                        set(old_lead.peptides).issubset(set(new_lead.peptides))
                        and new_lead.identifier not in lead_protein_set
                        and old_lead.score >= new_lead.score
                        and len(old_lead.peptides) < len(new_lead.peptides)
                    ):
                        logger.info(
                            "protein {} will replace protein {} as lead, with index {}, New Num Peptides: {}, "
                            "Old Num Peptides: {}".format(
                                str(new_lead.identifier),
                                str(old_lead.identifier),
                                str(j),
                                str(len(new_lead.peptides)),
                                str(len(old_lead.peptides)),
                            )
                        )
                        lead_protein_set.add(new_lead.identifier)
                        lead_protein_set.remove(old_lead.identifier)
                        # Swap their positions in the list
                        (
                            grouped_protein_objects[i][0],
                            grouped_protein_objects[i][j],
                        ) = (new_lead, old_lead)
                        break

        return grouped_protein_objects

    def _pulp_grouper(self):
        """
        This internal function uses pulp to solve the lp problem for parsimony then performs protein grouping with the
         various internal grouping functions.

        This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
        These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
        lists of [Protein][pyproteininference.physical.Protein] objects
        and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        """

        # Here we get the peptide to protein dictionary
        pep_prot_dict = self.data.peptide_to_protein_dictionary()

        self.data.protein_to_peptide_dictionary()

        identifiers_sorted = self.data.get_sorted_identifiers(scored=True)

        # Get all the proteins that we scored and the ones picked if picker was ran...
        data_proteins = sorted([x for x in self.data.protein_peptide_dictionary.keys() if x in identifiers_sorted])
        # Get the set of peptides for each protein...
        data_peptides = [set(self.data.protein_peptide_dictionary[x]) for x in data_proteins]
        flat_peptides_in_data = set([item for sublist in data_peptides for item in sublist])

        peptide_sets = []
        # Loop over the list of peptides...
        for k in range(len(data_peptides)):
            raw_peptides = data_peptides[k]
            peptide_set = set()
            # Loop over each individual peptide per protein...
            for peps in raw_peptides:
                peptide = peps

                # Remove mods...
                new_peptide = Psm.remove_peptide_mods(peptide)
                # Add it to a temporary set...
                peptide_set.add(new_peptide)
            # Append this set to a new list...
            peptide_sets.append(peptide_set)
            # Set that proteins peptides to be the unmodified ones...
            data_peptides[k] = peptide_set

        # Get them all...
        all_peptides = [x for x in data_peptides]
        # Remove redundant sets...
        non_redundant_peptide_sets = [set(i) for i in OrderedDict.fromkeys(frozenset(item) for item in peptide_sets)]

        # Loop over  the restricted list of peptides...
        ind_list = []
        for pep_sets in non_redundant_peptide_sets:
            # Get its index in terms of the overall list...
            ind_list.append(all_peptides.index(pep_sets))

        # Get the protein based on the index
        restricted_proteins = [data_proteins[x] for x in range(len(data_peptides)) if x in ind_list]

        # Here we get the list of all proteins
        plist = []
        for peps in pep_prot_dict.keys():
            for prots in list(pep_prot_dict[peps]):
                if prots in restricted_proteins and peps in flat_peptides_in_data:
                    plist.append(prots)

        # Here we get the unique proteins
        unique_prots = list(set(plist).union())
        unique_protein_set = set(unique_prots)

        unique_prots_sorted = [x for x in identifiers_sorted if x in unique_prots]

        # Define the protein variables with a lower bound of 0 and catgeory Integer
        prots = pulp.LpVariable.dicts("prot", indices=unique_prots_sorted, lowBound=0, cat="Integer")

        # Define our Lp Problem which is to Minimize our objective function
        prob = pulp.LpProblem("Parsimony_Problem", pulp.LpMinimize)

        # Define our objective function, which is to take the sum of all of our proteins and find the minimum set.
        prob += pulp.lpSum([prots[i] for i in prots])

        # Set up our constraints. The constrains are as follows:

        # Loop over each peptide and determine the proteins it maps to...
        # Each peptide is a constraint with the proteins it maps to having to be greater than or equal to 1
        # In the case below we see that protein 3 has a unique peptide, protein 2 is redundant

        logger.info("Sorting peptides before looping")
        for peptides in sorted(list(pep_prot_dict.keys())):
            try:
                prob += (
                    pulp.lpSum([prots[i] for i in sorted(list(pep_prot_dict[peptides])) if i in unique_protein_set])
                    >= 1
                )
            except KeyError:
                logger.info("Not including protein {} in pulp model".format(pep_prot_dict[peptides]))

        prob.solve()

        scored_data = self.data.get_protein_data()
        scored_proteins = list(scored_data)
        protein_finder = [x.identifier for x in scored_proteins]

        lead_protein_objects = []
        lead_protein_identifiers = []
        for proteins in unique_prots_sorted:
            parsimony_value = pulp.value(prots[proteins])
            if proteins in protein_finder and parsimony_value == 1:
                p_ind = protein_finder.index(proteins)
                protein_object = scored_proteins[p_ind]
                lead_protein_objects.append(protein_object)
                lead_protein_identifiers.append(protein_object.identifier)
            else:
                if parsimony_value == 1:
                    # Why are some proteins not being found when we run exclusion???
                    logger.warning("Protein {} not found with protein finder...".format(proteins))
                else:
                    pass

        self.lead_protein_objects = lead_protein_objects

        grouped_proteins = self._create_protein_groups(
            all_scored_proteins=scored_data,
            lead_protein_objects=self.lead_protein_objects,
            grouping_type=self.data.parameter_file_object.grouping_type,
        )

        regrouped_proteins = self._swissprot_and_isoform_override(
            scored_data=scored_data,
            grouped_proteins=grouped_proteins,
            override_type="soft",
            isoform_override=True,
        )

        grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
        protein_group_objects = regrouped_proteins["group_objects"]

        # Get the higher or lower variable
        hl = self.data.higher_or_lower()

        logger.info("Sorting Results based on lead Protein Score")
        grouped_protein_objects = datastore.DataStore.sort_protein_objects(
            grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
        )
        protein_group_objects = datastore.DataStore.sort_protein_group_objects(
            protein_group_objects=protein_group_objects, higher_or_lower=hl
        )

        # Run lead reassignment for the group objets and protein objects
        protein_group_objects = self._reassign_protein_group_leads(
            protein_group_objects=protein_group_objects,
        )

        grouped_protein_objects = self._reassign_protein_list_leads(grouped_protein_objects=grouped_protein_objects)

        logger.info("Re Sorting Results based on lead Protein Score")
        grouped_protein_objects = datastore.DataStore.sort_protein_objects(
            grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
        )
        protein_group_objects = datastore.DataStore.sort_protein_group_objects(
            protein_group_objects=protein_group_objects, higher_or_lower=hl
        )

        self.data.grouped_scored_proteins = grouped_protein_objects
        self.data.protein_group_objects = protein_group_objects

    def infer_proteins(self):
        """
        This method performs the Parsimony inference method and uses pulp for the LP solver.

        This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
        These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
        lists of [Protein][pyproteininference.physical.Protein] objects
        and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        """

        if self.parameter_file_object.lp_solver == self.PULP:

            self._pulp_grouper()

        else:
            raise ValueError(
                "Parsimony cannot run if lp_solver parameter value is not one of the following: {}".format(
                    ", ".join(Inference.LP_SOLVERS)
                )
            )

        # Call assign shared peptides
        self._assign_shared_peptides(shared_pep_type=self.parameter_file_object.shared_peptides)

    def _assign_shared_peptides(self, shared_pep_type="all"):

        if not self.data.grouped_scored_proteins and self.data.protein_group_objects:
            raise ValueError(
                "Grouped Protein objects could not be found. Please run 'infer_proteins' method of the Parsimony class"
            )

        if shared_pep_type == self.ALL_SHARED_PEPTIDES:
            pass

        elif shared_pep_type == self.BEST_SHARED_PEPTIDES:
            logger.info("Assigning Shared Peptides from Parsimony to the Best Scoring Protein")
            raw_peptide_tracker = set()
            peptide_tracker = set()
            for prots in self.data.grouped_scored_proteins:
                new_psms = []
                new_raw_peptides = set()
                new_peptides = set()
                lead_prot = prots[0]
                for psm in lead_prot.psms:
                    raw_pep = psm.identifier
                    pep = psm.non_flanking_peptide
                    if raw_pep not in raw_peptide_tracker:
                        new_raw_peptides.add(raw_pep)
                        raw_peptide_tracker.add(raw_pep)
                    if pep not in peptide_tracker:
                        new_peptides.add(pep)
                        new_psms.append(psm)
                        peptide_tracker.add(pep)
                lead_prot.psms = new_psms
                lead_prot.raw_peptides = new_raw_peptides
                lead_prot.peptides = new_peptides

            raw_peptide_tracker = set()
            peptide_tracker = set()
            for group in self.data.protein_group_objects:
                lead_prot = group.proteins[0]
                new_psms = []
                new_raw_peptides = set()
                new_peptides = set()
                for psm in lead_prot.psms:
                    raw_pep = psm.identifier
                    pep = psm.non_flanking_peptide
                    if raw_pep not in raw_peptide_tracker:
                        new_raw_peptides.add(raw_pep)
                        raw_peptide_tracker.add(raw_pep)
                    if pep not in peptide_tracker:
                        new_peptides.add(pep)
                        new_psms.append(psm)
                        peptide_tracker.add(pep)

                lead_prot.psms = new_psms
                lead_prot.raw_peptides = new_raw_peptides
                lead_prot.peptides = new_peptides

        else:
            pass

__init__(data, digest)

Initialization method of the Parsimony object.

Parameters:
Source code in pyproteininference/inference.py
401
402
403
404
405
406
407
408
409
410
411
412
413
414
def __init__(self, data, digest):
    """
    Initialization method of the Parsimony object.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].
    """
    self.data = data
    self.digest = digest
    self.data._validate_scored_proteins()
    self.scored_data = self.data.get_protein_data()
    self.lead_protein_set = None
    self.parameter_file_object = data.parameter_file_object

infer_proteins()

This method performs the Parsimony inference method and uses pulp for the LP solver.

This method assigns the variables: grouped_scored_proteins and protein_group_objects. These are both variables of the DataStore object and are lists of Protein objects and ProteinGroup objects.

Source code in pyproteininference/inference.py
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
def infer_proteins(self):
    """
    This method performs the Parsimony inference method and uses pulp for the LP solver.

    This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
    These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
    lists of [Protein][pyproteininference.physical.Protein] objects
    and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

    """

    if self.parameter_file_object.lp_solver == self.PULP:

        self._pulp_grouper()

    else:
        raise ValueError(
            "Parsimony cannot run if lp_solver parameter value is not one of the following: {}".format(
                ", ".join(Inference.LP_SOLVERS)
            )
        )

    # Call assign shared peptides
    self._assign_shared_peptides(shared_pep_type=self.parameter_file_object.shared_peptides)

PeptideCentric

Bases: Inference

PeptideCentric Inference class. This class contains methods that support the initialization of a PeptideCentric inference method.

Attributes:
Source code in pyproteininference/inference.py
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
class PeptideCentric(Inference):
    """
    PeptideCentric Inference class. This class contains methods that support the initialization of a
    PeptideCentric inference method.

    Attributes:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

    """

    def __init__(self, data, digest):
        """
        PeptideCentric Inference initialization method.

        Args:
            data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
            digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

        Returns:
            object:
        """
        self.data = data
        self.digest = digest
        self.data._validate_scored_proteins()
        self.scored_data = self.data.get_protein_data()

    def infer_proteins(self):
        """
        This method performs the Peptide Centric inference method.

        This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
        These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
        lists of [Protein][pyproteininference.physical.Protein] objects
        and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

        Returns:
            None:

        """

        # Get the higher or lower variable
        hl = self.data.higher_or_lower()

        logger.info("Applying Group ID's for the Peptide Centric Method")
        regrouped_proteins = self._apply_protein_group_ids()

        grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
        protein_group_objects = regrouped_proteins["group_objects"]

        logger.info("Sorting Results based on lead Protein Score")
        grouped_protein_objects = datastore.DataStore.sort_protein_objects(
            grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
        )
        protein_group_objects = datastore.DataStore.sort_protein_group_objects(
            protein_group_objects=protein_group_objects, higher_or_lower=hl
        )

        self.data.grouped_scored_proteins = grouped_protein_objects
        self.data.protein_group_objects = protein_group_objects

    def _apply_protein_group_ids(self):
        """
        This method creates the ProteinGroup objects for the peptide_centric inference based on protein groups
        from [._create_protein_groups][pyproteininference.inference.Inference._create_protein_groups].

        Returns:
            dict: a Dictionary that contains a list of [ProteinGroup]]pyproteininference.physical.ProteinGroup]
            objects (key:"group_objects") and a list of grouped [Protein]]pyproteininference.physical.Protein]
            objects (key:"grouped_protein_objects").

        """

        grouped_protein_objects = self.data.get_protein_data()

        # Here we create group ID's
        group_id = 0
        list_of_proteins_grouped = []
        protein_group_objects = []
        for protein_group in grouped_protein_objects:
            protein_group.peptides = set(
                [Psm.split_peptide(peptide_string=x) for x in list(protein_group.raw_peptides)]
            )
            protein_list = []
            group_id = group_id + 1
            pg = ProteinGroup(group_id)
            logger.debug("Created Protein Group with ID: {}".format(str(group_id)))
            # The following loop assigns group_id's, reviewed/unreviewed status, and number of unique peptides...
            if group_id not in protein_group.group_identification:
                protein_group.group_identification.add(group_id)
            protein_group.num_peptides = len(protein_group.peptides)
            # Here append the number of unique peptides... so we can use this as secondary sorting...
            protein_list.append(protein_group)
            # Sorted protein_groups then becomes a list of lists... of protein objects

            pg.proteins = protein_list
            protein_group_objects.append(pg)
            list_of_proteins_grouped.append([protein_group])

        return_dict = {
            "grouped_protein_objects": list_of_proteins_grouped,
            "group_objects": protein_group_objects,
        }

        return return_dict

__init__(data, digest)

PeptideCentric Inference initialization method.

Parameters:
Returns:
  • object
Source code in pyproteininference/inference.py
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
def __init__(self, data, digest):
    """
    PeptideCentric Inference initialization method.

    Args:
        data (DataStore): [DataStore Object][pyproteininference.datastore.DataStore].
        digest (Digest): [Digest Object][pyproteininference.in_silico_digest.Digest].

    Returns:
        object:
    """
    self.data = data
    self.digest = digest
    self.data._validate_scored_proteins()
    self.scored_data = self.data.get_protein_data()

infer_proteins()

This method performs the Peptide Centric inference method.

This method assigns the variables: grouped_scored_proteins and protein_group_objects. These are both variables of the DataStore object and are lists of Protein objects and ProteinGroup objects.

Returns:
  • None
Source code in pyproteininference/inference.py
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
def infer_proteins(self):
    """
    This method performs the Peptide Centric inference method.

    This method assigns the variables: `grouped_scored_proteins` and `protein_group_objects`.
    These are both variables of the [DataStore object][pyproteininference.datastore.DataStore] and are
    lists of [Protein][pyproteininference.physical.Protein] objects
    and [ProteinGroup][pyproteininference.physical.ProteinGroup] objects.

    Returns:
        None:

    """

    # Get the higher or lower variable
    hl = self.data.higher_or_lower()

    logger.info("Applying Group ID's for the Peptide Centric Method")
    regrouped_proteins = self._apply_protein_group_ids()

    grouped_protein_objects = regrouped_proteins["grouped_protein_objects"]
    protein_group_objects = regrouped_proteins["group_objects"]

    logger.info("Sorting Results based on lead Protein Score")
    grouped_protein_objects = datastore.DataStore.sort_protein_objects(
        grouped_protein_objects=grouped_protein_objects, higher_or_lower=hl
    )
    protein_group_objects = datastore.DataStore.sort_protein_group_objects(
        protein_group_objects=protein_group_objects, higher_or_lower=hl
    )

    self.data.grouped_scored_proteins = grouped_protein_objects
    self.data.protein_group_objects = protein_group_objects

Score

Bases: object

Score class that contains methods to do a variety of scoring methods on the Psm objects contained inside of Protein objects.

Methods in the class loop over each Protein object and creates a protein "score" variable using the Psm object scores.

Methods score all proteins from scoring_input from DataStore object. The PSM score that is used is determined from create_scoring_input.

Each scoring method will set the following attributes for the DataStore object.

  1. score_method; This is the full name of the score method.
  2. short_score_method; This is the short name of the score method.
  3. scored_proteins; This is a list of Protein objects that have been scored.
Attributes:
  • pre_score_data (list) –

    This is a list of Protein objects that contain Psm objects.

  • data (DataStore) –

    DataStore object.

Source code in pyproteininference/scoring.py
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
class Score(object):
    """
    Score class that contains methods to do a variety of scoring methods on the
    [Psm][pyproteininference.physical.Psm] objects
    contained inside of [Protein][pyproteininference.physical.Protein] objects.

    Methods in the class loop over each Protein object and creates a protein "score" variable using the Psm object
    scores.

    Methods score all proteins from `scoring_input` from [DataStore object][pyproteininference.datastore.DataStore].
    The PSM score that is used is determined from
    [create_scoring_input][pyproteininference.datastore.DataStore.create_scoring_input].

    Each scoring method will set the following attributes for
    the [DataStore object][pyproteininference.datastore.DataStore].

    1. `score_method`; This is the full name of the score method.
    2. `short_score_method`; This is the short name of the score method.
    3. `scored_proteins`; This is a list of [Protein][pyproteininference.physical.Protein] objects
    that have been scored.

    Attributes:
        pre_score_data (list): This is a list of [Protein][pyproteininference.physical.Protein] objects
            that contain [Psm][pyproteininference.physical.Psm] objects.
        data (DataStore): [DataStore][pyproteininference.datastore.DataStore] object.

    """

    BEST_PEPTIDE_PER_PROTEIN = "best_peptide_per_protein"
    ITERATIVE_DOWNWEIGHTED_LOG = "iterative_downweighted_log"
    MULTIPLICATIVE_LOG = "multiplicative_log"
    DOWNWEIGHTED_MULTIPLICATIVE_LOG = "downweighted_multiplicative_log"
    DOWNWEIGHTED_VERSION2 = "downweighted_version2"
    TOP_TWO_COMBINED = "top_two_combined"
    GEOMETRIC_MEAN = "geometric_mean"
    ADDITIVE = "additive"

    SCORE_METHODS = [
        BEST_PEPTIDE_PER_PROTEIN,
        ITERATIVE_DOWNWEIGHTED_LOG,
        MULTIPLICATIVE_LOG,
        DOWNWEIGHTED_MULTIPLICATIVE_LOG,
        DOWNWEIGHTED_VERSION2,
        TOP_TWO_COMBINED,
        GEOMETRIC_MEAN,
        ADDITIVE,
    ]

    SHORT_BEST_PEPTIDE_PER_PROTEIN = "bppp"
    SHORT_ITERATIVE_DOWNWEIGHTED_LOG = "idwl"
    SHORT_MULTIPLICATIVE_LOG = "ml"
    SHORT_DOWNWEIGHTED_MULTIPLICATIVE_LOG = "dwml"
    SHORT_DOWNWEIGHTED_VERSION2 = "dw2"
    SHORT_TOP_TWO_COMBINED = "ttc"
    SHORT_GEOMETRIC_MEAN = "gm"
    SHORT_ADDITIVE = "add"

    SHORT_SCORE_METHODS = [
        SHORT_BEST_PEPTIDE_PER_PROTEIN,
        SHORT_ITERATIVE_DOWNWEIGHTED_LOG,
        SHORT_MULTIPLICATIVE_LOG,
        SHORT_DOWNWEIGHTED_MULTIPLICATIVE_LOG,
        SHORT_DOWNWEIGHTED_VERSION2,
        SHORT_TOP_TWO_COMBINED,
        SHORT_GEOMETRIC_MEAN,
        SHORT_ADDITIVE,
    ]

    MULTIPLICATIVE_SCORE_TYPE = "multiplicative"
    ADDITIVE_SCORE_TYPE = "additive"

    SCORE_TYPES = [MULTIPLICATIVE_SCORE_TYPE, ADDITIVE_SCORE_TYPE]

    def __init__(self, data):
        """
        Initialization method for the Score class.

        Args:
            data (DataStore): [DataStore][pyproteininference.datastore.DataStore] object.

        Raises:
            ValueError: If the variable `scoring_input` for the [DataStore][pyproteininference.datastore.DataStore]
                object is Empty "[]" or does not exist "None".

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
        """
        if data.scoring_input:
            self.pre_score_data = data.scoring_input
        else:
            raise ValueError(
                "scoring input not found in data object - Please run 'create_scoring_input' method from "
                "DataStore to run any scoring type"
            )
        self.data = data

    def score_psms(self, score_method="multiplicative_log"):
        """
        This method dispatches to the actual scoring method given a string input that is defined in the
        [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.

        Args:
            score_method (str): This is a string that represents which scoring method to call.

        Raises:
            ValueError: Will Error out if the score_method is not present in the constant `SCORE_METHODS`.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.score_psms(score_method="best_peptide_per_protein")
        """

        self._validate_scoring_input()

        if score_method not in self.SCORE_METHODS:
            raise ValueError(
                "score method '{}' is not a proper method. Score method must be one of the following: '{}'".format(
                    score_method, ", ".join(self.SCORE_METHODS)
                )
            )
        else:
            if score_method == self.BEST_PEPTIDE_PER_PROTEIN:
                self.best_peptide_per_protein()
            if score_method == self.ITERATIVE_DOWNWEIGHTED_LOG:
                self.iterative_down_weighted_log()
            if score_method == self.MULTIPLICATIVE_LOG:
                self.multiplicative_log()
            if score_method == self.DOWNWEIGHTED_MULTIPLICATIVE_LOG:
                self.down_weighted_multiplicative_log()
            if score_method == self.DOWNWEIGHTED_VERSION2:
                self.down_weighted_v2()
            if score_method == self.TOP_TWO_COMBINED:
                self.top_two_combied()
            if score_method == self.GEOMETRIC_MEAN:
                self.geometric_mean_log()
            if score_method == self.ADDITIVE:
                self.additive()

    def best_peptide_per_protein(self):
        """
        This method uses a best peptide per protein scoring scheme.
        The top scoring Psm for each protein is selected as the overall Protein object score.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.best_peptide_per_protein()

        """

        all_scores = []

        logger.info("Scoring Proteins with BPPP")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()
            score = min([float(x) for x in val_list])

            protein.score = score

            all_scores.append(protein)
        # Here do ascending sorting because a lower pep or q value is better
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=False)

        self.data.protein_score = self.BEST_PEPTIDE_PER_PROTEIN
        self.data.short_protein_score = self.SHORT_BEST_PEPTIDE_PER_PROTEIN
        self.data.scored_proteins = all_scores

    def fishers_method(self):
        """
        This method uses a fishers method scoring scheme.
\
        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.fishers_method()

         """

        all_scores = []
        logger.info("Scoring Proteins with fishers method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()
            score = -2 * sum([math.log(x) for x in val_list])

            protein.score = score

            all_scores.append(protein)
        # Here reverse the sorting to descending because a higher score is better
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
        self.data.protein_score = "fishers_method"
        self.data.short_protein_score = "fm"
        self.data.scored_proteins = all_scores

    def multiplicative_log(self):
        """
        This method uses a Multiplicative Log scoring scheme.
        The selected Psm score from all the peptides per protein are multiplied together and we take -Log(X)
        of the multiplied Peptide scores.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.multiplicative_log()
        """

        # Instead of making all_scores a list... make it a generator??

        all_scores = []
        logger.info("Scoring Proteins with Multiplicative Log Method")
        for protein in self.pre_score_data:
            # We create a generator of val_list...
            val_list = protein.get_psm_scores()

            combine = reduce(lambda x, y: x * y, val_list)
            if combine == 0:
                combine = sys.float_info.min
            score = -math.log(combine)
            protein.score = score

            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.MULTIPLICATIVE_LOG
        self.data.short_protein_score = self.SHORT_MULTIPLICATIVE_LOG
        self.data.scored_proteins = all_scores

    def down_weighted_multiplicative_log(self):
        """
        This method uses a Multiplicative Log scoring scheme.
        The selected PSM score from all the peptides per protein are multiplied together and
        then this number is divided by the set PSM scores mean raised to the number of peptides for that protein
        then we take -Log(X) of the following value.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.down_weighted_multiplicative_log()
        """

        score_list = []
        for proteins in self.pre_score_data:
            cur_scores = proteins.get_psm_scores()
            for scores in cur_scores:
                score_list.append(scores)
        score_mean = numpy.mean(score_list)

        all_scores = []
        logger.info("Scoring Proteins with DWML method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()
            # Divide by the score mean raised to the length of the number of unique peptides for the protein
            # This is an attempt to normalize for number of peptides per protein
            combine = reduce(lambda x, y: x * y, val_list)
            if combine == 0:
                combine = sys.float_info.min
            score = -math.log(combine / (score_mean ** len(val_list)))
            protein.score = score

            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.DOWNWEIGHTED_MULTIPLICATIVE_LOG
        self.data.short_protein_score = self.SHORT_DOWNWEIGHTED_MULTIPLICATIVE_LOG
        self.data.scored_proteins = all_scores

    def top_two_combied(self):
        """
        This method uses a Top Two scoring scheme.
        The top two scores for each protein are multiplied together and we take -Log(X) of the multiplied value.
        If a protein only has 1 score/peptide, then we only do -Log(X) of the 1 peptide score.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.top_two_combied()
        """

        all_scores = []
        logger.info("Scoring Proteins with Top Two Method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()

            try:
                # Try to combine the top two scores
                # Divide by 2 to attempt to normalize the value
                score = -math.log((val_list[0] * val_list[1]) / 2)
            except IndexError:
                # If there is only 1 score/1 peptide then just use the 1 peptide provided
                score = -math.log(val_list[0])

            protein.score = score
            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.TOP_TWO_COMBINED
        self.data.short_protein_score = self.SHORT_TOP_TWO_COMBINED
        self.data.scored_proteins = all_scores

    def down_weighted_v2(self):
        """
        This method uses a Downweighted Multiplicative Log scoring scheme.
        Each peptide is iteratively downweighted by raising the peptide QValue or PepValue to the
        following power (1/(1+index_number)).
        Where index_number is the peptide number per protein.
        Each score for a protein provides less and less weight iteratively.

        We also take -Log(X) of the final score here.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.down_weighted_v2()
        """

        all_scores = []
        logger.info("Scoring Proteins with down weighted v2 method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()

            # Here take each score and raise it to the power of (1/(1+index_number)).
            # This downweights each successive score by reducing its weight in a decreasing fashion
            # Basically, each score for a protein will provide less and less weight iteratively
            val_list = [val_list[x] ** (1 / float(1 + x)) for x in range(len(val_list))]
            # val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
            score = -math.log(reduce(lambda x, y: x * y, val_list))

            protein.score = score
            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.DOWNWEIGHTED_VERSION2
        self.data.short_protein_score = self.SHORT_DOWNWEIGHTED_VERSION2
        self.data.scored_proteins = all_scores

    def iterative_down_weighted_log(self):
        """
        This method uses a Downweighted Multiplicative Log scoring scheme.
        Each peptide is iteratively downweighted by multiplying the peptide QValue or PepValue to
        the following (1+index_number).
        Where index_number is the peptide number per protein.
        Each score for a protein provides less and less weight iteratively.

        We also take -Log(X) of the final score here.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.iterative_down_weighted_log()
        """

        all_scores = []
        logger.info("Scoring Proteins with IDWL method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()

            # Here take each score and multiply it by its index number).
            # This downweights each successive score by reducing its weight in a decreasing fashion
            # Basically, each score for a protein will provide less and less weight iteratively
            val_list = [val_list[x] * (float(1 + x)) for x in range(len(val_list))]
            # val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
            combine = reduce(lambda x, y: x * y, val_list)
            if combine == 0:
                combine = sys.float_info.min
            score = -math.log(combine)
            protein.score = score

            protein.score = score
            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.ITERATIVE_DOWNWEIGHTED_LOG
        self.data.short_protein_score = self.SHORT_ITERATIVE_DOWNWEIGHTED_LOG
        self.data.scored_proteins = all_scores

    def geometric_mean_log(self):
        """
        This method uses a Geometric Mean scoring scheme.

        We also take -Log(X) of the final score here.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.geometric_mean_log()
        """

        all_scores = []
        logger.info("Scoring Proteins. with GML method")
        for protein in self.pre_score_data:
            psm_scores = protein.get_psm_scores()
            val_list = []
            for vals in psm_scores:
                val_list.append(float(vals))
                combine = reduce(lambda x, y: x * y, val_list)
                if combine == 0:
                    combine = sys.float_info.min
                pre_log_score = combine ** (1 / float(len(val_list)))
            score = -math.log(pre_log_score)

            protein.score = score
            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.GEOMETRIC_MEAN
        self.data.short_protein_score = self.SHORT_GEOMETRIC_MEAN
        self.data.scored_proteins = all_scores

    def iterative_down_weighted_v2(self):
        """
        The following method is an experimental method essentially used for future development of potential scoring
        schemes.
        """

        all_scores = []
        logger.info("Scoring Proteins with iterative down weighted v2 method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()

            # Here take each score and raise it to the power of (1/(1+index_number)).
            # This downweights each successive score by reducing its weight in a decreasing fashion
            # Basically, each score for a protein will provide less and less weight iteratively
            val_list = [val_list[x] ** (1 / float(1 + x)) for x in range(len(val_list))]
            # val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
            score = -math.log(reduce(lambda x, y: x * y, val_list))

            protein.score = score
            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = "iterative_downweighting2"
        self.data.short_protein_score = "idw2"
        self.data.scored_proteins = all_scores

    def additive(self):
        """
        This method uses an additive scoring scheme.
        The method can only be used if a larger PSM score is a better PSM score such as the Percolator score.

        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.additive()
        """

        all_scores = []
        logger.info("Scoring Proteins with additive method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()

            # Take the sum of our scores
            score = sum(val_list)

            protein.score = score
            all_scores.append(protein)

        # Higher score is better as a smaller q or pep in a -log will give a larger value
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

        self.data.protein_score = self.ADDITIVE
        self.data.short_protein_score = self.SHORT_ADDITIVE
        self.data.scored_proteins = all_scores

    def _validate_scoring_input(self):
        validated_psm_scores = all(x.main_score is not None for x in self.data.get_psm_data())
        if validated_psm_scores:
            logger.info(
                "PSM scores validated. Score: {} read from file correctly for all PSMs".format(
                    self.data.parameter_file_object.psm_score
                )
            )
        else:
            raise ValueError(
                "PSM scores not validated. Score: {} not read from file correctly for all PSMs".format(
                    self.data.parameter_file_object.psm_score
                )
            )

__init__(data)

Initialization method for the Score class.

Parameters:
Raises:
  • ValueError

    If the variable scoring_input for the DataStore object is Empty "[]" or does not exist "None".

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
Source code in pyproteininference/scoring.py
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
def __init__(self, data):
    """
    Initialization method for the Score class.

    Args:
        data (DataStore): [DataStore][pyproteininference.datastore.DataStore] object.

    Raises:
        ValueError: If the variable `scoring_input` for the [DataStore][pyproteininference.datastore.DataStore]
            object is Empty "[]" or does not exist "None".

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
    """
    if data.scoring_input:
        self.pre_score_data = data.scoring_input
    else:
        raise ValueError(
            "scoring input not found in data object - Please run 'create_scoring_input' method from "
            "DataStore to run any scoring type"
        )
    self.data = data

additive()

This method uses an additive scoring scheme. The method can only be used if a larger PSM score is a better PSM score such as the Percolator score.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.additive()
Source code in pyproteininference/scoring.py
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
def additive(self):
    """
    This method uses an additive scoring scheme.
    The method can only be used if a larger PSM score is a better PSM score such as the Percolator score.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.additive()
    """

    all_scores = []
    logger.info("Scoring Proteins with additive method")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()

        # Take the sum of our scores
        score = sum(val_list)

        protein.score = score
        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.ADDITIVE
    self.data.short_protein_score = self.SHORT_ADDITIVE
    self.data.scored_proteins = all_scores

best_peptide_per_protein()

This method uses a best peptide per protein scoring scheme. The top scoring Psm for each protein is selected as the overall Protein object score.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.best_peptide_per_protein()
Source code in pyproteininference/scoring.py
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
def best_peptide_per_protein(self):
    """
    This method uses a best peptide per protein scoring scheme.
    The top scoring Psm for each protein is selected as the overall Protein object score.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.best_peptide_per_protein()

    """

    all_scores = []

    logger.info("Scoring Proteins with BPPP")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()
        score = min([float(x) for x in val_list])

        protein.score = score

        all_scores.append(protein)
    # Here do ascending sorting because a lower pep or q value is better
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=False)

    self.data.protein_score = self.BEST_PEPTIDE_PER_PROTEIN
    self.data.short_protein_score = self.SHORT_BEST_PEPTIDE_PER_PROTEIN
    self.data.scored_proteins = all_scores

down_weighted_multiplicative_log()

This method uses a Multiplicative Log scoring scheme. The selected PSM score from all the peptides per protein are multiplied together and then this number is divided by the set PSM scores mean raised to the number of peptides for that protein then we take -Log(X) of the following value.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.down_weighted_multiplicative_log()
Source code in pyproteininference/scoring.py
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
def down_weighted_multiplicative_log(self):
    """
    This method uses a Multiplicative Log scoring scheme.
    The selected PSM score from all the peptides per protein are multiplied together and
    then this number is divided by the set PSM scores mean raised to the number of peptides for that protein
    then we take -Log(X) of the following value.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.down_weighted_multiplicative_log()
    """

    score_list = []
    for proteins in self.pre_score_data:
        cur_scores = proteins.get_psm_scores()
        for scores in cur_scores:
            score_list.append(scores)
    score_mean = numpy.mean(score_list)

    all_scores = []
    logger.info("Scoring Proteins with DWML method")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()
        # Divide by the score mean raised to the length of the number of unique peptides for the protein
        # This is an attempt to normalize for number of peptides per protein
        combine = reduce(lambda x, y: x * y, val_list)
        if combine == 0:
            combine = sys.float_info.min
        score = -math.log(combine / (score_mean ** len(val_list)))
        protein.score = score

        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.DOWNWEIGHTED_MULTIPLICATIVE_LOG
    self.data.short_protein_score = self.SHORT_DOWNWEIGHTED_MULTIPLICATIVE_LOG
    self.data.scored_proteins = all_scores

down_weighted_v2()

This method uses a Downweighted Multiplicative Log scoring scheme. Each peptide is iteratively downweighted by raising the peptide QValue or PepValue to the following power (1/(1+index_number)). Where index_number is the peptide number per protein. Each score for a protein provides less and less weight iteratively.

We also take -Log(X) of the final score here.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.down_weighted_v2()
Source code in pyproteininference/scoring.py
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
def down_weighted_v2(self):
    """
    This method uses a Downweighted Multiplicative Log scoring scheme.
    Each peptide is iteratively downweighted by raising the peptide QValue or PepValue to the
    following power (1/(1+index_number)).
    Where index_number is the peptide number per protein.
    Each score for a protein provides less and less weight iteratively.

    We also take -Log(X) of the final score here.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.down_weighted_v2()
    """

    all_scores = []
    logger.info("Scoring Proteins with down weighted v2 method")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()

        # Here take each score and raise it to the power of (1/(1+index_number)).
        # This downweights each successive score by reducing its weight in a decreasing fashion
        # Basically, each score for a protein will provide less and less weight iteratively
        val_list = [val_list[x] ** (1 / float(1 + x)) for x in range(len(val_list))]
        # val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
        score = -math.log(reduce(lambda x, y: x * y, val_list))

        protein.score = score
        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.DOWNWEIGHTED_VERSION2
    self.data.short_protein_score = self.SHORT_DOWNWEIGHTED_VERSION2
    self.data.scored_proteins = all_scores

fishers_method()

This method uses a fishers method scoring scheme. Examples: >>> score = pyproteininference.scoring.Score(data=data) >>> score.fishers_method()

Source code in pyproteininference/scoring.py
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
    def fishers_method(self):
        """
        This method uses a fishers method scoring scheme.
\
        Examples:
            >>> score = pyproteininference.scoring.Score(data=data)
            >>> score.fishers_method()

         """

        all_scores = []
        logger.info("Scoring Proteins with fishers method")
        for protein in self.pre_score_data:
            val_list = protein.get_psm_scores()
            score = -2 * sum([math.log(x) for x in val_list])

            protein.score = score

            all_scores.append(protein)
        # Here reverse the sorting to descending because a higher score is better
        all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)
        self.data.protein_score = "fishers_method"
        self.data.short_protein_score = "fm"
        self.data.scored_proteins = all_scores

geometric_mean_log()

This method uses a Geometric Mean scoring scheme.

We also take -Log(X) of the final score here.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.geometric_mean_log()
Source code in pyproteininference/scoring.py
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
def geometric_mean_log(self):
    """
    This method uses a Geometric Mean scoring scheme.

    We also take -Log(X) of the final score here.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.geometric_mean_log()
    """

    all_scores = []
    logger.info("Scoring Proteins. with GML method")
    for protein in self.pre_score_data:
        psm_scores = protein.get_psm_scores()
        val_list = []
        for vals in psm_scores:
            val_list.append(float(vals))
            combine = reduce(lambda x, y: x * y, val_list)
            if combine == 0:
                combine = sys.float_info.min
            pre_log_score = combine ** (1 / float(len(val_list)))
        score = -math.log(pre_log_score)

        protein.score = score
        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.GEOMETRIC_MEAN
    self.data.short_protein_score = self.SHORT_GEOMETRIC_MEAN
    self.data.scored_proteins = all_scores

iterative_down_weighted_log()

This method uses a Downweighted Multiplicative Log scoring scheme. Each peptide is iteratively downweighted by multiplying the peptide QValue or PepValue to the following (1+index_number). Where index_number is the peptide number per protein. Each score for a protein provides less and less weight iteratively.

We also take -Log(X) of the final score here.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.iterative_down_weighted_log()
Source code in pyproteininference/scoring.py
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
def iterative_down_weighted_log(self):
    """
    This method uses a Downweighted Multiplicative Log scoring scheme.
    Each peptide is iteratively downweighted by multiplying the peptide QValue or PepValue to
    the following (1+index_number).
    Where index_number is the peptide number per protein.
    Each score for a protein provides less and less weight iteratively.

    We also take -Log(X) of the final score here.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.iterative_down_weighted_log()
    """

    all_scores = []
    logger.info("Scoring Proteins with IDWL method")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()

        # Here take each score and multiply it by its index number).
        # This downweights each successive score by reducing its weight in a decreasing fashion
        # Basically, each score for a protein will provide less and less weight iteratively
        val_list = [val_list[x] * (float(1 + x)) for x in range(len(val_list))]
        # val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
        combine = reduce(lambda x, y: x * y, val_list)
        if combine == 0:
            combine = sys.float_info.min
        score = -math.log(combine)
        protein.score = score

        protein.score = score
        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.ITERATIVE_DOWNWEIGHTED_LOG
    self.data.short_protein_score = self.SHORT_ITERATIVE_DOWNWEIGHTED_LOG
    self.data.scored_proteins = all_scores

iterative_down_weighted_v2()

The following method is an experimental method essentially used for future development of potential scoring schemes.

Source code in pyproteininference/scoring.py
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
def iterative_down_weighted_v2(self):
    """
    The following method is an experimental method essentially used for future development of potential scoring
    schemes.
    """

    all_scores = []
    logger.info("Scoring Proteins with iterative down weighted v2 method")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()

        # Here take each score and raise it to the power of (1/(1+index_number)).
        # This downweights each successive score by reducing its weight in a decreasing fashion
        # Basically, each score for a protein will provide less and less weight iteratively
        val_list = [val_list[x] ** (1 / float(1 + x)) for x in range(len(val_list))]
        # val_list = [val_list[x]**(1/float(1+(float(x)/10))) for x in range(len(val_list))]
        score = -math.log(reduce(lambda x, y: x * y, val_list))

        protein.score = score
        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = "iterative_downweighting2"
    self.data.short_protein_score = "idw2"
    self.data.scored_proteins = all_scores

multiplicative_log()

This method uses a Multiplicative Log scoring scheme. The selected Psm score from all the peptides per protein are multiplied together and we take -Log(X) of the multiplied Peptide scores.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.multiplicative_log()
Source code in pyproteininference/scoring.py
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
def multiplicative_log(self):
    """
    This method uses a Multiplicative Log scoring scheme.
    The selected Psm score from all the peptides per protein are multiplied together and we take -Log(X)
    of the multiplied Peptide scores.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.multiplicative_log()
    """

    # Instead of making all_scores a list... make it a generator??

    all_scores = []
    logger.info("Scoring Proteins with Multiplicative Log Method")
    for protein in self.pre_score_data:
        # We create a generator of val_list...
        val_list = protein.get_psm_scores()

        combine = reduce(lambda x, y: x * y, val_list)
        if combine == 0:
            combine = sys.float_info.min
        score = -math.log(combine)
        protein.score = score

        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.MULTIPLICATIVE_LOG
    self.data.short_protein_score = self.SHORT_MULTIPLICATIVE_LOG
    self.data.scored_proteins = all_scores

score_psms(score_method='multiplicative_log')

This method dispatches to the actual scoring method given a string input that is defined in the ProteinInferenceParameter object.

Parameters:
  • score_method (str, default: 'multiplicative_log' ) –

    This is a string that represents which scoring method to call.

Raises:
  • ValueError

    Will Error out if the score_method is not present in the constant SCORE_METHODS.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.score_psms(score_method="best_peptide_per_protein")
Source code in pyproteininference/scoring.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
def score_psms(self, score_method="multiplicative_log"):
    """
    This method dispatches to the actual scoring method given a string input that is defined in the
    [ProteinInferenceParameter][pyproteininference.parameters.ProteinInferenceParameter] object.

    Args:
        score_method (str): This is a string that represents which scoring method to call.

    Raises:
        ValueError: Will Error out if the score_method is not present in the constant `SCORE_METHODS`.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.score_psms(score_method="best_peptide_per_protein")
    """

    self._validate_scoring_input()

    if score_method not in self.SCORE_METHODS:
        raise ValueError(
            "score method '{}' is not a proper method. Score method must be one of the following: '{}'".format(
                score_method, ", ".join(self.SCORE_METHODS)
            )
        )
    else:
        if score_method == self.BEST_PEPTIDE_PER_PROTEIN:
            self.best_peptide_per_protein()
        if score_method == self.ITERATIVE_DOWNWEIGHTED_LOG:
            self.iterative_down_weighted_log()
        if score_method == self.MULTIPLICATIVE_LOG:
            self.multiplicative_log()
        if score_method == self.DOWNWEIGHTED_MULTIPLICATIVE_LOG:
            self.down_weighted_multiplicative_log()
        if score_method == self.DOWNWEIGHTED_VERSION2:
            self.down_weighted_v2()
        if score_method == self.TOP_TWO_COMBINED:
            self.top_two_combied()
        if score_method == self.GEOMETRIC_MEAN:
            self.geometric_mean_log()
        if score_method == self.ADDITIVE:
            self.additive()

top_two_combied()

This method uses a Top Two scoring scheme. The top two scores for each protein are multiplied together and we take -Log(X) of the multiplied value. If a protein only has 1 score/peptide, then we only do -Log(X) of the 1 peptide score.

Examples:

>>> score = pyproteininference.scoring.Score(data=data)
>>> score.top_two_combied()
Source code in pyproteininference/scoring.py
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
def top_two_combied(self):
    """
    This method uses a Top Two scoring scheme.
    The top two scores for each protein are multiplied together and we take -Log(X) of the multiplied value.
    If a protein only has 1 score/peptide, then we only do -Log(X) of the 1 peptide score.

    Examples:
        >>> score = pyproteininference.scoring.Score(data=data)
        >>> score.top_two_combied()
    """

    all_scores = []
    logger.info("Scoring Proteins with Top Two Method")
    for protein in self.pre_score_data:
        val_list = protein.get_psm_scores()

        try:
            # Try to combine the top two scores
            # Divide by 2 to attempt to normalize the value
            score = -math.log((val_list[0] * val_list[1]) / 2)
        except IndexError:
            # If there is only 1 score/1 peptide then just use the 1 peptide provided
            score = -math.log(val_list[0])

        protein.score = score
        all_scores.append(protein)

    # Higher score is better as a smaller q or pep in a -log will give a larger value
    all_scores = sorted(all_scores, key=lambda k: k.score, reverse=True)

    self.data.protein_score = self.TOP_TWO_COMBINED
    self.data.short_protein_score = self.SHORT_TOP_TWO_COMBINED
    self.data.scored_proteins = all_scores

Export

Bases: object

Class that handles exporting protein inference results to filesystem as csv files.

Attributes:
  • data (DataStore) –
  • filepath (str) –

    Path to file to be written.

Source code in pyproteininference/export.py
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
class Export(object):
    """
    Class that handles exporting protein inference results to filesystem as csv files.

    Attributes:
        data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].
        filepath (str): Path to file to be written.

    """

    EXPORT_LEADS = "leads"
    EXPORT_ALL = "all"
    EXPORT_COMMA_SEP = "comma_sep"
    EXPORT_Q_VALUE_COMMA_SEP = "q_value_comma_sep"
    EXPORT_Q_VALUE = "q_value"
    EXPORT_Q_VALUE_ALL = "q_value_all"
    EXPORT_PEPTIDES = "peptides"
    EXPORT_PSMS = "psms"
    EXPORT_PSM_IDS = "psm_ids"
    EXPORT_LONG = "long"

    EXPORT_TYPES = [
        EXPORT_LEADS,
        EXPORT_ALL,
        EXPORT_COMMA_SEP,
        EXPORT_Q_VALUE_COMMA_SEP,
        EXPORT_Q_VALUE,
        EXPORT_Q_VALUE_ALL,
        EXPORT_PEPTIDES,
        EXPORT_PSMS,
        EXPORT_PSM_IDS,
        EXPORT_LONG,
    ]

    def __init__(self, data):
        """
        Initialization method for the Export class.

        Args:
            data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].

        Example:
            >>> export = pyproteininference.export.Export(data=data)

        """
        self.data = data
        self.filepath = None

    def export_to_csv(self, output_filename=None, directory=None, export_type="q_value"):
        """
        Method that dispatches to one of the many export methods given an export_type input.

        filepath is determined based on directory arg and information from
        [DataStore object][pyproteininference.datastore.DataStore].

        This method sets the `filepath` variable.

        Args:
            output_filename (str): Filepath to write to. If set as None will auto generate filename and
                will write to directory variable.
            directory (str): Directory to write the result file to. If None, will write to current working directory.
            export_type (str): Must be a value in `EXPORT_TYPES` and determines the output format.

        Example:
            >>> export = pyproteininference.export.Export(data=data)
            >>> export.export_to_csv(output_filename=None, directory="/path/to/output/dir/", export_type="psms")

        """

        if not directory:
            directory = os.getcwd()

        data = self.data
        tag = data.parameter_file_object.tag

        if self.EXPORT_LEADS == export_type:
            filename = "{}_leads_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_leads_restricted(filename_out=complete_filepath)

        elif self.EXPORT_ALL == export_type:
            filename = "{}_all_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_all_restricted(complete_filepath)

        elif self.EXPORT_COMMA_SEP == export_type:
            filename = "{}_comma_sep_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_comma_sep_restricted(complete_filepath)

        elif self.EXPORT_Q_VALUE_COMMA_SEP == export_type:
            filename = "{}_q_value_comma_sep_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_comma_sep(complete_filepath)

        elif self.EXPORT_Q_VALUE == export_type:
            filename = "{}_q_value_leads_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_leads(complete_filepath)

        elif self.EXPORT_Q_VALUE_ALL == export_type:
            filename = "{}_q_value_all_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_all(complete_filepath)

        elif self.EXPORT_PEPTIDES == export_type:
            filename = "{}_q_value_leads_peptides_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_leads_peptides(complete_filepath)

        elif self.EXPORT_PSMS == export_type:
            filename = "{}_q_value_leads_psms_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_leads_psms(complete_filepath)

        elif self.EXPORT_PSM_IDS == export_type:
            filename = "{}_q_value_leads_psm_ids_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_leads_psm_ids(complete_filepath)

        elif self.EXPORT_LONG == export_type:
            filename = "{}_q_value_long_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
            complete_filepath = os.path.join(directory, filename)
            if output_filename:
                complete_filepath = output_filename
            logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
            self.csv_export_q_value_leads_long(complete_filepath)

        else:
            complete_filepath = "protein_inference_results.csv"

        self.filepath = complete_filepath

    def csv_export_all_restricted(self, filename_out):
        """
        Method that outputs a subset of the passing proteins based on FDR.

        This method returns a non-square CSV file.

        Args:
            filename_out (str): Filename for the data to be written to

        """
        protein_objects = self.data.get_protein_objects(fdr_restricted=True)
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in protein_objects:
            for prots in groups:
                protein_export_list.append([prots.identifier])
                protein_export_list[-1].append(prots.score)
                protein_export_list[-1].append(prots.num_peptides)
                if prots.reviewed:
                    protein_export_list[-1].append("Reviewed")
                else:
                    protein_export_list[-1].append("Unreviewed")
                protein_export_list[-1].append(prots.group_identification)
                for peps in prots.peptides:
                    protein_export_list[-1].append(peps)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_leads_restricted(self, filename_out):
        """
        Method that outputs a subset of the passing proteins based on FDR.
        Only Proteins that pass FDR will be output and only Lead proteins will be output

        This method returns a non-square CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.

        """
        protein_objects = self.data.get_protein_objects(fdr_restricted=True)
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in protein_objects:
            protein_export_list.append([groups[0].identifier])
            protein_export_list[-1].append(groups[0].score)
            protein_export_list[-1].append(groups[0].num_peptides)
            if groups[0].reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups[0].group_identification)
            for peps in sorted(groups[0].peptides):
                protein_export_list[-1].append(peps)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_comma_sep_restricted(self, filename_out):
        """
        Method that outputs a subset of the passing proteins based on FDR.
        Only Proteins that pass FDR will be output and only Lead proteins will be output.
        Proteins in the groups of lead proteins will also be output in the same row as the lead.

        This method returns a non-square CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.

        """
        protein_objects = self.data.get_protein_objects(fdr_restricted=True)
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Other_Potential_Identifiers",
            ]
        ]
        for groups in protein_objects:
            for prots in groups:
                if prots == groups[0]:
                    protein_export_list.append([prots.identifier])
                    protein_export_list[-1].append(prots.score)
                    protein_export_list[-1].append(prots.num_peptides)
                    if prots.reviewed:
                        protein_export_list[-1].append("Reviewed")
                    else:
                        protein_export_list[-1].append("Unreviewed")
                    protein_export_list[-1].append(prots.group_identification)
                else:
                    protein_export_list[-1].append(prots.identifier)
        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def _format_lead_protein(self, groups):
        lead_protein = copy.copy(groups.proteins[0])
        if (
            self.data.parameter_file_object.inference_type == Inference.PARSIMONY
            and self.data.parameter_file_object.grouping_type == Inference.PARSIMONIOUS_GROUPING
        ):
            # take all protein identifiers from groups.proteins
            # then sort them first by whether they are a part of the set in self.data.digest.swiss_prot_protein_set and then alphabetically
            # then join them with a semicolon
            lead_protein.identifier = ";".join(
                [
                    x.identifier
                    for x in sorted(
                        groups.proteins,
                        key=lambda x: (x.identifier not in self.data.digest.swiss_prot_protein_set, x.identifier),
                    )
                ]
            )

        return lead_protein

    def csv_export_q_value_leads(self, filename_out):
        """
        Method that outputs all lead proteins with Q values.

        This method returns a non-square CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.

        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            lead_protein = self._format_lead_protein(groups)
            protein_export_list.append([lead_protein.identifier])
            protein_export_list[-1].append(lead_protein.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(lead_protein.num_peptides)
            if lead_protein.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            peptides = lead_protein.peptides
            for peps in sorted(peptides):
                protein_export_list[-1].append(peps)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_comma_sep(self, filename_out):
        """
        Method that outputs all lead proteins with Q values.
        Proteins in the groups of lead proteins will also be output in the same row as the lead.

        This method returns a non-square CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.

        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Other_Potential_Identifiers",
            ]
        ]
        for groups in self.data.protein_group_objects:
            lead_protein = self._format_lead_protein(groups)
            protein_export_list.append([lead_protein.identifier])
            protein_export_list[-1].append(lead_protein.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(lead_protein.num_peptides)
            if lead_protein.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            for other_prots in groups.proteins[1:]:
                protein_export_list[-1].append(other_prots.identifier)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_all(self, filename_out):
        """
        Method that outputs all proteins with Q values.
        Non Lead proteins are also output - entire group gets output.
        Proteins in the groups of lead proteins will also be output in the same row as the lead.

        This method returns a non-square CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.

        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            for proteins in groups.proteins:
                protein_export_list.append([proteins.identifier])
                protein_export_list[-1].append(proteins.score)
                protein_export_list[-1].append(groups.q_value)
                protein_export_list[-1].append(proteins.num_peptides)
                if proteins.reviewed:
                    protein_export_list[-1].append("Reviewed")
                else:
                    protein_export_list[-1].append("Unreviewed")
                protein_export_list[-1].append(groups.number_id)
                for peps in sorted(proteins.peptides):
                    protein_export_list[-1].append(peps)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_all_proteologic(self, filename_out):
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            for proteins in groups.proteins:
                protein_export_list.append([proteins.identifier])
                protein_export_list[-1].append(proteins.score)
                protein_export_list[-1].append(groups.q_value)
                protein_export_list[-1].append(proteins.num_peptides)
                if proteins.reviewed:
                    protein_export_list[-1].append("Reviewed")
                else:
                    protein_export_list[-1].append("Unreviewed")
                protein_export_list[-1].append(groups.number_id)
                for peps in sorted(proteins.peptides):
                    protein_export_list[-1].append(peps)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_leads_long(self, filename_out):
        """
        Method that outputs all lead proteins with Q values.

        This method returns a long formatted result file with one peptide on each row.

        Args:
            filename_out (str): Filename for the data to be written to.

        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            lead_protein = self._format_lead_protein(groups)
            for peps in sorted(lead_protein.peptides):
                protein_export_list.append([lead_protein.identifier])
                protein_export_list[-1].append(lead_protein.score)
                protein_export_list[-1].append(groups.q_value)
                protein_export_list[-1].append(lead_protein.num_peptides)
                if lead_protein.reviewed:
                    protein_export_list[-1].append("Reviewed")
                else:
                    protein_export_list[-1].append("Unreviewed")
                protein_export_list[-1].append(groups.number_id)
                protein_export_list[-1].append(peps)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_leads_peptides(self, filename_out, peptide_delimiter=" "):
        """
        Method that outputs all lead proteins with Q values in rectangular format.
        This method outputs unique peptides per protein.

        This method returns a rectangular CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.
            peptide_delimiter (str): String to separate peptides by in the "Peptides" column of the csv file
        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            lead_protein = self._format_lead_protein(groups)
            protein_export_list.append([lead_protein.identifier])
            protein_export_list[-1].append(lead_protein.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(lead_protein.num_peptides)
            if lead_protein.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            peptides = peptide_delimiter.join(list(sorted(lead_protein.peptides)))
            protein_export_list[-1].append(peptides)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_leads_psms(self, filename_out, peptide_delimiter=" "):
        """
        Method that outputs all lead proteins with Q values in rectangular format.
        This method outputs all PSMs for the protein not just unique peptide identifiers.

        This method returns a rectangular CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.
            peptide_delimiter (str): String to separate peptides by in the "Peptides" column of the csv file.
        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            lead_protein = self._format_lead_protein(groups)
            protein_export_list.append([lead_protein.identifier])
            protein_export_list[-1].append(lead_protein.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(lead_protein.num_peptides)
            if lead_protein.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            psms = peptide_delimiter.join(sorted([x.non_flanking_peptide for x in lead_protein.psms]))
            protein_export_list[-1].append(psms)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

    def csv_export_q_value_leads_psm_ids(self, filename_out, peptide_delimiter=" "):
        """
        Method that outputs all lead proteins with Q values in rectangular format.
        Psms are output as the psm_id value. So sequence information is not output.

        This method returns a rectangular CSV file.

        Args:
            filename_out (str): Filename for the data to be written to.
            peptide_delimiter (str): String to separate psm_ids by in the "Peptides" column of the csv file.
        """
        protein_export_list = [
            [
                "Protein",
                "Score",
                "Q_Value",
                "Number_of_Peptides",
                "Identifier_Type",
                "GroupID",
                "Peptides",
            ]
        ]
        for groups in self.data.protein_group_objects:
            lead_protein = self._format_lead_protein(groups)
            protein_export_list.append([lead_protein.identifier])
            protein_export_list[-1].append(lead_protein.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(lead_protein.num_peptides)
            if lead_protein.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            psms = peptide_delimiter.join(sorted(lead_protein.get_psm_ids()))
            protein_export_list[-1].append(psms)

        with open(filename_out, "w", newline="") as f:
            writer = csv.writer(f)
            writer.writerows(protein_export_list)

__init__(data)

Initialization method for the Export class.

Parameters:
Example

export = pyproteininference.export.Export(data=data)

Source code in pyproteininference/export.py
46
47
48
49
50
51
52
53
54
55
56
57
58
def __init__(self, data):
    """
    Initialization method for the Export class.

    Args:
        data (DataStore): [DataStore object][pyproteininference.datastore.DataStore].

    Example:
        >>> export = pyproteininference.export.Export(data=data)

    """
    self.data = data
    self.filepath = None

csv_export_all_restricted(filename_out)

Method that outputs a subset of the passing proteins based on FDR.

This method returns a non-square CSV file.

Parameters:
  • filename_out (str) –

    Filename for the data to be written to

Source code in pyproteininference/export.py
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
def csv_export_all_restricted(self, filename_out):
    """
    Method that outputs a subset of the passing proteins based on FDR.

    This method returns a non-square CSV file.

    Args:
        filename_out (str): Filename for the data to be written to

    """
    protein_objects = self.data.get_protein_objects(fdr_restricted=True)
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in protein_objects:
        for prots in groups:
            protein_export_list.append([prots.identifier])
            protein_export_list[-1].append(prots.score)
            protein_export_list[-1].append(prots.num_peptides)
            if prots.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(prots.group_identification)
            for peps in prots.peptides:
                protein_export_list[-1].append(peps)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_comma_sep_restricted(filename_out)

Method that outputs a subset of the passing proteins based on FDR. Only Proteins that pass FDR will be output and only Lead proteins will be output. Proteins in the groups of lead proteins will also be output in the same row as the lead.

This method returns a non-square CSV file.

Parameters:
  • filename_out (str) –

    Filename for the data to be written to.

Source code in pyproteininference/export.py
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
def csv_export_comma_sep_restricted(self, filename_out):
    """
    Method that outputs a subset of the passing proteins based on FDR.
    Only Proteins that pass FDR will be output and only Lead proteins will be output.
    Proteins in the groups of lead proteins will also be output in the same row as the lead.

    This method returns a non-square CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.

    """
    protein_objects = self.data.get_protein_objects(fdr_restricted=True)
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Other_Potential_Identifiers",
        ]
    ]
    for groups in protein_objects:
        for prots in groups:
            if prots == groups[0]:
                protein_export_list.append([prots.identifier])
                protein_export_list[-1].append(prots.score)
                protein_export_list[-1].append(prots.num_peptides)
                if prots.reviewed:
                    protein_export_list[-1].append("Reviewed")
                else:
                    protein_export_list[-1].append("Unreviewed")
                protein_export_list[-1].append(prots.group_identification)
            else:
                protein_export_list[-1].append(prots.identifier)
    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_leads_restricted(filename_out)

Method that outputs a subset of the passing proteins based on FDR. Only Proteins that pass FDR will be output and only Lead proteins will be output

This method returns a non-square CSV file.

Parameters:
  • filename_out (str) –

    Filename for the data to be written to.

Source code in pyproteininference/export.py
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
def csv_export_leads_restricted(self, filename_out):
    """
    Method that outputs a subset of the passing proteins based on FDR.
    Only Proteins that pass FDR will be output and only Lead proteins will be output

    This method returns a non-square CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.

    """
    protein_objects = self.data.get_protein_objects(fdr_restricted=True)
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in protein_objects:
        protein_export_list.append([groups[0].identifier])
        protein_export_list[-1].append(groups[0].score)
        protein_export_list[-1].append(groups[0].num_peptides)
        if groups[0].reviewed:
            protein_export_list[-1].append("Reviewed")
        else:
            protein_export_list[-1].append("Unreviewed")
        protein_export_list[-1].append(groups[0].group_identification)
        for peps in sorted(groups[0].peptides):
            protein_export_list[-1].append(peps)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_all(filename_out)

Method that outputs all proteins with Q values. Non Lead proteins are also output - entire group gets output. Proteins in the groups of lead proteins will also be output in the same row as the lead.

This method returns a non-square CSV file.

Parameters:
  • filename_out (str) –

    Filename for the data to be written to.

Source code in pyproteininference/export.py
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
def csv_export_q_value_all(self, filename_out):
    """
    Method that outputs all proteins with Q values.
    Non Lead proteins are also output - entire group gets output.
    Proteins in the groups of lead proteins will also be output in the same row as the lead.

    This method returns a non-square CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.

    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in self.data.protein_group_objects:
        for proteins in groups.proteins:
            protein_export_list.append([proteins.identifier])
            protein_export_list[-1].append(proteins.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(proteins.num_peptides)
            if proteins.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            for peps in sorted(proteins.peptides):
                protein_export_list[-1].append(peps)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_comma_sep(filename_out)

Method that outputs all lead proteins with Q values. Proteins in the groups of lead proteins will also be output in the same row as the lead.

This method returns a non-square CSV file.

Parameters:
  • filename_out (str) –

    Filename for the data to be written to.

Source code in pyproteininference/export.py
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
def csv_export_q_value_comma_sep(self, filename_out):
    """
    Method that outputs all lead proteins with Q values.
    Proteins in the groups of lead proteins will also be output in the same row as the lead.

    This method returns a non-square CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.

    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Other_Potential_Identifiers",
        ]
    ]
    for groups in self.data.protein_group_objects:
        lead_protein = self._format_lead_protein(groups)
        protein_export_list.append([lead_protein.identifier])
        protein_export_list[-1].append(lead_protein.score)
        protein_export_list[-1].append(groups.q_value)
        protein_export_list[-1].append(lead_protein.num_peptides)
        if lead_protein.reviewed:
            protein_export_list[-1].append("Reviewed")
        else:
            protein_export_list[-1].append("Unreviewed")
        protein_export_list[-1].append(groups.number_id)
        for other_prots in groups.proteins[1:]:
            protein_export_list[-1].append(other_prots.identifier)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_leads(filename_out)

Method that outputs all lead proteins with Q values.

This method returns a non-square CSV file.

Parameters:
  • filename_out (str) –

    Filename for the data to be written to.

Source code in pyproteininference/export.py
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
def csv_export_q_value_leads(self, filename_out):
    """
    Method that outputs all lead proteins with Q values.

    This method returns a non-square CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.

    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in self.data.protein_group_objects:
        lead_protein = self._format_lead_protein(groups)
        protein_export_list.append([lead_protein.identifier])
        protein_export_list[-1].append(lead_protein.score)
        protein_export_list[-1].append(groups.q_value)
        protein_export_list[-1].append(lead_protein.num_peptides)
        if lead_protein.reviewed:
            protein_export_list[-1].append("Reviewed")
        else:
            protein_export_list[-1].append("Unreviewed")
        protein_export_list[-1].append(groups.number_id)
        peptides = lead_protein.peptides
        for peps in sorted(peptides):
            protein_export_list[-1].append(peps)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_leads_long(filename_out)

Method that outputs all lead proteins with Q values.

This method returns a long formatted result file with one peptide on each row.

Parameters:
  • filename_out (str) –

    Filename for the data to be written to.

Source code in pyproteininference/export.py
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
def csv_export_q_value_leads_long(self, filename_out):
    """
    Method that outputs all lead proteins with Q values.

    This method returns a long formatted result file with one peptide on each row.

    Args:
        filename_out (str): Filename for the data to be written to.

    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in self.data.protein_group_objects:
        lead_protein = self._format_lead_protein(groups)
        for peps in sorted(lead_protein.peptides):
            protein_export_list.append([lead_protein.identifier])
            protein_export_list[-1].append(lead_protein.score)
            protein_export_list[-1].append(groups.q_value)
            protein_export_list[-1].append(lead_protein.num_peptides)
            if lead_protein.reviewed:
                protein_export_list[-1].append("Reviewed")
            else:
                protein_export_list[-1].append("Unreviewed")
            protein_export_list[-1].append(groups.number_id)
            protein_export_list[-1].append(peps)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_leads_peptides(filename_out, peptide_delimiter=' ')

Method that outputs all lead proteins with Q values in rectangular format. This method outputs unique peptides per protein.

This method returns a rectangular CSV file.

Parameters:
  • filename_out (str) –

    Filename for the data to be written to.

  • peptide_delimiter (str, default: ' ' ) –

    String to separate peptides by in the "Peptides" column of the csv file

Source code in pyproteininference/export.py
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
def csv_export_q_value_leads_peptides(self, filename_out, peptide_delimiter=" "):
    """
    Method that outputs all lead proteins with Q values in rectangular format.
    This method outputs unique peptides per protein.

    This method returns a rectangular CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.
        peptide_delimiter (str): String to separate peptides by in the "Peptides" column of the csv file
    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in self.data.protein_group_objects:
        lead_protein = self._format_lead_protein(groups)
        protein_export_list.append([lead_protein.identifier])
        protein_export_list[-1].append(lead_protein.score)
        protein_export_list[-1].append(groups.q_value)
        protein_export_list[-1].append(lead_protein.num_peptides)
        if lead_protein.reviewed:
            protein_export_list[-1].append("Reviewed")
        else:
            protein_export_list[-1].append("Unreviewed")
        protein_export_list[-1].append(groups.number_id)
        peptides = peptide_delimiter.join(list(sorted(lead_protein.peptides)))
        protein_export_list[-1].append(peptides)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_leads_psm_ids(filename_out, peptide_delimiter=' ')

Method that outputs all lead proteins with Q values in rectangular format. Psms are output as the psm_id value. So sequence information is not output.

This method returns a rectangular CSV file.

Parameters:
  • filename_out (str) –

    Filename for the data to be written to.

  • peptide_delimiter (str, default: ' ' ) –

    String to separate psm_ids by in the "Peptides" column of the csv file.

Source code in pyproteininference/export.py
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
def csv_export_q_value_leads_psm_ids(self, filename_out, peptide_delimiter=" "):
    """
    Method that outputs all lead proteins with Q values in rectangular format.
    Psms are output as the psm_id value. So sequence information is not output.

    This method returns a rectangular CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.
        peptide_delimiter (str): String to separate psm_ids by in the "Peptides" column of the csv file.
    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in self.data.protein_group_objects:
        lead_protein = self._format_lead_protein(groups)
        protein_export_list.append([lead_protein.identifier])
        protein_export_list[-1].append(lead_protein.score)
        protein_export_list[-1].append(groups.q_value)
        protein_export_list[-1].append(lead_protein.num_peptides)
        if lead_protein.reviewed:
            protein_export_list[-1].append("Reviewed")
        else:
            protein_export_list[-1].append("Unreviewed")
        protein_export_list[-1].append(groups.number_id)
        psms = peptide_delimiter.join(sorted(lead_protein.get_psm_ids()))
        protein_export_list[-1].append(psms)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

csv_export_q_value_leads_psms(filename_out, peptide_delimiter=' ')

Method that outputs all lead proteins with Q values in rectangular format. This method outputs all PSMs for the protein not just unique peptide identifiers.

This method returns a rectangular CSV file.

Parameters:
  • filename_out (str) –

    Filename for the data to be written to.

  • peptide_delimiter (str, default: ' ' ) –

    String to separate peptides by in the "Peptides" column of the csv file.

Source code in pyproteininference/export.py
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
def csv_export_q_value_leads_psms(self, filename_out, peptide_delimiter=" "):
    """
    Method that outputs all lead proteins with Q values in rectangular format.
    This method outputs all PSMs for the protein not just unique peptide identifiers.

    This method returns a rectangular CSV file.

    Args:
        filename_out (str): Filename for the data to be written to.
        peptide_delimiter (str): String to separate peptides by in the "Peptides" column of the csv file.
    """
    protein_export_list = [
        [
            "Protein",
            "Score",
            "Q_Value",
            "Number_of_Peptides",
            "Identifier_Type",
            "GroupID",
            "Peptides",
        ]
    ]
    for groups in self.data.protein_group_objects:
        lead_protein = self._format_lead_protein(groups)
        protein_export_list.append([lead_protein.identifier])
        protein_export_list[-1].append(lead_protein.score)
        protein_export_list[-1].append(groups.q_value)
        protein_export_list[-1].append(lead_protein.num_peptides)
        if lead_protein.reviewed:
            protein_export_list[-1].append("Reviewed")
        else:
            protein_export_list[-1].append("Unreviewed")
        protein_export_list[-1].append(groups.number_id)
        psms = peptide_delimiter.join(sorted([x.non_flanking_peptide for x in lead_protein.psms]))
        protein_export_list[-1].append(psms)

    with open(filename_out, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(protein_export_list)

export_to_csv(output_filename=None, directory=None, export_type='q_value')

Method that dispatches to one of the many export methods given an export_type input.

filepath is determined based on directory arg and information from DataStore object.

This method sets the filepath variable.

Parameters:
  • output_filename (str, default: None ) –

    Filepath to write to. If set as None will auto generate filename and will write to directory variable.

  • directory (str, default: None ) –

    Directory to write the result file to. If None, will write to current working directory.

  • export_type (str, default: 'q_value' ) –

    Must be a value in EXPORT_TYPES and determines the output format.

Example

export = pyproteininference.export.Export(data=data) export.export_to_csv(output_filename=None, directory="/path/to/output/dir/", export_type="psms")

Source code in pyproteininference/export.py
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
def export_to_csv(self, output_filename=None, directory=None, export_type="q_value"):
    """
    Method that dispatches to one of the many export methods given an export_type input.

    filepath is determined based on directory arg and information from
    [DataStore object][pyproteininference.datastore.DataStore].

    This method sets the `filepath` variable.

    Args:
        output_filename (str): Filepath to write to. If set as None will auto generate filename and
            will write to directory variable.
        directory (str): Directory to write the result file to. If None, will write to current working directory.
        export_type (str): Must be a value in `EXPORT_TYPES` and determines the output format.

    Example:
        >>> export = pyproteininference.export.Export(data=data)
        >>> export.export_to_csv(output_filename=None, directory="/path/to/output/dir/", export_type="psms")

    """

    if not directory:
        directory = os.getcwd()

    data = self.data
    tag = data.parameter_file_object.tag

    if self.EXPORT_LEADS == export_type:
        filename = "{}_leads_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_leads_restricted(filename_out=complete_filepath)

    elif self.EXPORT_ALL == export_type:
        filename = "{}_all_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_all_restricted(complete_filepath)

    elif self.EXPORT_COMMA_SEP == export_type:
        filename = "{}_comma_sep_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_comma_sep_restricted(complete_filepath)

    elif self.EXPORT_Q_VALUE_COMMA_SEP == export_type:
        filename = "{}_q_value_comma_sep_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_comma_sep(complete_filepath)

    elif self.EXPORT_Q_VALUE == export_type:
        filename = "{}_q_value_leads_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_leads(complete_filepath)

    elif self.EXPORT_Q_VALUE_ALL == export_type:
        filename = "{}_q_value_all_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_all(complete_filepath)

    elif self.EXPORT_PEPTIDES == export_type:
        filename = "{}_q_value_leads_peptides_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_leads_peptides(complete_filepath)

    elif self.EXPORT_PSMS == export_type:
        filename = "{}_q_value_leads_psms_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_leads_psms(complete_filepath)

    elif self.EXPORT_PSM_IDS == export_type:
        filename = "{}_q_value_leads_psm_ids_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_leads_psm_ids(complete_filepath)

    elif self.EXPORT_LONG == export_type:
        filename = "{}_q_value_long_{}_{}.csv".format(tag, data.short_protein_score, data.psm_score)
        complete_filepath = os.path.join(directory, filename)
        if output_filename:
            complete_filepath = output_filename
        logger.info("Exporting Protein Inference Data to File: {}".format(complete_filepath))
        self.csv_export_q_value_leads_long(complete_filepath)

    else:
        complete_filepath = "protein_inference_results.csv"

    self.filepath = complete_filepath

Protein

Bases: object

The following class is a representation of a Protein that stores characteristics/attributes of a protein for the entire analysis. We use slots to predefine the attributes the Protein Object can have. This is done to speed up runtime of the PI algorithm.

Attributes:
  • identifier (str) –

    String identifier for the Protein object.

  • score (float) –

    Float that represents the protein score as output from Score object methods.

  • psms (list) –

    List of Psm objects.

  • group_identification (set) –

    Set of group Identifiers that the protein belongs to (int).

  • reviewed (bool) –

    True/False on if the identifier is reviewed.

  • unreviewed (bool) –

    True/False on if the identifier is reviewed.

  • peptides (list) –

    List of non flanking peptide sequences.

  • peptide_scores (list) –

    List of Psm scores associated with the protein.

  • picked (bool) –

    True/False if the protein passes the picker algo. True if passes. False if does not pass.

  • num_peptides (int) –

    Number of peptides that map to the given Protein.

  • unique_peptides (list) –

    List of peptide strings that are unique to this protein across the analysis.

  • num_unique_peptides (int) –

    Number of unique peptides.

  • raw_peptides (list) –

    List of raw peptides. Includes flanking AA and Mods.

Source code in pyproteininference/physical.py
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
class Protein(object):
    """
    The following class is a representation of a Protein that stores characteristics/attributes of a protein for the
        entire analysis.
    We use __slots__ to predefine the attributes the Protein Object can have.
    This is done to speed up runtime of the PI algorithm.

    Attributes:
        identifier (str): String identifier for the Protein object.
        score (float): Float that represents the protein score as output from
            [Score object][pyproteininference.scoring.Score] methods.
        psms (list): List of [Psm][pyproteininference.physical.Psm] objects.
        group_identification (set): Set of group Identifiers that the protein belongs to (int).
        reviewed (bool): True/False on if the identifier is reviewed.
        unreviewed (bool): True/False on if the identifier is reviewed.
        peptides (list): List of non flanking peptide sequences.
        peptide_scores (list): List of Psm scores associated with the protein.
        picked (bool): True/False if the protein passes the picker algo. True if passes. False if does not pass.
        num_peptides (int): Number of peptides that map to the given Protein.
        unique_peptides (list): List of peptide strings that are unique to this protein across the analysis.
        num_unique_peptides (int): Number of unique peptides.
        raw_peptides (list): List of raw peptides. Includes flanking AA and Mods.

    """

    __slots__ = (
        "identifier",
        "score",
        "psms",
        "group_identification",
        "reviewed",
        "unreviewed",
        "peptides",
        "peptide_scores",
        "picked",
        "num_peptides",
        "unique_peptides",
        "num_unique_peptides",
        "raw_peptides",
    )

    def __init__(self, identifier):
        """
        Initialization method for Protein object.

        Args:
            identifier (str): String identifier for the Protein object.

        Example:
            >>> protein = pyproteininference.physical.Protein(identifier = "PRKDC_HUMAN|P78527")

        """
        self.identifier = identifier
        self.score = None
        self.psms = []  # List of psm objects
        self.group_identification = set()
        self.reviewed = False
        self.unreviewed = False
        self.peptides = None  # Sequence info without flanking
        self.peptide_scores = None  # remove
        self.picked = True
        self.num_peptides = None  # remove
        self.unique_peptides = None  # remove
        self.num_unique_peptides = None  # remove
        self.raw_peptides = set()  # Includes Flanking Seq Info

    def get_psm_scores(self):
        """
        Retrieves psm scores for a given protein.

        Returns:
            list: List of psm scores for the given protein.

        """
        score_list = [x.main_score for x in self.psms]
        return score_list

    def get_psm_identifiers(self):
        """
        Retrieves a list of Psm identifiers.

         Returns:
             list: List of Psm identifiers.

        """
        psms = [x.identifier for x in self.psms]
        return psms

    def get_stripped_psm_identifiers(self):
        """
        Retrieves a list of Psm identifiers that have had mods removed and flanking AAs removed.

         Returns:
             list: List of Psm identifiers that have no mods or flanking AAs.

        """
        psms = [x.stripped_peptide for x in self.psms]
        return psms

    def get_unique_peptide_identifiers(self):
        """
        Retrieves the unique set of peptides for a protein.

         Returns:
             set: Set of peptide strings.

        """
        unique_peptides = set(self.get_psm_identifiers())
        return unique_peptides

    def get_unique_stripped_peptide_identifiers(self):
        """
        Retrieves the unique set of peptides for a protein that are stripped.

         Returns:
             set: Set of peptide strings that are stripped of mods and flanking AAs.

        """
        stripped_peptide_identifiers = set(self.get_stripped_psm_identifiers())
        return stripped_peptide_identifiers

    def get_num_psms(self):
        """
        Retrieves the number of Psms.

         Returns:
             int: Number of Psms.

        """
        num_psms = len(self.get_psm_identifiers())
        return num_psms

    def get_num_peptides(self):
        """
        Retrieves the number of peptides.

         Returns:
             int: Number of peptides.

        """
        num_peptides = len(self.get_unique_peptide_identifiers())
        return num_peptides

    def get_psm_ids(self):
        """
        Retrieves the Psm Ids.

         Returns:
            list: List of Psm Ids.

        """
        psm_ids = [x.psm_id for x in self.psms]
        return psm_ids

__init__(identifier)

Initialization method for Protein object.

Parameters:
  • identifier (str) –

    String identifier for the Protein object.

Example

protein = pyproteininference.physical.Protein(identifier = "PRKDC_HUMAN|P78527")

Source code in pyproteininference/physical.py
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
def __init__(self, identifier):
    """
    Initialization method for Protein object.

    Args:
        identifier (str): String identifier for the Protein object.

    Example:
        >>> protein = pyproteininference.physical.Protein(identifier = "PRKDC_HUMAN|P78527")

    """
    self.identifier = identifier
    self.score = None
    self.psms = []  # List of psm objects
    self.group_identification = set()
    self.reviewed = False
    self.unreviewed = False
    self.peptides = None  # Sequence info without flanking
    self.peptide_scores = None  # remove
    self.picked = True
    self.num_peptides = None  # remove
    self.unique_peptides = None  # remove
    self.num_unique_peptides = None  # remove
    self.raw_peptides = set()  # Includes Flanking Seq Info

get_num_peptides()

Retrieves the number of peptides.

Returns: int: Number of peptides.

Source code in pyproteininference/physical.py
136
137
138
139
140
141
142
143
144
145
def get_num_peptides(self):
    """
    Retrieves the number of peptides.

     Returns:
         int: Number of peptides.

    """
    num_peptides = len(self.get_unique_peptide_identifiers())
    return num_peptides

get_num_psms()

Retrieves the number of Psms.

Returns: int: Number of Psms.

Source code in pyproteininference/physical.py
125
126
127
128
129
130
131
132
133
134
def get_num_psms(self):
    """
    Retrieves the number of Psms.

     Returns:
         int: Number of Psms.

    """
    num_psms = len(self.get_psm_identifiers())
    return num_psms

get_psm_identifiers()

Retrieves a list of Psm identifiers.

Returns: list: List of Psm identifiers.

Source code in pyproteininference/physical.py
81
82
83
84
85
86
87
88
89
90
def get_psm_identifiers(self):
    """
    Retrieves a list of Psm identifiers.

     Returns:
         list: List of Psm identifiers.

    """
    psms = [x.identifier for x in self.psms]
    return psms

get_psm_ids()

Retrieves the Psm Ids.

Returns: list: List of Psm Ids.

Source code in pyproteininference/physical.py
147
148
149
150
151
152
153
154
155
156
def get_psm_ids(self):
    """
    Retrieves the Psm Ids.

     Returns:
        list: List of Psm Ids.

    """
    psm_ids = [x.psm_id for x in self.psms]
    return psm_ids

get_psm_scores()

Retrieves psm scores for a given protein.

Returns:
  • list

    List of psm scores for the given protein.

Source code in pyproteininference/physical.py
70
71
72
73
74
75
76
77
78
79
def get_psm_scores(self):
    """
    Retrieves psm scores for a given protein.

    Returns:
        list: List of psm scores for the given protein.

    """
    score_list = [x.main_score for x in self.psms]
    return score_list

get_stripped_psm_identifiers()

Retrieves a list of Psm identifiers that have had mods removed and flanking AAs removed.

Returns: list: List of Psm identifiers that have no mods or flanking AAs.

Source code in pyproteininference/physical.py
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def get_stripped_psm_identifiers(self):
    """
    Retrieves a list of Psm identifiers that have had mods removed and flanking AAs removed.

     Returns:
         list: List of Psm identifiers that have no mods or flanking AAs.

    """
    psms = [x.stripped_peptide for x in self.psms]
    return psms

get_unique_peptide_identifiers()

Retrieves the unique set of peptides for a protein.

Returns: set: Set of peptide strings.

Source code in pyproteininference/physical.py
103
104
105
106
107
108
109
110
111
112
def get_unique_peptide_identifiers(self):
    """
    Retrieves the unique set of peptides for a protein.

     Returns:
         set: Set of peptide strings.

    """
    unique_peptides = set(self.get_psm_identifiers())
    return unique_peptides

get_unique_stripped_peptide_identifiers()

Retrieves the unique set of peptides for a protein that are stripped.

Returns: set: Set of peptide strings that are stripped of mods and flanking AAs.

Source code in pyproteininference/physical.py
114
115
116
117
118
119
120
121
122
123
def get_unique_stripped_peptide_identifiers(self):
    """
    Retrieves the unique set of peptides for a protein that are stripped.

     Returns:
         set: Set of peptide strings that are stripped of mods and flanking AAs.

    """
    stripped_peptide_identifiers = set(self.get_stripped_psm_identifiers())
    return stripped_peptide_identifiers

ProteinGroup

Bases: object

The following class is a physical Protein Group class that stores characteristics of a Protein Group for the entire analysis. We use slots to predefine the attributes the Psm Object can have. This is done to speed up runtime of the PI algorithm.

Attributes:
  • number_id (int) –

    unique Integer to represent a group.

  • proteins (list) –

    List of Protein objects.

  • q_value (float) –

    Q value for the protein group that is calculated with method calculate_q_values.

Source code in pyproteininference/physical.py
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
class ProteinGroup(object):
    """
    The following class is a physical Protein Group class that stores characteristics of a Protein Group for the entire
        analysis.
    We use __slots__ to predefine the attributes the Psm Object can have.
    This is done to speed up runtime of the PI algorithm.

    Attributes:
        number_id (int): unique Integer to represent a group.
        proteins (list): List of [Protein][pyproteininference.physical.Protein] objects.
        q_value (float): Q value for the protein group that is calculated with method
            [calculate_q_values][pyproteininference.datastore.DataStore.calculate_q_values].

    """

    __slots__ = ("proteins", "number_id", "q_value")

    def __init__(self, number_id):
        """
        Initialization method for ProteinGroup object.

        Args:
            number_id (int): unique Integer to represent a group.

        Example:
            >>> pg = pyproteininference.physical.ProteinGroup(number_id = 1)
        """

        self.proteins = []
        self.number_id = number_id
        self.q_value = None

__init__(number_id)

Initialization method for ProteinGroup object.

Parameters:
  • number_id (int) –

    unique Integer to represent a group.

Example

pg = pyproteininference.physical.ProteinGroup(number_id = 1)

Source code in pyproteininference/physical.py
354
355
356
357
358
359
360
361
362
363
364
365
366
367
def __init__(self, number_id):
    """
    Initialization method for ProteinGroup object.

    Args:
        number_id (int): unique Integer to represent a group.

    Example:
        >>> pg = pyproteininference.physical.ProteinGroup(number_id = 1)
    """

    self.proteins = []
    self.number_id = number_id
    self.q_value = None

Psm

Bases: object

The following class is a physical Psm class that stores characteristics of a psm for the entire analysis. We use slots to predefine the attributes the Psm Object can have. This is done to speed up runtime of the PI algorithm.

Attributes:
  • identifier (str) –

    Peptide Identifier: IE "K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".

  • percscore (float) –

    Percolator Score from input file if it exists.

  • qvalue (float) –

    Q value from input file if it exists.

  • pepvalue (float) –

    Pep value from input file if it exists.

  • possible_proteins (list) –

    List of protein strings that the Psm maps to based on the digest.

  • psm_id (str) –

    String that represents a global identifier for the Psm. Should come from input files.

  • custom_score (float) –

    Score that comes from a custom column in the input files.

  • main_score (float) –

    The Psm score to be used as the scoring variable for protein scoring. can be percscore,qvalue,pepvalue, or custom_score.

  • stripped_peptide (str) –

    This is the identifier attribute that has had mods removed and flanking AAs removed IE: DLIDEGHAATQLVNQLHDVVVENNLSDK.

  • non_flanking_peptide (str) –

    This is the identifier attribute that has had flanking AAs removed IE: DLIDEGH#AATQLVNQLHDVVVENNLSDK. #NOTE Mods are still present here.

Source code in pyproteininference/physical.py
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
class Psm(object):
    """
    The following class is a physical Psm class that stores characteristics of a psm for the entire analysis.
    We use __slots__ to predefine the attributes the Psm Object can have.
    This is done to speed up runtime of the PI algorithm.

    Attributes:
        identifier (str): Peptide Identifier: IE "K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".
        percscore (float): Percolator Score from input file if it exists.
        qvalue (float): Q value from input file if it exists.
        pepvalue (float): Pep value from input file if it exists.
        possible_proteins (list): List of protein strings that the Psm maps to based on the digest.
        psm_id (str): String that represents a global identifier for the Psm. Should come from input files.
        custom_score (float): Score that comes from a custom column in the input files.
        main_score (float): The Psm score to be used as the scoring variable for protein scoring. can be
            percscore,qvalue,pepvalue, or custom_score.
        stripped_peptide (str): This is the identifier attribute that has had mods removed and flanking AAs
            removed IE: DLIDEGHAATQLVNQLHDVVVENNLSDK.
        non_flanking_peptide (str): This is the identifier attribute that has had flanking AAs
            removed IE: DLIDEGH#AATQLVNQLHDVVVENNLSDK. #NOTE Mods are still present here.

    """

    __slots__ = (
        "identifier",
        "percscore",
        "qvalue",
        "pepvalue",
        "possible_proteins",
        "psm_id",
        "custom_score",
        "main_score",
        "stripped_peptide",
        "non_flanking_peptide",
    )

    # The regex removes anything between parantheses including parenthases - \([^()]*\)
    # The regex removes anything between brackets including parenthases - \[.*?\]
    # And the regex removes anything that is not an A-Z character [^A-Z]
    MOD_REGEX = re.compile("\([^()]*\)|\[.*?\]|[^A-Z]")  # noqa W605

    FRONT_FLANKING_REGEX = re.compile("^[A-Z|-][.]")
    BACK_FLANKING_REGEX = re.compile("[.][A-Z|-]$")

    SCORE_ATTRIBUTE_NAMES = set(["pepvalue", "qvalue", "percscore", "custom_score"])

    def __init__(self, identifier):
        """
        Initialization method for the Psm object.
        This method also initializes the `stripped_peptide` and `non_flanking_peptide` attributes.

        Args:
            identifier (str): Peptide Identifier: IE ""K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".

        Example:
            >>> psm = pyproteininference.physical.Psm(identifier = "K.DLIDEGHAATQLVNQLHDVVVENNLSDK.Q")

        """
        self.identifier = identifier
        self.percscore = None
        self.qvalue = None
        self.pepvalue = None
        self.possible_proteins = None
        self.psm_id = None
        self.custom_score = None
        self.main_score = None
        self.stripped_peptide = None
        self.non_flanking_peptide = None

        # Add logic to split the peptide and strip it of mods
        current_peptide = Psm.split_peptide(peptide_string=self.identifier)

        self.non_flanking_peptide = current_peptide

        if not current_peptide.isupper() or not current_peptide.isalpha():
            # If we have mods remove them...
            peptide_string = current_peptide.upper()
            stripped_peptide = Psm.remove_peptide_mods(peptide_string)
            current_peptide = stripped_peptide

        # Set stripped_peptide variable
        self.stripped_peptide = current_peptide

    @classmethod
    def remove_peptide_mods(cls, peptide_string):
        """
        This class method takes a string and uses a `MOD_REGEX` to remove mods from peptide strings.

        Args:
            peptide_string (str): Peptide string to have mods removed from.

        Returns:
            str: a peptide string with mods removed.

        """
        stripped_peptide = cls.MOD_REGEX.sub("", peptide_string)
        return stripped_peptide

    @classmethod
    def split_peptide(cls, peptide_string, delimiter="."):
        """
        This class method takes a peptide string with flanking AAs and removes them from the peptide string.
        This method uses string splitting and if the method produces a faulty peptide the method
            [split_peptide_pro][pyproteininference.physical.Psm.split_peptide_pro] will be called.

        Args:
            peptide_string (str): Peptide string to have mods removed from.
            delimiter (str): a string to indicate what separates a leading/trailing (flanking) AA from the
                peptide sequence.

        Returns:
            str: a peptide string with flanking AAs removed.

        """
        peptide_split = peptide_string.split(delimiter)
        if len(peptide_split) == 3:
            # If we get 3 chunks it will usually be ['A', 'ADGSDFGSS', 'F']
            # So take index 1
            peptide = peptide_split[1]
        elif len(peptide_split) == 1:
            # If we get 1 chunk it should just be ['ADGSDFGSS']
            # So take index 0
            peptide = peptide_split[0]
        else:
            # If we split the peptide and it is not length 1 or 3 then try to split with pro
            peptide = cls.split_peptide_pro(peptide_string=peptide_string, delimiter=delimiter)

        return peptide

    @classmethod
    def split_peptide_pro(cls, peptide_string, delimiter="."):
        """
        This class method takes a peptide string with flanking AAs and removes them from the peptide string.
        This is a specialized method of [split_peptide][pyproteininference.physical.Psm.split_peptide] that uses
         regex identifiers to replace flanking AAs as opposed to string splitting.


        Args:
            peptide_string (str): Peptide string to have mods removed from.
            delimiter (str): a string to indicate what separates a leading/trailing (flanking) AA from the peptide
                sequence.

        Returns:
            str: a peptide string with flanking AAs removed.

        """

        if delimiter != ".":
            front_regex = "^[A-Z|-][{}]".format(delimiter)
            cls.FRONT_FLANKING_REGEX = re.compile(front_regex)
            back_regex = "[{}][A-Z|-]$".format(delimiter)
            cls.BACK_FLANKING_REGEX = re.compile(back_regex)

        # Replace the front flanking with nothing
        peptide_string = cls.FRONT_FLANKING_REGEX.sub("", peptide_string)

        # Replace the back flanking with nothing
        peptide_string = cls.BACK_FLANKING_REGEX.sub("", peptide_string)

        return peptide_string

    def assign_main_score(self, score):
        """
        This method takes in a score type and assigns the variable main_score for a given Psm based on the score type.

        Args:
            score (str): This is a string representation of the Psm attribute that will get assigned to the main_score
                variable.

        """
        # Assign a main score based on user input
        if score not in self.SCORE_ATTRIBUTE_NAMES:
            raise ValueError("Scores must either be one of: '{}'".format(", ".join(self.SCORE_ATTRIBUTE_NAMES)))
        else:
            score_attribute = getattr(self, score)
            self.main_score = getattr(self, score)

__init__(identifier)

Initialization method for the Psm object. This method also initializes the stripped_peptide and non_flanking_peptide attributes.

Parameters:
  • identifier (str) –

    Peptide Identifier: IE ""K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".

Example

psm = pyproteininference.physical.Psm(identifier = "K.DLIDEGHAATQLVNQLHDVVVENNLSDK.Q")

Source code in pyproteininference/physical.py
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
def __init__(self, identifier):
    """
    Initialization method for the Psm object.
    This method also initializes the `stripped_peptide` and `non_flanking_peptide` attributes.

    Args:
        identifier (str): Peptide Identifier: IE ""K.DLIDEGH#AATQLVNQLHDVVVENNLSDK.Q".

    Example:
        >>> psm = pyproteininference.physical.Psm(identifier = "K.DLIDEGHAATQLVNQLHDVVVENNLSDK.Q")

    """
    self.identifier = identifier
    self.percscore = None
    self.qvalue = None
    self.pepvalue = None
    self.possible_proteins = None
    self.psm_id = None
    self.custom_score = None
    self.main_score = None
    self.stripped_peptide = None
    self.non_flanking_peptide = None

    # Add logic to split the peptide and strip it of mods
    current_peptide = Psm.split_peptide(peptide_string=self.identifier)

    self.non_flanking_peptide = current_peptide

    if not current_peptide.isupper() or not current_peptide.isalpha():
        # If we have mods remove them...
        peptide_string = current_peptide.upper()
        stripped_peptide = Psm.remove_peptide_mods(peptide_string)
        current_peptide = stripped_peptide

    # Set stripped_peptide variable
    self.stripped_peptide = current_peptide

assign_main_score(score)

This method takes in a score type and assigns the variable main_score for a given Psm based on the score type.

Parameters:
  • score (str) –

    This is a string representation of the Psm attribute that will get assigned to the main_score variable.

Source code in pyproteininference/physical.py
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
def assign_main_score(self, score):
    """
    This method takes in a score type and assigns the variable main_score for a given Psm based on the score type.

    Args:
        score (str): This is a string representation of the Psm attribute that will get assigned to the main_score
            variable.

    """
    # Assign a main score based on user input
    if score not in self.SCORE_ATTRIBUTE_NAMES:
        raise ValueError("Scores must either be one of: '{}'".format(", ".join(self.SCORE_ATTRIBUTE_NAMES)))
    else:
        score_attribute = getattr(self, score)
        self.main_score = getattr(self, score)

remove_peptide_mods(peptide_string) classmethod

This class method takes a string and uses a MOD_REGEX to remove mods from peptide strings.

Parameters:
  • peptide_string (str) –

    Peptide string to have mods removed from.

Returns:
  • str

    a peptide string with mods removed.

Source code in pyproteininference/physical.py
242
243
244
245
246
247
248
249
250
251
252
253
254
255
@classmethod
def remove_peptide_mods(cls, peptide_string):
    """
    This class method takes a string and uses a `MOD_REGEX` to remove mods from peptide strings.

    Args:
        peptide_string (str): Peptide string to have mods removed from.

    Returns:
        str: a peptide string with mods removed.

    """
    stripped_peptide = cls.MOD_REGEX.sub("", peptide_string)
    return stripped_peptide

split_peptide(peptide_string, delimiter='.') classmethod

This class method takes a peptide string with flanking AAs and removes them from the peptide string. This method uses string splitting and if the method produces a faulty peptide the method split_peptide_pro will be called.

Parameters:
  • peptide_string (str) –

    Peptide string to have mods removed from.

  • delimiter (str, default: '.' ) –

    a string to indicate what separates a leading/trailing (flanking) AA from the peptide sequence.

Returns:
  • str

    a peptide string with flanking AAs removed.

Source code in pyproteininference/physical.py
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
@classmethod
def split_peptide(cls, peptide_string, delimiter="."):
    """
    This class method takes a peptide string with flanking AAs and removes them from the peptide string.
    This method uses string splitting and if the method produces a faulty peptide the method
        [split_peptide_pro][pyproteininference.physical.Psm.split_peptide_pro] will be called.

    Args:
        peptide_string (str): Peptide string to have mods removed from.
        delimiter (str): a string to indicate what separates a leading/trailing (flanking) AA from the
            peptide sequence.

    Returns:
        str: a peptide string with flanking AAs removed.

    """
    peptide_split = peptide_string.split(delimiter)
    if len(peptide_split) == 3:
        # If we get 3 chunks it will usually be ['A', 'ADGSDFGSS', 'F']
        # So take index 1
        peptide = peptide_split[1]
    elif len(peptide_split) == 1:
        # If we get 1 chunk it should just be ['ADGSDFGSS']
        # So take index 0
        peptide = peptide_split[0]
    else:
        # If we split the peptide and it is not length 1 or 3 then try to split with pro
        peptide = cls.split_peptide_pro(peptide_string=peptide_string, delimiter=delimiter)

    return peptide

split_peptide_pro(peptide_string, delimiter='.') classmethod

This class method takes a peptide string with flanking AAs and removes them from the peptide string. This is a specialized method of split_peptide that uses regex identifiers to replace flanking AAs as opposed to string splitting.

Parameters:
  • peptide_string (str) –

    Peptide string to have mods removed from.

  • delimiter (str, default: '.' ) –

    a string to indicate what separates a leading/trailing (flanking) AA from the peptide sequence.

Returns:
  • str

    a peptide string with flanking AAs removed.

Source code in pyproteininference/physical.py
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
@classmethod
def split_peptide_pro(cls, peptide_string, delimiter="."):
    """
    This class method takes a peptide string with flanking AAs and removes them from the peptide string.
    This is a specialized method of [split_peptide][pyproteininference.physical.Psm.split_peptide] that uses
     regex identifiers to replace flanking AAs as opposed to string splitting.


    Args:
        peptide_string (str): Peptide string to have mods removed from.
        delimiter (str): a string to indicate what separates a leading/trailing (flanking) AA from the peptide
            sequence.

    Returns:
        str: a peptide string with flanking AAs removed.

    """

    if delimiter != ".":
        front_regex = "^[A-Z|-][{}]".format(delimiter)
        cls.FRONT_FLANKING_REGEX = re.compile(front_regex)
        back_regex = "[{}][A-Z|-]$".format(delimiter)
        cls.BACK_FLANKING_REGEX = re.compile(back_regex)

    # Replace the front flanking with nothing
    peptide_string = cls.FRONT_FLANKING_REGEX.sub("", peptide_string)

    # Replace the back flanking with nothing
    peptide_string = cls.BACK_FLANKING_REGEX.sub("", peptide_string)

    return peptide_string

HeuristicPipeline

Bases: ProteinInferencePipeline

This is the Protein Inference Heuristic class which houses the logic to run the Protein Inference Heuristic method to determine the best inference method for the given data. Logic is executed in the execute method.

Attributes:
  • parameter_file (str) –

    Path to Protein Inference Yaml Parameter File.

  • database_file (str) –

    Path to Fasta database used in proteomics search.

  • target_files (str / list) –

    Path to Target Psm File (Or a list of files).

  • decoy_files (str / list) –

    Path to Decoy Psm File (Or a list of files).

  • combined_files (str / list) –

    Path to Combined Psm File (Or a list of files).

  • target_directory (str) –

    Path to Directory containing Target Psm Files.

  • decoy_directory (str) –

    Path to Directory containing Decoy Psm Files.

  • combined_directory (str) –

    Path to Directory containing Combined Psm Files.

  • output_directory (str) –

    Path to Directory where output will be written.

  • output_filename (str) –

    Path to Filename where output will be written. Will override output_directory.

  • id_splitting (bool) –

    True/False on whether to split protein IDs in the digest. Advanced usage only.

  • append_alt_from_db (bool) –

    True/False on whether to append alternative proteins from the DB digestion in Reader class.

  • pdf_filename (str) –

    Filepath to be written to by Heuristic Plotting method. This is optional and a default filename will be created in output_directory if this is left as None.

  • inference_method_list (list) –

    List of inference methods used in heuristic determination.

  • datastore_dict (dict) –

    Dictionary of DataStore objects generated in heuristic determination with the inference method as the key of each entry.

  • selected_methods (list) –

    a list of String representations of the selected inference methods based on the heuristic.

  • selected_datastores (dict) –

    a Dictionary of DataStore object objects as selected by the heuristic.

  • output_type (str) –

    How to output results. Can either be "all" or "optimal". Will either output all results or will only output the optimal results.

Source code in pyproteininference/heuristic.py
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
class HeuristicPipeline(ProteinInferencePipeline):
    """
    This is the Protein Inference Heuristic class which houses the logic to run the Protein Inference Heuristic method
     to determine the best inference method for the given data.
    Logic is executed in the [execute][pyproteininference.heuristic.HeuristicPipeline.execute] method.

    Attributes:
        parameter_file (str): Path to Protein Inference Yaml Parameter File.
        database_file (str): Path to Fasta database used in proteomics search.
        target_files (str/list): Path to Target Psm File (Or a list of files).
        decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
        combined_files (str/list): Path to Combined Psm File (Or a list of files).
        target_directory (str): Path to Directory containing Target Psm Files.
        decoy_directory (str): Path to Directory containing Decoy Psm Files.
        combined_directory (str): Path to Directory containing Combined Psm Files.
        output_directory (str): Path to Directory where output will be written.
        output_filename (str): Path to Filename where output will be written. Will override output_directory.
        id_splitting (bool): True/False on whether to split protein IDs in the digest.
            Advanced usage only.
        append_alt_from_db (bool): True/False on whether to append
            alternative proteins from the DB digestion in Reader class.
        pdf_filename (str): Filepath to be written to by Heuristic Plotting method.
            This is optional and a default filename will be created in output_directory if this is left as None.
        inference_method_list (list): List of inference methods used in heuristic determination.
        datastore_dict (dict): Dictionary of [DataStore][pyproteininference.datastore.DataStore]
            objects generated in heuristic determination with the inference method as the key of each entry.
        selected_methods (list): a list of String representations of the selected inference methods based on the
            heuristic.
        selected_datastores (dict):
            a Dictionary of [DataStore object][pyproteininference.datastore.DataStore] objects as selected by the
            heuristic.
        output_type (str): How to output results. Can either be "all" or "optimal". Will either output all results
            or will only output the optimal results.

    """

    RATIO_CONSTANT = 2
    OUTPUT_TYPES = ["all", "optimal"]

    def __init__(
        self,
        parameter_file=None,
        database_file=None,
        target_files=None,
        decoy_files=None,
        combined_files=None,
        target_directory=None,
        decoy_directory=None,
        combined_directory=None,
        output_directory=None,
        output_filename=None,
        id_splitting=False,
        append_alt_from_db=True,
        pdf_filename=None,
        output_type="all",
    ):
        """

        Args:
            parameter_file (str): Path to Protein Inference Yaml Parameter File.
            database_file (str): Path to Fasta database used in proteomics search.
            target_files (str/list): Path to Target Psm File (Or a list of files).
            decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
            combined_files (str/list): Path to Combined Psm File (Or a list of files).
            target_directory (str): Path to Directory containing Target Psm Files.
            decoy_directory (str): Path to Directory containing Decoy Psm Files.
            combined_directory (str): Path to Directory containing Combined Psm Files.
            output_directory (str): Path to Directory where output will be written.
            output_filename (str): Path to Filename where output will be written.
                Will override output_directory.
            id_splitting (bool): True/False on whether to split protein IDs in the digest.
                Advanced usage only.
            append_alt_from_db (bool): True/False on whether to append alternative proteins
                from the DB digestion in Reader class.
            pdf_filename (str): Filepath to be written to by Heuristic Plotting method.
                This is optional and a default filename will be created in output_directory if this is left as None
            output_type (str): How to output results. Can either be "all" or "optimal". Will either output all results
                        or will only output the optimal results.

        Returns:
            HeuristicPipeline: [HeuristicPipeline][pyproteininference.heuristic.HeuristicPipeline] object

        Example:
            >>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
            >>>     parameter_file=yaml_params,
            >>>     database_file=database,
            >>>     target_files=target,
            >>>     decoy_files=decoy,
            >>>     combined_files=combined_files,
            >>>     target_directory=target_directory,
            >>>     decoy_directory=decoy_directory,
            >>>     combined_directory=combined_directory,
            >>>     output_directory=dir_name,
            >>>     output_filename=output_filename,
            >>>     append_alt_from_db=append_alt,
            >>>     pdf_filename=pdf_filename,
            >>>     output_type="all"
            >>> )
        """

        self.parameter_file = parameter_file
        self.database_file = database_file
        self.target_files = target_files
        self.decoy_files = decoy_files
        self.combined_files = combined_files
        self.target_directory = target_directory
        self.decoy_directory = decoy_directory
        self.combined_directory = combined_directory
        self.output_directory = output_directory
        self.output_filename = output_filename
        self.id_splitting = id_splitting
        self.append_alt_from_db = append_alt_from_db
        self.output_type = output_type
        if self.output_type not in self.OUTPUT_TYPES:
            raise ValueError("The variable output_type must be set to either 'all' or 'optimal'")
        if not pdf_filename:
            if self.output_directory and not self.output_filename:
                self.pdf_filename = os.path.join(self.output_directory, "heuristic_plot.pdf")
            elif self.output_filename:
                self.pdf_filename = os.path.join(os.path.split(self.output_filename)[0], "heuristic_plot.pdf")
            else:
                self.pdf_filename = os.path.join(os.getcwd(), "heuristic_plot.pdf")

        else:
            self.pdf_filename = pdf_filename

        self.inference_method_list = [
            Inference.INCLUSION,
            Inference.EXCLUSION,
            Inference.PARSIMONY,
            Inference.PEPTIDE_CENTRIC,
        ]
        self.datastore_dict = {}
        self.selected_methods = None
        self.selected_datastores = {}

        self._validate_input()

        self._set_output_directory()

        self._log_append_alt_from_db()

    def execute(self, fdr_threshold=0.05):
        """
        This method is the main driver of the heuristic method.
        This method calls other classes and methods that make up the heuristic pipeline.
        This includes but is not limited to:

        1. Loops over the main inference methods: Inclusion, Exclusion, Parsimony, and Peptide Centric.
        2. Determines the optimal inference method based on the input data as well as the database file.
        3. Outputs the results and indicates the optimal results.

        Args:
            fdr_threshold (float): The Qvalue/FDR threshold the heuristic method uses to base calculations from.

        Returns:
            None:

        Example:
            >>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
            >>>     parameter_file=yaml_params,
            >>>     database_file=database,
            >>>     target_files=target,
            >>>     decoy_files=decoy,
            >>>     combined_files=combined_files,
            >>>     target_directory=target_directory,
            >>>     decoy_directory=decoy_directory,
            >>>     combined_directory=combined_directory,
            >>>     output_directory=dir_name,
            >>>     output_filename=output_filename,
            >>>     append_alt_from_db=append_alt,
            >>>     pdf_filename=pdf_filename,
            >>>     output_type="all"
            >>> )
            >>> heuristic.execute(fdr_threshold=0.05)

        """

        pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
            yaml_param_filepath=self.parameter_file
        )

        digest = pyproteininference.in_silico_digest.PyteomicsDigest(
            database_path=self.database_file,
            digest_type=pyproteininference_parameters.digest_type,
            missed_cleavages=pyproteininference_parameters.missed_cleavages,
            reviewed_identifier_symbol=pyproteininference_parameters.reviewed_identifier_symbol,
            max_peptide_length=pyproteininference_parameters.restrict_peptide_length,
            id_splitting=self.id_splitting,
        )
        if self.database_file:
            logger.info("Running In Silico Database Digest on file {}".format(self.database_file))
            digest.digest_fasta_database()
        else:
            logger.warning(
                "No Database File provided, Skipping database digest and only taking protein-peptide mapping from the "
                "input files."
            )

        for inference_method in self.inference_method_list:

            method_specific_parameters = copy.deepcopy(pyproteininference_parameters)

            logger.info("Overriding inference type {}".format(method_specific_parameters.inference_type))

            method_specific_parameters.inference_type = inference_method

            logger.info("New inference type {}".format(method_specific_parameters.inference_type))
            logger.info("FDR Threshold Set to {}".format(method_specific_parameters.fdr))

            reader = pyproteininference.reader.GenericReader(
                target_file=self.target_files,
                decoy_file=self.decoy_files,
                combined_files=self.combined_files,
                parameter_file_object=method_specific_parameters,
                digest=digest,
                append_alt_from_db=self.append_alt_from_db,
            )
            reader.read_psms()

            data = pyproteininference.datastore.DataStore(reader=reader, digest=digest)

            data.restrict_psm_data()

            data.recover_mapping()

            data.create_scoring_input()

            if method_specific_parameters.inference_type == Inference.EXCLUSION:
                data.exclude_non_distinguishing_peptides()

            score = pyproteininference.scoring.Score(data=data)
            score.score_psms(score_method=method_specific_parameters.protein_score)

            if method_specific_parameters.picker:
                data.protein_picker()
            else:
                pass

            pyproteininference.inference.Inference.run_inference(data=data, digest=digest)

            data.calculate_q_values()

            self.datastore_dict[inference_method] = data

        self.selected_methods = self.determine_optimal_inference_method(
            false_discovery_rate_threshold=fdr_threshold, pdf_filename=self.pdf_filename
        )
        self.selected_datastores = {x: self.datastore_dict[x] for x in self.selected_methods}

        if self.output_type == "all":
            self._write_all_results(parameters=method_specific_parameters)
        elif self.output_type == "optimal":
            self._write_optimal_results(parameters=method_specific_parameters)
        else:
            self._write_optimal_results(parameters=method_specific_parameters)

    def generate_roc_plot(self, fdr_max=0.2, pdf_filename=None):
        """
        This method produces a PDF ROC plot overlaying the 4 inference methods apart of the heuristic algorithm.

        Args:
            fdr_max (float): Max FDR to display on the plot.
            pdf_filename (str): Filename to write roc plot to.

        Returns:
            None:

        """
        f = plt.figure()
        for inference_method in self.datastore_dict.keys():
            fdr_vs_target_hits = self.datastore_dict[inference_method].generate_fdr_vs_target_hits(fdr_max=fdr_max)
            fdrs = [x[0] for x in fdr_vs_target_hits]
            target_hits = [x[1] for x in fdr_vs_target_hits]
            plt.plot(fdrs, target_hits, '-', label=inference_method.replace("_", " "))
            target_fdr = self.datastore_dict[inference_method].parameter_file_object.fdr
            if inference_method in self.selected_methods:
                best_value = min(fdrs, key=lambda x: abs(x - target_fdr))
                best_index = fdrs.index(best_value)
                best_target_hit_value = target_hits[best_index]  # noqa F841

        plt.axvline(target_fdr, color="black", linestyle='--', alpha=0.75, label="Target FDR")
        plt.legend()
        plt.xlabel('Decoy FDR')
        plt.ylabel('Target Protein Hits')
        plt.xlim([-0.01, fdr_max])
        plt.legend(loc='lower right')
        plt.title("FDR vs Target Protein Hits per Inference Method")
        if pdf_filename:
            logger.info("Writing ROC plot to: {}".format(pdf_filename))
            f.savefig(pdf_filename)
        plt.close()

    def _write_all_results(self, parameters):
        """
        Internal method that loops over all results and writes them out.
        """
        for method in list(self.datastore_dict.keys()):
            datastore = self.datastore_dict[method]
            if method in self.selected_methods:
                inference_method_string = "{}_{}".format(method, "optimal_method")
            else:
                inference_method_string = method
            if not self.output_filename and self.output_directory:
                # If a filename is not provided then construct one using output_directory
                # Note: output_directory will always get set even if its set as None - gets set to cwd
                inference_filename = os.path.join(
                    self.output_directory,
                    "{}_{}_{}_{}_{}".format(
                        inference_method_string,
                        parameters.tag,
                        datastore.short_protein_score,
                        datastore.psm_score,
                        "protein_inference_results.csv",
                    ),
                )
            if self.output_filename:
                # If the user specified an output filename then split it apart and insert the inference method
                # Then reconstruct the file
                split = os.path.split(self.output_filename)
                path = split[0]
                filename = split[1]
                inference_filename = os.path.join(path, "{}_{}".format(inference_method_string, filename))
            export = pyproteininference.export.Export(data=self.datastore_dict[method])
            export.export_to_csv(
                output_filename=inference_filename,
                directory=self.output_directory,
                export_type=parameters.export,
            )

    def _write_optimal_results(self, parameters):
        """
        Internal method that writes out the optimized results.
        """

        for method in self.selected_methods:
            datastore = self.datastore_dict[method]
            inference_method_string = "{}_{}".format(method, "optimal_method")
            if not self.output_filename and self.output_directory:
                # If a filename is not provided then construct one using output_directory
                # Note: output_directory will always get set even if its set as None - gets set to cwd
                inference_filename = os.path.join(
                    self.output_directory,
                    "{}_{}_{}_{}_{}".format(
                        inference_method_string,
                        parameters.tag,
                        datastore.short_protein_score,
                        datastore.psm_score,
                        "protein_inference_results.csv",
                    ),
                )
            if self.output_filename:
                # If the user specified an output filename then split it apart and insert the inference method
                # Then reconstruct the file
                split = os.path.split(self.output_filename)
                path = split[0]
                filename = split[1]
                inference_filename = os.path.join(path, "{}_{}".format(inference_method_string, filename))
            export = pyproteininference.export.Export(data=self.selected_datastores[method])
            export.export_to_csv(
                output_filename=inference_filename,
                directory=self.output_directory,
                export_type=parameters.export,
            )

    def determine_optimal_inference_method(
        self,
        false_discovery_rate_threshold=0.05,
        upper_empirical_threshold=1,
        lower_empirical_threshold=0.5,
        pdf_filename=None,
    ):
        """
        This method determines the optimal inference method from Inclusion, Exclusion, Parsimony, Peptide-Centric.

        Args:
            false_discovery_rate_threshold (float): The fdr threshold to use in heuristic algorithm -
                This parameter determines the maximum fdr used when creating a range of finite FDR values.
            upper_empirical_threshold (float): Upper Threshold used for parsimony/peptide centric cutoff for
                the heuristic algorithm.
            lower_empirical_threshold (float): Lower Threshold used for inclusion/exclusion cutoff for
                the heuristic algorithm.
            pdf_filename (str): Filename to write heuristic density plot to.


        Returns:
            list: List of string representations of the recommended inference methods.

        """

        # Get the number of passing proteins
        number_stdev_from_mean_dict = {}
        fdrs = [false_discovery_rate_threshold * 0.01 * x for x in range(100)]
        for fdr in fdrs:
            stdev_from_mean = self.determine_number_stdev_from_mean(false_discovery_rate=fdr)
            number_stdev_from_mean_dict[fdr] = stdev_from_mean

        stdev_collection = collections.defaultdict(list)
        for fdr in fdrs:
            for key in number_stdev_from_mean_dict[fdr]:
                stdev_collection[key].append(number_stdev_from_mean_dict[fdr][key])

        heuristic_scores = self.generate_density_plot(
            number_stdevs_from_mean=stdev_collection, pdf_filename=pdf_filename
        )

        # Apply conditional statement with lower and upper thresholds
        if (
            heuristic_scores[Inference.PARSIMONY] <= lower_empirical_threshold
            or heuristic_scores[Inference.PEPTIDE_CENTRIC] <= lower_empirical_threshold
        ):
            # If parsimony or peptide centric are less than the lower empirical threshold
            # Then select the best method of the two
            logger.info(
                "Either parsimony {} or peptide centric {} pass empirical threshold {}. "
                "Selecting the best method of the two.".format(
                    heuristic_scores[Inference.PARSIMONY],
                    heuristic_scores[Inference.PEPTIDE_CENTRIC],
                    lower_empirical_threshold,
                )
            )
            sub_dict = {
                Inference.PARSIMONY: heuristic_scores[Inference.PARSIMONY],
                Inference.PEPTIDE_CENTRIC: heuristic_scores[Inference.PEPTIDE_CENTRIC],
            }

            if (
                heuristic_scores[Inference.PARSIMONY] <= lower_empirical_threshold
                and heuristic_scores[Inference.PEPTIDE_CENTRIC] <= lower_empirical_threshold
            ):
                # If both are under the threshold return both
                selected_methods = [Inference.PARSIMONY, Inference.PEPTIDE_CENTRIC]

            else:
                selected_methods = [min(sub_dict, key=sub_dict.get)]

        # If the above condition does not apply
        elif (
            heuristic_scores[Inference.EXCLUSION] <= upper_empirical_threshold
            or heuristic_scores[Inference.INCLUSION] <= upper_empirical_threshold
        ):
            # If exclusion or inclusion are less than the upper empirical threshold
            # Then select the best method of the two
            logger.info(
                "Either inclusion {} or exclusion {} pass empirical threshold {}. "
                "Selecting the best method of the two.".format(
                    heuristic_scores[Inference.INCLUSION],
                    heuristic_scores[Inference.EXCLUSION],
                    upper_empirical_threshold,
                )
            )
            sub_dict = {
                Inference.EXCLUSION: heuristic_scores[Inference.EXCLUSION],
                Inference.INCLUSION: heuristic_scores[Inference.INCLUSION],
            }

            if (
                heuristic_scores[Inference.EXCLUSION] <= upper_empirical_threshold
                and heuristic_scores[Inference.INCLUSION] <= upper_empirical_threshold
            ):
                # If both are under the threshold return both
                selected_methods = [Inference.INCLUSION, Inference.EXCLUSION]

            else:
                selected_methods = [min(sub_dict, key=sub_dict.get)]

        else:
            # If we have no conditional scenarios...
            # Select the best method
            logger.info("No methods pass empirical thresholds, selecting the best method")
            selected_methods = [min(heuristic_scores, key=heuristic_scores.get)]

        logger.info("Method(s) {} selected with the heuristic algorithm".format(", ".join(selected_methods)))
        return selected_methods

    def generate_density_plot(self, number_stdevs_from_mean, pdf_filename=None):
        """
        This method produces a PDF Density Plot plot overlaying the 4 inference methods part of the heuristic algorithm.

        Args:
            number_stdevs_from_mean (dict): a dictionary of the number of standard deviations from the mean per
                inference method for a range of FDRs.
            pdf_filename (str): Filename to write heuristic density plot to.

        Returns:
            dict: a dictionary of heuristic scores per inference method which correlates to the
                maximum point of the density plot per inference method.

        """
        f = plt.figure()

        heuristic_scores = {}
        for method in number_stdevs_from_mean:
            readible_method_name = Inference.INFERENCE_NAME_MAP[method]
            kwargs = dict(histtype='stepfilled', alpha=0.3, density=True, bins=40, ec="k", label=readible_method_name)
            x, y, _ = plt.hist(number_stdevs_from_mean[method], **kwargs)
            center = y[list(x).index(max(x))]
            heuristic_scores[method] = abs(center)

        plt.axvline(0, color="black", linestyle='--', alpha=0.75)
        plt.title("Density Plot of the Number of Standard Deviations from the Mean")
        plt.xlabel('Number of Standard Deviations from the Mean')
        plt.ylabel('Number of Observations')
        plt.legend(loc='upper right')
        if pdf_filename:
            logger.info("Writing Heuristic Density plot to: {}".format(pdf_filename))
            f.savefig(pdf_filename)
        else:
            plt.show()
        plt.close()

        logger.info("Heuristic Scores")
        logger.info(heuristic_scores)

        return heuristic_scores

    def determine_number_stdev_from_mean(self, false_discovery_rate):
        """
        This method calculates the mean of the number of proteins identified at a specific FDR of all
        4 methods and then for each method calculates the number of standard deviations
        from the previous calculated mean.

        Args:
            false_discovery_rate (float): The false discovery rate used as a cutoff for calculations.

        Returns:
            dict: a dictionary of the number of standard deviations away from the mean per inference method.

        """

        filtered_protein_objects = {
            x: self.datastore_dict[x].get_protein_objects(
                fdr_restricted=True, false_discovery_rate=false_discovery_rate
            )
            for x in self.datastore_dict.keys()
        }
        number_passing_proteins = {x: len(filtered_protein_objects[x]) for x in filtered_protein_objects.keys()}

        # Calculate how similar the number of passing proteins is for each method
        all_values = [x for x in number_passing_proteins.values()]
        mean = numpy.mean(all_values)
        standard_deviation = statistics.stdev(all_values)
        number_stdev_from_mean_dict = {}
        for key in number_passing_proteins.keys():
            cur_value = number_passing_proteins[key]
            number_stdev_from_mean_dict[key] = (cur_value - mean) / standard_deviation

        return number_stdev_from_mean_dict

__init__(parameter_file=None, database_file=None, target_files=None, decoy_files=None, combined_files=None, target_directory=None, decoy_directory=None, combined_directory=None, output_directory=None, output_filename=None, id_splitting=False, append_alt_from_db=True, pdf_filename=None, output_type='all')

Parameters:
  • parameter_file (str, default: None ) –

    Path to Protein Inference Yaml Parameter File.

  • database_file (str, default: None ) –

    Path to Fasta database used in proteomics search.

  • target_files (str / list, default: None ) –

    Path to Target Psm File (Or a list of files).

  • decoy_files (str / list, default: None ) –

    Path to Decoy Psm File (Or a list of files).

  • combined_files (str / list, default: None ) –

    Path to Combined Psm File (Or a list of files).

  • target_directory (str, default: None ) –

    Path to Directory containing Target Psm Files.

  • decoy_directory (str, default: None ) –

    Path to Directory containing Decoy Psm Files.

  • combined_directory (str, default: None ) –

    Path to Directory containing Combined Psm Files.

  • output_directory (str, default: None ) –

    Path to Directory where output will be written.

  • output_filename (str, default: None ) –

    Path to Filename where output will be written. Will override output_directory.

  • id_splitting (bool, default: False ) –

    True/False on whether to split protein IDs in the digest. Advanced usage only.

  • append_alt_from_db (bool, default: True ) –

    True/False on whether to append alternative proteins from the DB digestion in Reader class.

  • pdf_filename (str, default: None ) –

    Filepath to be written to by Heuristic Plotting method. This is optional and a default filename will be created in output_directory if this is left as None

  • output_type (str, default: 'all' ) –

    How to output results. Can either be "all" or "optimal". Will either output all results or will only output the optimal results.

Returns:
Example

heuristic = pyproteininference.heuristic.HeuristicPipeline( parameter_file=yaml_params, database_file=database, target_files=target, decoy_files=decoy, combined_files=combined_files, target_directory=target_directory, decoy_directory=decoy_directory, combined_directory=combined_directory, output_directory=dir_name, output_filename=output_filename, append_alt_from_db=append_alt, pdf_filename=pdf_filename, output_type="all" )

Source code in pyproteininference/heuristic.py
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
def __init__(
    self,
    parameter_file=None,
    database_file=None,
    target_files=None,
    decoy_files=None,
    combined_files=None,
    target_directory=None,
    decoy_directory=None,
    combined_directory=None,
    output_directory=None,
    output_filename=None,
    id_splitting=False,
    append_alt_from_db=True,
    pdf_filename=None,
    output_type="all",
):
    """

    Args:
        parameter_file (str): Path to Protein Inference Yaml Parameter File.
        database_file (str): Path to Fasta database used in proteomics search.
        target_files (str/list): Path to Target Psm File (Or a list of files).
        decoy_files (str/list): Path to Decoy Psm File (Or a list of files).
        combined_files (str/list): Path to Combined Psm File (Or a list of files).
        target_directory (str): Path to Directory containing Target Psm Files.
        decoy_directory (str): Path to Directory containing Decoy Psm Files.
        combined_directory (str): Path to Directory containing Combined Psm Files.
        output_directory (str): Path to Directory where output will be written.
        output_filename (str): Path to Filename where output will be written.
            Will override output_directory.
        id_splitting (bool): True/False on whether to split protein IDs in the digest.
            Advanced usage only.
        append_alt_from_db (bool): True/False on whether to append alternative proteins
            from the DB digestion in Reader class.
        pdf_filename (str): Filepath to be written to by Heuristic Plotting method.
            This is optional and a default filename will be created in output_directory if this is left as None
        output_type (str): How to output results. Can either be "all" or "optimal". Will either output all results
                    or will only output the optimal results.

    Returns:
        HeuristicPipeline: [HeuristicPipeline][pyproteininference.heuristic.HeuristicPipeline] object

    Example:
        >>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
        >>>     parameter_file=yaml_params,
        >>>     database_file=database,
        >>>     target_files=target,
        >>>     decoy_files=decoy,
        >>>     combined_files=combined_files,
        >>>     target_directory=target_directory,
        >>>     decoy_directory=decoy_directory,
        >>>     combined_directory=combined_directory,
        >>>     output_directory=dir_name,
        >>>     output_filename=output_filename,
        >>>     append_alt_from_db=append_alt,
        >>>     pdf_filename=pdf_filename,
        >>>     output_type="all"
        >>> )
    """

    self.parameter_file = parameter_file
    self.database_file = database_file
    self.target_files = target_files
    self.decoy_files = decoy_files
    self.combined_files = combined_files
    self.target_directory = target_directory
    self.decoy_directory = decoy_directory
    self.combined_directory = combined_directory
    self.output_directory = output_directory
    self.output_filename = output_filename
    self.id_splitting = id_splitting
    self.append_alt_from_db = append_alt_from_db
    self.output_type = output_type
    if self.output_type not in self.OUTPUT_TYPES:
        raise ValueError("The variable output_type must be set to either 'all' or 'optimal'")
    if not pdf_filename:
        if self.output_directory and not self.output_filename:
            self.pdf_filename = os.path.join(self.output_directory, "heuristic_plot.pdf")
        elif self.output_filename:
            self.pdf_filename = os.path.join(os.path.split(self.output_filename)[0], "heuristic_plot.pdf")
        else:
            self.pdf_filename = os.path.join(os.getcwd(), "heuristic_plot.pdf")

    else:
        self.pdf_filename = pdf_filename

    self.inference_method_list = [
        Inference.INCLUSION,
        Inference.EXCLUSION,
        Inference.PARSIMONY,
        Inference.PEPTIDE_CENTRIC,
    ]
    self.datastore_dict = {}
    self.selected_methods = None
    self.selected_datastores = {}

    self._validate_input()

    self._set_output_directory()

    self._log_append_alt_from_db()

determine_number_stdev_from_mean(false_discovery_rate)

This method calculates the mean of the number of proteins identified at a specific FDR of all 4 methods and then for each method calculates the number of standard deviations from the previous calculated mean.

Parameters:
  • false_discovery_rate (float) –

    The false discovery rate used as a cutoff for calculations.

Returns:
  • dict

    a dictionary of the number of standard deviations away from the mean per inference method.

Source code in pyproteininference/heuristic.py
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
def determine_number_stdev_from_mean(self, false_discovery_rate):
    """
    This method calculates the mean of the number of proteins identified at a specific FDR of all
    4 methods and then for each method calculates the number of standard deviations
    from the previous calculated mean.

    Args:
        false_discovery_rate (float): The false discovery rate used as a cutoff for calculations.

    Returns:
        dict: a dictionary of the number of standard deviations away from the mean per inference method.

    """

    filtered_protein_objects = {
        x: self.datastore_dict[x].get_protein_objects(
            fdr_restricted=True, false_discovery_rate=false_discovery_rate
        )
        for x in self.datastore_dict.keys()
    }
    number_passing_proteins = {x: len(filtered_protein_objects[x]) for x in filtered_protein_objects.keys()}

    # Calculate how similar the number of passing proteins is for each method
    all_values = [x for x in number_passing_proteins.values()]
    mean = numpy.mean(all_values)
    standard_deviation = statistics.stdev(all_values)
    number_stdev_from_mean_dict = {}
    for key in number_passing_proteins.keys():
        cur_value = number_passing_proteins[key]
        number_stdev_from_mean_dict[key] = (cur_value - mean) / standard_deviation

    return number_stdev_from_mean_dict

determine_optimal_inference_method(false_discovery_rate_threshold=0.05, upper_empirical_threshold=1, lower_empirical_threshold=0.5, pdf_filename=None)

This method determines the optimal inference method from Inclusion, Exclusion, Parsimony, Peptide-Centric.

Parameters:
  • false_discovery_rate_threshold (float, default: 0.05 ) –

    The fdr threshold to use in heuristic algorithm - This parameter determines the maximum fdr used when creating a range of finite FDR values.

  • upper_empirical_threshold (float, default: 1 ) –

    Upper Threshold used for parsimony/peptide centric cutoff for the heuristic algorithm.

  • lower_empirical_threshold (float, default: 0.5 ) –

    Lower Threshold used for inclusion/exclusion cutoff for the heuristic algorithm.

  • pdf_filename (str, default: None ) –

    Filename to write heuristic density plot to.

Returns:
  • list

    List of string representations of the recommended inference methods.

Source code in pyproteininference/heuristic.py
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
def determine_optimal_inference_method(
    self,
    false_discovery_rate_threshold=0.05,
    upper_empirical_threshold=1,
    lower_empirical_threshold=0.5,
    pdf_filename=None,
):
    """
    This method determines the optimal inference method from Inclusion, Exclusion, Parsimony, Peptide-Centric.

    Args:
        false_discovery_rate_threshold (float): The fdr threshold to use in heuristic algorithm -
            This parameter determines the maximum fdr used when creating a range of finite FDR values.
        upper_empirical_threshold (float): Upper Threshold used for parsimony/peptide centric cutoff for
            the heuristic algorithm.
        lower_empirical_threshold (float): Lower Threshold used for inclusion/exclusion cutoff for
            the heuristic algorithm.
        pdf_filename (str): Filename to write heuristic density plot to.


    Returns:
        list: List of string representations of the recommended inference methods.

    """

    # Get the number of passing proteins
    number_stdev_from_mean_dict = {}
    fdrs = [false_discovery_rate_threshold * 0.01 * x for x in range(100)]
    for fdr in fdrs:
        stdev_from_mean = self.determine_number_stdev_from_mean(false_discovery_rate=fdr)
        number_stdev_from_mean_dict[fdr] = stdev_from_mean

    stdev_collection = collections.defaultdict(list)
    for fdr in fdrs:
        for key in number_stdev_from_mean_dict[fdr]:
            stdev_collection[key].append(number_stdev_from_mean_dict[fdr][key])

    heuristic_scores = self.generate_density_plot(
        number_stdevs_from_mean=stdev_collection, pdf_filename=pdf_filename
    )

    # Apply conditional statement with lower and upper thresholds
    if (
        heuristic_scores[Inference.PARSIMONY] <= lower_empirical_threshold
        or heuristic_scores[Inference.PEPTIDE_CENTRIC] <= lower_empirical_threshold
    ):
        # If parsimony or peptide centric are less than the lower empirical threshold
        # Then select the best method of the two
        logger.info(
            "Either parsimony {} or peptide centric {} pass empirical threshold {}. "
            "Selecting the best method of the two.".format(
                heuristic_scores[Inference.PARSIMONY],
                heuristic_scores[Inference.PEPTIDE_CENTRIC],
                lower_empirical_threshold,
            )
        )
        sub_dict = {
            Inference.PARSIMONY: heuristic_scores[Inference.PARSIMONY],
            Inference.PEPTIDE_CENTRIC: heuristic_scores[Inference.PEPTIDE_CENTRIC],
        }

        if (
            heuristic_scores[Inference.PARSIMONY] <= lower_empirical_threshold
            and heuristic_scores[Inference.PEPTIDE_CENTRIC] <= lower_empirical_threshold
        ):
            # If both are under the threshold return both
            selected_methods = [Inference.PARSIMONY, Inference.PEPTIDE_CENTRIC]

        else:
            selected_methods = [min(sub_dict, key=sub_dict.get)]

    # If the above condition does not apply
    elif (
        heuristic_scores[Inference.EXCLUSION] <= upper_empirical_threshold
        or heuristic_scores[Inference.INCLUSION] <= upper_empirical_threshold
    ):
        # If exclusion or inclusion are less than the upper empirical threshold
        # Then select the best method of the two
        logger.info(
            "Either inclusion {} or exclusion {} pass empirical threshold {}. "
            "Selecting the best method of the two.".format(
                heuristic_scores[Inference.INCLUSION],
                heuristic_scores[Inference.EXCLUSION],
                upper_empirical_threshold,
            )
        )
        sub_dict = {
            Inference.EXCLUSION: heuristic_scores[Inference.EXCLUSION],
            Inference.INCLUSION: heuristic_scores[Inference.INCLUSION],
        }

        if (
            heuristic_scores[Inference.EXCLUSION] <= upper_empirical_threshold
            and heuristic_scores[Inference.INCLUSION] <= upper_empirical_threshold
        ):
            # If both are under the threshold return both
            selected_methods = [Inference.INCLUSION, Inference.EXCLUSION]

        else:
            selected_methods = [min(sub_dict, key=sub_dict.get)]

    else:
        # If we have no conditional scenarios...
        # Select the best method
        logger.info("No methods pass empirical thresholds, selecting the best method")
        selected_methods = [min(heuristic_scores, key=heuristic_scores.get)]

    logger.info("Method(s) {} selected with the heuristic algorithm".format(", ".join(selected_methods)))
    return selected_methods

execute(fdr_threshold=0.05)

This method is the main driver of the heuristic method. This method calls other classes and methods that make up the heuristic pipeline. This includes but is not limited to:

  1. Loops over the main inference methods: Inclusion, Exclusion, Parsimony, and Peptide Centric.
  2. Determines the optimal inference method based on the input data as well as the database file.
  3. Outputs the results and indicates the optimal results.
Parameters:
  • fdr_threshold (float, default: 0.05 ) –

    The Qvalue/FDR threshold the heuristic method uses to base calculations from.

Returns:
  • None
Example

heuristic = pyproteininference.heuristic.HeuristicPipeline( parameter_file=yaml_params, database_file=database, target_files=target, decoy_files=decoy, combined_files=combined_files, target_directory=target_directory, decoy_directory=decoy_directory, combined_directory=combined_directory, output_directory=dir_name, output_filename=output_filename, append_alt_from_db=append_alt, pdf_filename=pdf_filename, output_type="all" ) heuristic.execute(fdr_threshold=0.05)

Source code in pyproteininference/heuristic.py
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
def execute(self, fdr_threshold=0.05):
    """
    This method is the main driver of the heuristic method.
    This method calls other classes and methods that make up the heuristic pipeline.
    This includes but is not limited to:

    1. Loops over the main inference methods: Inclusion, Exclusion, Parsimony, and Peptide Centric.
    2. Determines the optimal inference method based on the input data as well as the database file.
    3. Outputs the results and indicates the optimal results.

    Args:
        fdr_threshold (float): The Qvalue/FDR threshold the heuristic method uses to base calculations from.

    Returns:
        None:

    Example:
        >>> heuristic = pyproteininference.heuristic.HeuristicPipeline(
        >>>     parameter_file=yaml_params,
        >>>     database_file=database,
        >>>     target_files=target,
        >>>     decoy_files=decoy,
        >>>     combined_files=combined_files,
        >>>     target_directory=target_directory,
        >>>     decoy_directory=decoy_directory,
        >>>     combined_directory=combined_directory,
        >>>     output_directory=dir_name,
        >>>     output_filename=output_filename,
        >>>     append_alt_from_db=append_alt,
        >>>     pdf_filename=pdf_filename,
        >>>     output_type="all"
        >>> )
        >>> heuristic.execute(fdr_threshold=0.05)

    """

    pyproteininference_parameters = pyproteininference.parameters.ProteinInferenceParameter(
        yaml_param_filepath=self.parameter_file
    )

    digest = pyproteininference.in_silico_digest.PyteomicsDigest(
        database_path=self.database_file,
        digest_type=pyproteininference_parameters.digest_type,
        missed_cleavages=pyproteininference_parameters.missed_cleavages,
        reviewed_identifier_symbol=pyproteininference_parameters.reviewed_identifier_symbol,
        max_peptide_length=pyproteininference_parameters.restrict_peptide_length,
        id_splitting=self.id_splitting,
    )
    if self.database_file:
        logger.info("Running In Silico Database Digest on file {}".format(self.database_file))
        digest.digest_fasta_database()
    else:
        logger.warning(
            "No Database File provided, Skipping database digest and only taking protein-peptide mapping from the "
            "input files."
        )

    for inference_method in self.inference_method_list:

        method_specific_parameters = copy.deepcopy(pyproteininference_parameters)

        logger.info("Overriding inference type {}".format(method_specific_parameters.inference_type))

        method_specific_parameters.inference_type = inference_method

        logger.info("New inference type {}".format(method_specific_parameters.inference_type))
        logger.info("FDR Threshold Set to {}".format(method_specific_parameters.fdr))

        reader = pyproteininference.reader.GenericReader(
            target_file=self.target_files,
            decoy_file=self.decoy_files,
            combined_files=self.combined_files,
            parameter_file_object=method_specific_parameters,
            digest=digest,
            append_alt_from_db=self.append_alt_from_db,
        )
        reader.read_psms()

        data = pyproteininference.datastore.DataStore(reader=reader, digest=digest)

        data.restrict_psm_data()

        data.recover_mapping()

        data.create_scoring_input()

        if method_specific_parameters.inference_type == Inference.EXCLUSION:
            data.exclude_non_distinguishing_peptides()

        score = pyproteininference.scoring.Score(data=data)
        score.score_psms(score_method=method_specific_parameters.protein_score)

        if method_specific_parameters.picker:
            data.protein_picker()
        else:
            pass

        pyproteininference.inference.Inference.run_inference(data=data, digest=digest)

        data.calculate_q_values()

        self.datastore_dict[inference_method] = data

    self.selected_methods = self.determine_optimal_inference_method(
        false_discovery_rate_threshold=fdr_threshold, pdf_filename=self.pdf_filename
    )
    self.selected_datastores = {x: self.datastore_dict[x] for x in self.selected_methods}

    if self.output_type == "all":
        self._write_all_results(parameters=method_specific_parameters)
    elif self.output_type == "optimal":
        self._write_optimal_results(parameters=method_specific_parameters)
    else:
        self._write_optimal_results(parameters=method_specific_parameters)

generate_density_plot(number_stdevs_from_mean, pdf_filename=None)

This method produces a PDF Density Plot plot overlaying the 4 inference methods part of the heuristic algorithm.

Parameters:
  • number_stdevs_from_mean (dict) –

    a dictionary of the number of standard deviations from the mean per inference method for a range of FDRs.

  • pdf_filename (str, default: None ) –

    Filename to write heuristic density plot to.

Returns:
  • dict

    a dictionary of heuristic scores per inference method which correlates to the maximum point of the density plot per inference method.

Source code in pyproteininference/heuristic.py
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
def generate_density_plot(self, number_stdevs_from_mean, pdf_filename=None):
    """
    This method produces a PDF Density Plot plot overlaying the 4 inference methods part of the heuristic algorithm.

    Args:
        number_stdevs_from_mean (dict): a dictionary of the number of standard deviations from the mean per
            inference method for a range of FDRs.
        pdf_filename (str): Filename to write heuristic density plot to.

    Returns:
        dict: a dictionary of heuristic scores per inference method which correlates to the
            maximum point of the density plot per inference method.

    """
    f = plt.figure()

    heuristic_scores = {}
    for method in number_stdevs_from_mean:
        readible_method_name = Inference.INFERENCE_NAME_MAP[method]
        kwargs = dict(histtype='stepfilled', alpha=0.3, density=True, bins=40, ec="k", label=readible_method_name)
        x, y, _ = plt.hist(number_stdevs_from_mean[method], **kwargs)
        center = y[list(x).index(max(x))]
        heuristic_scores[method] = abs(center)

    plt.axvline(0, color="black", linestyle='--', alpha=0.75)
    plt.title("Density Plot of the Number of Standard Deviations from the Mean")
    plt.xlabel('Number of Standard Deviations from the Mean')
    plt.ylabel('Number of Observations')
    plt.legend(loc='upper right')
    if pdf_filename:
        logger.info("Writing Heuristic Density plot to: {}".format(pdf_filename))
        f.savefig(pdf_filename)
    else:
        plt.show()
    plt.close()

    logger.info("Heuristic Scores")
    logger.info(heuristic_scores)

    return heuristic_scores

generate_roc_plot(fdr_max=0.2, pdf_filename=None)

This method produces a PDF ROC plot overlaying the 4 inference methods apart of the heuristic algorithm.

Parameters:
  • fdr_max (float, default: 0.2 ) –

    Max FDR to display on the plot.

  • pdf_filename (str, default: None ) –

    Filename to write roc plot to.

Returns:
  • None
Source code in pyproteininference/heuristic.py
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
def generate_roc_plot(self, fdr_max=0.2, pdf_filename=None):
    """
    This method produces a PDF ROC plot overlaying the 4 inference methods apart of the heuristic algorithm.

    Args:
        fdr_max (float): Max FDR to display on the plot.
        pdf_filename (str): Filename to write roc plot to.

    Returns:
        None:

    """
    f = plt.figure()
    for inference_method in self.datastore_dict.keys():
        fdr_vs_target_hits = self.datastore_dict[inference_method].generate_fdr_vs_target_hits(fdr_max=fdr_max)
        fdrs = [x[0] for x in fdr_vs_target_hits]
        target_hits = [x[1] for x in fdr_vs_target_hits]
        plt.plot(fdrs, target_hits, '-', label=inference_method.replace("_", " "))
        target_fdr = self.datastore_dict[inference_method].parameter_file_object.fdr
        if inference_method in self.selected_methods:
            best_value = min(fdrs, key=lambda x: abs(x - target_fdr))
            best_index = fdrs.index(best_value)
            best_target_hit_value = target_hits[best_index]  # noqa F841

    plt.axvline(target_fdr, color="black", linestyle='--', alpha=0.75, label="Target FDR")
    plt.legend()
    plt.xlabel('Decoy FDR')
    plt.ylabel('Target Protein Hits')
    plt.xlim([-0.01, fdr_max])
    plt.legend(loc='lower right')
    plt.title("FDR vs Target Protein Hits per Inference Method")
    if pdf_filename:
        logger.info("Writing ROC plot to: {}".format(pdf_filename))
        f.savefig(pdf_filename)
    plt.close()

LogElementHandler

Bases: Handler

A logging handler that emits messages to a log element.

Source code in pyproteininference/gui/gui.py
26
27
28
29
30
31
32
33
34
35
36
37
38
class LogElementHandler(logging.Handler):
    """A logging handler that emits messages to a log element."""

    def __init__(self, element: ui.log, level: int = logging.NOTSET) -> None:
        self.element = element
        super().__init__(level)

    def emit(self, record: logging.LogRecord) -> None:
        try:
            msg = self.format(record)
            self.element.push(msg)
        except Exception:
            self.handleError(record)

run_inference_analysis_async(q, config)

Run some heavy computation that updates the progress bar through the queue.

Source code in pyproteininference/gui/gui.py
41
42
43
44
45
def run_inference_analysis_async(q: Queue, config) -> str:
    """Run some heavy computation that updates the progress bar through the queue."""
    pipeline = pyproteininference.pipeline.ProteinInferencePipeline.create_from_gui_config(q, config)
    pipeline.execute()
    return "Protein Inference Analysis Completed"