Convert a tabular peptides dataset into their corresponding proteins#

In this example, we:

  1. load a tabular dataset of peptide intensities

  2. convert it into their corresponding proteins, either as peptide counts or

  3. convert to proteins calculating the sum of peptide intensities.

  4. export the data into Pandas data frames and join for further processing.

First, load the requried libraries.

[1]:
import pandas as pd
import omicspylib as opl
from omicspylib import PeptidesDataset
print(f'omicspylib version: {opl.__version__}')
omicspylib version: 0.0.7

Then prepare your data as a Pandas data frame. You need to specify the column name containing the peptide identifier (peptide_id in this example), the protein identifier required to perform the group by operation (protein_id in this example) and the column names for all experimental conditions like below.

It is expected that you perform any cleaning required for your use case (e.g. removal of reverse hits, contaminants, modified peptides, or shared peptides across proteins etc).

[2]:
data_df = pd.read_csv('data/peptides_dataset.tsv', sep='\t')

config = {
    'id_col': 'peptide_id',
    'conditions': {
        'c1': ['c1_rep1', 'c1_rep2', 'c1_rep3', 'c1_rep4', 'c1_rep5'],
        'c2': ['c2_rep1', 'c2_rep2', 'c2_rep3', 'c2_rep4', 'c2_rep5'],
        'c3': ['c3_rep1', 'c3_rep2', 'c3_rep3', 'c3_rep4', 'c3_rep5'],
    },
    'protein_id_col': 'protein_id',
}
data_df.head(3)
[2]:
peptide_id protein_id c1_rep1 c1_rep2 c1_rep3 c1_rep4 c1_rep5 c2_rep1 c2_rep2 c2_rep3 c2_rep4 c2_rep5 c3_rep1 c3_rep2 c3_rep3 c3_rep4 c3_rep5
0 pept147 prot0 1740.912460 0.000000 1393.260017 4685.874636 513.393605 502.109101 949.462139 0.000000 3006.548317 671.891115 4123.628101 11583.385623 3114.882410 2812.034141 2195.550530
1 pept424 prot0 3668.876134 0.000000 0.000000 303.011791 1314.382432 404.828763 3723.604607 11838.405382 7586.141805 0.000000 336.363330 0.000000 200.425728 3891.630707 1395.146624
2 pept631 prot0 0.000000 3138.459061 3409.906069 1712.639948 987.488051 0.000000 8197.162348 0.000000 2067.977126 1111.872036 9229.125064 0.000000 19303.065270 2427.103374 491.195810

Next, create the PeptidesDataset object that wraps the specified experimental conditions and abstract related operations. For example, you could perform normalization at peptide level, prior to calculating protein abundance values.

[3]:
peptides_dataset = PeptidesDataset.from_df(data_df, **config)

Use the to_proteins method to aggregate peptides dataset into a proteins dataset. This method will return a ProteinsDataset that you can keep using e.g. to pairform pairwise comparison between two experimental conditions.

In this example, we will calculate the sum of peptide intensities passing the sum value to the agg_method argument or the number of peptides, passing the counts value. Optionally, you can rename the column names, on the fly, by adding a prefix tag, so that you don’t have name conflicts, later that you will join back the data, into one table.

[4]:
# create ProteinsDataset objects from the PeptidesDataset using different aggregation methods.
proteins_dataset_int = peptides_dataset.to_proteins(agg_method='sum', add_prefix='intensity_')
proteins_dataset_pept_counts = peptides_dataset.to_proteins(agg_method='counts', add_prefix='n_peptides_')

# extract data as Pandas data frames
prot_int = proteins_dataset_int.to_table()
prot_counts = proteins_dataset_pept_counts.to_table()

# merge to one table for further processing
proteins_dataset = prot_counts.merge(prot_int, on='protein_id', how='left')
proteins_dataset.head(3)
[4]:
n_peptides_c1_rep1 n_peptides_c1_rep2 n_peptides_c1_rep3 n_peptides_c1_rep4 n_peptides_c1_rep5 n_peptides_c2_rep1 n_peptides_c2_rep2 n_peptides_c2_rep3 n_peptides_c2_rep4 n_peptides_c2_rep5 ... intensity_c2_rep1 intensity_c2_rep2 intensity_c2_rep3 intensity_c2_rep4 intensity_c2_rep5 intensity_c3_rep1 intensity_c3_rep2 intensity_c3_rep3 intensity_c3_rep4 intensity_c3_rep5
protein_id
prot0 7 5 8 8 9 8 8 7 7 8 ... 10905.868929 29978.048873 39279.995012 16447.245791 15476.546193 22106.072629 18638.301516 41024.311866 69277.036001 10825.451618
prot1 9 7 8 7 5 4 7 9 9 7 ... 6798.912758 17346.823885 14392.470307 22301.020074 34467.525125 14886.915695 35368.129215 5180.276570 13585.279883 36810.953282
prot10 10 9 10 6 7 8 7 6 7 7 ... 50901.080878 16509.573663 5293.420927 31092.895805 13232.936859 9335.053026 11066.451746 5719.300722 18697.848325 8109.855595

3 rows × 30 columns