Convert a tabular peptides dataset into their corresponding proteins#
In this example, we:
load a tabular dataset of peptide intensities
convert it into their corresponding proteins, either as peptide counts or
convert to proteins calculating the sum of peptide intensities.
export the data into Pandas data frames and join for further processing.
First, load the requried libraries.
[1]:
import pandas as pd
import omicspylib as opl
from omicspylib import PeptidesDataset
print(f'omicspylib version: {opl.__version__}')
omicspylib version: 0.0.7
Then prepare your data as a Pandas data frame. You need to specify the column name containing the peptide identifier (peptide_id in this example), the protein identifier required to perform the group by operation (protein_id in this example) and the column names for all experimental conditions like below.
It is expected that you perform any cleaning required for your use case (e.g. removal of reverse hits, contaminants, modified peptides, or shared peptides across proteins etc).
[2]:
data_df = pd.read_csv('data/peptides_dataset.tsv', sep='\t')
config = {
'id_col': 'peptide_id',
'conditions': {
'c1': ['c1_rep1', 'c1_rep2', 'c1_rep3', 'c1_rep4', 'c1_rep5'],
'c2': ['c2_rep1', 'c2_rep2', 'c2_rep3', 'c2_rep4', 'c2_rep5'],
'c3': ['c3_rep1', 'c3_rep2', 'c3_rep3', 'c3_rep4', 'c3_rep5'],
},
'protein_id_col': 'protein_id',
}
data_df.head(3)
[2]:
| peptide_id | protein_id | c1_rep1 | c1_rep2 | c1_rep3 | c1_rep4 | c1_rep5 | c2_rep1 | c2_rep2 | c2_rep3 | c2_rep4 | c2_rep5 | c3_rep1 | c3_rep2 | c3_rep3 | c3_rep4 | c3_rep5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | pept147 | prot0 | 1740.912460 | 0.000000 | 1393.260017 | 4685.874636 | 513.393605 | 502.109101 | 949.462139 | 0.000000 | 3006.548317 | 671.891115 | 4123.628101 | 11583.385623 | 3114.882410 | 2812.034141 | 2195.550530 |
| 1 | pept424 | prot0 | 3668.876134 | 0.000000 | 0.000000 | 303.011791 | 1314.382432 | 404.828763 | 3723.604607 | 11838.405382 | 7586.141805 | 0.000000 | 336.363330 | 0.000000 | 200.425728 | 3891.630707 | 1395.146624 |
| 2 | pept631 | prot0 | 0.000000 | 3138.459061 | 3409.906069 | 1712.639948 | 987.488051 | 0.000000 | 8197.162348 | 0.000000 | 2067.977126 | 1111.872036 | 9229.125064 | 0.000000 | 19303.065270 | 2427.103374 | 491.195810 |
Next, create the PeptidesDataset object that wraps the specified experimental conditions and abstract related operations. For example, you could perform normalization at peptide level, prior to calculating protein abundance values.
[3]:
peptides_dataset = PeptidesDataset.from_df(data_df, **config)
Use the to_proteins method to aggregate peptides dataset into a proteins dataset. This method will return a ProteinsDataset that you can keep using e.g. to pairform pairwise comparison between two experimental conditions.
In this example, we will calculate the sum of peptide intensities passing the sum value to the agg_method argument or the number of peptides, passing the counts value. Optionally, you can rename the column names, on the fly, by adding a prefix tag, so that you don’t have name conflicts, later that you will join back the data, into one table.
[4]:
# create ProteinsDataset objects from the PeptidesDataset using different aggregation methods.
proteins_dataset_int = peptides_dataset.to_proteins(agg_method='sum', add_prefix='intensity_')
proteins_dataset_pept_counts = peptides_dataset.to_proteins(agg_method='counts', add_prefix='n_peptides_')
# extract data as Pandas data frames
prot_int = proteins_dataset_int.to_table()
prot_counts = proteins_dataset_pept_counts.to_table()
# merge to one table for further processing
proteins_dataset = prot_counts.merge(prot_int, on='protein_id', how='left')
proteins_dataset.head(3)
[4]:
| n_peptides_c1_rep1 | n_peptides_c1_rep2 | n_peptides_c1_rep3 | n_peptides_c1_rep4 | n_peptides_c1_rep5 | n_peptides_c2_rep1 | n_peptides_c2_rep2 | n_peptides_c2_rep3 | n_peptides_c2_rep4 | n_peptides_c2_rep5 | ... | intensity_c2_rep1 | intensity_c2_rep2 | intensity_c2_rep3 | intensity_c2_rep4 | intensity_c2_rep5 | intensity_c3_rep1 | intensity_c3_rep2 | intensity_c3_rep3 | intensity_c3_rep4 | intensity_c3_rep5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| protein_id | |||||||||||||||||||||
| prot0 | 7 | 5 | 8 | 8 | 9 | 8 | 8 | 7 | 7 | 8 | ... | 10905.868929 | 29978.048873 | 39279.995012 | 16447.245791 | 15476.546193 | 22106.072629 | 18638.301516 | 41024.311866 | 69277.036001 | 10825.451618 |
| prot1 | 9 | 7 | 8 | 7 | 5 | 4 | 7 | 9 | 9 | 7 | ... | 6798.912758 | 17346.823885 | 14392.470307 | 22301.020074 | 34467.525125 | 14886.915695 | 35368.129215 | 5180.276570 | 13585.279883 | 36810.953282 |
| prot10 | 10 | 9 | 10 | 6 | 7 | 8 | 7 | 6 | 7 | 7 | ... | 50901.080878 | 16509.573663 | 5293.420927 | 31092.895805 | 13232.936859 | 9335.053026 | 11066.451746 | 5719.300722 | 18697.848325 | 8109.855595 |
3 rows × 30 columns