Convert a tabular peptides dataset into their corresponding proteins#

In this example, we:

load a tabular dataset of peptide intensities
convert it into their corresponding proteins, either as peptide counts or
convert to proteins calculating the sum of peptide intensities.
export the data into Pandas data frames and join for further processing.

First, load the requried libraries.

[1]:

import pandas as pd
import omicspylib as opl
from omicspylib import PeptidesDataset
print(f'omicspylib version: {opl.__version__}')

omicspylib version: 0.0.7

Then prepare your data as a Pandas data frame. You need to specify the column name containing the peptide identifier (peptide_id in this example), the protein identifier required to perform the group by operation (protein_id in this example) and the column names for all experimental conditions like below.

It is expected that you perform any cleaning required for your use case (e.g. removal of reverse hits, contaminants, modified peptides, or shared peptides across proteins etc).

[2]:

data_df = pd.read_csv('data/peptides_dataset.tsv', sep='\t')

config = {
    'id_col': 'peptide_id',
    'conditions': {
        'c1': ['c1_rep1', 'c1_rep2', 'c1_rep3', 'c1_rep4', 'c1_rep5'],
        'c2': ['c2_rep1', 'c2_rep2', 'c2_rep3', 'c2_rep4', 'c2_rep5'],
        'c3': ['c3_rep1', 'c3_rep2', 'c3_rep3', 'c3_rep4', 'c3_rep5'],
    },
    'protein_id_col': 'protein_id',
}
data_df.head(3)

[2]:

	peptide_id	protein_id	c1_rep1	c1_rep2	c1_rep3	c1_rep4	c1_rep5	c2_rep1	c2_rep2	c2_rep3	c2_rep4	c2_rep5	c3_rep1	c3_rep2	c3_rep3	c3_rep4	c3_rep5
0	pept147	prot0	1740.912460	0.000000	1393.260017	4685.874636	513.393605	502.109101	949.462139	0.000000	3006.548317	671.891115	4123.628101	11583.385623	3114.882410	2812.034141	2195.550530
1	pept424	prot0	3668.876134	0.000000	0.000000	303.011791	1314.382432	404.828763	3723.604607	11838.405382	7586.141805	0.000000	336.363330	0.000000	200.425728	3891.630707	1395.146624
2	pept631	prot0	0.000000	3138.459061	3409.906069	1712.639948	987.488051	0.000000	8197.162348	0.000000	2067.977126	1111.872036	9229.125064	0.000000	19303.065270	2427.103374	491.195810

Next, create the PeptidesDataset object that wraps the specified experimental conditions and abstract related operations. For example, you could perform normalization at peptide level, prior to calculating protein abundance values.

[3]:

peptides_dataset = PeptidesDataset.from_df(data_df, **config)

Use the to_proteins method to aggregate peptides dataset into a proteins dataset. This method will return a ProteinsDataset that you can keep using e.g. to pairform pairwise comparison between two experimental conditions.

In this example, we will calculate the sum of peptide intensities passing the sum value to the agg_method argument or the number of peptides, passing the counts value. Optionally, you can rename the column names, on the fly, by adding a prefix tag, so that you don’t have name conflicts, later that you will join back the data, into one table.

[4]:

# create ProteinsDataset objects from the PeptidesDataset using different aggregation methods.
proteins_dataset_int = peptides_dataset.to_proteins(agg_method='sum', add_prefix='intensity_')
proteins_dataset_pept_counts = peptides_dataset.to_proteins(agg_method='counts', add_prefix='n_peptides_')

# extract data as Pandas data frames
prot_int = proteins_dataset_int.to_table()
prot_counts = proteins_dataset_pept_counts.to_table()

# merge to one table for further processing
proteins_dataset = prot_counts.merge(prot_int, on='protein_id', how='left')
proteins_dataset.head(3)

[4]:

	n_peptides_c1_rep1	n_peptides_c1_rep2	n_peptides_c1_rep3	n_peptides_c1_rep4	n_peptides_c1_rep5	n_peptides_c2_rep1	n_peptides_c2_rep2	n_peptides_c2_rep3	n_peptides_c2_rep4	n_peptides_c2_rep5	...	intensity_c2_rep1	intensity_c2_rep2	intensity_c2_rep3	intensity_c2_rep4	intensity_c2_rep5	intensity_c3_rep1	intensity_c3_rep2	intensity_c3_rep3	intensity_c3_rep4	intensity_c3_rep5
protein_id
prot0	7	5	8	8	9	8	8	7	7	8	...	10905.868929	29978.048873	39279.995012	16447.245791	15476.546193	22106.072629	18638.301516	41024.311866	69277.036001	10825.451618
prot1	9	7	8	7	5	4	7	9	9	7	...	6798.912758	17346.823885	14392.470307	22301.020074	34467.525125	14886.915695	35368.129215	5180.276570	13585.279883	36810.953282
prot10	10	9	10	6	7	8	7	6	7	7	...	50901.080878	16509.573663	5293.420927	31092.895805	13232.936859	9335.053026	11066.451746	5719.300722	18697.848325	8109.855595

3 rows × 30 columns

Convert a tabular peptides dataset into their corresponding proteins#

This Page