Hierarchically clustered heatmaps#

In this tutorial, we start with a proteins dataset and perform hierarchical clustering on the data. Then plot a heatmap where we overlay the clustering tree and color the entries based on their group. This implementation uses internally the clustermap plotting function from the Seaborn library.

[1]:
from omicspylib.analysis.clusters import HierarchicallyClusteredHeatmap
import pandas as pd
import matplotlib.pyplot as plt
from omicspylib import ProteinsDataset
from omicspylib import __version__

print(f'omicspylib version: {__version__}')
omicspylib version: 0.0.8

Load our proteins table containing the protein identifier and the quantitative value per experimental condition.

[2]:
data = pd.read_csv('data/protein_dataset.tsv', sep='\t')
data.head(2)
[2]:
protein_id c1_rep1 c1_rep2 c1_rep3 c1_rep4 c1_rep5 c2_rep1 c2_rep2 c2_rep3 c2_rep4 c2_rep5 c3_rep1 c3_rep2 c3_rep3 c3_rep4 c3_rep5
0 p0 1748947.964 2655665.55 1812807.047 3179830.747 3002006.748 0.000 1357720.520 0.000 2087116.684 0.000 2558776.479 2655657.487 2115434.782 2889376.029 0.0
1 p1 1689613.957 0.00 1953790.640 2447525.246 2877005.859 1438315.297 1198347.576 1864606.985 0.000 1414141.418 0.000 3070691.996 3149289.453 0.000 2264704.9

To create the heatmap, you need to keep in the table only the quantitative values. The unique identifiers (protein id in this case) should be placed in the index.

[3]:
hm_inputs = data.set_index('protein_id')
hm_inputs.head(2)
[3]:
c1_rep1 c1_rep2 c1_rep3 c1_rep4 c1_rep5 c2_rep1 c2_rep2 c2_rep3 c2_rep4 c2_rep5 c3_rep1 c3_rep2 c3_rep3 c3_rep4 c3_rep5
protein_id
p0 1748947.964 2655665.55 1812807.047 3179830.747 3002006.748 0.000 1357720.520 0.000 2087116.684 0.000 2558776.479 2655657.487 2115434.782 2889376.029 0.0
p1 1689613.957 0.00 1953790.640 2447525.246 2877005.859 1438315.297 1198347.576 1864606.985 0.000 1414141.418 0.000 3070691.996 3149289.453 0.000 2264704.9

To create the heatmap use the HierarchicallyClusteredHeatmap object. It will return a filtered version of the data, a plot object and two lists with the group number of the rows and columns. Returned data might have less rows than the provided dataset, based on the filtering applied. For example, you might want to consider only the cases identified across 5 or more experiments. The unique row identifier is still in the index, so you can join back to the original dataset.

[4]:
heatmap = HierarchicallyClusteredHeatmap(min_frequency=5, n_row_clusters=10, n_col_clusters=3)

result = heatmap.eval(data=hm_inputs, figsize=(5, 8))
plt.show()

print(f'N. rows before: {hm_inputs.shape[0]}')
print(f'N. rows after: {result.filtered_data.shape[0]}')
print(f'Row group idx: {result.row_groups}')
print(f'Col group idx: {result.col_groups}')
../_images/notebooks_hierrarchically-clustered-heatmap_7_0.png
N. rows before: 100
N. rows after: 97
Row group idx: [0, 1, 0, 1, 2, 1, 0, 1, 3, 0, 4, 3, 5, 1, 0, 3, 6, 3, 6, 7, 3, 8, 2, 2, 0, 6, 6, 6, 6, 6, 6, 0, 3, 0, 0, 0, 3, 0, 6, 3, 0, 0, 0, 6, 3, 3, 0, 4, 3, 5, 5, 5, 5, 3, 3, 6, 6, 1, 6, 0, 3, 3, 0, 4, 3, 3, 3, 1, 3, 3, 0, 0, 0, 0, 0, 3, 3, 5, 2, 3, 8, 0, 6, 2, 2, 6, 7, 8, 6, 3, 3, 1, 4, 7, 9, 4, 1]
Col group idx: [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2]

Alternatively your data are wrapped around a ProteinsDataset object, you can extract the table and proceed.

[5]:
conditions = {
    'c1': [f'c1_rep{i+1}' for i in range(5)],
    'c2': [f'c2_rep{i+1}' for i in range(5)],
    'c3': [f'c3_rep{i+1}' for i in range(5)],
}
dataset = ProteinsDataset.from_df(data, id_col='protein_id', conditions=conditions)
data_in = dataset.to_table()

heatmap = HierarchicallyClusteredHeatmap(n_col_clusters=None)
result = heatmap.eval(data=data_in)
# plt.savefig('my-plot.png')  # save as image
# plt.savefig('my-plot.pdf')  # save as vector-based image
plt.show()
../_images/notebooks_hierrarchically-clustered-heatmap_9_0.png