ProteinsDataset#
- class omicspylib.datasets.proteins.ProteinsDataset(conditions: list[ProteinsDatasetExpCondition])#
Bases:
TabularDatasetA proteins dataset object. It contains multiple experimental conditions with one or more experiments per condition.
Constructor#
- ProteinsDataset.__init__(conditions: list[ProteinsDatasetExpCondition]) None#
Proteins dataset object constractor. This object wraps multiple experiments under sets of experimental conditions.
- Parameters:
conditions
- classmethod ProteinsDataset.from_df(data: DataFrame, id_col: str, conditions: dict[str, list]) ProteinsDataset#
Initialize a
ProteinsDatasetfrom a pandas dataframe.- Parameters:
data (pd.DataFrame) – The input DataFrame containing protein data.
id_col (str) – The name of the column in the DataFrame that represents the protein IDs.
conditions (dict[str, list]) – A dictionary mapping condition names to lists of column names representing the corresponding experimental conditions in the DataFrame.
- Returns:
A ProteinsDataset object created from the input DataFrame.
- Return type:
- classmethod ProteinsDataset.from_maxquant(data: str | DataFrame, conditions: dict[str, list], rm_reverse: bool = True, rm_contaminants: bool = True, rm_only_modified: bool = True, id_col: str = 'Majority protein IDs', rename_id_col: str | None = 'protein_id') ProteinsDataset#
Create a ProteinsDataset object from MaxQuant proteinGroups.txt file.
- Parameters:
data (str or pd.DataFrame) – The input data as a path to a TSV file or a pandas DataFrame.
conditions (dict[str, list]) – A dictionary mapping condition names to a list of corresponding samples.
rm_reverse (bool, optional) – If True, remove reverse hits from the dataset, by default True.
rm_contaminants (bool, optional) – If True, remove contaminant hits from the dataset, by default, True.
rm_only_modified (bool, optional) – If True, remove proteins with only modified peptides, by default True.
id_col (str, optional) – The column name containing the protein IDs, by default ‘Majority protein IDs’.
rename_id_col (str or None, optional) – The new column name for the protein IDs after renaming, by default ‘protein_id’.
- Returns:
The assembled ProteinsDataset object.
- Return type:
Properties#
- ProteinsDataset.condition_names#
List experimental condition names.
- Returns:
A list of experimental condition names.
- Return type:
list
- ProteinsDataset.n_conditions#
Return the number of experimental conditions included in the dataset.
- Returns:
Number of experimental conditions included in the dataset.
- Return type:
int
- ProteinsDataset.n_experiments#
Returns the number of experiments included in the dataset, across all experimental conditions.
- Returns:
Number of experiments included in the dataset.
- Return type:
int
- ProteinsDataset.n_records#
Returns the number of unique records in the dataset.
- Returns:
The total number of unique records.
- Return type:
int
Methods#
- ProteinsDataset.append(new_obj: ProteinsDataset, skip_duplicates: bool = False) ProteinsDataset#
Append another experimental condition in the same dataset.
- Parameters:
new_obj (ProteinsDataset) – A Peptides dataset object to join.
skip_duplicates (bool) – If
False, when an experimental condition (name) already exists, it will raise an error. Otherwise, it will just be omitted.
- Returns:
A new object containing the experimental conditions of the two datasets.
- Return type:
- Raises:
ValueError: – If the provided class differs from the existing or the id_col column name differs, or an experimental condition already exists.
- ProteinsDataset.describe() Dict[str, Any]#
Returns basic information about the dataset. Includes fields like number of experimental conditions, number of records in total, total number of experiments and statistics for each experimental condition.
- Returns:
Dataset statistics.
- Return type:
dict
- ProteinsDataset.drop(exp: str | list | None = None, cond: str | list | None = None) T#
Drop specified experiment(s) and or condition(s).
- Parameters:
exp (str, list, optional) – Experiment name(s) to be dropped.
cond (str, list, optional) – Experimental condition(s) to be dropped.
- Returns:
An object of the same instance type without the
specified experiment(s) and/or condition(s).
- ProteinsDataset.experiment_names(condition: str | None = None) List[str]#
Get experiment names from the dataset. If experimental condition name is provided, experiment names will be limited to that case.
- Parameters:
condition (str or None) – Name of the experimental condition to retrieve names for, or None to retrieve all experimental conditions.
- Returns:
A list of experiment names.
- Return type:
list
- ProteinsDataset.filter(exp: str | list | None = None, cond: list | None = None, min_frequency: int | None = None, na_threshold: float = 0.0) T#
Filter dataset based on a given set of properties.
- Parameters:
exp (list, str, optional) – List or experiment to keep with. Leave empty to keep all experiments.
cond (list, optional) – List of experimental condition names. If provided, only the conditions specified will remain in the dataset.
min_frequency (int or None, optional) – If specified, records of the dataset will be filtered to the records with greater than or equal the specified frequency.
na_threshold (float or None, optional) – Values below or equal to this threshold are considered missing. It is used in to filter records based on the number of missing values.
- Returns:
A new instance of the dataset object, filtered based on the user’s input.
- Return type:
- ProteinsDataset.frequency(na_threshold: float = 0.0, join_method: Literal['left', 'right', 'inner', 'outer', 'cross'] = 'outer', axis: int = 1, conditions: List[str] | None = None) DataFrame#
Calculate the number of experiments within each experimental condition with quantitative value above the specified threshold, and return a merged data frame for all conditions.
By default, and outer join is performed across all conditions. Adjust accordingly if needed.
- Parameters:
na_threshold (float) – Values below or equal to this threshold are considered missing.
join_method (MergeHow) – Method of joining records of each experimental condition in the output.
axis (int) – Axis on which to calculate the frequency. Use
1for row by row and0for column by column.conditions (List[str], optional) – If specified, only the specified conditions are considered.
- Returns:
A Pandas data frame containing the average value for each condition.
- Return type:
pd.DataFrame
- ProteinsDataset.impute(method: Literal['fixed', 'global min', 'global mean', 'global median', 'global row min', 'global row median', 'global row mean', 'group row min', 'group row mean', 'group row median'], na_threshold: float = 0.0, value: float | None = None, shift: float = 0.0, random_noise: bool = False) T#
Impute missing values with any of the specified methods. Note that missing value imputation my introduce artifacts in the analysis. Consider the level of missing value imputation before interpreting your results.
- Parameters:
method (str) –
- Imputation method. Can be one of:
fixed: A fixed value. All values below the given threshold will be set to that value. To use this method, you also need to specify thevalueparameter.global min|mean|median: First themin|mean|medianvalue of the dataset is calculated, and then missing values are set to that fixed value. You can also specify the shift parameter to shift the calculated min by a fixed step.global row min|mean|median: Similar toglobal minbut the min|median|mean value refers to the row entry value instead of the value across all entries of that table.group row min|mean|median. Similar to the previous but now the min|mean|median is based on the values of the group.
na_threshold (float) – Values below or equal to this threshold are considered missing.
value (float, optional) – If
fixedmethod is specified, you also need to set that value here.shift (float, optional) – If
global|group-minmethod is specified, you can also decrease that value by a fixed step.random_noise (bool, optional) – If specified random noise based on the global or within group variability, will be added. Imputed values will be selected from a normal distribution with mean the selected value (depending on the method) and std the within group or global standard deviation (depending on the method). Because you draw random values from a normal distribution, consider transforming your data if needed, to approximate it (e.g., apply log2 transformation, if needed). After imputation, you can back_transform to the original scale.
- ProteinsDataset.log2_backtransform() T#
Calculate the exponential with base 2. Is used to invert log2 transformation and convert values back to their original scale.
- Return type:
An object of the same instance with the values transformed.
- ProteinsDataset.log2_transform() T#
Perform log2 transformation in all experiments.
- Return type:
An object of the same instance with the values transformed.
- ProteinsDataset.missing_values(na_threshold: float = 0.0) Tuple[DataFrame, int, int]#
Returns number of missing values per experiment and condition. Missing values are considered the cases that are either missing or are below the specified threshold.
- Parameters:
na_threshold (float) – Values below or equal to this threshold are considered missing.
- Returns:
pd.DataFrame – A Pandas data frame with the number of missing cases per experiment and condition.
Int – Number of missing values.
Int – Number of values in total
- ProteinsDataset.mean(na_threshold: float = 0.0, join_method: Literal['left', 'right', 'inner', 'outer', 'cross'] = 'inner', axis: int = 1) DataFrame#
Calculate the average value for each record within each experimental condition and return a merged data frame for all conditions.
Missing values (and values below or equal the specified threshold) are omitted.
By default, and inner join is performed across all conditions. Adjust accordingly if needed.
- Parameters:
na_threshold (float) – Values below or equal to this threshold are considered missing.
join_method (MergeHow, optional) – Method of joining records of each experimental condition in the output.
axis (int) – 1 for row by row and 0 for column by column.
- Returns:
A Pandas data frame containing the average value for each condition.
- Return type:
pd.DataFrame
- ProteinsDataset.normalize(method: Literal['mean'], ref_exp: str | None = None, ref_condition: str | None = None, use_common_records: bool = False, na_threshold: float = 0.0) T#
Normalize the dataset. Any required transformations must be done before calling the
normalizemethod. For clarity, this is not handled internally.For example, you might use
log2_transformmethod first thennormalizeand finallylog2_backtransformthe normalized values to return to the same units.Normalization methods:
- Mean without a use of common records without a
ref_exp.: Find the experiment with the most records and consider reference.
Calculate mean experiment intensity and the difference from reference.
Shift each experiment’s intensity by the difference with reference.
- Mean without a use of common records without a
- mean without a use of common records with ref exp.:
Like above, but reference experiment is defined by the user.
- Mean with common records without a ref exp.:
Find experiment with the most records and consider reference.
Perform pairwise comparison of each experiment with the reference where: i. Filter on common records. Ii. Calculate difference from reference. Iii. Shift all intensities of that experiment based on that difference.
- mean with common records with a ref exp.:
Similar with previous (pairwise comparison and select common records), but reference is defined by the user.
- mean with common records with a ref condition:
Similar with the previous, reference is selected automatically, from the condition specified.
- Parameters:
method – Normalization method. For the moment, only normalization to the
meanis supported.ref_exp (str, optional) – If specified, this experiment will be considered the reference. If this is set, the
ref_conditionfield is ignored.ref_condition (str, optional) – If specified, the experiment of that condition with the most records will be considered the reference. Note that
ref_expshould not be set.use_common_records (bool) – If set to
True, common records, in a pairwise comparison with the reference, will be considered for normalization.na_threshold (float) – Values below or equal to this threshold are considered missing.
- Return type:
A new instance of the same object with normalized values.
- ProteinsDataset.to_table(join_method: Literal['left', 'right', 'inner', 'outer', 'cross'] = 'outer') DataFrame#
Merge individual experimental conditions to one table. You might use this method to extract a Pandas data frame from the dataset and keep working using common procedures.
Note that the entry identifier is in the index of the data frame.
- Parameters:
join_method (MergeHow, optional) – Method of joining records of each experimental condition in the output.
- Returns:
A Pandas data frame containing all experimental conditions.
- Return type:
pd.DataFrame
- ProteinsDataset.unique_records() list#
Returns a list of unique entry ids across all experimental conditions.
- Returns:
A list of unique entry ids.
- Return type:
list