Data Analysis PXD040621#

Plan

read data and log2 transform intensity values
aggregate peptide intensities to protein intensities
format data from long to wide format
remove contaminant proteins
check for missing values
Clustermap of sample and proteins
differential analysis (Volcano Plots)
Enrichment Analysis
check for maltose update pathway (Fig. 3 in paper)

%pip install acore vuecore

Show code cell output

Hide code cell output

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: acore in /home/runner/.local/lib/python3.12/site-packages (0.2.0)
Requirement already satisfied: vuecore in /home/runner/.local/lib/python3.12/site-packages (0.0.5)
Requirement already satisfied: numpy<2.1.0 in /home/runner/.local/lib/python3.12/site-packages (from acore) (2.0.2)
Requirement already satisfied: pandas in /home/runner/.local/lib/python3.12/site-packages (from acore) (2.3.1)
Requirement already satisfied: dsp-pandas in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.0.5)
Requirement already satisfied: scipy in /home/runner/.local/lib/python3.12/site-packages (from acore) (1.15.3)
Requirement already satisfied: scikit-learn>=1.5 in /home/runner/.local/lib/python3.12/site-packages (from acore) (1.7.1)
Requirement already satisfied: networkx in /home/runner/.local/lib/python3.12/site-packages (from acore) (3.5)
Requirement already satisfied: biopython in /home/runner/.local/lib/python3.12/site-packages (from acore) (1.85)
Requirement already satisfied: combat in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.3.3)
Requirement already satisfied: gseapy!=1.1.5 in /home/runner/.local/lib/python3.12/site-packages (from acore) (1.1.9)
Requirement already satisfied: kmapper in /home/runner/.local/lib/python3.12/site-packages (from acore) (2.1.0)
Requirement already satisfied: lifelines in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.30.0)
Requirement already satisfied: pingouin in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.5.5)
Requirement already satisfied: python-louvain in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.16)
Requirement already satisfied: PyWGCNA!=2.2.0 in /home/runner/.local/lib/python3.12/site-packages (from acore) (1.16.0)
Requirement already satisfied: snfpy in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.2.2)
Requirement already satisfied: umap-learn in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.5.9.post2)
Requirement already satisfied: statsmodels in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.14.5)
Requirement already satisfied: inmoose in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.8.0)
Requirement already satisfied: requests in /usr/lib/python3/dist-packages (from acore) (2.31.0)
Requirement already satisfied: rarfile in /home/runner/.local/lib/python3.12/site-packages (from acore) (4.2)
Requirement already satisfied: plotly in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (5.15.0)
Requirement already satisfied: beautifulsoup4 in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (4.13.4)
Requirement already satisfied: dash in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (3.2.0)
Requirement already satisfied: matplotlib in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (3.10.5)
Requirement already satisfied: kaleido in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (0.2.0)
Requirement already satisfied: pyvis in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (0.3.2)
Requirement already satisfied: wordcloud in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (1.9.4)
Requirement already satisfied: cyjupyter in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (0.2.0)
Requirement already satisfied: nltk in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (3.9.1)
Requirement already satisfied: webweb in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (0.0.37)
Requirement already satisfied: dash-cytoscape in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (1.0.2)
Requirement already satisfied: contourpy>=1.0.1 in /home/runner/.local/lib/python3.12/site-packages (from matplotlib->vuecore) (1.3.3)
Requirement already satisfied: cycler>=0.10 in /home/runner/.local/lib/python3.12/site-packages (from matplotlib->vuecore) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/runner/.local/lib/python3.12/site-packages (from matplotlib->vuecore) (4.59.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/runner/.local/lib/python3.12/site-packages (from matplotlib->vuecore) (1.4.8)
Requirement already satisfied: packaging>=20.0 in /home/runner/.local/lib/python3.12/site-packages (from matplotlib->vuecore) (24.2)
Requirement already satisfied: pillow>=8 in /home/runner/.local/lib/python3.12/site-packages (from matplotlib->vuecore) (10.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/lib/python3/dist-packages (from matplotlib->vuecore) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/lib/python3/dist-packages (from matplotlib->vuecore) (2.8.2)
Requirement already satisfied: anndata>=0.8.0 in /home/runner/.local/lib/python3.12/site-packages (from PyWGCNA!=2.2.0->acore) (0.12.1)
Requirement already satisfied: biomart>=0.9.2 in /home/runner/.local/lib/python3.12/site-packages (from PyWGCNA!=2.2.0->acore) (0.9.2)
Requirement already satisfied: reactome2py>=3.0.0 in /home/runner/.local/lib/python3.12/site-packages (from PyWGCNA!=2.2.0->acore) (3.0.0)
Requirement already satisfied: rsrc>=0.1.3 in /home/runner/.local/lib/python3.12/site-packages (from PyWGCNA!=2.2.0->acore) (0.1.3)
Requirement already satisfied: seaborn>=0.11.1 in /home/runner/.local/lib/python3.12/site-packages (from PyWGCNA!=2.2.0->acore) (0.13.2)
Requirement already satisfied: setuptools>=59.8.0 in /usr/lib/python3/dist-packages (from PyWGCNA!=2.2.0->acore) (68.1.2)
Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas->acore) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/runner/.local/lib/python3.12/site-packages (from pandas->acore) (2025.2)
Requirement already satisfied: ipython>=5.3.0 in /home/runner/.local/lib/python3.12/site-packages (from pyvis->vuecore) (9.4.0)
Requirement already satisfied: jinja2>=2.9.6 in /usr/lib/python3/dist-packages (from pyvis->vuecore) (3.1.2)
Requirement already satisfied: jsonpickle>=1.4.1 in /home/runner/.local/lib/python3.12/site-packages (from pyvis->vuecore) (4.1.1)
Requirement already satisfied: joblib>=1.2.0 in /home/runner/.local/lib/python3.12/site-packages (from scikit-learn>=1.5->acore) (1.5.1)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/runner/.local/lib/python3.12/site-packages (from scikit-learn>=1.5->acore) (3.6.0)
Requirement already satisfied: patsy>=0.5.6 in /home/runner/.local/lib/python3.12/site-packages (from statsmodels->acore) (1.0.1)
Requirement already satisfied: soupsieve>1.2 in /home/runner/.local/lib/python3.12/site-packages (from beautifulsoup4->vuecore) (2.7)
Requirement already satisfied: typing-extensions>=4.0.0 in /usr/lib/python3/dist-packages (from beautifulsoup4->vuecore) (4.10.0)
Requirement already satisfied: mpmath>=1.1.0 in /home/runner/.local/lib/python3.12/site-packages (from combat->acore) (1.3.0)
Requirement already satisfied: ipywidgets>=7.0.0 in /home/runner/.local/lib/python3.12/site-packages (from cyjupyter->vuecore) (8.1.7)
Requirement already satisfied: Flask<3.2,>=1.0.4 in /home/runner/.local/lib/python3.12/site-packages (from dash->vuecore) (3.1.1)
Requirement already satisfied: Werkzeug<3.2 in /home/runner/.local/lib/python3.12/site-packages (from dash->vuecore) (3.1.3)
Requirement already satisfied: importlib-metadata in /home/runner/.local/lib/python3.12/site-packages (from dash->vuecore) (8.7.0)
Requirement already satisfied: retrying in /home/runner/.local/lib/python3.12/site-packages (from dash->vuecore) (1.4.2)
Requirement already satisfied: nest-asyncio in /home/runner/.local/lib/python3.12/site-packages (from dash->vuecore) (1.6.0)
Requirement already satisfied: tenacity>=6.2.0 in /home/runner/.local/lib/python3.12/site-packages (from plotly->vuecore) (9.1.2)
Requirement already satisfied: openpyxl in /home/runner/.local/lib/python3.12/site-packages (from dsp-pandas->acore) (3.1.5)
Requirement already satisfied: fastcluster in /home/runner/.local/lib/python3.12/site-packages (from inmoose->acore) (1.3.0)
Requirement already satisfied: autograd>=1.5 in /home/runner/.local/lib/python3.12/site-packages (from lifelines->acore) (1.8.0)
Requirement already satisfied: autograd-gamma>=0.3 in /home/runner/.local/lib/python3.12/site-packages (from lifelines->acore) (0.5.0)
Requirement already satisfied: formulaic>=0.2.2 in /home/runner/.local/lib/python3.12/site-packages (from lifelines->acore) (1.2.0)
Requirement already satisfied: click in /usr/lib/python3/dist-packages (from nltk->vuecore) (8.1.6)
Requirement already satisfied: regex>=2021.8.3 in /home/runner/.local/lib/python3.12/site-packages (from nltk->vuecore) (2025.7.34)
Requirement already satisfied: tqdm in /home/runner/.local/lib/python3.12/site-packages (from nltk->vuecore) (4.67.1)
Requirement already satisfied: pandas-flavor in /home/runner/.local/lib/python3.12/site-packages (from pingouin->acore) (0.7.0)
Requirement already satisfied: tabulate in /home/runner/.local/lib/python3.12/site-packages (from pingouin->acore) (0.9.0)
Requirement already satisfied: numba>=0.51.2 in /home/runner/.local/lib/python3.12/site-packages (from umap-learn->acore) (0.61.2)
Requirement already satisfied: pynndescent>=0.5 in /home/runner/.local/lib/python3.12/site-packages (from umap-learn->acore) (0.5.13)
Requirement already satisfied: array-api-compat>=1.7.1 in /home/runner/.local/lib/python3.12/site-packages (from anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (1.12.0)
Requirement already satisfied: h5py>=3.8 in /home/runner/.local/lib/python3.12/site-packages (from anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (3.14.0)
Requirement already satisfied: legacy-api-wrap in /home/runner/.local/lib/python3.12/site-packages (from anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (1.4.1)
Requirement already satisfied: natsort in /home/runner/.local/lib/python3.12/site-packages (from anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (8.4.0)
Requirement already satisfied: zarr!=3.0.*,>=2.18.7 in /home/runner/.local/lib/python3.12/site-packages (from anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (3.1.1)
Requirement already satisfied: blinker>=1.9.0 in /home/runner/.local/lib/python3.12/site-packages (from Flask<3.2,>=1.0.4->dash->vuecore) (1.9.0)
Requirement already satisfied: itsdangerous>=2.2.0 in /home/runner/.local/lib/python3.12/site-packages (from Flask<3.2,>=1.0.4->dash->vuecore) (2.2.0)
Requirement already satisfied: markupsafe>=2.1.1 in /usr/lib/python3/dist-packages (from Flask<3.2,>=1.0.4->dash->vuecore) (2.1.5)
Requirement already satisfied: interface-meta>=1.2.0 in /home/runner/.local/lib/python3.12/site-packages (from formulaic>=0.2.2->lifelines->acore) (1.3.0)
Requirement already satisfied: narwhals>=1.17 in /home/runner/.local/lib/python3.12/site-packages (from formulaic>=0.2.2->lifelines->acore) (2.0.1)
Requirement already satisfied: wrapt>=1.0 in /home/runner/.local/lib/python3.12/site-packages (from formulaic>=0.2.2->lifelines->acore) (1.17.2)
Requirement already satisfied: decorator in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (5.2.1)
Requirement already satisfied: ipython-pygments-lexers in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (1.1.1)
Requirement already satisfied: jedi>=0.16 in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (0.19.2)
Requirement already satisfied: matplotlib-inline in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (0.1.7)
Requirement already satisfied: pexpect>4.3 in /usr/lib/python3/dist-packages (from ipython>=5.3.0->pyvis->vuecore) (4.9.0)
Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (3.0.51)
Requirement already satisfied: pygments>=2.4.0 in /usr/lib/python3/dist-packages (from ipython>=5.3.0->pyvis->vuecore) (2.17.2)
Requirement already satisfied: stack_data in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (0.6.3)
Requirement already satisfied: traitlets>=5.13.0 in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (5.14.3)
Requirement already satisfied: comm>=0.1.3 in /home/runner/.local/lib/python3.12/site-packages (from ipywidgets>=7.0.0->cyjupyter->vuecore) (0.2.3)
Requirement already satisfied: widgetsnbextension~=4.0.14 in /home/runner/.local/lib/python3.12/site-packages (from ipywidgets>=7.0.0->cyjupyter->vuecore) (4.0.14)
Requirement already satisfied: jupyterlab_widgets~=3.0.15 in /home/runner/.local/lib/python3.12/site-packages (from ipywidgets>=7.0.0->cyjupyter->vuecore) (3.0.15)
Requirement already satisfied: llvmlite<0.45,>=0.44.0dev0 in /home/runner/.local/lib/python3.12/site-packages (from numba>=0.51.2->umap-learn->acore) (0.44.0)
Requirement already satisfied: json5>=0.8.4 in /home/runner/.local/lib/python3.12/site-packages (from reactome2py>=3.0.0->PyWGCNA!=2.2.0->acore) (0.12.0)
Requirement already satisfied: memoir>=0.0.3 in /home/runner/.local/lib/python3.12/site-packages (from rsrc>=0.1.3->PyWGCNA!=2.2.0->acore) (0.0.3)
Requirement already satisfied: reprit>=0.3.0 in /home/runner/.local/lib/python3.12/site-packages (from rsrc>=0.1.3->PyWGCNA!=2.2.0->acore) (0.9.0)
Requirement already satisfied: zipp>=3.20 in /home/runner/.local/lib/python3.12/site-packages (from importlib-metadata->dash->vuecore) (3.23.0)
Requirement already satisfied: et-xmlfile in /home/runner/.local/lib/python3.12/site-packages (from openpyxl->dsp-pandas->acore) (2.0.0)
Requirement already satisfied: xarray in /home/runner/.local/lib/python3.12/site-packages (from pandas-flavor->pingouin->acore) (2025.7.1)
Requirement already satisfied: parso<0.9.0,>=0.8.4 in /home/runner/.local/lib/python3.12/site-packages (from jedi>=0.16->ipython>=5.3.0->pyvis->vuecore) (0.8.4)
Requirement already satisfied: wcwidth in /home/runner/.local/lib/python3.12/site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython>=5.3.0->pyvis->vuecore) (0.2.13)
Requirement already satisfied: donfig>=0.8 in /home/runner/.local/lib/python3.12/site-packages (from zarr!=3.0.*,>=2.18.7->anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (0.8.1.post1)
Requirement already satisfied: numcodecs>=0.14 in /home/runner/.local/lib/python3.12/site-packages (from numcodecs[crc32c]>=0.14->zarr!=3.0.*,>=2.18.7->anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (0.16.1)
Requirement already satisfied: executing>=1.2.0 in /home/runner/.local/lib/python3.12/site-packages (from stack_data->ipython>=5.3.0->pyvis->vuecore) (2.2.0)
Requirement already satisfied: asttokens>=2.1.0 in /home/runner/.local/lib/python3.12/site-packages (from stack_data->ipython>=5.3.0->pyvis->vuecore) (3.0.0)
Requirement already satisfied: pure-eval in /home/runner/.local/lib/python3.12/site-packages (from stack_data->ipython>=5.3.0->pyvis->vuecore) (0.2.3)
Requirement already satisfied: pyyaml in /home/runner/.local/lib/python3.12/site-packages (from donfig>=0.8->zarr!=3.0.*,>=2.18.7->anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (6.0.2)
Requirement already satisfied: crc32c>=2.7 in /home/runner/.local/lib/python3.12/site-packages (from numcodecs[crc32c]>=0.14->zarr!=3.0.*,>=2.18.7->anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (2.7.1)
Note: you may need to restart the kernel to use updated packages.

from pathlib import Path

import acore.differential_regulation
import acore.enrichment_analysis
import acore.normalization
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import scipy.stats
import seaborn as sns
import vuecore
from acore.io.uniprot import fetch_annotations, process_annotations
from vuecore.viz import get_enrichment_plots

Read in the data#

file_in: input file with the quantified peptide data in MSstats format as provided by quantms

The file will be loaded from the repository if it is not present.

file_in = Path("data/PXD040621/processed/PXD040621.sdrf_openms_design_msstats_in.csv")
if not file_in.exists():
    file_in = (
        "https://raw.githubusercontent.com/biosustain/dsp_course_proteomics_intro/HEAD"
        "/data/PXD040621/processed/PXD040621.sdrf_openms_design_msstats_in.csv"
    )
df = pd.read_csv(file_in, sep=",", header=0)  # .set_index([])
df.head()

	ProteinName	PeptideSequence	PrecursorCharge	FragmentIon	IsotopeLabelType	Condition	BioReplicate	Run	Intensity	Reference
0	sp\|P00959\|SYM_ECOLI	AAAAPVTGPLADDPIQETITFDDFAK	2	NaN	L	Control	1	1	201,065,600.000	20220830_JL-4884_Forster_Ecoli_DMSO_rep1_EG-1....
1	sp\|P00959\|SYM_ECOLI	AAAAPVTGPLADDPIQETITFDDFAK	2	NaN	L	Control	2	2	74,844,780.000	20220830_JL-4884_Forster_Ecoli_DMSO_rep2_EG-2....
2	sp\|P00959\|SYM_ECOLI	AAAAPVTGPLADDPIQETITFDDFAK	2	NaN	L	Control	3	3	67,591,390.000	20220830_JL-4884_Forster_Ecoli_DMSO_rep3_EG-3....
3	sp\|P00959\|SYM_ECOLI	AAAAPVTGPLADDPIQETITFDDFAK	2	NaN	L	Control	4	4	76,388,810.000	20220830_JL-4884_Forster_Ecoli_DMSO_rep4_EG-4....
4	sp\|P00959\|SYM_ECOLI	AAAAPVTGPLADDPIQETITFDDFAK	2	NaN	L	Sulforaphane	5	5	116,247,100.000	20220830_JL-4884_Forster_Ecoli_Suf_rep1_EG-5.mzML

define the output folder for our VueGen report which we will create later

out_dir = "data/PXD040621/report/"
out_dir = Path(out_dir)
out_dir.mkdir(parents=True, exist_ok=True)

We have the following columns in the data:

cols = [
    "ProteinName",
    "PeptideSequence",
    "PrecursorCharge",
    "FragmentIon",
    "ProductCharge",
    "IsotopeLabelType",
    "Condition",
    "BioReplicate",
    "Run",
    "Intensity",
    "Reference",
]

Log2 transform the intensity values#

log2 transformations are common for lognormal distributed data

df["Intensity"] = np.log2(df["Intensity"].astype(float))
df.head()

	ProteinName	PeptideSequence	PrecursorCharge	FragmentIon	IsotopeLabelType	Condition	BioReplicate	Run	Intensity	Reference
0	sp\|P00959\|SYM_ECOLI	AAAAPVTGPLADDPIQETITFDDFAK	2	NaN	L	Control	1	1	27.583	20220830_JL-4884_Forster_Ecoli_DMSO_rep1_EG-1....
1	sp\|P00959\|SYM_ECOLI	AAAAPVTGPLADDPIQETITFDDFAK	2	NaN	L	Control	2	2	26.157	20220830_JL-4884_Forster_Ecoli_DMSO_rep2_EG-2....
2	sp\|P00959\|SYM_ECOLI	AAAAPVTGPLADDPIQETITFDDFAK	2	NaN	L	Control	3	3	26.010	20220830_JL-4884_Forster_Ecoli_DMSO_rep3_EG-3....
3	sp\|P00959\|SYM_ECOLI	AAAAPVTGPLADDPIQETITFDDFAK	2	NaN	L	Control	4	4	26.187	20220830_JL-4884_Forster_Ecoli_DMSO_rep4_EG-4....
4	sp\|P00959\|SYM_ECOLI	AAAAPVTGPLADDPIQETITFDDFAK	2	NaN	L	Sulforaphane	5	5	26.793	20220830_JL-4884_Forster_Ecoli_Suf_rep1_EG-5.mzML

Exploratory and Data Quality Plots (peptide level)#

df[“BioReplicate”] = df[“BioReplicate”].replace({5: 1, 6: 2, 7: 3, 8: 4}) fg = sns.displot( data=df.rename(columns={“BioReplicate”: “Rep”, “Condition”: “C.”}), x=”Intensity”, col=”C.”, row=”Rep”, # hue=”Reactor_ID”, kind=”kde”, height=2, aspect=1.1, )

Aggregate the peptide intensities to protein intensities#

we use the median of the peptide intensities for each protein

There are more sophisticated ways to do this, e.g. using MaxLFQ, iBAQ, FlashLFQ, DirectLFQ, etc.

shorten sample name for readability

proteins = (
    df.groupby(["ProteinName", "Reference"])["Intensity"].median().unstack(level=0)
)
proteins.index = proteins.index.str.split("_").str[4:6].str.join("_")
proteins

ProteinName	CON_ENSEMBL:ENSBTAP00000024462;CON_ENSEMBL:ENSBTAP00000024466	CON_O76013	CON_P00761	CON_P01966	CON_P02070	CON_P02533	CON_P02538	CON_P02662	CON_P02663	CON_P02666	...	sp\|Q47319\|TAPT_ECOLI	sp\|Q47536\|YAIP_ECOLI	sp\|Q47622\|SAPA_ECOLI	sp\|Q47679\|YAFV_ECOLI	sp\|Q47710\|YQJK_ECOLI	sp\|Q57261\|TRUD_ECOLI	sp\|Q59385-2\|COPA_ECOLI	sp\|Q59385\|COPA_ECOLI	sp\|Q7DFV3\|YMGG_ECOLI	sp\|Q93K97\|ADPP_ECOLI
Reference
DMSO_rep1	NaN	NaN	31.055	24.202	26.890	28.336	25.828	NaN	26.762	NaN	...	NaN	NaN	25.343	NaN	27.038	28.411	23.555	27.640	28.513	27.223
DMSO_rep2	NaN	25.647	30.117	25.711	NaN	25.595	26.210	NaN	27.593	NaN	...	NaN	NaN	NaN	NaN	26.841	27.941	25.240	27.244	27.621	25.291
DMSO_rep3	25.529	NaN	33.767	NaN	NaN	26.992	25.441	28.129	28.895	29.559	...	NaN	NaN	24.576	NaN	26.609	27.070	NaN	27.525	27.679	24.359
DMSO_rep4	NaN	NaN	33.702	NaN	NaN	27.205	26.093	NaN	27.764	NaN	...	NaN	NaN	25.945	23.902	27.164	26.680	22.524	27.404	27.256	25.767
Suf_rep1	NaN	NaN	34.068	NaN	NaN	NaN	24.442	NaN	25.938	NaN	...	NaN	NaN	25.836	NaN	26.819	27.995	NaN	27.499	28.090	25.956
Suf_rep2	NaN	NaN	33.252	NaN	26.571	26.495	25.070	NaN	NaN	NaN	...	NaN	NaN	NaN	24.162	27.268	27.055	NaN	27.667	27.526	25.231
Suf_rep3	NaN	NaN	31.844	25.927	NaN	26.249	28.169	22.968	26.466	NaN	...	25.463	NaN	NaN	NaN	24.741	27.313	NaN	27.708	27.814	26.103
Suf_rep4	NaN	NaN	33.857	NaN	NaN	27.208	24.690	NaN	NaN	NaN	...	NaN	24.468	24.757	24.040	27.071	26.643	NaN	27.848	27.605	26.178

8 rows × 2306 columns

Remove contaminant proteins#

Remove the contaminant proteins which were added to the fasta file used in the data processing. Contaminant proteins are e.g. creation which gets into the sample from the human skin or hair when the sample is prepared.

These are filtered out as they are most of the time not relevant, but a contamination.

decoy_proteins = proteins.filter(like="CON_", axis=1)
proteins = proteins.drop(decoy_proteins.columns, axis=1)
proteins

ProteinName	sp\|A5A613\|YCIY_ECOLI	sp\|P00350\|6PGD_ECOLI	sp\|P00363\|FRDA_ECOLI	sp\|P00370\|DHE4_ECOLI	sp\|P00393\|NDH_ECOLI	sp\|P00448\|SODM_ECOLI	sp\|P00452\|RIR1_ECOLI	sp\|P00490\|PHSM_ECOLI	sp\|P00509\|AAT_ECOLI	sp\|P00547\|KHSE_ECOLI	...	sp\|Q47319\|TAPT_ECOLI	sp\|Q47536\|YAIP_ECOLI	sp\|Q47622\|SAPA_ECOLI	sp\|Q47679\|YAFV_ECOLI	sp\|Q47710\|YQJK_ECOLI	sp\|Q57261\|TRUD_ECOLI	sp\|Q59385-2\|COPA_ECOLI	sp\|Q59385\|COPA_ECOLI	sp\|Q7DFV3\|YMGG_ECOLI	sp\|Q93K97\|ADPP_ECOLI
Reference
DMSO_rep1	27.180	28.152	30.247	27.459	26.824	25.610	NaN	27.864	29.979	26.065	...	NaN	NaN	25.343	NaN	27.038	28.411	23.555	27.640	28.513	27.223
DMSO_rep2	NaN	27.926	30.262	26.873	26.757	24.901	NaN	26.439	29.048	NaN	...	NaN	NaN	NaN	NaN	26.841	27.941	25.240	27.244	27.621	25.291
DMSO_rep3	NaN	27.653	29.970	26.600	25.442	25.054	27.172	26.382	28.777	NaN	...	NaN	NaN	24.576	NaN	26.609	27.070	NaN	27.525	27.679	24.359
DMSO_rep4	NaN	27.152	29.471	26.439	25.799	24.790	NaN	26.820	29.485	25.524	...	NaN	NaN	25.945	23.902	27.164	26.680	22.524	27.404	27.256	25.767
Suf_rep1	NaN	27.442	30.005	27.400	26.671	25.564	NaN	27.685	29.295	NaN	...	NaN	NaN	25.836	NaN	26.819	27.995	NaN	27.499	28.090	25.956
Suf_rep2	NaN	27.032	30.086	27.189	26.886	25.378	27.364	27.531	29.284	NaN	...	NaN	NaN	NaN	24.162	27.268	27.055	NaN	27.667	27.526	25.231
Suf_rep3	NaN	27.815	29.904	27.139	26.711	25.318	26.062	27.545	29.357	26.265	...	25.463	NaN	NaN	NaN	24.741	27.313	NaN	27.708	27.814	26.103
Suf_rep4	NaN	27.587	29.575	27.224	26.321	25.360	25.101	27.705	29.584	26.427	...	NaN	24.468	24.757	24.040	27.071	26.643	NaN	27.848	27.605	26.178

8 rows × 2269 columns

Create a label for each sample based on the metadata.

we will use a string in the sample name, but you can see how the metadata is organized

meta = df[["Condition", "BioReplicate", "Run", "Reference"]].drop_duplicates()
meta

	Condition	BioReplicate	Run	Reference
0	Control	1	1	20220830_JL-4884_Forster_Ecoli_DMSO_rep1_EG-1....
1	Control	2	2	20220830_JL-4884_Forster_Ecoli_DMSO_rep2_EG-2....
2	Control	3	3	20220830_JL-4884_Forster_Ecoli_DMSO_rep3_EG-3....
3	Control	4	4	20220830_JL-4884_Forster_Ecoli_DMSO_rep4_EG-4....
4	Sulforaphane	5	5	20220830_JL-4884_Forster_Ecoli_Suf_rep1_EG-5.mzML
5	Sulforaphane	6	6	20220830_JL-4884_Forster_Ecoli_Suf_rep2_EG-6.mzML
6	Sulforaphane	7	7	20220830_JL-4884_Forster_Ecoli_Suf_rep3_EG-7.mzML
7	Sulforaphane	8	8	20220830_JL-4884_Forster_Ecoli_Suf_rep4_EG-8.mzML

label_encoding = {0: "control", 1: "10 µm sulforaphane"}
label_suf = pd.Series(
    proteins.index.str.contains("Suf_").astype(int),
    index=proteins.index,
    name="label_suf",
    dtype=np.int8,
).map(label_encoding)
label_suf

Reference
DMSO_rep1               control
DMSO_rep2               control
DMSO_rep3               control
DMSO_rep4               control
Suf_rep1     10 µm sulforaphane
Suf_rep2     10 µm sulforaphane
Suf_rep3     10 µm sulforaphane
Suf_rep4     10 µm sulforaphane
Name: label_suf, dtype: object

Plot the data completeness for each protein.#

view_name = "Protein"
out_dir_subsection = out_dir / "1_data" / "completeness"
out_dir_subsection.mkdir(parents=True, exist_ok=True)

view_name = "Protein"
ax = (
    proteins.notna()
    .sum()
    .sort_values()
    .plot(
        rot=45,
        ylabel=f"Number of Samples {view_name.lower()} was observed in",
    )
)
ax.get_figure().savefig(
    out_dir_subsection / f"data_completeness_step_plot.png",
    bbox_inches="tight",
    dpi=300,
)

_images/920b0d32a0e87b1e85101ba9ad25adea4228e33fd151c22b4b3d88e442cb2578.png

view_name = "Protein"
ax = (
    proteins.notna()
    .sum()
    .value_counts()
    .sort_index(ascending=False)
    .plot(
        kind="bar",
        title=f"Data Completeness per {view_name}",
        xlabel=f"Number of Samples {view_name.lower()} was observed in",
        ylabel=f"Number of {view_name}s",
        color="steelblue",
        figsize=(10, 6),
    )
)
ax.get_figure().savefig(
    out_dir_subsection / f"data_completeness_bar_plot.png",
    bbox_inches="tight",
    dpi=300,
)

_images/3148ba61f260e734a61eed9aa6901c190e0099b7f39dfd4401d5d4655f810812.png

# Explode column names to examine split by '|'
proteins_meta = (
    proteins.columns.str.split("|", expand=True)
    .to_frame()
    .dropna(how="any", axis=1)
    .reset_index(drop=True)
)
proteins_meta.columns = ["Source", "ProteinName", "GeneName"]
proteins_meta["GeneName"] = proteins_meta["GeneName"].str.split("_").str[0]
proteins_meta.index = proteins.columns
proteins_meta.index.name = "identifier"
proteins_meta

	Source	ProteinName	GeneName
identifier
sp\|A5A613\|YCIY_ECOLI	sp	A5A613	YCIY
sp\|P00350\|6PGD_ECOLI	sp	P00350	6PGD
sp\|P00363\|FRDA_ECOLI	sp	P00363	FRDA
sp\|P00370\|DHE4_ECOLI	sp	P00370	DHE4
sp\|P00393\|NDH_ECOLI	sp	P00393	NDH
...	...	...	...
sp\|Q57261\|TRUD_ECOLI	sp	Q57261	TRUD
sp\|Q59385-2\|COPA_ECOLI	sp	Q59385-2	COPA
sp\|Q59385\|COPA_ECOLI	sp	Q59385	COPA
sp\|Q7DFV3\|YMGG_ECOLI	sp	Q7DFV3	YMGG
sp\|Q93K97\|ADPP_ECOLI	sp	Q93K97	ADPP

2269 rows × 3 columns

For later in the enrichment analysis let’s replace the protein identifier from the Fasta file with the UNIPROT ID

proteins.columns = proteins_meta["ProteinName"].rename("UniprotID")
proteins

UniprotID	A5A613	P00350	P00363	P00370	P00393	P00448	P00452	P00490	P00509	P00547	...	Q47319	Q47536	Q47622	Q47679	Q47710	Q57261	Q59385-2	Q59385	Q7DFV3	Q93K97
Reference
DMSO_rep1	27.180	28.152	30.247	27.459	26.824	25.610	NaN	27.864	29.979	26.065	...	NaN	NaN	25.343	NaN	27.038	28.411	23.555	27.640	28.513	27.223
DMSO_rep2	NaN	27.926	30.262	26.873	26.757	24.901	NaN	26.439	29.048	NaN	...	NaN	NaN	NaN	NaN	26.841	27.941	25.240	27.244	27.621	25.291
DMSO_rep3	NaN	27.653	29.970	26.600	25.442	25.054	27.172	26.382	28.777	NaN	...	NaN	NaN	24.576	NaN	26.609	27.070	NaN	27.525	27.679	24.359
DMSO_rep4	NaN	27.152	29.471	26.439	25.799	24.790	NaN	26.820	29.485	25.524	...	NaN	NaN	25.945	23.902	27.164	26.680	22.524	27.404	27.256	25.767
Suf_rep1	NaN	27.442	30.005	27.400	26.671	25.564	NaN	27.685	29.295	NaN	...	NaN	NaN	25.836	NaN	26.819	27.995	NaN	27.499	28.090	25.956
Suf_rep2	NaN	27.032	30.086	27.189	26.886	25.378	27.364	27.531	29.284	NaN	...	NaN	NaN	NaN	24.162	27.268	27.055	NaN	27.667	27.526	25.231
Suf_rep3	NaN	27.815	29.904	27.139	26.711	25.318	26.062	27.545	29.357	26.265	...	25.463	NaN	NaN	NaN	24.741	27.313	NaN	27.708	27.814	26.103
Suf_rep4	NaN	27.587	29.575	27.224	26.321	25.360	25.101	27.705	29.584	26.427	...	NaN	24.468	24.757	24.040	27.071	26.643	NaN	27.848	27.605	26.178

8 rows × 2269 columns

And let’s save a table with the data for inspection

proteins_meta.to_csv(out_dir_subsection / "proteins_identifiers.csv")
proteins.to_csv(out_dir_subsection / "proteins.csv")

Hierarchical Clustering of data#

using completely observed data only Find correlations in data

out_dir_subsection = out_dir / "1_data" / "clustermap"
out_dir_subsection.mkdir(parents=True, exist_ok=True)

_group_labels = label_encoding.values()
lut = dict(zip(_group_labels, [f"C{i}" for i in range(len(_group_labels))]))
row_colors = label_suf.map(lut).rename("group color")
row_colors

Reference
DMSO_rep1    C0
DMSO_rep2    C0
DMSO_rep3    C0
DMSO_rep4    C0
Suf_rep1     C1
Suf_rep2     C1
Suf_rep3     C1
Suf_rep4     C1
Name: group color, dtype: object

vuecore.set_font_sizes(7)
cg = sns.clustermap(
    proteins.dropna(how="any", axis=1),
    method="ward",
    row_colors=row_colors,
    figsize=(11, 6),
    robust=True,
    xticklabels=True,
    yticklabels=True,
)
fig = cg.figure
cg.ax_heatmap.set_xlabel("Proteins")
cg.ax_heatmap.set_ylabel("Sample ID")
vuecore.select_xticks(cg.ax_heatmap)
handles = [
    plt.Line2D([0], [0], marker="o", color="w", markerfacecolor=lut[name], markersize=8)
    for name in lut
]
cg.ax_cbar.legend(
    handles, _group_labels, title="Groups", loc="lower left", bbox_to_anchor=(2, 0.5)
)
fname = out_dir_subsection / "clustermap_ward.png"
# vuecore.savefig(fig, fname, pdf=True, dpi=600, tight_layout=False)
fig.savefig(
    out_dir_subsection / "clustermap_ward.png",
    bbox_inches="tight",
    dpi=300,
)

_images/4bad06fb6a61297da44d3d544ae66ea19228dd83cf9d2dcd265f72add96d5184.png

Analytical Plots#

data distribution (e.g. histogram)
coefficient of variation (CV)
number of identified proteins per sample

# ToDo: bin width functionaity: bins should match between all plots (see pimms)
ax = proteins.T.hist(layout=(2, 4), bins=20, sharex=True, sharey=True, figsize=(8, 4))

_images/3096c753a24e027bf6c815e944321d264ac206b1d1f7d7532004ee2eb5bef529.png

Coefficient of Variation (CV)#

CV = standard deviation / mean
per group

df_cvs = (
    proteins.groupby(label_suf)  # .join(metadata[grouping])
    # .agg(scipy.stats.variation)
    .agg([scipy.stats.variation, "mean"])  # .rename_axis(["feat", "stat"], axis=1)
)
df_cvs

UniprotID	A5A613		P00350		P00363		P00370		P00393		...	Q57261		Q59385-2		Q59385		Q7DFV3		Q93K97
	variation	mean	variation	mean	variation	mean	variation	mean	variation	mean	...	variation	mean	variation	mean	variation	mean	variation	mean	variation	mean
label_suf
10 µm sulforaphane	NaN	NaN	0.010	27.469	0.006	29.892	0.004	27.238	0.008	26.647	...	0.018	27.252	NaN	NaN	0.004	27.680	0.008	27.759	0.015	25.867
control	NaN	27.180	0.013	27.721	0.011	29.987	0.014	26.843	0.023	26.205	...	0.025	27.525	NaN	23.773	0.005	27.453	0.017	27.767	0.040	25.660

2 rows × 4538 columns

df_cvs = df_cvs.stack(0, future_stack=True).reset_index().dropna()
df_cvs

	label_suf	UniprotID	variation	mean
1	10 µm sulforaphane	P00350	0.010	27.469
2	10 µm sulforaphane	P00363	0.006	29.892
3	10 µm sulforaphane	P00370	0.004	27.238
4	10 µm sulforaphane	P00393	0.008	26.647
5	10 µm sulforaphane	P00448	0.004	25.405
...	...	...	...	...
4,532	control	Q47710	0.008	26.913
4,533	control	Q57261	0.025	27.525
4,535	control	Q59385	0.005	27.453
4,536	control	Q7DFV3	0.017	27.767
4,537	control	Q93K97	0.040	25.660

3144 rows × 4 columns

default_args = dict(
    facet_col="label_suf",
    # facet_row="Time",
    labels={
        "label_suf": "group",
        "variation": "CV",
    },
)
fig = px.scatter(
    data_frame=df_cvs,
    x="variation",
    y="mean",
    trendline="ols",
    **default_args,
)
fname = "cv_vs_mean"
# ? save
fig

Hierarchical Clustering of normalized data#

using completely observed data only Checkout the recipe on normalization methods.

normalization_method = "median"
X = acore.normalization.normalize_data(
    proteins.dropna(how="any", axis=1), normalization_method
)
X

UniprotID	P00350	P00363	P00370	P00393	P00448	P00490	P00509	P00550	P00561	P00562	...	Q46868	Q46893	Q46920	Q46948	Q47147	Q47710	Q57261	Q59385	Q7DFV3	Q93K97
Reference
DMSO_rep1	27.858	29.953	27.165	26.530	25.316	27.570	29.685	26.429	26.826	27.150	...	29.139	27.865	27.059	26.919	26.365	26.744	28.117	27.346	28.219	26.929
DMSO_rep2	28.064	30.399	27.011	26.894	25.039	26.576	29.185	25.603	26.598	26.720	...	29.868	27.814	26.840	27.330	26.516	26.978	28.078	27.381	27.758	25.429
DMSO_rep3	27.854	30.170	26.800	25.643	25.254	26.582	28.977	27.217	26.693	27.039	...	29.720	28.852	27.007	27.230	26.680	26.809	27.271	27.725	27.879	24.559
DMSO_rep4	27.319	29.638	26.606	25.966	24.957	26.987	29.652	26.835	26.685	26.529	...	28.653	28.294	27.308	27.574	26.441	27.331	26.847	27.571	27.423	25.935
Suf_rep1	27.261	29.824	27.219	26.490	25.383	27.504	29.114	26.637	26.764	27.028	...	29.604	28.407	26.867	27.706	26.765	26.638	27.815	27.318	27.910	25.775
Suf_rep2	27.029	30.084	27.187	26.884	25.375	27.529	29.281	27.030	27.099	26.015	...	28.981	28.579	26.920	27.505	26.613	27.266	27.053	27.665	27.523	25.228
Suf_rep3	27.817	29.907	27.141	26.714	25.321	27.548	29.359	27.011	26.773	26.364	...	28.755	28.379	27.096	27.254	26.284	24.743	27.316	27.711	27.817	26.106
Suf_rep4	27.533	29.521	27.169	26.266	25.306	27.650	29.529	26.921	26.931	26.808	...	27.866	28.746	27.093	27.622	26.954	27.017	26.589	27.793	27.551	26.123

8 rows × 1446 columns

X.median(axis="columns")

Reference
DMSO_rep1   27.278
DMSO_rep2   27.278
DMSO_rep3   27.278
DMSO_rep4   27.278
Suf_rep1    27.278
Suf_rep2    27.278
Suf_rep3    27.278
Suf_rep4    27.278
dtype: float64

vuecore.set_font_sizes(7)
cg = sns.clustermap(
    X,
    method="ward",
    row_colors=row_colors,
    figsize=(11, 6),
    robust=True,
    xticklabels=True,
    yticklabels=True,
)
fig = cg.figure
cg.ax_heatmap.set_xlabel("Proteins")
cg.ax_heatmap.set_ylabel("Sample ID")
vuecore.select_xticks(cg.ax_heatmap)
handles = [
    plt.Line2D([0], [0], marker="o", color="w", markerfacecolor=lut[name], markersize=8)
    for name in lut
]
cg.ax_cbar.legend(
    handles, _group_labels, title="Groups", loc="lower left", bbox_to_anchor=(2, 0.5)
)
fname = out_dir_subsection / "clustermap_ward.png"
# vuecore.savefig(fig, fname, pdf=True, dpi=600, tight_layout=False)
fig.savefig(
    out_dir_subsection / f"clustermap_ward_{normalization_method}.png",
    bbox_inches="tight",
    dpi=300,
)

_images/ff383dccd7e80624435931351261a7603a998225d19ad67c6e97aa127a5e7112.png

Differential Regulation#

out_dir_subsection = out_dir / "2_differential_regulation"
out_dir_subsection.mkdir(parents=True, exist_ok=True)

Retain all proteins with at least 3 observations in each group#

this is a requirement for a standard t-test
you could look into imputation methods to fill in missing values)
- protein in at least two samples per group?
- missing all in one condition?

Let’s not impute, but filter for proteins with at least 3 observations in each group

group_counts = proteins.groupby(label_suf).count()
group_counts

UniprotID	A5A613	P00350	P00363	P00370	P00393	P00448	P00452	P00490	P00509	P00547	...	Q47319	Q47536	Q47622	Q47679	Q47710	Q57261	Q59385-2	Q59385	Q7DFV3	Q93K97
label_suf
10 µm sulforaphane	0	4	4	4	4	4	3	4	4	2	...	1	1	2	2	4	4	0	4	4	4
control	1	4	4	4	4	4	1	4	4	2	...	0	0	3	1	4	4	3	4	4	4

2 rows × 2269 columns

Then we can filter the proteins to only those with at least 3 observations in each grou

mask = group_counts.groupby("label_suf").transform(lambda x: x >= 3).all(axis=0)
mask

UniprotID
A5A613      False
P00350       True
P00363       True
P00370       True
P00393       True
            ...  
Q57261       True
Q59385-2    False
Q59385       True
Q7DFV3       True
Q93K97       True
Length: 2269, dtype: bool

view = proteins.loc[:, mask].join(label_suf)
group = "label_suf"
diff_reg = acore.differential_regulation.run_anova(
    view,
    alpha=0.15,
    drop_cols=[],
    subject=None,
    group=group,
).sort_values("pvalue", ascending=True)
diff_reg["rejected"] = diff_reg["rejected"].astype(bool)
diff_reg.sort_values("pvalue")

	identifier	T-statistics	pvalue	mean_group1	mean_group2	std(group1)	std(group2)	log2FC	test	correction	padj	rejected	group1	group2	FC	-log10 pvalue	Method
957	P10384	-11.514	0.000	26.430	27.475	0.131	0.088	-1.045	t-Test	FDR correction BH	0.044	True	control	10 µm sulforaphane	0.485	4.589	Unpaired t-test
1,678	Q46835	-8.146	0.000	26.723	27.677	0.194	0.058	-0.954	t-Test	FDR correction BH	0.127	True	control	10 µm sulforaphane	0.516	3.735	Unpaired t-test
120	P08200	-7.471	0.000	28.759	29.455	0.072	0.145	-0.696	t-Test	FDR correction BH	0.127	True	control	10 µm sulforaphane	0.617	3.527	Unpaired t-test
1,542	P75726	7.460	0.000	28.082	26.852	0.222	0.180	1.230	t-Test	FDR correction BH	0.127	True	control	10 µm sulforaphane	2.345	3.524	Unpaired t-test
38	P02943	-7.006	0.000	26.117	27.920	0.428	0.124	-1.802	t-Test	FDR correction BH	0.143	True	control	10 µm sulforaphane	0.287	3.375	Unpaired t-test
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
983	P14407	-0.003	0.998	28.561	28.562	0.393	0.656	-0.001	t-Test	FDR correction BH	0.999	False	control	10 µm sulforaphane	0.999	0.001	Unpaired t-test
560	P0ABB0	0.002	0.998	31.306	31.305	0.318	0.051	0.000	t-Test	FDR correction BH	0.999	False	control	10 µm sulforaphane	1.000	0.001	Unpaired t-test
1,531	P69797	0.002	0.998	30.255	30.254	0.391	0.328	0.001	t-Test	FDR correction BH	0.999	False	control	10 µm sulforaphane	1.000	0.001	Unpaired t-test
452	P0A9L3	-0.002	0.998	30.452	30.453	0.433	0.106	-0.001	t-Test	FDR correction BH	0.999	False	control	10 µm sulforaphane	1.000	0.001	Unpaired t-test
1,059	P23538	-0.002	0.999	26.283	26.284	0.109	0.192	-0.000	t-Test	FDR correction BH	0.999	False	control	10 µm sulforaphane	1.000	0.001	Unpaired t-test

1696 rows × 17 columns

diff_reg.sort_values("pvalue").head(20)

	identifier	T-statistics	pvalue	mean_group1	mean_group2	std(group1)	std(group2)	log2FC	test	correction	padj	rejected	group1	group2	FC	-log10 pvalue	Method
957	P10384	-11.514	0.000	26.430	27.475	0.131	0.088	-1.045	t-Test	FDR correction BH	0.044	True	control	10 µm sulforaphane	0.485	4.589	Unpaired t-test
1,678	Q46835	-8.146	0.000	26.723	27.677	0.194	0.058	-0.954	t-Test	FDR correction BH	0.127	True	control	10 µm sulforaphane	0.516	3.735	Unpaired t-test
120	P08200	-7.471	0.000	28.759	29.455	0.072	0.145	-0.696	t-Test	FDR correction BH	0.127	True	control	10 µm sulforaphane	0.617	3.527	Unpaired t-test
1,542	P75726	7.460	0.000	28.082	26.852	0.222	0.180	1.230	t-Test	FDR correction BH	0.127	True	control	10 µm sulforaphane	2.345	3.524	Unpaired t-test
38	P02943	-7.006	0.000	26.117	27.920	0.428	0.124	-1.802	t-Test	FDR correction BH	0.143	True	control	10 µm sulforaphane	0.287	3.375	Unpaired t-test
1,455	P62768	-6.288	0.001	28.999	29.682	0.147	0.118	-0.682	t-Test	FDR correction BH	0.213	False	control	10 µm sulforaphane	0.623	3.123	Unpaired t-test
1,508	P68066	-5.856	0.001	31.551	32.216	0.101	0.169	-0.665	t-Test	FDR correction BH	0.235	False	control	10 µm sulforaphane	0.630	2.960	Unpaired t-test
579	P0ABK5	-5.841	0.001	29.630	30.524	0.246	0.100	-0.894	t-Test	FDR correction BH	0.235	False	control	10 µm sulforaphane	0.538	2.955	Unpaired t-test
1,099	P25437	-5.304	0.002	29.088	29.744	0.183	0.112	-0.657	t-Test	FDR correction BH	0.324	False	control	10 µm sulforaphane	0.634	2.739	Unpaired t-test
1,108	P25539	-5.230	0.002	26.862	27.166	0.061	0.080	-0.304	t-Test	FDR correction BH	0.324	False	control	10 µm sulforaphane	0.810	2.708	Unpaired t-test
553	P0AB80	-5.149	0.002	28.177	28.853	0.205	0.098	-0.675	t-Test	FDR correction BH	0.324	False	control	10 µm sulforaphane	0.626	2.674	Unpaired t-test
1,382	P45577	-5.069	0.002	29.028	29.467	0.049	0.142	-0.439	t-Test	FDR correction BH	0.324	False	control	10 µm sulforaphane	0.738	2.640	Unpaired t-test
950	P0CK95	-4.969	0.003	26.408	27.282	0.295	0.076	-0.874	t-Test	FDR correction BH	0.330	False	control	10 µm sulforaphane	0.546	2.597	Unpaired t-test
610	P0AC33	-4.725	0.003	26.764	27.356	0.206	0.069	-0.592	t-Test	FDR correction BH	0.364	False	control	10 µm sulforaphane	0.663	2.489	Unpaired t-test
80	P06721	-4.719	0.003	25.529	26.244	0.198	0.172	-0.715	t-Test	FDR correction BH	0.364	False	control	10 µm sulforaphane	0.609	2.487	Unpaired t-test
1,268	P36943	6.160	0.004	NaN	28.375	NaN	0.133	NaN	t-Test	FDR correction BH	0.364	False	control	10 µm sulforaphane	NaN	2.397	Unpaired t-test
812	P0AEZ9	-4.510	0.004	28.393	28.921	0.178	0.098	-0.528	t-Test	FDR correction BH	0.364	False	control	10 µm sulforaphane	0.693	2.392	Unpaired t-test
873	P0AFU8	-4.479	0.004	28.742	29.429	0.230	0.132	-0.686	t-Test	FDR correction BH	0.364	False	control	10 µm sulforaphane	0.621	2.377	Unpaired t-test
491	P0AA39	-4.438	0.004	25.367	26.097	0.271	0.087	-0.730	t-Test	FDR correction BH	0.364	False	control	10 µm sulforaphane	0.603	2.358	Unpaired t-test
1,057	P23331	-4.404	0.005	27.976	28.253	0.077	0.078	-0.277	t-Test	FDR correction BH	0.364	False	control	10 µm sulforaphane	0.825	2.342	Unpaired t-test

diff_reg.plot(x="log2FC", y="-log10 pvalue", kind="scatter", title=group)

<Axes: title={'center': 'label_suf'}, xlabel='log2FC', ylabel='-log10 pvalue'>

_images/1c0c0407a54d2c29e0eb12210702676dd4584652a42566f5d63f7e3ee1aea983.png

Interactive Volcano Plot#

str_cols = diff_reg.dtypes[diff_reg.dtypes == "object"].index.tolist()
hover_data = {
    "rejected": ":.0f",
    **{
        c: ":.4f"
        for c in [
            "padj",
            "FC",
        ]
    },
    **{c: True for c in str_cols},
}
fig = px.scatter(
    diff_reg,
    x="log2FC",
    y="-log10 pvalue",
    color="rejected",
    hover_data=hover_data,
    width=1200,
    height=800,
    title=f"Volcano plot for {view_name}s",
)
fig

Save result to subsection folder

fig.write_json(
    out_dir_subsection / "0_volcano_plot.json",
    pretty=False,
)
diff_reg.to_csv(out_dir_subsection / "1_differential_regulation.csv")

Enrichment Analysis#

out_dir_subsection = out_dir / "uniprot_annotations"
out_dir_subsection.mkdir(parents=True, exist_ok=True)

Fetch the annotations from UniProt API.#

fname_annotations = out_dir_subsection / "annotations.csv"
try:
    annotations = pd.read_csv(fname_annotations, index_col=0)
    print(f"Loaded annotations from {fname_annotations}")
except FileNotFoundError:
    print(f"Fetching annotations for {proteins.columns.size} UniProt IDs.")
    FIELDS = "go_p,go_c,go_f"
    annotations = fetch_annotations(proteins.columns, fields=FIELDS)
    annotations = process_annotations(annotations, fields=FIELDS)
    # cache the annotations
    fname_annotations.parent.mkdir(exist_ok=True, parents=True)
    annotations.to_csv(fname_annotations, index=True)

annotations

Fetching annotations for 2269 UniProt IDs.
Fetched: 500 / 2269
Fetched: 1000 / 2269
Fetched: 1500 / 2269
Fetched: 2000 / 2269
Fetched: 2269 / 2269

	identifier	source	annotation
3	P00350	Gene Ontology (biological process)	D-gluconate metabolic process [GO:0019521]
4	P00350	Gene Ontology (cellular component)	cytosol [GO:0005829]
5	P00350	Gene Ontology (molecular function)	guanosine tetraphosphate binding [GO:0097216]
6	P00350	Gene Ontology (molecular function)	identical protein binding [GO:0042802]
7	P00350	Gene Ontology (molecular function)	NADP binding [GO:0050661]
...	...	...	...
15,989	Q93K97	Gene Ontology (molecular function)	magnesium ion binding [GO:0000287]
15,990	Q93K97	Gene Ontology (molecular function)	protein homodimerization activity [GO:0042803]
15,991	Q93K97	Gene Ontology (molecular function)	pyrophosphatase activity [GO:0016462]
15,992	Q93K97	Gene Ontology (biological process)	response to heat [GO:0009408]
15,993	Q93K97	Gene Ontology (biological process)	ribose phosphate metabolic process [GO:0019693]

14994 rows × 3 columns

Run the enrichment analysis#

background is the set of identified proteins in the experiment (not the whole proteome of the organisim, here E. coli)
The enrichment is performed separately for the up- and down-regulated proteins (‘rejected’), which are few in our example where we had to set the adjusted p-value to 0.15.

In the enrichment we set the cutoff for the adjusted p-value to 0.2, which is a bit arbitrary to see some results.

enriched = acore.enrichment_analysis.run_up_down_regulation_enrichment(
    regulation_data=diff_reg,
    annotation=annotations,
    min_detected_in_set=1,
    lfc_cutoff=1,
    pval_col="padj",  # toggle if it does not work
    correction_alpha=0.2,  # adjust the p-value to see more or less results
)
enriched

	terms	identifiers	foreground	background	foreground_pop	background_pop	pvalue	padj	rejected	direction	comparison
0	citrate CoA-transferase activity [GO:0008814]	P75726	1	0	1	1,696	0.001	0.002	True	upregulated	control~10 µm sulforaphane
1	citrate metabolic process [GO:0006101]	P75726	1	2	1	1,696	0.002	0.002	True	upregulated	control~10 µm sulforaphane
3	ATP-independent citrate lyase complex [GO:0009...	P75726	1	2	1	1,696	0.002	0.002	True	upregulated	control~10 µm sulforaphane
4	acetyl-CoA metabolic process [GO:0006084]	P75726	1	2	1	1,696	0.002	0.002	True	upregulated	control~10 µm sulforaphane
5	citrate (pro-3S)-lyase activity [GO:0008815]	P75726	1	1	1	1,696	0.001	0.002	True	upregulated	control~10 µm sulforaphane
2	cytoplasm [GO:0005737]	P75726	1	29	1	1,696	0.018	0.018	True	upregulated	control~10 µm sulforaphane
0	long-chain fatty acid transporting porin acti...	P10384	1	0	2	1,696	0.001	0.002	True	downregulated	control~10 µm sulforaphane
2	maltodextrin transmembrane transporter activi...	P02943	1	0	2	1,696	0.001	0.002	True	downregulated	control~10 µm sulforaphane
3	maltose transmembrane transport [GO:1904981]	P02943	1	0	2	1,696	0.001	0.002	True	downregulated	control~10 µm sulforaphane
4	maltose transmembrane transporter activity [G...	P02943	1	0	2	1,696	0.001	0.002	True	downregulated	control~10 µm sulforaphane
5	maltose transporting porin activity [GO:0015481]	P02943	1	0	2	1,696	0.001	0.002	True	downregulated	control~10 µm sulforaphane
14	cell outer membrane [GO:0009279]	P02943,P10384	2	43	2	1,696	0.001	0.002	True	downregulated	control~10 µm sulforaphane
15	ligand-gated channel activity [GO:0022834]	P10384	1	0	2	1,696	0.001	0.002	True	downregulated	control~10 µm sulforaphane
8	polysaccharide transport [GO:0015774]	P02943	1	0	2	1,696	0.001	0.002	True	downregulated	control~10 µm sulforaphane
16	long-chain fatty acid transport [GO:0015909]	P10384	1	0	2	1,696	0.001	0.002	True	downregulated	control~10 µm sulforaphane
13	carbohydrate transmembrane transporter activit...	P02943	1	1	2	1,696	0.002	0.003	True	downregulated	control~10 µm sulforaphane
1	maltodextrin transmembrane transport [GO:0042...	P02943	1	1	2	1,696	0.002	0.003	True	downregulated	control~10 µm sulforaphane
11	virus receptor activity [GO:0001618]	P02943	1	1	2	1,696	0.002	0.003	True	downregulated	control~10 µm sulforaphane
7	outer membrane protein complex [GO:0106234]	P02943	1	2	2	1,696	0.004	0.004	True	downregulated	control~10 µm sulforaphane
6	monoatomic ion transport [GO:0006811]	P02943	1	2	2	1,696	0.004	0.004	True	downregulated	control~10 µm sulforaphane
10	porin activity [GO:0015288]	P02943	1	5	2	1,696	0.007	0.008	True	downregulated	control~10 µm sulforaphane
9	pore complex [GO:0046930]	P02943	1	5	2	1,696	0.007	0.008	True	downregulated	control~10 µm sulforaphane
12	DNA damage response [GO:0006974]	P02943	1	51	2	1,696	0.060	0.060	True	downregulated	control~10 µm sulforaphane

fig = get_enrichment_plots(
    enriched,
    identifier="anything",  # ToDo: figure out what this does
    args=dict(title="Enrichment Analysis"),
)
fig = fig[0]
fig.write_json(
    out_dir_subsection / "enrichment_analysis.json",
    pretty=True,
)
fig

Check for Maltose Uptake#

out_dir_subsection = out_dir / "3_maltose_uptake"
out_dir_subsection.mkdir(parents=True, exist_ok=True)

apply filtering of ‘differentially abundant proteins’ as described in the paper

Differentially abundant proteins were determined as those with log2 fold-change

1 and < -1, and p < 0.05 This means not multiple testing correction was applied.

view = diff_reg.query("pvalue < 0.05 and FC > 1")  # .shape[0]
view.to_csv(
    out_dir_subsection / "1_differently_regulated_as_in_paper.csv",
    index=False,
)
view

	identifier	T-statistics	pvalue	mean_group1	mean_group2	std(group1)	std(group2)	log2FC	test	correction	padj	rejected	group1	group2	FC	-log10 pvalue	Method
1,542	P75726	7.460	0.000	28.082	26.852	0.222	0.180	1.230	t-Test	FDR correction BH	0.127	True	control	10 µm sulforaphane	2.345	3.524	Unpaired t-test
1,553	P75825	4.344	0.005	26.903	25.378	0.327	0.512	1.525	t-Test	FDR correction BH	0.364	False	control	10 µm sulforaphane	2.877	2.314	Unpaired t-test
580	P0ABK9	4.238	0.005	26.363	25.637	0.230	0.187	0.726	t-Test	FDR correction BH	0.364	False	control	10 µm sulforaphane	1.654	2.264	Unpaired t-test
1,288	P37349	3.868	0.008	25.935	25.251	0.296	0.079	0.683	t-Test	FDR correction BH	0.438	False	control	10 µm sulforaphane	1.606	2.082	Unpaired t-test
1,610	P76440	3.796	0.009	27.182	26.639	0.233	0.084	0.543	t-Test	FDR correction BH	0.438	False	control	10 µm sulforaphane	1.457	2.045	Unpaired t-test
961	P11349	3.722	0.010	27.459	26.939	0.223	0.094	0.521	t-Test	FDR correction BH	0.445	False	control	10 µm sulforaphane	1.435	2.007	Unpaired t-test
1,632	P77252	3.624	0.011	26.540	26.012	0.131	0.215	0.528	t-Test	FDR correction BH	0.445	False	control	10 µm sulforaphane	1.441	1.957	Unpaired t-test
142	P09152	3.542	0.012	26.822	26.250	0.228	0.162	0.572	t-Test	FDR correction BH	0.445	False	control	10 µm sulforaphane	1.487	1.914	Unpaired t-test
1,098	P25397	3.191	0.019	27.632	27.003	0.177	0.292	0.629	t-Test	FDR correction BH	0.577	False	control	10 µm sulforaphane	1.546	1.726	Unpaired t-test
261	P0A7E3	3.181	0.019	27.335	26.631	0.146	0.355	0.704	t-Test	FDR correction BH	0.577	False	control	10 µm sulforaphane	1.629	1.720	Unpaired t-test
64	P05637	2.910	0.027	26.621	26.034	0.262	0.232	0.587	t-Test	FDR correction BH	0.589	False	control	10 µm sulforaphane	1.503	1.569	Unpaired t-test
443	P0A9I1	2.907	0.027	28.099	26.997	0.555	0.351	1.102	t-Test	FDR correction BH	0.589	False	control	10 µm sulforaphane	2.147	1.567	Unpaired t-test
1,535	P69829	2.725	0.034	26.906	26.284	0.058	0.391	0.621	t-Test	FDR correction BH	0.614	False	control	10 µm sulforaphane	1.538	1.463	Unpaired t-test
1,190	P31142	2.582	0.042	28.110	27.149	0.203	0.612	0.961	t-Test	FDR correction BH	0.647	False	control	10 µm sulforaphane	1.947	1.380	Unpaired t-test
1,094	P24232	2.577	0.042	27.136	26.324	0.530	0.129	0.811	t-Test	FDR correction BH	0.647	False	control	10 µm sulforaphane	1.755	1.377	Unpaired t-test
1,275	P37127	2.483	0.048	25.203	24.641	0.226	0.320	0.561	t-Test	FDR correction BH	0.680	False	control	10 µm sulforaphane	1.476	1.322	Unpaired t-test

Let’s find the proteins highlighted in the volcano plot in Figure 3.

highlighted_genes = ["LamB", "MalE", "Malk", "CitF", "CitT", "CitE", "Frd"]
highlighted_genes = "|".join([p.upper() for p in highlighted_genes])
highlighted_genes = proteins_meta.query(
    f"`GeneName`.str.contains('{highlighted_genes}')"
)
highlighted_genes

	Source	ProteinName	GeneName
identifier
sp\|P00363\|FRDA_ECOLI	sp	P00363	FRDA
sp\|P02943\|LAMB_ECOLI	sp	P02943	LAMB
sp\|P0A8Q0\|FRDC_ECOLI	sp	P0A8Q0	FRDC
sp\|P0A8Q3\|FRDD_ECOLI	sp	P0A8Q3	FRDD
sp\|P0A9I1\|CITE_ECOLI	sp	P0A9I1	CITE
sp\|P0AC47\|FRDB_ECOLI	sp	P0AC47	FRDB
sp\|P0AE74\|CITT_ECOLI	sp	P0AE74	CITT
sp\|P0AEX9\|MALE_ECOLI	sp	P0AEX9	MALE
sp\|P68187\|MALK_ECOLI	sp	P68187	MALK

highlighted_proteins = "|".join([p.upper() for p in highlighted_genes["ProteinName"]])
view = diff_reg.query(f"`identifier`.str.contains('{highlighted_proteins}')")
view = view.set_index("identifier").join(proteins_meta.set_index("ProteinName"))
view.to_csv(
    out_dir_subsection / "2_highlighted_proteins_in_figure3.csv",
    index=False,
)
sel_cols = [
    "identifier",
    "GeneName",
    "log2FC",
    "pvalue",
    "padj",
    "rejected",
    "group1",
    "group2",
    "Method",
]
view.reset_index()[sel_cols].sort_values("log2FC", ascending=False)

	identifier	GeneName	log2FC	pvalue	padj	rejected	group1	group2	Method
1	P0A9I1	CITE	1.102	0.027	0.589	False	control	10 µm sulforaphane	Unpaired t-test
2	P0AE74	CITT	0.872	0.110	0.754	False	control	10 µm sulforaphane	Unpaired t-test
3	P0A8Q0	FRDC	0.815	0.159	0.756	False	control	10 µm sulforaphane	Unpaired t-test
4	P00363	FRDA	0.095	0.676	0.965	False	control	10 µm sulforaphane	Unpaired t-test
5	P0AC47	FRDB	0.062	0.808	0.981	False	control	10 µm sulforaphane	Unpaired t-test
6	P0AEX9	MALE	0.012	0.959	0.997	False	control	10 µm sulforaphane	Unpaired t-test
0	P02943	LAMB	-1.802	0.000	0.143	True	control	10 µm sulforaphane	Unpaired t-test

Let’s see their original data

view_proteins = (
    (
        proteins[highlighted_genes["ProteinName"].to_list()].T.join(
            proteins_meta.set_index("ProteinName")["GeneName"]
        )
    )
    .set_index("GeneName", append=True)
    .T
)  # to check]
view_proteins.to_csv(
    out_dir_subsection / "3_highlighted_proteins_in_figure3_intensities.csv",
    index=True,
)
view_proteins

UniprotID	P00363	P02943	P0A8Q0	P0A8Q3	P0A9I1	P0AC47	P0AE74	P0AEX9	P68187
GeneName	FRDA	LAMB	FRDC	FRDD	CITE	FRDB	CITT	MALE	MALK
DMSO_rep1	30.247	26.752	30.348	27.389	29.031	31.223	27.544	27.099	NaN
DMSO_rep2	30.262	26.012	28.815	27.061	27.933	30.947	26.364	26.785	24.881
DMSO_rep3	29.970	26.151	30.569	27.291	27.569	30.664	25.292	26.854	NaN
DMSO_rep4	29.471	25.555	29.097	24.906	27.865	30.815	26.608	27.037	25.423
Suf_rep1	30.005	27.844	28.350	26.703	27.296	31.360	25.485	26.518	26.047
Suf_rep2	30.086	28.095	29.444	NaN	26.744	30.317	25.691	26.864	24.964
Suf_rep3	29.904	27.971	29.172	NaN	26.563	30.818	25.549	27.536	25.545
Suf_rep4	29.575	27.769	28.602	25.294	27.386	30.907	25.594	26.809	26.134

How to explain the differences?

Data Analysis PXD040621

Contents