Data Analysis PXD040621#

Plan

  • read data and log2 transform intensity values

  • aggregate peptide intensities to protein intensities

  • format data from long to wide format

  • remove contaminant proteins

  • check for missing values

  • Clustermap of sample and proteins

  • differential analysis (Volcano Plots)

  • Enrichment Analysis

  • check for maltose update pathway (Fig. 3 in paper)

%pip install acore vuecore "pingouin<0.6.0"

Hide code cell output

Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: acore in /home/runner/.local/lib/python3.12/site-packages (0.2.2)
Requirement already satisfied: vuecore in /home/runner/.local/lib/python3.12/site-packages (0.1.1)
Requirement already satisfied: pingouin<0.6.0 in /home/runner/.local/lib/python3.12/site-packages (0.5.5)
Requirement already satisfied: numpy in /home/runner/.local/lib/python3.12/site-packages (from acore) (2.4.6)
Requirement already satisfied: pandas in /home/runner/.local/lib/python3.12/site-packages (from acore) (2.3.3)
Requirement already satisfied: dsp-pandas in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.0.5)
Requirement already satisfied: scipy in /home/runner/.local/lib/python3.12/site-packages (from acore) (1.17.1)
Requirement already satisfied: scikit-learn>=1.5 in /home/runner/.local/lib/python3.12/site-packages (from acore) (1.9.0)
Requirement already satisfied: networkx in /home/runner/.local/lib/python3.12/site-packages (from acore) (3.6.1)
Requirement already satisfied: biopython in /home/runner/.local/lib/python3.12/site-packages (from acore) (1.87)
Requirement already satisfied: combat in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.3.3)
Requirement already satisfied: gseapy!=1.1.5 in /home/runner/.local/lib/python3.12/site-packages (from acore) (1.2.1)
Requirement already satisfied: kmapper in /home/runner/.local/lib/python3.12/site-packages (from acore) (2.1.0)
Requirement already satisfied: lifelines in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.30.3)
Requirement already satisfied: python-louvain in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.16)
Requirement already satisfied: PyWGCNA!=2.2.0 in /home/runner/.local/lib/python3.12/site-packages (from acore) (1.16.0)
Requirement already satisfied: snfpy in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.2.2)
Requirement already satisfied: umap-learn in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.5.12)
Requirement already satisfied: statsmodels in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.14.6)
Requirement already satisfied: inmoose in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.9.1)
Requirement already satisfied: requests in /usr/lib/python3/dist-packages (from acore) (2.31.0)
Requirement already satisfied: rarfile in /home/runner/.local/lib/python3.12/site-packages (from acore) (4.2)
Requirement already satisfied: pandera in /home/runner/.local/lib/python3.12/site-packages (from acore) (0.31.1)
Requirement already satisfied: plotly in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (5.24.1)
Requirement already satisfied: beautifulsoup4 in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (4.14.3)
Requirement already satisfied: dash in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (4.2.0)
Requirement already satisfied: matplotlib in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (3.10.9)
Requirement already satisfied: kaleido in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (0.2.1)
Requirement already satisfied: pyvis in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (0.3.2)
Requirement already satisfied: wordcloud in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (1.9.6)
Requirement already satisfied: cyjupyter in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (0.2.0)
Requirement already satisfied: nltk in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (3.9.4)
Requirement already satisfied: webweb in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (0.0.37)
Requirement already satisfied: dash-cytoscape in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (1.0.2)
Requirement already satisfied: pydantic in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (2.13.4)
Requirement already satisfied: nbformat>=4.2.0 in /home/runner/.local/lib/python3.12/site-packages (from vuecore) (5.10.4)
Requirement already satisfied: pandas-flavor in /home/runner/.local/lib/python3.12/site-packages (from pingouin<0.6.0) (0.8.1)
Requirement already satisfied: seaborn in /home/runner/.local/lib/python3.12/site-packages (from pingouin<0.6.0) (0.13.2)
Requirement already satisfied: tabulate in /home/runner/.local/lib/python3.12/site-packages (from pingouin<0.6.0) (0.10.0)
Requirement already satisfied: contourpy>=1.0.1 in /home/runner/.local/lib/python3.12/site-packages (from matplotlib->vuecore) (1.3.3)
Requirement already satisfied: cycler>=0.10 in /home/runner/.local/lib/python3.12/site-packages (from matplotlib->vuecore) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/runner/.local/lib/python3.12/site-packages (from matplotlib->vuecore) (4.63.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/runner/.local/lib/python3.12/site-packages (from matplotlib->vuecore) (1.5.0)
Requirement already satisfied: packaging>=20.0 in /home/runner/.local/lib/python3.12/site-packages (from matplotlib->vuecore) (26.2)
Requirement already satisfied: pillow>=8 in /home/runner/.local/lib/python3.12/site-packages (from matplotlib->vuecore) (12.2.0)
Requirement already satisfied: pyparsing>=3 in /usr/lib/python3/dist-packages (from matplotlib->vuecore) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in /usr/lib/python3/dist-packages (from matplotlib->vuecore) (2.8.2)
Requirement already satisfied: fastjsonschema>=2.15 in /home/runner/.local/lib/python3.12/site-packages (from nbformat>=4.2.0->vuecore) (2.21.2)
Requirement already satisfied: jsonschema>=2.6 in /home/runner/.local/lib/python3.12/site-packages (from nbformat>=4.2.0->vuecore) (4.26.0)
Requirement already satisfied: jupyter-core!=5.0.*,>=4.12 in /home/runner/.local/lib/python3.12/site-packages (from nbformat>=4.2.0->vuecore) (5.9.1)
Requirement already satisfied: traitlets>=5.1 in /home/runner/.local/lib/python3.12/site-packages (from nbformat>=4.2.0->vuecore) (5.15.0)
Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas->acore) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/runner/.local/lib/python3.12/site-packages (from pandas->acore) (2026.2)
Requirement already satisfied: anndata>=0.8.0 in /home/runner/.local/lib/python3.12/site-packages (from PyWGCNA!=2.2.0->acore) (0.12.16)
Requirement already satisfied: biomart>=0.9.2 in /home/runner/.local/lib/python3.12/site-packages (from PyWGCNA!=2.2.0->acore) (0.9.2)
Requirement already satisfied: reactome2py>=3.0.0 in /home/runner/.local/lib/python3.12/site-packages (from PyWGCNA!=2.2.0->acore) (3.0.0)
Requirement already satisfied: rsrc>=0.1.3 in /home/runner/.local/lib/python3.12/site-packages (from PyWGCNA!=2.2.0->acore) (0.1.3)
Requirement already satisfied: setuptools>=59.8.0 in /usr/lib/python3/dist-packages (from PyWGCNA!=2.2.0->acore) (68.1.2)
Requirement already satisfied: ipython>=5.3.0 in /home/runner/.local/lib/python3.12/site-packages (from pyvis->vuecore) (9.14.0)
Requirement already satisfied: jinja2>=2.9.6 in /usr/lib/python3/dist-packages (from pyvis->vuecore) (3.1.2)
Requirement already satisfied: jsonpickle>=1.4.1 in /home/runner/.local/lib/python3.12/site-packages (from pyvis->vuecore) (4.1.2)
Requirement already satisfied: joblib>=1.4.0 in /home/runner/.local/lib/python3.12/site-packages (from scikit-learn>=1.5->acore) (1.5.3)
Requirement already satisfied: narwhals>=2.0.1 in /home/runner/.local/lib/python3.12/site-packages (from scikit-learn>=1.5->acore) (2.22.0)
Requirement already satisfied: threadpoolctl>=3.5.0 in /home/runner/.local/lib/python3.12/site-packages (from scikit-learn>=1.5->acore) (3.6.0)
Requirement already satisfied: patsy>=0.5.6 in /home/runner/.local/lib/python3.12/site-packages (from statsmodels->acore) (1.0.2)
Requirement already satisfied: soupsieve>=1.6.1 in /home/runner/.local/lib/python3.12/site-packages (from beautifulsoup4->vuecore) (2.8.4)
Requirement already satisfied: typing-extensions>=4.0.0 in /home/runner/.local/lib/python3.12/site-packages (from beautifulsoup4->vuecore) (4.15.0)
Requirement already satisfied: mpmath>=1.1.0 in /home/runner/.local/lib/python3.12/site-packages (from combat->acore) (1.4.1)
Requirement already satisfied: ipywidgets>=7.0.0 in /home/runner/.local/lib/python3.12/site-packages (from cyjupyter->vuecore) (8.1.8)
Requirement already satisfied: Flask<3.2,>=1.0.4 in /home/runner/.local/lib/python3.12/site-packages (from dash->vuecore) (3.1.3)
Requirement already satisfied: Werkzeug<3.2 in /home/runner/.local/lib/python3.12/site-packages (from dash->vuecore) (3.1.8)
Requirement already satisfied: importlib-metadata in /home/runner/.local/lib/python3.12/site-packages (from dash->vuecore) (9.0.0)
Requirement already satisfied: retrying in /home/runner/.local/lib/python3.12/site-packages (from dash->vuecore) (1.4.2)
Requirement already satisfied: nest-asyncio in /home/runner/.local/lib/python3.12/site-packages (from dash->vuecore) (1.6.0)
Requirement already satisfied: janus>=1.0.0 in /home/runner/.local/lib/python3.12/site-packages (from dash->vuecore) (2.0.0)
Requirement already satisfied: tenacity>=6.2.0 in /home/runner/.local/lib/python3.12/site-packages (from plotly->vuecore) (9.1.4)
Requirement already satisfied: openpyxl in /home/runner/.local/lib/python3.12/site-packages (from dsp-pandas->acore) (3.1.5)
Requirement already satisfied: fastcluster in /home/runner/.local/lib/python3.12/site-packages (from inmoose->acore) (1.3.0)
Requirement already satisfied: autograd>=1.5 in /home/runner/.local/lib/python3.12/site-packages (from lifelines->acore) (1.8.0)
Requirement already satisfied: autograd-gamma>=0.3 in /home/runner/.local/lib/python3.12/site-packages (from lifelines->acore) (0.5.0)
Requirement already satisfied: formulaic>=0.2.2 in /home/runner/.local/lib/python3.12/site-packages (from lifelines->acore) (1.2.2)
Requirement already satisfied: click in /usr/lib/python3/dist-packages (from nltk->vuecore) (8.1.6)
Requirement already satisfied: regex>=2021.8.3 in /home/runner/.local/lib/python3.12/site-packages (from nltk->vuecore) (2026.5.9)
Requirement already satisfied: tqdm in /home/runner/.local/lib/python3.12/site-packages (from nltk->vuecore) (4.67.3)
Requirement already satisfied: xarray in /home/runner/.local/lib/python3.12/site-packages (from pandas-flavor->pingouin<0.6.0) (2026.4.0)
Requirement already satisfied: typeguard in /home/runner/.local/lib/python3.12/site-packages (from pandera->acore) (4.5.2)
Requirement already satisfied: typing_inspect>=0.6.0 in /home/runner/.local/lib/python3.12/site-packages (from pandera->acore) (0.9.0)
Requirement already satisfied: annotated-types>=0.6.0 in /home/runner/.local/lib/python3.12/site-packages (from pydantic->vuecore) (0.7.0)
Requirement already satisfied: pydantic-core==2.46.4 in /home/runner/.local/lib/python3.12/site-packages (from pydantic->vuecore) (2.46.4)
Requirement already satisfied: typing-inspection>=0.4.2 in /home/runner/.local/lib/python3.12/site-packages (from pydantic->vuecore) (0.4.2)
Requirement already satisfied: numba>=0.51.2 in /home/runner/.local/lib/python3.12/site-packages (from umap-learn->acore) (0.65.1)
Requirement already satisfied: pynndescent>=0.5 in /home/runner/.local/lib/python3.12/site-packages (from umap-learn->acore) (0.6.0)
Requirement already satisfied: array-api-compat>=1.7.1 in /home/runner/.local/lib/python3.12/site-packages (from anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (1.14.0)
Requirement already satisfied: h5py>=3.8 in /home/runner/.local/lib/python3.12/site-packages (from anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (3.16.0)
Requirement already satisfied: legacy-api-wrap in /home/runner/.local/lib/python3.12/site-packages (from anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (1.5)
Requirement already satisfied: natsort in /home/runner/.local/lib/python3.12/site-packages (from anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (8.4.0)
Requirement already satisfied: scverse-misc>=0.0.3 in /home/runner/.local/lib/python3.12/site-packages (from anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (0.0.7)
Requirement already satisfied: zarr!=3.0.*,>=2.18.7 in /home/runner/.local/lib/python3.12/site-packages (from anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (3.2.1)
Requirement already satisfied: blinker>=1.9.0 in /home/runner/.local/lib/python3.12/site-packages (from Flask<3.2,>=1.0.4->dash->vuecore) (1.9.0)
Requirement already satisfied: itsdangerous>=2.2.0 in /home/runner/.local/lib/python3.12/site-packages (from Flask<3.2,>=1.0.4->dash->vuecore) (2.2.0)
Requirement already satisfied: markupsafe>=2.1.1 in /usr/lib/python3/dist-packages (from Flask<3.2,>=1.0.4->dash->vuecore) (2.1.5)
Requirement already satisfied: interface-meta>=1.2.0 in /home/runner/.local/lib/python3.12/site-packages (from formulaic>=0.2.2->lifelines->acore) (2.0.1)
Requirement already satisfied: wrapt>=1.0 in /home/runner/.local/lib/python3.12/site-packages (from formulaic>=0.2.2->lifelines->acore) (2.2.1)
Requirement already satisfied: decorator>=5.1.0 in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (5.3.1)
Requirement already satisfied: ipython-pygments-lexers>=1.0.0 in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (1.1.1)
Requirement already satisfied: jedi>=0.18.2 in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (0.20.0)
Requirement already satisfied: matplotlib-inline>=0.1.6 in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (0.2.2)
Requirement already satisfied: pexpect>4.6 in /usr/lib/python3/dist-packages (from ipython>=5.3.0->pyvis->vuecore) (4.9.0)
Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (3.0.52)
Requirement already satisfied: psutil>=7 in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (7.2.2)
Requirement already satisfied: pygments>=2.14.0 in /usr/lib/python3/dist-packages (from ipython>=5.3.0->pyvis->vuecore) (2.17.2)
Requirement already satisfied: stack_data>=0.6.0 in /home/runner/.local/lib/python3.12/site-packages (from ipython>=5.3.0->pyvis->vuecore) (0.6.3)
Requirement already satisfied: comm>=0.1.3 in /home/runner/.local/lib/python3.12/site-packages (from ipywidgets>=7.0.0->cyjupyter->vuecore) (0.2.3)
Requirement already satisfied: widgetsnbextension~=4.0.14 in /home/runner/.local/lib/python3.12/site-packages (from ipywidgets>=7.0.0->cyjupyter->vuecore) (4.0.15)
Requirement already satisfied: jupyterlab_widgets~=3.0.15 in /home/runner/.local/lib/python3.12/site-packages (from ipywidgets>=7.0.0->cyjupyter->vuecore) (3.0.16)
Requirement already satisfied: attrs>=22.2.0 in /usr/lib/python3/dist-packages (from jsonschema>=2.6->nbformat>=4.2.0->vuecore) (23.2.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /home/runner/.local/lib/python3.12/site-packages (from jsonschema>=2.6->nbformat>=4.2.0->vuecore) (2025.9.1)
Requirement already satisfied: referencing>=0.28.4 in /home/runner/.local/lib/python3.12/site-packages (from jsonschema>=2.6->nbformat>=4.2.0->vuecore) (0.37.0)
Requirement already satisfied: rpds-py>=0.25.0 in /home/runner/.local/lib/python3.12/site-packages (from jsonschema>=2.6->nbformat>=4.2.0->vuecore) (2026.5.1)
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.12/dist-packages (from jupyter-core!=5.0.*,>=4.12->nbformat>=4.2.0->vuecore) (4.9.6)
Requirement already satisfied: llvmlite<0.48,>=0.47.0dev0 in /home/runner/.local/lib/python3.12/site-packages (from numba>=0.51.2->umap-learn->acore) (0.47.0)
Requirement already satisfied: json5>=0.8.4 in /home/runner/.local/lib/python3.12/site-packages (from reactome2py>=3.0.0->PyWGCNA!=2.2.0->acore) (0.14.0)
Requirement already satisfied: memoir>=0.0.3 in /home/runner/.local/lib/python3.12/site-packages (from rsrc>=0.1.3->PyWGCNA!=2.2.0->acore) (0.0.3)
Requirement already satisfied: reprit>=0.3.0 in /home/runner/.local/lib/python3.12/site-packages (from rsrc>=0.1.3->PyWGCNA!=2.2.0->acore) (0.9.0)
Requirement already satisfied: mypy-extensions>=0.3.0 in /home/runner/.local/lib/python3.12/site-packages (from typing_inspect>=0.6.0->pandera->acore) (1.1.0)
Requirement already satisfied: zipp>=3.20 in /home/runner/.local/lib/python3.12/site-packages (from importlib-metadata->dash->vuecore) (4.1.0)
Requirement already satisfied: et-xmlfile in /home/runner/.local/lib/python3.12/site-packages (from openpyxl->dsp-pandas->acore) (2.0.0)
Requirement already satisfied: parso<0.9.0,>=0.8.6 in /home/runner/.local/lib/python3.12/site-packages (from jedi>=0.18.2->ipython>=5.3.0->pyvis->vuecore) (0.8.7)
Requirement already satisfied: wcwidth in /home/runner/.local/lib/python3.12/site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython>=5.3.0->pyvis->vuecore) (0.7.0)
Requirement already satisfied: session-info2 in /home/runner/.local/lib/python3.12/site-packages (from scverse-misc>=0.0.3->anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (0.4.1)
Requirement already satisfied: executing>=1.2.0 in /home/runner/.local/lib/python3.12/site-packages (from stack_data>=0.6.0->ipython>=5.3.0->pyvis->vuecore) (2.2.1)
Requirement already satisfied: asttokens>=2.1.0 in /home/runner/.local/lib/python3.12/site-packages (from stack_data>=0.6.0->ipython>=5.3.0->pyvis->vuecore) (3.0.1)
Requirement already satisfied: pure-eval in /home/runner/.local/lib/python3.12/site-packages (from stack_data>=0.6.0->ipython>=5.3.0->pyvis->vuecore) (0.2.3)
Requirement already satisfied: donfig>=0.8 in /home/runner/.local/lib/python3.12/site-packages (from zarr!=3.0.*,>=2.18.7->anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (0.8.1.post1)
Requirement already satisfied: google-crc32c>=1.5 in /home/runner/.local/lib/python3.12/site-packages (from zarr!=3.0.*,>=2.18.7->anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (1.8.0)
Requirement already satisfied: numcodecs>=0.14 in /home/runner/.local/lib/python3.12/site-packages (from zarr!=3.0.*,>=2.18.7->anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (0.16.5)
Requirement already satisfied: pyyaml in /home/runner/.local/lib/python3.12/site-packages (from donfig>=0.8->zarr!=3.0.*,>=2.18.7->anndata>=0.8.0->PyWGCNA!=2.2.0->acore) (6.0.3)
Note: you may need to restart the kernel to use updated packages.
from pathlib import Path

import acore.differential_regulation
import acore.enrichment_analysis
import acore.normalization
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import scipy.stats
import seaborn as sns
import vuecore
from acore.io.uniprot import fetch_annotations, process_annotations
from vuecore.viz import get_enrichment_plots

Paramters#

  • file_in: input file with the quantified peptide data in MSstats format as provided by quantms

  • out_dir: output directory for the results of the data analysis, which will be used later for the report generation with VueGen.

The file will be loaded from the online repository if it is not present.

file_in: str = Path(
    "data/PXD040621/processed/PXD040621.sdrf_openms_design_msstats_in.csv"
) # input file with the quantified peptide data in MSstats format as provided by quantms
out_dir = "data/PXD040621/report/" # output directory for the results of the data analysis, which will be used later for the report generation with VueGen.
min_obs_per_group: int = 3 # minimum number of observations per group for a protein to be included in the differential regulation analysis

We have the following columns in the data:

cols = [
    "ProteinName",
    "PeptideSequence",
    "PrecursorCharge",
    "FragmentIon",
    "ProductCharge",
    "IsotopeLabelType",
    "Condition",
    "BioReplicate",
    "Run",
    "Intensity",
    "Reference",
]
if not file_in.exists():
    file_in = (
        "https://raw.githubusercontent.com/biosustain/dsp_course_proteomics_intro/HEAD"
        "/data/PXD040621/processed/PXD040621.sdrf_openms_design_msstats_in.csv"
    )
df = pd.read_csv(file_in, sep=",", header=0)  # .set_index([])
df.head()
ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType Condition BioReplicate Run Intensity Reference
0 sp|P00959|SYM_ECOLI AAAAPVTGPLADDPIQETITFDDFAK 2 NaN 0 L Control 1 1 201,065,600.000 20220830_JL-4884_Forster_Ecoli_DMSO_rep1_EG-1....
1 sp|P00959|SYM_ECOLI AAAAPVTGPLADDPIQETITFDDFAK 2 NaN 0 L Control 2 2 74,844,780.000 20220830_JL-4884_Forster_Ecoli_DMSO_rep2_EG-2....
2 sp|P00959|SYM_ECOLI AAAAPVTGPLADDPIQETITFDDFAK 2 NaN 0 L Control 3 3 67,591,390.000 20220830_JL-4884_Forster_Ecoli_DMSO_rep3_EG-3....
3 sp|P00959|SYM_ECOLI AAAAPVTGPLADDPIQETITFDDFAK 2 NaN 0 L Control 4 4 76,388,810.000 20220830_JL-4884_Forster_Ecoli_DMSO_rep4_EG-4....
4 sp|P00959|SYM_ECOLI AAAAPVTGPLADDPIQETITFDDFAK 2 NaN 0 L Sulforaphane 5 5 116,247,100.000 20220830_JL-4884_Forster_Ecoli_Suf_rep1_EG-5.mzML

VueGen report output folder#

define the output folder for our VueGen report which we will create later

out_dir = "data/PXD040621/report/"
out_dir = Path(out_dir)
out_dir.mkdir(parents=True, exist_ok=True)

Log2 transform the intensity values#

  • log2 transformations are common for lognormal distributed data

df["Intensity"] = np.log2(df["Intensity"].astype(float))
df.head()
ProteinName PeptideSequence PrecursorCharge FragmentIon ProductCharge IsotopeLabelType Condition BioReplicate Run Intensity Reference
0 sp|P00959|SYM_ECOLI AAAAPVTGPLADDPIQETITFDDFAK 2 NaN 0 L Control 1 1 27.583 20220830_JL-4884_Forster_Ecoli_DMSO_rep1_EG-1....
1 sp|P00959|SYM_ECOLI AAAAPVTGPLADDPIQETITFDDFAK 2 NaN 0 L Control 2 2 26.157 20220830_JL-4884_Forster_Ecoli_DMSO_rep2_EG-2....
2 sp|P00959|SYM_ECOLI AAAAPVTGPLADDPIQETITFDDFAK 2 NaN 0 L Control 3 3 26.010 20220830_JL-4884_Forster_Ecoli_DMSO_rep3_EG-3....
3 sp|P00959|SYM_ECOLI AAAAPVTGPLADDPIQETITFDDFAK 2 NaN 0 L Control 4 4 26.187 20220830_JL-4884_Forster_Ecoli_DMSO_rep4_EG-4....
4 sp|P00959|SYM_ECOLI AAAAPVTGPLADDPIQETITFDDFAK 2 NaN 0 L Sulforaphane 5 5 26.793 20220830_JL-4884_Forster_Ecoli_Suf_rep1_EG-5.mzML

Exploratory and Data Quality Plots (peptide level)#

Shows the overal distribution of the log2-normal intensities per sample

df["BioReplicate"] = df["BioReplicate"].replace({5: 1, 6: 2, 7: 3, 8: 4})
fg = sns.displot(
    data=df.rename(columns={"BioReplicate": "Rep", "Condition": "C."}),
    x="Intensity",
    col="C.",
    row="Rep",
    # hue="Reactor_ID",
    kind="kde",
    height=2,
    aspect=1.1,
)
_images/7c4aee0a65142170db4e7405dabca3c3c02d687760abfaa378faedca29e22556.png

Aggregate the peptide intensities to protein intensities#

  • we use the median of the peptide intensities for each protein

There are more sophisticated ways to do this, e.g. using MaxLFQ, iBAQ, FlashLFQ, DirectLFQ, etc.

  • shorten sample name for readability

proteins = (
    df.groupby(["ProteinName", "Reference"])["Intensity"].median().unstack(level=0)
)
proteins.index = proteins.index.str.split("_").str[4:6].str.join("_")
proteins
ProteinName CON_ENSEMBL:ENSBTAP00000024462;CON_ENSEMBL:ENSBTAP00000024466 CON_O76013 CON_P00761 CON_P01966 CON_P02070 CON_P02533 CON_P02538 CON_P02662 CON_P02663 CON_P02666 ... sp|Q47319|TAPT_ECOLI sp|Q47536|YAIP_ECOLI sp|Q47622|SAPA_ECOLI sp|Q47679|YAFV_ECOLI sp|Q47710|YQJK_ECOLI sp|Q57261|TRUD_ECOLI sp|Q59385-2|COPA_ECOLI sp|Q59385|COPA_ECOLI sp|Q7DFV3|YMGG_ECOLI sp|Q93K97|ADPP_ECOLI
Reference
DMSO_rep1 NaN NaN 31.055 24.202 26.890 28.336 25.828 NaN 26.762 NaN ... NaN NaN 25.343 NaN 27.038 28.411 23.555 27.640 28.513 27.223
DMSO_rep2 NaN 25.647 30.117 25.711 NaN 25.595 26.210 NaN 27.593 NaN ... NaN NaN NaN NaN 26.841 27.941 25.240 27.244 27.621 25.291
DMSO_rep3 25.529 NaN 33.767 NaN NaN 26.992 25.441 28.129 28.895 29.559 ... NaN NaN 24.576 NaN 26.609 27.070 NaN 27.525 27.679 24.359
DMSO_rep4 NaN NaN 33.702 NaN NaN 27.205 26.093 NaN 27.764 NaN ... NaN NaN 25.945 23.902 27.164 26.680 22.524 27.404 27.256 25.767
Suf_rep1 NaN NaN 34.068 NaN NaN NaN 24.442 NaN 25.938 NaN ... NaN NaN 25.836 NaN 26.819 27.995 NaN 27.499 28.090 25.956
Suf_rep2 NaN NaN 33.252 NaN 26.571 26.495 25.070 NaN NaN NaN ... NaN NaN NaN 24.162 27.268 27.055 NaN 27.667 27.526 25.231
Suf_rep3 NaN NaN 31.844 25.927 NaN 26.249 28.169 22.968 26.466 NaN ... 25.463 NaN NaN NaN 24.741 27.313 NaN 27.708 27.814 26.103
Suf_rep4 NaN NaN 33.857 NaN NaN 27.208 24.690 NaN NaN NaN ... NaN 24.468 24.757 24.040 27.071 26.643 NaN 27.848 27.605 26.178

8 rows × 2306 columns

Remove contaminant proteins#

Remove the contaminant proteins which were added to the fasta file used in the data processing. Contaminant proteins are e.g. creation which gets into the sample from the human skin or hair when the sample is prepared.

These are filtered out as they are most of the time not relevant, but a contamination.

decoy_proteins = proteins.filter(like="CON_", axis=1)
proteins = proteins.drop(decoy_proteins.columns, axis=1)
proteins
ProteinName sp|A5A613|YCIY_ECOLI sp|P00350|6PGD_ECOLI sp|P00363|FRDA_ECOLI sp|P00370|DHE4_ECOLI sp|P00393|NDH_ECOLI sp|P00448|SODM_ECOLI sp|P00452|RIR1_ECOLI sp|P00490|PHSM_ECOLI sp|P00509|AAT_ECOLI sp|P00547|KHSE_ECOLI ... sp|Q47319|TAPT_ECOLI sp|Q47536|YAIP_ECOLI sp|Q47622|SAPA_ECOLI sp|Q47679|YAFV_ECOLI sp|Q47710|YQJK_ECOLI sp|Q57261|TRUD_ECOLI sp|Q59385-2|COPA_ECOLI sp|Q59385|COPA_ECOLI sp|Q7DFV3|YMGG_ECOLI sp|Q93K97|ADPP_ECOLI
Reference
DMSO_rep1 27.180 28.152 30.247 27.459 26.824 25.610 NaN 27.864 29.979 26.065 ... NaN NaN 25.343 NaN 27.038 28.411 23.555 27.640 28.513 27.223
DMSO_rep2 NaN 27.926 30.262 26.873 26.757 24.901 NaN 26.439 29.048 NaN ... NaN NaN NaN NaN 26.841 27.941 25.240 27.244 27.621 25.291
DMSO_rep3 NaN 27.653 29.970 26.600 25.442 25.054 27.172 26.382 28.777 NaN ... NaN NaN 24.576 NaN 26.609 27.070 NaN 27.525 27.679 24.359
DMSO_rep4 NaN 27.152 29.471 26.439 25.799 24.790 NaN 26.820 29.485 25.524 ... NaN NaN 25.945 23.902 27.164 26.680 22.524 27.404 27.256 25.767
Suf_rep1 NaN 27.442 30.005 27.400 26.671 25.564 NaN 27.685 29.295 NaN ... NaN NaN 25.836 NaN 26.819 27.995 NaN 27.499 28.090 25.956
Suf_rep2 NaN 27.032 30.086 27.189 26.886 25.378 27.364 27.531 29.284 NaN ... NaN NaN NaN 24.162 27.268 27.055 NaN 27.667 27.526 25.231
Suf_rep3 NaN 27.815 29.904 27.139 26.711 25.318 26.062 27.545 29.357 26.265 ... 25.463 NaN NaN NaN 24.741 27.313 NaN 27.708 27.814 26.103
Suf_rep4 NaN 27.587 29.575 27.224 26.321 25.360 25.101 27.705 29.584 26.427 ... NaN 24.468 24.757 24.040 27.071 26.643 NaN 27.848 27.605 26.178

8 rows × 2269 columns

Factor variable#

Create a label for each sample based on the metadata.

  • we will use a string in the sample name, but you can see how the metadata is organized

This could be provided in a separate metadata file.

meta = df[["Condition", "BioReplicate", "Run", "Reference"]].drop_duplicates()
meta
Condition BioReplicate Run Reference
0 Control 1 1 20220830_JL-4884_Forster_Ecoli_DMSO_rep1_EG-1....
1 Control 2 2 20220830_JL-4884_Forster_Ecoli_DMSO_rep2_EG-2....
2 Control 3 3 20220830_JL-4884_Forster_Ecoli_DMSO_rep3_EG-3....
3 Control 4 4 20220830_JL-4884_Forster_Ecoli_DMSO_rep4_EG-4....
4 Sulforaphane 1 5 20220830_JL-4884_Forster_Ecoli_Suf_rep1_EG-5.mzML
5 Sulforaphane 2 6 20220830_JL-4884_Forster_Ecoli_Suf_rep2_EG-6.mzML
6 Sulforaphane 3 7 20220830_JL-4884_Forster_Ecoli_Suf_rep3_EG-7.mzML
7 Sulforaphane 4 8 20220830_JL-4884_Forster_Ecoli_Suf_rep4_EG-8.mzML
label_encoding = {0: "control", 1: "10 µm sulforaphane"}
label_suf = pd.Series(
    proteins.index.str.contains("Suf_").astype(int),
    index=proteins.index,
    name="label_suf",
    dtype=np.int8,
).map(label_encoding)
label_suf
Reference
DMSO_rep1               control
DMSO_rep2               control
DMSO_rep3               control
DMSO_rep4               control
Suf_rep1     10 µm sulforaphane
Suf_rep2     10 µm sulforaphane
Suf_rep3     10 µm sulforaphane
Suf_rep4     10 µm sulforaphane
Name: label_suf, dtype: object

Plot the data completeness for each protein.#

  • see the how many proteins are observed across how many samples.

We create a subfolder in our results folder to keep it organized for the report generation later.

Hide code cell source

view_name = "Protein"
out_dir_subsection = out_dir / "1_data" / "completeness"
out_dir_subsection.mkdir(parents=True, exist_ok=True)

From the line plot (in staircase shape) you can see that over half of the proteins are observed in all 8 samples.

Hide code cell source

view_name = "Protein"
ax = (
    proteins.notna()
    .sum()
    .sort_values()
    .plot(
        rot=45,
        ylabel=f"Number of Samples {view_name.lower()} was observed in",
    )
)
ax.get_figure().savefig(
    out_dir_subsection / f"data_completeness_step_plot.png",
    bbox_inches="tight",
    dpi=300,
)
_images/9586a6d63ef5af6776290c428553c98320b36bc04d170d9e62a6d676b1ecc2bc.png

The bar plot shows the same information, but it is easier to see the exact number of proteins observed in how many samples.

Hide code cell source

view_name = "Protein"
ax = (
    proteins.notna()
    .sum()
    .value_counts()
    .sort_index(ascending=False)
    .plot(
        kind="bar",
        title=f"Data Completeness per {view_name}",
        xlabel=f"Number of Samples {view_name.lower()} was observed in",
        ylabel=f"Number of {view_name}s",
        color="steelblue",
        figsize=(10, 6),
    )
)
ax.get_figure().savefig(
    out_dir_subsection / f"data_completeness_bar_plot.png",
    bbox_inches="tight",
    dpi=300,
)
_images/9e5575e5e9c96f47998811b8ce66bcf51b40c622465326ee1561ecf5521476bd.png

Protein identifiers metadata#

Keep the protein identifiers metadata. Will be used to query UNIProt.

Hide code cell source

# Explode column names to examine split by '|'
proteins_meta = (
    proteins.columns.str.split("|", expand=True)
    .to_frame()
    .dropna(how="any", axis=1)
    .reset_index(drop=True)
)
proteins_meta.columns = ["Source", "ProteinName", "GeneName"]
proteins_meta["GeneName"] = proteins_meta["GeneName"].str.split("_").str[0]
proteins_meta.index = proteins.columns
proteins_meta.index.name = "identifier"
proteins_meta
Source ProteinName GeneName
identifier
sp|A5A613|YCIY_ECOLI sp A5A613 YCIY
sp|P00350|6PGD_ECOLI sp P00350 6PGD
sp|P00363|FRDA_ECOLI sp P00363 FRDA
sp|P00370|DHE4_ECOLI sp P00370 DHE4
sp|P00393|NDH_ECOLI sp P00393 NDH
... ... ... ...
sp|Q57261|TRUD_ECOLI sp Q57261 TRUD
sp|Q59385-2|COPA_ECOLI sp Q59385-2 COPA
sp|Q59385|COPA_ECOLI sp Q59385 COPA
sp|Q7DFV3|YMGG_ECOLI sp Q7DFV3 YMGG
sp|Q93K97|ADPP_ECOLI sp Q93K97 ADPP

2269 rows × 3 columns

For later in the enrichment analysis let’s replace the protein identifier from the Fasta file with the UNIPROT ID

proteins.columns = proteins_meta["ProteinName"].rename("UniprotID")
proteins
UniprotID A5A613 P00350 P00363 P00370 P00393 P00448 P00452 P00490 P00509 P00547 ... Q47319 Q47536 Q47622 Q47679 Q47710 Q57261 Q59385-2 Q59385 Q7DFV3 Q93K97
Reference
DMSO_rep1 27.180 28.152 30.247 27.459 26.824 25.610 NaN 27.864 29.979 26.065 ... NaN NaN 25.343 NaN 27.038 28.411 23.555 27.640 28.513 27.223
DMSO_rep2 NaN 27.926 30.262 26.873 26.757 24.901 NaN 26.439 29.048 NaN ... NaN NaN NaN NaN 26.841 27.941 25.240 27.244 27.621 25.291
DMSO_rep3 NaN 27.653 29.970 26.600 25.442 25.054 27.172 26.382 28.777 NaN ... NaN NaN 24.576 NaN 26.609 27.070 NaN 27.525 27.679 24.359
DMSO_rep4 NaN 27.152 29.471 26.439 25.799 24.790 NaN 26.820 29.485 25.524 ... NaN NaN 25.945 23.902 27.164 26.680 22.524 27.404 27.256 25.767
Suf_rep1 NaN 27.442 30.005 27.400 26.671 25.564 NaN 27.685 29.295 NaN ... NaN NaN 25.836 NaN 26.819 27.995 NaN 27.499 28.090 25.956
Suf_rep2 NaN 27.032 30.086 27.189 26.886 25.378 27.364 27.531 29.284 NaN ... NaN NaN NaN 24.162 27.268 27.055 NaN 27.667 27.526 25.231
Suf_rep3 NaN 27.815 29.904 27.139 26.711 25.318 26.062 27.545 29.357 26.265 ... 25.463 NaN NaN NaN 24.741 27.313 NaN 27.708 27.814 26.103
Suf_rep4 NaN 27.587 29.575 27.224 26.321 25.360 25.101 27.705 29.584 26.427 ... NaN 24.468 24.757 24.040 27.071 26.643 NaN 27.848 27.605 26.178

8 rows × 2269 columns

And let’s save a table with the data for inspection

proteins_meta.to_csv(out_dir_subsection / "proteins_identifiers.csv")
proteins.to_csv(out_dir_subsection / "proteins.csv")

Hierarchical Clustering of data#

  • using completely observed data only find correlations in data

out_dir_subsection = out_dir / "1_data" / "clustermap"
out_dir_subsection.mkdir(parents=True, exist_ok=True)
_group_labels = label_encoding.values()
lut = dict(zip(_group_labels, [f"C{i}" for i in range(len(_group_labels))]))
row_colors = label_suf.map(lut).rename("group color")
row_colors
Reference
DMSO_rep1    C0
DMSO_rep2    C0
DMSO_rep3    C0
DMSO_rep4    C0
Suf_rep1     C1
Suf_rep2     C1
Suf_rep3     C1
Suf_rep4     C1
Name: group color, dtype: object
vuecore.set_font_sizes(7)
cg = sns.clustermap(
    proteins.dropna(how="any", axis=1),
    method="ward",
    row_colors=row_colors,
    figsize=(11, 6),
    robust=True,
    xticklabels=True,
    yticklabels=True,
)
fig = cg.figure
cg.ax_heatmap.set_xlabel("Proteins")
cg.ax_heatmap.set_ylabel("Sample ID")
vuecore.select_xticks(cg.ax_heatmap)
handles = [
    plt.Line2D([0], [0], marker="o", color="w", markerfacecolor=lut[name], markersize=8)
    for name in lut
]
cg.ax_cbar.legend(
    handles, _group_labels, title="Groups", loc="lower left", bbox_to_anchor=(2, 0.5)
)
fname = out_dir_subsection / "clustermap_ward.png"
# vuecore.savefig(fig, fname, pdf=True, dpi=600, tight_layout=False)
fig.savefig(
    out_dir_subsection / "clustermap_ward.png",
    bbox_inches="tight",
    dpi=300,
)
_images/b455229f270134321620e89cbdd3d97761a5d0e7bb5690b6328dd3d824ea5144.png

Quality Control Plots (protein level)#

  • data distribution (histogram and boxplot) to identify drift

  • coefficient of variation (CV)

  • number of identified proteins per sample

Data distribution (histogram)#

min_int, max_int = int(proteins.min().min()), int(proteins.max().max())
bins = range(min_int, max_int+1, 1)
ax = proteins.T.hist(layout=(2, 4), bins=bins, sharex=True, sharey=True, figsize=(8, 4))
_images/e32d335737fc7659047c375feed80b9468204a2b079d5ac9ecd61ecd037b5614.png

Data distribution (boxplot)#

fig, ax = plt.subplots(figsize=(8, 4))
proteins.T.boxplot(ax=ax)
ax.set_xlabel("Sample ID")
ax.set_ylabel("log2 intensity")
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
fname = out_dir_subsection / "boxplot_log2_unnormalized.png"
vuecore.savefig(fig, fname, pdf=True, dpi=600, tight_layout=False)
print(f"Saved boxplot to {fname}")
Saved boxplot to data/PXD040621/report/1_data/clustermap/boxplot_log2_unnormalized.png
_images/85ebf858146e04593cbf095d2ab6c3092eaf5f9977028861eb6cc514e0a04095.png

Coefficient of Variation (CV)#

  • CV = standard deviation / mean $\( CV = \frac{\sigma}{\mu} \)$

  • per group

Use scipy.stats.variation to calculate the CV.

df_cvs = (
    proteins.groupby(label_suf)  # .join(metadata[grouping])
    # .agg(scipy.stats.variation)
    .agg([scipy.stats.variation, "mean"])  # .rename_axis(["feat", "stat"], axis=1)
)
df_cvs
UniprotID A5A613 P00350 P00363 P00370 P00393 ... Q57261 Q59385-2 Q59385 Q7DFV3 Q93K97
variation mean variation mean variation mean variation mean variation mean ... variation mean variation mean variation mean variation mean variation mean
label_suf
10 µm sulforaphane NaN NaN 0.010 27.469 0.006 29.892 0.004 27.238 0.008 26.647 ... 0.018 27.252 NaN NaN 0.004 27.680 0.008 27.759 0.015 25.867
control NaN 27.180 0.013 27.721 0.011 29.987 0.014 26.843 0.023 26.205 ... 0.025 27.525 NaN 23.773 0.005 27.453 0.017 27.767 0.040 25.660

2 rows × 4538 columns

df_cvs = df_cvs.stack(0, future_stack=True).reset_index().dropna()
df_cvs
label_suf UniprotID variation mean
1 10 µm sulforaphane P00350 0.010 27.469
2 10 µm sulforaphane P00363 0.006 29.892
3 10 µm sulforaphane P00370 0.004 27.238
4 10 µm sulforaphane P00393 0.008 26.647
5 10 µm sulforaphane P00448 0.004 25.405
... ... ... ... ...
4,532 control Q47710 0.008 26.913
4,533 control Q57261 0.025 27.525
4,535 control Q59385 0.005 27.453
4,536 control Q7DFV3 0.017 27.767
4,537 control Q93K97 0.040 25.660

3144 rows × 4 columns

Visualize the relationship between the coefficient of variation (CV) and the mean intensity per group: x-axis: CV (‘variation’), y-axis: mean intensity.

Here no further filtering is needed, but many researchers filter for proteins with a mean intensity above a certain threshold (aroung 0.3). This is best done on quality control samples, e.g. a pooled sample, for large datasets.

Hide code cell source

default_args = dict(
    facet_col="label_suf",
    # facet_row="Time",
    labels={
        "label_suf": "group",
        "variation": "CV",
    },
)
fig = px.scatter(
    data_frame=df_cvs,
    x="variation",
    y="mean",
    trendline="ols",
    hover_data={"UniprotID": True, "mean": True},  # hide the index in the hover
    **default_args,
)
fname = "cv_vs_mean"
fig.write_json(
    out_dir_subsection / f"{fname}.json",
    pretty=True,
)
fig

Hierarchical Clustering of normalized data#

Let’s see the effect of normalization on the clustering.

normalization_method = "median"
X = acore.normalization.normalize_data(
    proteins.dropna(how="any", axis=1), normalization_method
)
# X = acore.normalization.normalize_data(
#     X, "zscore"
# )  # to make it more comparable across proteins, but not necessary for the clustering
X
UniprotID P00350 P00363 P00370 P00393 P00448 P00490 P00509 P00550 P00561 P00562 ... Q46868 Q46893 Q46920 Q46948 Q47147 Q47710 Q57261 Q59385 Q7DFV3 Q93K97
Reference
DMSO_rep1 27.858 29.953 27.165 26.530 25.316 27.570 29.685 26.429 26.826 27.150 ... 29.139 27.865 27.059 26.919 26.365 26.744 28.117 27.346 28.219 26.929
DMSO_rep2 28.064 30.399 27.011 26.894 25.039 26.576 29.185 25.603 26.598 26.720 ... 29.868 27.814 26.840 27.330 26.516 26.978 28.078 27.381 27.758 25.429
DMSO_rep3 27.854 30.170 26.800 25.643 25.254 26.582 28.977 27.217 26.693 27.039 ... 29.720 28.852 27.007 27.230 26.680 26.809 27.271 27.725 27.879 24.559
DMSO_rep4 27.319 29.638 26.606 25.966 24.957 26.987 29.652 26.835 26.685 26.529 ... 28.653 28.294 27.308 27.574 26.441 27.331 26.847 27.571 27.423 25.935
Suf_rep1 27.261 29.824 27.219 26.490 25.383 27.504 29.114 26.637 26.764 27.028 ... 29.604 28.407 26.867 27.706 26.765 26.638 27.815 27.318 27.910 25.775
Suf_rep2 27.029 30.084 27.187 26.884 25.375 27.529 29.281 27.030 27.099 26.015 ... 28.981 28.579 26.920 27.505 26.613 27.266 27.053 27.665 27.523 25.228
Suf_rep3 27.817 29.907 27.141 26.714 25.321 27.548 29.359 27.011 26.773 26.364 ... 28.755 28.379 27.096 27.254 26.284 24.743 27.316 27.711 27.817 26.106
Suf_rep4 27.533 29.521 27.169 26.266 25.306 27.650 29.529 26.921 26.931 26.808 ... 27.866 28.746 27.093 27.622 26.954 27.017 26.589 27.793 27.551 26.123

8 rows × 1446 columns

X.median(axis="columns")
Reference
DMSO_rep1   27.278
DMSO_rep2   27.278
DMSO_rep3   27.278
DMSO_rep4   27.278
Suf_rep1    27.278
Suf_rep2    27.278
Suf_rep3    27.278
Suf_rep4    27.278
dtype: float64

Hide code cell source

vuecore.set_font_sizes(7)
cg = sns.clustermap(
    X,
    method="ward",
    row_colors=row_colors,
    figsize=(11, 6),
    robust=True,
    xticklabels=True,
    yticklabels=True,
)
fig = cg.figure
cg.ax_heatmap.set_xlabel("Proteins")
cg.ax_heatmap.set_ylabel("Sample ID")
vuecore.select_xticks(cg.ax_heatmap)
handles = [
    plt.Line2D([0], [0], marker="o", color="w", markerfacecolor=lut[name], markersize=8)
    for name in lut
]
cg.ax_cbar.legend(
    handles, _group_labels, title="Groups", loc="lower left", bbox_to_anchor=(2, 0.5)
)
fname = out_dir_subsection / "clustermap_ward.png"
# vuecore.savefig(fig, fname, pdf=True, dpi=600, tight_layout=False)
fig.savefig(
    out_dir_subsection / f"clustermap_ward_{normalization_method}.png",
    bbox_inches="tight",
    dpi=300,
)
_images/045fbcfce7d0562b7cc150da753cc9671a1788792182f487a895bc12833e6239.png

Exercise:

  1. Try out different normalization methods and see how the clustering changes.

  2. Maybe you want to combine a sample based with a protein based normalization method.

Differential Regulation#

out_dir_subsection = out_dir / "2_differential_regulation"
out_dir_subsection.mkdir(parents=True, exist_ok=True)

Retain all proteins with at least 3 observations in each group

  • this is a requirement for a standard t-test

  • you could look into imputation methods to fill in missing values)

    • protein in at least two samples per group?

    • missing all in one condition?

Let’s not impute, but filter for proteins with at least 3 observations in each group

group_counts = proteins.groupby(label_suf).count()
group_counts
UniprotID A5A613 P00350 P00363 P00370 P00393 P00448 P00452 P00490 P00509 P00547 ... Q47319 Q47536 Q47622 Q47679 Q47710 Q57261 Q59385-2 Q59385 Q7DFV3 Q93K97
label_suf
10 µm sulforaphane 0 4 4 4 4 4 3 4 4 2 ... 1 1 2 2 4 4 0 4 4 4
control 1 4 4 4 4 4 1 4 4 2 ... 0 0 3 1 4 4 3 4 4 4

2 rows × 2269 columns

Let’s see how many observations are present by condition (group)

  • The order is dermined by the treatment group here

  • Some proteins are absent in the control group that are observed in the treatment group, which is interesting for a separate analysis (not shown).

Hide code cell source

frac_non_na_by_group = (
    proteins.join(label_suf, how="inner")
    .groupby(label_suf.name)
    .apply(lambda x: x.notna().sum(), include_groups=False)
    .T
)
frac_non_na_by_group = frac_non_na_by_group.sort_values(by=frac_non_na_by_group.columns[0])
fig, ax = plt.subplots(figsize=(7, 3))
n_groups = len(frac_non_na_by_group.columns)
jitter_span = 0.1
offsets = np.linspace(-jitter_span / 2, jitter_span / 2, n_groups) if n_groups > 1 else [0.0]
for i, g in enumerate(frac_non_na_by_group.columns):
    (frac_non_na_by_group[g] + offsets[i]).plot(
        kind="line",
        ax=ax,
        style=".",
        alpha=0.5,
    )
ax.legend(title="Group")
ax.set_ylabel("Ratio of non-missing values\n(completeness)")
ax.set_title("Number of observations")
Text(0.5, 1.0, 'Number of observations')
_images/984e9e26ea97f1c63ea038646ea7c0baba0d5eaca394be5c02da00f986aa6273.png

For now, we filter the proteins to only those with at least 3 observations in each group.

mask = group_counts.groupby("label_suf").transform(
    lambda x: x >= min_obs_per_group
).all(axis=0)
mask
UniprotID
A5A613      False
P00350       True
P00363       True
P00370       True
P00393       True
            ...  
Q57261       True
Q59385-2    False
Q59385       True
Q7DFV3       True
Q93K97       True
Length: 2269, dtype: bool

We use a simple t-test for the differential regulation analysis.

Experiment with the normalization method and it’s effect on the results.

view = X.loc[:, mask].join(label_suf)
group = label_suf.name
diff_reg = acore.differential_regulation.run_anova(
    view,
    alpha=0.15,
    drop_cols=[],
    subject=None,
    group=group,
).sort_values("pvalue", ascending=True)
diff_reg["rejected"] = diff_reg["rejected"].astype(bool)
diff_reg.sort_values("pvalue")
identifier T-Statistics pvalue mean(group1) mean(group2) std(group1) std(group2) log2FC test correction padj rejected group1 group2 FC -log10 pvalue Method
1,321 P75726 13.440 0.000 28.135 26.793 0.131 0.113 1.341 t-Test FDR correction BH 0.015 True control 10 µm sulforaphane 2.534 4.978 Unpaired t-test
1,375 P76440 10.610 0.000 27.235 26.580 0.082 0.068 0.655 t-Test FDR correction BH 0.030 True control 10 µm sulforaphane 1.575 4.384 Unpaired t-test
827 P0CK95 -9.341 0.000 26.461 27.223 0.126 0.063 -0.762 t-Test FDR correction BH 0.032 True control 10 µm sulforaphane 0.590 4.069 Unpaired t-test
513 P0ABK5 -8.911 0.000 29.683 30.465 0.064 0.138 -0.783 t-Test FDR correction BH 0.032 True control 10 µm sulforaphane 0.581 3.953 Unpaired t-test
35 P02943 -8.791 0.000 26.170 27.861 0.282 0.178 -1.691 t-Test FDR correction BH 0.032 True control 10 µm sulforaphane 0.310 3.920 Unpaired t-test
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
101 P07650 -0.003 0.998 27.861 27.861 0.191 0.123 -0.000 t-Test FDR correction BH 0.999 False control 10 µm sulforaphane 1.000 0.001 Unpaired t-test
47 P04846 0.002 0.998 25.750 25.749 0.086 0.285 0.000 t-Test FDR correction BH 0.999 False control 10 µm sulforaphane 1.000 0.001 Unpaired t-test
1,038 P31120 -0.001 0.999 28.647 28.647 0.106 0.087 -0.000 t-Test FDR correction BH 0.999 False control 10 µm sulforaphane 1.000 0.000 Unpaired t-test
1,291 P67660 0.001 0.999 26.369 26.368 0.210 0.269 0.000 t-Test FDR correction BH 0.999 False control 10 µm sulforaphane 1.000 0.000 Unpaired t-test
774 P0AG24 0.001 0.999 25.513 25.513 0.858 0.175 0.001 t-Test FDR correction BH 0.999 False control 10 µm sulforaphane 1.000 0.000 Unpaired t-test

1446 rows × 17 columns

diff_reg.sort_values("pvalue").head(20)
identifier T-Statistics pvalue mean(group1) mean(group2) std(group1) std(group2) log2FC test correction padj rejected group1 group2 FC -log10 pvalue Method
1,321 P75726 13.440 0.000 28.135 26.793 0.131 0.113 1.341 t-Test FDR correction BH 0.015 True control 10 µm sulforaphane 2.534 4.978 Unpaired t-test
1,375 P76440 10.610 0.000 27.235 26.580 0.082 0.068 0.655 t-Test FDR correction BH 0.030 True control 10 µm sulforaphane 1.575 4.384 Unpaired t-test
827 P0CK95 -9.341 0.000 26.461 27.223 0.126 0.063 -0.762 t-Test FDR correction BH 0.032 True control 10 µm sulforaphane 0.590 4.069 Unpaired t-test
513 P0ABK5 -8.911 0.000 29.683 30.465 0.064 0.138 -0.783 t-Test FDR correction BH 0.032 True control 10 µm sulforaphane 0.581 3.953 Unpaired t-test
35 P02943 -8.791 0.000 26.170 27.861 0.282 0.178 -1.691 t-Test FDR correction BH 0.032 True control 10 µm sulforaphane 0.310 3.920 Unpaired t-test
1,431 Q46835 -8.614 0.000 26.776 27.618 0.124 0.115 -0.842 t-Test FDR correction BH 0.032 True control 10 µm sulforaphane 0.558 3.871 Unpaired t-test
832 P10384 -6.345 0.001 26.482 27.416 0.213 0.140 -0.934 t-Test FDR correction BH 0.147 True control 10 µm sulforaphane 0.524 3.144 Unpaired t-test
129 P09152 6.197 0.001 26.875 26.191 0.074 0.176 0.684 t-Test FDR correction BH 0.147 True control 10 µm sulforaphane 1.607 3.090 Unpaired t-test
1,237 P60785 -5.771 0.001 28.240 28.460 0.056 0.035 -0.220 t-Test FDR correction BH 0.164 False control 10 µm sulforaphane 0.859 2.928 Unpaired t-test
1,113 P37349 5.711 0.001 25.988 25.193 0.235 0.053 0.795 t-Test FDR correction BH 0.164 False control 10 µm sulforaphane 1.735 2.904 Unpaired t-test
991 P27249 -5.709 0.001 24.921 25.351 0.067 0.112 -0.430 t-Test FDR correction BH 0.164 False control 10 µm sulforaphane 0.742 2.903 Unpaired t-test
487 P0AB67 -5.519 0.001 27.878 28.381 0.089 0.131 -0.503 t-Test FDR correction BH 0.179 False control 10 µm sulforaphane 0.705 2.827 Unpaired t-test
514 P0ABK9 5.370 0.002 26.416 25.578 0.147 0.226 0.838 t-Test FDR correction BH 0.190 False control 10 µm sulforaphane 1.787 2.767 Unpaired t-test
1,329 P75825 4.990 0.002 26.956 25.320 0.220 0.524 1.636 t-Test FDR correction BH 0.236 False control 10 µm sulforaphane 3.109 2.606 Unpaired t-test
959 P25437 -4.910 0.003 29.141 29.686 0.181 0.064 -0.545 t-Test FDR correction BH 0.236 False control 10 µm sulforaphane 0.685 2.571 Unpaired t-test
1,394 P77252 4.851 0.003 26.593 25.953 0.119 0.195 0.639 t-Test FDR correction BH 0.236 False control 10 µm sulforaphane 1.558 2.545 Unpaired t-test
27 P00968 -4.802 0.003 26.452 26.880 0.083 0.130 -0.429 t-Test FDR correction BH 0.236 False control 10 µm sulforaphane 0.743 2.524 Unpaired t-test
110 P08200 -4.789 0.003 28.812 29.396 0.182 0.107 -0.584 t-Test FDR correction BH 0.236 False control 10 µm sulforaphane 0.667 2.518 Unpaired t-test
562 P0ACD8 4.767 0.003 30.243 29.895 0.090 0.089 0.348 t-Test FDR correction BH 0.236 False control 10 µm sulforaphane 1.273 2.508 Unpaired t-test
399 P0A9I1 4.499 0.004 28.152 26.938 0.357 0.301 1.214 t-Test FDR correction BH 0.280 False control 10 µm sulforaphane 2.320 2.387 Unpaired t-test

Interactive Volcano Plot#

  • y-axis: -log10(p-value)

  • x-axis: log2(fold-change)

  • color: rejected (significant or not after multiple testing correction)

Hover over the points to see additional information

Hide code cell source

str_cols = diff_reg.dtypes[diff_reg.dtypes == "object"].index.tolist()
hover_data = {
    "rejected": ":.0f",
    **{
        c: ":.4f"
        for c in [
            "padj",
            "FC",
        ]
    },
    **{c: True for c in str_cols},
}
fig = px.scatter(
    diff_reg,
    x="log2FC",
    y="-log10 pvalue",
    color="rejected",
    hover_data=hover_data,
    width=1200,
    height=800,
    title=f"Volcano plot for {view_name}s",
)
fig

Save result to subsection folder

fig.write_json(
    out_dir_subsection / "0_volcano_plot.json",
    pretty=False,
)
diff_reg.to_csv(out_dir_subsection / "1_differential_regulation.csv")

Enrichment Analysis#

  • You can reload the annotations, but for latency, I added a cached version to the repository

out_dir_subsection = out_dir / "uniprot_annotations"
out_dir_subsection.mkdir(parents=True, exist_ok=True)

Fetch the annotations from UniProt API.#

fname_annotations = out_dir_subsection / "annotations.csv"
try:
    annotations = pd.read_csv(fname_annotations, index_col=0)
    print(f"Loaded annotations from {fname_annotations}")
except FileNotFoundError:
    print(f"Fetching annotations for {proteins.columns.size} UniProt IDs.")
    FIELDS = "go_p,go_c,go_f"
    annotations = fetch_annotations(proteins.columns, fields=FIELDS)
    annotations = process_annotations(annotations, fields=FIELDS)
    # cache the annotations
    fname_annotations.parent.mkdir(exist_ok=True, parents=True)
    annotations.to_csv(fname_annotations, index=True)

annotations
Loaded annotations from data/PXD040621/report/uniprot_annotations/annotations.csv
identifier source annotation
6 P00350 Gene Ontology (biological process) D-gluconate catabolic process [GO:0046177]
7 P00350 Gene Ontology (cellular component) cytosol [GO:0005829]
8 P00350 Gene Ontology (molecular function) guanosine tetraphosphate binding [GO:0097216]
9 P00350 Gene Ontology (molecular function) identical protein binding [GO:0042802]
10 P00350 Gene Ontology (molecular function) NADP binding [GO:0050661]
... ... ... ...
32,263 ADPP_ECOLI Gene Ontology (molecular function) magnesium ion binding [GO:0000287]
32,264 ADPP_ECOLI Gene Ontology (molecular function) protein homodimerization activity [GO:0042803]
32,265 ADPP_ECOLI Gene Ontology (molecular function) pyrophosphatase activity [GO:0016462]
32,266 ADPP_ECOLI Gene Ontology (biological process) response to heat [GO:0009408]
32,267 ADPP_ECOLI Gene Ontology (biological process) ribose phosphate metabolic process [GO:0019693]

30283 rows × 3 columns

Run the enrichment analysis#

  • background is the set of identified proteins in the experiment (not the whole proteome of the organisim, here E. coli)

  • The enrichment is performed separately for the up- and down-regulated proteins (‘rejected’), which are few in our example where we had to set the adjusted p-value to 0.15.

In the enrichment we set the cutoff for the adjusted p-value to 0.2, which is a bit arbitrary to see some results.

enriched = acore.enrichment_analysis.run_up_down_regulation_enrichment(
    regulation_data=diff_reg,
    annotation=annotations,
    min_detected_in_set=1,
    lfc_cutoff=1,
    pval_col="padj",  # toggle if it does not work
    correction_alpha=0.2,  # adjust the p-value to see more or less results
)
enriched.set_index('identifiers')
terms foreground background foreground_pop background_pop pvalue padj rejected direction comparison
identifiers
P75726 citrate CoA-transferase activity [GO:0008814] 1 0 1 1446 0.001 0.002 True upregulated control~10 µm sulforaphane
P75726 citrate metabolic process [GO:0006101] 1 1 1 1446 0.001 0.002 True upregulated control~10 µm sulforaphane
P75726 ATP-independent citrate lyase complex [GO:0009... 1 1 1 1446 0.001 0.002 True upregulated control~10 µm sulforaphane
P75726 acetyl-CoA metabolic process [GO:0006084] 1 1 1 1446 0.001 0.002 True upregulated control~10 µm sulforaphane
P75726 citrate (pro-3S)-lyase activity [GO:0008815] 1 1 1 1446 0.001 0.002 True upregulated control~10 µm sulforaphane
P75726 cytoplasm [GO:0005737] 1 28 1 1446 0.020 0.020 True upregulated control~10 µm sulforaphane
P02943 maltodextrin transmembrane transporter activi... 1 0 1 1446 0.001 0.002 True downregulated control~10 µm sulforaphane
P02943 maltose transmembrane transport [GO:1904981] 1 0 1 1446 0.001 0.002 True downregulated control~10 µm sulforaphane
P02943 maltose transmembrane transporter activity [G... 1 0 1 1446 0.001 0.002 True downregulated control~10 µm sulforaphane
P02943 maltose transporting porin activity [GO:0015481] 1 0 1 1446 0.001 0.002 True downregulated control~10 µm sulforaphane
P02943 polysaccharide transport [GO:0015774] 1 0 1 1446 0.001 0.002 True downregulated control~10 µm sulforaphane
P02943 maltodextrin transmembrane transport [GO:0042... 1 1 1 1446 0.001 0.002 True downregulated control~10 µm sulforaphane
P02943 carbohydrate transmembrane transporter activit... 1 1 1 1446 0.001 0.002 True downregulated control~10 µm sulforaphane
P02943 virus receptor activity [GO:0001618] 1 1 1 1446 0.001 0.002 True downregulated control~10 µm sulforaphane
P02943 outer membrane protein complex [GO:0106234] 1 2 1 1446 0.002 0.003 True downregulated control~10 µm sulforaphane
P02943 monoatomic ion transport [GO:0006811] 1 2 1 1446 0.002 0.003 True downregulated control~10 µm sulforaphane
P02943 porin activity [GO:0015288] 1 5 1 1446 0.004 0.005 True downregulated control~10 µm sulforaphane
P02943 pore complex [GO:0046930] 1 5 1 1446 0.004 0.005 True downregulated control~10 µm sulforaphane
P02943 cell outer membrane [GO:0009279] 1 37 1 1446 0.026 0.028 True downregulated control~10 µm sulforaphane
P02943 DNA damage response [GO:0006974] 1 42 1 1446 0.030 0.030 True downregulated control~10 µm sulforaphane

Plot the enrichment scores for the up- and down-regulated proteins separately.

  • y-axis: GO term

  • x-axis: enrichment score (e.g. -log10(p-pvalue adjusted))

Hide code cell source

fig = get_enrichment_plots(
    enriched,
    identifier="anything",  # ToDo: figure out what this does
    args=dict(title="Enrichment Analysis"),
)
fig = fig[0]
fig.write_json(
    out_dir_subsection / "enrichment_analysis.json",
    pretty=True,
)
fig

Check for Maltose Uptake#

out_dir_subsection = out_dir / "3_maltose_uptake"
out_dir_subsection.mkdir(parents=True, exist_ok=True)

Apply filtering of ‘differentially abundant proteins’ as described in the paper

“Differentially abundant proteins were determined as those with log2 fold-change \(log2FC > 1\) and \(log2FC < -1\), and \(p < 0.05\)

This means no multiple testing correction was applied.

view = diff_reg.query("pvalue < 0.05 and FC > 1")  # .shape[0]
view.to_csv(
    out_dir_subsection / "1_differently_regulated_as_in_paper.csv",
    index=False,
)
view.set_index("identifier")[['pvalue', 'log2FC', 'padj', 'rejected', 'mean(group1)', 'mean(group2)',
       'std(group1)', 'std(group2)',  'test', 'correction', 
       'group1', 'group2', 'FC', '-log10 pvalue', 'Method']]
pvalue log2FC padj rejected mean(group1) mean(group2) std(group1) std(group2) test correction group1 group2 FC -log10 pvalue Method
identifier
P75726 0.000 1.341 0.015 True 28.135 26.793 0.131 0.113 t-Test FDR correction BH control 10 µm sulforaphane 2.534 4.978 Unpaired t-test
P76440 0.000 0.655 0.030 True 27.235 26.580 0.082 0.068 t-Test FDR correction BH control 10 µm sulforaphane 1.575 4.384 Unpaired t-test
P09152 0.001 0.684 0.147 True 26.875 26.191 0.074 0.176 t-Test FDR correction BH control 10 µm sulforaphane 1.607 3.090 Unpaired t-test
P37349 0.001 0.795 0.164 False 25.988 25.193 0.235 0.053 t-Test FDR correction BH control 10 µm sulforaphane 1.735 2.904 Unpaired t-test
P0ABK9 0.002 0.838 0.190 False 26.416 25.578 0.147 0.226 t-Test FDR correction BH control 10 µm sulforaphane 1.787 2.767 Unpaired t-test
P75825 0.002 1.636 0.236 False 26.956 25.320 0.220 0.524 t-Test FDR correction BH control 10 µm sulforaphane 3.109 2.606 Unpaired t-test
P77252 0.003 0.639 0.236 False 26.593 25.953 0.119 0.195 t-Test FDR correction BH control 10 µm sulforaphane 1.558 2.545 Unpaired t-test
P0ACD8 0.003 0.348 0.236 False 30.243 29.895 0.090 0.089 t-Test FDR correction BH control 10 µm sulforaphane 1.273 2.508 Unpaired t-test
P0A9I1 0.004 1.214 0.280 False 28.152 26.938 0.357 0.301 t-Test FDR correction BH control 10 µm sulforaphane 2.320 2.387 Unpaired t-test
P0A7E3 0.005 0.816 0.288 False 27.388 26.572 0.072 0.317 t-Test FDR correction BH control 10 µm sulforaphane 1.760 2.317 Unpaired t-test
P19931 0.006 0.605 0.345 False 27.319 26.714 0.109 0.232 t-Test FDR correction BH control 10 µm sulforaphane 1.521 2.189 Unpaired t-test
P05637 0.009 0.699 0.369 False 26.674 25.975 0.131 0.288 t-Test FDR correction BH control 10 µm sulforaphane 1.623 2.062 Unpaired t-test
P25397 0.009 0.741 0.385 False 27.685 26.944 0.205 0.271 t-Test FDR correction BH control 10 µm sulforaphane 1.671 2.031 Unpaired t-test
P37637 0.013 0.326 0.468 False 28.952 28.626 0.130 0.096 t-Test FDR correction BH control 10 µm sulforaphane 1.254 1.888 Unpaired t-test
P24232 0.014 0.923 0.470 False 27.189 26.266 0.418 0.203 t-Test FDR correction BH control 10 µm sulforaphane 1.896 1.860 Unpaired t-test
P37765 0.015 0.359 0.495 False 27.570 27.211 0.032 0.182 t-Test FDR correction BH control 10 µm sulforaphane 1.283 1.822 Unpaired t-test
P68767 0.016 0.589 0.510 False 28.552 27.963 0.128 0.279 t-Test FDR correction BH control 10 µm sulforaphane 1.504 1.799 Unpaired t-test
P0AGL2 0.019 0.420 0.586 False 30.640 30.221 0.105 0.203 t-Test FDR correction BH control 10 µm sulforaphane 1.338 1.721 Unpaired t-test
P31142 0.022 1.073 0.624 False 28.163 27.090 0.122 0.590 t-Test FDR correction BH control 10 µm sulforaphane 2.103 1.667 Unpaired t-test
P69811 0.022 0.636 0.624 False 29.010 28.374 0.094 0.347 t-Test FDR correction BH control 10 µm sulforaphane 1.554 1.657 Unpaired t-test
P24228 0.025 0.794 0.674 False 25.663 24.869 0.304 0.348 t-Test FDR correction BH control 10 µm sulforaphane 1.734 1.605 Unpaired t-test
P11349 0.025 0.632 0.674 False 27.512 26.880 0.367 0.032 t-Test FDR correction BH control 10 µm sulforaphane 1.550 1.603 Unpaired t-test
P37127 0.025 0.673 0.674 False 25.256 24.583 0.085 0.384 t-Test FDR correction BH control 10 µm sulforaphane 1.595 1.599 Unpaired t-test
P0A6W9 0.028 0.834 0.677 False 27.536 26.702 0.302 0.401 t-Test FDR correction BH control 10 µm sulforaphane 1.783 1.553 Unpaired t-test
P69829 0.030 0.733 0.677 False 26.959 26.226 0.216 0.395 t-Test FDR correction BH control 10 µm sulforaphane 1.662 1.518 Unpaired t-test
P0AB52 0.035 2.149 0.733 False 28.457 26.308 1.055 0.877 t-Test FDR correction BH control 10 µm sulforaphane 4.435 1.457 Unpaired t-test
P63235 0.038 0.878 0.733 False 30.551 29.673 0.290 0.497 t-Test FDR correction BH control 10 µm sulforaphane 1.837 1.415 Unpaired t-test
P16659 0.039 0.422 0.733 False 29.434 29.012 0.263 0.089 t-Test FDR correction BH control 10 µm sulforaphane 1.339 1.407 Unpaired t-test
P18390 0.041 0.856 0.733 False 26.130 25.274 0.401 0.406 t-Test FDR correction BH control 10 µm sulforaphane 1.809 1.390 Unpaired t-test
P0AE74 0.041 0.984 0.733 False 26.505 25.521 0.643 0.138 t-Test FDR correction BH control 10 µm sulforaphane 1.978 1.385 Unpaired t-test
P37773 0.043 0.718 0.735 False 27.938 27.221 0.149 0.463 t-Test FDR correction BH control 10 µm sulforaphane 1.644 1.365 Unpaired t-test
P0AE45 0.046 1.216 0.768 False 26.325 25.110 0.516 0.660 t-Test FDR correction BH control 10 µm sulforaphane 2.323 1.340 Unpaired t-test
P0AGI8 0.047 0.640 0.773 False 26.223 25.583 0.422 0.142 t-Test FDR correction BH control 10 µm sulforaphane 1.558 1.325 Unpaired t-test
P0AET5 0.048 0.393 0.773 False 30.623 30.230 0.267 0.063 t-Test FDR correction BH control 10 µm sulforaphane 1.313 1.322 Unpaired t-test
P30850 0.048 0.526 0.773 False 28.148 27.623 0.142 0.338 t-Test FDR correction BH control 10 µm sulforaphane 1.440 1.322 Unpaired t-test
P06961 0.049 0.303 0.773 False 25.429 25.126 0.135 0.166 t-Test FDR correction BH control 10 µm sulforaphane 1.234 1.306 Unpaired t-test

Let’s find the proteins highlighted in the volcano plot in Figure 3 in the paper: Figure 3

Hide code cell source

highlighted_genes = ["LamB", "MalE", "Malk", "CitF", "CitT", "CitE", "Frd"]
highlighted_genes = "|".join([p.upper() for p in highlighted_genes])
highlighted_genes = proteins_meta.query(
    f"`GeneName`.str.contains('{highlighted_genes}')"
)
highlighted_genes
Source ProteinName GeneName
identifier
sp|P00363|FRDA_ECOLI sp P00363 FRDA
sp|P02943|LAMB_ECOLI sp P02943 LAMB
sp|P0A8Q0|FRDC_ECOLI sp P0A8Q0 FRDC
sp|P0A8Q3|FRDD_ECOLI sp P0A8Q3 FRDD
sp|P0A9I1|CITE_ECOLI sp P0A9I1 CITE
sp|P0AC47|FRDB_ECOLI sp P0AC47 FRDB
sp|P0AE74|CITT_ECOLI sp P0AE74 CITT
sp|P0AEX9|MALE_ECOLI sp P0AEX9 MALE
sp|P68187|MALK_ECOLI sp P68187 MALK

Hide code cell source

highlighted_proteins = "|".join([p.upper() for p in highlighted_genes["ProteinName"]])
view = diff_reg.query(f"`identifier`.str.contains('{highlighted_proteins}')")
view = view.set_index("identifier").join(proteins_meta.set_index("ProteinName"))
view.to_csv(
    out_dir_subsection / "2_highlighted_proteins_in_figure3.csv",
    index=True,
)
sel_cols = [
    "GeneName",
    "log2FC",
    "pvalue",
    "padj",
    "rejected",
    "group1",
    "group2",
    "Method",
]
view[sel_cols].sort_values("log2FC", ascending=False)
GeneName log2FC pvalue padj rejected group1 group2 Method
identifier
P0A9I1 CITE 1.214 0.004 0.280 False control 10 µm sulforaphane Unpaired t-test
P0AE74 CITT 0.984 0.041 0.733 False control 10 µm sulforaphane Unpaired t-test
P0A8Q0 FRDC 0.927 0.114 0.833 False control 10 µm sulforaphane Unpaired t-test
P00363 FRDA 0.206 0.342 0.855 False control 10 µm sulforaphane Unpaired t-test
P0AC47 FRDB 0.174 0.383 0.866 False control 10 µm sulforaphane Unpaired t-test
P0AEX9 MALE 0.124 0.655 0.914 False control 10 µm sulforaphane Unpaired t-test
P02943 LAMB -1.691 0.000 0.032 True control 10 µm sulforaphane Unpaired t-test
  • See normalized data.

Hide code cell source

view_X = (
    (
        X.loc[:, X.columns.isin(highlighted_genes["ProteinName"].to_list())].T.join(
            proteins_meta.set_index("ProteinName")["GeneName"]
        )
    )
    .set_index("GeneName", append=True)
    .T
)  # to check]
view_X.to_csv(
    out_dir_subsection / "3_highlighted_proteins_in_figure3_intensities_normalized.csv",
    index=True,
)
view_X
UniprotID P00363 P02943 P0A8Q0 P0A9I1 P0AC47 P0AE74 P0AEX9
GeneName FRDA LAMB FRDC CITE FRDB CITT MALE
DMSO_rep1 29.953 26.458 30.054 28.737 30.930 27.250 26.805
DMSO_rep2 30.399 26.149 28.953 28.071 31.085 26.501 26.922
DMSO_rep3 30.170 26.351 30.769 27.769 30.864 25.492 27.055
DMSO_rep4 29.638 25.722 29.264 28.032 30.983 26.776 27.205
Suf_rep1 29.824 27.663 28.169 27.115 31.179 25.304 26.337
Suf_rep2 30.084 28.092 29.441 26.742 30.315 25.688 26.862
Suf_rep3 29.907 27.973 29.175 26.565 30.820 25.551 27.538
Suf_rep4 29.521 27.714 28.548 27.331 30.852 25.540 26.754

Let’s see their original data:

  • See unnormalized data.

  • note that proteins can be discarded based on your filtering criteria.

How would the analysis change with imputation?

Hide code cell source

view_proteins = (
    (
        proteins[highlighted_genes["ProteinName"].to_list()].T.join(
            proteins_meta.set_index("ProteinName")["GeneName"]
        )
    )
    .set_index("GeneName", append=True)
    .T
)  # to check]
view_proteins.to_csv(
    out_dir_subsection / "3_highlighted_proteins_in_figure3_intensities.csv",
    index=True,
)
view_proteins
UniprotID P00363 P02943 P0A8Q0 P0A8Q3 P0A9I1 P0AC47 P0AE74 P0AEX9 P68187
GeneName FRDA LAMB FRDC FRDD CITE FRDB CITT MALE MALK
DMSO_rep1 30.247 26.752 30.348 27.389 29.031 31.223 27.544 27.099 NaN
DMSO_rep2 30.262 26.012 28.815 27.061 27.933 30.947 26.364 26.785 24.881
DMSO_rep3 29.970 26.151 30.569 27.291 27.569 30.664 25.292 26.854 NaN
DMSO_rep4 29.471 25.555 29.097 24.906 27.865 30.815 26.608 27.037 25.423
Suf_rep1 30.005 27.844 28.350 26.703 27.296 31.360 25.485 26.518 26.047
Suf_rep2 30.086 28.095 29.444 NaN 26.744 30.317 25.691 26.864 24.964
Suf_rep3 29.904 27.971 29.172 NaN 26.563 30.818 25.549 27.536 25.545
Suf_rep4 29.575 27.769 28.602 25.294 27.386 30.907 25.594 26.809 26.134

How to explain the differences?