4.1 Data preparation

4.1.1 Loading the raw proteomics data

📌 Remember: Load the library before starting the analysis.

library(tidyverse)


Load and prepare the data.

read_csv() from readr package (part of tidyverse) is used to read csv files.

data <- read_csv("data-01/PXD040621_peptides.csv", show_col_types = FALSE)


📌 Remember: Remember: You can use the DT package to visualize the data.

DT::datatable(
    data = head(data, 1000), # show only the first 1000 rows
    rownames = FALSE,
    extensions = c("Buttons", "Scroller"),
    options = list(
        dom = "Bfrtip",
        buttons = c("copy", "csv"),
        deferRender = TRUE,
        scrollX = TRUE,
        scrollY = 200,
        scroller = TRUE
    ),
    caption = "proteomics metadata"
)


4.1.2 Data transformation

log2 transformations are commonly used for log-normal distributed data.

data <- data %>%
    mutate(Intensity = log2(Intensity))

❓ Question: Do you remember what the %>% (pipe) is doing in the code?

4.1.3 Data aggregation

Let’s aggregate the peptide intensities to protein intensities. We use the median of the peptide intensities for each protein. Also, let’s shorten sample names for better readability.

protein_data <- data %>%
    group_by(ProteinName, Reference) %>%
    summarize(Intensity = median(Intensity, na.rm = TRUE), .groups = "drop") %>%
    # Small pipeline to shorten Reference names
    mutate(Reference_parts = strsplit(Reference, "_")) %>%
    mutate(Reference_parts = lapply(Reference_parts, function(parts) parts[4:6])) %>%
    mutate(Reference = sapply(Reference_parts, function(x) paste(x, collapse = "_"))) %>%
    select(-Reference_parts) %>% # Remove the temporary column from the data frame.
    mutate(Reference = str_remove(Reference, "^Ecoli_"))

💡 Tip: Try to run the code above line by line.

4.1.4 Removing contaminants

We remove the contaminant proteins which were present in the fasta file used in the data processing.

Contaminant proteins are e.g. keratin from skin or hair that are often accidentally introduced during sample handling.

protein_data <- protein_data %>%
    filter(!str_detect(ProteinName, "CON_"))


📌 Remember: It is always a good idea to check the data often while processing it.

head(protein_data, n = 10)
## # A tibble: 10 × 3
##    ProteinName          Reference Intensity
##    <chr>                <chr>         <dbl>
##  1 sp|A5A613|YCIY_ECOLI DMSO_rep1      27.2
##  2 sp|P00350|6PGD_ECOLI DMSO_rep1      28.2
##  3 sp|P00350|6PGD_ECOLI DMSO_rep2      27.9
##  4 sp|P00350|6PGD_ECOLI DMSO_rep3      27.7
##  5 sp|P00350|6PGD_ECOLI DMSO_rep4      27.2
##  6 sp|P00350|6PGD_ECOLI Suf_rep1       27.4
##  7 sp|P00350|6PGD_ECOLI Suf_rep2       27.0
##  8 sp|P00350|6PGD_ECOLI Suf_rep3       27.8
##  9 sp|P00350|6PGD_ECOLI Suf_rep4       27.6
## 10 sp|P00363|FRDA_ECOLI DMSO_rep1      30.2
❓ Question: Can you think about a different way of inspecting the data?

4.1.5 Cleaning names

Split the ProteinName column into Identifier, Source, ProteinName and Gene columns.

After splitting, we also remove the _ECOLI suffix from the Gene column.

protein_data_parsed <- protein_data %>%
    separate(ProteinName, into = c("Source", "Protein", "Gene"), sep = "\\|", extra = "drop", remove = FALSE) %>%
    mutate(Gene = str_remove(Gene, "_ECOLI")) %>%
    rename("Identifier" = "ProteinName")


Finally, add a column with the experimental condition labels.

protein_data_parsed <- protein_data_parsed %>%
    mutate(Label = if_else(str_detect(Reference, "DMSO"), "DMSO", "Sulforaphane"))


Let’s have a look at the final data frame.

head(protein_data_parsed, n = 10)
## # A tibble: 10 × 7
##    Identifier      Source Protein Gene  Reference Intensity Label
##    <chr>           <chr>  <chr>   <chr> <chr>         <dbl> <chr>
##  1 sp|A5A613|YCIY… sp     A5A613  YCIY  DMSO_rep1      27.2 DMSO 
##  2 sp|P00350|6PGD… sp     P00350  6PGD  DMSO_rep1      28.2 DMSO 
##  3 sp|P00350|6PGD… sp     P00350  6PGD  DMSO_rep2      27.9 DMSO 
##  4 sp|P00350|6PGD… sp     P00350  6PGD  DMSO_rep3      27.7 DMSO 
##  5 sp|P00350|6PGD… sp     P00350  6PGD  DMSO_rep4      27.2 DMSO 
##  6 sp|P00350|6PGD… sp     P00350  6PGD  Suf_rep1       27.4 Sulf…
##  7 sp|P00350|6PGD… sp     P00350  6PGD  Suf_rep2       27.0 Sulf…
##  8 sp|P00350|6PGD… sp     P00350  6PGD  Suf_rep3       27.8 Sulf…
##  9 sp|P00350|6PGD… sp     P00350  6PGD  Suf_rep4       27.6 Sulf…
## 10 sp|P00363|FRDA… sp     P00363  FRDA  DMSO_rep1      30.2 DMSO