4.1 Data preparation
4.1.1 Loading the raw proteomics data
📌 Remember: Load the library before starting the analysis.
Load and prepare the data.
read_csv() from readr package (part of tidyverse) is used to read csv files.
📌 Remember: Remember: You can use the DT package to visualize the data.
DT::datatable(
data = head(data, 1000), # show only the first 1000 rows
rownames = FALSE,
extensions = c("Buttons", "Scroller"),
options = list(
dom = "Bfrtip",
buttons = c("copy", "csv"),
deferRender = TRUE,
scrollX = TRUE,
scrollY = 200,
scroller = TRUE
),
caption = "proteomics metadata"
)4.1.2 Data transformation
log2 transformations are commonly used for log-normal distributed data.
❓ Question: Do you remember what the %>% (pipe) is doing in the code?
4.1.3 Data aggregation
Let’s aggregate the peptide intensities to protein intensities. We use the median of the peptide intensities for each protein. Also, let’s shorten sample names for better readability.
protein_data <- data %>%
group_by(ProteinName, Reference) %>%
summarize(Intensity = median(Intensity, na.rm = TRUE), .groups = "drop") %>%
# Small pipeline to shorten Reference names
mutate(Reference_parts = strsplit(Reference, "_")) %>%
mutate(Reference_parts = lapply(Reference_parts, function(parts) parts[4:6])) %>%
mutate(Reference = sapply(Reference_parts, function(x) paste(x, collapse = "_"))) %>%
select(-Reference_parts) %>% # Remove the temporary column from the data frame.
mutate(Reference = str_remove(Reference, "^Ecoli_"))💡 Tip: Try to run the code above line by line.
4.1.4 Removing contaminants
We remove the contaminant proteins which were present in the fasta file used in the data processing.
Contaminant proteins are e.g. keratin from skin or hair that are often accidentally introduced during sample handling.
📌 Remember: It is always a good idea to check the data often while processing it.
## # A tibble: 10 × 3
## ProteinName Reference Intensity
## <chr> <chr> <dbl>
## 1 sp|A5A613|YCIY_ECOLI DMSO_rep1 27.2
## 2 sp|P00350|6PGD_ECOLI DMSO_rep1 28.2
## 3 sp|P00350|6PGD_ECOLI DMSO_rep2 27.9
## 4 sp|P00350|6PGD_ECOLI DMSO_rep3 27.7
## 5 sp|P00350|6PGD_ECOLI DMSO_rep4 27.2
## 6 sp|P00350|6PGD_ECOLI Suf_rep1 27.4
## 7 sp|P00350|6PGD_ECOLI Suf_rep2 27.0
## 8 sp|P00350|6PGD_ECOLI Suf_rep3 27.8
## 9 sp|P00350|6PGD_ECOLI Suf_rep4 27.6
## 10 sp|P00363|FRDA_ECOLI DMSO_rep1 30.2
4.1.5 Cleaning names
Split the ProteinName column into Identifier, Source, ProteinName and Gene columns.
After splitting, we also remove the _ECOLI suffix from the Gene column.
protein_data_parsed <- protein_data %>%
separate(ProteinName, into = c("Source", "Protein", "Gene"), sep = "\\|", extra = "drop", remove = FALSE) %>%
mutate(Gene = str_remove(Gene, "_ECOLI")) %>%
rename("Identifier" = "ProteinName")Finally, add a column with the experimental condition labels.
protein_data_parsed <- protein_data_parsed %>%
mutate(Label = if_else(str_detect(Reference, "DMSO"), "DMSO", "Sulforaphane"))Let’s have a look at the final data frame.
## # A tibble: 10 × 7
## Identifier Source Protein Gene Reference Intensity Label
## <chr> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 sp|A5A613|YCIY… sp A5A613 YCIY DMSO_rep1 27.2 DMSO
## 2 sp|P00350|6PGD… sp P00350 6PGD DMSO_rep1 28.2 DMSO
## 3 sp|P00350|6PGD… sp P00350 6PGD DMSO_rep2 27.9 DMSO
## 4 sp|P00350|6PGD… sp P00350 6PGD DMSO_rep3 27.7 DMSO
## 5 sp|P00350|6PGD… sp P00350 6PGD DMSO_rep4 27.2 DMSO
## 6 sp|P00350|6PGD… sp P00350 6PGD Suf_rep1 27.4 Sulf…
## 7 sp|P00350|6PGD… sp P00350 6PGD Suf_rep2 27.0 Sulf…
## 8 sp|P00350|6PGD… sp P00350 6PGD Suf_rep3 27.8 Sulf…
## 9 sp|P00350|6PGD… sp P00350 6PGD Suf_rep4 27.6 Sulf…
## 10 sp|P00363|FRDA… sp P00363 FRDA DMSO_rep1 30.2 DMSO