Organisation: Data Science Platform
Responsible: Juliana Assis ()

1 Introduction to data visualization

In today’s data-driven world, the ability to visualize data effectively is essential for uncovering trends, making informed decisions, and communicating complex information. This workshop will guide you through the theoretical foundations and the practical applications of data visualization, ensuring you leave with a solid understanding of the subject.

1.1 Agenda

  • The importance of data visualization for making complex data accessible and understandable

  • Good practices for creating clear, compelling visualizations.

  • Practical skills in data visualization using R, focusing on RNotebook, ggplot2, dplyr, tidyr, and Plotly, with hands-on experience in creating both static and interactive plots

2 Practical applications and Workshop Material

If you prefer to run the Workshop in your own computer, please, follow the instructions at the Github.

Otherwise you are welcome to join our cloud. The link will be not pasted here, it will be only available in our workshop (for security reasons).

Independent of the method you chose, the next steps will be the same to everyone.

2.1 Installing Required R Packages

This section ensures all the necessary R packages for the workshop are installed and loaded. Here’s what each part of the code does:

  1. List of Packages: We define a list of required packages to install and load.
  2. Check for Missing Packages: Compares the required packages with those already installed. If any are missing, they are installed using install.packages.
  3. Load Libraries: Loads each package into the R session so that its functions are available for use.
  4. Install Bioconductor Manager: Ensures the BiocManager package is installed, which is needed for some specialized packages.
  5. Install ComplexHeatmap: Uses BiocManager to install the ComplexHeatmap package, which is not available on CRAN.

Below is the R code implementing these steps:

# List of packages to install
packages <- c("ggpubr", "grid", "tidyr", "reshape2", "reshape", "ggrepel", "ggh4x", "pheatmap", "RColorBrewer", "patchwork","DT", "kableExtra", "plotly", "bookdown", "heatmaply", "gganimate", "gifski")

# Install missing packages
new_packages <- packages[!(packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load the libraries
lapply(packages, library, character.only = TRUE)

# BiocManager
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ComplexHeatmap")

2.1.1 Loading libraries

In this section, we load all the required libraries for the workshop. These libraries provide functions for data manipulation, visualization, and interactive reporting. Here’s what each package is used for:

  • ggplot2: A powerful package for creating elegant and customizable visualizations in R.
  • ggpubr: Simplifies creating publication-ready plots based on ggplot2.
  • grid: Provides tools for creating and manipulating graphical layouts in R.
  • tidyr: Helps tidy messy datasets into a clean, consistent format.
  • reshape2: Transforms data between wide and long formats.
  • ggrepel: Prevents text labels in ggplot2 plots from overlapping.
  • ggh4x: Extends ggplot2 with additional customization options for axes and legends.
  • pheatmap: Creates pretty heatmaps with minimal customization.
  • RColorBrewer: Provides color palettes for visualizations.
  • patchwork: Combines multiple ggplot2 plots into a single layout.
  • DT: Creates interactive, searchable, and sortable tables in HTML.
  • dplyr: A grammar of data manipulation, making data wrangling fast and intuitive.
  • kableExtra: Enhances the appearance of tables created with knitr::kable.
  • ComplexHeatmap: Creates highly customizable and complex heatmaps.
  • plotly: Creates interactive plots from ggplot2 or directly in Plotly.
  • bookdown: Generates books and reports in various formats from R Markdown.
  • heatmaply: Creates interactive heatmaps with Plotly integration.
  • circlize: Creates circular visualizations for data.

Code on how to load the packages (one by one, you can also make a loop)

library(ggplot2)
library(ggpubr)
library(grid)
library(tidyr)
library(reshape2)
library(ggrepel)
library(ggh4x)
library(pheatmap)
library(RColorBrewer)
library(patchwork)
library(DT)
library(dplyr)
library(kableExtra)
library(ComplexHeatmap)
library(plotly)
library(bookdown)
library(heatmaply)
library(circlize) 
#Extra
library(gganimate)
library(gifski)

3 Explanation about the dataset

Ecological Adaptation and Succession of Human Fecal Microbial Communities in an Automated In Vitro Fermentation System

Summary

Here, by employing 16S rRNA gene amplicon sequencing, we investigated the adaptation and succession of human fecal microbial communities in an automated multistage fermentor. We performed two independent experiments using different human donor fecal samples, one configured with two units of three colon compartments each studied for 22 days and another with one unit of two colon compartments studied for 31 days. The fermentor maintained a trend of increasing microbial alpha diversity along colon compartments. Within each experiment, microbial compositions followed compartment-specific trajectories and reached independent stable configurations. While compositions were highly similar between replicate units, they were clearly separated between different experiments, showing that they maintained the individuality of fecal inoculum rather than converging on a fermentor-specific composition. While some fecal amplicon sequence variants (ASVs) were undetected in the fermentor, many ASVs undetected in the fecal samples flourished in vitro. These bloomer ASVs accounted for significant proportions of the population and included prominent health-associated microbes such as Bacteroides fragilis and Akkermansia muciniphila. Turnover in community compositions is likely explained by feed composition and pH, suggesting that these communities can be easily modulated. Our results suggest that in vitro fermentors are promising tools to study complex microbial communities harboring important members of human gut microbiota.

Experimental Design

Alpha Diversity: To be reproduced

4 Downloading Required Files and Creating Directory Structure

In this section, we are automating the setup for our workshop by downloading necessary files from the GitHub repository. Here’s what the script does:

  1. Define Base URLs: A list of base URLs where the files are hosted.
  2. Specify Files to Download: A list of files organized by subdirectory.
  3. Create Directories: Ensure the required directory structure exists locally.
  4. Download Files: Loop through each file and download it to the appropriate directory.

Below is the R code that performs these tasks:

# Base URLs for each directory on GitHub
base_urls <- list(
  "assets" = "https://github.com/biosustain/dsp_workshop_datavizR/raw/main/01_Code/assets/",
  "assets/fonts" = "https://github.com/biosustain/dsp_workshop_datavizR/raw/main/01_Code/assets/fonts/",
  "data/raw" = "https://github.com/biosustain/dsp_workshop_datavizR/raw/main/01_Code/data/raw/"
)

# List of files to download, organized by subdirectory
file_paths <- list(
  "assets" = c(
    "_footer-lab.Rmd",
    "_header-lab.Rmd",
    "logo.png",
    "logo.svg",
    "report.css"
  ),
  "assets/fonts" = c(
    "JetBrainsMono-Regular.ttf",
    "SourceSansPro-Italic.ttf",
    "SourceSansPro-Regular.ttf",
    "SourceSansPro-SemiBold.ttf",
    "SourceSansPro-SemiBoldItalic.ttf"
  ),
  "data/raw" = c(
    "01_Alpha_Diversity.tsv",
    "02_ord_DataFrame.tsv",
    "03_HeatMap_Exp1_DCs.tsv",
    "04_HeatMap.rds",
    "data.csv"
  )
)

# Create necessary local directories with the specified structure
dir.create("./assets/fonts", recursive = TRUE, showWarnings = FALSE)
dir.create("./data/raw", recursive = TRUE, showWarnings = FALSE)

# Loop through each directory and download files
for (subdir in names(file_paths)) {
  for (file_name in file_paths[[subdir]]) {
    # Construct the full URL for each file
    url <- paste0(base_urls[[subdir]], file_name)
    
    # Define the local destination path
    dest <- file.path(".", subdir, file_name)
    
    # Download the file
    download.file(url, destfile = dest, mode = "wb")
  }
}

message("All files have been downloaded successfully into the specified directory structure!")

5 Loading the Alpha Diversity Data

Here, we are loading the alpha diversity data from a TSV file using the read.table function.
The arguments specify the following: - header=T: The first row contains column headers.
- sep="\t": The file is tab-delimited.
- row.names=1: The first column contains row names.
- check.names=T: Ensures column names are syntactically valid in R.

Code

# Load the TSV files
alpha_info_tab <- read.table("data/raw/01_Alpha_Diversity.tsv", header=T, sep="\t", row.names=1, check.names=T)
#ord_dataframe <- read.table("data/raw/02_ord_DataFrame.tsv", header = TRUE, sep = "\t")
#heatmap_exp1 <- read.table("data/raw/03_HeatMap_Exp1_DCs.tsv", header = TRUE, sep = "\t")

# Load the RDS file
#heatmap_data <- readRDS("data/raw/04_HeatMap.rds")

# Load the CSV file
#data_csv <- read.csv("data/raw/data.csv")

6 Summary: Sample Information

In this section, we will display the alpha_info_tab dataset as an interactive table using the DT package. The table will include features such as:

  • The ability to copy table content to the clipboard.
  • An option to export the data to a CSV file.
  • Horizontal and vertical scrolling for large dataset.

Here, you can see the code and the output bellow.

#Summary: Sample Information

DT::datatable(data = alpha_info_tab, rownames = TRUE, 
              extensions = c('Buttons', 'Scroller'), 
              options = list(dom = 'Bfrtip', buttons = c('copy', 'csv'),
                             deferRender = TRUE, scrollX = T,
                             scrollX = T,
                             scrollY = 200,
                             scroller = TRUE,
                             caption = 'Sample Alpha Diversity'))

7 Filtering the data using dyplr package

This section filters the dataset to include only rows where the Study column is equal to "Exp1". The dplyr package’s filter() function is used for this purpose, allowing for easy subsetting of the data. After filtering, we display the dataset.

# Filter the data for Study == "Exp1"
alpha_info_tab_filtered <- alpha_info_tab %>%
  filter(Study == "Exp1")

# View the filtered data
alpha_info_tab_filtered

8 BoxPlot: Our first try

BoxPlot of Observed Values by Compartment for Exp1

In this section, we filter the dataset for Study == "Exp1" and create a boxplot of the Observed values, grouped by the Compartment column. The ggplot2 package is used for plotting, with geom_boxplot() creating the boxplot and theme_pubr() applied to give the plot a professional appearance. The labs() function is used to label the x-axis and y-axis, while the plot title is set with ggtitle().

The code also includes an option to add individual data points with geom_jitter(), which has been commented out for now. You can remove the comment and see what happens.

# Filter for Exp1 and create the boxplot
alpha_info_tab %>%
  filter(Study == "Exp1") %>%
  ggplot(aes(x = Compartment, y = Observed, fill = Compartment)) +
  geom_boxplot() +
  #geom_jitter(width = 0.2) +  # Adds individual data points
  theme_pubr(border = TRUE) +
  labs(x = "Compartment", y = "Observed") +
  ggtitle("Boxplot of Observed Values in Exp1 by Compartment") 

8.1 Sorting by Compartment and Adding Color

In this section, we define two methods for assigning colors to the Compartment variable in the dataset.

  1. Manual Color Assignment: The first method manually assigns specific colors to each compartment using a named vector (cols_compartment). This is useful for smaller dataset.

  2. Automatic Color Assignment: The second method uses the RColorBrewer package to automatically generate a color palette based on the unique values in the Compartment column. The brewer.pal() function generates a set of colors from the “Set3” palette, and setNames() associates the colors with the unique compartments. For more details about available color palettes, you can visit the RColorBrewer website.

cols_compartment <- c("Fecal" = "#D46C4E", "AC" = "#77A515", "TC" = "#264D59", "DC" = "#43978D")

# Using color brewer
# Get the unique compartments
unique_compartments <- unique(alpha_info_tab_filtered$Compartment)

# Automatically generate a color palette (you can use a predefined set or any other color scale)
cols_compartment_auto <- setNames(
  RColorBrewer::brewer.pal(length(unique_compartments), "Set3"), 
  unique_compartments
)

8.2 Plotting the Boxplot with Custom Colors

We filter the data for Study == "Exp1" and then reorder the Compartment factor levels manually to ensure the boxplot appears in a specific order. The plot is created using ggplot2, with compartments on the x-axis and observed values on the y-axis. We apply custom colors to the boxplot using the scale_fill_manual() function, which can either use a predefined color vector (cols_compartment) or an automatically generated color palette (cols_compartment_auto).

  • The mutate(Compartment = factor(Compartment, levels = c("Fecal", "AC", "TC", "DC"))) part ensures the compartments appear in a specific order in the boxplot.

  • geom_boxplot() creates the actual boxplot, and scale_fill_manual(values = cols_compartment) applies the custom color scheme to each compartment.

  • Finally, the plot is saved using ggsave() (although this line is commented out). Remove the comment if you want to save it.

#BoxPlot<-
alpha_info_tab %>%
  filter(Study == "Exp1") %>%
  mutate(Compartment = factor(Compartment, levels = c("Fecal", "AC", "TC", "DC"))) %>%
  ggplot(aes(x = Compartment, y = Observed, fill = Compartment)) +
  geom_boxplot() +
  theme_pubr(border = TRUE) +
  labs(x = "Compartment", y = "Observed") +
  ggtitle("Boxplot of Observed Values in Exp1 by Compartment") +
  #scale_fill_manual(values = cols_compartment_auto)
  scale_fill_manual(values = cols_compartment)

  #ggsave("BoxPlotTest.png", BoxPlot, width = 8.0, height = 5.5)

8.3 Facet Wrap and Data Transformation

In this section, we filter and transform the data using dplyr functions to create a boxplot with the facet_wrap() function in ggplot2.

  • Data Transformation:
    • We filter for Study == "Exp1" to focus on this subset of data.
    • We create a new column Units that duplicates the Fecal compartment into two entries: Unit1 and Unit2. This is achieved using the bind_rows() function to add rows to the dataset for both Unit1 and Unit2 while maintaining the original Unit for other compartments.
    • The mutate() function is used to ensure that the Compartment variable has the desired factor levels (Fecal, AC, TC, DC) for the boxplot, ensuring that the compartments are ordered.
    • We filter out any rows where the Compartment or Units columns have missing values using filter(!is.na(Compartment) & !is.na(Units)).
  • Plotting:
    • The ggplot() function is used to create the boxplot of Observed values by Compartment. We use the fill aesthetic to color the boxes by compartment, and apply a custom color scheme using scale_fill_manual().
    • The facet_wrap(~ Units, scales = "free_x") creates separate plots (facets) for each value of the Units variable. This allows us to view the distributions for Unit1 and Unit2 separately, with free x-axis scaling to accommodate different numbers of categories across the facets.

The result is a faceted boxplot showing observed values for Exp1 across the Compartment categories, separated by Units.

test <- alpha_info_tab %>%
  filter(Study == "Exp1") %>% 
  bind_rows(
    alpha_info_tab %>%
      filter(Compartment == "Fecal") %>%
      mutate(Units = "Unit1"),
    alpha_info_tab %>%
      filter(Compartment == "Fecal") %>%
      mutate(Units = "Unit2"),
    alpha_info_tab %>%
      filter(Compartment != "Fecal") %>%
      mutate(Units = as.character(Unit))  
  ) %>%
  mutate(Compartment = factor(Compartment, levels = c("Fecal", "AC", "TC", "DC"))) %>%
  filter(!is.na(Compartment) & !is.na(Units)) %>%
  ggplot(aes(x = Compartment, y = Observed, fill = Compartment)) +
  geom_boxplot() + 
  geom_jitter(width = 0.2) +  # Adds individual data points
  theme_pubr(border = TRUE) +  
  labs(x = "Compartment", y = "Observed") +
  ggtitle("Boxplot of Observed Values in Exp1 by Compartment and Units") +
  scale_fill_manual(values = cols_compartment) +  
  facet_wrap(~ Units, scales = "free_x")
test

9 Longitudinal Analysis

Here, we are visualizing the longitudinal trends of Observed (Richness) and Shannon diversity across different time points (Day) for the Exp1 study. The data is reshaped and plotted with several customizations.

  • Data Transformation:
    • We filter the data to focus on Study == "Exp1" using filter().
    • The gather() function is used to reshape the data from wide format (separate columns for Observed and Shannon) to long format, where both metrics are stored in a single column (metric), and their values in another (value).
    • The mutate() function is used to rename the metrics directly in the data, so that Observed is shown as “Richness” and Shannon is retained as “Shannon”.
  • Plotting:
    • The ggplot() function is used to create the plot with Day on the x-axis and value on the y-axis, colored by Compartment and grouped by Study and Compartment_Unit using interaction().
    • We add vertical lines at day intervals (geom_vline(xintercept = 1:23)) to visually separate different days.
    • Points (geom_point()) and lines (geom_line()) are plotted for each combination of Study, Compartment, and Unit.
    • Customizations for axis text size, legend position, and color are applied using theme() and scale_colour_manual().
    • The x-axis is customized with scale_x_discrete() to set the limits, which include the Fecal compartment and days 1 to 23.
  • Faceting:
    • The facet_grid() function is used to create separate panels for each metric (Observed and Shannon), with scales = "free_y" to allow for different y-axis scales for each metric.

This visualization provides insights into the changes in microbial diversity across different compartments and units over time, making it easier to compare the trends for both richness and Shannon diversity in the same plot.

The plot shows the longitudinal trends of Richness and Shannon diversity, faceted by metric, with custom color, line types, and axis adjustments.

# Extra

# Define the y-axis scale
#scales <- scale_y_continuous(limits = c(0, 129))

alpha_info_tab %>%
  filter(Study == "Exp1") %>%  # Filter for Exp1
  gather(key = "metric", value = "value", c("Observed", "Shannon")) %>%  # Reshape data
  mutate(
    metric = recode(metric, "Observed" = "Richness", "Shannon" = "Shannon")  # Change names directly in the data
  ) %>% 
  
  ggplot(aes(x = Day, y = value, color = Compartment, group = interaction(Study, Compartment_Unit))) +
  
  # Add vertical lines for day intervals
  geom_vline(xintercept = 1:23, linetype = 'solid', colour = "grey", alpha = 0.3) +
  
  # Plot points and lines
  geom_point(size = 4) +
  geom_line(aes(linetype = Unit)) +
  
  # Customize plot theme
  theme_pubr(border = TRUE) +
  theme(
    axis.text.x = element_text(size = 8, hjust = 0.5),
    axis.text.y = element_text(size = 8, hjust = 1),
    legend.position = "top"
  ) +
  
  # Custom color and line types
  scale_colour_manual(values = cols_compartment) +
  scale_linetype_manual(values = c("dotted", "solid", "dashed", "longdash")) +
  
  # Define x-axis labels
  scale_x_discrete(limits = c("Fecal", as.character(1:23))) +
  
  # Add labels for axes
  labs(x = "", y = "") +
  
  # Facet the plot by metric
  facet_grid(rows = vars(metric), scales = "free_y", space = "free_x") 

10 Beta Diversity: Principal Coordinate Analysis (PCoA)

In this section, we are focusing on the ordination data to visualize Beta Diversity through Principal Coordinate Analysis (PCoA).

  • Data Import:
    • The ord_DataFrame.tsv file is loaded into R using the read.table() function. This file contains the results from ordination analysis, where the rows represent individual samples and the columns represent the principal coordinates (or other variables).
    • The header = TRUE argument indicates that the first row contains column names. The row.names = 1 argument designates the first column as row names (e.g., sample IDs).
    • The sep = "\t" option specifies that the file is tab-delimited. The check.names = TRUE ensures that column names are properly formatted.
  • Displaying the Data:
    • The data is rendered in a formatted table using knitr::kable(). This function outputs the data in a readable format for the R Markdown document.
    • The kableExtra::kable_styling() function is applied to make the table more visually appealing, with bootstrap_options = c("striped", "hover") to add alternating row colors and hover effects.
ord_DataFrame <- read.table("data/raw/02_ord_DataFrame.tsv",  header=T, sep="\t", row.names=1, check.names=T)

ord_DataFrame %>%
  knitr::kable() %>%
  kableExtra::kable_styling(bootstrap_options = c("striped", "hover"))

Beta Diversity

ggplot(ord_DataFrame, aes(x = X45.7., y = X19.4., color = Compartment_Unit, shape = Unit)) +
  geom_point(size = 5) #+ 

  #scale_shape_manual(values = c(18, 19, 17)) +
  #theme_pubr(border = TRUE) +
  #coord_fixed(ratio = 1) 

10.1 Beta Diversity: PCoA Plot with Custom Colors and Shapes

In this section, we are visualizing the Beta Diversity results from the Principal Coordinate Analysis (PCoA) using a scatter plot. The plot is customized to display samples from different compartments and units, with specific color and shape assignments for better visualization.

10.1.1 Steps:

  1. Compartment Unit:
    • The unique() function is used to inspect the unique values in the Compartment_Unit column of the ordination data frame (ord_DataFrame). This helps identify the different categories that will be represented in the plot.
  2. Color Mapping:
    • The compartment_unit_col vector is defined to assign specific colors to each unique Compartment_Unit combination. The colors chosen are visually distinct for each compartment and unit combination.
  3. Plotting:
    • The ggplot() function is used to create the PCoA plot. The x and y axes are set to X45.7. and X19.4., which represent the first two principal coordinates (PC1 and PC2).
    • The aes() function is used to map the Compartment_Unit to color and Unit to shape. This allows for clear differentiation of the points based on their respective categories.
    • Horizontal and vertical dashed lines at x = 0 and y = 0 are added using geom_hline() and geom_vline() to help visually assess the distribution of the samples.
    • Points are plotted using geom_point(), and the size of the points is set to 5 for better visibility.
    • Custom shapes are applied to the points using scale_shape_manual(), with different shapes for different Unit values.
    • The theme_pubr() function is used to apply a professional plot theme with borders.
    • The aspect ratio is kept equal using coord_fixed(ratio = 1) to maintain proportionality between the axes.
  4. Customization:
    • Font sizes and legend positioning are adjusted using theme(). Axis labels, title fonts, and legend settings are customized for better readability.
    • Custom colors are applied to the points using scale_color_manual() with the compartment_unit_col palette.
  5. Labels:
    • Axis labels for the plot are set to reflect the explained variance of each principal component (PC1 [45.7%] and PC2 [19.4%]).
    • The legend is configured to show the “Compartment_Unit” and “Unit” information.
  6. Plot Display:
    • The plot is displayed using print(ordplot2) to visualize the results.
unique(ord_DataFrame$Compartment_Unit)
## [1] "Fecal" "AC1"   "AC2"   "DC1"   "DC2"   "TC1"   "TC2"
# compartment Unit
compartment_unit_col <- c("Fecal" = "#264D59", "AC1" = "#77A515", "TC1" = "#D46C4E", "DC1" = "#43978D", "AC2" = "#77A515", "TC2" = "#D46C4E", "DC2" = "#43978D")

ordplot2 <- ggplot(ord_DataFrame, aes(x = X45.7., y = X19.4., color = Compartment_Unit, shape = Unit)) +
  geom_hline(yintercept = 0, linetype = 'dashed', alpha = 0.3) + 
  geom_vline(xintercept = 0, linetype = 'dashed', alpha = 0.3) +
  geom_point(size = 5) + 
  #geom_text_repel(aes(label = Study), nudge_x = 0.06, size = 3.0, segment.alpha = 0.5) +
  scale_shape_manual(values = c(18, 19, 17)) +  
  theme_pubr(border = TRUE) +
  coord_fixed(ratio = 1) + # Keep aspect ratio 1:1
  theme(axis.text = element_text(size = 14),
        axis.text.x = element_text(size = 12, hjust = 0.5),
        axis.title.y = element_text(size = 18),
        legend.text = element_text(size = 14),
        legend.title = element_text(size = 0),
        legend.position = "bottom", #top, null etc
        axis.title.x = element_text(size = 18),
        strip.text.x = element_text(size = 20, face = "bold")) +
        scale_color_manual(values = compartment_unit_col) +
  labs(x = "PCo1 [45.7%]", y = "PCo2 [19.4%]", color = "Compartment_Unit", shape = "Unit") #+
  #annotate("text", x = min(ord_DataFrame$X45.7.), y = min(ord_DataFrame$X19.4.), 
           #label = "Exp2", hjust = 0.1, vjust = 0.4, size = 5, color = "black")

# To view the plot
print(ordplot2)

11 Heatmap: Preparing the Data

In this section, we load and prepare data for heatmap visualization. The distinction between a data frame and a matrix is important because many heatmap plotting functions, such as those in the ComplexHeatmap or pheatmap packages, require the input to be in matrix format.

After loading the data

  1. Check the Class:
    • The class() function is used to check the type of the loaded object, which will initially be a data frame.
  2. Convert to Matrix:
    • The as.matrix() function converts the data frame into a matrix format.
    • The class() function is then used to confirm that the object is now a matrix.

11.0.1 Data Frame vs. Matrix:

Feature Data Frame Matrix
Structure A list of equal-length vectors. A 2D array with elements of the same type.
Column Types Can hold different data types (e.g., numeric, character, etc.). Must hold elements of the same type.
Usage in Heatmaps Often not directly usable in heatmap functions. Preferred for heatmap plotting.
Row and Column Names Both row and column names are supported. Row and column names are supported.
Conversion Can be converted to a matrix with as.matrix(). Cannot be directly converted to a data frame.

Why Convert to a Matrix? Heatmap functions require uniform data types (e.g., numeric values) for all elements, as they rely on mathematical operations such as clustering or scaling. Matrices ensure that all entries are of the same type (typically numeric), avoiding issues caused by mixed data types in a data frame.

heat_df <- read.table("data/raw/03_HeatMap_Exp1_DCs.tsv", header=T, sep="\t", row.names=1, check.names=T)
class(heat_df)
## [1] "data.frame"
heat_mat <- as.matrix(heat_df)
class(heat_mat)
## [1] "matrix" "array"

11.1 Filtering the Top 20 Taxa for Heatmap Visualization

This code filters the heatmap data to include only the top 20 taxa based on their abundance across samples. This is a common preprocessing step to focus the visualization on the most significant or abundant features, reducing clutter in the heatmap.

Key Steps:

  1. Calculate Total Abundance:
    • A new column, Total, is added to the data frame. This column contains the row sums for each taxon, representing its total abundance across all samples.
    • Alternatively, you can use rowMeans() if you want to base the selection on average abundance instead of the sum.
  2. Filter the Top 20:
    • The arrange(desc(Total)) function sorts the taxa in descending order of their total abundance.
    • The slice(1:20) function selects the top 20 taxa with the highest abundance.
    • The select(-Total) function removes the Total column, as it is no longer needed.
  3. Convert to Matrix:
    • The filtered data frame is converted to a matrix using as.matrix() to ensure compatibility with heatmap functions.
# Step 1: Calculate row sums or another measure to identify top taxa
heat_df$Total <- rowSums(heat_df) # or use `rowMeans` if preferred

# Step 2: Filter the top 20 rows based on Total
top20_data <- heat_df %>%
  arrange(desc(Total)) %>%
  slice(1:20) %>%
  select(-Total)

# Step 3: Convert to matrix for ComplexHeatmap
data_matrix <- as.matrix(top20_data)

11.2 Plotting the Heatmap

This code generates a heatmap to visualize the abundance of the top 20 taxa across samples. Heatmaps are a powerful tool for exploring patterns, clusters, and relationships within the data.

  1. Data Matrix:
    • The data_matrix created earlier is used as the input for the heatmap. It represents the abundance of the top 20 taxa across samples.
  2. Color Mapping:
    • The colorRamp2() function defines a gradient color scheme for the heatmap:
      • Blue: Represents the minimum abundance in the dataset.
      • White: Represents the median abundance.
      • Red: Represents the maximum abundance.
  3. Row and Column Names:
    • Both row (taxa) and column (samples) names are displayed for easy interpretation.
  4. Clustering:
    • Hierarchical clustering is applied to rows (taxa) and columns (samples) using the cluster_rows = TRUE and cluster_columns = TRUE options. This groups similar taxa and samples together, revealing patterns in the data.
Heatmap(data_matrix,
        name = "Abundance",
        col = colorRamp2(c(min(data_matrix), median(data_matrix), max(data_matrix)), 
                         c("blue", "white", "red")),
        show_row_names = TRUE,
        show_column_names = TRUE,
        cluster_rows = TRUE,
        cluster_columns = TRUE
)

11.3 Adding Sample Annotations to the Heatmap

This section builds upon the previous heatmap by adding annotations for the samples, specifically their compartments. Sample annotations provide additional context to the heatmap, enhancing interpretability.

  1. Define Compartments:
    • The compartment variable labels each column (sample) in the data_matrix with its corresponding compartment.
    • In this case, all samples are labeled as "DC1".
  2. Check Matching Lengths:
    • To ensure proper alignment, the length of the compartment vector is checked against the number of columns in data_matrix.
  3. Compartment Colors:
    • A custom color palette is defined for the compartments, with "DC1" mapped to "#43978D".
  4. Create Annotations:
    • The HeatmapAnnotation() function is used to define an annotation for the heatmap. It assigns each sample to its respective compartment with the specified color.
  5. Plot Heatmap with Annotation:
    • The heatmap is generated as before, but with the addition of the top_annotation argument. This places a color-coded bar above the heatmap to represent the compartments of the samples.
# Define sample compartments
ncol(data_matrix)
## [1] 13
# Create Compartment labels for the samples, repeating "DC1" for all columns in data_matrix
compartment <- rep("DC1", ncol(data_matrix))

# Check if the lengths match
length(compartment) == ncol(data_matrix)  # Should return TRUE
## [1] TRUE
# Colors for compartments
compartment_colors <- c("DC1" = "#43978D")


# Create the annotation
sample_annotation <- HeatmapAnnotation(
    Compartment = compartment,
    col = list(Compartment = compartment_colors)
)

# Create the heatmap with the annotation
Heatmap(
  data_matrix,
  name = "Abundance",
  col = colorRamp2(
    c(min(data_matrix), median(data_matrix), max(data_matrix)),  # Color scale points
    c("blue", "white", "red")  # Colors corresponding to min, median, and max values
  ),
  show_row_names = TRUE,       # Show row names
  show_column_names = TRUE,     # Show column names
  cluster_rows = TRUE,          # Cluster rows
  cluster_columns = TRUE,       # Cluster columns
  top_annotation = HeatmapAnnotation(
    Compartment = anno_simple(
      compartment,               # Variable for compartment annotation
      col = compartment_colors   # Custom colors for compartments
    )
  )
)

11.4 Heatmap Visualization Using pheatmap

This section demonstrates how to create a heatmap using the pheatmap package. It includes customizations such as annotation and color schemes, providing a comprehensive approach to visualizing the data.

## Change colors
colsHeat<- c("#F7F7F7", "#92C5DE", "#0571B0", "#F4A582", "#CA0020")

# Create an annotation dataframe
annotation_df <- data.frame(Compartment = compartment)

# Define annotation colors
annotation_colors <- list(Compartment = c("DC1" = "#43978D"))

# Plot heatmap using pheatmap
pheatmap(data_matrix,
         cluster_cols = FALSE,
         cluster_rows = TRUE,
         #scale = "column",
         #gaps_row = 5, 
         clustering_distance_rows = "euclidean",
         clustering_distance_cols  = "euclidean",
         annotation_colors = annotation_colors, 
         annotation_col = annotation_df,  
         show_colnames = TRUE,
         color = colorRampPalette(c(colsHeat))(50),
         border_color = "#f8edeb",
         display_numbers = FALSE)

11.5 Sort and “normalization”

The code is performing a column-wise normalization on data_matrix by dividing each element in a column by the mean of that column.

Purpose of the Code The result, stored in data, is a transformed version of data_matrix where each column has been scaled so that the average value in each column is 1. This transformation is often done to normalize the data across columns to account for differences in scale or to control for sample-specific variation in abundance data.

# Row-wise normalization
#data <- apply(data_matrix, 1, function(x) { x / mean(x) })

# Column-wise normalization
data <- apply(data_matrix, 2, function(x) { x / mean(x) })

# Different normalization
#sequencing_depth <- 20000
#read_counts_matrix <- data_matrix * sequencing_depth
# Step 2: Log-normalize each value in the matrix
#data <- log10(1 + read_counts_matrix) / max(log10(1 + read_counts_matrix))

# Sort by Days
colnames(data) <- colnames(data)[order(as.numeric(gsub("Exp1_(\\d+).*", "\\1", colnames(data))))]

pheatmap(data,
         cluster_cols = FALSE,
         cluster_rows = TRUE,
         #scale = "column",
         #gaps_row = 5, 
         clustering_distance_rows = "euclidean",
         clustering_distance_cols  = "euclidean",
         annotation_colors = annotation_colors, 
         annotation_col = annotation_df,  # Add annotation dataframe here
         show_colnames = TRUE,
         color = colorRampPalette(c(colsHeat))(50),
         border_color = "#f8edeb",
         display_numbers = FALSE)

12 Interactive Heatmap with Plotly

This section demonstrates how to create an interactive heatmap using the plotly package. This approach allows users to interact with the data visualization, enhancing the analysis experience.

  1. Interactive Visualization:
    • The heatmap provides hover functionality, enabling users to view exact data values and row/column labels.
    • Facilitates zooming and panning for better exploration of large datasets.
  2. Data Inputs:
    • colnames(data): Used as labels for the x-axis, representing the columns of the data matrix.
    • rownames(data): Used as labels for the y-axis, representing the rows of the data matrix.
    • data: The actual data matrix to be visualized.
  3. Color Scheme:
    • The colors argument specifies the color palette for the heatmap. In this case, colsHeat is used as a custom color scheme for data representation.
  4. Custom Layout:
    • Adjustments are made to the layout for better readability:
      • margin = list(l = 120): Adds space on the left for row labels.
      • xaxis and yaxis: Customize axis ticks and labels for clarity.
  5. Color Bar:
    • A color bar is added to the plot, with a title ("Abundance") indicating the meaning of the color gradient.
  6. Benefits of Using Plotly:
    • Fully interactive visualizations for better user engagement.
    • Customizable layout and color settings.
    • Ideal for presenting data in web-based applications or reports.
p <- plot_ly(
    x = colnames(data), 
    y = rownames(data),   
    z = data,             
    type = "heatmap", 
    colors = colsHeat,
    showscale = TRUE,
    colorbar = list(title = "Abundance")  # Optional colorbar title
) %>%
    layout(
        margin = list(l = 120),  # Space for row names on the left
        xaxis = list(showticklabels = TRUE, ticks = ""),
        yaxis = list(tickvals = 1:nrow(data), ticktext = rownames(data))
    )

p

12.1 Heatmap with Compartment Annotation

This section demonstrates how to enhance a heatmap visualization with an additional annotation layer. Specifically, a single-row heatmap is added to indicate a compartment (e.g., “DC1”) for all columns of the main heatmap.

  1. Compartment Annotation:
    • A single-row annotation heatmap is created to display the “DC1” compartment for all samples in the data.
    • Each column corresponds to a column in the main data matrix, with the same label for all samples.
  2. Custom Colors for Annotation:
    • The colorscale is fixed to a single color (#43978D), representing “DC1.”
    • The color scale does not include a gradient since all annotation values are identical (0).
  3. Plotly Integration:
    • plot_ly is used to create both the main heatmap (p) and the annotation heatmap (p_compartment).
    • The subplot function combines these two visualizations, stacking the annotation on top of the main heatmap.
  4. Layout Adjustments:
    • Heights of the subplots are adjusted (heights = c(0.1, 0.9)), giving a smaller proportion to the annotation heatmap.
    • Axes are shared along the x-axis (shareX = TRUE), ensuring alignment between the main heatmap and the annotation strip.
# Create the compartment annotation heatmap (only DC1, shown in #43978D)
compartment_annotation <- rep(0, ncol(data))  # Only 0s for DC1

p_compartment <- plot_ly(
    x = colnames(data),
    y = "Compartment",  # Annotation row title
    z = matrix(compartment_annotation, nrow = 1),  # Annotation as a single-row matrix
    type = "heatmap",
    colorscale = list(list(0, "#43978D"), list(1, "#43978D")),  # Fixed color for DC1 only
    showscale = FALSE  # No colorbar for annotation
) %>%
    layout(
        xaxis = list(showticklabels = TRUE, ticks = ""),
        yaxis = list(
            tickvals = c(0),       # Position the annotation title
            ticktext = "Compartment"
        )
    )

# Combine the main heatmap and the annotation strip
subplot(p_compartment, p, nrows = 2, heights = c(0.1, 0.9), shareX = TRUE)

13 Extra: Animation and Iteractive

Extra activities for those who finish the activities or for those who want to do homework

13.1 Extra: Animation

Animate the PCoA Plot for Beta Diversity

ord_DataFrame$Compartment_Day <- paste(ord_DataFrame$Compartment_Unit, ord_DataFrame$Day, sep = " - ")

# Ensure 'Day' is numeric and sorted
ord_DataFrame$Day <- as.numeric(as.character(ord_DataFrame$Day))
ord_DataFrame <- ord_DataFrame[order(ord_DataFrame$Day), ]

# Create the animation
anim <- ggplot(ord_DataFrame, aes(x = X45.7., y = X19.4., color = Compartment, shape = Unit)) +
  geom_point(size=5, alpha=0.8) +
  geom_text_repel(aes(label = sprintf("%.1f", Day)), nudge_x = 0.06, size = 5.0, segment.alpha = 0.8) +  # Format Day to 1 decimal
  scale_color_manual(values = cols_compartment) +
  scale_shape_manual(values = c(18, 19, 17)) +
  theme_pubr(border = TRUE) +
  theme(legend.position="right") +
  transition_states(Day) +  # transition in numeric order of Day
  facet_wrap(~ Study) +
  ease_aes('linear')

animate(anim, duration = 25, fps = 20, renderer = gifski_renderer())

anim_save("ord.gif")

13.2 Interactive Heatmap Using a Different Package

This section showcases the use of the heatmaply package to create an interactive heatmap with enhanced functionality. The focus is on exploring data visually, including options for clustering, color customization, and annotations.

heatmaply(data,
          show_dendrogram = c(TRUE, FALSE),  # Cluster rows and columns
          column_text_angle = 90,  # Orientation of the column labels
          colors = colorRampPalette(c("#F7F7F7", "#92C5DE", "#0571B0", "#F4A582", "#CA0020"))(50), 
          show_row_names = TRUE,
          show_column_names = FALSE,
          annotation_col = data.frame(Compartment = compartment_annotation), 
          row_text_angle = 0)

14 Packages and versions

sessionInfo()
## R version 4.4.1 (2024-06-14)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Copenhagen
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] gifski_1.32.0-1       gganimate_1.0.9       circlize_0.4.16      
##  [4] heatmaply_1.5.0       viridis_0.6.5         viridisLite_0.4.2    
##  [7] bookdown_0.40         plotly_4.10.4         ComplexHeatmap_2.20.0
## [10] kableExtra_1.4.0      dplyr_1.1.4           DT_0.33              
## [13] patchwork_1.3.0       RColorBrewer_1.1-3    pheatmap_1.0.12      
## [16] ggh4x_0.2.8           ggrepel_0.9.6         reshape2_1.4.4       
## [19] tidyr_1.3.1           ggpubr_0.6.0          ggplot2_3.5.1        
## 
## loaded via a namespace (and not attached):
##  [1] gridExtra_2.3       rlang_1.1.4         magrittr_2.0.3     
##  [4] clue_0.3-65         GetoptLong_1.0.5    matrixStats_1.4.1  
##  [7] compiler_4.4.1      png_0.1-8           systemfonts_1.1.0  
## [10] vctrs_0.6.5         stringr_1.5.1       pkgconfig_2.0.3    
## [13] shape_1.4.6.1       crayon_1.5.3        fastmap_1.2.0      
## [16] backports_1.5.0     labeling_0.4.3      ca_0.71.1          
## [19] utf8_1.2.4          rmarkdown_2.28      purrr_1.0.2        
## [22] xfun_0.48           cachem_1.1.0        jsonlite_1.8.9     
## [25] progress_1.2.3      highr_0.11          tweenr_2.0.3       
## [28] prettyunits_1.2.0   broom_1.0.7         parallel_4.4.1     
## [31] cluster_2.1.6       R6_2.5.1            bslib_0.8.0        
## [34] stringi_1.8.4       car_3.1-3           jquerylib_0.1.4    
## [37] Rcpp_1.0.13         assertthat_0.2.1    iterators_1.0.14   
## [40] knitr_1.48          IRanges_2.38.1      tidyselect_1.2.1   
## [43] rstudioapi_0.16.0   abind_1.4-8         yaml_2.3.10        
## [46] TSP_1.2-4           doParallel_1.0.17   codetools_0.2-20   
## [49] tibble_3.2.1        plyr_1.8.9          withr_3.0.1        
## [52] evaluate_1.0.0      xml2_1.3.6          pillar_1.9.0       
## [55] carData_3.0-5       foreach_1.5.2       stats4_4.4.1       
## [58] generics_0.1.3      hms_1.1.3           S4Vectors_0.42.1   
## [61] munsell_0.5.1       scales_1.3.0        glue_1.8.0         
## [64] lazyeval_0.2.2      tools_4.4.1         dendextend_1.18.0  
## [67] data.table_1.16.0   webshot_0.5.5       ggsignif_0.6.4     
## [70] registry_0.5-1      crosstalk_1.2.1     seriation_1.5.6    
## [73] colorspace_2.1-1    Formula_1.2-5       cli_3.6.3          
## [76] fansi_1.0.6         svglite_2.1.3       gtable_0.3.5       
## [79] rstatix_0.7.2       sass_0.4.9          digest_0.6.37      
## [82] BiocGenerics_0.50.0 rjson_0.2.23        htmlwidgets_1.6.4  
## [85] farver_2.1.2        htmltools_0.5.8.1   lifecycle_1.0.4    
## [88] httr_1.4.7          GlobalOptions_0.1.2