4.10 Detecting Outliers
The PCA above uses only the top 500 most variable genes (DESeq2 default). Here we run PCA on the full VST matrix and inspect a scree plot and biplot to assess whether any single sample drives an unusual amount of variance, a common sign of a technical outlier.
pca_full <- prcomp(t(assay(vsd)))
screeplot <- fviz_screeplot(pca_full, addlabels = TRUE,
main = "Scree plot — variance per PC")
screeplot
pca_ind <- fviz_pca_ind(pca_full, geom = c("point", "text"), repel = TRUE,
title = "PCA — sample positions (full gene matrix)")
pca_ind
pca_biplot <- fviz_pca_biplot(pca_full,
repel = TRUE,
select.var = list(contrib = 50), # top genes only
title = "Biplot — top 50 contributing genes and samples",
col.var = "#92C5DE",
col.ind = "black")
pca_biplot
ggsave(
filename = file.path(git_root, "results", "plots", "pca_screeplot.png"),
plot = screeplot,
width = 8,
height = 6,
dpi = 300
)
ggsave(
filename = file.path(git_root, "results", "plots", "pca_individuals.png"),
plot = pca_ind,
width = 8,
height = 6,
dpi = 300
)
ggsave(
filename = file.path(git_root, "results", "plots", "pca_biplot.png"),
plot = pca_biplot,
width = 8,
height = 6,
dpi = 300
)How to read a biplot:
Dots = samples (C1, C2, C3, sac1, sac2, sac3)
Arrows/lines = genes — the direction shows which samples that gene is highly expressed in, and the length shows how strongly it contributes to the PC:
- Genes pointing right → higher expression in treatment
- Genes pointing left → higher expression in control
- Genes pointing up/down → contribute more to PC2 (within-condition variation)
- Genes near the centre → contribute little to either PC
What to look for:
- Genes with long arrows along PC1 are your strongest candidates for driving the treatment response — these are likely to appear as significant DE genes
- If many arrows point in the same direction, it suggests coordinated regulation (a pathway-level response)
- A gene pointing toward C3 specifically (along PC2) would explain the C3 separation flagged in the PCA and heatmap