5.1 Differential Expression Analysis
⚠️ Warning: Always use raw integer counts as input. Do not use TPM, FPKM, or any normalised values — DESeq2 handles normalisation internally.
DESeq2 fits a negative binomial model to the raw counts and performs a Wald test for each gene. Internally it runs three steps in sequence: size factor estimation (normalisation), dispersion estimation, and the Wald test. These can also be run separately, but DESeq() handles all three in one call.
What DESeq2 does internally — three steps:
1. Size factor estimation (normalisation)
Each sample gets a size factor that accounts for differences in sequencing depth. DESeq2 uses the median-of-ratios method: for each gene, it computes the ratio of each sample’s count to the geometric mean count across all samples, then takes the median ratio across all genes as the size factor. This is more robust than simply dividing by total counts because it is not affected by a small number of highly expressed genes dominating the library.
2. Dispersion estimation
RNA-seq count data is overdispersed — the variance is greater than the mean (unlike a Poisson distribution which assumes variance = mean). DESeq2 models this with a negative binomial distribution and estimates a dispersion parameter per gene, which describes how much counts vary between replicates beyond what is expected by chance. Genes with few replicates get their dispersion estimates shrunk toward a fitted trend across all genes (empirical Bayes shrinkage) — this stabilises estimates for genes with low counts.
3. Wald test
For each gene, DESeq2 fits the negative binomial model and tests whether the log2 fold change between conditions is significantly different from zero using a Wald test. The resulting p-values are then corrected for multiple testing using the Benjamini-Hochberg procedure to control the false discovery rate (FDR).
Why does the order matter?
resultsNames(dds) shows condition_treatment_vs_control — this means the
fold change is calculated as treatment / control:
- Positive LFC → gene is upregulated in treatment
- Negative LFC → gene is downregulated in treatment
If control were not set as the reference level, the fold change would be inverted and your biological interpretation would be backwards.
git_root <- system("git rev-parse --show-toplevel", intern = TRUE)
dds <- readRDS(file.path(git_root, "results", "rds", "dds_ecoli_MG1655.rds"))
cat("✅ DDS loaded\n")## ✅ DDS loaded
## Dimensions : 3698 genes × 6 samples
## Conditions : control vs treatment
📌 Remember: The variable of interest should be at the end of the design formula, and the control group must be the first (reference) factor level. Both were set in the QC script.
Check the results
## [1] "Intercept" "condition_treatment_vs_control"