3.3 Loading Count Data

The nf-core/rnaseq pipeline was run with -profile prokaryotic, which uses Bowtie2 for alignment and Salmon for quantification. Since E. coli has no introns*, splice-aware aligners like STAR are unnecessary. We load the SummarizedExperiment object produced by the pipeline, extract the raw count matrix, and assign gene symbols as row names

*The dispersal of five group II introns among natural populations of Escherichia coli Dai & Zimmerly -2002 ). Despite their apparent intractability, at least five distinct group II introns exist naturally in E. coli strains. These are self-splicing group II introns (retroelements), not spliceosomal introns like in eukaryotes — so they don’t affect RNA-seq quantification in the way eukaryotic introns do, which is why Bowtie2 (non-splice-aware) works fine for E. coli.

⭐ Important: Raw counts must remain as integers — DESeq2’s statistical model requires this.

💡 Tip: If you are unsure which assay name or rowData columns are available in your RDS, inspect them first with assayNames(count_x) and names(rowData(count_x)).

count_x <- readRDS(
  file.path(git_root, "data", "nf-core_rnaseq",
            "salmon.merged.gene.SummarizedExperiment.rds")
)

count_genes <- assay(count_x, assayNames(count_x)[1])

gene_symbols <- rowData(count_x)$gene_name
gene_ids     <- rowData(count_x)$gene_id

gene_symbols_saved <- ifelse(
  !is.na(gene_symbols) & nchar(gene_symbols) > 0,
  make.unique(as.character(gene_symbols)),
  make.unique(as.character(gene_ids))
)

count_genes           <- apply(count_genes, 2, as.integer)
rownames(count_genes) <- gene_symbols_saved

cat("Dimensions (genes × samples):", dim(count_genes), "\n")
## Dimensions (genes × samples): 4523 6
print(head(rownames(count_genes)))
## [1] "b0001" "b0002" "b0003" "b0004" "b0005" "b0006"