4.3 Loading Count Data

The nf-core/rnaseq pipeline was run with -profile prokaryotic, which uses Bowtie2 for alignment and Salmon for quantification. Since E. coli has no introns*, splice-aware aligners like STAR are unnecessary. We load the SummarizedExperiment object produced by the pipeline, extract the raw count matrix, and assign gene symbols as row names

The dispersal of five group II introns among natural populations of Escherichia coli Dai & Zimmerly -2002. Despite their apparent intractability, at least five distinct group II introns exist naturally in E. coli strains. These are self-splicing group II introns (retroelements), not spliceosomal introns like in eukaryotes — so they don’t affect RNA-seq quantification in the way eukaryotic introns do, which is why Bowtie2 (non-splice-aware) works fine for E. coli.

⭐ Important: Raw counts must remain as integers — DESeq2’s statistical model requires this.

💡 Tip: If you are unsure which assay name or rowData columns are available in your RDS, inspect them first with assayNames(count_x) and names(rowData(count_x)).

count_x <- readRDS(
  file.path(git_root, "data", "nf-core_rnaseq",
            "salmon.merged.gene.SummarizedExperiment.rds")
)

count_genes <- assay(count_x, assayNames(count_x)[1])

gene_symbols <- rowData(count_x)$gene_name
gene_ids     <- rowData(count_x)$gene_id

gene_symbols_saved <- ifelse(
  !is.na(gene_symbols) & nchar(gene_symbols) > 0,
  make.unique(as.character(gene_symbols)),
  make.unique(as.character(gene_ids))
)

count_genes           <- apply(count_genes, 2, as.integer)
rownames(count_genes) <- gene_symbols_saved

cat("Dimensions (genes × samples):", dim(count_genes), "\n")
## Dimensions (genes × samples): 4523 6
print(head(rownames(count_genes)))
## [1] "b0001" "b0002" "b0003" "b0004" "b0005" "b0006"