About#

The course#

At the Data Science platform we are integrating with the research data management team and we are also building pipelines for our researchers at DTU - Biosustain. To that end we are building and deploying in Azure scalable, portable and reproducible bioinformatics workflows using Nextflow. Nextflow is a workflow orchestration tool designed for developing and executing reproducible and scalable data analysis pipelines. It is a scripting language to develop workflows based on Groovy, allowing for easy integration with existing software and tools. However you do not need to learn Groovy to get started with Nextflow.

Nextflow supports parallelization and can run on multiple computing environments, including local systems, high-performance clusters, and cloud platforms. Nextflow has a strong community behind in the bioinformatics field and is supported by the nf-core project, which provides a large repository of pre-built, community-maintained bioinformatics pipelines that are optimized for Nextflow. Mainly for these characteristics it has been our workflow building language of choice and would like to teach it to our fellows at DTU.

Objectives#

In this course you will learn:

  • What is Nextflow?

  • Why is needed?

  • Nextflow concepts (e.g. channels, processes and operators, parallellism, reentrancy, reusability)

  • Core features (e.g Portability, Scalability, Reproducibility, Modularity)

  • Installing Nextflow (requirements, you will use a prepared dev environment)

  • Write and run your first Nextflow script (nextflow script, config file, results, working directory)

  • Run a nextflow pipeline in a Docker image (run for example a short version of an rnaseq pipeline)

  • Seqera platform (monitoring your pipeline execution)

  • nf-core community (113 bioinformatics standardized workflows)

  • Resources to keep training

Data Science platform#

Data Science has become an essential piece both in academia and industry to accelerate gaining insights into the generated datasets. As a strategy to integrate high-level analytics in DTU - Biosustain we created a centralized Data Science platform (DSP) that provides support to our researchers while promoting standardized data and data processes.

The DSP team aims to make data science more accessible and inclusive not only at DTU Biosustain but also across the DTU Community. The platform follows a data-centric approach that focuses on data infrastructure, processes, and outputs as ongoing, evolving products rather than one-time projects. Each data product is designed as a multidisciplinary collaboration involving the entire data lifecycle and pursuing standardization and automation, and with data usage and reusage in mind.

The DSP is based on four pillars:

– Support: our research fellows on Statistics, Programming, Data analytics, and Machine learning

– Education: coorganizing Data club with DTU - Bioengineering and organizing Data Science workshops

– Innovation: introducing researchers to new computational biology methods and technologies

– Tooling: implementing open-sourced standard tools

You can contact us at Data Science platform email.