First Script

We will write a basic nextflow script to perform QC on sequencing reads using FastQC.

Before getting started with the nextflow script, add the tools needed for todays container (overwrite yesterdays assignment/save elsewhere). We will be working from our local clone of the rtp_workshop repository today.

name: test_env
channels:
 - bioconda
dependencies:
 - fastqc
 - multiqc
 - gffread
 - kallisto

FROM nfcore/base:1.14
LABEL authors="Barry Digby" \
    description="Docker container containing fastqc"

WORKDIR ./
COPY environment.yml ./
RUN conda env create -f environment.yml && conda clean -a
ENV PATH /opt/conda/envs/test_env/bin:$PATH

Your local directory of the repository should look like:

barry@YT-1300:/data/github/test$ ls -la
total 321416
drwxrwxr-x  4 barry barry      4096 Dec 13 14:22 .
drwxrwxr-x 21 barry barry      4096 Dec 13 16:50 ..
-rw-rw-r--  1 barry barry       245 Dec 13 12:20 Dockerfile
-rw-rw-r--  1 barry barry        61 Dec 13 12:20 environment.yml
drwxrwxr-x  8 barry barry      4096 Dec 13 15:07 .git
drwxrwxr-x  3 barry barry      4096 Dec 13 14:22 .github
-rw-rw-r--  1 barry barry         6 Dec 13 14:06 .gitignore
-rw-rw-r--  1 barry barry        32 Dec 13 15:07 README.md
-rwxrwxr-x  1 barry barry 329093120 Dec 13 13:48 test.img

Warning

We will build on this directory as the day goes on - make sure you have everything in order now.

Scripting Language

Nextflow scripts use groovy as the main scripting language however, the script body within processes are polyglot - one of the main attractions of nextflow.

#!/usr/bin/env nextflow

params.foo = "String"
params.bar = 5

println params.foo.size()

process TEST{

    echo true

    input:
    val(foo) from params.foo
    val(bar) from params.bar

    script:
    """
    echo "Script body printing foo: $foo, bar: $bar"
    """
}

Save the script to a file test.nf and run it using nextflow run test.nf:

nextflow run test.nf
N E X T F L O W  ~  version 21.04.1
Launching `test.nf` [nice_austin] - revision: 56da2768ff
6
executor >  local (1)
[ab/90ba6d] process > TEST [100%] 1 of 1 ✔
Script body printing foo: String, bar: 5

Warning

Please use 4 whitespaces as indentation for process blocks. Do not use tabs.

Notice that the scripting language outside of the process (println) is written in groovy. The process body script automatically uses bash - but we can perscribe a different language using a shebang line:

#!/usr/bin/env nextflow

params.foo = "String"
params.bar = 5

println params.foo.size()

process TEST{

    echo true

    input:
    val(foo) from params.foo
    val(bar) from params.bar

    script:
    """
    #!/usr/bin/perl

    print scalar reverse ("Script body printing foo:, $foo, bar:, $bar")
    """
}

nextflow run test.nf
N E X T F L O W  ~  version 21.04.1
Launching `test.nf` [gloomy_perlman] - revision: 6e0da47179
6
executor >  local (1)
[17/92a7c9] process > TEST [100%] 1 of 1 ✔
5 ,:rab ,gnirtS ,:oof gnitnirp ydob tpircS

Channels

Channels are used to stage files and values in nextflow. There are two types of channels - queue channels and value channels. Broadly speaking, queue channels are used to connect files to processes and cannot be reused. Value channels on the other hand hold a value (or file value - i.e a path to a file), and can be re-used mutliple times.

Let’s use some simulated RNA-Seq reads:

wget https://github.com/BarryDigby/circ_data/releases/download/RTP/test-datasets.tar.gz && tar -xvzf test-datasets.tar.gz

ls -la test-datasets/fastq
total 151M
-rw-rw-r-- 1 barry 11M Nov 22 12:16 fust1_rep1_1.fastq.gz
-rw-rw-r-- 1 barry 12M Nov 22 12:16 fust1_rep1_2.fastq.gz
-rw-rw-r-- 1 barry 14M Nov 22 12:16 fust1_rep2_1.fastq.gz
-rw-rw-r-- 1 barry 15M Nov 22 12:16 fust1_rep2_2.fastq.gz
-rw-rw-r-- 1 barry 14M Nov 22 12:16 fust1_rep3_1.fastq.gz
-rw-rw-r-- 1 barry 16M Nov 22 12:16 fust1_rep3_2.fastq.gz
-rw-rw-r-- 1 barry 11M Nov 22 12:16 N2_rep1_1.fastq.gz
-rw-rw-r-- 1 barry 12M Nov 22 12:16 N2_rep1_2.fastq.gz
-rw-rw-r-- 1 barry 12M Nov 22 12:16 N2_rep2_1.fastq.gz
-rw-rw-r-- 1 barry 15M Nov 22 12:16 N2_rep2_2.fastq.gz
-rw-rw-r-- 1 barry 11M Nov 22 12:16 N2_rep3_1.fastq.gz
-rw-rw-r-- 1 barry 13M Nov 22 12:16 N2_rep3_2.fastq.gz

Queue Channels

Now that we have real data to work with, practice staging the files using the fromFilePairs() operator:

#!/usr/bin/env nextflow

Channel.fromFilePairs("test-datasets/fastq/*_{1,2}.fastq.gz", checkIfExists: true)
       .set{ ch_reads }

ch_reads.view()

Overwrite the test.nf script and run it using nextflow run test.nf. The output should look like:

nextflow run foo.nf
N E X T F L O W  ~  version 21.04.1
Launching `foo.nf` [sleepy_brahmagupta] - revision: d316cf84b0
[fust1_rep3, [/data/test/test-datasets/fastq/fust1_rep3_1.fastq.gz, /data/test/test-datasets/fastq/fust1_rep3_2.fastq.gz]]
[N2_rep3, [/data/test/test-datasets/fastq/N2_rep3_1.fastq.gz, /data/test/test-datasets/fastq/N2_rep3_2.fastq.gz]]
[fust1_rep1, [/data/test/test-datasets/fastq/fust1_rep1_1.fastq.gz, /data/test/test-datasets/fastq/fust1_rep1_2.fastq.gz]]
[fust1_rep2, [/data/test/test-datasets/fastq/fust1_rep2_1.fastq.gz, /data/test/test-datasets/fastq/fust1_rep2_2.fastq.gz]]
[N2_rep2, [/data/test/test-datasets/fastq/N2_rep2_1.fastq.gz, /data/test/test-datasets/fastq/N2_rep2_2.fastq.gz]]
[N2_rep1, [/data/test/test-datasets/fastq/N2_rep1_1.fastq.gz, /data/test/test-datasets/fastq/N2_rep1_2.fastq.gz]]

The files have been stored in a tuple, which is similar to dictionaries in python, or a list of lists. The fromFilePairs() operator automatically names each tuple according to the grouping key - e.g fust1_rep3 - and places the fastq file pairs in a list within the tuple.

When used as inputs, the process will submit a job for each line in the channel in parallel.

Note

Queue channels are FIFO.

To read in a single file, use the fromPath() operator:

#!/usr/bin/env nextflow

Channel.fromPath("test-datasets/reference/chrI.gtf")
    .set{ ch_gtf }

ch_gtf.view()

N E X T F L O W  ~  version 21.04.1
Launching `foo.nf` [scruffy_marconi] - revision: 45988ab471
/data/test/test-datasets/reference/chrI.gtf

One can also use wildcard glob patterns in conjunction with fromPath():

#!/usr/bin/env nextflow

Channel.fromPath("test-datasets/reference/*")
    .set{ ch_reference_files }

ch_reference_files.view()

nextflow run foo.nf
N E X T F L O W  ~  version 21.04.1
Launching `foo.nf` [soggy_descartes] - revision: e3125b3a9e
/data/test/test-datasets/reference/mature.fa
/data/test/test-datasets/reference/chrI.fa.fai
/data/test/test-datasets/reference/chrI.gtf
/data/test/test-datasets/reference/chrI.fa

This is not a great idea in this example - you will have to manually extract each file from the channel. It makes more sense to stage each file in their own channel for downstream analysis.

Value Channels

Value channels (singleton channels) are bound to a single variable and can be read mutliple times - unlike queue channels.

One would typically stage a single file path here, or a parameter variable:

#!/usr/bin/env nextflow

Channel.value("test-datasets/reference/chrI.gtf")
   .set{ ch_gtf }

ch_gtf.view()
ch_gtf.view()

nextflow run foo.nf
N E X T F L O W  ~  version 21.04.1
Launching `foo.nf` [sleepy_thompson] - revision: 76d154a8f4
test-datasets/reference/chrI.gtf
test-datasets/reference/chrI.gtf

Note

You cannot perform operations on a value channel.

#!/usr/bin/env nextflow

Channel.value("test-datasets/reference/chrI.gtf")
    .set{ ch_gtf }

ch_gtf.map{ it -> it.baseName }.view()

nextflow run foo.nf
N E X T F L O W  ~  version 21.04.1
Launching `foo.nf` [clever_mclean] - revision: 4cf48e7013
No such variable: baseName

-- Check script 'foo.nf' at line: 6 or see '.nextflow.log' file for more details

Channel.value(file())

There exists a workaround for staging a value channel that can both be re-used and allow operations.

nf-core devs never raised an issue with my using this method, as far as I am aware it is legitimate.

#!/usr/bin/env nextflow

Channel.value(file("test-datasets/reference/chrI.gtf"))
    .set{ ch_gtf }

ch_gtf.view()
ch_gtf.map{ it -> it.baseName }.view()

nextflow run foo.nf
N E X T F L O W  ~  version 21.04.1
Launching `foo.nf` [gloomy_almeida] - revision: 6b54fe867d
/data/test/test-datasets/reference/chrI.gtf
chrI

Processes

After staging the sequncing reads, we will create a process called FASTQC to perform quality control analysis:

#!/usr/bin/env nextflow

Channel.fromFilePairs("test-datasets/fastq/*_{1,2}.fastq.gz", checkIfExists: true)
       .set{ ch_reads }

process FASTQC{
    publishDir "./fastqc", mode: 'copy'

    input:
    tuple val(base), file(reads) from ch_reads

    output:
    file("*.{html,zip}") into ch_multiqc

    script:
    """
    fastqc -q $reads
    """
}

To run the script, we need to point to the container which holds the FastQC executable. To do this, we specify -with-singularity 'path/to/image'.

nextflow run test.nf -with-singularity 'test.img'

This should raise an error about ‘no such file or directory’. In short, the singularity container does not know where to look for the files when we run the script.

Configuration file

This brings us along nicely to the nextflow.config file. This file is used to specify nextflow variables and parameters for the workflow.

In the file below, we specify the bind path of the container for each process, and enable singularity (we could specify podman, docker, etc here if we needed to).

In the same directory, save the contents below to a file named nextflow.config:

process{
  containerOptions = '-B /data/'
}

singularity.enabled = true
singularity.autoMounts = true

Now run the script again:

nextflow run test.nf -with-singularity 'test.img' -c nextflow.config

Tip

You can save the file under ~/.nextflow/config - nextflow will automatically check this location for a configuration file, bypassing the need to specify the -c flag.

The results of fastqc are stored in the output directory fastqc/. We specified two output file types, .html and .zip, and as such, these are the files published in the output directory.

Parameters

Parameters are variables passed to the nextflow workflow.

It is poor practice to hardcode paths within a workflow - nextflow offers two methods to pass parameters to a workflow:

Via the command line
Via a configuration file

Command Line Parameters

Using the previous script as an example, we will remove the hardcoded variables and pass the parameter via the command line. Edit your script like so (I’m only showing the relevant lines):

#!/usr/bin/env nextflow

Channel.fromFilePairs( params.input, checkIfExists: true )
       .set{ ch_reads }

Pass the path to params.input:

$ nextflow run test.nf --input "test-dataset/fastq/*_{1,2}.fastq.gz" -with-singularity 'test.img' -c nextflow.config

Configuration Parameters

Alternatively, we can specify parameters via any *.config file. You can supply multiple configuration profiles to a workflow. Please bear in mind that the order matters - duplicate parameters will be overwritten by subsequent configuration profiles.

For now, add them to the nextflow.config file we created:

process{
  containerOptions = '-B /data/'
}

params{
  input = "/data/test/test-dataset/fastq/*_{1,2}.fastq.gz"
}

singularity.enabled = true
singularity.autoMounts = true

This circumvents the need to pass multiple parameters via the command line.

nextflow run test.nf -with-singularity 'test.img' -c nextflow.config

Note

Please use double quotes when using a wildcard glob pattern.

Note

It is good practice to provide the absolute paths to files.

Please complete Assignment II Part 1.