Assignment I
Part I: QC Sequencing Reads
Download some simulated RNA-Seq reads:
wget https://github.com/BarryDigby/circ_data/releases/download/RTP/fastq.tar.gz && tar -xvzf fastq.tar.gz
ls -la test-datasets/fastq
total 151M
-rw-rw-r-- 1 barry 11M Nov 22 12:16 fust1_rep1_1.fastq.gz
-rw-rw-r-- 1 barry 12M Nov 22 12:16 fust1_rep1_2.fastq.gz
-rw-rw-r-- 1 barry 14M Nov 22 12:16 fust1_rep2_1.fastq.gz
-rw-rw-r-- 1 barry 15M Nov 22 12:16 fust1_rep2_2.fastq.gz
-rw-rw-r-- 1 barry 14M Nov 22 12:16 fust1_rep3_1.fastq.gz
-rw-rw-r-- 1 barry 16M Nov 22 12:16 fust1_rep3_2.fastq.gz
-rw-rw-r-- 1 barry 11M Nov 22 12:16 N2_rep1_1.fastq.gz
-rw-rw-r-- 1 barry 12M Nov 22 12:16 N2_rep1_2.fastq.gz
-rw-rw-r-- 1 barry 12M Nov 22 12:16 N2_rep2_1.fastq.gz
-rw-rw-r-- 1 barry 15M Nov 22 12:16 N2_rep2_2.fastq.gz
-rw-rw-r-- 1 barry 11M Nov 22 12:16 N2_rep3_1.fastq.gz
-rw-rw-r-- 1 barry 13M Nov 22 12:16 N2_rep3_2.fastq.gz
Your task is to create a container using a both a conda environment.yml file and a suitable Dockerfile file hosting the following quality control tools:
fastqcmultiqc
Once the container has been created, shell into the container and run fastqc on the sequencing reads. Once all of the outputs have been generated for each fastq file (”*.{html,zip}”), run multiqc to generate a summary report.
Bonus
Push your Docker container (which should have both
fastqcandmultiqcinstalled) to DockerHub.Download the container using the
singularity pullcommand - we are mimicking behaviour on a HPC here where Docker is not available to us.Write a bash script that loops over each fastq file performing
fastqc.At the end of the script, run
multiqcon the outputs of thefastqcruns.Run the script from within the container by using the
singularity shellcommand. Be careful to specify the correct bind path using-B.
Part II: Advanced Container Creation
You are tasked with creating a container to faithfully reproduce the analysis performed by Zhao et al
An excerpt of the methods are given in the screenshot below - create a container using a Dockerfile and an environment.yml file as shown in previous examples.
Note
There are three tools which you will need to install manually: 1) RSEM v.1.3.0 2) BLASTX v2.6.0 3) CNCI v2.0. You will have to perform a dry-run installation of these tools locally first.
Note
Use pinned tool versions! We want the precise versions used in the analysis.
Note
If a tool is present in multiple channels, be sure to specifically select the channel you want to download it from e.g: conda-forge::<tool>=<version>. If you do not do this, conda will not know which channel to use and fail during the install.
RSEM
RSEM is written in C++ and requires a bunch of dependencies which are beyond the scope of this workshop. I have included the dependencies for RSEM installation in the Dockerfile for you:
# Add dependencies
RUN apt-get update; apt-get clean all;
RUN apt-get install --yes build-essential \
gcc-multilib \
tar \
unzip \
ncurses-base \
zlib1g \
liblzma5 \
libbz2-1.0 \
gcc \
g++ \
zlib1g-dev
When installing RSEM in the Dockerfile, chain the wget, tar -zxvf, cd, make and make install commands using &&.
Each RUN line triggers a new layer - breaking up installation commands over multiple RUN lines will fail - Thank you Bianca! :)
CNCI
CNCI is available on Github at the following link. There are two issues here:
The authors never bothered to make a stable release, so you cannot download a versioned tarball containing the contents of the repository.
Running
git clonein aDockerfilewill fail (Host key verification failed.). You need to generate uniquessh keysfor the container, which are then saved in the image layer. This is extremely unsecure - don’t do this.
To overcome these issues, I forked the repository and created a stable release - I cloned the repo locally, tarzipped it and uploaded the tarball as a release file. The stable release is available at the following link.
Within the Dockerfile, use wget to download the archived repository. You can follow the installation steps from there.
Hint
Once downloaded and de-compressed, make the CNCI folder fully accessible: chmod -R 777 CNCI/. You must do this in order to add the executables to your $PATH.
Check Installations
If you need a reminder, the steps to build the container are:
docker build -t <dockerhub_username>/<repo_name> $(pwd) # run in directory containing both Dockerfile and environment.yml file
docker run -it <dockerhub_username>/<repo_name>
Check the installs worked:
tophat
cufflinks
rsem-bam2wig
makeblastdb -help
CPC2.py
CNCI.py -h
computeMatrix
All of the tools should work except for Deeptools (computeMatrix). This looks like a particularly nasty error to debug - particularly when the tool is coming from the Anaconda repository. You will come across situations like this that will force you to look for alternative tools, or comb through their source code and locate and remedy the error.
Once you are happy with the installations, push your changes to Github to trigger an automated build. (i.e push the Dockerfile & environment.yml to your repo).