No description
  • Jupyter Notebook 99.2%
  • Python 0.8%
Find a file
2020-02-02 20:00:28 +01:00
notebooks huge change of the pipeline 2020-02-02 20:00:28 +01:00
.gitignore first commit 2019-10-30 13:18:53 +01:00
data huge change of the pipeline 2020-02-02 20:00:28 +01:00
environment.yml first commit 2019-10-30 13:18:53 +01:00
README.md huge change of the pipeline 2020-02-02 20:00:28 +01:00

HDTM Pipeline

This is the repro repository for the HDTM pipeline.

Project structure

.
├── data
│   ├── inputs
│   ├── interim
│   └── outputs
├── figures
│   ├── by_chrom
│   └── by_reads
├── notebooks
│   ├── 00-preprocess.ipynb
│   ├── 00-preprocess.md
│   ├── 01.00-hdtm_pipeline_template.ipynb
│   ├── 01.00-hdtm_pipeline_template.md
│   ├── 01-hdtm_pipeline.ipynb
│   ├── 01-hdtm_pipeline.md
│   ├── 01-hdtm_pipeline_notebooks
│   ├── 10-build_trackhub.ipynb
│   ├── 10-build_trackhub.md
│   ├── 11-plot_results.ipynb
│   ├── 11-plot_results.md
│   ├── 12-interactive_vizu.ipynb
│   ├── 12-interactive_vizu.md
│   └── scibelt
├── environment.yml
└── README.md
  • data contains the data:
    • inputs are the data needed to run the pipeline.
    • interim are the files produced by the pipeline.
    • outputs are the 'final' files, as the produced trackhub and the bigWigs.
  • figures contains the figures generated by the pipeline.
  • notebooks is the most important folder, that contains all the jupyter notebooks that run the pipeline.
  • notebooks/scibelt is a python module that contains helper function needed by the pipeline.
  • environment.yml is the conda environment file (see the installation section).

Installation

All the dependencies are dealt with conda. [Anaconda][anaconda], or at least [miniconda][miniconda] is needed.

You can then install the dependencies with conda env create -f environment.yml or via the Anaconda manager.

You have also to register the kernel properly :

conda activate hdtm
python -m ipykernel install --user --name hdtm

You can then run jupyter notebook in the proper conda environment (conda activate hdtm) and run the different notebooks of the pipeline.

The notebooks

00.00-fetch_data

The notebook will retrieve the inputs from the server (TODO: from the dropbox, from an external server ?), convert the genbank references and build a file that summary the different chromosome and their size.

00.00-process_references

The notebook will convert the genbank references and build a file that summary the different chromosome and their size, and extract the CDS from the references.

01-hdtm_pipeline

That notebook use papermill to build one sub-notebook per reads from a template 01.00-hdtm_pipeline_template (that are stored, once built, in notebooks/01-hdtm_pipeline_notebooks/).

The steps are the following :

  • Use trimmomatic to trim the raw reads.
  • Concatenate the relevant chromosomes into a unique fasta file.
  • Align the trimmed reads on that reference file using bwa mem, then filter the result using samtools -q 10.
  • convert the sam file into a bedgraph (see notebooks/scibelt/cigar.py).
  • convert the bedgraph file into a bigWig one (which is requested to build the trackhub).

10-build_trackhub

Once all the reads have been processed, that notebook will build a trackhub. It can be easily put online by replacing the host (from localhost to a remote server) and the remote_dir argument.

11-plot_results

This notebook builds one figure per experiment (with the alignment to the relevant chromosome), then one figure per chromosome (to compare all the experiments linked to that chromosome).

The figures are in the figures/ folder.

12-interactive_vizu

That notebook gives access to an interactive dashboard that displays, for each chromosome, the reads that are linked to it.

Acknowledgement

The pipeline steps come from Fréderic Grenier's original work [link], the experimental data come from Kevin Huguet's work.