- Jupyter Notebook 99.2%
- Python 0.8%
| notebooks | ||
| .gitignore | ||
| data | ||
| environment.yml | ||
| README.md | ||
HDTM Pipeline
This is the repro repository for the HDTM pipeline.
Project structure
.
├── data
│ ├── inputs
│ ├── interim
│ └── outputs
├── figures
│ ├── by_chrom
│ └── by_reads
├── notebooks
│ ├── 00-preprocess.ipynb
│ ├── 00-preprocess.md
│ ├── 01.00-hdtm_pipeline_template.ipynb
│ ├── 01.00-hdtm_pipeline_template.md
│ ├── 01-hdtm_pipeline.ipynb
│ ├── 01-hdtm_pipeline.md
│ ├── 01-hdtm_pipeline_notebooks
│ ├── 10-build_trackhub.ipynb
│ ├── 10-build_trackhub.md
│ ├── 11-plot_results.ipynb
│ ├── 11-plot_results.md
│ ├── 12-interactive_vizu.ipynb
│ ├── 12-interactive_vizu.md
│ └── scibelt
├── environment.yml
└── README.md
datacontains the data:inputsare the data needed to run the pipeline.interimare the files produced by the pipeline.outputsare the 'final' files, as the produced trackhub and the bigWigs.
figurescontains the figures generated by the pipeline.notebooksis the most important folder, that contains all the jupyter notebooks that run the pipeline.notebooks/scibeltis a python module that contains helper function needed by the pipeline.environment.ymlis thecondaenvironment file (see the installation section).
Installation
All the dependencies are dealt with conda. [Anaconda][anaconda], or at least [miniconda][miniconda] is needed.
You can then install the dependencies with conda env create -f environment.yml or via the Anaconda manager.
You have also to register the kernel properly :
conda activate hdtm
python -m ipykernel install --user --name hdtm
You can then run jupyter notebook in the proper conda environment (conda activate hdtm) and run the different notebooks of the pipeline.
The notebooks
00.00-fetch_data
The notebook will retrieve the inputs from the server (TODO: from the dropbox, from an external server ?), convert the genbank references and build a file that summary the different chromosome and their size.
00.00-process_references
The notebook will convert the genbank references and build a file that summary the different chromosome and their size, and extract the CDS from the references.
01-hdtm_pipeline
That notebook use papermill to build one sub-notebook per reads from a template 01.00-hdtm_pipeline_template (that are stored, once built, in notebooks/01-hdtm_pipeline_notebooks/).
The steps are the following :
- Use trimmomatic to trim the raw reads.
- Concatenate the relevant chromosomes into a unique fasta file.
- Align the trimmed reads on that reference file using
bwa mem, then filter the result usingsamtools -q 10. - convert the
samfile into a bedgraph (see notebooks/scibelt/cigar.py). - convert the bedgraph file into a bigWig one (which is requested to build the trackhub).
10-build_trackhub
Once all the reads have been processed, that notebook will build a trackhub. It can be easily put online by replacing the host (from localhost to a remote server) and the remote_dir argument.
11-plot_results
This notebook builds one figure per experiment (with the alignment to the relevant chromosome), then one figure per chromosome (to compare all the experiments linked to that chromosome).
The figures are in the figures/ folder.
12-interactive_vizu
That notebook gives access to an interactive dashboard that displays, for each chromosome, the reads that are linked to it.
Acknowledgement
The pipeline steps come from Fréderic Grenier's original work [link], the experimental data come from Kevin Huguet's work.