Singularity preprocessing
Arguably the greatest strength of the dandelion package is a streamlined preprocessing setup making use of a variety of specialised single cell VDJ algorithms:
V(D)J gene reannotation with
igblastnand parsed to AIRR format with changeo’sMakeDB.py[Gupta2015], with the pipeline strengthened by runningblastnin parallel and using the best alignmentsReassigning heavy chain IG V gene alleles with TIgGER [Gadala-Maria15]
Reassigning IG constant region calls by blasting against a curated set of highly specific C gene sequences
Quantifying mutations via SHazaM’s observedMutations
However, running this workflow requires a high number of dependencies and databases, which can be troublesome to set up. As such, we’ve put together a Singularity container that comes pre-configured with all of the required software and resources, allowing you to run the pre-processing pipeline with a single call and easy installation.
Setup and running
Once you have Singularity installed, you can download the Dandelion container. This command will create sc-dandelion_latest.sif, note its location.
singularity pull library://kt16/default/sc-dandelion:latest
In order to prepare your VDJ data for ingestion, create a folder for each sample you’d like to analyse, name it with your sample ID, and store the Cell Ranger all_contig_annotations.csv and all_contig.fasta output files inside.
5841STDY7998693
├── all_contig_annotations.csv
└── all_contig.fasta
Please ensure that the only subfolders present in your folder are such per-sample subfolders with the .csv and .fasta files.
You can then navigate to the directory holding all your sample folders and run Dandelion pre-processing like so:
singularity run -B $PWD /path/to/sc-dandelion_latest.sif dandelion-preprocess
Any optional arguments get added at the end of this line.
If you’re running TR data rather than IG data, specify --chain TR to skip steps 2-4 in the preprocessing. This notably works with TRGD data, which most versions of Cell Ranger struggle to annotate correctly. However, the contigs are still reconstructed, so Dandelion’s preprocessing can annotate them for you.
You can provide the --filter_to_high_confidence flag to only keep the contigs that Cell Ranger has called as high confidence. If you wish to process files that have a different prefix than all, e.g. filtered_contig_annotations.csv and filtered_contig.fasta, provide the desired file prefix with --file_prefix. In that case, be sure that your input folder contains those files rather than the all ones. We use all as default as it’s possible to subset contigs to relevant
ones later.
Recommended parameterisation
If in possession of gene expression data that the VDJ data will be integrated with, the following parameterisation is likely to yield the best results:
singularity run -B $PWD /path/to/sc-dandelion_latest.sif dandelion-preprocess \
--filter_to_high_confidence
Part of the Cell Ranger VDJ filtering criteria is whether the algorithm thinks the contig is in a cell or not, for which you will have superior information based on the gene expression data. The other half of the Cell Ranger VDJ filtering process, requiring the contig to be high confidence, is retained by providing the --filter_to_high_confidence flag.
Optional arguments
By default, this workflow will analyse all provided IG samples jointly with TIgGER to maximise inference power, and in the event of multiple input folders will prepend the sample IDs to the cell barcodes to avoid erroneously merging barcodes overlapping between samples at this stage. TIgGER should be ran on a per-individual level. If running the workflow on multiple individuals’ worth of data at once, or wanting to flag the cell barcodes in a non-default manner, information can be provided to
the script in the form of a CSV file passed through the --meta argument:
The first row of the CSV needs to be a header identifying the information in the columns, and the first column needs to contain sample IDs.
Barcode flagging can be controlled by an optional
prefix/suffixcolumn. The pipeline will then add the specified prefixes/suffixes to the barcodes of the samples. This may be desirable, as corresponding gene expression samples are likely to have different IDs, and providing the matched ID will pre-format the VDJ output to match the GEX nomenclature.Individual information for TIgGER can be specified in an optional
individualcolumn. If specified, TIgGER will be ran for each unique value present in the column, pooling the corresponding samples.
It’s possible to just pass a prefix/suffix or individual information. An excerpt of a sample CSV file that could be used on input:
sample,suffix,individual
5841STDY7998693,5841STDY7991475,A37
5841STDY7998694,5841STDY7991476,A37
5841STDY7998695,5841STDY7991477,A37
WSSS_A_LNG9030827,WSSS_A_LNG8986832,A51
WSSS8090101,WSSS8015042,A40
WSSS8090102,WSSS8015043,A40
[...]
If specifying a metadata file, only subfolders with names provided in the sample column will be processed.
The delimiter between the barcode and the prefix/suffix can be controlled with the --sep argument. By default, the workflow will strip out the trailing "-1" from the Cellranger ouput barcode names; pass --keep_trailing_hyphen_number if you don’t want to do that. Pass --clean_output if you want to remove intermediate files and just keep the primary output. The intermediate files may be useful for more detailed inspection.
For the full list of optional arguments, run:
singularity run -B $PWD /path/to/sc-dandelion_latest.sif dandelion-preprocess --help
usage: dandelion_preprocess.py [-h] [--meta META] [--chain CHAIN] [--org ORG]
[--file_prefix FILE_PREFIX] [--db DB]
[--strain STRAIN] [--sep SEP]
[--flavour FLAVOUR]
[--filter_to_high_confidence]
[--keep_trailing_hyphen_number]
[--skip_format_header] [--skip_tigger]
[--skip_reassign_dj] [--skip_correct_c]
[--clean_output]
options:
-h, --help show this help message and exit
--meta META Optional metadata CSV file, header required, first
column for sample ID matching folder names in the
directory this is being ran in. Can have a
"prefix"/"suffix" column for barcode alteration, and
"individual" to provide tigger groupings that isn't
analysing all of the samples jointly.
--chain CHAIN Whether the data is TR or IG, as the preprocessing
pipelines differ. Defaults to "IG".
--org ORG organism for running the reannotation. human or mouse.
--file_prefix FILE_PREFIX
Which set of contig files to take for the folder. For
a given PREFIX, will use PREFIX_contig_annotations.csv
and PREFIX_contig.fasta. Defaults to "all".
--db DB Which database to use for reannotation. imgt or ogrdb.
--strain STRAIN Which mouse strain to use for running the
reannotation. Only for ogrdb. Defaults to all (None)
mouse strains.
--sep SEP The separator to place between the barcode and
prefix/suffix. Uses sample names as a prefix for BCR
data if metadata CSV file absent and more than one
sample to process. Defaults to "_".
--flavour FLAVOUR The "flavour" for running igblastn reannotation.
Accepts either "strict" or "original". strict will
enforce evalue and penalty cutoffs.
--filter_to_high_confidence
If passed, limits the contig space to ones that are
set to "True" in the high_confidence column of the
contig annotation.
--keep_trailing_hyphen_number
If passed, do not strip out the trailing hyphen
number, e.g. "-1", from the end of barcodes.
--skip_format_header If passed, skips formatting of contig headers.
--skip_tigger If passed, skips TIgGER reassign alleles step.
--skip_reassign_dj If passed, skips reassigning d/j calls with blastn
when flavour=strict.
--skip_correct_c If passed, skips correcting c calls at assign_isotypes
stage. Only if Chain == IG.
--clean_output If passed, remove intermediate files that aren't the
primary output from the run reults. The intermediate
files may be occasionally useful for inspection.
Output
The main file of interest will be dandelion/all_contig_dandelion.tsv, stored in a new subfolder each sample folder. This is an AIRR formatted export of the corrected contigs, which can be used for downstream analysis by both dandelion itself, and other packages like scirpy [Sturm2020] and changeo
[Gupta2015].
The file above features a contig space filtered with immcantation. If this is not of interest to you and you wish to see the full contig space as provided on input, refer to dandelion/tmp/all_contig_iblast_db-all.tsv.
The plots showing the impact of TIgGER are in <tigger>/<tigger>_reassign_alleles.pdf, for each TIgGER folder (one per unique individual if using --meta, tigger otherwise). The impact of C gene reannotation is shown in dandelion/data/assign_isotype.pdf for each sample.
If you’re interested in more detail about the pre-processing this offers, or wish to use the workflow in a more advanced manner (e.g. by using your own databases), proceed to the pre-processing section of the advanced guide.