V(D)J clustering

On the topic of finding clones/clonotypes, there are many ways used for clustering BCRs, almost all involving some measure based on sequence similarity. There are also a lot of very well established guidelines and criterias maintained by the BCR community. For example, immcantation uses a number of model-based methods [Gupta2015] to group clones based on the distribution of length-normalised junctional hamming distance while others use the whole BCR V(D)J sequence to define clones as shown in this paper [Bashford-Rogers2019].

Import modules

[1]:

import os
import pandas as pd
import dandelion as ddl

ddl.logging.print_header()

dandelion==0.3.4.dev29 pandas==2.0.1 numpy==1.24.3 matplotlib==3.7.1 networkx==3.1 scipy==1.11.2

[2]:

# change directory to somewhere more workable
os.chdir(os.path.expanduser("~/Downloads/dandelion_tutorial/"))
# I'm importing scanpy here to make use of its logging module.
import scanpy as sc

sc.settings.verbosity = 3
import warnings

warnings.filterwarnings("ignore")
sc.logging.print_header()

scanpy==1.9.3 anndata==0.9.1 umap==0.5.3 numpy==1.24.3 scipy==1.11.2 pandas==2.0.1 scikit-learn==1.3.0 statsmodels==0.14.0 python-igraph==0.10.6 pynndescent==0.5.10

Read in the previously saved files

I will work with the same example from the previous section since I have the filtered V(D)J data stored in a Dandelion class.

[3]:

vdj = ddl.read_h5ddl("dandelion_results.h5ddl")
vdj

[3]:

Dandelion class object with n_obs = 2229 and n_contigs = 7357
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status'
    metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

Finding clones

The following is dandelion’s implementation of a rather conventional method to define clones, ddl.tl.find_clones.

Clone definition criterion

Clone definition is based on the following criterion:

Identical V- and J-gene usage in the VDJ chain (IGH/TRB/TRD).
Identical CDR3 junctional/CDR3 sequence length in the VDJ chain.
VDJ chain junctional/CDR3 sequences attains a minimum of % sequence similarity, based on hamming distance. The similarity cut-off is tunable (default is 85%; change to 100% if analyzing TCR data).
VJ chain (IGK/IGL/TRA/TRG) usage. If cells within clones use different VJ chains, the clone will be split following the same conditions for VDJ chains in (1-3) as above.

Running `ddl.tl.find_clones`

The function will take a file path, a pandas DataFrame (for example if you’ve used pandas to read in the filtered file already), or a Dandelion class object. The default mode for calculation of junctional hamming distance is to use the CDR3 junction amino acid sequences, specified via the key option (None defaults to junction_aa). You can switch it to using CDR3 junction nucleotide sequences (key = 'junction'), or even the full V(D)J amino acid sequence (key = 'sequence_alignment_aa'), as long as the column name exists in the .data slot.

If you want to use the alleles for defining V-J gene usage, specify:

by_alleles = True

Clustering TCR is possible with the same setup but requires changing of default parameters (covered in the TCR section).

[4]:

ddl.tl.find_clones(vdj)
vdj

Finding clonotypes

Finding clones based on B cell VDJ chains : 100%|██████████| 220/220 [00:00<00:00, 1481.66it/s]
Finding clones based on B cell VJ chains : 100%|██████████| 210/210 [00:00<00:00, 4790.91it/s]
Refining clone assignment based on VJ chain pairing : 100%|██████████| 2229/2229 [00:00<00:00, 577782.81it/s]

 finished: Updated Dandelion object:
   'data', contig-indexed AIRR table
   'metadata', cell-indexed observations table
 (0:00:00)

[4]:

Dandelion class object with n_obs = 2229 and n_contigs = 7357
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

This will return a new column with the column name 'clone_id' as per convention. If a file path is provided as input, it will also save the file automatically into the base directory of the file name. Otherwise, a Dandelion object will be returned.

Clonotype definition criterion

The clone_id follows an A_B_C_D_E_F format and largely reflects the conditions above where:

{A} indicates if the contigs use the same V and J genes in the VDJ chain.

{B} indicates if junctional/CDR3 sequences are equal in length in the VDJ chain.

{C} indicates if clones are split based on junctional/CDR3 hamming distance threshold (for VDJ chain).

{D} indicates if the contigs use the same V and J genes in the VJ chain.

{E} indicates if junctional/CDR3 sequences are equal in length in the VJ chain.

{F} indicates if clones are split based on junctional/CDR3 hamming distance threshold (for VJ chain).

Also, to prevent issues with clonotype ids matching between B cells and T cells, there will be a prefix added to the clone_id to reflect whether or not it’s a B, abT or gdT clone.

Also, to reduce ambiguity, the A_B_C segment will have the VDJ prefix and the D_E_F segment will have the VJ suffix.

Therefor, a complete B cell clonotype id will look something like:

B_VDJ_1_1_2_VJ_2_1_1

For an Orphan VDJ, it would be B_VDJ_1_1_2.

For an Orphan VJ, it would be B_VJ_2_1_1.

There is also an alternate column called clone_id_by_size which is a simple numerical version of the clone_id which corresponds to the size of the clonotype - 1 is the largest clonotype, 2 is the second largest, and so on.

[5]:

vdj.metadata

[5]:

	clone_id	clone_id_by_size	sample_id	locus_VDJ	locus_VJ	productive_VDJ	productive_VJ	v_call_genotyped_VDJ	d_call_VDJ	j_call_VDJ	...	d_call_B_VDJ_main	j_call_B_VDJ_main	v_call_B_VJ_main	j_call_B_VJ_main	isotype	isotype_status	locus_status	chain_status	rearrangement_status_VDJ	rearrangement_status_VJ
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG	B_VJ_165_2_7	70	sc5p_v2_hs_PBMC_10k	None	IGK	None	T	None	None	None	...	None	None	IGKV1-33,IGKV1D-33	IGKJ4	None	None	Orphan IGK	Orphan VJ	None	standard
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC	B_VDJ_119_3_2_VJ_80_2_3	1952	sc5p_v2_hs_PBMC_10k	IGH	IGK	T	T	IGHV1-69,IGHV1-69D	IGHD3-22	IGHJ3	...	IGHD3-22	IGHJ3	IGKV1-8	IGKJ1	IgM	IgM	IGH + IGK	Single pair	standard	standard
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG	B_VDJ_42_1_2_VJ_54_1_1	1567	sc5p_v2_hs_PBMC_10k	IGH	IGL	T	T	IGHV1-2	None	IGHJ3	...	None	IGHJ3	IGLV5-45	IGLJ3	IgM	IgM	IGH + IGL	Single pair	standard	standard
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC	B_VDJ_38_4_4_VJ_191_1_1	1568	sc5p_v2_hs_PBMC_10k	IGH	IGK	T	T	IGHV5-51	None	IGHJ3	...	None	IGHJ3	IGKV1D-8	IGKJ2	IgM	IgM	IGH + IGK	Single pair	standard	standard
sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA	B_VDJ_55_2_1_VJ_184_2_7	1569	sc5p_v2_hs_PBMC_10k	IGH	IGL	T	T	IGHV4-4	IGHD6-13	IGHJ3	...	IGHD6-13	IGHJ3	IGLV3-19	IGLJ3,IGLJ2	IgM	IgM	IGH + IGL	Single pair	standard	standard
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG	B_VDJ_41_2_1_VJ_26_2_8	796	vdj_v1_hs_pbmc3	IGH	IGK	T	T	IGHV2-5	IGHD5/OR15-5a,IGHD5/OR15-5b	IGHJ5,IGHJ4	...	IGHD5/OR15-5a,IGHD5/OR15-5b	IGHJ5,IGHJ4	IGKV4-1	IGKJ4	IgM	IgM	IGH + IGK	Single pair	standard	standard
vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT	B_VDJ_2_6_3_VJ_87_1_3	797	vdj_v1_hs_pbmc3	IGH	IGK	T	T	IGHV3-30	IGHD4-17	IGHJ6	...	IGHD4-17	IGHJ6	IGKV2-30	IGKJ2	IgM	IgM	IGH + IGK	Single pair	standard	standard
vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA	B_VDJ_1_1_1_VJ_139_4_16	798	vdj_v1_hs_pbmc3	IGH	IGK	T	T	IGHV4-59	IGHD6-13	IGHJ2	...	IGHD6-13	IGHJ2	IGKV1D-39,IGKV1-39	IGKJ1	IgM	IgM	IGH + IGK	Single pair	standard	standard
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG	B_VDJ_47_4_1_VJ_103_3_4	799	vdj_v1_hs_pbmc3	IGH	IGL	T	T	IGHV1-69,IGHV1-69D	IGHD2-15	IGHJ6	...	IGHD2-15	IGHJ6	IGLV1-47	IGLJ3	IgM	IgM	IGH + IGL	Single pair	standard	standard
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG	B_VDJ_184_5_1_VJ_121_3_3	2336	vdj_v1_hs_pbmc3	IGH	IGL	T	T	IGHV3-23,IGHV3-23D	None	IGHJ4	...	None	IGHJ4	IGLV2-11	IGLJ3,IGLJ2	IgM	IgM	IGH + IGL	Single pair	standard	standard

2229 rows × 47 columns

Alternative : Running `tl.define_clones`

Alternatively, a wrapper to call changeo’s DefineClones.py [Gupta2015] is also included. To run it, you need to choose the distance threshold for clonal assignment. To facilitate this, the function pp.calculate_threshold will run shazam’s distToNearest function and return a plot showing the length normalized hamming distance distribution and automated threshold value.

Again, pp.calculate_threshold will take a file path, pandas DataFrame or Dandelion object as input. If a dandelion object is provided, the threshold value will be inserted into the .threshold slot. For more fine control, please use shazam’s distToNearest and changeo’s DefineClones.py functions directly.

[6]:

ddl.pp.calculate_threshold(vdj)

Calculating threshold

R[write to console]: Error in (function (db, sequenceColumn = "junction", vCallColumn = "v_call",  :
  361 cell(s) with multiple heavy chains found. One heavy chain per cell is expected.

Rerun this after filtering. For now, switching to heavy mode.
      Threshold method 'density' did not return with any values. Switching to method = 'gmm'.

../_images/notebooks_3_dandelion_findingclones-10x_data_14_3.png


 finished: Updated Dandelion object:
   'threshold', threshold value for tuning clonal assignment
 (0:01:04)

[7]:

# see the actual value in .threshold slot
vdj.threshold

[7]:

0.2613216828025309

You can also manually select a value as the threshold if you wish. Note that rerunning this with manual_threshold is just for reproducing the plot but with the line at 0.1 in this tutorial. You can just edit vdj.threshold directly if you wish, i.e. vdj.threshold = 0.1.

[8]:

ddl.pp.calculate_threshold(vdj, manual_threshold=0.1)

Calculating threshold

R[write to console]: Error in (function (db, sequenceColumn = "junction", vCallColumn = "v_call",  :
  361 cell(s) with multiple heavy chains found. One heavy chain per cell is expected.

Rerun this after filtering. For now, switching to heavy mode.
      Threshold method 'density' did not return with any values. Switching to method = 'gmm'.

../_images/notebooks_3_dandelion_findingclones-10x_data_17_3.png


 finished: Updated Dandelion object:
   'threshold', threshold value for tuning clonal assignment
 (0:01:08)

[9]:

# see the updated .threshold slot
vdj.threshold

[9]:

0.1

[10]:

vdj

[10]:

Dandelion class object with n_obs = 2229 and n_contigs = 7357
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

We can run ddl.tl.define_clones to call changeo’s DefineClones.py; see here for more info. Note, if a pandas.DataFrame or file path is provided as the input, the value in dist option (corresponds to threshold value) needs to be manually supplied. If a Dandelion object is provided, it will automatically retrieve it from the threshold slot. Additional options for ddl.tl.define_clones to provide to DefineClones.py can be supplied as a list to the additional_args option.

[11]:

ddl.tl.define_clones(vdj, key_added="changeo_clone_id")
vdj

Finding clones
Running command: DefineClones.py -d /var/folders/_r/j_8_fj3x28n2th3ch0ckn9c40000gt/T/tmp04n9y0u_/tmp/dandelion_define_clones_heavy-clone.tsv -o /var/folders/_r/j_8_fj3x28n2th3ch0ckn9c40000gt/T/tmp04n9y0u_/dandelion_define_clones_heavy-clone.tsv --act set --model ham --norm len --dist 0.1 --nproc 1 --vf v_call_genotyped

/opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(

       START> DefineClones
        FILE> dandelion_define_clones_heavy-clone.tsv
   SEQ_FIELD> junction
     V_FIELD> v_call_genotyped
     J_FIELD> j_call
 MAX_MISSING> 0
GROUP_FIELDS> None
      ACTION> set
        MODE> gene
    DISTANCE> 0.1
     LINKAGE> single
       MODEL> ham
        NORM> len
         SYM> avg
       NPROC> 1

/opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(
/opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(
/opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
  warnings.warn(

PROGRESS> [Grouping sequences] 23:53:07 (2110) 0.0 min

PROGRESS> [Assigning clones] 23:53:09 |####################| 100% (2,110) 0.0 min

 OUTPUT> dandelion_define_clones_heavy-clone.tsv
 CLONES> 2056
RECORDS> 2110
   PASS> 2110
   FAIL> 0
    END> DefineClones

 finished: Updated Dandelion object:
   'data', contig-indexed AIRR table
   'metadata', cell-indexed observations table
 (0:00:08)

[11]:

Dandelion class object with n_obs = 2229 and n_contigs = 7357
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status', 'clone_id', 'changeo_clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'changeo_clone_id'

Note that I specified the option key_added and this adds the output from tl.define_clones into a separate column. If left as default (None), it will write into clone_id column. The same option can be specified in tl.find_clones earlier.

Generation of V(D)J network

dandelion generates a network to facilitate visualisation of results, inspired from [Bashford-Rogers2013]. This uses the full V(D)J contig sequences instead of just the junctional sequences to chart a tree-like network for each clone. The actual visualization will be achieved through scanpy later.

`ddl.tl.generate_network`

First we need to generate the network. ddl.tl.generate_network will take a V(D)J table that has clones defined, specifically under the 'clone_id' column. The default mode is to use amino acid sequences for constructing Levenshtein distance matrices, but can be toggled using the key option.

If you have a pre-processed table parsed from immcantation’s method, or any other method as long as it’s in a AIRR format, the table can be used as well.

You can specify the clone_key option for generating the network for the clone id definition of choice as long as it exists as a column in the .data slot.

Before proceeding, let’s do a bit of subsetting. Here I want to remove the Orphan VJ cells (lacking BCR heavy chain i.e. VDJ information). Whether or not you want to do this is up to you. I’m doing this because I want to focus on the BCR heavy chain for now. You may elect to keep everything and that can be your starting point for further analysis.

[12]:

vdj = vdj[
    vdj.metadata.chain_status.isin(
        ["Single pair", "Extra pair", "Extra pair-exception", "Orphan VDJ"]
    )
].copy()
vdj

[12]:

Dandelion class object with n_obs = 2071 and n_contigs = 4882
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status', 'clone_id', 'changeo_clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'changeo_clone_id'

[13]:

ddl.tl.generate_network(vdj)

Generating network

Setting up data: 4289it [00:00, 8643.10it/s]
Calculating distances : 100%|██████████| 2200/2200 [00:00<00:00, 11934.79it/s]
Aggregating distances : 100%|██████████| 5/5 [00:00<00:00, 60.78it/s]
Sorting into clusters : 100%|██████████| 2200/2200 [00:00<00:00, 4111.08it/s]
Calculating minimum spanning tree : 100%|██████████| 33/33 [00:00<00:00, 1313.07it/s]
Generating edge list : 100%|██████████| 33/33 [00:00<00:00, 4986.38it/s]
Computing overlap : 100%|██████████| 2200/2200 [00:00<00:00, 4023.25it/s]
Adjust overlap : 100%|██████████| 122/122 [00:00<00:00, 4728.33it/s]
Linking edges : 100%|██████████| 2026/2026 [00:00<00:00, 49217.86it/s]

Computing network layout
Computing expanded network layout
 finished: Updated Dandelion object:
   'data', contig-indexed AIRR table
   'metadata', cell-indexed observations table
   'layout', graph layout
   'graph', network constructed from distance matrices of VDJ- and VJ- chains (0:00:11)

In dandelion version >=0.2.2, the default layout_method is changed to sfdp, which is implemented through graph-tool package. This is significantly faster than the default modified Fruchterman-Reingold layout which while will work reasonably fast here, it will take quite a while when a lot of contigs are provided (>100k cells may take 1 hour). You can toggle this behaviour with:

ddl.tl.generate_network(vdj, layout_method = 'mod_fr') # for the original
ddl.tl.generate_network(vdj, layout_method = 'sfdp') # for sfdp

Generating graph without layout

If you don’t care for the layout and simply want access to the network/graph, you can do:

ddl.tl.generate_network(vdj, compute_layout = False)

and use the networkx graphs in the graph slot and compute your own layout as you wish.

In previous versions of dandelion, it used to be possible to generate the entire distance matrix for every pair of cell but this functionality was removed because it was too time consuming. If you are after this, please reach out to me and we can try and see if we can reimplement it!

down sampling data/graph

You can also downsample the number of cells. This will return a new object as a downsampled copy of the original with its own distance matrix. We will add use_existing_graph=False for this to work (otherwise it will just reuse the previous graph to recompute a layout; it will throw an error as it doesn’t know what to do with downsampling).

[14]:

vdj_downsample = ddl.tl.generate_network(
    vdj, downsample=500, use_existing_graph=False
)
vdj_downsample

Generating network
Downsampling to 500 cells.

Setting up data: 1030it [00:00, 8020.00it/s]
Calculating distances : 100%|██████████| 553/553 [00:00<00:00, 9683.26it/s]
Aggregating distances : 100%|██████████| 4/4 [00:00<00:00, 338.15it/s]
Sorting into clusters : 100%|██████████| 553/553 [00:00<00:00, 5603.23it/s]
Calculating minimum spanning tree : 100%|██████████| 5/5 [00:00<00:00, 1051.26it/s]
Generating edge list : 100%|██████████| 5/5 [00:00<00:00, 7082.58it/s]
Computing overlap : 100%|██████████| 553/553 [00:00<00:00, 5171.85it/s]
Adjust overlap : 100%|██████████| 36/36 [00:00<00:00, 4731.90it/s]
Linking edges : 100%|██████████| 493/493 [00:00<00:00, 82536.70it/s]

Computing network layout
Computing expanded network layout
 finished: Updated Dandelion object:
   'data', contig-indexed AIRR table
   'metadata', cell-indexed observations table
   'layout', graph layout
   'graph', network constructed from distance matrices of VDJ- and VJ- chains (0:00:01)

[14]:

Dandelion class object with n_obs = 499 and n_contigs = 1030
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status', 'clone_id', 'changeo_clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    layout: layout for 499 vertices, layout for 7 vertices
    graph: networkx graph of 499 vertices, networkx graph of 7 vertices

check the newly re-initialized Dandelion object

[15]:

vdj

[15]:

Dandelion class object with n_obs = 2071 and n_contigs = 4882
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status', 'clone_id', 'changeo_clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'changeo_clone_id'
    layout: layout for 2071 vertices, layout for 70 vertices
    graph: networkx graph of 2071 vertices, networkx graph of 70 vertices

The graph/networks can be accessed through the .graph slot as an networkx graph object if you want to extract the data for network statistics or make any changes to the network.

At this point, we can save the dandelion object; the file can be quite big because the distance matrix is not sparse. I recommend some form of compression (I use bzip2 below but that can impact on read/write times significantly). See here for compression options.

[16]:

vdj.write_h5ddl("dandelion_results.h5ddl", complib="bzip2")

[ ]:

V(D)J clustering

Finding clones

Running ddl.tl.find_clones

Alternative : Running tl.define_clones

Generation of V(D)J network

ddl.tl.generate_network

down sampling data/graph

Running `ddl.tl.find_clones`

Alternative : Running `tl.define_clones`

`ddl.tl.generate_network`