V(D)J clustering
On the topic of finding clones/clonotypes, there are many ways used for clustering BCRs, almost all involving some measure based on sequence similarity. There are also a lot of very well established guidelines and criterias maintained by the BCR community. For example, immcantation uses a number of model-based methods [Gupta2015] to group clones based on the distribution of length-normalised junctional hamming distance while others use the whole BCR V(D)J sequence to define clones as shown in this paper [Bashford-Rogers2019].
Import modules
[1]:
import os
import pandas as pd
import dandelion as ddl
ddl.logging.print_header()
dandelion==0.3.4.dev29 pandas==2.0.1 numpy==1.24.3 matplotlib==3.7.1 networkx==3.1 scipy==1.11.2
[2]:
# change directory to somewhere more workable
os.chdir(os.path.expanduser("~/Downloads/dandelion_tutorial/"))
# I'm importing scanpy here to make use of its logging module.
import scanpy as sc
sc.settings.verbosity = 3
import warnings
warnings.filterwarnings("ignore")
sc.logging.print_header()
scanpy==1.9.3 anndata==0.9.1 umap==0.5.3 numpy==1.24.3 scipy==1.11.2 pandas==2.0.1 scikit-learn==1.3.0 statsmodels==0.14.0 python-igraph==0.10.6 pynndescent==0.5.10
Read in the previously saved files
I will work with the same example from the previous section since I have the filtered V(D)J data stored in a Dandelion
class.
[3]:
vdj = ddl.read_h5ddl("dandelion_results.h5ddl")
vdj
[3]:
Dandelion class object with n_obs = 2229 and n_contigs = 7357
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status'
metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
Finding clones
The following is dandelion’s implementation of a rather conventional method to define clones, ddl.tl.find_clones
.
Clone definition criterion
Clone definition is based on the following criterion:
Identical V- and J-gene usage in the VDJ chain (IGH/TRB/TRD).
Identical CDR3 junctional/CDR3 sequence length in the VDJ chain.
VDJ chain junctional/CDR3 sequences attains a minimum of % sequence similarity, based on hamming distance. The similarity cut-off is tunable (default is 85%; change to 100% if analyzing TCR data).
VJ chain (IGK/IGL/TRA/TRG) usage. If cells within clones use different VJ chains, the clone will be split following the same conditions for VDJ chains in (1-3) as above.
Running ddl.tl.find_clones
The function will take a file path, a pandas DataFrame
(for example if you’ve used pandas to read in the filtered file already), or a Dandelion
class object. The default mode for calculation of junctional hamming distance is to use the CDR3 junction amino acid sequences, specified via the key
option (None
defaults to junction_aa
). You can switch it to using CDR3 junction nucleotide sequences (key = 'junction'
), or even the full V(D)J amino acid sequence
(key = 'sequence_alignment_aa'
), as long as the column name exists in the .data
slot.
If you want to use the alleles for defining V-J gene usage, specify:
by_alleles = True
Clustering TCR is possible with the same setup but requires changing of default parameters (covered in the TCR section).
[4]:
ddl.tl.find_clones(vdj)
vdj
Finding clonotypes
Finding clones based on B cell VDJ chains : 100%|██████████| 220/220 [00:00<00:00, 1481.66it/s]
Finding clones based on B cell VJ chains : 100%|██████████| 210/210 [00:00<00:00, 4790.91it/s]
Refining clone assignment based on VJ chain pairing : 100%|██████████| 2229/2229 [00:00<00:00, 577782.81it/s]
finished: Updated Dandelion object:
'data', contig-indexed AIRR table
'metadata', cell-indexed observations table
(0:00:00)
[4]:
Dandelion class object with n_obs = 2229 and n_contigs = 7357
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
This will return a new column with the column name 'clone_id'
as per convention. If a file path is provided as input, it will also save the file automatically into the base directory of the file name. Otherwise, a Dandelion
object will be returned.
Clonotype definition criterion
The clone_id
follows an A_B_C_D_E_F
format and largely reflects the conditions above where:
{A} indicates if the contigs use the same V and J genes in the VDJ chain.
{B} indicates if junctional/CDR3 sequences are equal in length in the VDJ chain.
{C} indicates if clones are split based on junctional/CDR3 hamming distance threshold (for VDJ chain).
{D} indicates if the contigs use the same V and J genes in the VJ chain.
{E} indicates if junctional/CDR3 sequences are equal in length in the VJ chain.
{F} indicates if clones are split based on junctional/CDR3 hamming distance threshold (for VJ chain).
Also, to prevent issues with clonotype ids matching between B cells and T cells, there will be a prefix added to the clone_id
to reflect whether or not it’s a B
, abT
or gdT
clone.
Also, to reduce ambiguity, the A_B_C
segment will have the VDJ
prefix and the D_E_F
segment will have the VJ
suffix.
Therefor, a complete B cell clonotype id will look something like:
B_VDJ_1_1_2_VJ_2_1_1
For an Orphan VDJ
, it would be B_VDJ_1_1_2
.
For an Orphan VJ
, it would be B_VJ_2_1_1
.
There is also an alternate column called clone_id_by_size
which is a simple numerical version of the clone_id
which corresponds to the size of the clonotype - 1
is the largest clonotype, 2
is the second largest, and so on.
[5]:
vdj.metadata
[5]:
clone_id | clone_id_by_size | sample_id | locus_VDJ | locus_VJ | productive_VDJ | productive_VJ | v_call_genotyped_VDJ | d_call_VDJ | j_call_VDJ | ... | d_call_B_VDJ_main | j_call_B_VDJ_main | v_call_B_VJ_main | j_call_B_VJ_main | isotype | isotype_status | locus_status | chain_status | rearrangement_status_VDJ | rearrangement_status_VJ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG | B_VJ_165_2_7 | 70 | sc5p_v2_hs_PBMC_10k | None | IGK | None | T | None | None | None | ... | None | None | IGKV1-33,IGKV1D-33 | IGKJ4 | None | None | Orphan IGK | Orphan VJ | None | standard |
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC | B_VDJ_119_3_2_VJ_80_2_3 | 1952 | sc5p_v2_hs_PBMC_10k | IGH | IGK | T | T | IGHV1-69,IGHV1-69D | IGHD3-22 | IGHJ3 | ... | IGHD3-22 | IGHJ3 | IGKV1-8 | IGKJ1 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG | B_VDJ_42_1_2_VJ_54_1_1 | 1567 | sc5p_v2_hs_PBMC_10k | IGH | IGL | T | T | IGHV1-2 | None | IGHJ3 | ... | None | IGHJ3 | IGLV5-45 | IGLJ3 | IgM | IgM | IGH + IGL | Single pair | standard | standard |
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC | B_VDJ_38_4_4_VJ_191_1_1 | 1568 | sc5p_v2_hs_PBMC_10k | IGH | IGK | T | T | IGHV5-51 | None | IGHJ3 | ... | None | IGHJ3 | IGKV1D-8 | IGKJ2 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA | B_VDJ_55_2_1_VJ_184_2_7 | 1569 | sc5p_v2_hs_PBMC_10k | IGH | IGL | T | T | IGHV4-4 | IGHD6-13 | IGHJ3 | ... | IGHD6-13 | IGHJ3 | IGLV3-19 | IGLJ3,IGLJ2 | IgM | IgM | IGH + IGL | Single pair | standard | standard |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG | B_VDJ_41_2_1_VJ_26_2_8 | 796 | vdj_v1_hs_pbmc3 | IGH | IGK | T | T | IGHV2-5 | IGHD5/OR15-5a,IGHD5/OR15-5b | IGHJ5,IGHJ4 | ... | IGHD5/OR15-5a,IGHD5/OR15-5b | IGHJ5,IGHJ4 | IGKV4-1 | IGKJ4 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT | B_VDJ_2_6_3_VJ_87_1_3 | 797 | vdj_v1_hs_pbmc3 | IGH | IGK | T | T | IGHV3-30 | IGHD4-17 | IGHJ6 | ... | IGHD4-17 | IGHJ6 | IGKV2-30 | IGKJ2 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA | B_VDJ_1_1_1_VJ_139_4_16 | 798 | vdj_v1_hs_pbmc3 | IGH | IGK | T | T | IGHV4-59 | IGHD6-13 | IGHJ2 | ... | IGHD6-13 | IGHJ2 | IGKV1D-39,IGKV1-39 | IGKJ1 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG | B_VDJ_47_4_1_VJ_103_3_4 | 799 | vdj_v1_hs_pbmc3 | IGH | IGL | T | T | IGHV1-69,IGHV1-69D | IGHD2-15 | IGHJ6 | ... | IGHD2-15 | IGHJ6 | IGLV1-47 | IGLJ3 | IgM | IgM | IGH + IGL | Single pair | standard | standard |
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG | B_VDJ_184_5_1_VJ_121_3_3 | 2336 | vdj_v1_hs_pbmc3 | IGH | IGL | T | T | IGHV3-23,IGHV3-23D | None | IGHJ4 | ... | None | IGHJ4 | IGLV2-11 | IGLJ3,IGLJ2 | IgM | IgM | IGH + IGL | Single pair | standard | standard |
2229 rows × 47 columns
Alternative : Running tl.define_clones
Alternatively, a wrapper to call changeo’s DefineClones.py
[Gupta2015] is also included. To run it, you need to choose the distance threshold for clonal assignment. To facilitate this, the function pp.calculate_threshold
will run shazam’s distToNearest function and return a plot showing the length normalized hamming distance distribution and
automated threshold value.
Again, pp.calculate_threshold
will take a file path, pandas DataFrame
or Dandelion
object as input. If a dandelion object is provided, the threshold value will be inserted into the .threshold
slot. For more fine control, please use shazam’s distToNearest and changeo’s DefineClones.py functions directly.
[6]:
ddl.pp.calculate_threshold(vdj)
Calculating threshold
R[write to console]: Error in (function (db, sequenceColumn = "junction", vCallColumn = "v_call", :
361 cell(s) with multiple heavy chains found. One heavy chain per cell is expected.
Rerun this after filtering. For now, switching to heavy mode.
Threshold method 'density' did not return with any values. Switching to method = 'gmm'.
finished: Updated Dandelion object:
'threshold', threshold value for tuning clonal assignment
(0:01:04)
[7]:
# see the actual value in .threshold slot
vdj.threshold
[7]:
0.2613216828025309
You can also manually select a value as the threshold if you wish. Note that rerunning this with manual_threshold
is just for reproducing the plot but with the line at 0.1 in this tutorial. You can just edit vdj.threshold
directly if you wish, i.e. vdj.threshold = 0.1
.
[8]:
ddl.pp.calculate_threshold(vdj, manual_threshold=0.1)
Calculating threshold
R[write to console]: Error in (function (db, sequenceColumn = "junction", vCallColumn = "v_call", :
361 cell(s) with multiple heavy chains found. One heavy chain per cell is expected.
Rerun this after filtering. For now, switching to heavy mode.
Threshold method 'density' did not return with any values. Switching to method = 'gmm'.
finished: Updated Dandelion object:
'threshold', threshold value for tuning clonal assignment
(0:01:08)
[9]:
# see the updated .threshold slot
vdj.threshold
[9]:
0.1
[10]:
vdj
[10]:
Dandelion class object with n_obs = 2229 and n_contigs = 7357
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
We can run ddl.tl.define_clones
to call changeo’s DefineClones.py
; see here for more info. Note, if a pandas.DataFrame
or file path is provided as the input, the value in dist option (corresponds to threshold value) needs to be manually supplied. If a Dandelion
object is provided, it will automatically retrieve it from the threshold slot. Additional options for ddl.tl.define_clones
to provide to
DefineClones.py
can be supplied as a list to the additional_args
option.
[11]:
ddl.tl.define_clones(vdj, key_added="changeo_clone_id")
vdj
Finding clones
Running command: DefineClones.py -d /var/folders/_r/j_8_fj3x28n2th3ch0ckn9c40000gt/T/tmp04n9y0u_/tmp/dandelion_define_clones_heavy-clone.tsv -o /var/folders/_r/j_8_fj3x28n2th3ch0ckn9c40000gt/T/tmp04n9y0u_/dandelion_define_clones_heavy-clone.tsv --act set --model ham --norm len --dist 0.1 --nproc 1 --vf v_call_genotyped
/opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
warnings.warn(
START> DefineClones
FILE> dandelion_define_clones_heavy-clone.tsv
SEQ_FIELD> junction
V_FIELD> v_call_genotyped
J_FIELD> j_call
MAX_MISSING> 0
GROUP_FIELDS> None
ACTION> set
MODE> gene
DISTANCE> 0.1
LINKAGE> single
MODEL> ham
NORM> len
SYM> avg
NPROC> 1
/opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
warnings.warn(
/opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
warnings.warn(
/opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/Bio/pairwise2.py:278: BiopythonDeprecationWarning: Bio.pairwise2 has been deprecated, and we intend to remove it in a future release of Biopython. As an alternative, please consider using Bio.Align.PairwiseAligner as a replacement, and contact the Biopython developers if you still need the Bio.pairwise2 module.
warnings.warn(
PROGRESS> [Grouping sequences] 23:53:07 (2110) 0.0 min
PROGRESS> [Assigning clones] 23:53:09 |####################| 100% (2,110) 0.0 min
OUTPUT> dandelion_define_clones_heavy-clone.tsv
CLONES> 2056
RECORDS> 2110
PASS> 2110
FAIL> 0
END> DefineClones
finished: Updated Dandelion object:
'data', contig-indexed AIRR table
'metadata', cell-indexed observations table
(0:00:08)
[11]:
Dandelion class object with n_obs = 2229 and n_contigs = 7357
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status', 'clone_id', 'changeo_clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'changeo_clone_id'
Note that I specified the option key_added
and this adds the output from tl.define_clones
into a separate column. If left as default (None
), it will write into clone_id
column. The same option can be specified in tl.find_clones
earlier.
Generation of V(D)J network
dandelion generates a network to facilitate visualisation of results, inspired from [Bashford-Rogers2013]. This uses the full V(D)J contig sequences instead of just the junctional sequences to chart a tree-like network for each clone. The actual visualization will be achieved through scanpy
later.
ddl.tl.generate_network
First we need to generate the network. ddl.tl.generate_network
will take a V(D)J table that has clones defined, specifically under the 'clone_id'
column. The default mode is to use amino acid sequences for constructing Levenshtein distance matrices, but can be toggled using the key
option.
If you have a pre-processed table parsed from immcantation’s method, or any other method as long as it’s in a AIRR format, the table can be used as well.
You can specify the clone_key
option for generating the network for the clone id definition of choice as long as it exists as a column in the .data
slot.
Before proceeding, let’s do a bit of subsetting. Here I want to remove the Orphan VJ cells (lacking BCR heavy chain i.e. VDJ information). Whether or not you want to do this is up to you. I’m doing this because I want to focus on the BCR heavy chain for now. You may elect to keep everything and that can be your starting point for further analysis.
[12]:
vdj = vdj[
vdj.metadata.chain_status.isin(
["Single pair", "Extra pair", "Extra pair-exception", "Orphan VDJ"]
)
].copy()
vdj
[12]:
Dandelion class object with n_obs = 2071 and n_contigs = 4882
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status', 'clone_id', 'changeo_clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'changeo_clone_id'
[13]:
ddl.tl.generate_network(vdj)
Generating network
Setting up data: 4289it [00:00, 8643.10it/s]
Calculating distances : 100%|██████████| 2200/2200 [00:00<00:00, 11934.79it/s]
Aggregating distances : 100%|██████████| 5/5 [00:00<00:00, 60.78it/s]
Sorting into clusters : 100%|██████████| 2200/2200 [00:00<00:00, 4111.08it/s]
Calculating minimum spanning tree : 100%|██████████| 33/33 [00:00<00:00, 1313.07it/s]
Generating edge list : 100%|██████████| 33/33 [00:00<00:00, 4986.38it/s]
Computing overlap : 100%|██████████| 2200/2200 [00:00<00:00, 4023.25it/s]
Adjust overlap : 100%|██████████| 122/122 [00:00<00:00, 4728.33it/s]
Linking edges : 100%|██████████| 2026/2026 [00:00<00:00, 49217.86it/s]
Computing network layout
Computing expanded network layout
finished: Updated Dandelion object:
'data', contig-indexed AIRR table
'metadata', cell-indexed observations table
'layout', graph layout
'graph', network constructed from distance matrices of VDJ- and VJ- chains (0:00:11)
In dandelion
version >=0.2.2
, the default layout_method
is changed to sfdp
, which is implemented through graph-tool
package. This is significantly faster than the default modified Fruchterman-Reingold layout which while will work reasonably fast here, it will take quite a while when a lot of contigs are provided (>100k cells may take 1 hour). You can toggle this behaviour with:
ddl.tl.generate_network(vdj, layout_method = 'mod_fr') # for the original
ddl.tl.generate_network(vdj, layout_method = 'sfdp') # for sfdp
Generating graph without layout
If you don’t care for the layout and simply want access to the network/graph, you can do:
ddl.tl.generate_network(vdj, compute_layout = False)
and use the networkx graphs in the graph slot and compute your own layout as you wish.
In previous versions of dandelion, it used to be possible to generate the entire distance matrix for every pair of cell but this functionality was removed because it was too time consuming. If you are after this, please reach out to me and we can try and see if we can reimplement it!
down sampling data/graph
You can also downsample the number of cells. This will return a new object as a downsampled copy of the original with its own distance matrix. We will add use_existing_graph=False
for this to work (otherwise it will just reuse the previous graph to recompute a layout; it will throw an error as it doesn’t know what to do with downsampling).
[14]:
vdj_downsample = ddl.tl.generate_network(
vdj, downsample=500, use_existing_graph=False
)
vdj_downsample
Generating network
Downsampling to 500 cells.
Setting up data: 1030it [00:00, 8020.00it/s]
Calculating distances : 100%|██████████| 553/553 [00:00<00:00, 9683.26it/s]
Aggregating distances : 100%|██████████| 4/4 [00:00<00:00, 338.15it/s]
Sorting into clusters : 100%|██████████| 553/553 [00:00<00:00, 5603.23it/s]
Calculating minimum spanning tree : 100%|██████████| 5/5 [00:00<00:00, 1051.26it/s]
Generating edge list : 100%|██████████| 5/5 [00:00<00:00, 7082.58it/s]
Computing overlap : 100%|██████████| 553/553 [00:00<00:00, 5171.85it/s]
Adjust overlap : 100%|██████████| 36/36 [00:00<00:00, 4731.90it/s]
Linking edges : 100%|██████████| 493/493 [00:00<00:00, 82536.70it/s]
Computing network layout
Computing expanded network layout
finished: Updated Dandelion object:
'data', contig-indexed AIRR table
'metadata', cell-indexed observations table
'layout', graph layout
'graph', network constructed from distance matrices of VDJ- and VJ- chains (0:00:01)
[14]:
Dandelion class object with n_obs = 499 and n_contigs = 1030
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status', 'clone_id', 'changeo_clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
layout: layout for 499 vertices, layout for 7 vertices
graph: networkx graph of 499 vertices, networkx graph of 7 vertices
check the newly re-initialized Dandelion object
[15]:
vdj
[15]:
Dandelion class object with n_obs = 2071 and n_contigs = 4882
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status', 'clone_id', 'changeo_clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'changeo_clone_id'
layout: layout for 2071 vertices, layout for 70 vertices
graph: networkx graph of 2071 vertices, networkx graph of 70 vertices
The graph/networks can be accessed through the .graph
slot as an networkx graph object if you want to extract the data for network statistics or make any changes to the network.
At this point, we can save the dandelion object; the file can be quite big because the distance matrix is not sparse. I recommend some form of compression (I use bzip2
below but that can impact on read/write times significantly). See here for compression options.
[16]:
vdj.write_h5ddl("dandelion_results.h5ddl", complib="bzip2")
[ ]: