{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# V(D)J clustering\n", "\n", "On the topic of finding clones/clonotypes, there are many ways used for clustering BCRs, almost all involving some measure based on sequence similarity. There are also a lot of very well established guidelines and criterias maintained by the BCR community. For example, *immcantation* uses a number of model-based [methods](https://changeo.readthedocs.io/en/stable/methods/clustering.html) [[Gupta2015]](https://academic.oup.com/bioinformatics/article/31/20/3356/195677) to group clones based on the distribution of length-normalised junctional hamming distance while others use the whole BCR V(D)J sequence to define clones as shown in this paper [[Bashford-Rogers2019]](https://www.nature.com/articles/s41586-019-1595-3)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import modules" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "dandelion==0.5.5.dev16 pandas==2.2.3 numpy==2.1.3 matplotlib==3.10.1 networkx==3.4.2 scipy==1.15.2\n" ] } ], "source": [ "import os\n", "import dandelion as ddl\n", "\n", "ddl.logging.print_header()" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "scanpy==1.10.3 anndata==0.11.3 umap==0.5.7 numpy==2.1.3 scipy==1.15.2 pandas==2.2.3 scikit-learn==1.6.1 statsmodels==0.14.4 igraph==0.11.8 pynndescent==0.5.13\n" ] } ], "source": [ "# change directory to somewhere more workable\n", "os.chdir(os.path.expanduser(\"~/Downloads/dandelion_tutorial/\"))\n", "# I'm importing scanpy here to make use of its logging module.\n", "import scanpy as sc\n", "\n", "sc.settings.verbosity = 3\n", "import warnings\n", "\n", "warnings.filterwarnings(\"ignore\")\n", "sc.logging.print_header()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Read in the previously saved files\n", "\n", "I will work with the same example from the previous section since I have the filtered V(D)J data stored in a `Dandelion` class." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "Dandelion class object with n_obs = 2238 and n_contigs = 7355\n", " data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status'\n", " metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vdj = ddl.read_h5ddl(\"dandelion_results.h5ddl\")\n", "vdj" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finding clones\n", "\n", "The following is *dandelion*'s implementation of a rather conventional method to define clones, `ddl.tl.find_clones`. \n", "\n", "
| \n", " | v_call_genotyped | \n", "j_call | \n", "c_call | \n", "
|---|---|---|---|
| sequence_id | \n", "\n", " | \n", " | \n", " |
| sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 | \n", "IGKV1-33*01,IGKV1D-33*01 | \n", "IGKJ4*01 | \n", "IGKC | \n", "
| sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 | \n", "IGHV1-69*01,IGHV1-69D*01 | \n", "IGHJ3*02 | \n", "IGHM | \n", "
| sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 | \n", "IGKV1-8*01 | \n", "IGKJ1*01 | \n", "IGKC | \n", "
| sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 | \n", "IGLV5-45*02 | \n", "IGLJ3*02 | \n", "IGLC3 | \n", "
| sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 | \n", "IGHV1-2*02 | \n", "IGHJ3*02 | \n", "IGHM | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 | \n", "IGHV1-46*01 | \n", "IGHJ5*02 | \n", "IGHM | \n", "
| vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 | \n", "IGHV1-69*01,IGHV1-69D*01 | \n", "IGHJ6*02 | \n", "IGHM | \n", "
| vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 | \n", "IGLV1-47*01 | \n", "IGLJ3*02 | \n", "IGLC3 | \n", "
| vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 | \n", "IGLV2-11*01 | \n", "IGLJ2*01,IGLJ3*01,IGLJ3*02 | \n", "IGLC | \n", "
| vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 | \n", "IGHV3-23*01,IGHV3-23D*01 | \n", "IGHJ4*02 | \n", "IGHM | \n", "
7355 rows × 3 columns
\n", "| \n", " | v_call_genotyped | \n", "j_call | \n", "c_call | \n", "
|---|---|---|---|
| sequence_id | \n", "\n", " | \n", " | \n", " |
| sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 | \n", "IGKV1-33 | \n", "IGKJ4 | \n", "IGKC | \n", "
| sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 | \n", "IGHV1-69 | \n", "IGHJ3 | \n", "IGHM | \n", "
| sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 | \n", "IGKV1-8 | \n", "IGKJ1 | \n", "IGKC | \n", "
| sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 | \n", "IGLV5-45 | \n", "IGLJ3 | \n", "IGLC3 | \n", "
| sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 | \n", "IGHV1-2 | \n", "IGHJ3 | \n", "IGHM | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "
| vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 | \n", "IGHV1-46 | \n", "IGHJ5 | \n", "IGHM | \n", "
| vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 | \n", "IGHV1-69 | \n", "IGHJ6 | \n", "IGHM | \n", "
| vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 | \n", "IGLV1-47 | \n", "IGLJ3 | \n", "IGLC3 | \n", "
| vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 | \n", "IGLV2-11 | \n", "IGLJ2 | \n", "IGLC | \n", "
| vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 | \n", "IGHV3-23 | \n", "IGHJ4 | \n", "IGHM | \n", "
7355 rows × 3 columns
\n", "| \n", " | clone_id | \n", "clone_id_by_size | \n", "sample_id | \n", "locus_VDJ | \n", "locus_VJ | \n", "productive_VDJ | \n", "productive_VJ | \n", "v_call_genotyped_VDJ | \n", "d_call_VDJ | \n", "j_call_VDJ | \n", "... | \n", "d_call_B_VDJ_main | \n", "j_call_B_VDJ_main | \n", "v_call_B_VJ_main | \n", "j_call_B_VJ_main | \n", "isotype | \n", "isotype_status | \n", "locus_status | \n", "chain_status | \n", "rearrangement_status_VDJ | \n", "rearrangement_status_VJ | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG | \n", "B_VJ_186_2_8 | \n", "169 | \n", "sc5p_v2_hs_PBMC_10k | \n", "None | \n", "IGK | \n", "None | \n", "T | \n", "None | \n", "None | \n", "None | \n", "... | \n", "None | \n", "None | \n", "IGKV1-33 | \n", "IGKJ4 | \n", "\n", " | \n", " | Orphan IGK | \n", "Orphan VJ | \n", "None | \n", "standard | \n", "
| sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC | \n", "B_VDJ_129_4_2_VJ_142_2_3 | \n", "1989 | \n", "sc5p_v2_hs_PBMC_10k | \n", "IGH | \n", "IGK | \n", "T | \n", "T | \n", "IGHV1-69 | \n", "IGHD3-22 | \n", "IGHJ3 | \n", "... | \n", "IGHD3-22 | \n", "IGHJ3 | \n", "IGKV1-8 | \n", "IGKJ1 | \n", "IgM | \n", "IgM | \n", "IGH + IGK | \n", "Single pair | \n", "standard | \n", "standard | \n", "
| sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG | \n", "B_VDJ_81_1_2_VJ_41_1_1 | \n", "1603 | \n", "sc5p_v2_hs_PBMC_10k | \n", "IGH | \n", "IGL | \n", "T | \n", "T | \n", "IGHV1-2 | \n", "None | \n", "IGHJ3 | \n", "... | \n", "None | \n", "IGHJ3 | \n", "IGLV5-45 | \n", "IGLJ3 | \n", "IgM | \n", "IgM | \n", "IGH + IGL | \n", "Single pair | \n", "standard | \n", "standard | \n", "
| sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC | \n", "B_VDJ_130_4_5_VJ_30_1_1 | \n", "1604 | \n", "sc5p_v2_hs_PBMC_10k | \n", "IGH | \n", "IGK | \n", "T | \n", "T | \n", "IGHV5-51 | \n", "None | \n", "IGHJ3 | \n", "... | \n", "None | \n", "IGHJ3 | \n", "IGKV1D-8 | \n", "IGKJ2 | \n", "IgM | \n", "IgM | \n", "IGH + IGK | \n", "Single pair | \n", "standard | \n", "standard | \n", "
| sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA | \n", "B_VDJ_46_2_1_VJ_101_2_7 | \n", "1605 | \n", "sc5p_v2_hs_PBMC_10k | \n", "IGH | \n", "IGL | \n", "T | \n", "T | \n", "IGHV4-4 | \n", "IGHD6-13 | \n", "IGHJ3 | \n", "... | \n", "IGHD6-13 | \n", "IGHJ3 | \n", "IGLV3-19 | \n", "IGLJ2 | \n", "IgM | \n", "IgM | \n", "IGH + IGL | \n", "Single pair | \n", "standard | \n", "standard | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG | \n", "B_VDJ_174_1_1_VJ_71_2_8 | \n", "813 | \n", "vdj_v1_hs_pbmc3 | \n", "IGH | \n", "IGK | \n", "T | \n", "T | \n", "IGHV2-5 | \n", "IGHD5/OR15-5a | \n", "IGHJ4 | \n", "... | \n", "IGHD5/OR15-5a | \n", "IGHJ4 | \n", "IGKV4-1 | \n", "IGKJ4 | \n", "IgM | \n", "IgM | \n", "IGH + IGK | \n", "Single pair | \n", "standard | \n", "standard | \n", "
| vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT | \n", "B_VDJ_93_5_6_VJ_153_1_3 | \n", "814 | \n", "vdj_v1_hs_pbmc3 | \n", "IGH | \n", "IGK | \n", "T | \n", "T | \n", "IGHV3-30 | \n", "IGHD4-17 | \n", "IGHJ6 | \n", "... | \n", "IGHD4-17 | \n", "IGHJ6 | \n", "IGKV2-30 | \n", "IGKJ2 | \n", "IgM | \n", "IgM | \n", "IGH + IGK | \n", "Single pair | \n", "standard | \n", "standard | \n", "
| vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA | \n", "B_VDJ_83_1_1_VJ_83_4_13 | \n", "815 | \n", "vdj_v1_hs_pbmc3 | \n", "IGH | \n", "IGK | \n", "T | \n", "T | \n", "IGHV4-61 | \n", "IGHD6-13 | \n", "IGHJ2 | \n", "... | \n", "IGHD6-13 | \n", "IGHJ2 | \n", "IGKV1-39 | \n", "IGKJ1 | \n", "IgM | \n", "IgM | \n", "IGH + IGK | \n", "Single pair | \n", "standard | \n", "standard | \n", "
| vdj_v1_hs_pbmc3_TTTGCGCCATACCATG | \n", "B_VDJ_38_7_2_VJ_160_3_5 | \n", "816 | \n", "vdj_v1_hs_pbmc3 | \n", "IGH | \n", "IGL | \n", "T | \n", "T | \n", "IGHV1-69 | \n", "IGHD2-15 | \n", "IGHJ6 | \n", "... | \n", "IGHD2-15 | \n", "IGHJ6 | \n", "IGLV1-47 | \n", "IGLJ3 | \n", "IgM | \n", "IgM | \n", "IGH + IGL | \n", "Single pair | \n", "standard | \n", "standard | \n", "
| vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG | \n", "B_VDJ_144_5_4_VJ_79_3_2 | \n", "2389 | \n", "vdj_v1_hs_pbmc3 | \n", "IGH | \n", "IGL | \n", "T | \n", "T | \n", "IGHV3-23 | \n", "None | \n", "IGHJ4 | \n", "... | \n", "None | \n", "IGHJ4 | \n", "IGLV2-11 | \n", "IGLJ2 | \n", "IgM | \n", "IgM | \n", "IGH + IGL | \n", "Single pair | \n", "standard | \n", "standard | \n", "
2238 rows × 47 columns
\n", "