Interoperability with scirpy

It is now possible to convert the file formats between dandelion>=0.1.1 and scirpy>=0.6.2 [Sturm2020] to enhance the collaboration between the analysis toolkits.

We will download the airr_rearrangement.tsv file from here:

# bash
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_b_airr_rearrangement.tsv

Gene expression data can also be obtained here

# bash
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_filtered_feature_bc_matrix.h5

Import dandelion module

[1]:
# import sys
# sys.path.append("C://Users//Amos Choo//Desktop//dandelion")
import os

import dandelion as ddl


# change directory to somewhere more workable

# os.chdir(os.path.expanduser("~/Downloads/dandelion_tutorial/"))

ddl.logging.print_versions()
dandelion==0.3.4.dev74 pandas==2.0.1 numpy==1.24.3 matplotlib==3.7.1 networkx==3.1 scipy==1.11.2
[2]:
import scirpy as ir
import scanpy as sc


ir.__version__
[2]:
'0.14.0'

dandelion

[3]:
# read in the airr_rearrangement.tsv file
file_location = "sc5p_v2_hs_PBMC_10k_b_airr_rearrangement.tsv"

# read in gene expression data
adata = sc.read_10x_h5("sc5p_v2_hs_PBMC_10k_filtered_feature_bc_matrix.h5")
adata.var_names_make_unique()

vdj = ddl.read_10x_airr(file_location)
vdj
/opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/_core/anndata.py:1832: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[3]:
Dandelion class object with n_obs = 994 and n_contigs = 2093
    data: 'cell_id', 'sequence_id', 'sequence', 'sequence_aa', 'productive', 'rev_comp', 'v_call', 'v_cigar', 'd_call', 'd_cigar', 'j_call', 'j_cigar', 'c_call', 'c_cigar', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'junction_length', 'junction_aa_length', 'v_sequence_start', 'v_sequence_end', 'd_sequence_start', 'd_sequence_end', 'j_sequence_start', 'j_sequence_end', 'c_sequence_start', 'c_sequence_end', 'consensus_count', 'duplicate_count', 'is_cell', 'locus', 'rearrangement_status'
    metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

The test file contains a blank clone_id column so we run find_clones to populate it first.

[4]:
ddl.tl.find_clones(vdj)
Finding clones based on B cell VDJ chains : 100%|██████████| 157/157 [00:00<00:00, 7623.45it/s]
Finding clones based on B cell VJ chains : 100%|██████████| 164/164 [00:00<00:00, 10267.42it/s]
Refining clone assignment based on VJ chain pairing : 100%|██████████| 994/994 [00:00<00:00, 586129.37it/s]

ddl.to_scirpy : Converting dandelion to scirpy

[5]:
irdata = ddl.to_scirpy(vdj)
irdata
/opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/_core/aligned_mapping.py:54: ExperimentalFeatureWarning: Support for Awkward Arrays is currently experimental. Behavior may change in the future. Please report any issues you may encounter!
[5]:
MuData object with n_obs × n_vars = 994 × 0
  1 modality
    airr:   994 x 0
      obsm: 'airr'

Conversion to AnndData in scirpy format is also available

[6]:
mudata = ddl.to_scirpy(vdj, to_mudata=False)
mudata
[6]:
AnnData object with n_obs × n_vars = 994 × 0
    obsm: 'airr'

If you have gene expression data, the parameter gex_adata supports the gene expression data in AnnData format.

Please note that this will slice to the same cell_id that are present in the same in the AIRR data.

[7]:
irdata = ddl.to_scirpy(vdj, to_mudata=False, gex_adata=adata)
irdata
/opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/_core/anndata.py:117: ImplicitModificationWarning: Transforming to str index.
[7]:
AnnData object with n_obs × n_vars = 984 × 36601
    var: 'gene_ids', 'feature_types', 'genome'
    obsm: 'airr'
[8]:
mudata = ddl.to_scirpy(vdj, to_mudata=True, gex_adata=adata)
mudata
[8]:
MuData object with n_obs × n_vars = 10563 × 36601
  2 modalities
    gex:    10553 x 36601
      var:  'gene_ids', 'feature_types', 'genome'
    airr:   994 x 0
      obsm: 'airr'

Use scirpy’s get functions to retrieve the relevant airr info (https://scirpy.scverse.org/en/latest/generated/scirpy.get.airr.html)

[9]:
ir.get.airr(irdata, "clone_id")
WARNING: No chain indices found under adata.obsm['chain_indices']. Running scirpy.pp.index_chains with default parameters.
[9]:
VJ_1_clone_id VDJ_1_clone_id VJ_2_clone_id VDJ_2_clone_id
AAACCTGTCATATCGG-1 B_VJ_98_1_2 None None None
AAACCTGTCCGTTGTC-1 B_VDJ_113_3_1_VJ_152_2_1 B_VDJ_113_3_1_VJ_152_2_1 None None
AAACCTGTCGAGAACG-1 B_VDJ_107_4_1_VJ_162_1_1 B_VDJ_107_4_1_VJ_162_1_1 None None
AAACCTGTCTTGAGAC-1 B_VDJ_157_1_1_VJ_4_1_1 B_VDJ_157_1_1_VJ_4_1_1 None None
AAACGGGAGCGACGTA-1 B_VDJ_98_1_2_VJ_63_2_3 B_VDJ_98_1_2_VJ_63_2_3 None None
... ... ... ... ...
TTTGCGCTCTAACGGT-1 B_VDJ_88_3_1_VJ_122_1_2 B_VDJ_88_3_1_VJ_122_1_2 None None
TTTGGTTGTAGCCTAT-1 B_VDJ_40_1_1_VJ_112_2_1 B_VDJ_40_1_1_VJ_112_2_1 None None
TTTGGTTTCAGAGCTT-1 B_VDJ_102_5_2_VJ_153_1_1 B_VDJ_102_5_2_VJ_153_1_1 None None
TTTGGTTTCAGTGTTG-1 B_VDJ_114_1_1_VJ_1_2_1 B_VDJ_114_1_1_VJ_1_2_1 None None
TTTGGTTTCGGTGTCG-1 B_VDJ_131_1_1_VJ_158_3_1 B_VDJ_131_1_1_VJ_158_3_1 None None

984 rows × 4 columns

[10]:
ir.get.airr(mudata, "clone_id")
WARNING: No chain indices found under adata.obsm['chain_indices']. Running scirpy.pp.index_chains with default parameters.
[10]:
VJ_1_clone_id VDJ_1_clone_id VJ_2_clone_id VDJ_2_clone_id
cell_id
AAACCTGTCATATCGG-1 B_VJ_98_1_2 None None None
AAACCTGTCCGTTGTC-1 B_VDJ_113_3_1_VJ_152_2_1 B_VDJ_113_3_1_VJ_152_2_1 None None
AAACCTGTCGAGAACG-1 B_VDJ_107_4_1_VJ_162_1_1 B_VDJ_107_4_1_VJ_162_1_1 None None
AAACCTGTCTTGAGAC-1 B_VDJ_157_1_1_VJ_4_1_1 B_VDJ_157_1_1_VJ_4_1_1 None None
AAACGGGAGCGACGTA-1 B_VDJ_98_1_2_VJ_63_2_3 B_VDJ_98_1_2_VJ_63_2_3 None None
... ... ... ... ...
TTTGCGCTCTAACGGT-1 B_VDJ_88_3_1_VJ_122_1_2 B_VDJ_88_3_1_VJ_122_1_2 None None
TTTGGTTGTAGCCTAT-1 B_VDJ_40_1_1_VJ_112_2_1 B_VDJ_40_1_1_VJ_112_2_1 None None
TTTGGTTTCAGAGCTT-1 B_VDJ_102_5_2_VJ_153_1_1 B_VDJ_102_5_2_VJ_153_1_1 None None
TTTGGTTTCAGTGTTG-1 B_VDJ_114_1_1_VJ_1_2_1 B_VDJ_114_1_1_VJ_1_2_1 None None
TTTGGTTTCGGTGTCG-1 B_VDJ_131_1_1_VJ_158_3_1 B_VDJ_131_1_1_VJ_158_3_1 None None

994 rows × 4 columns

Or you can add transfer = True, which will perform dandelion’s tl.transfer.

[11]:
irdatax = ddl.to_scirpy(vdj, transfer=True)
irdatax
[11]:
MuData object with n_obs × n_vars = 994 × 0
  1 modality
    airr:   994 x 0
      obs:  'clone_id', 'clone_id_by_size', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
      obsm: 'airr'
[12]:
irdatax = ddl.to_scirpy(vdj, transfer=True, to_mudata=False)
irdatax
[12]:
AnnData object with n_obs × n_vars = 994 × 0
    obs: 'clone_id', 'clone_id_by_size', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    obsm: 'airr'

ddl.from_scirpy : Converting scirpy to dandelion

Converting MuData back to Dandelion

[13]:
vdjx = ddl.from_scirpy(mudata)
vdjx
[13]:
Dandelion class object with n_obs = 994 and n_contigs = 2093
    data: 'c_call', 'c_cigar', 'c_sequence_end', 'c_sequence_start', 'clone_id', 'consensus_count', 'd_call', 'd_cigar', 'd_sequence_end', 'd_sequence_start', 'duplicate_count', 'germline_alignment', 'is_cell', 'j_call', 'j_cigar', 'j_sequence_end', 'j_sequence_start', 'junction', 'junction_aa', 'junction_aa_length', 'junction_length', 'locus', 'productive', 'rearrangement_status', 'rev_comp', 'sequence', 'sequence_aa', 'sequence_alignment', 'sequence_id', 'v_call', 'v_cigar', 'v_sequence_end', 'v_sequence_start', 'cell_id'
    metadata: 'clone_id', 'clone_id_by_size', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

Converting AnnData back to Dandelion

[14]:
vdjx = ddl.from_scirpy(irdata)
vdjx
[14]:
Dandelion class object with n_obs = 984 and n_contigs = 2073
    data: 'c_call', 'c_cigar', 'c_sequence_end', 'c_sequence_start', 'clone_id', 'consensus_count', 'd_call', 'd_cigar', 'd_sequence_end', 'd_sequence_start', 'duplicate_count', 'germline_alignment', 'is_cell', 'j_call', 'j_cigar', 'j_sequence_end', 'j_sequence_start', 'junction', 'junction_aa', 'junction_aa_length', 'junction_length', 'locus', 'productive', 'rearrangement_status', 'rev_comp', 'sequence', 'sequence_aa', 'sequence_alignment', 'sequence_id', 'v_call', 'v_cigar', 'v_sequence_end', 'v_sequence_start', 'cell_id'
    metadata: 'clone_id', 'clone_id_by_size', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
[15]:
vdjx.metadata
[15]:
clone_id clone_id_by_size locus_VDJ locus_VJ productive_VDJ productive_VJ v_call_VDJ d_call_VDJ j_call_VDJ v_call_VJ ... d_call_B_VDJ_main j_call_B_VDJ_main v_call_B_VJ_main j_call_B_VJ_main isotype isotype_status locus_status chain_status rearrangement_status_VDJ rearrangement_status_VJ
AAACCTGTCATATCGG-1 B_VJ_98_1_2 6 None IGK None True None None None IGKV1-8 ... None None IGKV1-8 IGKJ4 None None Orphan IGK Orphan VJ None standard
AAACCTGTCCGTTGTC-1 B_VDJ_113_3_1_VJ_152_2_1 957 IGH IGK True True IGHV1-69D IGHD3-22 IGHJ3 IGKV1-8 ... IGHD3-22 IGHJ3 IGKV1-8 IGKJ1 IgM IgM IGH + IGK Single pair standard standard
AAACCTGTCGAGAACG-1 B_VDJ_107_4_1_VJ_162_1_1 773 IGH IGL True True IGHV1-2 nan IGHJ3 IGLV5-45 ... nan IGHJ3 IGLV5-45 IGLJ3 IgM IgM IGH + IGL Single pair standard standard
AAACCTGTCTTGAGAC-1 B_VDJ_157_1_1_VJ_4_1_1 772 IGH IGK True True IGHV5-51 nan IGHJ3 IGKV1D-8 ... nan IGHJ3 IGKV1D-8 IGKJ2 IgM IgM IGH + IGK Single pair standard standard
AAACGGGAGCGACGTA-1 B_VDJ_98_1_2_VJ_63_2_3 771 IGH IGL True True IGHV4-59 nan IGHJ3 IGLV3-19 ... nan IGHJ3 IGLV3-19 IGLJ2 IgM IgM IGH + IGL Single pair standard standard
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
TTTGCGCTCTAACGGT-1 B_VDJ_88_3_1_VJ_122_1_2 386 IGH IGL True True IGHV3-43 nan IGHJ6 IGLV2-8 ... nan IGHJ6 IGLV2-8 IGLJ3 IgM IgM IGH + IGL Single pair standard standard
TTTGGTTGTAGCCTAT-1 B_VDJ_40_1_1_VJ_112_2_1 385 IGH IGK True True IGHV4-39 nan IGHJ2 IGKV6-21 ... nan IGHJ2 IGKV6-21 IGKJ4 IgM IgM IGH + IGK Single pair standard standard
TTTGGTTTCAGAGCTT-1 B_VDJ_102_5_2_VJ_153_1_1 384 IGH IGK True True IGHV7-4-1 IGHD3-10 IGHJ4 IGKV3-11 ... IGHD3-10 IGHJ4 IGKV3-11 IGKJ5 IgM IgM IGH + IGK Single pair standard standard
TTTGGTTTCAGTGTTG-1 B_VDJ_114_1_1_VJ_1_2_1 383 IGH IGL True True IGHV2-5 nan IGHJ4 IGLV2-23 ... nan IGHJ4 IGLV2-23 IGLJ2 IgM IgM IGH + IGL Single pair standard standard
TTTGGTTTCGGTGTCG-1 B_VDJ_131_1_1_VJ_158_3_1 1149 IGH IGK True True IGHV3-21 nan IGHJ2 IGKV3-11 ... nan IGHJ2 IGKV3-11 IGKJ4 IgM IgM IGH + IGK Single pair standard standard

984 rows × 46 columns

This time, find clones with scirpy’s method.

[16]:
ir.tl.chain_qc(irdata)
ir.pp.ir_dist(irdata, metric="hamming", sequence="aa")
ir.tl.define_clonotypes(irdata)
irdata
[16]:
AnnData object with n_obs × n_vars = 984 × 36601
    obs: 'receptor_type', 'receptor_subtype', 'chain_pairing', 'clone_id', 'clone_id_size'
    var: 'gene_ids', 'feature_types', 'genome'
    uns: 'chain_indices', 'ir_dist_aa_hamming', 'ir_dist_nt_identity', 'clone_id'
    obsm: 'airr', 'chain_indices'

Visualising with scirpy’s plotting tools

You can now also plot dandelion networks using scirpy’s functions.

[17]:
ddl.tl.generate_network(vdj, key="junction")
Setting up data: 2093it [00:00, 21322.38it/s]
Calculating distances : 100%|██████████| 1154/1154 [00:00<00:00, 11734.08it/s]
Aggregating distances : 100%|██████████| 4/4 [00:00<00:00, 381.53it/s]
Sorting into clusters : 100%|██████████| 1154/1154 [00:00<00:00, 4742.88it/s]
Calculating minimum spanning tree : 100%|██████████| 8/8 [00:00<00:00, 1173.89it/s]
Generating edge list : 100%|██████████| 8/8 [00:00<00:00, 4087.52it/s]
Computing overlap : 100%|██████████| 1154/1154 [00:00<00:00, 3972.95it/s]
Adjust overlap : 100%|██████████| 82/82 [00:00<00:00, 3896.20it/s]
Linking edges : 100%|██████████| 984/984 [00:00<00:00, 93606.29it/s]
Computing network layout
Computing expanded network layout
[18]:
irdata.obs["scirpy_clone_id"] = irdata.obs["clone_id"]  # stash it
ddl.tl.transfer(
    irdata, vdj, overwrite=True
)  # overwrite scirpy's clone_id definition
[19]:
ir.tl.clonotype_network(irdata, min_cells=2)
ir.pl.clonotype_network(irdata, color="clone_id", panel_size=(7, 7))
[19]:
<Axes: >
../_images/notebooks_1c_dandelion_scirpy_32_1.png

to swap to a shorter clone_id name (ordered by size)

[20]:
ddl.tl.transfer(irdata, vdj, clone_key="clone_id_by_size")
ir.tl.clonotype_network(irdata, clonotype_key="clone_id_by_size", min_cells=2)
ir.pl.clonotype_network(irdata, color="clone_id_by_size", panel_size=(7, 7))
[20]:
<Axes: >
../_images/notebooks_1c_dandelion_scirpy_34_1.png

you can also collapse the networks to a single node and plot by size

[21]:
ddl.tl.transfer(irdata, vdj, clone_key="clone_id_by_size", collapse_nodes=True)
ir.tl.clonotype_network(irdata, clonotype_key="clone_id_by_size", min_cells=2)
ir.pl.clonotype_network(irdata, color="scirpy_clone_id", panel_size=(7, 7))
[21]:
<Axes: >
../_images/notebooks_1c_dandelion_scirpy_36_1.png
[ ]: