V(D)J analysis

[1]:
import dandelion as ddl
import scanpy as sc
import warnings
import os

warnings.filterwarnings("ignore")
sc.settings.set_figure_params(dpi=80)

Let’s run through some of what Dandelion can do in terms of analysis. In order to kickstart this tutorial, we prepared GEX and VDJ objects with four demo 10X samples parsed for your convenience. The previous notebook shows how this was done, with the VDJ loading into Dandelion likely of most interest due to the syntax required.

[2]:
if not os.path.exists("demo-gex.h5ad"):
    os.system("wget ftp://ftp.sanger.ac.uk/pub/users/kp9/demo-gex.h5ad")

if not os.path.exists("demo-vdj.h5ddl"):
    os.system("wget ftp://ftp.sanger.ac.uk/pub/users/kp9/demo-vdj.h5ddl")

Let’s import the objects. Dandelion’s h5ddl files can be read via ddl.read_h5ddl().

[3]:
adata = sc.read("demo-gex.h5ad")
vdj = ddl.read_h5ddl("demo-vdj.h5ddl")

At this point you’re probably wondering why there’s a separate Dandelion object. The reason is AIRR compliance. Some of the AIRR columns have more complex typing than what Scanpy can currently support within its objects. However, it’s quite straightforward to link up a Scanpy object with a Dandelion one.

[4]:
vdj, adata = ddl.pp.check_contigs(vdj, adata)

Preparing data: 6490it [00:02, 2896.00it/s]
Scanning for poor quality/ambiguous contigs: 100%|██████████| 3158/3158 [00:17<00:00, 175.60it/s]

This filters the contigs and synchronises relevant information between the objects. Once linked up like this, any new information can be copied over from the Dandelion object via ddl.tl.transfer(). There will be an example later in the notebook.

For now, let’s take a look at the chain status (as gotten from the Dandelion object) and known BCR marker expression.

[5]:
sc.pl.umap(adata, color=["IGHM", "JCHAIN", "chain_status"])
../_images/notebooks_Q3-analysis_9_0.png

Under the hood, the Dandelion object is essentially two data frames. .data holds the AIRR-compliant contig space table, while .metadata is an .obs equivalent that parses the contig information to a cell level and can be easily integrated with a Scanpy object. There are also ddl.to_scirpy() and ddl.from_scirpy() for interoperability with Scirpy, as explored in a notebook in the advanced guide. Scirpy also offers its own conversion functions.

The thing you’re most likely to find yourself doing manually with the Dandelion object is modifying cell names to match your GEX naming convention. The cell names can be found in .data.cell_id, change those however you see fit and then call .update_metadata() to regenerate the per-cell .obs equivalent.

vdj.data.cell_id = [result of modification procedure on existing vdj.data.cell_id]
vdj.update_metadata()
[6]:
vdj

[6]:
Dandelion class object with n_obs = 3092 and n_contigs = 7340
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'c_call', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'cell_id_blastn', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'mu_count', 'mu_freq', 'rearrangement_status', 'ambiguous'
    metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

Now that we’ve got the gist of basic handling of the Dandelion object, let’s use it for some analysis!

A core element of VDJ analysis is clonotype calling, roughly equivalent to clustering cells in GEX processing. Dandelion requires the clones it calls to have identical V and J genes, along with no more than 15% mismatches in the CDR3 sequences (common practice in BCR analysis).

For TCR clonotype calling, you can perform common practice nucleotide sequence identity by passing identity=1 and key="junction" to the function.

[7]:
ddl.tl.find_clones(vdj)

Finding clones based on B cell VDJ chains : 100%|██████████| 256/256 [00:00<00:00, 2348.45it/s]
Refining clone assignment based on VJ chain pairing : 100%|██████████| 2817/2817 [00:00<00:00, 363806.83it/s]

We can compute a graph based on Levenshtein distance of the complete contig sequence. A NetworkX representation of it is now saved in vdj.graph.

[8]:
ddl.tl.generate_network(vdj)

Setting up data: 6229it [00:02, 2934.43it/s]
Calculating distances... : 100%|██████████| 3021/3021 [00:01<00:00, 3002.97it/s]
Generating edge list : 100%|██████████| 75/75 [00:00<00:00, 909.32it/s]
Computing overlap : 100%|██████████| 3021/3021 [00:03<00:00, 833.44it/s]
Linking edges : 100%|██████████| 2722/2722 [00:00<00:00, 19081.08it/s]

Since we now know what our clonotype calls are, we can quantify clonal expansion. It’s possible to cap this at a desired maximum clonotype size.

[9]:
ddl.tl.clone_size(vdj)
# this makes an independent column with the provided max_size in its name
ddl.tl.clone_size(vdj, max_size=3)

Now that our Dandelion object has analysis information inside it, we can copy it over to the Scanpy object to have access to it there. The graph gets turned into the Scanpy standard forms of .obsp['vdj_distances'] and .obsp['vdj_connectivites'] for potential downstream use.

[10]:
ddl.tl.transfer(adata, vdj)

Let’s take a look at what we made!

[11]:
ddl.pl.clone_network(adata, color="clone_id_size")
sc.pl.umap(adata, color="clone_id_size")
../_images/notebooks_Q3-analysis_21_0.png
../_images/notebooks_Q3-analysis_21_1.png

Wait, why are we seeing some clone size 0 in the plots? Orphan chains.

[12]:
ddl.pl.clone_network(adata, color="clone_id_size_max_3")
sc.pl.umap(adata, color="clone_id_size_max_3")
../_images/notebooks_Q3-analysis_23_0.png
../_images/notebooks_Q3-analysis_23_1.png

Dandelion comes with a number of plotting functions for your convenience. However, those functions tend to operate best without the Scanpy plotting defaults in place. You can reset Matplotlib’s configuration prior to using them.

[13]:
import matplotlib as mpl
mpl.rcParams.update(mpl.rcParamsDefault)
%matplotlib inline

We’ve got bar plots.

[14]:
ddl.pl.barplot(
    vdj[vdj.metadata.isotype_status != "Multi"],  # remove multi from the plots
    color="v_call_genotyped_VDJ",
    xtick_fontsize=5,
)
[14]:
(<Figure size 800x300 with 1 Axes>,
 <AxesSubplot:title={'center':'v call genotyped VDJ usage'}, ylabel='proportion'>)
../_images/notebooks_Q3-analysis_27_1.png

All of the plotting functions have a number of parameters that can be fiddled with for desired visualisation outcomes. For example, let’s disable automatic descending sorting, show counts rather than proportions, and change the palette.

[15]:
ddl.pl.barplot(
    vdj[vdj.metadata.isotype_status != "Multi"],
    color="v_call_genotyped_VDJ",
    normalize=False,
    sort_descending=None,
    palette="tab20",
    xtick_fontsize=5,
)
[15]:
(<Figure size 800x300 with 1 Axes>,
 <AxesSubplot:title={'center':'v call genotyped VDJ usage'}, ylabel='count'>)
../_images/notebooks_Q3-analysis_29_1.png

We’ve got stacked bar plots.

[16]:
ddl.pl.stackedbarplot(
    vdj[vdj.metadata.isotype_status != "Multi"],
    color="isotype_status",
    groupby="locus_status",
    xtick_rotation=0,
    figsize=(4, 3),
)
[16]:
(<Figure size 400x300 with 1 Axes>,
 <AxesSubplot:title={'center':'multiple stacked bar plot : isotype status usage'}, ylabel='count'>)
../_images/notebooks_Q3-analysis_31_1.png

These can be normalised to add up to 1 for each column.

[17]:
ddl.pl.stackedbarplot(
    vdj[vdj.metadata.isotype_status != "Multi"],
    color="v_call_genotyped_VDJ",
    groupby="isotype_status",
    normalize=True,
    xtick_fontsize=5,
)
[17]:
(<Figure size 800x300 with 1 Axes>,
 <AxesSubplot:title={'center':'multiple stacked bar plot : v call genotyped VDJ usage'}, ylabel='proportion'>)
../_images/notebooks_Q3-analysis_33_1.png

We’ve also got a spectratype plot, which shows the distribution of the CDR3 length for the various contigs.

[18]:
ddl.pl.spectratype(
    vdj[vdj.metadata.isotype_status != "Multi"],
    color="junction_length",
    groupby="c_call",
    locus="IGH",
    width=2.3,
)
[18]:
(<Figure size 500x300 with 1 Axes>,
 <AxesSubplot:xlabel='junction_length', ylabel='count'>)
../_images/notebooks_Q3-analysis_35_1.png

Another common VDJ analysis request is to examine the distribution of shared clonotypes between cells of different metadata groups. Dandelion can do this as a circos plot.

[19]:
ddl.tl.clone_overlap(
    adata, groupby="leiden", colorby="leiden", weighted_overlap=True
)
ddl.pl.clone_overlap(
    adata, groupby="leiden", colorby="leiden", weighted_overlap=True
)
../_images/notebooks_Q3-analysis_37_0.png

There’s also a heatmap on offer.

[20]:
ddl.pl.clone_overlap(
    adata,
    groupby="leiden",
    colorby="leiden",
    weighted_overlap=True,
    as_heatmap=True,
    # seaborn clustermap kwargs
    cmap="Blues",
    annot=True,
    figsize=(8, 8),
    annot_kws={"size": 10},
)
../_images/notebooks_Q3-analysis_39_0.png

Save the objects, like so.

[21]:
adata.write("demo-gex-processed.h5ad")
vdj.write("demo-vdj-processed.h5ddl")