Filtering

We now move on to filtering out BCR contigs (and corresponding cells if necessary) from the BCR data and transcriptome object loaded in scanpy.

[1]:

import os
import dandelion as ddl
import pandas as pd
import scanpy as sc

ddl.set_backend("base")

# change directory to holding tutorial data
os.chdir("dandelion_tutorial")

Prepare the gene expression data

[2]:

samples = [
    "sc5p_v2_hs_PBMC_1k_b",
    "sc5p_v2_hs_PBMC_10k_b",
    "vdj_v1_hs_pbmc3_b",
    "vdj_nextgem_hs_pbmc3_b",
]
adata_list = []
for sample in samples:
    adata = sc.read_10x_h5(
        sample + "/filtered_feature_bc_matrix.h5", gex_only=True
    )
    adata.obs["sample_id"] = sample
    # rename cells to sample id + barcode
    adata.obs_names = [str(sample) + "_" + str(j) for j in adata.obs_names]
    adata.var_names_make_unique()
    adata_list.append(adata)
adata = sc.concat(adata_list)
# rename the obs_names again, this time cleaving the trailing -#
adata.obs_names = [str(j).split("-")[0] for j in adata.obs_names]
adata

/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.12/site-packages/anndata/_core/anndata.py:1880: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.12/site-packages/anndata/_core/anndata.py:1880: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.12/site-packages/anndata/_core/anndata.py:1880: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.12/site-packages/anndata/_core/anndata.py:1880: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.12/site-packages/anndata/_core/anndata.py:1880: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.12/site-packages/anndata/_core/anndata.py:1880: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.12/site-packages/anndata/_core/anndata.py:1880: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.12/site-packages/anndata/_core/anndata.py:1880: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.

[2]:

AnnData object with n_obs × n_vars = 30471 × 31915
    obs: 'sample_id'

I’m using a wrapper called pp.recipe_scanpy_qc to run through a generic scanpy workflow. You can skip this if you already have a pre-processed AnnData object for the subsequent steps.

[3]:

ddl.pp.recipe_scanpy_qc(adata, mito_cutoff=None)  # use a gmm model to decide
# we can continue with those that survive qc
adata = adata[adata.obs["filter_rna"] == "False"].copy()
adata

OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.12/functools.py:912: UserWarning: zero-centering a sparse array/matrix densifies it.
/Users/uqztuong/Documents/GitHub/dandelion/src/dandelion/external/scanpy/__init__.py:142: FutureWarning: In the future, the default backend for leiden will be igraph instead of leidenalg.

 To achieve the future defaults please pass: flavor="igraph" and n_iterations=2.  directed must also be False to work with igraph's implementation.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.12/site-packages/numpy/_core/fromnumeric.py:3860: RuntimeWarning: Mean of empty slice.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.12/site-packages/numpy/_core/_methods.py:144: RuntimeWarning: invalid value encountered in scalar divide

[3]:

AnnData object with n_obs × n_vars = 25063 × 31915
    obs: 'sample_id', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'gmm_pct_count_clusters_keep', 'scrublet_score', 'is_doublet', 'filter_rna'

Filter cells that are potental doublets and poor quality in both the V(D)J data and transcriptome data

We use ddl.pp.check_contigs to mark and filter out cells and contigs from both the V(D)J data and transcriptome data in AnnData. The operation will remove bad quality cells based on transcriptome information as well as remove V(D)J doublets (multiplet heavy/long chains, and/or light/short chains) from the V(D)J data. In some situations, a single cell can have multiple heavy/long and light/short chain contigs although they have an identical V(D)J+C alignment; in situations like this, the contigs with lesser UMIs will be dropped and the UMIs transferred to umi_count column. The same procedure is applied to both chains before further checks of the annotation quality, UMI and consensus count distributions.

Cells in the gene expression object without V(D)J information will not be affected which means that the AnnData object can hold non-B/T cells.

[4]:

# first we read in the 4 bcr files
bcr_files = []
for sample in samples:
    file_location = sample + "/dandelion/filtered_contig_dandelion.tsv"
    bcr_files.append(pd.read_csv(file_location, sep="\t"))
bcr = pd.concat(bcr_files, ignore_index=True)
bcr.reset_index(inplace=True, drop=True)
bcr

[4]:

	sequence_id	sequence	rev_comp	productive	v_call	d_call	j_call	sequence_alignment	germline_alignment	junction	...	v_sequence_alignment_aa	d_sequence_alignment_aa	j_sequence_alignment_aa	complete_vdj	j_call_multimappers	j_call_multiplicity	j_call_sequence_start_multimappers	j_call_sequence_end_multimappers	j_call_support_multimappers	mu_count
0	sc5p_v2_hs_PBMC_1k_b_AACTCCCAGGCTAGGT_contig_1	ACTGCGGGGGTAAGAGGTTGTGTCCACCATGGCCTGGACTCCTCTC...	F	T	IGLV5-45*03	NaN	IGLJ3*02	CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTG...	CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTG...	TGTATGATTTGGCACAGCAGCGCTTGGGTGTTC	...	QAVLTQPSSLSASPGASASLTCTLRSGINVGTYRIYWYQQKPGSPP...	NaN	VFGGGTKLTVL	NaN	["IGLJ3*01"]	1	[397]	[431]	[5.29e-13]	0
1	sc5p_v2_hs_PBMC_1k_b_AACTCCCAGGCTAGGT_contig_2	ATACTTTCTGAGAGTCCTGGACCTCCTGTGCAAGAACATGAAACAT...	F	T	IGHV4-61*02	IGHD3-3*01	IGHJ6*02	CAGGTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTGAAGCCTT...	CAGGTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTGAAGCCTT...	TGTGCGAGAGAAAATTACGATTTTTGGAGTGGTTATTACCACGGTG...	...	QVQLQESGPGLVKPSQTLSLTCTVSGGSISSGSYYWSWIRQPAGKG...	YDFWSGY	YHGADVWGQGTTVTVSS	NaN	["IGHJ6*02"]	1	[416]	[470]	[3.41e-19]	3
2	sc5p_v2_hs_PBMC_1k_b_AACTCCCAGGCTAGGT_contig_3	GGCTGGGGTCTCAGGAGGCAGCGCTCTGGGGACGTCTCCACCATGG...	F	F	IGLV2-5*01	NaN	IGLJ3*02	CAGTCTGCCCTGATTCAGCCTCCCTCC...GTGTCCGGGTCTCCTG...	CAGTCTGCCCTGATTCAGCCTCCCTCC...GTGTCCGGGTCTCCTG...	TGCTGCTCATATACAAGCAGTGCCACTTTCTTGGGTGTTC	...	QSALIQPPSVSGSPGQSVTISCTGTSSDVGSYDYVSWYQQHPGTVP...	NaN	LGVRRRDQADRP	NaN	["IGLJ3*02"]	1	[396]	[433]	[2.45e-16]	0
3	sc5p_v2_hs_PBMC_1k_b_AACTCTTGTCATCGGC_contig_2	AGCTCTGAGAGAGGAGCCTTAGCCCTGGATTCCAAGGCCTATCCAC...	F	T	IGHV3-21*01	IGHD3-22*01	IGHJ4*02	GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCCTGGTCAAGCCTG...	GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCCTGGTCAAGCCTG...	TGTGCGAGACGTTACTATGATAGTAGTGGTTATTCCGCAAACTTTG...	...	EVQLVESGGGLVKPGGSLRLSCAASGFTFSSYSMNWVRQAPGKGLE...	YYDSSGY	FDYWGQGTLVTVSS	NaN	["IGHJ4*02"]	1	[462]	[506]	[2.81e-20]	0
4	sc5p_v2_hs_PBMC_1k_b_AACTCTTGTCATCGGC_contig_1	AGAGCTCTGGGGAGTCTGCACCATGGCTTGGACCCCACTCCTCTTC...	F	T	IGLV4-69*01	NaN	IGLJ1*01	CAGCTTGTGCTGACTCAATCGCCCTCT...GCCTCTGCCTCCCTGG...	CAGCTTGTGCTGACTCAATCGCCCTCT...GCCTCTGCCTCCCTGG...	TGTCAGACCTGGGGCACTGGCATTTATGTCTTC	...	QLVLTQSPSASASLGASVKLTCTLSSGHSSYAIAWHQQQPEKGPRY...	NaN	YVFGTGTKVTVL	NaN	["IGLJ1*01"]	1	[379]	[416]	[2.39e-16]	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7380	vdj_nextgem_hs_pbmc3_b_TTTGCGCTCTGTCAAG_contig_2	ATCACATAACAACCACATTCCTCCTCTAAAGAAGCCCCCGGGAGCC...	F	T	IGHV1-6901,IGHV1-69D01	IGHD3-22*01	IGHJ4*02	CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAAGTGAAGAAGCCTG...	CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG...	TGTGCGAGGGGGAAGTATTACTATGATAAAAGTGGGTCTCCACCTC...	...	QVQLVQSGAEVKKPGSSVKVSCKVSGGIFSSYAISWVRQAPGQGLE...	YYYDKSG	FDYWGQGTLVTVSS	NaN	["IGHJ4*02"]	1	[457]	[500]	[1.18e-19]	16
7381	vdj_nextgem_hs_pbmc3_b_TTTGGTTGTAAGGATT_contig_1	AGAGCTCTGGAGAAGAGCTGCTCAGTTAGGACCCAGAGGGAACCAT...	F	T	IGKV3-20*01	NaN	IGKJ201,IGKJ202	GAAATTGTGTTGACGCAGTCTCCAGGCACCCTGTCTTTGTCTCCAG...	GAAATTGTGTTGACGCAGTCTCCAGGCACCCTGTCTTTGTCTCCAG...	TGTCAGCAGTATGATGAGTCACCTCTGACTTTT	...	EIVLTQSPGTLSLSPGERATLSCRASQSLTNSQLAWYQQKPGQAPR...	NaN	TFGQGTKLEIK	NaN	["IGKJ2*02"]	1	[396]	[429]	[3.59e-14]	11
7382	vdj_nextgem_hs_pbmc3_b_TTTGGTTGTAAGGATT_contig_2	AGCTCTGGGAGAGGAGCCCCAGCCCTGAGATTCCCAGGTGTTTCCA...	F	T	IGHV3-9*01	IGHD5-1801,IGHD5-501	IGHJ6*03	GAAGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG...	GAAGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG...	TGTGCAAAAGACGGATACAGCTATCGTTCGTCATACTACTTTTACA...	...	EVQLVESGGGLVQPGRSLRLSCAASGFSFDDYVMHWVRQAPGKGLE...	GYSYR	YYFYMDVWGKGTTVTVSS	NaN	["IGHJ6*03"]	1	[456]	[510]	[1.96e-22]	10
7383	vdj_nextgem_hs_pbmc3_b_TTTGTCACAGTAGAGC_contig_1	AGCTCTGAGAGAGGAGCCCAGCCCTGGGATTTTCAGGTGTTTTCAT...	F	T	IGHV3-2301,IGHV3-23D01	IGHD4-17*01	IGHJ4*02	GAGGTGCAGCTGTTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG...	GAGGTGCAGCTGTTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG...	TGTGCGAAAGATTTTAGGTCGCCATACGGTGACTACTACTTTGACT...	...	EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLE...	YGD	YFDYWGQGTLVTVSS	NaN	["IGHJ4*02"]	1	[456]	[503]	[6.02e-22]	0
7384	vdj_nextgem_hs_pbmc3_b_TTTGTCACAGTAGAGC_contig_2	GTGGGTCCAGGAGGCAGAACTCTGGGTGTCTCACCATGGCCTGGAT...	F	T	IGLV3-25*03	NaN	IGLJ1*01	TCCTATGAGCTGACACAGCCACCCTCG...GTGTCAGTGTCCCCAG...	TCCTATGAGCTGACACAGCCACCCTCG...GTGTCAGTGTCCCCAG...	TGTCAATCAGCAGACAGCAGTGGTACTTATCTTTATGTCTTC	...	SYELTQPPSVSVSPGQTARITCSGDALPKQYAYWYQQKPGQAPVLV...	NaN	YVFGTGTKVTVL	NaN	["IGLJ1*01"]	1	[383]	[420]	[1.64e-16]	0

7385 rows × 120 columns

Annotating V/D/J gene functionality

A new preprocessing function now allows you to annotate the V/D/J gene functionality for each contig in the V(D)J data. The function will parse the v_call, d_call, and j_call columns and extract the gene name and functionality (functional (F), open reading frame (ORF; not considered a confirmed functional gene) and pesudogene (P)) from each call. The resulting annotations will be stored in new columns: <call>_functionality e.g. v_call_functionality. This information maye be useful before downstream analyses, such as filtering out these contigs or identifying specific gene usage patterns.

[5]:

# first, convert bcr to a Dandelion object
bcr = ddl.Dandelion(bcr)
bcr

/Users/uqztuong/Documents/GitHub/dandelion/src/dandelion/base/core/_core.py:1107: PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`

[5]:

Dandelion class object with n_obs = 3158 and n_contigs = 7385
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'rearrangement_status'
    metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

[6]:

germline_db = (
    "/Users/uqztuong/Documents/GitHub/dandelion/container/database/germlines"
)
ddl.pp.annotate_functionality(bcr, germline=germline_db, org="human", db="imgt")
bcr.data[
    [
        "v_call",
        "v_call_functionality",
        "d_call",
        "d_call_functionality",
        "j_call",
        "j_call_functionality",
    ]
]

[6]:

	v_call	v_call_functionality	d_call	d_call_functionality	j_call	j_call_functionality
sequence_id
sc5p_v2_hs_PBMC_10k_b_AAACCTGTCATATCGG_contig_1	IGKV1-3301,IGKV1D-3301	F		None	IGKJ4*01	F
sc5p_v2_hs_PBMC_10k_b_AAACCTGTCCGTTGTC_contig_2	IGHV1-6901,IGHV1-69D01	F	IGHD3-22*01	F	IGHJ3*02	F
sc5p_v2_hs_PBMC_10k_b_AAACCTGTCCGTTGTC_contig_1	IGKV1-8*01	F		None	IGKJ1*01	F
sc5p_v2_hs_PBMC_10k_b_AAACCTGTCGAGAACG_contig_1	IGLV5-45*02	F		None	IGLJ3*02	F
sc5p_v2_hs_PBMC_10k_b_AAACCTGTCGAGAACG_contig_2	IGHV1-2*02	F		None	IGHJ3*02	F
...	...	...	...	...	...	...
vdj_v1_hs_pbmc3_b_TTTCCTCTCGACAGCC_contig_1	IGHV1-46*01	F	IGHD2-15*01	F	IGHJ5*02	F
vdj_v1_hs_pbmc3_b_TTTGCGCCATACCATG_contig_2	IGHV1-6901,IGHV1-69D01	F	IGHD2-15*01	F	IGHJ6*02	F
vdj_v1_hs_pbmc3_b_TTTGCGCCATACCATG_contig_1	IGLV1-47*01	F		None	IGLJ3*02	F
vdj_v1_hs_pbmc3_b_TTTGGTTGTAGGCATG_contig_2	IGLV2-11*01	F		None	IGLJ201,IGLJ301,IGLJ3*02	F
vdj_v1_hs_pbmc3_b_TTTGGTTGTAGGCATG_contig_1	IGHV3-2301,IGHV3-23D01	F		None	IGHJ4*02	F

7385 rows × 6 columns

Library type

It is recommended to specify the library_type argument as it will remove all contigs that do not belong to the related loci. The rationale is that the choice of the library type should mean that the primers used would most likely amplify those related sequences and if there’s any unexpected loci, they likely represent artifacts and shouldn’t be analysed. The optional argument accepts: ig, tr-ab, tr-gd or None where None means all contigs will be kept.

The main output of this function are two an additional columns in vdj.data, extra and ambiguous, which flags T or F for contigs that were marked accordingly. The rules for marking contigs are as follows:

extra is marked as T if the contig passes the internal QC filters based on umi_count (or consensus_count if there are ties in the umi_count) in a cell. If you are only interested in just the top contig pair, you can set filter_extra=True to remove the extra contigs.

For VDJ chains, the current rule set is to keep the top 1 productive contig with the highest counts and mark the rest as extra (or ambiguous if appropriate). Toggle ntop_vdj to keep the top n (default 1) contigs.

For VJ chains, the current rule set is to keep the top 2 productive contigs with the highest counts and mark the rest as extra (or ambiguous if appropriate). Toggle ntop_vj to keep the top n (default 2) contigs.

ambiguous is marked as T if the contig is of poor quality annotation and would be removed from downstream analysis. Cells with multiple contigs with very low umi_counts and/or consensus_counts are also marked as ambiguous as it is not possible to distinguish which is the most representative contig.

Please note that the default for filter_extra is True. If you want to keep the extra contigs for whatever reasons e.g. interested in T/B-cell development datasets, you need to set filter_extra=False. We are setting this as False in this example because later on we want to visualise these extra contigs.

[7]:

vdj, adata = ddl.pp.check_contigs(
    bcr, adata, library_type="ig", filter_extra=False
)

Preparing data: 6513it [00:00, 7932.03it/s]
Scanning for poor quality/ambiguous contigs: 100%|██████████| 3158/3158 [00:08<00:00, 384.55it/s]

If you want to filter out contigs that are marked as pseudogenes or ORFs, you can use the filter_pseudo argument in check_contigs. This argument takes a dictionary with keys “v”, “d”, and “j” corresponding to the gene segments, and values as lists of functionality strings to filter out. For example, to only keep functional genes, you would set filter_pseudo={"v": ["P", "ORF"], "d": ["P", "ORF"], "j": ["P", "ORF"]}. Note that this requires that the functionality columns (v_call_functionality, d_call_functionality, j_call_functionality) are present in the AIRR table, which can be populated by running ddl.pp.annotate_functionality first.

Check the Dandelion object

[8]:

vdj

[8]:

Dandelion class object with n_obs = 2496 and n_contigs = 7385
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'rearrangement_status', 'v_call_functionality', 'd_call_functionality', 'j_call_functionality', 'ambiguous', 'extra'
    metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

Check the AnnData object as well

[9]:

adata

[9]:

AnnData object with n_obs × n_vars = 25063 × 31915
    obs: 'sample_id', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'gmm_pct_count_clusters_keep', 'scrublet_score', 'is_doublet', 'filter_rna', 'has_contig', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

These are the relevant columns for looking at the QC status of the cells and contigs in the .obs slot in the AnnData object (and also .metadata slot in the Dandelion object):

Relevant columns in obs

has_contig
whether cells have V(D)J chains.
locus_status
detailed information on chain status pairings (below).
chain_status
summarised information of the chain locus status pairings (similar to chain_pairing in scirpy).
rearrangement_status_VDJ and rearrangement_status_VJ
whether or not V(D)J gene usage are standard (i.e. all from the same locus).

So in a standard situation, I would remove cells flagged with Orphan VJ, Orphan VJ-exception, ambiguous in .metadata.chain_status, and also any cell marked as chimeric in the .metadata.rearrangement_status_VDJ and .metadata.rearrangement_status_VJ from downstream cell-level calculations/analysis.

Having said that, you will find that most of Dandelion’s functions will work without the need to requirement to perform additional filtering and filtering can be performed on the final AnnData object (described in the visualisation section).

Let’s take a look at these new columns

[10]:

pd.crosstab(adata.obs["chain_status"], adata.obs["locus_status"])

[10]:

locus_status	Extra VDJ + Extra VJ	Extra VDJ + IGK	Extra VDJ + IGL	IGH + Extra VJ	IGH + IGK	IGH + IGL	Orphan Extra VJ	Orphan IGH	Orphan IGK	Orphan IGL
chain_status
Extra pair	60	4	4	94	0	0	0	0	0	0
Orphan Extra VJ	0	0	0	0	0	0	10	0	0	0
Orphan VDJ	0	0	0	0	0	0	0	10	0	0
Orphan VJ	0	0	0	0	0	0	0	0	98	49
Single pair	0	0	0	0	1260	907	0	0	0	0

if there are multiple library types, i.e. ddl.pp.check_contigs was run with library_type = None, or if several tcr/bcr Dandelion objects are concatenated, there will be additional columns where the v/d/j/c calls and productive will be split into additional columns to reflect those that belong to a B cell, alpha-beta T cell, or gamma-delta T cell.

Now actually filter the AnnData object and run through a standard workflow starting by filtering genes and normalizing the data

Because the ‘filtered’ AnnData object was returned as a filtered but otherwise unprocessed object, we still need to normalize and run through the usual process here. The following is just a standard scanpy workflow.

[11]:

# standard preprocessing steps
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable].copy()
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata)

Visualizing the clusters and whether or not there’s a corresponding V(D)J receptor

[12]:

sc.pl.umap(adata, color=["sample_id", "leiden", "chain_status"])

../../_images/notebooks_base_2_dandelion_filtering-10x_data_26_0.png

Visualizing some B cell genes

[13]:

sc.pl.umap(adata, color=["IGHM", "JCHAIN"])

../../_images/notebooks_base_2_dandelion_filtering-10x_data_28_0.png

[14]:

# We can save this AnnData object for now.
adata.write("adata.h5ad", compression="gzip")

Save dandelion object

To save the vdj object, we have two options - either save the .data and .metadata slots with pandas’ functions:

[15]:

vdj.data.to_csv("filtered_vdj_table.tsv", sep="\t")

Or save the whole Dandelion class object with either .write_h5ddl/.write, which uses h5py to save the class to a HDF5 format.

[16]:

vdj.write_h5ddl("dandelion_results.h5ddl")  # can add compression="gzip"

Running ddl.pp.check_contigs without AnnData

Finally, ddl.pp.check_contigs can also be run without an AnnData object:

[17]:

vdj3 = ddl.pp.check_contigs(bcr)
vdj3

Preparing data: 6513it [00:00, 8187.42it/s]
Scanning for poor quality/ambiguous contigs: 100%|██████████| 3158/3158 [00:08<00:00, 370.81it/s]

[17]:

Dandelion class object with n_obs = 3157 and n_contigs = 6382
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'rearrangement_status', 'v_call_functionality', 'd_call_functionality', 'j_call_functionality', 'ambiguous', 'extra'
    metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

[ ]: