Running from R

Foreword

dandelion is written in python==3.7.6 and it is primarily a single-cell BCR-seq analysis package. It makes use of some tools from the fantastic immcantation suite and the main idea is that it implements a workflow for the pre-processing and exploratory stages with integrated use of tools from immcantation for the BCR side of things and analysis tools from scanpy for the RNA-seq side of things. I hope to be able to introduce some new single-cell BCR-seq exploratory tools down the road through dandelion.

dandelion can be run in R through reticulate. This section will try to replicate the examples using R.

There are some issues with the conversion of dataframes between python and R so I would not recommend saving the final AnnData object as a final out file, but only use this to help generate the intermediate files from the BCR processing and the plots. Please store your original analysis some where safe. I would also skip quantify mutation step due to conflicts between rpy2 and reticulate.

For more details, please refer to the original python tutorial.

Let’s start!

First, install reticulate via if you don’t already have it:

install.packages("reticulate")

[1]:

library(reticulate)
use_condaenv("dandelion")
# or use Sys.setenv(RETICULATE_PYTHON = conda_python(envname='dandelion'))

You can check if the python config is set up properly with py_config()

[2]:

py_config()

Warning message in system2("poetry", c("env", "info", "--path"), stdout = TRUE):
“running command ''poetry' env info --path' had status 1”

python:         /opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/bin/python
libpython:      /opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/libpython3.11.dylib
pythonhome:     /opt/homebrew/Caskroom/mambaforge/base/envs/dandelion:/opt/homebrew/Caskroom/mambaforge/base/envs/dandelion
version:        3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:41) [Clang 15.0.7 ]
numpy:          /opt/homebrew/Caskroom/mambaforge/base/envs/dandelion/lib/python3.11/site-packages/numpy
numpy_version:  1.24.4

NOTE: Python version was forced by use_python function

To proceed with the analyses, we first change the working directory and also import the dandelion module.

[3]:

# change directory to somewhere more workable
setwd("~/Downloads/dandelion_tutorial/")
ddl <- import("dandelion")

As per reticulate convention, python . operators are to be swaped with $ in R.

Pre-processing

To being the BCR preprocessing, the minimum files you require are the following:

filtered_contig.fasta
filtered_contig_annotations.csv

Step 1: Formatting the headers of the Cell Ranger fasta file

This will do the following: 1) add the prefix provided to every sequence header

add the prefix provided to every contig_id in the annotation.csv file
create a folder called dandelion/data (if left as default) and saves a copy of these files in that directory.

[4]:

# the first option is a list of fasta files to format and the second option is the list of prefix to add to each file.
samples <- c("sc5p_v2_hs_PBMC_1k", "sc5p_v2_hs_PBMC_10k", "vdj_v1_hs_pbmc3", "vdj_nextgem_hs_pbmc3")
ddl$pp$format_fastas(samples, prefix = samples)

Step 2: Reannotate the V/D/J genes with igblastn.

ddl$pp$reannotate_genes uses changeo’s scripts to call igblastn to reannotate the fasta files. Depending on the fileformat option, it will parse out as either an airr (default) or changeo-legacy TSV file. Importantly, with the recent update to changeo v1.0.0, all the column headers, including changeo format, are now adhereing to the AIRR standard (lowercase and some column name changes). Specifying extended = True will return the additional 10x annotation of V/D/J genes but they are unnecessary at this stage.

[5]:

ddl$pp$reannotate_genes(samples)

you should see something like this in the terminal:

Assigning genes :   0%|                                        | 0/4 [00:00<?, ?it/s]
START> AssignGenes
 COMMAND> igblast
 VERSION> 1.15.0
    FILE> sc5p_v2_hs_PBMC_10k_b_filtered_contig.fasta
ORGANISM> human
    LOCI> ig
   NPROC> 4

PROGRESS> 12:30:37 |Running IgBLAST          | 0.0 min

Step 3 : Reassigning heavy chain V gene alleles (optional but recommended)

Next, we use immcantation’s TIgGER method to reassign allelic calls for heavy chain V genes with pp.reassign_alleles. As stated in TIgGER’s website and manuscript, ‘TIgGER is a computational method that significantly improves V(D)J allele assignments by first determining the complete set of gene segments carried by an individual (including novel alleles) from V(D)J-rearrange sequences. TIgGER can then infer a subject’s genotype from these sequences, and use this genotype to correct the initial V(D)J allele assignments.’

This impact’s on how contigs are chosen for finding clones later so it is highly recommended to run it. It is also important when considering to do mutational analysis.

However, the main caveat is that this needs to be run on multiple samples from the same subject to allow for more information to be used to confidently assign a genotyped v_call. In this tutorial, I’m assuming the four samples can be split into two sets where sets of two corresponds to a different/single individual. So while important, this step can be skipped if you don’t have the samples to do this.

pp.reassign_alleles requires the combined_folder option to be specified so that a merged/concatenated file can be produced for running TIgGER. The function also runs pp.create_germlines using the germline database updated with the germline corrections from TIgGER. The default behavior is that it will return a germline_alignment_d_mask column in the final output. This can be changed by specifying germ_types option; see here for other options.

Specifying fileformat = 'changeo' will run on changeo formatted files if this was run earlier; but it’s probably best to stick to airr’s standard format.

[6]:

# reassigning alleles on the first set of samples
ddl$pp$reassign_alleles(samples[1:2], combined_folder = "tutorial_scgp1")

[7]:

# reassigning alleles on the first set of samples
ddl$pp$reassign_alleles(samples[3:4], combined_folder = "tutorial_scgp2")

We can see that most of the original ambiguous V calls have now been corrected and only a few remain. These will be flagged as multi later on and can probably be excluded from detailed analyses. For now, leaving them in the data will not impact on subsequent analyses.

Step 4: Assigning constant region calls

Cell Ranger’s annotation files provides a c_gene column, but rather than simply relying on Cell Ranger’s annotation, it is common to use immcantation-presto’s MaskPrimers.py with a custom primer list.

As an alternative, dandelion includes a pre-processing function, pp.assign_isotypes, to use blastn to annotate constant region calls for all contigs and retrieves the call, merging it with the tsv files. This function will overwrite the output from previous steps and add a c_call column at the end, or replace the existing column if it already exists. The Cell Ranger calls are returned as c_call_10x.

Further, to deal with incorrect constant gene calls due to insufficient length, a pairwise alignment will be run against curated sequences that were deemed to be highly specific in distinguishing IGHA1 vs IGHA2, and IGHG1 to IGHG4. I have also curated sets of sequences that should help deal with IGLC3/6/7 as these are problematic too. If there is insufficient info, the c_call will be returned as a combination of the most aligned sets of sequences. Because of how similar the lambda light chains are, extremely ambiguous calls (only able to map to a common sequence across the light chains) will be returned as IGLC. This typically occurs when the constant sequence is very short. Those that have equal alignment scores between IGLC3/6/7 sequences and the common sequence will be returned as a concatenated call; for example, a contig initially annotated as IGLC3 will be returned as IGLC,IGLC3.

[8]:

ddl$pp$assign_isotypes(samples)

Note

Should you want to use a different reference fasta file for this step, run with the following option:

ddl$pp$assign_isotypes(samples, blastdb = "path/to/custom_BCR_constant.fasta")

The default option will return a summary plot that can be disabled with plot = FALSE.

It’s worthwhile to manually check the the sequences for constant calls returned as IGHA1-2, IGHG1-4 and the light chains and manually correct them if necessary.

Step 5: Quantify mutations (optional).

In the python tutorial, at this stage, I quantified the basic mutational load with ddl.pp.quantify_mutations before subsequent analyses. This would not run properly within this R session due to rpy2/reticulate conflict. Instead, i’d recommend you to run this separately on the. *_contig_dandelion.tsv file that was generated in the previous step, by following Shazam’s tutorial on basic mutational analysis, before continuing.

Filtering

Create a Seurat object from the transcriptome data

Let’s first import the gene expression data. Let’s try from Seurat object.

[5]:

setwd("~/Downloads/dandelion_tutorial/")
library(Seurat)
samples <- c("sc5p_v2_hs_PBMC_1k", "sc5p_v2_hs_PBMC_10k", "vdj_v1_hs_pbmc3", "vdj_nextgem_hs_pbmc3")
seurat_objects <- list()
for (i in 1:length(samples)) {
    filename <- paste0(samples[i], "/filtered_feature_bc_matrix.h5")
    data <- Read10X_h5(filename)
    seurat_objects[[i]] <- CreateSeuratObject(counts = data$`Gene Expression`)
}

Genome matrix has multiple modalities, returning a list of matrices for this genome

Genome matrix has multiple modalities, returning a list of matrices for this genome

Genome matrix has multiple modalities, returning a list of matrices for this genome

Genome matrix has multiple modalities, returning a list of matrices for this genome

[6]:

# merge them, there's probably better ways
merged1 <- merge(seurat_objects[[1]], seurat_objects[[2]], add.cell.ids = samples[1:2])
merged2 <- merge(seurat_objects[[3]], seurat_objects[[4]], add.cell.ids = samples[3:4])
merged <- merge(merged1, merged2)

[7]:

head(merged@meta.data)

A data.frame: 6 × 3
	orig.ident	nCount_RNA	nFeature_RNA
	<chr>	<dbl>	<int>
sc5p_v2_hs_PBMC_1k_AAACCTGCAGCCTGTG-1	SeuratProject	5029	1819
sc5p_v2_hs_PBMC_1k_AAACGGGTCGCTGATA-1	SeuratProject	4343	1738
sc5p_v2_hs_PBMC_1k_AAAGATGTCCTCGCAT-1	SeuratProject	898	554
sc5p_v2_hs_PBMC_1k_AAAGCAAAGTATGACA-1	SeuratProject	4999	1796
sc5p_v2_hs_PBMC_1k_AAAGCAATCCATTCTA-1	SeuratProject	601	416
sc5p_v2_hs_PBMC_1k_AAATGCCCACTTAAGC-1	SeuratProject	4998	1530

dandelion removes hyphen from the cell barcodes as a default behaviour, so we will do the same for the Seurat object.

[8]:

# remove the -# from the end of each cell name
merged <- RenameCells(merged, new.names = gsub("-.*", "", row.names(merged@meta.data)))
merged

An object of class Seurat
38224 features across 30471 samples within 1 assay
Active assay: RNA (38224 features, 0 variable features)

Standard Seurat pre-processing workflow

[9]:

merged[["percent.mt"]] <- PercentageFeatureSet(merged, pattern = "^MT-")
merged <- subset(merged, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
merged <- NormalizeData(merged)
merged <- FindVariableFeatures(merged, selection.method = "vst", nfeatures = 2000)
merged <- ScaleData(merged)
merged <- RunPCA(merged, features = VariableFeatures(object = merged))
merged <- FindNeighbors(merged, dims = 1:50)
merged <- FindClusters(merged)
merged <- RunUMAP(merged, dims = 1:50)

Centering and scaling data matrix

PC_ 1
Positive:  LYZ, S100A9, FCN1, S100A8, CST3, VCAN, MNDA, IFI30, SPI1, SERPINA1
           S100A12, CSTA, TYMP, CTSS, CD14, LST1, CYBB, FTL, MS4A6A, CSF3R
           TYROBP, CD68, TNFAIP2, CFD, NCF2, CD36, CEBPD, AIF1, GRN, LILRB2
Negative:  IFITM1, RPS27, IL32, LTB, RPS12, RPS18, IL7R, TCF7, CD7, TRBC2
           NOSIP, RPS29, LEF1, LINC00861, ETS1, CD247, CCR7, LIME1, CD2, MAL
           SELL, GZMM, TRBC1, AQP3, OXNAD1, AES, CD69, ARL4C, PIM2, TRAC
PC_ 2
Positive:  MS4A1, CD79A, BANK1, LINC00926, IGHM, HLA-DQA1, FCER2, HLA-DQB1, TCL1A, CD79B
           FCRLA, CD19, VPREB3, RALGPS2, CD22, HLA-DRA, FCRL1, HLA-DOB, AFF3, NIBAN3
           IGHD, POU2AF1, BLNK, BLK, HLA-DPB1, SPIB, IGKC, ADAM28, CD24, HVCN1
Negative:  NKG7, HCST, CST7, PRF1, GZMA, GNLY, CTSW, KLRD1, IFITM1, FGFBP2
           FCGR3A, KLRF1, SPON2, S100A4, HOPX, GZMB, SRGN, IFITM2, CCL5, CLIC3
           CD7, MATK, ITGB2, TBX21, IL2RB, CD247, KLRB1, GZMM, SH2D1B, S1PR5
PC_ 3
Positive:  NKG7, GNLY, GZMB, CST7, PRF1, KLRD1, GZMA, FCGR3A, FGFBP2, KLRF1
           SPON2, CLIC3, HOPX, SH2D1B, TBX21, CCL4, ADGRG1, S1PR5, MATK, PTGDR
           MYOM2, IL2RB, TTC38, CCL5, PRSS23, GZMH, CTSW, XCL2, CXXC5, PLEK
Negative:  TCF7, LEF1, TPT1, IL7R, VIM, RPS12, CCR7, RPL36A, MAL, NOSIP
           RGS10, SNHG29, PRKCQ-AS1, TRABD2A, RPL17, EEF1G, GAS5, OXNAD1, GIMAP5, RPS4Y1
           RGCC, BCL11B, ACTN1, TRAT1, RPS29, LTB, AL138963.4, GIMAP1, PCED1B-AS1, NELL2
PC_ 4
Positive:  CAVIN2, PPBP, TUBB1, PF4, GNG11, SPARC, GP9, CLU, MPIG6B, TREML1
           PRKAR2B, NRGN, ITGA2B, CMTM5, PTGS1, MMD, MYL9, AC147651.1, TRIM58, ACRBP
           MAP3K7CL, AL731557.1, DMTN, C2orf88, RGS18, CD9, TSC22D1, MTURN, THBS1, TMEM40
Negative:  AES, RPS12, SEPT9, RARRES3, SEPT6, JUNB, SEPT7, SEPT1, RPS18, AL138963.3
           DUSP1, KIAA1551, JUN, FOS, VIM, CD74, TPT1, S100A4, S100A10, S100A6
           HLA-DQA2, CD69, HLA-DRB5, NFKBIA, CD37, DUSP2, RPS27, RETN, HLA-DRA, ATP6V0B
PC_ 5
Positive:  RPS4Y1, TLE5, EEF1G, SNHG29, GAS5, SEPTIN9, SNHG6, PCED1B-AS1, RPL17, IL2RG
           PLAAT4, SEPTIN7, SEPTIN1, LINC01578, JUND, AL138963.4, TOMM6, SNHG5, CD81, SEPTIN6
           KLF2, RESF1, GIMAP5, ATP5PO, MICOS10, RPS17, RBIS, RNASEK, PRKCQ-AS1, RPL36A
Negative:  AES, SEPT9, SEPT6, SEPT1, SEPT7, RARRES3, AL138963.3, KIAA1551, GNG11, CLU
           CAVIN2, TUBB1, SPARC, PF4, PPBP, GP9, CDKN1A, JUNB, MPIG6B, TREML1
           ITGA2B, LMNA, PTGS1, CMTM5, MAP3K7CL, CD9, PRKAR2B, DUSP1, ACRBP, TRIM58

Computing nearest neighbor graph

Computing SNN

Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 18634
Number of edges: 809571

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9226
Number of communities: 24
Elapsed time: 2 seconds

Warning message:
“The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session”
00:02:23 UMAP embedding parameters a = 0.9922 b = 1.112

00:02:23 Read 18634 rows and found 50 numeric columns

00:02:23 Using Annoy for neighbor search, n_neighbors = 30

00:02:23 Building Annoy index with metric = cosine, n_trees = 50

0%   10   20   30   40   50   60   70   80   90   100%

[----|----|----|----|----|----|----|----|----|----|

*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
|

00:02:25 Writing NN index file to temp file /var/folders/_r/j_8_fj3x28n2th3ch0ckn9c40000gt/T//RtmpfKS4Qx/filea8756193a1af

00:02:25 Searching Annoy index using 1 thread, search_k = 3000

00:02:28 Annoy recall = 100%

00:02:29 Commencing smooth kNN distance calibration using 1 thread
 with target n_neighbors = 30

00:02:29 Initializing from normalized Laplacian + noise (using irlba)

00:02:31 Commencing optimization for 200 epochs, with 832570 positive edges

00:02:36 Optimization finished

[10]:

options(repr.plot.width = 6, repr.plot.height = 4)
DimPlot(merged, reduction = "umap")

../_images/notebooks_6_dandelion_running_from_R-10x_data_31_0.png

[11]:

merged

An object of class Seurat
38224 features across 18634 samples within 1 assay
Active assay: RNA (38224 features, 2000 variable features)
 2 dimensional reductions calculated: pca, umap

In order for dandelion to do the next step (marking/filtering of poor quality BCRs and BCR doublets), we need to include a column in the metadata, filter_rna. This column will tell dandelion whether or not the cell has passed transcriptomic QCs. So, we will set the column to FALSE i.e. every cell passed QC. This is important because we want to remove as many doublets and poor quality contigs to save time on computation and construction of the final data tables.

Convert to AnnData

We will need to convert the Seurat object to AnnData to be able to continue. AnnData .obs slot is essentially the same as @meta.data in seurat.

[12]:

library(reticulate)
sc <- import("scanpy")
# convert the meta.data slot to a python friendly object
obs <- r_to_py(merged@meta.data)
normcounts <- r_to_py(Matrix::t(GetAssayData(merged)))

[13]:

adata <- sc$AnnData(X = normcounts, obs = obs)
adata

AnnData object with n_obs × n_vars = 18634 × 38224
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'RNA_snn_res.0.8', 'seurat_clusters'

Read in the BCR files and merge them

[14]:

files <- list()
for (i in 1:length(samples)) {
    filename <- paste0(samples[i], "/dandelion/filtered_contig_dandelion.tsv")
    files[[i]] <- readr::read_tsv(filename)
}
combined_bcr <- do.call(rbind, files)
head(combined_bcr)

Rows: 183 Columns: 120
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (54): sequence_id, sequence, v_call, d_call, j_call, sequence_alignment,...
dbl (60): junction_length, np1_length, np2_length, v_sequence_start, v_seque...
lgl  (6): rev_comp, productive, stop_codon, vj_in_frame, j_source, complete_vdj

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 2392 Columns: 120
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (54): sequence_id, sequence, v_call, d_call, j_call, sequence_alignment,...
dbl (60): junction_length, np1_length, np2_length, v_sequence_start, v_seque...
lgl  (6): rev_comp, productive, stop_codon, vj_in_frame, j_source, complete_vdj

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1827 Columns: 120
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (57): sequence_id, sequence, v_call, d_call, j_call, sequence_alignment,...
dbl (57): junction_length, np1_length, np2_length, v_sequence_start, v_seque...
lgl  (6): rev_comp, productive, stop_codon, vj_in_frame, j_source, complete_vdj

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 2955 Columns: 120
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (57): sequence_id, sequence, v_call, d_call, j_call, sequence_alignment,...
dbl (57): junction_length, np1_length, np2_length, v_sequence_start, v_seque...
lgl  (6): rev_comp, productive, stop_codon, vj_in_frame, j_source, complete_vdj

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A tibble: 6 × 120
sequence_id	sequence	rev_comp	productive	v_call	d_call	j_call	sequence_alignment	germline_alignment	junction	⋯	v_sequence_alignment_aa	d_sequence_alignment_aa	j_sequence_alignment_aa	complete_vdj	j_call_multimappers	j_call_multiplicity	j_call_sequence_start_multimappers	j_call_sequence_end_multimappers	j_call_support_multimappers	mu_count
<chr>	<chr>	<lgl>	<lgl>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	⋯	<chr>	<chr>	<chr>	<lgl>	<chr>	<dbl>	<chr>	<chr>	<chr>	<dbl>
sc5p_v2_hs_PBMC_1k_AACTCCCAGGCTAGGT_contig_1	ACTGCGGGGGTAAGAGGTTGTGTCCACCATGGCCTGGACTCCTCTCCTCCTCCTGTTCCTCTCTCACTGCACAGGTTCCCTCTCGCAGGCTGTGCTGACTCAGCCGTCTTCCCTCTCTGCATCTCCTGGAGCATCAGCCAGTCTCACCTGCACCTTGCGCAGTGGCATCAATGTTGGTACCTACAGGATATACTGGTACCAGCAGAAGCCAGGGAGTCCTCCCCAGTATCTCCTGAGGTACAAATCAGACTCAGATAAGCAGCAGGGCTCTGGAGTCCCCAGCCGCTTCTCTGGATCCAAAGATGCTTCGGCCAATGCAGGGATTTTACTCATCTCTGGGCTCCAGTCTGAGGATGAGGCTGACTATTACTGTATGATTTGGCACAGCAGCGCTTGGGTGTTCGGCGGAGGGACCAAGCTGACCGTCCTAGGTCAGCCCAAGGCTGCCCCCTCGGTCACTCTGTTCCCGCCCTCCTCTGAGGAGCTTCAAGCCAACAAGGCCACACTGGTGTGTCTCATAAGTGACTTCTACCCGGGAGCCGTGACAGTGGCCTGGAAGGCAGATAGCAGCCCCGTCAAGGCGGGAGTGGAGACCACCACACCCTCCAAACAAAGCAACAACAAGTACGCGGCCAGCAGCTA	FALSE	TRUE	IGLV5-45*03	NA	IGLJ3*02	CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTGGAGCATCAGCCAGTCTCACCTGCACCTTGCGCAGTGGCATCAATGTT.........GGTACCTACAGGATATACTGGTACCAGCAGAAGCCAGGGAGTCCTCCCCAGTATCTCCTGAGGTACAAATCAGAC.........TCAGATAAGCAGCAGGGCTCTGGAGTCCCC...AGCCGCTTCTCTGGATCCAAAGATGCTTCGGCCAATGCAGGGATTTTACTCATCTCTGGGCTCCAGTCTGAGGATGAGGCTGACTATTACTGTATGATTTGGCACAGCAGCGCTTGGGTGTTCGGCGGAGGGACCAAGCTGACCGTCCTAG	CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTGGAGCATCAGCCAGTCTCACCTGCACCTTGCGCAGTGGCATCAATGTT.........GGTACCTACAGGATATACTGGTACCAGCAGAAGCCAGGGAGTCCTCCCCAGTATCTCCTGAGGTACAAATCAGAC.........TCAGATAAGCAGCAGGGCTCTGGAGTCCCC...AGCCGCTTCTCTGGATCCAAAGATGCTTCGGCCAATGCAGGGATTTTACTCATCTCTGGGCTCCAGTCTGAGGATGAGGCTGACTATTACTGTATGATTTGGCACAGCAGCGCTTGGGTGTTCGGCGGAGGGACCAAGCTGACCGTCCTAG	TGTATGATTTGGCACAGCAGCGCTTGGGTGTTC	⋯	QAVLTQPSSLSASPGASASLTCTLRSGINVGTYRIYWYQQKPGSPPQYLLRYKSDSDKQQGSGVPSRFSGSKDASANAGILLISGLQSEDEADYYCMIWHSSA	NA	VFGGGTKLTVL	NA	IGLJ3*01	1	397	431	4.91e-13	0
sc5p_v2_hs_PBMC_1k_AACTCCCAGGCTAGGT_contig_2	ATACTTTCTGAGAGTCCTGGACCTCCTGTGCAAGAACATGAAACATCTGTGGTTCTTCCTCCTCCTGGTGGCAGCTCCCAGATGGGTCCTGTCCCAGGTGCAGCTGCAGGAGTCGGGCCCAGGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGCAGTGGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGGGCGTATCTATACCAGTGGGAGCACCAACTACAACCCCTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCGTGTATTACTGTGCGAGAGAAAATTACGATTTTTGGAGTGGTTATTACCACGGTGCGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTCCTCAGGGAGTGCATCCGCCCCAACCCTTTTCCCCCTCGTCTCCTGTGAGAATTCCCCGTCGGATACGAGCAGCGTG	FALSE	TRUE	IGHV4-61*02	IGHD3-3*01	IGHJ6*02	CAGGTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGC......AGTGGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGGGCGTATCTATACCAGT.........GGGAGCACCAACTACAACCCCTCCCTCAAG...AGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCGTGTATTACTGTGCGAGAGAAAATTACGATTTTTGGAGTGGTTATTACCACGGTGCGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTCCTCA	CAGGTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTGAAGCCTTCACAGACCCTGTCCCTCACCTGCACTGTCTCTGGTGGCTCCATCAGC......AGTGGTAGTTACTACTGGAGCTGGATCCGGCAGCCCGCCGGGAAGGGACTGGAGTGGATTGGGCGTATCTATACCAGT.........GGGAGCACCAACTACAACCCCTCCCTCAAG...AGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCAGACACGGCCGTGTATTACTGTGCGAGAGANNATTACGATTTTTGGAGTGGTTATTACTACGGTATGGACGTCTGGGGCCAAGGGACCACGGTCACCGTCTCCTCA	TGTGCGAGAGAAAATTACGATTTTTGGAGTGGTTATTACCACGGTGCGGACGTCTGG	⋯	QVQLQESGPGLVKPSQTLSLTCTVSGGSISSGSYYWSWIRQPAGKGLEWIGRIYTSGSTNYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCAR	YDFWSGY	YHGADVWGQGTTVTVSS	NA	IGHJ6*02	1	416	469	1.14e-18	3
sc5p_v2_hs_PBMC_1k_AACTCCCAGGCTAGGT_contig_3	GGCTGGGGTCTCAGGAGGCAGCGCTCTGGGGACGTCTCCACCATGGCCTGGGCTCTGCTCCTCCTCACCTCCTCACTCAGGGCACAGGCTCTTGGGCCCAGTCTGCCCTGATTCAGCCTCCCTCCGTGTCCGGGTCTCCTGGACAGTCAGTCACCATCTCCTGCACTGGAACCAGCAGTGATGTTGGGAGTTATGACTATGTCTCCTGGTACCAACAGCACCCAGGCACAGTCCCCAAACCCATGATCTACAATGTCAATACTCAGCCCTCAGGGGTCCCTGATCGTTTCTCTGGCTCCAAGTCTGGCAATACGGCCTCCATGACCATCTCTGGACTCCAGGCTGAGGACGAGGCTGATTATTAGTGCTGCTCATATACAAGCAGTGCCACTTTCTTGGGTGTTCGGCGGAGGGACCAAGCTGACCGTCCTAGGTCAGCCCAAGGCTGCCCCCTCGGTCACTCTGTTCCCACCCTCCTCTGAGGAGCTTCAAGCCAACAAGGCCACACTGGTGTGTCTCATAAGTGACTTCTACCCGGGAGCCGTGACAGTGGCCTGGAAGGCAGATAGCAGCCCCGTCAAGGCGGGAGTGGAGACCACCACACCCTCCAAACAAAGCAACAACAAGTACGCGGCCAGCAGCTA	FALSE	FALSE	IGLV2-5*01	NA	IGLJ3*02	CAGTCTGCCCTGATTCAGCCTCCCTCC...GTGTCCGGGTCTCCTGGACAGTCAGTCACCATCTCCTGCACTGGAACCAGCAGTGATGTTGGG.........AGTTATGACTATGTCTCCTGGTACCAACAGCACCCAGGCACAGTCCCCAAACCCATGATCTACAATGTC.....................AATACTCAGCCCTCAGGGGTCCCT...GATCGTTTCTCTGGCTCCAAG......TCTGGCAATACGGCCTCCATGACCATCTCTGGACTCCAGGCTGAGGACGAGGCTGATTATTAGTGCTGCTCATATACAAGCAGTGCCACTTTCTTGGGTGTTCGGCGGAGGGACCAAGCTGACCGTCCTAG	CAGTCTGCCCTGATTCAGCCTCCCTCC...GTGTCCGGGTCTCCTGGACAGTCAGTCACCATCTCCTGCACTGGAACCAGCAGTGATGTTGGG.........AGTTATGACTATGTCTCCTGGTACCAACAGCACCCAGGCACAGTCCCCAAACCCATGATCTACAATGTC.....................AATACTCAGCCCTCAGGGGTCCCT...GATCGTTTCTCTGGCTCCAAG......TCTGGCAATACGGCCTCCATGACCATCTCTGGACTCCAGGCTGAGGACGAGGCTGATTATTAGTGCTGCTCATATACAAGCAGTGCCACTTNNTTGGGTGTTCGGCGGAGGGACCAAGCTGACCGTCCTAG	TGCTGCTCATATACAAGCAGTGCCACTTTCTTGGGTGTTC	⋯	QSALIQPPSVSGSPGQSVTISCTGTSSDVGSYDYVSWYQQHPGTVPKPMIYNVNTQPSGVPDRFSGSKSGNTASMTISGLQAEDEADY*CCSYTSSAT	NA	LGVRRRDQADRP	NA	IGLJ3*02	1	396	433	2.28e-16	0
sc5p_v2_hs_PBMC_1k_AACTCTTGTCATCGGC_contig_2	AGCTCTGAGAGAGGAGCCTTAGCCCTGGATTCCAAGGCCTATCCACTTGGTGATCAGCACTGAGCACCGAGGATTCACCATGGAACTGGGGCTCCGCTGGGTTTTCCTTGTTGCTATTTTAGAAGGTGTCCAGTGTGAGGTGCAGCTGGTGGAGTCTGGGGGAGGCCTGGTCAAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTCAGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCATCCATTAGTAGTAGTAGTAGTTACATATACTACGCAGACTCAGTGAAGGGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGACGTTACTATGATAGTAGTGGTTATTCCGCAAACTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAGGGAGTGCATCCGCCCCAACCCTTTTCCCCCTCGTCTCCTGTGAGAATTCCCCGTCGGATACGAGCAGCGTG	FALSE	TRUE	IGHV3-21*01	IGHD3-22*01	IGHJ4*02	GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCCTGGTCAAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTC............AGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCATCCATTAGTAGTAGT......AGTAGTTACATATACTACGCAGACTCAGTGAAG...GGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGACGTTACTATGATAGTAGTGGTTATTCCGCAAACTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG	GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCCTGGTCAAGCCTGGGGGGTCCCTGAGACTCTCCTGTGCAGCCTCTGGATTCACCTTC............AGTAGCTATAGCATGAACTGGGTCCGCCAGGCTCCAGGGAAGGGGCTGGAGTGGGTCTCATCCATTAGTAGTAGT......AGTAGTTACATATACTACGCAGACTCAGTGAAG...GGCCGATTCACCATCTCCAGAGACAACGCCAAGAACTCACTGTATCTGCAAATGAACAGCCTGAGAGCCGAGGACACGGCTGTGTATTACTGTGCGAGANNTTACTATGATAGTAGTGGTTATTNNNNNNACTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCAG	TGTGCGAGACGTTACTATGATAGTAGTGGTTATTCCGCAAACTTTGACTACTGG	⋯	EVQLVESGGGLVKPGGSLRLSCAASGFTFSSYSMNWVRQAPGKGLEWVSSISSSSSYIYYADSVKGRFTISRDNAKNSLYLQMNSLRAEDTAVYYCAR	YYDSSGY	FDYWGQGTLVTVSS	NA	IGHJ4*02	1	462	506	2.61e-20	0
sc5p_v2_hs_PBMC_1k_AACTCTTGTCATCGGC_contig_1	AGAGCTCTGGGGAGTCTGCACCATGGCTTGGACCCCACTCCTCTTCCTCACCCTCCTCCTCCACTGCACAGGGTCTCTCTCCCAGCTTGTGCTGACTCAATCGCCCTCTGCCTCTGCCTCCCTGGGAGCCTCGGTCAAGCTCACCTGCACTCTGAGCAGTGGGCACAGCAGCTACGCCATCGCATGGCATCAGCAGCAGCCAGAGAAGGGCCCTCGGTACTTGATGAAGCTTAACAGTGATGGCAGCCACAGCAAGGGGGACGGGATCCCTGATCGCTTCTCAGGCTCCAGCTCTGGGGCTGAGCGCTACCTCACCATCTCCAGCCTCCAGTCTGAGGATGAGGCTGACTATTACTGTCAGACCTGGGGCACTGGCATTTATGTCTTCGGAACTGGGACCAAGGTCACCGTCCTAGGTCAGCCCAAGGCCAACCCCACTGTCACTCTGTTCCCGCCCTCCTCTGAGGAGCTCCAAGCCAACAAGGCCACACTAGTGTGTCTGATCAGTGACTTCTACCCGGGAGCTGTGACAGTGGCCTGGAAGGCAGATGGCAGCCCCGTCAAGGCGGGAGTGGAGACCACCAAACCCTCCAAACAGAGCAACAACAAGTACGCGGCCAGCAGCTA	FALSE	TRUE	IGLV4-69*01	NA	IGLJ1*01	CAGCTTGTGCTGACTCAATCGCCCTCT...GCCTCTGCCTCCCTGGGAGCCTCGGTCAAGCTCACCTGCACTCTGAGCAGTGGGCACAGC...............AGCTACGCCATCGCATGGCATCAGCAGCAGCCAGAGAAGGGCCCTCGGTACTTGATGAAGCTTAACAGTGAT.........GGCAGCCACAGCAAGGGGGACGGGATCCCT...GATCGCTTCTCAGGCTCCAGC......TCTGGGGCTGAGCGCTACCTCACCATCTCCAGCCTCCAGTCTGAGGATGAGGCTGACTATTACTGTCAGACCTGGGGCACTGGCATTTATGTCTTCGGAACTGGGACCAAGGTCACCGTCCTAG	CAGCTTGTGCTGACTCAATCGCCCTCT...GCCTCTGCCTCCCTGGGAGCCTCGGTCAAGCTCACCTGCACTCTGAGCAGTGGGCACAGC...............AGCTACGCCATCGCATGGCATCAGCAGCAGCCAGAGAAGGGCCCTCGGTACTTGATGAAGCTTAACAGTGAT.........GGCAGCCACAGCAAGGGGGACGGGATCCCT...GATCGCTTCTCAGGCTCCAGC......TCTGGGGCTGAGCGCTACCTCACCATCTCCAGCCTCCAGTCTGAGGATGAGGCTGACTATTACTGTCAGACCTGGGGCACTGGCATTTATGTCTTCGGAACTGGGACCAAGGTCACCGTCCTAG	TGTCAGACCTGGGGCACTGGCATTTATGTCTTC	⋯	QLVLTQSPSASASLGASVKLTCTLSSGHSSYAIAWHQQQPEKGPRYLMKLNSDGSHSKGDGIPDRFSGSSSGAERYLTISSLQSEDEADYYCQTWGTGI	NA	YVFGTGTKVTVL	NA	IGLJ1*01	1	379	416	2.22e-16	0
sc5p_v2_hs_PBMC_1k_AACTCTTGTCATCGGC_contig_3	AGCAGAGCTCTGGGGAGTCTGCACCATGGCTTGGACCCCACTCCTCTTCCTCACCCTCCTCCTCCACTGCACAGGTCAGGATGGCCCTCAGCACCCTGACCTCCAGCTCACTGATACCACCTCCCAAACTTATGCCAGGAATGTCCTTCCCTCTTTTCTTGACTCCAGCCGGTAATGGGTGTCTGTGTTTTCAGGGTCTCTCTCCCAGCTTGTGCTGACTCAATCGCCCTCTGCCTCTGCCTCCCTGGGAGCCTCGGTCAAGCTCACCTGCACTCTGAGCAGTGGGCACAGCAGCTACGCCATCGCATGGCATCAGCAGCAGCCAGAGAAGGGCCCTCGGTACTTGATGAAGCTTAACAGTGATGGCAGCCACAGCAAGGGGGACGGGATCCCTGATCGCTTCTCAGGCTCCAGCTCTGGGGCTGAGCGCTACCTCACCATCTCCAGCCTCCAGTCTGAGGATGAGGCTGACTATTACTGTCAGACCTGGGGCACTGGCATTCTTGGGTGTTCGGCGGAGGGACCAAGCTGACCGTCCTAGGTCAGCCCAAGGCTGCCCCCTCGGTCACTCTGTTCCCACCCTCCTCTGAGGAGCTTCAAGCCAACAAGGCCACACTGGTGTGTCTCATAAGTGACTTCTACCCGGGAGCCGTGACAGTGGCCTGGAAGGCAGATAGCAGCCCCGTCAAGGCGGGAGTGGAGACCACCACACCCTCCAAACAAAGCAACAACAAGTACGCGGCCAGCAGCTA	FALSE	FALSE	IGLV4-69*01	NA	IGLJ3*02	CAGCTTGTGCTGACTCAATCGCCCTCT...GCCTCTGCCTCCCTGGGAGCCTCGGTCAAGCTCACCTGCACTCTGAGCAGTGGGCACAGC...............AGCTACGCCATCGCATGGCATCAGCAGCAGCCAGAGAAGGGCCCTCGGTACTTGATGAAGCTTAACAGTGAT.........GGCAGCCACAGCAAGGGGGACGGGATCCCT...GATCGCTTCTCAGGCTCCAGC......TCTGGGGCTGAGCGCTACCTCACCATCTCCAGCCTCCAGTCTGAGGATGAGGCTGACTATTACTGTCAGACCTGGGGCACTGGCATTCTTGGGTGTTCGGCGGAGGGACCAAGCTGACCGTCCTAG	CAGCTTGTGCTGACTCAATCGCCCTCT...GCCTCTGCCTCCCTGGGAGCCTCGGTCAAGCTCACCTGCACTCTGAGCAGTGGGCACAGC...............AGCTACGCCATCGCATGGCATCAGCAGCAGCCAGAGAAGGGCCCTCGGTACTTGATGAAGCTTAACAGTGAT.........GGCAGCCACAGCAAGGGGGACGGGATCCCT...GATCGCTTCTCAGGCTCCAGC......TCTGGGGCTGAGCGCTACCTCACCATCTCCAGCCTCCAGTCTGAGGATGAGGCTGACTATTACTGTCAGACCTGGGGCACTGGCATTCTTGGGTGTTCGGCGGAGGGACCAAGCTGACCGTCCTAG	TGTCAGACCTGGGGCACTGGCATTCTTGGGTGTTC	⋯	QLVLTQSPSASASLGASVKLTCTLSSGHSSYAIAWHQQQPEKGPRYLMKLNSDGSHSKGDGIPDRFSGSSSGAERYLTISSLQSEDEADYYCQTWGTGI	NA	GCSAEGPSPS	NA	IGLJ3*02	1	504	541	2.67e-16	0

Run `ddl$pp$filter_contigs`

[15]:

vdj_results_list <- ddl$pp$filter_contigs(combined_bcr, adata)
vdj_results_list

[[1]]
Dandelion class object with n_obs = 812 and n_contigs = 1657
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'rearrangement_status'
    metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

[[2]]
AnnData object with n_obs × n_vars = 18634 × 38224
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'RNA_snn_res.0.8', 'seurat_clusters', 'filter_rna', 'has_contig', 'filter_contig_quality', 'filter_contig_VDJ', 'filter_contig_VJ', 'contig_QC_pass', 'filter_contig', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

This returns a two level list in R. The first level is the VDJ results, stored as a Dandelion python-class object and the second level is the accompanying AnnData object.

The Dandelion class is structured like a multi-slot object and the two data frames below are linked: 1) data <- AIRR table with row names as individual vdj contigs

metadata <- AIRR table collapsed to cell barcodes as row names

More details on the Dandelion class are in my python tutorials. The most important slot for now, is the metadata slot within Dandelion.

In order for the metadata to form properly, there must not be any duplicate barcodes, or incorrectly retrieved information from the data slot. If you end up with a Dandelion object that only contains the data slot filled, it means one of the two conditions happened. In those situations, I would recommend you to send me a copy of the file so I can check why it’s failing; it is usually due to coding eror that arise from string and float incompatibilities when constructing the object.

To save the Dandelion object, you can do the following:

[16]:

vdj_results_list[[1]]$write_pkl("vdj_save.pkl.pbz2")

The .pkl.pbz2 extension is basically a bzip2-compressed pickle file format from python.

You can also save using $write_h5ddl.

[17]:

vdj_results_list[[1]]$write_h5ddl("vdj_save.h5ddl")

To read the file back into R, you can do the following:

[18]:

vdj_data <- ddl$read_pkl("vdj_save.pkl.pbz2")
vdj_data

Dandelion class object with n_obs = 812 and n_contigs = 1657
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'rearrangement_status'
    metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

or

[19]:

vdj_data2 <- ddl$read_h5ddl("vdj_save.h5ddl")
vdj_data2

Dandelion class object with n_obs = 812 and n_contigs = 1657
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'rearrangement_status'
    metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

Finding clones

Dandelion comes with a method to define clones based on V-J gene usuage and CDR3 junction similarity but you can always run Immcantation/Change-O’s DefineClones with the filtered file from earlier using their tutorial. To use dandelion’s you just need to do the following:

[20]:

ddl$tl$find_clones(vdj_data)

Calculating size of clones

Sometimes it’s useful to evaluate the size of the clone. Here ddl$tl$clone_size does a simple calculation to enable that.

[21]:

ddl$tl$clone_size(vdj_data)

You can also specify max_size to clip off the calculation at a fixed value.

[22]:

ddl$tl$clone_size(vdj_data, max_size = 3)

I have blitz through the last 3 functions without showing you the output but don’t worry, they are all stashed in the Dandelion object.

[23]:

# compare the column names in the metadata slot of the object below with the one above.
vdj_data

Dandelion class object with n_obs = 812 and n_contigs = 1657
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'clone_id_size', 'clone_id_size_max_3.0'

Generate BCR network visualization

The name of Dandelion came from the way it visualizes the BCR as networks, that look like dandelions in the field. We need to first generate the network. We will hopewfully visualize in Seurat later.

[24]:

ddl$tl$generate_network(vdj_data)

Integrating with Seurat

At this point, you might want to transfer the metadata slot back to Seurat so you can visualise some things. You can do that column by column directly from Dandelion object like as follows:

isotype = unlist(vdj_data$metadata$isotype) # because of the python to R conversion, it thinks it's a list rather than a vector. we can correct this with unlist
names(isotype) <- row.names(vdj_data$metadata)
merged <- AddMetaData(merged, isotype, 'isotype')
DimPlot(merged, group.by = 'isotype')

I will demonstrate how to do this via the AnnData object.

[25]:

adata2 <- vdj_results_list[[2]]
adata2

AnnData object with n_obs × n_vars = 18634 × 38224
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'RNA_snn_res.0.8', 'seurat_clusters', 'filter_rna', 'has_contig', 'filter_contig_quality', 'filter_contig_VDJ', 'filter_contig_VJ', 'contig_QC_pass', 'filter_contig', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

Transfer `Dandelion` to `AnnData`

ddl$tl$transfer will act to transfer the metadata and graph slots from Dandelion object to AnnData.

[26]:

ddl$tl$transfer(adata2, vdj_data) # switch expanded_only to TRUE if you only want to get the coordinates for expanded clones

This will populate the adata .obs slot with the metadata from the Dandelion object.

[27]:

adata2

AnnData object with n_obs × n_vars = 18634 × 38224
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'RNA_snn_res.0.8', 'seurat_clusters', 'filter_rna', 'has_contig', 'filter_contig_quality', 'filter_contig_VDJ', 'filter_contig_VJ', 'contig_QC_pass', 'filter_contig', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'clone_id', 'clone_id_by_size', 'clone_id_size', 'clone_id_size_max_3.0'
    uns: 'neighbors', 'rna_neighbors', 'clone_id'
    obsm: 'X_vdj'
    obsp: 'connectivities', 'distances', 'vdj_connectivities', 'vdj_distances'

Saving

There may be some issues with conversions between python and R and vice versa. So my recommendation at this stage is to save the three objects separately and load them up in a fresh session. There’s a high chance your session will crash and/or corrupt your file if you ignore this.

[28]:

adata2$write("adata_test.h5ad", compression = "gzip")

[29]:

saveRDS(merged, "merged.RDS")

[30]:

vdj_data$write_pkl("vdj_save.pkl.pbz2")

New Session: Transfer `AnnData` to `Seurat`

We start a new session and read in the files.

[1]:

setwd("~/Downloads/dandelion_tutorial/")
library(reticulate)
ddl <- import("dandelion")
sc <- import("scanpy")

[2]:

library(Seurat)
samples <- c("sc5p_v2_hs_PBMC_1k", "sc5p_v2_hs_PBMC_10k", "vdj_v1_hs_pbmc3", "vdj_nextgem_hs_pbmc3")
adata <- sc$read_h5ad("adata_test.h5ad")
# vdj = ddl$read('vdj_save.pkl.pbz2') # don't need this at this stage
merged <- readRDS("merged.RDS")

The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
which was just loaded, will retire in October 2023.
Please refer to R-spatial evolution reports for details, especially
https://r-spatial.org/r/2023/05/15/evolution4.html.
It may be desirable to make the sf package available;
package maintainers should consider adding sf to Suggests:.
The sp package is now running under evolution status 2
     (status 2 uses the sf package in place of rgdal)

Attaching SeuratObject

[3]:

adata

AnnData object with n_obs × n_vars = 18634 × 38224
    obs: 'orig.ident', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'RNA_snn_res.0.8', 'seurat_clusters', 'filter_rna', 'has_contig', 'filter_contig_quality', 'filter_contig_VDJ', 'filter_contig_VJ', 'contig_QC_pass', 'filter_contig', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'clone_id', 'clone_id_by_size', 'clone_id_size', 'clone_id_size_max_3.0'
    uns: 'clone_id', 'neighbors', 'rna_neighbors'
    obsm: 'X_vdj'
    obsp: 'connectivities', 'distances', 'vdj_connectivities', 'vdj_distances'

[4]:

merged

An object of class Seurat
38224 features across 18634 samples within 1 assay
Active assay: RNA (38224 features, 2000 variable features)
 2 dimensional reductions calculated: pca, umap

So there are a few cells missing from the AnnData object because they were filtered away. Let’s do a simple merge to populate the Seurat object’s meta.data slot.

[5]:

merged_meta <- merged@meta.data
head(merged_meta)

A data.frame: 6 × 6
	orig.ident	nCount_RNA	nFeature_RNA	percent.mt	RNA_snn_res.0.8	seurat_clusters
	<chr>	<dbl>	<int>	<dbl>	<fct>	<fct>
sc5p_v2_hs_PBMC_1k_AAACCTGCAGCCTGTG	SeuratProject	5029	1819	3.579240	19	19
sc5p_v2_hs_PBMC_1k_AAACGGGTCGCTGATA	SeuratProject	4343	1738	2.440709	4	4
sc5p_v2_hs_PBMC_1k_AAAGCAAAGTATGACA	SeuratProject	4999	1796	2.880576	4	4
sc5p_v2_hs_PBMC_1k_AAATGCCCACTTAAGC	SeuratProject	4998	1530	3.281313	2	2
sc5p_v2_hs_PBMC_1k_AACACGTCAACAACCT	SeuratProject	4818	1579	2.283105	2	2
sc5p_v2_hs_PBMC_1k_AACACGTGTTATCCGA	SeuratProject	7564	2220	2.525119	4	4

[6]:

# extract the metadata from the anndata object
adata_meta <- as.data.frame(adata$obs)
# two rounds to convert NAs
is.nan.data.frame <- function(x) {
  do.call(cbind, lapply(x, is.nan))
}

adata_meta[is.nan(adata_meta)] <- NA

is.nan.data.frame <- function(x) {
  do.call(cbind, lapply(x, function(y) y == "nan"))
}

adata_meta[is.nan(adata_meta)] <- NA

If you run into issues with the conversion, unfortunately there’s not much I can do about it. One alternative is to just transfer the .obs slot from the dandelion object one by one as above into the seurat object.

[7]:

# merge into a data frame and then make sure the format is correct
merged_data <- as.data.frame(merge(merged_meta, adata_meta, by = 0, all = TRUE))
rownames(merged_data) <- merged_data[, 1]
merged_data[, 1] <- NULL

[8]:

# now just replace the current Seurat@meta.data
merged@meta.data <- merged_data

[9]:

options(repr.plot.width = 6, repr.plot.height = 12, repr.plot.res = 200)
DimPlot(merged, group.by = c("contig_QC_pass", "isotype", "status"), ncol = 1)

Warning message in `[[.Seurat`(object, group.by):
“Cannot find the following bits of meta data: status”

../_images/notebooks_6_dandelion_running_from_R-10x_data_82_1.png

If you want to visualise the BCR network, you will have to subset to cells that contain BCR.

[10]:

merged_bcr <- subset(merged, subset = contig_QC_pass == "True")
merged_bcr

An object of class Seurat
38224 features across 812 samples within 1 assay
Active assay: RNA (38224 features, 2000 variable features)
 2 dimensional reductions calculated: pca, umap

Plotting BCR network

[11]:

X_vdj <- adata$obsm["X_vdj"]
X_vdj <- apply(X_vdj, 2, function(x) gsub("NaN", NA, x)) # convert python NAs to R NAs
X_vdj <- apply(X_vdj, 2, function(x) as.numeric(x)) # Make sure they are actually numbers
X_vdj <- as.matrix(X_vdj)
row.names(X_vdj) <- row.names(adata$obs) # will not work if the anndata .obs slot is malformed. the row.names for adata$obs and row.names(merged_bcr@meta.data) should be identical anyway, so just replace with row.names(merged_bcr@meta.data)
X_vdj <- X_vdj[!is.na(X_vdj[, 1]), ]
colnames(X_vdj) <- c("VDJ_1", "VDJ_2")

[12]:

merged_bcr[["VDJ"]] <- CreateDimReducObject(embeddings = X_vdj, key = "VDJ_", assay = DefaultAssay(merged_bcr))

[13]:

merged_bcr

An object of class Seurat
38224 features across 812 samples within 1 assay
Active assay: RNA (38224 features, 2000 variable features)
 3 dimensional reductions calculated: pca, umap, VDJ

[14]:

options(repr.plot.width = 6, repr.plot.height = 5, repr.plot.res = 200)
DimPlot(merged_bcr, reduction = "VDJ")

../_images/notebooks_6_dandelion_running_from_R-10x_data_89_0.png

[15]:

DimPlot(merged_bcr, reduction = "VDJ", group.by = "isotype")

../_images/notebooks_6_dandelion_running_from_R-10x_data_90_0.png

This concludes a quick primer on how to use dandelion from R. It may be a bit buggy from time to time due to how reticulate/rpy2 works but hopefully it will be overall alright.

The rest of the functions could be potentially run from R (just be changing . to $ for example), but I haven’t tested it. Might be easier to run it through python, or maybe with the new RStudio 4?

[16]:

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin22.4.0 (64-bit)
Running under: macOS Ventura 13.4.1

Matrix products: default
BLAS:   /opt/homebrew/Cellar/openblas/0.3.23/lib/libopenblasp-r0.3.23.dylib
LAPACK: /opt/homebrew/Cellar/r/4.3.1/lib/R/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] SeuratObject_4.1.3 Seurat_4.3.0.1     reticulate_1.30

loaded via a namespace (and not attached):
  [1] deldir_1.0-9           pbapply_1.7-0          gridExtra_2.3
  [4] rlang_1.1.1            magrittr_2.0.3         RcppAnnoy_0.0.20
  [7] spatstat.geom_3.2-1    matrixStats_1.0.0      ggridges_0.5.4
 [10] compiler_4.3.1         png_0.1-8              vctrs_0.6.3
 [13] reshape2_1.4.4         stringr_1.5.0          pkgconfig_2.0.3
 [16] crayon_1.5.2           fastmap_1.1.1          ellipsis_0.3.2
 [19] labeling_0.4.2         utf8_1.2.3             promises_1.2.0.1
 [22] purrr_1.0.1            jsonlite_1.8.5         goftest_1.2-3
 [25] later_1.3.1            spatstat.utils_3.0-3   uuid_1.1-0
 [28] irlba_2.3.5.1          parallel_4.3.1         cluster_2.1.4
 [31] R6_2.5.1               ica_1.0-3              spatstat.data_3.0-1
 [34] stringi_1.7.12         RColorBrewer_1.1-3     parallelly_1.36.0
 [37] lmtest_0.9-40          scattermore_1.2        Rcpp_1.0.10
 [40] IRkernel_1.3.2         tensor_1.5             future.apply_1.11.0
 [43] zoo_1.8-12             base64enc_0.1-3        sctransform_0.3.5
 [46] httpuv_1.6.11          Matrix_1.5-4.1         splines_4.3.1
 [49] igraph_1.5.0           tidyselect_1.2.0       abind_1.4-5
 [52] spatstat.random_3.1-5  codetools_0.2-19       miniUI_0.1.1.1
 [55] spatstat.explore_3.2-1 listenv_0.9.0          lattice_0.21-8
 [58] tibble_3.2.1           plyr_1.8.8             shiny_1.7.4
 [61] withr_2.5.0            ROCR_1.0-11            evaluate_0.21
 [64] Rtsne_0.16             future_1.32.0          survival_3.5-5
 [67] polyclip_1.10-4        fitdistrplus_1.1-11    pillar_1.9.0
 [70] KernSmooth_2.23-21     plotly_4.10.2          generics_0.1.3
 [73] rprojroot_2.0.3        sp_2.0-0               IRdisplay_1.1
 [76] ggplot2_3.4.2          munsell_0.5.0          scales_1.2.1
 [79] globals_0.16.2         xtable_1.8-4           glue_1.6.2
 [82] lazyeval_0.2.2         tools_4.3.1            data.table_1.14.8
 [85] pbdZMQ_0.3-9           RANN_2.6.1             leiden_0.4.3
 [88] Cairo_1.6-0            cowplot_1.1.1          grid_4.3.1
 [91] tidyr_1.3.0            colorspace_2.1-0       nlme_3.1-162
 [94] patchwork_1.1.2        repr_1.1.6             cli_3.6.1
 [97] spatstat.sparse_3.0-2  fansi_1.0.4            viridisLite_0.4.2
[100] dplyr_1.1.2            uwot_0.1.14            gtable_0.3.3
[103] digest_0.6.31          progressr_0.13.0       ggrepel_0.9.3
[106] farver_2.1.1           htmlwidgets_1.6.2      htmltools_0.5.5
[109] lifecycle_1.0.3        httr_1.4.6             here_1.0.1
[112] mime_0.12              MASS_7.3-60

[ ]: