Filtering
We now move on to filtering out BCR contigs (and corresponding cells if necessary) from the BCR data and transcriptome object loaded in scanpy.
Import dandelion module
[1]:
import os
import dandelion as ddl
# change directory to somewhere more workable
os.chdir(os.path.expanduser("~/Downloads/dandelion_tutorial/"))
ddl.logging.print_header()
dandelion==0.3.4.dev29 pandas==2.0.1 numpy==1.24.3 matplotlib==3.7.1 networkx==3.1 scipy==1.11.2
Import modules for use with scanpy
[2]:
import pandas as pd
import numpy as np
import scanpy as sc
import warnings
warnings.filterwarnings("ignore")
sc.logging.print_header()
scanpy==1.9.3 anndata==0.9.1 umap==0.5.3 numpy==1.24.3 scipy==1.11.2 pandas==2.0.1 scikit-learn==1.3.0 statsmodels==0.14.0 python-igraph==0.10.6 pynndescent==0.5.10
Import the transcriptome data
[3]:
samples = [
"sc5p_v2_hs_PBMC_1k",
"sc5p_v2_hs_PBMC_10k",
"vdj_v1_hs_pbmc3",
"vdj_nextgem_hs_pbmc3",
]
adata_list = []
for sample in samples:
adata = sc.read_10x_h5(
sample + "/filtered_feature_bc_matrix.h5", gex_only=True
)
adata.obs["sampleid"] = sample
# rename cells to sample id + barcode
adata.obs_names = [str(sample) + "_" + str(j) for j in adata.obs_names]
adata.var_names_make_unique()
adata_list.append(adata)
adata = adata_list[0].concatenate(adata_list[1:])
# rename the obs_names again, this time cleaving the trailing -#
adata.obs_names = [str(j).split("-")[0] for j in adata.obs_names]
adata
[3]:
AnnData object with n_obs × n_vars = 30471 × 31915
obs: 'sampleid', 'batch'
var: 'feature_types', 'genome', 'gene_ids-0', 'gene_ids-1', 'gene_ids-2', 'gene_ids-3'
I’m using a wrapper called pp.recipe_scanpy_qc
to run through a generic scanpy workflow. You can skip this if you already have a pre-processed AnnData
object for the subsequent steps.
[4]:
ddl.pp.recipe_scanpy_qc(adata, mito_cutoff=None) # use a gmm model to decide
# we can continue with those that survive qc
adata = adata[adata.obs["filter_rna"] == "False"].copy()
adata
[4]:
AnnData object with n_obs × n_vars = 22985 × 31915
obs: 'sampleid', 'batch', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'gmm_pct_count_clusters_keep', 'scrublet_score', 'is_doublet', 'filter_rna'
var: 'feature_types', 'genome', 'gene_ids-0', 'gene_ids-1', 'gene_ids-2', 'gene_ids-3'
Filter cells that are potental doublets and poor quality in both the V(D)J data and transcriptome data
ddl.pp.filter_contigs
Updated workflow
Pre v0.2.4, the normal workflow involves using ddl.pp.filter_contigs
to remove poor quality contigs. From v0.2.4 onwards however, a separate function, ddl.pp.check_contigs
, will be an alternative mode to perform the QCs. The difference is that ddl.pp.filter_contigs
will remove contigs from the final data, whereas ddl.pp.check_contigs
simply marks the problematic contigs as ambiguous
. We will go through each option separately
We use the function pp.filter_contigs
to mark and filter out cells and contigs from both the V(D)J data and transcriptome data in AnnData
. The operation will remove bad quality cells based on transcriptome information as well as remove V(D)J doublets (multiplet heavy/long chains, and/or light/short chains) from the V(D)J data. In some situations, a single cell can have multiple heavy/long and light/short chain contigs although they have an identical V(D)J+C alignment; in situations like
this, the contigs with lesser umis will be dropped and the umis transferred to duplicate_count column. The same procedure is applied to both heavy chain and light chains before identifying doublets.
Cells in the gene expression object without V(D)J information will not be affected which means that the AnnData
object can hold non-B/T cells. Run ?ddl.pp.filter_contigs
to check what each option does.
[5]:
# first we read in the 4 bcr files
bcr_files = []
for sample in samples:
file_location = sample + "/dandelion/filtered_contig_dandelion.tsv"
bcr_files.append(pd.read_csv(file_location, sep="\t"))
bcr = pd.concat(bcr_files, ignore_index=True)
bcr.reset_index(inplace=True, drop=True)
bcr
[5]:
sequence_id | sequence | rev_comp | productive | v_call | d_call | j_call | sequence_alignment | germline_alignment | junction | ... | v_sequence_alignment_aa | d_sequence_alignment_aa | j_sequence_alignment_aa | complete_vdj | j_call_multimappers | j_call_multiplicity | j_call_sequence_start_multimappers | j_call_sequence_end_multimappers | j_call_support_multimappers | mu_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | sc5p_v2_hs_PBMC_1k_AACTCCCAGGCTAGGT_contig_1 | ACTGCGGGGGTAAGAGGTTGTGTCCACCATGGCCTGGACTCCTCTC... | F | T | IGLV5-45*03 | NaN | IGLJ3*02 | CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTG... | CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTG... | TGTATGATTTGGCACAGCAGCGCTTGGGTGTTC | ... | QAVLTQPSSLSASPGASASLTCTLRSGINVGTYRIYWYQQKPGSPP... | NaN | VFGGGTKLTVL | NaN | IGLJ3*01 | 1.0 | 397 | 431 | 0.0 | 0 |
1 | sc5p_v2_hs_PBMC_1k_AACTCCCAGGCTAGGT_contig_2 | ATACTTTCTGAGAGTCCTGGACCTCCTGTGCAAGAACATGAAACAT... | F | T | IGHV4-61*02 | IGHD3-3*01 | IGHJ6*02 | CAGGTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTGAAGCCTT... | CAGGTGCAGCTGCAGGAGTCGGGCCCA...GGACTGGTGAAGCCTT... | TGTGCGAGAGAAAATTACGATTTTTGGAGTGGTTATTACCACGGTG... | ... | QVQLQESGPGLVKPSQTLSLTCTVSGGSISSGSYYWSWIRQPAGKG... | YDFWSGY | YHGADVWGQGTTVTVSS | NaN | IGHJ6*02 | 1.0 | 416 | 469 | 0.0 | 3 |
2 | sc5p_v2_hs_PBMC_1k_AACTCCCAGGCTAGGT_contig_3 | GGCTGGGGTCTCAGGAGGCAGCGCTCTGGGGACGTCTCCACCATGG... | F | F | IGLV2-5*01 | NaN | IGLJ3*02 | CAGTCTGCCCTGATTCAGCCTCCCTCC...GTGTCCGGGTCTCCTG... | CAGTCTGCCCTGATTCAGCCTCCCTCC...GTGTCCGGGTCTCCTG... | TGCTGCTCATATACAAGCAGTGCCACTTTCTTGGGTGTTC | ... | QSALIQPPSVSGSPGQSVTISCTGTSSDVGSYDYVSWYQQHPGTVP... | NaN | LGVRRRDQADRP | NaN | IGLJ3*02 | 1.0 | 396 | 433 | 0.0 | 0 |
3 | sc5p_v2_hs_PBMC_1k_AACTCTTGTCATCGGC_contig_2 | AGCTCTGAGAGAGGAGCCTTAGCCCTGGATTCCAAGGCCTATCCAC... | F | T | IGHV3-21*01 | IGHD3-22*01 | IGHJ4*02 | GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCCTGGTCAAGCCTG... | GAGGTGCAGCTGGTGGAGTCTGGGGGA...GGCCTGGTCAAGCCTG... | TGTGCGAGACGTTACTATGATAGTAGTGGTTATTCCGCAAACTTTG... | ... | EVQLVESGGGLVKPGGSLRLSCAASGFTFSSYSMNWVRQAPGKGLE... | YYDSSGY | FDYWGQGTLVTVSS | NaN | IGHJ4*02 | 1.0 | 462 | 506 | 0.0 | 0 |
4 | sc5p_v2_hs_PBMC_1k_AACTCTTGTCATCGGC_contig_1 | AGAGCTCTGGGGAGTCTGCACCATGGCTTGGACCCCACTCCTCTTC... | F | T | IGLV4-69*01 | NaN | IGLJ1*01 | CAGCTTGTGCTGACTCAATCGCCCTCT...GCCTCTGCCTCCCTGG... | CAGCTTGTGCTGACTCAATCGCCCTCT...GCCTCTGCCTCCCTGG... | TGTCAGACCTGGGGCACTGGCATTTATGTCTTC | ... | QLVLTQSPSASASLGASVKLTCTLSSGHSSYAIAWHQQQPEKGPRY... | NaN | YVFGTGTKVTVL | NaN | IGLJ1*01 | 1.0 | 379 | 416 | 0.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
7352 | vdj_nextgem_hs_pbmc3_TTTGCGCTCTGTCAAG_contig_2 | ATCACATAACAACCACATTCCTCCTCTAAAGAAGCCCCCGGGAGCC... | F | T | IGHV1-69*01,IGHV1-69D*01 | IGHD3-22*01 | IGHJ4*02 | CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAAGTGAAGAAGCCTG... | CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... | TGTGCGAGGGGGAAGTATTACTATGATAAAAGTGGGTCTCCACCTC... | ... | QVQLVQSGAEVKKPGSSVKVSCKVSGGIFSSYAISWVRQAPGQGLE... | YYYDKSG | FDYWGQGTLVTVSS | NaN | IGHJ4*02 | 1.0 | 457 | 500 | 1.1e-19 | 16 |
7353 | vdj_nextgem_hs_pbmc3_TTTGGTTGTAAGGATT_contig_1 | AGAGCTCTGGAGAAGAGCTGCTCAGTTAGGACCCAGAGGGAACCAT... | F | T | IGKV3-20*01 | NaN | IGKJ2*01,IGKJ2*02 | GAAATTGTGTTGACGCAGTCTCCAGGCACCCTGTCTTTGTCTCCAG... | GAAATTGTGTTGACGCAGTCTCCAGGCACCCTGTCTTTGTCTCCAG... | TGTCAGCAGTATGATGAGTCACCTCTGACTTTT | ... | EIVLTQSPGTLSLSPGERATLSCRASQSLTNSQLAWYQQKPGQAPR... | NaN | TFGQGTKLEIK | NaN | IGKJ2*02 | 1.0 | 396 | 429 | 3.33e-14 | 11 |
7354 | vdj_nextgem_hs_pbmc3_TTTGGTTGTAAGGATT_contig_2 | AGCTCTGGGAGAGGAGCCCCAGCCCTGAGATTCCCAGGTGTTTCCA... | F | T | IGHV3-9*01 | IGHD5-18*01,IGHD5-5*01 | IGHJ6*03 | GAAGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG... | GAAGTGCAGCTGGTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG... | TGTGCAAAAGACGGATACAGCTATCGTTCGTCATACTACTTTTACA... | ... | EVQLVESGGGLVQPGRSLRLSCAASGFSFDDYVMHWVRQAPGKGLE... | GYSYR | YYFYMDVWGKGTTVTVSS | NaN | IGHJ6*03 | 1.0 | 456 | 509 | 6.54e-22 | 10 |
7355 | vdj_nextgem_hs_pbmc3_TTTGTCACAGTAGAGC_contig_1 | AGCTCTGAGAGAGGAGCCCAGCCCTGGGATTTTCAGGTGTTTTCAT... | F | T | IGHV3-23*01,IGHV3-23D*01 | IGHD4-17*01 | IGHJ4*02 | GAGGTGCAGCTGTTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG... | GAGGTGCAGCTGTTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG... | TGTGCGAAAGATTTTAGGTCGCCATACGGTGACTACTACTTTGACT... | ... | EVQLLESGGGLVQPGGSLRLSCAASGFTFSSYAMSWVRQAPGKGLE... | YGD | YFDYWGQGTLVTVSS | NaN | IGHJ4*02 | 1.0 | 456 | 503 | 5.59e-22 | 0 |
7356 | vdj_nextgem_hs_pbmc3_TTTGTCACAGTAGAGC_contig_2 | GTGGGTCCAGGAGGCAGAACTCTGGGTGTCTCACCATGGCCTGGAT... | F | T | IGLV3-25*03 | NaN | IGLJ1*01 | TCCTATGAGCTGACACAGCCACCCTCG...GTGTCAGTGTCCCCAG... | TCCTATGAGCTGACACAGCCACCCTCG...GTGTCAGTGTCCCCAG... | TGTCAATCAGCAGACAGCAGTGGTACTTATCTTTATGTCTTC | ... | SYELTQPPSVSVSPGQTARITCSGDALPKQYAYWYQQKPGQAPVLV... | NaN | YVFGTGTKVTVL | NaN | IGLJ1*01 | 1.0 | 383 | 420 | 1.53e-16 | 0 |
7357 rows × 120 columns
Library type
It is recommended to specify the library_type argument as it will remove all contigs that do not belong to the related loci. The rationale is that the choice of the library type should mean that the primers used would most likely amplify those related sequences and if there’s any unexpected loci, they likely represent artifacts and shouldn’t be analysed. The optional argument accepts: ig
, tr-ab
, tr-gd
or None
where None
means all contigs will be kept.
[6]:
# The function will return both objects.
vdj, adata2 = ddl.pp.filter_contigs(
bcr, adata, library_type="ig", filter_rna=True
) # filter_rna is set to True to speed up the rest of the analyses. Usually I leave it as False.
Preparing data: 6505it [00:00, 7730.10it/s]
Scanning for poor quality/ambiguous contigs: 100%|██████████| 3158/3158 [00:06<00:00, 463.86it/s]
Filtering parameters to consider
The default mode is to filter any extra VDJ chains (BCR heavy chains and TCR long chains) because of allelic exclusion, with some exceptions: IgM and IgD pairs will be kept and productive TCR delta chains will be kept in alpha-beta T cells because of allelic inclusion [Sleckman1998]. The option to change the behaviour (i.e keep all extra VDJ chains) is by toggling:
filter_extra_vdj_chains=False
The default mode is to keep any extra VJ chains (BCR light chains and TCR short chains), but some may be interested in removing them. The option to change the behaviour is by toggling:
filter_extra_vj_chains=True
If the cell in the V(D)J table cannot be found in the transcriptomic data, it will also be removed from the V(D)J data by default. This can be changed by toggling:
filter_missing=False
When contigs are marked as poor quality, the default behaviour is to remove the contigs associated with the barcode, and not the barcode from the transcriptome data. This can be toggled to remove the entire cell if the intention is to retain a conservative dataset for both V(D)J and transcriptome data:
filter_poorqualitycontig=True
The default behaviour is to rescue the chain/contig with the highest umi if there are multiple contigs for a single cell. The function requires a minimum fold-difference of 2 between the highest and lowest umi in order to rescue the contig. However, if the contigs have similar number of umis, or if the sum of the umis are very low, then the entire cell will be filtered. The fold-difference cut-off can be specified via the option
umi_foldchange_cutoff
. This can be toggled toFalse
i.e. drop all multiple chains/contigs:
keep_highest_umi=True
The default behaviour is to only consider productive contigs but some cell types may require examination of non-productive chains (e.g. developing early B/T cells, ILCs, NKTs etc.). Because the filtering of productive and non-productive contigs are kept separate, this should not impact on productive contigs. But specifying
productive_only=True
will remove all non-productive contigs.
productive_only=False
If you just want to mark which contigs to remove and not actually remove them from consideration, this can be toggled with:
filter_contig=False
If you want to keep the processed transcriptome data as is, and not make use of the V(D)J data to filter out potentially poor quality cells because of multiplet V(D)J detection, consider using:
filter_rna=False
This should keep the anndata
as per the input but with the .obs
columns appropriately filled in with the V(D)J QC metrics.
Lastly, if you just want to do a light filtering (like just check that the V(D)J+C genes are matching), then you can toggle
simple=True
. This will ignore all the other options:
simple=True
Check the output V(D)J table
The vdj table is returned as a Dandelion
class object in the .data
slot (described in further detail here); if a file was provided for filter_bcr
above, a new file will be created in the same folder with the filtered
prefix. Note that this V(D)J table is indexed based on contigs (sequence_id).
[7]:
vdj
[7]:
Dandelion class object with n_obs = 2043 and n_contigs = 4181
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'rearrangement_status'
metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
Check the AnnData object as well
And the AnnData
object is indexed based on cells.
[8]:
adata2
[8]:
AnnData object with n_obs × n_vars = 22920 × 31915
obs: 'sampleid', 'batch', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'gmm_pct_count_clusters_keep', 'scrublet_score', 'is_doublet', 'filter_rna', 'has_contig', 'filter_contig_quality', 'filter_contig_VDJ', 'filter_contig_VJ', 'contig_QC_pass', 'filter_contig', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
var: 'feature_types', 'genome', 'gene_ids-0', 'gene_ids-1', 'gene_ids-2', 'gene_ids-3'
The .obs
slot in the AnnData
object now contains a few new columns related to the V(D)J chains:
Relevant columns in obs
has_contig
whether cells have V(D)J chains.
filter_contig_quality
recommendation for filtering cells identified as having poor quality contigs.
filter_contig_VDJ
recommendation for filtering cells identified as VDJ ‘multiplets’.
filter_contig_VJ
recommendation for filtering cells identifed as having multiple VJ contigs.
contig_QC_pass
cells where V(D)J chains successfully passed QC.
filter_contig
recommendation for filter for cells due to bad quality chains.
So this means that to go forward, you want to only select cells that have BCR that passed QC (has_contig == "True"
and contig_QC_pass == "True"
) with filtering recommendation to be false (filter_contig == "False"
).
The number of cells that actually has a matching BCR can be tabluated.
[9]:
pd.crosstab(adata2.obs["has_contig"], adata2.obs["filter_contig"])
[9]:
filter_contig | False |
---|---|
has_contig | |
No_contig | 20756 |
True | 2164 |
[10]:
pd.crosstab(adata2.obs["has_contig"], adata2.obs["contig_QC_pass"])
[10]:
contig_QC_pass | False | No_contig | True |
---|---|---|---|
has_contig | |||
No_contig | 0 | 20756 | 0 |
True | 121 | 0 | 2043 |
[11]:
pd.crosstab(adata2.obs["contig_QC_pass"], adata2.obs["filter_contig"])
[11]:
filter_contig | False |
---|---|
contig_QC_pass | |
False | 121 |
No_contig | 20756 |
True | 2043 |
ddl.pp.check_contigs
From v0.2.4
onwards, there’s a new function that performs similarly to ddl.pp.filter_contigs
, but relaxed settings so as not to forcefully remove contigs. The function is also simplified with reduced arguments. The main output of this function is an additional ambiguous
column in vdj.data
, which flags T
or F
for contigs that were marked as poor quality. The numbers of ambiguous
contigs would not tally with the number of contigs removed from ddl.pp.filter_contigs
because while ddl.pp.check_contigs
only assess ambiguity at the contig level (i.e. whether a contig can be considered good/bad on its own), ddl.pp.filter_contigs
imposes additional ‘strict’ assumptions (e.g. a cell should only contain 1 productive pair of VDJ pairs).
[12]:
# Usage is similar
vdj, adata = ddl.pp.check_contigs(bcr, adata, library_type="ig")
Preparing data: 0it [00:00, ?it/s]Preparing data: 6505it [00:00, 7073.10it/s]
Scanning for poor quality/ambiguous contigs: 100%|██████████| 3158/3158 [00:06<00:00, 504.69it/s]
Check the Dandelion object
[13]:
vdj
[13]:
Dandelion class object with n_obs = 2229 and n_contigs = 7357
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status'
metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
Check the AnnData object as well
[14]:
adata
[14]:
AnnData object with n_obs × n_vars = 22985 × 31915
obs: 'sampleid', 'batch', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'gmm_pct_count_clusters_keep', 'scrublet_score', 'is_doublet', 'filter_rna', 'has_contig', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
var: 'feature_types', 'genome', 'gene_ids-0', 'gene_ids-1', 'gene_ids-2', 'gene_ids-3'
The .obs
slot in the AnnData
object (and also .metadata
slot in the Dandelion
object) have different columns from the output of ddl.pp.filter_contigs
:
I will highlight the ones that are relevant at this stage:
Relevant columns in obs
has_contig
whether cells have V(D)J chains.
locus_status
detailed information on chain status pairings (below).
chain_status
summarised information of the chain locus status pairings (similar to
chain_pairing
inscirpy
).rearrangement_status_VDJ
andrearrangement_status_VJ
whether or not V(D)J gene usage are standard (i.e. all from the same locus).
So in a standard situation, I would remove cells flagged with Orphan VJ
, Orphan VJ-exception
, Extra pair
, ambiguous
in .metadata.chain_status
, and also any cell marked as chimeric
in the .metadata.rearrangement_status_VDJ
and .metadata.rearrangement_status_VJ
from downstream cell-level calculations/analysis.
Having said that, you will find that most of Dandelion
’s functions will work without the need to requirement to perform additional filtering and filtering can be performed on the final AnnData
object (described in the visualisation section).
Let’s take a look at these new columns
[15]:
pd.crosstab(adata.obs["chain_status"], adata.obs["locus_status"])
[15]:
locus_status | Extra VDJ + Extra VJ | Extra VDJ + IGK | Extra VDJ + IGL | IGH + Extra VJ | IGH + IGK | IGH + IGL | IgM/IgD + Extra VJ | IgM/IgD + IGK | IgM/IgD + IGL | No_contig | Orphan Extra VJ | Orphan IGH | Orphan IGK | Orphan IGL |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
chain_status | ||||||||||||||
Extra pair | 23 | 3 | 3 | 87 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Extra pair-exception | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 2 | 1 | 0 | 0 | 0 | 0 | 0 |
No_contig | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20756 | 0 | 0 | 0 | 0 |
Orphan VDJ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 8 | 0 | 0 |
Orphan VJ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 38 | 0 | 83 | 37 |
Single pair | 0 | 0 | 0 | 0 | 1118 | 820 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
if there are multiple library types, i.e. ddl.pp.filter_contigs
or ddl.pp.check_contigs
was run with library_type = None
, or if several tcr/bcr Dandelion
objects are concatenated, there will be additional columns where the v/d/j/c calls
and productive
will be split into additional columns to reflect those that belong to a B cell, alpha-beta T cell, or gamma-delta T cell.
We will use this contig_checked
object going forward.
Now actually filter the AnnData object and run through a standard workflow starting by filtering genes and normalizing the data
Because the ‘filtered’ AnnData
object was returned as a filtered but otherwise unprocessed object, we still need to normalize and run through the usual process here. The following is just a standard scanpy workflow.
[16]:
# filter genes
sc.pp.filter_genes(adata, min_cells=3)
# Normalize the counts
sc.pp.normalize_total(adata, target_sum=1e4)
# Logarithmize the data
sc.pp.log1p(adata)
# Stash the normalised counts
adata.raw = adata
Identify highly-variable genes
[17]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
Filter the genes to only those marked as highly-variable
[18]:
adata = adata[:, adata.var.highly_variable]
Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. Scale the data to unit variance.
[19]:
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)
Run PCA
[20]:
sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)
Computing the neighborhood graph, umap and clusters
[21]:
# Computing the neighborhood graph
sc.pp.neighbors(adata)
# Embedding the neighborhood graph
sc.tl.umap(adata)
# Clustering the neighborhood graph
sc.tl.leiden(adata)
Visualizing the clusters and whether or not there’s a corresponding V(D)J receptor
[22]:
sc.pl.umap(adata, color=["leiden", "chain_status"])
Visualizing some B cell genes
[23]:
sc.pl.umap(adata, color=["IGHM", "JCHAIN"])
Save AnnData
We can save this AnnData
object for now.
[24]:
adata.write("adata.h5ad", compression="gzip")
Save dandelion
To save the vdj object, we have two options - either save the .data
and .metadata
slots with pandas’ functions:
[25]:
vdj.data.to_csv("filtered_vdj_table.tsv", sep="\t")
Or save the whole Dandelion class object with either .write_h5ddl/.write_h5/.write
, which saves the class to a HDF5 format, or using a pickle-based .write_pkl
function.
[26]:
vdj.write_h5ddl("dandelion_results.h5ddl", compression="blosc:lz4")
[27]:
vdj.write_pkl(
"dandelion_results.pkl.pbz2"
) # this will automatically use bzip2 for compression, swith the extension to .gz for gzip
Running ddl.pp.filter_contigs
and ddl.pp.check_contigs
without AnnData
Finally, ddl.pp.filter_contigs
can also be run without an AnnData
object:
[28]:
vdj3 = ddl.pp.filter_contigs(bcr)
vdj3
Preparing data: 0it [00:00, ?it/s]Preparing data: 6505it [00:00, 8225.96it/s]
Scanning for poor quality/ambiguous contigs: 100%|██████████| 3158/3158 [00:06<00:00, 469.54it/s]
[28]:
Dandelion class object with n_obs = 2812 and n_contigs = 5754
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'rearrangement_status'
metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
[29]:
vdj4 = ddl.pp.check_contigs(bcr)
vdj4
Preparing data: 0it [00:00, ?it/s]Preparing data: 6505it [00:00, 6757.71it/s]
Scanning for poor quality/ambiguous contigs: 100%|██████████| 3158/3158 [00:06<00:00, 505.20it/s]
[29]:
Dandelion class object with n_obs = 3156 and n_contigs = 7357
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'duplicate_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'rearrangement_status'
metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'duplicate_count_B_VDJ', 'duplicate_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
[ ]: