Analyzing TCR data

With dandelion>=1.3 onwards, there will be the ability to start analyzing 10x single-cell TCR data with the existing setup for both alpha-beta and gamma-delta TCR data formats. Currently, the alpha-beta and gamma-delta data sets have to be analyzed separately.

We will download the various input formats of TCR files from 10x’s resource page as part of this tutorial:

# bash
mkdir -p dandelion_tutorial/sc5p_v2_hs_PBMC_10k;
mkdir -p dandelion_tutorial/sc5p_v1p1_hs_melanoma_10k;

cd dandelion_tutorial/sc5p_v2_hs_PBMC_10k;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_filtered_feature_bc_matrix.h5;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_t_airr_rearrangement.tsv;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_t_filtered_contig_annotations.csv;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_t_filtered_contig.fasta;

cd ../sc5p_v1p1_hs_melanoma_10k;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v1p1_hs_melanoma_10k/sc5p_v1p1_hs_melanoma_10k_filtered_feature_bc_matrix.h5;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v1p1_hs_melanoma_10k/sc5p_v1p1_hs_melanoma_10k_t_airr_rearrangement.tsv;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v1p1_hs_melanoma_10k/sc5p_v1p1_hs_melanoma_10k_t_filtered_contig_annotations.csv;
wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v1p1_hs_melanoma_10k/sc5p_v1p1_hs_melanoma_10k_t_filtered_contig.fasta;

Import dandelion module

[1]:
import os
import dandelion as ddl

# change directory to somewhere more workable
os.chdir(os.path.expanduser("~/Downloads/dandelion_tutorial/"))
ddl.logging.print_versions()
dandelion==0.3.4.dev29 pandas==2.0.1 numpy==1.24.3 matplotlib==3.7.1 networkx==3.1 scipy==1.11.2

I’m showing two examples for reading in the data: with or without reannotation.

Read in AIRR format

[2]:
# read in the airr_rearrangement.tsv file
file1 = "sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_t_airr_rearrangement.tsv"
file2 = "sc5p_v1p1_hs_melanoma_10k/sc5p_v1p1_hs_melanoma_10k_t_airr_rearrangement.tsv"
[3]:
vdj1 = ddl.read_10x_airr(file1)
vdj1
[3]:
Dandelion class object with n_obs = 5351 and n_contigs = 10860
    data: 'cell_id', 'sequence_id', 'sequence', 'sequence_aa', 'productive', 'rev_comp', 'v_call', 'v_cigar', 'd_call', 'd_cigar', 'j_call', 'j_cigar', 'c_call', 'c_cigar', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'junction_length', 'junction_aa_length', 'v_sequence_start', 'v_sequence_end', 'd_sequence_start', 'd_sequence_end', 'j_sequence_start', 'j_sequence_end', 'c_sequence_start', 'c_sequence_end', 'consensus_count', 'duplicate_count', 'is_cell', 'locus', 'rearrangement_status'
    metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'c_call_abT_VDJ', 'c_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_abT_VDJ_main', 'd_call_abT_VDJ_main', 'j_call_abT_VDJ_main', 'v_call_abT_VJ_main', 'j_call_abT_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
[4]:
vdj2 = ddl.read_10x_airr(file2)
vdj2
[4]:
Dandelion class object with n_obs = 1560 and n_contigs = 2755
    data: 'cell_id', 'sequence_id', 'sequence', 'sequence_aa', 'productive', 'rev_comp', 'v_call', 'v_cigar', 'd_call', 'd_cigar', 'j_call', 'j_cigar', 'c_call', 'c_cigar', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'junction_length', 'junction_aa_length', 'v_sequence_start', 'v_sequence_end', 'd_sequence_start', 'd_sequence_end', 'j_sequence_start', 'j_sequence_end', 'c_sequence_start', 'c_sequence_end', 'consensus_count', 'duplicate_count', 'is_cell', 'locus', 'rearrangement_status'
    metadata: 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'c_call_abT_VDJ', 'c_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_abT_VDJ_main', 'd_call_abT_VDJ_main', 'j_call_abT_VDJ_main', 'v_call_abT_VJ_main', 'j_call_abT_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
[5]:
# combine into a singular object
# let's add the sample_id to each cell barcode so that we don't end up overlapping later on
sample_id = "sc5p_v2_hs_PBMC_10k"
vdj1.data["sample_id"] = sample_id
vdj1.data["cell_id"] = [sample_id + "_" + c for c in vdj1.data["cell_id"]]
vdj1.data["sequence_id"] = [
    sample_id + "_" + s for s in vdj1.data["sequence_id"]
]

sample_id = "sc5p_v1p1_hs_melanoma_10k"
vdj2.data["sample_id"] = sample_id
vdj2.data["cell_id"] = [sample_id + "_" + c for c in vdj2.data["cell_id"]]
vdj2.data["sequence_id"] = [
    sample_id + "_" + s for s in vdj2.data["sequence_id"]
]

# combine into a singular object
vdj = ddl.concat([vdj1, vdj2])
vdj
[5]:
Dandelion class object with n_obs = 6911 and n_contigs = 13615
    data: 'cell_id', 'sequence_id', 'sequence', 'sequence_aa', 'productive', 'rev_comp', 'v_call', 'v_cigar', 'd_call', 'd_cigar', 'j_call', 'j_cigar', 'c_call', 'c_cigar', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'junction_length', 'junction_aa_length', 'v_sequence_start', 'v_sequence_end', 'd_sequence_start', 'd_sequence_end', 'j_sequence_start', 'j_sequence_end', 'c_sequence_start', 'c_sequence_end', 'consensus_count', 'duplicate_count', 'is_cell', 'locus', 'rearrangement_status', 'sample_id'
    metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'c_call_abT_VDJ', 'c_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_abT_VDJ_main', 'd_call_abT_VDJ_main', 'j_call_abT_VDJ_main', 'v_call_abT_VJ_main', 'j_call_abT_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

Read in with reannotation

We specify the filename_prefix option because they have different prefixes that precedes _contig.fasta and _contig_annotations.csv.

[6]:
samples = ["sc5p_v2_hs_PBMC_10k", "sc5p_v1p1_hs_melanoma_10k"]
filename_prefixes = [
    "sc5p_v2_hs_PBMC_10k_t_filtered",
    "sc5p_v1p1_hs_melanoma_10k_t_filtered",
]
ddl.pp.format_fastas(samples, prefix=samples, filename_prefix=filename_prefixes)
Formatting fasta(s) : 100%|██████████| 2/2 [00:01<00:00,  1.71it/s]

Make sure to toggle loci = 'tr' for TCR data. I’m setting reassign_dj = True so as to try and force a reassignment of J genes (and D genes if it can) with stricter cut offs.

[7]:
ddl.pp.reannotate_genes(
    samples, loci="tr", reassign_dj=True, filename_prefix=filename_prefixes
)
Assigning genes :   0%|          | 0/2 [00:00<?, ?it/s]
         START> MakeDB
       COMMAND> igblast
  ALIGNER_FILE> sc5p_v2_hs_PBMC_10k_t_filtered_contig_igblast.fmt7
      SEQ_FILE> sc5p_v2_hs_PBMC_10k_t_filtered_contig.fasta
       ASIS_ID> False
    ASIS_CALLS> False
      VALIDATE> strict
      EXTENDED> True
INFER_JUNCTION> False

PROGRESS> 00:11:06 |Done                | 0.0 min

PROGRESS> 00:11:10 |####################| 100% (13,630) 0.1 min

OUTPUT> sc5p_v2_hs_PBMC_10k_t_filtered_contig_igblast_db-pass.tsv
  PASS> 12246
  FAIL> 1384
   END> MakeDb

         START> MakeDB
       COMMAND> igblast
  ALIGNER_FILE> sc5p_v2_hs_PBMC_10k_t_filtered_contig_igblast.fmt7
      SEQ_FILE> sc5p_v2_hs_PBMC_10k_t_filtered_contig.fasta
       ASIS_ID> False
    ASIS_CALLS> False
      VALIDATE> strict
      EXTENDED> True
INFER_JUNCTION> False

PROGRESS> 00:11:11 |Done                | 0.0 min

PROGRESS> 00:11:15 |####################| 100% (13,630) 0.1 min

OUTPUT> sc5p_v2_hs_PBMC_10k_t_filtered_contig_igblast_db-pass.tsv
  PASS> 12246
  FAIL> 1384
   END> MakeDb

Assigning genes :  50%|█████     | 1/2 [02:36<02:36, 156.10s/it]
         START> MakeDB
       COMMAND> igblast
  ALIGNER_FILE> sc5p_v1p1_hs_melanoma_10k_t_filtered_contig_igblast.fmt7
      SEQ_FILE> sc5p_v1p1_hs_melanoma_10k_t_filtered_contig.fasta
       ASIS_ID> False
    ASIS_CALLS> False
      VALIDATE> strict
      EXTENDED> True
INFER_JUNCTION> False

PROGRESS> 00:12:08 |Done                | 0.0 min

PROGRESS> 00:12:10 |####################| 100% (3,706) 0.0 min

OUTPUT> sc5p_v1p1_hs_melanoma_10k_t_filtered_contig_igblast_db-pass.tsv
  PASS> 3217
  FAIL> 489
   END> MakeDb

         START> MakeDB
       COMMAND> igblast
  ALIGNER_FILE> sc5p_v1p1_hs_melanoma_10k_t_filtered_contig_igblast.fmt7
      SEQ_FILE> sc5p_v1p1_hs_melanoma_10k_t_filtered_contig.fasta
       ASIS_ID> False
    ASIS_CALLS> False
      VALIDATE> strict
      EXTENDED> True
INFER_JUNCTION> False

PROGRESS> 00:12:10 |Done                | 0.0 min

PROGRESS> 00:12:11 |####################| 100% (3,706) 0.0 min

OUTPUT> sc5p_v1p1_hs_melanoma_10k_t_filtered_contig_igblast_db-pass.tsv
  PASS> 3217
  FAIL> 489
   END> MakeDb

Assigning genes : 100%|██████████| 2/2 [03:22<00:00, 101.49s/it]

There’s no need to run the the rest of the preprocessing steps.

We’ll read in the reannotated files like as follow:

[8]:
import pandas as pd

tcr_files = []
for sample in samples:
    file_location = (
        sample + "/dandelion/" + sample + "_t_filtered_contig_dandelion.tsv"
    )
    tcr_files.append(pd.read_csv(file_location, sep="\t"))
tcr = pd.concat(tcr_files, ignore_index=True)
tcr.reset_index(inplace=True, drop=True)
tcr
[8]:
sequence_id sequence rev_comp productive v_call d_call j_call sequence_alignment germline_alignment junction ... cdr3_aa sequence_alignment_aa v_sequence_alignment_aa d_sequence_alignment_aa j_sequence_alignment_aa j_call_multimappers j_call_multiplicity j_call_sequence_start_multimappers j_call_sequence_end_multimappers j_call_support_multimappers
0 sc5p_v2_hs_PBMC_10k_AAACCTGAGCGATAGC_contig_1 CTGGAAGACCACCTGGGCTGTCATTGAGCTCTGGTGCCAGGAGGAA... F T TRAV23/DV6*02 NaN TRAJ22*01 CAGCAGCAGGTGAAACAAAGTCCTCAATCTTTGATAGTCCAGAAAG... CAGCAGCAGGTGAAACAAAGTCCTCAATCTTTGATAGTCCAGAAAG... TGTGCAGCAAGCAAGGGTTCTGCAAGGCAACTGACCTTT ... AASKGSARQLT QQQVKQSPQSLIVQKGGIPIINCAYENTAFDYFPWYQQFPGKGPAL... QQQVKQSPQSLIVQKGGIPIINCAYENTAFDYFPWYQQFPGKGPAL... NaN GSARQLTFGSGTQLTVLP TRAJ22*01 1.0 412 466 2.61e-25
1 sc5p_v2_hs_PBMC_10k_AAACCTGAGCGATAGC_contig_2 GAGAGTCCTGCTCCCCTTTCATCAATGCACAGATACAGAAGACCCC... F T TRBV6-5*01 NaN TRBJ2-3*01 AATGCTGGTGTCACTCAGACCCCAAAATTCCAGGTCCTGAAGACAG... AATGCTGGTGTCACTCAGACCCCAAAATTCCAGGTCCTGAAGACAG... TGTGCCAGCAGTTACCGGGGGGGATCGGAAGATACGCAGTATTTT ... ASSYRGGSEDTQY NAGVTQTPKFQVLKTGQSMTLQCAQDMNHEYMSWYRQDPGMGLRLI... NAGVTQTPKFQVLKTGQSMTLQCAQDMNHEYMSWYRQDPGMGLRLI... NaN DTQYFGPGTRLTVL TRBJ2-3*01 1.0 423 466 3.47e-19
2 sc5p_v2_hs_PBMC_10k_AAACCTGAGTCACGCC_contig_2 CCTTTTCACCAATGCACAGACCCAGAGGACCCCTCCATCCTGCAGT... F T TRBV6-2*01,TRBV6-3*01 NaN TRBJ2-6*01 AATGCTGGTGTCACTCAGACCCCAAAATTCCGGGTCCTGAAGACAG... AATGCTGGTGTCACTCAGACCCCAAAATTCCGGGTCCTGAAGACAG... TGTGCCAGCAGTTATCTCCCCCGTAGACAGGACAGGGAATCCTCTG... ... ASSYLPRRQDRESSGANVLT NAGVTQTPKFRVLKTGQSMTLLCAQDMNHEYMYWYRQDPGMGLRLI... NAGVTQTPKFRVLKTGQSMTLLCAQDMNHEYMYWYRQDPGMGLRLI... NaN SGANVLTFGAGSRLTVL TRBJ2-6*01 1.0 422 474 3.5e-24
3 sc5p_v2_hs_PBMC_10k_AAACCTGAGTCACGCC_contig_1 GTAGCTCGTTGATATCTGTGTGGATAGGGAGCTGTGACGAGGGCAA... F T TRAV8-6*02 NaN TRAJ8*01 GCCCAGTCTGTGACCCAGCTTGACAGCCAAGTCCCTGTCTTTGAAG... GCCCAGTCTGTGACCCAGCTTGACAGCCAAGTCCCTGTCTTTGAAG... TGTGCTGTGAGTGCGTTTTTTCAGAAACTTGTATTT ... AVSAFFQKLV AQSVTQLDSQVPVFEEAPVELRCNYSSSVSVYLFWYVQYPNQGLQL... AQSVTQLDSQVPVFEEAPVELRCNYSSSVSVYLFWYVQYPNQGLQL... NaN FQKLVFGTGTRLLVSP TRAJ8*01 1.0 566 614 7.35e-22
4 sc5p_v2_hs_PBMC_10k_AAACCTGCACGTCAGC_contig_1 CCCACATGAAGTGTCTACCTTCTGCAGACTCCAATGGCTCAGGAAC... F T TRAV1-2*01 NaN TRAJ33*01 GGACAAAACATTGACCAG...CCCACTGAGATGACAGCTACGGAAG... GGACAAAACATTGACCAG...CCCACTGAGATGACAGCTACGGAAG... TGTGCTGTCATGGATAGCAACTATCAGTTAATCTGG ... AVMDSNYQLI GQNIDQPTEMTATEGAIVQINCTYQTSGFNGLFWYQQHAGEAPTFL... GQNIDQPTEMTATEGAIVQINCTYQTSGFNGLFWYQQHAGEAPTFL... NaN DSNYQLIWGAGTKLIIKP TRAJ33*01 1.0 407 463 2.01e-26
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
15458 sc5p_v1p1_hs_melanoma_10k_TTTGGTTTCACAGGCC_con... AATGGCTCAGGAACTGGGAATGCAGTGCCAGGCTCGTGGTATCCTG... F T TRAV1-2*01 NaN TRAJ20*01 GGACAAAACATTGACCAG...CCCACTGAGATGACAGCTACGGAAG... GGACAAAACATTGACCAG...CCCACTGAGATGACAGCTACGGAAG... TGTGCTGTGATGGGGGACTACAAGCTCAGCTTT ... AVMGDYKLS GQNIDQPTEMTATEGAIVQINCTYQTSGFNGLFWYQQHAGEAPTFL... GQNIDQPTEMTATEGAIVQINCTYQTSGFNGLFWYQQHAGEAPTFL... NaN DYKLSFGAGTTVTVRA TRAJ20*01 1.0 380 428 5.22e-22
15459 sc5p_v1p1_hs_melanoma_10k_TTTGGTTTCTTTACGT_con... TTCCTCTGCTCTGGCAGCAGATCTCCCAGAGGGAGCAGCCTGACCA... F T TRBV30*01 NaN TRBJ2-5*01 TCTCAGACTATTCATCAATGGCCAGCGACCCTGGTGCAGCCTGTGG... TCTCAGACTATTCATCAATGGCCAGCGACCCTGGTGCAGCCTGTGG... TGTGCCTGGAGTGAGCTAGCGGCCCAAGAGACCCAGTACTTC ... AWSELAAQETQY SQTIHQWPATLVQPVGSPLSLECTVEGTSNPNLYWYRQAAGRGLQL... SQTIHQWPATLVQPVGSPLSLECTVEGTSNPNLYWYRQAAGRGLQL... NaN QETQYFGPGTRLLVL TRBJ2-5*01 1.0 457 503 8.02e-21
15460 sc5p_v1p1_hs_melanoma_10k_TTTGGTTTCTTTACGT_con... GATCTTAATTGGGAAGAACAAGGATGACATCCATTCGAGCTGTATT... F F TRAV13-1*02 NaN TRAJ43*01 GGAGAGAATGTGGAGCAGCATCCTTCAACCCTGAGTGTCCAGGAGG... GGAGAGAATGTGGAGCAGCATCCTTCAACCCTGAGTGTCCAGGAGG... TGTGCAGCAAGTACAACCCGAAGGTTAGGCGGGGTGGGGTCAAAAA... ... AASTTRRLGGVGSKK*HA GENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKRPQL... GENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKRPQL... NaN *HALWSRDQTDSKT TRAJ43*01 1.0 388 439 2.49e-20
15461 sc5p_v1p1_hs_melanoma_10k_TTTGTCACATTTCACT_con... TCTGGGGATGTTCACAGAGGGCCTGGTCTGGAATATTCCACATCTG... F T TRBV12-4*01 NaN TRBJ1-1*01 GATGCTGGAGTTATCCAGTCACCCCGGCACGAGGTGACAGAGATGG... GATGCTGGAGTTATCCAGTCACCCCGGCACGAGGTGACAGAGATGG... TGTGCCAGCAGTTTAGGATGGGGAGACGGCACTGAAGCTTTCTTT ... ASSLGWGDGTEAF DAGVIQSPRHEVTEMGQEVTLRCKPISGHDYLFWYRQTMMRGLELL... DAGVIQSPRHEVTEMGQEVTLRCKPISGHDYLFWYRQTMMRGLELL... NaN TEAFFGQGTRLTVV TRBJ1-1*01 1.0 419 462 3.45e-19
15462 sc5p_v1p1_hs_melanoma_10k_TTTGTCACATTTCACT_con... AGATCAGAAGAGGAGGCTTCTCACCCTGCAGCAGGGACCTGTGAGC... F T TRAV38-2/DV8*01 NaN TRAJ32*01,TRAJ32*02 GCTCAGACAGTCACTCAGTCTCAACCAGAGATGTCTGTGCAGGAGG... GCTCAGACAGTCACTCAGTCTCAACCAGAGATGTCTGTGCAGGAGG... TGTGCTTATAGGAGCACCCAGATCCCCCAGCTCATCTTT ... AYRSTQIPQLI AQTVTQSQPEMSVQEAETVTLSCTYDTSESDYYLFWYKQPPSRQMI... AQTVTQSQPEMSVQEAETVTLSCTYDTSESDYYLFWYKQPPSRQMI... NaN LIFGTGTLLAVQP TRAJ32*02 1.0 408 449 4.25e-18

15463 rows × 108 columns

The reannotated file can be used with dandelion as per the BCR tutorial.

For the rest of the tutorial, I’m going to show how to proceed with 10x’s AIRR format file instead as there are some minor differences.

Import modules for use with scanpy

[9]:
import pandas as pd
import numpy as np
import scanpy as sc
import warnings
import functools
import seaborn as sns
import scipy.stats
import anndata as ad

warnings.filterwarnings("ignore")
sc.logging.print_header()
scanpy==1.9.3 anndata==0.9.1 umap==0.5.3 numpy==1.24.3 scipy==1.11.2 pandas==2.0.1 scikit-learn==1.3.0 statsmodels==0.14.0 python-igraph==0.10.6 pynndescent==0.5.10

Import the transcriptome data

[10]:
gex_files = {
    "sc5p_v2_hs_PBMC_10k": "sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_filtered_feature_bc_matrix.h5",
    "sc5p_v1p1_hs_melanoma_10k": "sc5p_v1p1_hs_melanoma_10k/sc5p_v1p1_hs_melanoma_10k_filtered_feature_bc_matrix.h5",
}
[11]:
adata_list = []
for f in gex_files:
    adata_tmp = sc.read_10x_h5(gex_files[f], gex_only=True)
    adata_tmp.obs["sample_id"] = f
    adata_tmp.obs_names = [f + "_" + x for x in adata_tmp.obs_names]
    adata_tmp.var_names_make_unique()
    adata_list.append(adata_tmp)
adata = ad.concat(adata_list)
adata
[11]:
AnnData object with n_obs × n_vars = 17275 × 36601
    obs: 'sample_id'

Run QC on the transcriptome data.

[12]:
ddl.pp.recipe_scanpy_qc(adata)
adata
[12]:
AnnData object with n_obs × n_vars = 17275 × 36601
    obs: 'sample_id', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'scrublet_score', 'is_doublet', 'filter_rna'

Filtering TCR data.

Note that I’m using the Dandelion object as input rather than the pandas dataframe (yes both types of input will works. In fact, a file path to the .tsv will work too).

[13]:
# The function will return both objects.
vdj, adata = ddl.pp.filter_contigs(
    vdj, adata
)  # please look at the other tutorials for using `ddl.pp.check_contigs` as well
Preparing data: 13615it [00:01, 9454.49it/s]
Scanning for poor quality/ambiguous contigs: 100%|██████████| 6911/6911 [00:14<00:00, 473.35it/s]

Check the output V(D)J table

[14]:
vdj
[14]:
Dandelion class object with n_obs = 5333 and n_contigs = 10352
    data: 'cell_id', 'sequence_id', 'sequence', 'sequence_aa', 'productive', 'rev_comp', 'v_call', 'v_cigar', 'd_call', 'd_cigar', 'j_call', 'j_cigar', 'c_call', 'c_cigar', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'junction_length', 'junction_aa_length', 'v_sequence_start', 'v_sequence_end', 'd_sequence_start', 'd_sequence_end', 'j_sequence_start', 'j_sequence_end', 'c_sequence_start', 'c_sequence_end', 'consensus_count', 'duplicate_count', 'is_cell', 'locus', 'rearrangement_status', 'sample_id'
    metadata: 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'c_call_abT_VDJ', 'c_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_abT_VDJ_main', 'd_call_abT_VDJ_main', 'j_call_abT_VDJ_main', 'v_call_abT_VJ_main', 'j_call_abT_VJ_main', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

Check the AnnData object as well

[15]:
adata
[15]:
AnnData object with n_obs × n_vars = 17275 × 36601
    obs: 'sample_id', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'scrublet_score', 'is_doublet', 'filter_rna', 'has_contig', 'filter_contig_quality', 'filter_contig_VDJ', 'filter_contig_VJ', 'contig_QC_pass', 'filter_contig', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'c_call_abT_VDJ', 'c_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_abT_VDJ_main', 'd_call_abT_VDJ_main', 'j_call_abT_VDJ_main', 'v_call_abT_VJ_main', 'j_call_abT_VJ_main', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

The number of cells that actually has a matching BCR can be tabluated.

[16]:
pd.crosstab(adata.obs["has_contig"], adata.obs["filter_contig"])
[16]:
filter_contig False True
has_contig
No_contig 6995 3432
True 5438 1410
[17]:
pd.crosstab(adata.obs["has_contig"], adata.obs["contig_QC_pass"])
[17]:
contig_QC_pass False No_contig True
has_contig
No_contig 0 10427 0
True 1515 0 5333
[18]:
pd.crosstab(adata.obs["contig_QC_pass"], adata.obs["filter_contig"])
[18]:
filter_contig False True
contig_QC_pass
False 105 1410
No_contig 6995 3432
True 5333 0

Now actually filter the AnnData object and run through a standard workflow starting by filtering genes and normalizing the data

Because the ‘filtered’ AnnData object was returned as a filtered but otherwise unprocessed object, we still need to normalize and run through the usual process here. The following is just a standard scanpy workflow.

[19]:
# filter genes
sc.pp.filter_genes(adata, min_cells=3)
# Normalize the counts
sc.pp.normalize_total(adata, target_sum=1e4)
# Logarithmize the data
sc.pp.log1p(adata)
# Stash the normalised counts
adata.raw = adata

Identify highly-variable genes

[20]:
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
../_images/notebooks_7_dandelion_TCR_data_10x_data_35_0.png

Filter the genes to only those marked as highly-variable

[21]:
adata = adata[:, adata.var.highly_variable]

Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. Scale the data to unit variance.

[22]:
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])
sc.pp.scale(adata, max_value=10)

Run PCA

[23]:
sc.tl.pca(adata, svd_solver="arpack")
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)
../_images/notebooks_7_dandelion_TCR_data_10x_data_41_0.png

Computing the neighborhood graph, umap and clusters

[24]:
# Computing the neighborhood graph
sc.pp.neighbors(adata)
# Embedding the neighborhood graph
sc.tl.umap(adata)
# Clustering the neighborhood graph
sc.tl.leiden(adata)

Visualizing the clusters and whether or not there’s a corresponding contig.

[25]:
sc.pl.umap(adata, color=["leiden", "contig_QC_pass"])
../_images/notebooks_7_dandelion_TCR_data_10x_data_45_0.png

Visualizing some T cell genes.

[26]:
sc.pl.umap(adata, color=["CD3E", "CD8B"])
../_images/notebooks_7_dandelion_TCR_data_10x_data_47_0.png

Find clones.

Note

Here we specify identity = 1 so only cells with identical CDR3 nucleotide sequences (key = 'junction') are grouped into clones/clonotypes.

[27]:
ddl.tl.find_clones(vdj, identity=1, key="junction")
vdj
Finding clones based on abT cell VDJ chains : 100%|██████████| 501/501 [00:00<00:00, 3944.38it/s]
Finding clones based on abT cell VJ chains : 100%|██████████| 1548/1548 [00:00<00:00, 15106.42it/s]
Refining clone assignment based on VJ chain pairing : 100%|██████████| 5333/5333 [00:00<00:00, 632084.98it/s]
[27]:
Dandelion class object with n_obs = 5333 and n_contigs = 10352
    data: 'cell_id', 'sequence_id', 'sequence', 'sequence_aa', 'productive', 'rev_comp', 'v_call', 'v_cigar', 'd_call', 'd_cigar', 'j_call', 'j_cigar', 'c_call', 'c_cigar', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'junction_length', 'junction_aa_length', 'v_sequence_start', 'v_sequence_end', 'd_sequence_start', 'd_sequence_end', 'j_sequence_start', 'j_sequence_end', 'c_sequence_start', 'c_sequence_end', 'consensus_count', 'duplicate_count', 'is_cell', 'locus', 'rearrangement_status', 'sample_id', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'c_call_abT_VDJ', 'c_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_abT_VDJ_main', 'd_call_abT_VDJ_main', 'j_call_abT_VDJ_main', 'v_call_abT_VJ_main', 'j_call_abT_VJ_main', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'

Generate TCR network.

The 10x-provided AIRR file is missing columns like sequence_alignment and sequence_alignment_aa so we will use the next best thing, which is sequence or sequence_aa. Note that these columns are not-gapped.

Specify key = 'sequence_aa' to toggle this behavior. Can also try junction or junction_aa if just want to visualise the CDR3 linkage.

[28]:
# again, i'm removing the Orphan VJ cells (lacking TRB chain i.e. VDJ information).
vdj = vdj[
    vdj.metadata.chain_status.isin(
        ["Single pair", "Extra pair", "Extra pair-exception", "Orphan VDJ"]
    )
].copy()
[29]:
ddl.tl.generate_network(vdj, key="sequence_aa")
Setting up data: 10352it [00:00, 21511.01it/s]
Calculating distances : 100%|██████████| 5727/5727 [00:00<00:00, 11658.99it/s]
Aggregating distances : 100%|██████████| 3/3 [00:00<00:00, 22.26it/s]
Sorting into clusters : 100%|██████████| 5727/5727 [00:02<00:00, 2295.37it/s]
Calculating minimum spanning tree : 100%|██████████| 39/39 [00:00<00:00, 1237.44it/s]
Generating edge list : 100%|██████████| 39/39 [00:00<00:00, 4374.56it/s]
Computing overlap : 100%|██████████| 5727/5727 [00:02<00:00, 2320.63it/s]
Adjust overlap : 100%|██████████| 445/445 [00:00<00:00, 4626.70it/s]
Linking edges : 100%|██████████| 5282/5282 [00:00<00:00, 117770.68it/s]
Computing network layout
Computing expanded network layout
[30]:
vdj
[30]:
Dandelion class object with n_obs = 5333 and n_contigs = 10352
    data: 'cell_id', 'sequence_id', 'sequence', 'sequence_aa', 'productive', 'rev_comp', 'v_call', 'v_cigar', 'd_call', 'd_cigar', 'j_call', 'j_cigar', 'c_call', 'c_cigar', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'junction_length', 'junction_aa_length', 'v_sequence_start', 'v_sequence_end', 'd_sequence_start', 'd_sequence_end', 'j_sequence_start', 'j_sequence_end', 'c_sequence_start', 'c_sequence_end', 'consensus_count', 'duplicate_count', 'is_cell', 'locus', 'rearrangement_status', 'sample_id', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'c_call_abT_VDJ', 'c_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_abT_VDJ_main', 'd_call_abT_VDJ_main', 'j_call_abT_VDJ_main', 'v_call_abT_VJ_main', 'j_call_abT_VJ_main', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    layout: layout for 5333 vertices, layout for 84 vertices
    graph: networkx graph of 5333 vertices, networkx graph of 84 vertices

Plotting in scanpy.

[31]:
ddl.tl.transfer(
    adata, vdj
)  # this will include singletons. To show only expanded clones, specify expanded_only=True
[32]:
sc.set_figure_params(figsize=[5, 5])
ddl.pl.clone_network(adata, color=["sample_id"], edges_width=1, size=15)
../_images/notebooks_7_dandelion_TCR_data_10x_data_56_0.png
[33]:
adata
[33]:
AnnData object with n_obs × n_vars = 17275 × 2305
    obs: 'sample_id', 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'scrublet_score', 'is_doublet', 'filter_rna', 'has_contig', 'filter_contig_quality', 'filter_contig_VDJ', 'filter_contig_VJ', 'contig_QC_pass', 'filter_contig', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_abT_VDJ', 'd_call_abT_VDJ', 'j_call_abT_VDJ', 'v_call_abT_VJ', 'j_call_abT_VJ', 'c_call_abT_VDJ', 'c_call_abT_VJ', 'productive_abT_VDJ', 'productive_abT_VJ', 'duplicate_count_abT_VDJ', 'duplicate_count_abT_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_abT_VDJ_main', 'd_call_abT_VDJ_main', 'j_call_abT_VDJ_main', 'v_call_abT_VJ_main', 'j_call_abT_VJ_main', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'leiden', 'clone_id', 'clone_id_by_size'
    var: 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca', 'neighbors', 'umap', 'leiden', 'leiden_colors', 'contig_QC_pass_colors', 'rna_neighbors', 'clone_id', 'sample_id_colors'
    obsm: 'X_pca', 'X_umap', 'X_vdj'
    varm: 'PCs'
    obsp: 'distances', 'connectivities', 'rna_connectivities', 'rna_distances', 'vdj_connectivities', 'vdj_distances'
[34]:
sc.set_figure_params(figsize=[4.5, 5])
ddl.pl.clone_network(
    adata,
    color=[
        "chain_status",
        "rearrangement_status_VDJ",
        "rearrangement_status_VJ",
    ],
    ncols=1,
    legend_fontoutline=3,
    size=10,
    edges_width=1,
)
../_images/notebooks_7_dandelion_TCR_data_10x_data_58_0.png
[35]:
ddl.tl.transfer(adata, vdj, expanded_only=True)
[36]:
sc.set_figure_params(figsize=[5, 5])
ddl.pl.clone_network(adata, color=["sample_id"], edges_width=1, size=50)
../_images/notebooks_7_dandelion_TCR_data_10x_data_60_0.png
[37]:
sc.set_figure_params(figsize=[4.5, 5])
ddl.pl.clone_network(
    adata,
    color=[
        "locus_status",
        "rearrangement_status_VDJ",
        "rearrangement_status_VJ",
    ],
    ncols=1,
    legend_fontoutline=3,
    edges_width=1,
    size=50,
)
../_images/notebooks_7_dandelion_TCR_data_10x_data_61_0.png

Using scirpy to plot

You can also use scirpy’s functions to plot the network.

A likely use case is if you have a lot of cells and you don’t want to wait for dandelion to generate the layout because it’s taking too long. Or you simply prefer scirpy’s style of plotting.

You can run ddl.tl.generate_network(..., compute_layout = False) and it will finish ultra-fast, and after transfer to scirpy, you can use its plotting functions to visualise the networks - the clone network is generated very quickly but visualising it using spring layout does take quite a while.

[38]:
import scirpy as ir

ir.tl.clonotype_network(adata, min_cells=2)
ir.pl.clonotype_network(adata, color="clone_id", panel_size=(7, 7))
[38]:
<Axes: >
../_images/notebooks_7_dandelion_TCR_data_10x_data_63_1.png

You can change the clonotype labels by transferring with a different clone_key. For example, from numerically ordered from largest to smallest.

[39]:
ddl.tl.transfer(adata, vdj, clone_key="clone_id_by_size")
ir.tl.clonotype_network(adata, clonotype_key="clone_id_by_size", min_cells=2)
ir.pl.clonotype_network(adata, color="clone_id_by_size", panel_size=(7, 7))
[39]:
<Axes: >
../_images/notebooks_7_dandelion_TCR_data_10x_data_65_1.png

You can also transfer with the clones collapsed for plotting as pie-charts as per how scirpy does it.

[40]:
ddl.tl.transfer(adata, vdj, clone_key="clone_id_by_size", collapse_nodes=True)
ir.tl.clonotype_network(adata, clonotype_key="clone_id_by_size", min_cells=2)
ir.pl.clonotype_network(adata, color="sample_id", panel_size=(7, 7))
[40]:
<Axes: >
../_images/notebooks_7_dandelion_TCR_data_10x_data_67_1.png

Finish.

We can save the files.

[41]:
adata.write("adata_tcr.h5ad", compression="gzip")
[42]:
vdj.write_h5ddl("dandelion_results_tcr.h5ddl", complib="blosc:lz4")
[ ]: