Dandelion class

Much of the functions and utility of the dandelion package revolves around the Dandelion class object. The class will act as an intermediary object for storage and flexible interaction with other tools. This section will run through a quick primer to the Dandelion class.

Import modules

[1]:
import os

os.chdir(os.path.expanduser("~/Downloads/dandelion_tutorial/"))
import dandelion as ddl

ddl.logging.print_versions()
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_csv from `anndata` is deprecated. Import anndata.io.read_csv instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_excel from `anndata` is deprecated. Import anndata.io.read_excel instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_hdf from `anndata` is deprecated. Import anndata.io.read_hdf instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_loom from `anndata` is deprecated. Import anndata.io.read_loom instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_mtx from `anndata` is deprecated. Import anndata.io.read_mtx instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_text from `anndata` is deprecated. Import anndata.io.read_text instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_umi_tools from `anndata` is deprecated. Import anndata.io.read_umi_tools instead.
dandelion==0.5.5.dev16 pandas==2.2.3 numpy==2.1.3 matplotlib==3.10.1 networkx==3.4.2 scipy==1.15.2
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/nxviz/__init__.py:33: UserWarning:
nxviz has a new API! Version 0.7.4 onwards, the old class-based API is being
deprecated in favour of a new API focused on advancing a grammar of network
graphics. If your plotting code depends on the old API, please consider
pinning nxviz at version 0.7.4, as the new API will break your old code.

To check out the new API, please head over to the docs at
https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!

(This deprecation message will go away in version 1.0.)

[2]:
vdj = ddl.read_h5ddl("dandelion_results.h5ddl")
# let's run find_clones again as this was not stored.
ddl.tl.find_clones(vdj)
vdj
Finding clones based on B cell VDJ chains : 100%|██████████| 222/222 [00:00<00:00, 3567.62it/s]
Finding clones based on B cell VJ chains : 100%|██████████| 209/209 [00:00<00:00, 5862.71it/s]
Refining clone assignment based on VJ chain pairing : 100%|██████████| 2238/2238 [00:00<00:00, 647413.78it/s]
[2]:
Dandelion class object with n_obs = 2238 and n_contigs = 7355
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    layout: layout for 2112 vertices, layout for 71 vertices
    graph: networkx graph of 2112 vertices, networkx graph of 71 vertices

Essentially, the .data slot holds the AIRR contig table while the .metadata holds a collapsed version that is compatible with combining with AnnData’s .obs slot. You can retrieve these slots like a typical class object; for example, if I want the metadata:

[3]:
vdj.metadata
[3]:
clone_id clone_id_by_size sample_id locus_VDJ locus_VJ productive_VDJ productive_VJ v_call_genotyped_VDJ d_call_VDJ j_call_VDJ ... d_call_B_VDJ_main j_call_B_VDJ_main v_call_B_VJ_main j_call_B_VJ_main isotype isotype_status locus_status chain_status rearrangement_status_VDJ rearrangement_status_VJ
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG B_VJ_76_2_7 169 sc5p_v2_hs_PBMC_10k None IGK None T None None None ... None None IGKV1D-33,IGKV1-33 IGKJ4 Orphan IGK Orphan VJ None standard
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC B_VDJ_191_3_2_VJ_185_2_3 1988 sc5p_v2_hs_PBMC_10k IGH IGK T T IGHV1-69,IGHV1-69D IGHD3-22 IGHJ3 ... IGHD3-22 IGHJ3 IGKV1-8 IGKJ1 IgM IgM IGH + IGK Single pair standard standard
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG B_VDJ_9_1_2_VJ_153_1_1 1602 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV1-2 None IGHJ3 ... None IGHJ3 IGLV5-45 IGLJ3 IgM IgM IGH + IGL Single pair standard standard
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC B_VDJ_92_4_2_VJ_47_1_1 1603 sc5p_v2_hs_PBMC_10k IGH IGK T T IGHV5-51 None IGHJ3 ... None IGHJ3 IGKV1D-8 IGKJ2 IgM IgM IGH + IGK Single pair standard standard
sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA B_VDJ_15_2_1_VJ_83_2_6 1604 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV4-4 IGHD6-13 IGHJ3 ... IGHD6-13 IGHJ3 IGLV3-19 IGLJ2,IGLJ3 IgM IgM IGH + IGL Single pair standard standard
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG B_VDJ_61_2_1_VJ_129_2_7 812 vdj_v1_hs_pbmc3 IGH IGK T T IGHV2-5 IGHD5/OR15-5a,IGHD5/OR15-5b IGHJ4,IGHJ5 ... IGHD5/OR15-5a,IGHD5/OR15-5b IGHJ4,IGHJ5 IGKV4-1 IGKJ4 IgM IgM IGH + IGK Single pair standard standard
vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT B_VDJ_37_5_2_VJ_49_1_3 813 vdj_v1_hs_pbmc3 IGH IGK T T IGHV3-30 IGHD4-17 IGHJ6 ... IGHD4-17 IGHJ6 IGKV2-30 IGKJ2 IgM IgM IGH + IGK Single pair standard standard
vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA B_VDJ_145_1_1_VJ_35_4_14 814 vdj_v1_hs_pbmc3 IGH IGK T T IGHV4-61 IGHD6-13 IGHJ2 ... IGHD6-13 IGHJ2 IGKV1-39,IGKV1D-39 IGKJ1 IgM IgM IGH + IGK Single pair standard standard
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG B_VDJ_48_4_2_VJ_50_3_5 815 vdj_v1_hs_pbmc3 IGH IGL T T IGHV1-69,IGHV1-69D IGHD2-15 IGHJ6 ... IGHD2-15 IGHJ6 IGLV1-47 IGLJ3 IgM IgM IGH + IGL Single pair standard standard
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG B_VDJ_117_5_3_VJ_102_3_4 2388 vdj_v1_hs_pbmc3 IGH IGL T T IGHV3-23,IGHV3-23D None IGHJ4 ... None IGHJ4 IGLV2-11 IGLJ2,IGLJ3 IgM IgM IGH + IGL Single pair standard standard

2238 rows × 47 columns

slicing

You can slice the Dandelion object via the .data or .metadata via their indices, with the behavior similar to how it is in pandas DataFrame and AnnData.

slicing .data

[4]:
# get the largest clone
largest_clone = vdj.data["clone_id"].value_counts().idxmax()

vdj[vdj.data["clone_id"] == largest_clone]
[4]:
Dandelion class object with n_obs = 566 and n_contigs = 2802
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    layout: layout for 547 vertices, layout for 9 vertices
    graph: networkx graph of 547 vertices, networkx graph of 9 vertices
[5]:
vdj[
    vdj.data_names.isin(
        [
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1",
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2",
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1",
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1",
            "sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2",
        ]
    )
]
[5]:
Dandelion class object with n_obs = 3 and n_contigs = 5
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    layout: layout for 2 vertices, layout for 0 vertices
    graph: networkx graph of 2 vertices, networkx graph of 0 vertices

slicing .metadata

[6]:
vdj[vdj.metadata["productive_VDJ"].isin(["T", "T|T"])]
[6]:
Dandelion class object with n_obs = 2112 and n_contigs = 5052
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    layout: layout for 2112 vertices, layout for 71 vertices
    graph: networkx graph of 2112 vertices, networkx graph of 71 vertices
[7]:
vdj[vdj.metadata_names == "vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT"]
[7]:
Dandelion class object with n_obs = 1 and n_contigs = 2
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    layout: layout for 1 vertices, layout for 0 vertices
    graph: networkx graph of 1 vertices, networkx graph of 0 vertices

copy

You can deep copy the Dandelion object to another variable which will inherit all slots:

[8]:
vdj2 = vdj.copy()
vdj2.metadata
[8]:
clone_id clone_id_by_size sample_id locus_VDJ locus_VJ productive_VDJ productive_VJ v_call_genotyped_VDJ d_call_VDJ j_call_VDJ ... d_call_B_VDJ_main j_call_B_VDJ_main v_call_B_VJ_main j_call_B_VJ_main isotype isotype_status locus_status chain_status rearrangement_status_VDJ rearrangement_status_VJ
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG B_VJ_76_2_7 169 sc5p_v2_hs_PBMC_10k None IGK None T None None None ... None None IGKV1D-33,IGKV1-33 IGKJ4 Orphan IGK Orphan VJ None standard
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC B_VDJ_191_3_2_VJ_185_2_3 1988 sc5p_v2_hs_PBMC_10k IGH IGK T T IGHV1-69,IGHV1-69D IGHD3-22 IGHJ3 ... IGHD3-22 IGHJ3 IGKV1-8 IGKJ1 IgM IgM IGH + IGK Single pair standard standard
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG B_VDJ_9_1_2_VJ_153_1_1 1602 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV1-2 None IGHJ3 ... None IGHJ3 IGLV5-45 IGLJ3 IgM IgM IGH + IGL Single pair standard standard
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC B_VDJ_92_4_2_VJ_47_1_1 1603 sc5p_v2_hs_PBMC_10k IGH IGK T T IGHV5-51 None IGHJ3 ... None IGHJ3 IGKV1D-8 IGKJ2 IgM IgM IGH + IGK Single pair standard standard
sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA B_VDJ_15_2_1_VJ_83_2_6 1604 sc5p_v2_hs_PBMC_10k IGH IGL T T IGHV4-4 IGHD6-13 IGHJ3 ... IGHD6-13 IGHJ3 IGLV3-19 IGLJ2,IGLJ3 IgM IgM IGH + IGL Single pair standard standard
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG B_VDJ_61_2_1_VJ_129_2_7 812 vdj_v1_hs_pbmc3 IGH IGK T T IGHV2-5 IGHD5/OR15-5a,IGHD5/OR15-5b IGHJ4,IGHJ5 ... IGHD5/OR15-5a,IGHD5/OR15-5b IGHJ4,IGHJ5 IGKV4-1 IGKJ4 IgM IgM IGH + IGK Single pair standard standard
vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT B_VDJ_37_5_2_VJ_49_1_3 813 vdj_v1_hs_pbmc3 IGH IGK T T IGHV3-30 IGHD4-17 IGHJ6 ... IGHD4-17 IGHJ6 IGKV2-30 IGKJ2 IgM IgM IGH + IGK Single pair standard standard
vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA B_VDJ_145_1_1_VJ_35_4_14 814 vdj_v1_hs_pbmc3 IGH IGK T T IGHV4-61 IGHD6-13 IGHJ2 ... IGHD6-13 IGHJ2 IGKV1-39,IGKV1D-39 IGKJ1 IgM IgM IGH + IGK Single pair standard standard
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG B_VDJ_48_4_2_VJ_50_3_5 815 vdj_v1_hs_pbmc3 IGH IGL T T IGHV1-69,IGHV1-69D IGHD2-15 IGHJ6 ... IGHD2-15 IGHJ6 IGLV1-47 IGLJ3 IgM IgM IGH + IGL Single pair standard standard
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG B_VDJ_117_5_3_VJ_102_3_4 2388 vdj_v1_hs_pbmc3 IGH IGL T T IGHV3-23,IGHV3-23D None IGHJ4 ... None IGHJ4 IGLV2-11 IGLJ2,IGLJ3 IgM IgM IGH + IGL Single pair standard standard

2238 rows × 47 columns

Retrieving entries with update_metadata

The .metadata slot in Dandelion class automatically initializes whenever the .data slot is filled. However, it only returns a standard number of columns that are pre-specified. To retrieve other columns from the .data slot, we can update the metadata with ddl.update_metadata and specify the options retrieve and retrieve_mode.

The following modes determine how the retrieval is completed:

split and unique only - splits the retrieval into VDJ and VJ chains. A | will separate unique element.

split and merge - splits the retrieval into VDJ and VJ chains. A | will separate every element.

merge and unique only - smiliar to above but merged into a single column.

split - split retrieval into individual columns for each contig.

merge - merge retrieval into a single column where a | will separate every element.

For numerical columns, there’s additional options:

split and sum - splits the retrieval into VDJ and VJ chains and sum separately.

split and average - smiliar to above but average instead of sum.

sum - sum the retrievals into a single column.

average - averages the retrievals into a single column.

If retrieve_mode is not specified, it will default to split and merge

Example: retrieving fwr1 sequences

[9]:
vdj.update_metadata(retrieve="fwr1")
vdj
[9]:
Dandelion class object with n_obs = 2238 and n_contigs = 7355
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'fwr1_VJ', 'fwr1_VDJ'
    layout: layout for 2112 vertices, layout for 71 vertices
    graph: networkx graph of 2112 vertices, networkx graph of 71 vertices

Note the additional fwr1 VDJ and VJ columns in the metadata slot.

By default, dandelion will not try to merge numerical columns as it can create mixed dtype columns.

There is a new sub-function that will try and retrieve frequently used columns such as np1_length, np2_length:

[10]:
vdj.update_plus()
vdj
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/numpy/_core/fromnumeric.py:3904: RuntimeWarning: Mean of empty slice.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/numpy/_core/_methods.py:147: RuntimeWarning: invalid value encountered in scalar divide
[10]:
Dandelion class object with n_obs = 2238 and n_contigs = 7355
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'fwr1_VJ', 'fwr1_VDJ', 'mu_count_VDJ', 'mu_count_VJ', 'mu_count', 'junction_length_VDJ', 'junction_length_VJ', 'junction_aa_length_VDJ', 'junction_aa_length_VJ', 'np1_length_VDJ', 'np1_length_VJ', 'np2_length_VDJ'
    layout: layout for 2112 vertices, layout for 71 vertices
    graph: networkx graph of 2112 vertices, networkx graph of 71 vertices

Renaming barcodes

You can now use a simple function to rename the barcodes (both sequence and cell ids at the same time). This is useful for when you want to rename the barcodes to a more meaningful name. This only works on the indices that were initially used to create the Dandelion object. So if you have run the function once already, it doesn’t continuously add the prefix/suffix to the new indices. It just updates based on the original indices.

[11]:
# original
print(vdj.data[["sequence_id", "cell_id"]]), print(vdj.metadata_names)
                                                                                 sequence_id  \
sequence_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1  sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2  sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1  sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1  sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2  sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2
...                                                                                      ...
vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1          vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2          vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1          vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2          vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1          vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1

                                                                            cell_id
sequence_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1  sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2  sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1  sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1  sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2  sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
...                                                                             ...
vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1          vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2          vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1          vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2          vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1          vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG

[7355 rows x 2 columns]
Index(['sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG',
       'sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC',
       'sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG',
       'sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC',
       'sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA',
       'sc5p_v2_hs_PBMC_10k_AAACGGGCACTGTTAG',
       'sc5p_v2_hs_PBMC_10k_AAAGATGAGGATGCGT',
       'sc5p_v2_hs_PBMC_10k_AAAGATGGTCGAATCT',
       'sc5p_v2_hs_PBMC_10k_AAAGATGGTGAGGGAG',
       'sc5p_v2_hs_PBMC_10k_AAAGTAGCAGATCCAT',
       ...
       'vdj_v1_hs_pbmc3_TTGTAGGTCGCAAGCC', 'vdj_v1_hs_pbmc3_TTTACTGTCAGCTGGC',
       'vdj_v1_hs_pbmc3_TTTATGCGTCAGAATA', 'vdj_v1_hs_pbmc3_TTTATGCTCAGGATCT',
       'vdj_v1_hs_pbmc3_TTTATGCTCCTAGAAC', 'vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG',
       'vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT', 'vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA',
       'vdj_v1_hs_pbmc3_TTTGCGCCATACCATG', 'vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG'],
      dtype='object', length=2238)
[11]:
(None, None)
[12]:
# let's add a 'test-' as a prefix. There's also the suffix option
vdj.add_sequence_prefix("test", sep="-")
print(vdj.data[["sequence_id", "cell_id"]]), print(vdj.metadata_names)
                                                                                          sequence_id  \
sequence_id
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1  test-sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_cont...
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2  test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_cont...
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1  test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_cont...
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1  test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_cont...
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2  test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_cont...
...                                                                                               ...
test-vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1         test-vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1
test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2         test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2
test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1         test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1
test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2         test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2
test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1         test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1

                                                                                      cell_id
sequence_id
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1  test-sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2  test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1  test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1  test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2  test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
...                                                                                       ...
test-vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1          test-vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC
test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2          test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1          test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2          test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG
test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1          test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG

[7355 rows x 2 columns]
Index(['test-sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG',
       'test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC',
       'test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG',
       'test-sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC',
       'test-sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA',
       'test-sc5p_v2_hs_PBMC_10k_AAACGGGCACTGTTAG',
       'test-sc5p_v2_hs_PBMC_10k_AAAGATGAGGATGCGT',
       'test-sc5p_v2_hs_PBMC_10k_AAAGATGGTCGAATCT',
       'test-sc5p_v2_hs_PBMC_10k_AAAGATGGTGAGGGAG',
       'test-sc5p_v2_hs_PBMC_10k_AAAGTAGCAGATCCAT',
       ...
       'test-vdj_v1_hs_pbmc3_TTGTAGGTCGCAAGCC',
       'test-vdj_v1_hs_pbmc3_TTTACTGTCAGCTGGC',
       'test-vdj_v1_hs_pbmc3_TTTATGCGTCAGAATA',
       'test-vdj_v1_hs_pbmc3_TTTATGCTCAGGATCT',
       'test-vdj_v1_hs_pbmc3_TTTATGCTCCTAGAAC',
       'test-vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG',
       'test-vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT',
       'test-vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA',
       'test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG',
       'test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG'],
      dtype='object', length=2238)
[12]:
(None, None)
[13]:
# same functionality as above
vdj.add_cell_prefix("test2", sep="_")
print(vdj.data[["sequence_id", "cell_id"]]), print(vdj.metadata_names)
                                                                                          sequence_id  \
sequence_id
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_cont...  test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_con...
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_cont...  test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_con...
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_cont...  test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_con...
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_cont...  test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_con...
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_cont...  test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_con...
...                                                                                               ...
test2_vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1       test2_vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1
test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2       test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2
test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1       test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1
test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2       test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2
test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1       test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1

                                                                                       cell_id
sequence_id
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_cont...  test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_cont...  test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_cont...  test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_cont...  test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_cont...  test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
...                                                                                        ...
test2_vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1         test2_vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC
test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2         test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1         test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2         test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG
test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1         test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG

[7355 rows x 2 columns]
Index(['test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG',
       'test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC',
       'test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG',
       'test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC',
       'test2_sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA',
       'test2_sc5p_v2_hs_PBMC_10k_AAACGGGCACTGTTAG',
       'test2_sc5p_v2_hs_PBMC_10k_AAAGATGAGGATGCGT',
       'test2_sc5p_v2_hs_PBMC_10k_AAAGATGGTCGAATCT',
       'test2_sc5p_v2_hs_PBMC_10k_AAAGATGGTGAGGGAG',
       'test2_sc5p_v2_hs_PBMC_10k_AAAGTAGCAGATCCAT',
       ...
       'test2_vdj_v1_hs_pbmc3_TTGTAGGTCGCAAGCC',
       'test2_vdj_v1_hs_pbmc3_TTTACTGTCAGCTGGC',
       'test2_vdj_v1_hs_pbmc3_TTTATGCGTCAGAATA',
       'test2_vdj_v1_hs_pbmc3_TTTATGCTCAGGATCT',
       'test2_vdj_v1_hs_pbmc3_TTTATGCTCCTAGAAC',
       'test2_vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG',
       'test2_vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT',
       'test2_vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA',
       'test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG',
       'test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG'],
      dtype='object', length=2238)
[13]:
(None, None)
[14]:
# you can also reset the ids
vdj.reset_ids()
print(vdj.data[["sequence_id", "cell_id"]]), print(vdj.metadata_names)
                                                                                 sequence_id  \
sequence_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1  sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2  sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1  sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1  sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2  sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2
...                                                                                      ...
vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1          vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2          vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1          vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2          vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1          vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1

                                                                            cell_id
sequence_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1  sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2  sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1  sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1  sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2  sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
...                                                                             ...
vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1          vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2          vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1          vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2          vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1          vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG

[7355 rows x 2 columns]
Index(['sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG',
       'sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC',
       'sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG',
       'sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC',
       'sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA',
       'sc5p_v2_hs_PBMC_10k_AAACGGGCACTGTTAG',
       'sc5p_v2_hs_PBMC_10k_AAAGATGAGGATGCGT',
       'sc5p_v2_hs_PBMC_10k_AAAGATGGTCGAATCT',
       'sc5p_v2_hs_PBMC_10k_AAAGATGGTGAGGGAG',
       'sc5p_v2_hs_PBMC_10k_AAAGTAGCAGATCCAT',
       ...
       'vdj_v1_hs_pbmc3_TTGTAGGTCGCAAGCC', 'vdj_v1_hs_pbmc3_TTTACTGTCAGCTGGC',
       'vdj_v1_hs_pbmc3_TTTATGCGTCAGAATA', 'vdj_v1_hs_pbmc3_TTTATGCTCAGGATCT',
       'vdj_v1_hs_pbmc3_TTTATGCTCCTAGAAC', 'vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG',
       'vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT', 'vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA',
       'vdj_v1_hs_pbmc3_TTTGCGCCATACCATG', 'vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG'],
      dtype='object', length=2238)
[14]:
(None, None)

Simplifying the V/DJ/C call annotations

Sometimes the V/DJ/C call annotations can be quite verbose. You can simplify them with the .simplify() function. This function will remove the , and only keep the first element of the call, as well as stripping alleles. This is useful for when you want to simplify the V/DJ/C calls for plotting purposes.

[15]:
# before
(
    vdj.data[["v_call_genotyped", "j_call"]],
    vdj.metadata[["v_call_genotyped_VDJ", "j_call_VDJ"]],
)
[15]:
(                                                       v_call_genotyped  \
 sequence_id
 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1  IGKV1-33*01,IGKV1D-33*01
 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2  IGHV1-69*01,IGHV1-69D*01
 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1                IGKV1-8*01
 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1               IGLV5-45*02
 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2                IGHV1-2*02
 ...                                                                 ...
 vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1                   IGHV1-46*01
 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2      IGHV1-69*01,IGHV1-69D*01
 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1                   IGLV1-47*01
 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2                   IGLV2-11*01
 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1      IGHV3-23*01,IGHV3-23D*01

                                                                    j_call
 sequence_id
 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1                    IGKJ4*01
 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2                    IGHJ3*02
 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1                    IGKJ1*01
 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1                    IGLJ3*02
 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2                    IGHJ3*02
 ...                                                                   ...
 vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1                        IGHJ5*02
 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2                        IGHJ6*02
 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1                        IGLJ3*02
 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2      IGLJ2*01,IGLJ3*01,IGLJ3*02
 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1                        IGHJ4*02

 [7355 rows x 2 columns],
                                      v_call_genotyped_VDJ   j_call_VDJ
 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG                 None         None
 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC   IGHV1-69,IGHV1-69D        IGHJ3
 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG              IGHV1-2        IGHJ3
 sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC             IGHV5-51        IGHJ3
 sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA              IGHV4-4        IGHJ3
 ...                                                   ...          ...
 vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG                  IGHV2-5  IGHJ4,IGHJ5
 vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT                 IGHV3-30        IGHJ6
 vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA                 IGHV4-61        IGHJ2
 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG       IGHV1-69,IGHV1-69D        IGHJ6
 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG       IGHV3-23,IGHV3-23D        IGHJ4

 [2238 rows x 2 columns])
[16]:
# after
vdj.simplify()
# before
(
    vdj.data[["v_call_genotyped", "j_call"]],
    vdj.metadata[["v_call_genotyped_VDJ", "j_call_VDJ"]],
)
[16]:
(                                              v_call_genotyped j_call
 sequence_id
 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1         IGKV1-33  IGKJ4
 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2         IGHV1-69  IGHJ3
 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1          IGKV1-8  IGKJ1
 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1         IGLV5-45  IGLJ3
 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2          IGHV1-2  IGHJ3
 ...                                                        ...    ...
 vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1             IGHV1-46  IGHJ5
 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2             IGHV1-69  IGHJ6
 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1             IGLV1-47  IGLJ3
 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2             IGLV2-11  IGLJ2
 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1             IGHV3-23  IGHJ4

 [7355 rows x 2 columns],
                                      v_call_genotyped_VDJ j_call_VDJ
 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG                 None       None
 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC             IGHV1-69      IGHJ3
 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG              IGHV1-2      IGHJ3
 sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC             IGHV5-51      IGHJ3
 sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA              IGHV4-4      IGHJ3
 ...                                                   ...        ...
 vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG                  IGHV2-5      IGHJ4
 vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT                 IGHV3-30      IGHJ6
 vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA                 IGHV4-61      IGHJ2
 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG                 IGHV1-69      IGHJ6
 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG                 IGHV3-23      IGHJ4

 [2238 rows x 2 columns])

concatenating multiple objects

This is a simple function to concatenate (append) two or more Dandelion class, or pandas dataframes. Note that this operates on the .data slot and not the .metadata slot.

[17]:
vdj
[17]:
Dandelion class object with n_obs = 2238 and n_contigs = 7355
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    layout: layout for 2112 vertices, layout for 71 vertices
    graph: networkx graph of 2112 vertices, networkx graph of 71 vertices
[18]:
# just simple concatenation x 3. check the difference between the cell and contig numbers between this object and just vdj
vdj_concat = ddl.concat([vdj, vdj, vdj])
vdj_concat
[18]:
Dandelion class object with n_obs = 6714 and n_contigs = 22065
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
[19]:
vdj_concat.data[["sequence_id", "cell_id"]].head()
[19]:
sequence_id cell_id
sequence_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1_0 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1_0 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_0
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1_2 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1_2 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_2
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2_0 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2_0 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_0
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1_0 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1_0 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_0

ddl.concat also lets you add in your custom prefixes/suffixes to append to the sequence ids. If not provided, it will add -0, -1 etc. as a suffix if it detects that the sequence ids are not unique as seen above.

read/write

Dandelion class can be saved using .write_h5ddl and .write_pkl functions with accompanying compression methods e.g. gzip. write_h5ddl primarily uses h5py library and write_pkl just uses pickle. read_h5ddl and read_pkl functions will read the respective file formats accordingly.

[20]:
%time vdj.write_h5ddl('dandelion_results_test.h5ddl', compression="gzip")
/Users/uqztuong/Documents/GitHub/dandelion/src/dandelion/utilities/_utilities.py:476: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
/Users/uqztuong/Documents/GitHub/dandelion/src/dandelion/utilities/_utilities.py:476: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
CPU times: user 6.39 s, sys: 137 ms, total: 6.53 s
Wall time: 6.79 s

If you see any warnings above, it’s due to mix dtypes somewhere in the object. So do some checking if you think it will interfere with downstream usage.

[21]:
%time vdj_1 = ddl.read_h5ddl('dandelion_results_test.h5ddl')
vdj_1
CPU times: user 1.19 s, sys: 60.9 ms, total: 1.25 s
Wall time: 1.37 s
[21]:
Dandelion class object with n_obs = 2238 and n_contigs = 7355
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    layout: layout for 2112 vertices, layout for 71 vertices
    graph: networkx graph of 2112 vertices, networkx graph of 71 vertices

The read/write times using pickle can be situationally faster/slower and file sizes can also be situationally smaller/larger (depending on which compression is used).

[22]:
%time vdj.write_pkl('dandelion_results_test.pkl.gz')
CPU times: user 7.4 s, sys: 89.6 ms, total: 7.49 s
Wall time: 8.15 s
[23]:
%time vdj_2 = ddl.read_pkl('dandelion_results_test.pkl.gz')
vdj_2
CPU times: user 127 ms, sys: 13.7 ms, total: 141 ms
Wall time: 146 ms
[23]:
Dandelion class object with n_obs = 2238 and n_contigs = 7355
    data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
    metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
    layout: layout for 2112 vertices, layout for 71 vertices
    graph: networkx graph of 2112 vertices, networkx graph of 71 vertices

There’s also other types of writing functions such as .write_airr and .write_10x, which will write the object to a .tsv or .csv file that is compatible with airr and 10x formats respectively.

[24]:
import pandas as pd

vdj2.write_airr("test.airr.tsv")
df = pd.read_csv("test.airr.tsv", sep="\t")
df
[24]:
sequence_id sequence rev_comp productive v_call d_call j_call sequence_alignment germline_alignment junction ... j_call_multimappers j_call_multiplicity j_call_sequence_start_multimappers j_call_sequence_end_multimappers j_call_support_multimappers mu_count ambiguous extra rearrangement_status clone_id
0 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 TGGGGAGGAGTCAGTCCCAACCAGGACACGGCCTGGACATGAGGGT... F T IGKV1-33*01,IGKV1D-33*01 NaN IGKJ4*01 GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTGG... GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAG... TGTCAACAATATGACGAACTTCCCGTCACTTTC ... IGKJ4*01 1.0 385.0 412.0 3.56e-09 27 F F standard B_VJ_76_2_7
1 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 ATCACATAACAACCACATTCCTCCTCTAAAGAAGCCCCTGGGAGCA... F T IGHV1-69*01,IGHV1-69D*01 IGHD3-22*01 IGHJ3*02 CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... TGTGCGACTACGTATTACTATGATAGTAGTGGTTATTACCAGAATG... ... IGHJ3*02 1.0 445.0 494.0 4.5799999999999995e-23 0 F F standard B_VDJ_191_3_2_VJ_185_2_3
2 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 AGGAGTCAGACCCTGTCAGGACACAGCATAGACATGAGGGTCCCCG... F T IGKV1-8*01 NaN IGKJ1*01 GCCATCCGGATGACCCAGTCTCCATCCTCATTCTCTGCATCTACAG... GCCATCCGGATGACCCAGTCTCCATCCTCATTCTCTGCATCTACAG... TGTCAACAGTATTATAGTTACCCTCGGACGTTC ... IGKJ1*01 1.0 380.0 415.0 2.7e-15 0 F F standard B_VDJ_191_3_2_VJ_185_2_3
3 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 ACTGTGGGGGTAAGAGGTTGTGTCCACCATGGCCTGGACTCCTCTC... F T IGLV5-45*02 NaN IGLJ3*02 CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTG... CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTG... TGTATGATTTGGCACAGCAGCGCTTGGGTGGTC ... IGLJ3*01 1.0 402.0 431.0 6.84e-12 8 F F standard B_VDJ_9_1_2_VJ_153_1_1
4 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 GGGAGCATCACCCAGCAACCACATCTGTCCTCTAGAGAATCCCCTG... F T IGHV1-2*02 NaN IGHJ3*02 CAGGTGCAACTGGTGCAGTCTGGGGGT...GAGGTAAAGAAGCCTG... CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... TGTGCGAGAGAGATAGAGGGGGACGGTGTTTTTGAAATCTGG ... IGHJ3*02 1.0 433.0 479.0 4.48e-18 22 F F standard B_VDJ_9_1_2_VJ_153_1_1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7350 vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 ATCATCCAACAACCACATCCCTTCTCTACAGAAGCCTCTGAGAGGA... F T IGHV1-46*01 IGHD2-15*01 IGHJ5*02 CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... TGTGCGAGAGAGGGATATTGTAGTGGTGGTAGCTGCTACTCCCCCG... ... IGHJ5*02 1.0 461 506 7.83e-21 0 T T standard NaN
7351 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 ATCACATAACAACCACATTCCTCCTCTAAAGAAGCCCCTGGGAGCA... F T IGHV1-69*01,IGHV1-69D*01 IGHD2-15*01 IGHJ6*02 CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... TGTGCGAGATCTCTGGATATTGTAGTGGTGGTAGCACTCTACTACT... ... IGHJ6*02 1.0 439 497 4.57e-28 0 F F standard B_VDJ_48_4_2_VJ_50_3_5
7352 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 AGCTTCAGCTGTGGTAGAGAAGACAGGATTCAGGACAATCTCCAGC... F T IGLV1-47*01 NaN IGLJ3*02 CAGTCTGTGCTGACTCAGCCACCCTCA...GCGTCTGGGACCCCCG... CAGTCTGTGCTGACTCAGCCACCCTCA...GCGTCTGGGACCCCCG... TGTGCAGCATGGGATGACAGCCTGAGTGGTTGGGTGTTC ... IGLJ3*02 1.0 397 434 2.46e-16 0 F F standard B_VDJ_48_4_2_VJ_50_3_5
7353 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 GGCTGGGGTCTCAGGAGGCAGCACTCTCGGGACGTCTCCACCATGG... F T IGLV2-11*01 NaN IGLJ2*01,IGLJ3*01,IGLJ3*02 CAGTCTGCCCTGACTCAGCCTCGCTCA...GTGTCCGGGTCTCCTG... CAGTCTGCCCTGACTCAGCCTCGCTCA...GTGTCCGGGTCTCCTG... TGCTGCTCATATGCAGGCAGCTACACTGTGTTTTTC ... IGLJ3*01 1.0 393 430 2.46e-11 4 F F standard B_VDJ_117_5_3_VJ_102_3_4
7354 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 AGCTCTGAGAGAGGAGCCCAGCCCTGGGATTTTCAGGTGTTTTCAT... F T IGHV3-23*01,IGHV3-23D*01 NaN IGHJ4*02 GAGGTGCAGGTGTTGGAGTCTGGGGGA...GGCTTGGAACAGCCTG... GAGGTGCAGCTGTTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG... TGTGCGGGGAGTCGGTGGTTATATTCTTTTGACTACTGG ... IGHJ4*02 1.0 449 491 1.65e-17 8 F F standard B_VDJ_117_5_3_VJ_102_3_4

7355 rows × 124 columns

[25]:
vdj2.write_10x(
    folder="10x_test",
    filename_prefix="all",
)  # this writes both the conting_annotations.csv and contig.fasta
df = pd.read_csv("10x_test/all_contig_annotations.csv")
df
[25]:
barcode contig_id length chain v_gene d_gene j_gene c_gene full_length productive cdr3 cdr3_nt reads umis raw_clonotype_id raw_consensus_id
0 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 556 IGK IGKV1-33*01,IGKV1D-33*01 NaN IGKJ4*01 IGKC NaN True CQQYDELPVTF TGTCAACAATATGACGAACTTCCCGTCACTTTC 9139 68 B_VJ_76_2_7 B_VJ_76_2_7
1 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 565 IGH IGHV1-69*01,IGHV1-69D*01 IGHD3-22*01 IGHJ3*02 IGHM NaN True CATTYYYDSSGYYQNDAFDIW TGTGCGACTACGTATTACTATGATAGTAGTGGTTATTACCAGAATG... 4161 51 B_VDJ_191_3_2_VJ_185_2_3 B_VDJ_191_3_2_VJ_185_2_3
2 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 551 IGK IGKV1-8*01 NaN IGKJ1*01 IGKC NaN True CQQYYSYPRTF TGTCAACAGTATTATAGTTACCCTCGGACGTTC 5679 43 B_VDJ_191_3_2_VJ_185_2_3 B_VDJ_191_3_2_VJ_185_2_3
3 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 642 IGL IGLV5-45*02 NaN IGLJ3*02 IGLC3 NaN True CMIWHSSAWVV TGTATGATTTGGCACAGCAGCGCTTGGGTGGTC 13160 90 B_VDJ_9_1_2_VJ_153_1_1 B_VDJ_9_1_2_VJ_153_1_1
4 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 550 IGH IGHV1-2*02 NaN IGHJ3*02 IGHM NaN True CAREIEGDGVFEIW TGTGCGAGAGAGATAGAGGGGGACGGTGTTTTTGAAATCTGG 5080 47 B_VDJ_9_1_2_VJ_153_1_1 B_VDJ_9_1_2_VJ_153_1_1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7350 vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 577 IGH IGHV1-46*01 IGHD2-15*01 IGHJ5*02 IGHM NaN True CAREGYCSGGSCYSPDPNNGWFDPW TGTGCGAGAGAGGGATATTGTAGTGGTGGTAGCTGCTACTCCCCCG... 2960 28 NaN NaN
7351 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 568 IGH IGHV1-69*01,IGHV1-69D*01 IGHD2-15*01 IGHJ6*02 IGHM NaN True CARSLDIVVVVALYYYYGMDVW TGTGCGAGATCTCTGGATATTGTAGTGGTGGTAGCACTCTACTACT... 2464 32 B_VDJ_48_4_2_VJ_50_3_5 B_VDJ_48_4_2_VJ_50_3_5
7352 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 645 IGL IGLV1-47*01 NaN IGLJ3*02 IGLC3 NaN True CAAWDDSLSGWVF TGTGCAGCATGGGATGACAGCCTGAGTGGTTGGGTGTTC 2457 28 B_VDJ_48_4_2_VJ_50_3_5 B_VDJ_48_4_2_VJ_50_3_5
7353 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 641 IGL IGLV2-11*01 NaN IGLJ2*01,IGLJ3*01,IGLJ3*02 IGLC NaN True CCSYAGSYTVFF TGCTGCTCATATGCAGGCAGCTACACTGTGTTTTTC 2744 36 B_VDJ_117_5_3_VJ_102_3_4 B_VDJ_117_5_3_VJ_102_3_4
7354 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 562 IGH IGHV3-23*01,IGHV3-23D*01 NaN IGHJ4*02 IGHM NaN True CAGSRWLYSFDYW TGTGCGGGGAGTCGGTGGTTATATTCTTTTGACTACTGG 1915 22 B_VDJ_117_5_3_VJ_102_3_4 B_VDJ_117_5_3_VJ_102_3_4

7355 rows × 16 columns

[ ]: