Dandelion class
Much of the functions and utility of the dandelion package revolves around the Dandelion class object. The class will act as an intermediary object for storage and flexible interaction with other tools. This section will run through a quick primer to the Dandelion class.
Import modules
[1]:
import os
os.chdir(os.path.expanduser("~/Downloads/dandelion_tutorial/"))
import dandelion as ddl
ddl.logging.print_versions()
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_csv from `anndata` is deprecated. Import anndata.io.read_csv instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_excel from `anndata` is deprecated. Import anndata.io.read_excel instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_hdf from `anndata` is deprecated. Import anndata.io.read_hdf instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_loom from `anndata` is deprecated. Import anndata.io.read_loom instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_mtx from `anndata` is deprecated. Import anndata.io.read_mtx instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_text from `anndata` is deprecated. Import anndata.io.read_text instead.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing read_umi_tools from `anndata` is deprecated. Import anndata.io.read_umi_tools instead.
dandelion==0.5.5.dev16 pandas==2.2.3 numpy==2.1.3 matplotlib==3.10.1 networkx==3.4.2 scipy==1.15.2
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/nxviz/__init__.py:33: UserWarning:
nxviz has a new API! Version 0.7.4 onwards, the old class-based API is being
deprecated in favour of a new API focused on advancing a grammar of network
graphics. If your plotting code depends on the old API, please consider
pinning nxviz at version 0.7.4, as the new API will break your old code.
To check out the new API, please head over to the docs at
https://ericmjl.github.io/nxviz/ to learn more. We hope you enjoy using it!
(This deprecation message will go away in version 1.0.)
[2]:
vdj = ddl.read_h5ddl("dandelion_results.h5ddl")
# let's run find_clones again as this was not stored.
ddl.tl.find_clones(vdj)
vdj
Finding clones based on B cell VDJ chains : 100%|██████████| 222/222 [00:00<00:00, 3567.62it/s]
Finding clones based on B cell VJ chains : 100%|██████████| 209/209 [00:00<00:00, 5862.71it/s]
Refining clone assignment based on VJ chain pairing : 100%|██████████| 2238/2238 [00:00<00:00, 647413.78it/s]
[2]:
Dandelion class object with n_obs = 2238 and n_contigs = 7355
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
layout: layout for 2112 vertices, layout for 71 vertices
graph: networkx graph of 2112 vertices, networkx graph of 71 vertices
Essentially, the .data slot holds the AIRR contig table while the .metadata holds a collapsed version that is compatible with combining with AnnData’s .obs slot. You can retrieve these slots like a typical class object; for example, if I want the metadata:
[3]:
vdj.metadata
[3]:
| clone_id | clone_id_by_size | sample_id | locus_VDJ | locus_VJ | productive_VDJ | productive_VJ | v_call_genotyped_VDJ | d_call_VDJ | j_call_VDJ | ... | d_call_B_VDJ_main | j_call_B_VDJ_main | v_call_B_VJ_main | j_call_B_VJ_main | isotype | isotype_status | locus_status | chain_status | rearrangement_status_VDJ | rearrangement_status_VJ | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG | B_VJ_76_2_7 | 169 | sc5p_v2_hs_PBMC_10k | None | IGK | None | T | None | None | None | ... | None | None | IGKV1D-33,IGKV1-33 | IGKJ4 | Orphan IGK | Orphan VJ | None | standard | ||
| sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC | B_VDJ_191_3_2_VJ_185_2_3 | 1988 | sc5p_v2_hs_PBMC_10k | IGH | IGK | T | T | IGHV1-69,IGHV1-69D | IGHD3-22 | IGHJ3 | ... | IGHD3-22 | IGHJ3 | IGKV1-8 | IGKJ1 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
| sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG | B_VDJ_9_1_2_VJ_153_1_1 | 1602 | sc5p_v2_hs_PBMC_10k | IGH | IGL | T | T | IGHV1-2 | None | IGHJ3 | ... | None | IGHJ3 | IGLV5-45 | IGLJ3 | IgM | IgM | IGH + IGL | Single pair | standard | standard |
| sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC | B_VDJ_92_4_2_VJ_47_1_1 | 1603 | sc5p_v2_hs_PBMC_10k | IGH | IGK | T | T | IGHV5-51 | None | IGHJ3 | ... | None | IGHJ3 | IGKV1D-8 | IGKJ2 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
| sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA | B_VDJ_15_2_1_VJ_83_2_6 | 1604 | sc5p_v2_hs_PBMC_10k | IGH | IGL | T | T | IGHV4-4 | IGHD6-13 | IGHJ3 | ... | IGHD6-13 | IGHJ3 | IGLV3-19 | IGLJ2,IGLJ3 | IgM | IgM | IGH + IGL | Single pair | standard | standard |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG | B_VDJ_61_2_1_VJ_129_2_7 | 812 | vdj_v1_hs_pbmc3 | IGH | IGK | T | T | IGHV2-5 | IGHD5/OR15-5a,IGHD5/OR15-5b | IGHJ4,IGHJ5 | ... | IGHD5/OR15-5a,IGHD5/OR15-5b | IGHJ4,IGHJ5 | IGKV4-1 | IGKJ4 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
| vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT | B_VDJ_37_5_2_VJ_49_1_3 | 813 | vdj_v1_hs_pbmc3 | IGH | IGK | T | T | IGHV3-30 | IGHD4-17 | IGHJ6 | ... | IGHD4-17 | IGHJ6 | IGKV2-30 | IGKJ2 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
| vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA | B_VDJ_145_1_1_VJ_35_4_14 | 814 | vdj_v1_hs_pbmc3 | IGH | IGK | T | T | IGHV4-61 | IGHD6-13 | IGHJ2 | ... | IGHD6-13 | IGHJ2 | IGKV1-39,IGKV1D-39 | IGKJ1 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
| vdj_v1_hs_pbmc3_TTTGCGCCATACCATG | B_VDJ_48_4_2_VJ_50_3_5 | 815 | vdj_v1_hs_pbmc3 | IGH | IGL | T | T | IGHV1-69,IGHV1-69D | IGHD2-15 | IGHJ6 | ... | IGHD2-15 | IGHJ6 | IGLV1-47 | IGLJ3 | IgM | IgM | IGH + IGL | Single pair | standard | standard |
| vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG | B_VDJ_117_5_3_VJ_102_3_4 | 2388 | vdj_v1_hs_pbmc3 | IGH | IGL | T | T | IGHV3-23,IGHV3-23D | None | IGHJ4 | ... | None | IGHJ4 | IGLV2-11 | IGLJ2,IGLJ3 | IgM | IgM | IGH + IGL | Single pair | standard | standard |
2238 rows × 47 columns
slicing
You can slice the Dandelion object via the .data or .metadata via their indices, with the behavior similar to how it is in pandas DataFrame and AnnData.
slicing .data
[4]:
# get the largest clone
largest_clone = vdj.data["clone_id"].value_counts().idxmax()
vdj[vdj.data["clone_id"] == largest_clone]
[4]:
Dandelion class object with n_obs = 566 and n_contigs = 2802
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
layout: layout for 547 vertices, layout for 9 vertices
graph: networkx graph of 547 vertices, networkx graph of 9 vertices
[5]:
vdj[
vdj.data_names.isin(
[
"sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1",
"sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2",
"sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1",
"sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1",
"sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2",
]
)
]
[5]:
Dandelion class object with n_obs = 3 and n_contigs = 5
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
layout: layout for 2 vertices, layout for 0 vertices
graph: networkx graph of 2 vertices, networkx graph of 0 vertices
slicing .metadata
[6]:
vdj[vdj.metadata["productive_VDJ"].isin(["T", "T|T"])]
[6]:
Dandelion class object with n_obs = 2112 and n_contigs = 5052
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
layout: layout for 2112 vertices, layout for 71 vertices
graph: networkx graph of 2112 vertices, networkx graph of 71 vertices
[7]:
vdj[vdj.metadata_names == "vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT"]
[7]:
Dandelion class object with n_obs = 1 and n_contigs = 2
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
layout: layout for 1 vertices, layout for 0 vertices
graph: networkx graph of 1 vertices, networkx graph of 0 vertices
copy
You can deep copy the Dandelion object to another variable which will inherit all slots:
[8]:
vdj2 = vdj.copy()
vdj2.metadata
[8]:
| clone_id | clone_id_by_size | sample_id | locus_VDJ | locus_VJ | productive_VDJ | productive_VJ | v_call_genotyped_VDJ | d_call_VDJ | j_call_VDJ | ... | d_call_B_VDJ_main | j_call_B_VDJ_main | v_call_B_VJ_main | j_call_B_VJ_main | isotype | isotype_status | locus_status | chain_status | rearrangement_status_VDJ | rearrangement_status_VJ | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG | B_VJ_76_2_7 | 169 | sc5p_v2_hs_PBMC_10k | None | IGK | None | T | None | None | None | ... | None | None | IGKV1D-33,IGKV1-33 | IGKJ4 | Orphan IGK | Orphan VJ | None | standard | ||
| sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC | B_VDJ_191_3_2_VJ_185_2_3 | 1988 | sc5p_v2_hs_PBMC_10k | IGH | IGK | T | T | IGHV1-69,IGHV1-69D | IGHD3-22 | IGHJ3 | ... | IGHD3-22 | IGHJ3 | IGKV1-8 | IGKJ1 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
| sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG | B_VDJ_9_1_2_VJ_153_1_1 | 1602 | sc5p_v2_hs_PBMC_10k | IGH | IGL | T | T | IGHV1-2 | None | IGHJ3 | ... | None | IGHJ3 | IGLV5-45 | IGLJ3 | IgM | IgM | IGH + IGL | Single pair | standard | standard |
| sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC | B_VDJ_92_4_2_VJ_47_1_1 | 1603 | sc5p_v2_hs_PBMC_10k | IGH | IGK | T | T | IGHV5-51 | None | IGHJ3 | ... | None | IGHJ3 | IGKV1D-8 | IGKJ2 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
| sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA | B_VDJ_15_2_1_VJ_83_2_6 | 1604 | sc5p_v2_hs_PBMC_10k | IGH | IGL | T | T | IGHV4-4 | IGHD6-13 | IGHJ3 | ... | IGHD6-13 | IGHJ3 | IGLV3-19 | IGLJ2,IGLJ3 | IgM | IgM | IGH + IGL | Single pair | standard | standard |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG | B_VDJ_61_2_1_VJ_129_2_7 | 812 | vdj_v1_hs_pbmc3 | IGH | IGK | T | T | IGHV2-5 | IGHD5/OR15-5a,IGHD5/OR15-5b | IGHJ4,IGHJ5 | ... | IGHD5/OR15-5a,IGHD5/OR15-5b | IGHJ4,IGHJ5 | IGKV4-1 | IGKJ4 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
| vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT | B_VDJ_37_5_2_VJ_49_1_3 | 813 | vdj_v1_hs_pbmc3 | IGH | IGK | T | T | IGHV3-30 | IGHD4-17 | IGHJ6 | ... | IGHD4-17 | IGHJ6 | IGKV2-30 | IGKJ2 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
| vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA | B_VDJ_145_1_1_VJ_35_4_14 | 814 | vdj_v1_hs_pbmc3 | IGH | IGK | T | T | IGHV4-61 | IGHD6-13 | IGHJ2 | ... | IGHD6-13 | IGHJ2 | IGKV1-39,IGKV1D-39 | IGKJ1 | IgM | IgM | IGH + IGK | Single pair | standard | standard |
| vdj_v1_hs_pbmc3_TTTGCGCCATACCATG | B_VDJ_48_4_2_VJ_50_3_5 | 815 | vdj_v1_hs_pbmc3 | IGH | IGL | T | T | IGHV1-69,IGHV1-69D | IGHD2-15 | IGHJ6 | ... | IGHD2-15 | IGHJ6 | IGLV1-47 | IGLJ3 | IgM | IgM | IGH + IGL | Single pair | standard | standard |
| vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG | B_VDJ_117_5_3_VJ_102_3_4 | 2388 | vdj_v1_hs_pbmc3 | IGH | IGL | T | T | IGHV3-23,IGHV3-23D | None | IGHJ4 | ... | None | IGHJ4 | IGLV2-11 | IGLJ2,IGLJ3 | IgM | IgM | IGH + IGL | Single pair | standard | standard |
2238 rows × 47 columns
Retrieving entries with update_metadata
The .metadata slot in Dandelion class automatically initializes whenever the .data slot is filled. However, it only returns a standard number of columns that are pre-specified. To retrieve other columns from the .data slot, we can update the metadata with ddl.update_metadata and specify the options retrieve and retrieve_mode.
The following modes determine how the retrieval is completed:
split and unique only - splits the retrieval into VDJ and VJ chains. A | will separate unique element.
split and merge - splits the retrieval into VDJ and VJ chains. A | will separate every element.
merge and unique only - smiliar to above but merged into a single column.
split - split retrieval into individual columns for each contig.
merge - merge retrieval into a single column where a | will separate every element.
For numerical columns, there’s additional options:
split and sum - splits the retrieval into VDJ and VJ chains and sum separately.
split and average - smiliar to above but average instead of sum.
sum - sum the retrievals into a single column.
average - averages the retrievals into a single column.
If retrieve_mode is not specified, it will default to split and merge
Example: retrieving fwr1 sequences
[9]:
vdj.update_metadata(retrieve="fwr1")
vdj
[9]:
Dandelion class object with n_obs = 2238 and n_contigs = 7355
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'fwr1_VJ', 'fwr1_VDJ'
layout: layout for 2112 vertices, layout for 71 vertices
graph: networkx graph of 2112 vertices, networkx graph of 71 vertices
Note the additional fwr1 VDJ and VJ columns in the metadata slot.
By default, dandelion will not try to merge numerical columns as it can create mixed dtype columns.
There is a new sub-function that will try and retrieve frequently used columns such as np1_length, np2_length:
[10]:
vdj.update_plus()
vdj
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/numpy/_core/fromnumeric.py:3904: RuntimeWarning: Mean of empty slice.
/opt/homebrew/Caskroom/miniforge/base/envs/dandelion/lib/python3.11/site-packages/numpy/_core/_methods.py:147: RuntimeWarning: invalid value encountered in scalar divide
[10]:
Dandelion class object with n_obs = 2238 and n_contigs = 7355
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ', 'fwr1_VJ', 'fwr1_VDJ', 'mu_count_VDJ', 'mu_count_VJ', 'mu_count', 'junction_length_VDJ', 'junction_length_VJ', 'junction_aa_length_VDJ', 'junction_aa_length_VJ', 'np1_length_VDJ', 'np1_length_VJ', 'np2_length_VDJ'
layout: layout for 2112 vertices, layout for 71 vertices
graph: networkx graph of 2112 vertices, networkx graph of 71 vertices
Renaming barcodes
You can now use a simple function to rename the barcodes (both sequence and cell ids at the same time). This is useful for when you want to rename the barcodes to a more meaningful name. This only works on the indices that were initially used to create the Dandelion object. So if you have run the function once already, it doesn’t continuously add the prefix/suffix to the new indices. It just updates based on the original indices.
[11]:
# original
print(vdj.data[["sequence_id", "cell_id"]]), print(vdj.metadata_names)
sequence_id \
sequence_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2
... ...
vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1
cell_id
sequence_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
... ...
vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG
[7355 rows x 2 columns]
Index(['sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG',
'sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC',
'sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG',
'sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC',
'sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA',
'sc5p_v2_hs_PBMC_10k_AAACGGGCACTGTTAG',
'sc5p_v2_hs_PBMC_10k_AAAGATGAGGATGCGT',
'sc5p_v2_hs_PBMC_10k_AAAGATGGTCGAATCT',
'sc5p_v2_hs_PBMC_10k_AAAGATGGTGAGGGAG',
'sc5p_v2_hs_PBMC_10k_AAAGTAGCAGATCCAT',
...
'vdj_v1_hs_pbmc3_TTGTAGGTCGCAAGCC', 'vdj_v1_hs_pbmc3_TTTACTGTCAGCTGGC',
'vdj_v1_hs_pbmc3_TTTATGCGTCAGAATA', 'vdj_v1_hs_pbmc3_TTTATGCTCAGGATCT',
'vdj_v1_hs_pbmc3_TTTATGCTCCTAGAAC', 'vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG',
'vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT', 'vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA',
'vdj_v1_hs_pbmc3_TTTGCGCCATACCATG', 'vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG'],
dtype='object', length=2238)
[11]:
(None, None)
[12]:
# let's add a 'test-' as a prefix. There's also the suffix option
vdj.add_sequence_prefix("test", sep="-")
print(vdj.data[["sequence_id", "cell_id"]]), print(vdj.metadata_names)
sequence_id \
sequence_id
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 test-sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_cont...
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_cont...
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_cont...
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_cont...
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_cont...
... ...
test-vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 test-vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1
test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2
test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1
test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2
test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1
cell_id
sequence_id
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 test-sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
... ...
test-vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 test-vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC
test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG
test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG
[7355 rows x 2 columns]
Index(['test-sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG',
'test-sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC',
'test-sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG',
'test-sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC',
'test-sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA',
'test-sc5p_v2_hs_PBMC_10k_AAACGGGCACTGTTAG',
'test-sc5p_v2_hs_PBMC_10k_AAAGATGAGGATGCGT',
'test-sc5p_v2_hs_PBMC_10k_AAAGATGGTCGAATCT',
'test-sc5p_v2_hs_PBMC_10k_AAAGATGGTGAGGGAG',
'test-sc5p_v2_hs_PBMC_10k_AAAGTAGCAGATCCAT',
...
'test-vdj_v1_hs_pbmc3_TTGTAGGTCGCAAGCC',
'test-vdj_v1_hs_pbmc3_TTTACTGTCAGCTGGC',
'test-vdj_v1_hs_pbmc3_TTTATGCGTCAGAATA',
'test-vdj_v1_hs_pbmc3_TTTATGCTCAGGATCT',
'test-vdj_v1_hs_pbmc3_TTTATGCTCCTAGAAC',
'test-vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG',
'test-vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT',
'test-vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA',
'test-vdj_v1_hs_pbmc3_TTTGCGCCATACCATG',
'test-vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG'],
dtype='object', length=2238)
[12]:
(None, None)
[13]:
# same functionality as above
vdj.add_cell_prefix("test2", sep="_")
print(vdj.data[["sequence_id", "cell_id"]]), print(vdj.metadata_names)
sequence_id \
sequence_id
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_cont... test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_con...
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_cont... test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_con...
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_cont... test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_con...
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_cont... test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_con...
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_cont... test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_con...
... ...
test2_vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 test2_vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1
test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2
test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1
test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2
test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1
cell_id
sequence_id
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_cont... test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_cont... test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_cont... test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_cont... test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_cont... test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
... ...
test2_vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 test2_vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC
test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG
test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG
[7355 rows x 2 columns]
Index(['test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG',
'test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC',
'test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG',
'test2_sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC',
'test2_sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA',
'test2_sc5p_v2_hs_PBMC_10k_AAACGGGCACTGTTAG',
'test2_sc5p_v2_hs_PBMC_10k_AAAGATGAGGATGCGT',
'test2_sc5p_v2_hs_PBMC_10k_AAAGATGGTCGAATCT',
'test2_sc5p_v2_hs_PBMC_10k_AAAGATGGTGAGGGAG',
'test2_sc5p_v2_hs_PBMC_10k_AAAGTAGCAGATCCAT',
...
'test2_vdj_v1_hs_pbmc3_TTGTAGGTCGCAAGCC',
'test2_vdj_v1_hs_pbmc3_TTTACTGTCAGCTGGC',
'test2_vdj_v1_hs_pbmc3_TTTATGCGTCAGAATA',
'test2_vdj_v1_hs_pbmc3_TTTATGCTCAGGATCT',
'test2_vdj_v1_hs_pbmc3_TTTATGCTCCTAGAAC',
'test2_vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG',
'test2_vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT',
'test2_vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA',
'test2_vdj_v1_hs_pbmc3_TTTGCGCCATACCATG',
'test2_vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG'],
dtype='object', length=2238)
[13]:
(None, None)
[14]:
# you can also reset the ids
vdj.reset_ids()
print(vdj.data[["sequence_id", "cell_id"]]), print(vdj.metadata_names)
sequence_id \
sequence_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2
... ...
vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1
cell_id
sequence_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG
... ...
vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 vdj_v1_hs_pbmc3_TTTGCGCCATACCATG
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG
[7355 rows x 2 columns]
Index(['sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG',
'sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC',
'sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG',
'sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC',
'sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA',
'sc5p_v2_hs_PBMC_10k_AAACGGGCACTGTTAG',
'sc5p_v2_hs_PBMC_10k_AAAGATGAGGATGCGT',
'sc5p_v2_hs_PBMC_10k_AAAGATGGTCGAATCT',
'sc5p_v2_hs_PBMC_10k_AAAGATGGTGAGGGAG',
'sc5p_v2_hs_PBMC_10k_AAAGTAGCAGATCCAT',
...
'vdj_v1_hs_pbmc3_TTGTAGGTCGCAAGCC', 'vdj_v1_hs_pbmc3_TTTACTGTCAGCTGGC',
'vdj_v1_hs_pbmc3_TTTATGCGTCAGAATA', 'vdj_v1_hs_pbmc3_TTTATGCTCAGGATCT',
'vdj_v1_hs_pbmc3_TTTATGCTCCTAGAAC', 'vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG',
'vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT', 'vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA',
'vdj_v1_hs_pbmc3_TTTGCGCCATACCATG', 'vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG'],
dtype='object', length=2238)
[14]:
(None, None)
Simplifying the V/DJ/C call annotations
Sometimes the V/DJ/C call annotations can be quite verbose. You can simplify them with the .simplify() function. This function will remove the , and only keep the first element of the call, as well as stripping alleles. This is useful for when you want to simplify the V/DJ/C calls for plotting purposes.
[15]:
# before
(
vdj.data[["v_call_genotyped", "j_call"]],
vdj.metadata[["v_call_genotyped_VDJ", "j_call_VDJ"]],
)
[15]:
( v_call_genotyped \
sequence_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 IGKV1-33*01,IGKV1D-33*01
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 IGHV1-69*01,IGHV1-69D*01
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 IGKV1-8*01
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 IGLV5-45*02
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 IGHV1-2*02
... ...
vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 IGHV1-46*01
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 IGHV1-69*01,IGHV1-69D*01
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 IGLV1-47*01
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 IGLV2-11*01
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 IGHV3-23*01,IGHV3-23D*01
j_call
sequence_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 IGKJ4*01
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 IGHJ3*02
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 IGKJ1*01
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 IGLJ3*02
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 IGHJ3*02
... ...
vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 IGHJ5*02
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 IGHJ6*02
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 IGLJ3*02
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 IGLJ2*01,IGLJ3*01,IGLJ3*02
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 IGHJ4*02
[7355 rows x 2 columns],
v_call_genotyped_VDJ j_call_VDJ
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG None None
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC IGHV1-69,IGHV1-69D IGHJ3
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG IGHV1-2 IGHJ3
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC IGHV5-51 IGHJ3
sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA IGHV4-4 IGHJ3
... ... ...
vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG IGHV2-5 IGHJ4,IGHJ5
vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT IGHV3-30 IGHJ6
vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA IGHV4-61 IGHJ2
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG IGHV1-69,IGHV1-69D IGHJ6
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG IGHV3-23,IGHV3-23D IGHJ4
[2238 rows x 2 columns])
[16]:
# after
vdj.simplify()
# before
(
vdj.data[["v_call_genotyped", "j_call"]],
vdj.metadata[["v_call_genotyped_VDJ", "j_call_VDJ"]],
)
[16]:
( v_call_genotyped j_call
sequence_id
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 IGKV1-33 IGKJ4
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 IGHV1-69 IGHJ3
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 IGKV1-8 IGKJ1
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 IGLV5-45 IGLJ3
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 IGHV1-2 IGHJ3
... ... ...
vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 IGHV1-46 IGHJ5
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 IGHV1-69 IGHJ6
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 IGLV1-47 IGLJ3
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 IGLV2-11 IGLJ2
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 IGHV3-23 IGHJ4
[7355 rows x 2 columns],
v_call_genotyped_VDJ j_call_VDJ
sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG None None
sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC IGHV1-69 IGHJ3
sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG IGHV1-2 IGHJ3
sc5p_v2_hs_PBMC_10k_AAACCTGTCTTGAGAC IGHV5-51 IGHJ3
sc5p_v2_hs_PBMC_10k_AAACGGGAGCGACGTA IGHV4-4 IGHJ3
... ... ...
vdj_v1_hs_pbmc3_TTTCCTCAGCAATATG IGHV2-5 IGHJ4
vdj_v1_hs_pbmc3_TTTCCTCAGCGCTTAT IGHV3-30 IGHJ6
vdj_v1_hs_pbmc3_TTTCCTCAGGGAAACA IGHV4-61 IGHJ2
vdj_v1_hs_pbmc3_TTTGCGCCATACCATG IGHV1-69 IGHJ6
vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG IGHV3-23 IGHJ4
[2238 rows x 2 columns])
concatenating multiple objects
This is a simple function to concatenate (append) two or more Dandelion class, or pandas dataframes. Note that this operates on the .data slot and not the .metadata slot.
[17]:
vdj
[17]:
Dandelion class object with n_obs = 2238 and n_contigs = 7355
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
layout: layout for 2112 vertices, layout for 71 vertices
graph: networkx graph of 2112 vertices, networkx graph of 71 vertices
[18]:
# just simple concatenation x 3. check the difference between the cell and contig numbers between this object and just vdj
vdj_concat = ddl.concat([vdj, vdj, vdj])
vdj_concat
[18]:
Dandelion class object with n_obs = 6714 and n_contigs = 22065
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
[19]:
vdj_concat.data[["sequence_id", "cell_id"]].head()
[19]:
| sequence_id | cell_id | |
|---|---|---|
| sequence_id | ||
| sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1_0 | sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1_0 | sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_0 |
| sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1_1 | sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1_1 | sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_1 |
| sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1_2 | sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1_2 | sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_2 |
| sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2_0 | sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2_0 | sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_0 |
| sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1_0 | sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1_0 | sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_0 |
ddl.concat also lets you add in your custom prefixes/suffixes to append to the sequence ids. If not provided, it will add -0, -1 etc. as a suffix if it detects that the sequence ids are not unique as seen above.
read/write
Dandelion class can be saved using .write_h5ddl and .write_pkl functions with accompanying compression methods e.g. gzip. write_h5ddl primarily uses h5py library and write_pkl just uses pickle. read_h5ddl and read_pkl functions will read the respective file formats accordingly.
[20]:
%time vdj.write_h5ddl('dandelion_results_test.h5ddl', compression="gzip")
/Users/uqztuong/Documents/GitHub/dandelion/src/dandelion/utilities/_utilities.py:476: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
/Users/uqztuong/Documents/GitHub/dandelion/src/dandelion/utilities/_utilities.py:476: FutureWarning: Downcasting behavior in `replace` is deprecated and will be removed in a future version. To retain the old behavior, explicitly call `result.infer_objects(copy=False)`. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
CPU times: user 6.39 s, sys: 137 ms, total: 6.53 s
Wall time: 6.79 s
If you see any warnings above, it’s due to mix dtypes somewhere in the object. So do some checking if you think it will interfere with downstream usage.
[21]:
%time vdj_1 = ddl.read_h5ddl('dandelion_results_test.h5ddl')
vdj_1
CPU times: user 1.19 s, sys: 60.9 ms, total: 1.25 s
Wall time: 1.37 s
[21]:
Dandelion class object with n_obs = 2238 and n_contigs = 7355
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
layout: layout for 2112 vertices, layout for 71 vertices
graph: networkx graph of 2112 vertices, networkx graph of 71 vertices
The read/write times using pickle can be situationally faster/slower and file sizes can also be situationally smaller/larger (depending on which compression is used).
[22]:
%time vdj.write_pkl('dandelion_results_test.pkl.gz')
CPU times: user 7.4 s, sys: 89.6 ms, total: 7.49 s
Wall time: 8.15 s
[23]:
%time vdj_2 = ddl.read_pkl('dandelion_results_test.pkl.gz')
vdj_2
CPU times: user 127 ms, sys: 13.7 ms, total: 141 ms
Wall time: 146 ms
[23]:
Dandelion class object with n_obs = 2238 and n_contigs = 7355
data: 'sequence_id', 'sequence', 'rev_comp', 'productive', 'v_call', 'd_call', 'j_call', 'sequence_alignment', 'germline_alignment', 'junction', 'junction_aa', 'v_cigar', 'd_cigar', 'j_cigar', 'stop_codon', 'vj_in_frame', 'locus', 'c_call', 'junction_length', 'np1_length', 'np2_length', 'v_sequence_start', 'v_sequence_end', 'v_germline_start', 'v_germline_end', 'd_sequence_start', 'd_sequence_end', 'd_germline_start', 'd_germline_end', 'j_sequence_start', 'j_sequence_end', 'j_germline_start', 'j_germline_end', 'v_score', 'v_identity', 'v_support', 'd_score', 'd_identity', 'd_support', 'j_score', 'j_identity', 'j_support', 'fwr1', 'fwr2', 'fwr3', 'fwr4', 'cdr1', 'cdr2', 'cdr3', 'cell_id', 'consensus_count', 'umi_count', 'v_call_10x', 'd_call_10x', 'j_call_10x', 'junction_10x', 'junction_10x_aa', 'j_support_igblastn', 'j_score_igblastn', 'j_call_igblastn', 'j_call_blastn', 'j_identity_blastn', 'j_alignment_length_blastn', 'j_number_of_mismatches_blastn', 'j_number_of_gap_openings_blastn', 'j_sequence_start_blastn', 'j_sequence_end_blastn', 'j_germline_start_blastn', 'j_germline_end_blastn', 'j_support_blastn', 'j_score_blastn', 'j_sequence_alignment_blastn', 'j_germline_alignment_blastn', 'j_source', 'd_support_igblastn', 'd_score_igblastn', 'd_call_igblastn', 'd_call_blastn', 'd_identity_blastn', 'd_alignment_length_blastn', 'd_number_of_mismatches_blastn', 'd_number_of_gap_openings_blastn', 'd_sequence_start_blastn', 'd_sequence_end_blastn', 'd_germline_start_blastn', 'd_germline_end_blastn', 'd_support_blastn', 'd_score_blastn', 'd_sequence_alignment_blastn', 'd_germline_alignment_blastn', 'd_source', 'v_call_genotyped', 'germline_alignment_d_mask', 'sample_id', 'c_sequence_alignment', 'c_germline_alignment', 'c_sequence_start', 'c_sequence_end', 'c_score', 'c_identity', 'c_call_10x', 'junction_aa_length', 'fwr1_aa', 'fwr2_aa', 'fwr3_aa', 'fwr4_aa', 'cdr1_aa', 'cdr2_aa', 'cdr3_aa', 'sequence_alignment_aa', 'v_sequence_alignment_aa', 'd_sequence_alignment_aa', 'j_sequence_alignment_aa', 'complete_vdj', 'j_call_multimappers', 'j_call_multiplicity', 'j_call_sequence_start_multimappers', 'j_call_sequence_end_multimappers', 'j_call_support_multimappers', 'mu_count', 'ambiguous', 'extra', 'rearrangement_status', 'clone_id'
metadata: 'clone_id', 'clone_id_by_size', 'sample_id', 'locus_VDJ', 'locus_VJ', 'productive_VDJ', 'productive_VJ', 'v_call_genotyped_VDJ', 'd_call_VDJ', 'j_call_VDJ', 'v_call_genotyped_VJ', 'j_call_VJ', 'c_call_VDJ', 'c_call_VJ', 'junction_VDJ', 'junction_VJ', 'junction_aa_VDJ', 'junction_aa_VJ', 'v_call_genotyped_B_VDJ', 'd_call_B_VDJ', 'j_call_B_VDJ', 'v_call_genotyped_B_VJ', 'j_call_B_VJ', 'c_call_B_VDJ', 'c_call_B_VJ', 'productive_B_VDJ', 'productive_B_VJ', 'umi_count_B_VDJ', 'umi_count_B_VJ', 'v_call_VDJ_main', 'v_call_VJ_main', 'd_call_VDJ_main', 'j_call_VDJ_main', 'j_call_VJ_main', 'c_call_VDJ_main', 'c_call_VJ_main', 'v_call_B_VDJ_main', 'd_call_B_VDJ_main', 'j_call_B_VDJ_main', 'v_call_B_VJ_main', 'j_call_B_VJ_main', 'isotype', 'isotype_status', 'locus_status', 'chain_status', 'rearrangement_status_VDJ', 'rearrangement_status_VJ'
layout: layout for 2112 vertices, layout for 71 vertices
graph: networkx graph of 2112 vertices, networkx graph of 71 vertices
There’s also other types of writing functions such as .write_airr and .write_10x, which will write the object to a .tsv or .csv file that is compatible with airr and 10x formats respectively.
[24]:
import pandas as pd
vdj2.write_airr("test.airr.tsv")
df = pd.read_csv("test.airr.tsv", sep="\t")
df
[24]:
| sequence_id | sequence | rev_comp | productive | v_call | d_call | j_call | sequence_alignment | germline_alignment | junction | ... | j_call_multimappers | j_call_multiplicity | j_call_sequence_start_multimappers | j_call_sequence_end_multimappers | j_call_support_multimappers | mu_count | ambiguous | extra | rearrangement_status | clone_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 | TGGGGAGGAGTCAGTCCCAACCAGGACACGGCCTGGACATGAGGGT... | F | T | IGKV1-33*01,IGKV1D-33*01 | NaN | IGKJ4*01 | GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTGG... | GACATCCAGATGACCCAGTCTCCATCCTCCCTGTCTGCATCTGTAG... | TGTCAACAATATGACGAACTTCCCGTCACTTTC | ... | IGKJ4*01 | 1.0 | 385.0 | 412.0 | 3.56e-09 | 27 | F | F | standard | B_VJ_76_2_7 |
| 1 | sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 | ATCACATAACAACCACATTCCTCCTCTAAAGAAGCCCCTGGGAGCA... | F | T | IGHV1-69*01,IGHV1-69D*01 | IGHD3-22*01 | IGHJ3*02 | CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... | CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... | TGTGCGACTACGTATTACTATGATAGTAGTGGTTATTACCAGAATG... | ... | IGHJ3*02 | 1.0 | 445.0 | 494.0 | 4.5799999999999995e-23 | 0 | F | F | standard | B_VDJ_191_3_2_VJ_185_2_3 |
| 2 | sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 | AGGAGTCAGACCCTGTCAGGACACAGCATAGACATGAGGGTCCCCG... | F | T | IGKV1-8*01 | NaN | IGKJ1*01 | GCCATCCGGATGACCCAGTCTCCATCCTCATTCTCTGCATCTACAG... | GCCATCCGGATGACCCAGTCTCCATCCTCATTCTCTGCATCTACAG... | TGTCAACAGTATTATAGTTACCCTCGGACGTTC | ... | IGKJ1*01 | 1.0 | 380.0 | 415.0 | 2.7e-15 | 0 | F | F | standard | B_VDJ_191_3_2_VJ_185_2_3 |
| 3 | sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 | ACTGTGGGGGTAAGAGGTTGTGTCCACCATGGCCTGGACTCCTCTC... | F | T | IGLV5-45*02 | NaN | IGLJ3*02 | CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTG... | CAGGCTGTGCTGACTCAGCCGTCTTCC...CTCTCTGCATCTCCTG... | TGTATGATTTGGCACAGCAGCGCTTGGGTGGTC | ... | IGLJ3*01 | 1.0 | 402.0 | 431.0 | 6.84e-12 | 8 | F | F | standard | B_VDJ_9_1_2_VJ_153_1_1 |
| 4 | sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 | GGGAGCATCACCCAGCAACCACATCTGTCCTCTAGAGAATCCCCTG... | F | T | IGHV1-2*02 | NaN | IGHJ3*02 | CAGGTGCAACTGGTGCAGTCTGGGGGT...GAGGTAAAGAAGCCTG... | CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... | TGTGCGAGAGAGATAGAGGGGGACGGTGTTTTTGAAATCTGG | ... | IGHJ3*02 | 1.0 | 433.0 | 479.0 | 4.48e-18 | 22 | F | F | standard | B_VDJ_9_1_2_VJ_153_1_1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7350 | vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 | ATCATCCAACAACCACATCCCTTCTCTACAGAAGCCTCTGAGAGGA... | F | T | IGHV1-46*01 | IGHD2-15*01 | IGHJ5*02 | CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... | CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... | TGTGCGAGAGAGGGATATTGTAGTGGTGGTAGCTGCTACTCCCCCG... | ... | IGHJ5*02 | 1.0 | 461 | 506 | 7.83e-21 | 0 | T | T | standard | NaN |
| 7351 | vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 | ATCACATAACAACCACATTCCTCCTCTAAAGAAGCCCCTGGGAGCA... | F | T | IGHV1-69*01,IGHV1-69D*01 | IGHD2-15*01 | IGHJ6*02 | CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... | CAGGTGCAGCTGGTGCAGTCTGGGGCT...GAGGTGAAGAAGCCTG... | TGTGCGAGATCTCTGGATATTGTAGTGGTGGTAGCACTCTACTACT... | ... | IGHJ6*02 | 1.0 | 439 | 497 | 4.57e-28 | 0 | F | F | standard | B_VDJ_48_4_2_VJ_50_3_5 |
| 7352 | vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 | AGCTTCAGCTGTGGTAGAGAAGACAGGATTCAGGACAATCTCCAGC... | F | T | IGLV1-47*01 | NaN | IGLJ3*02 | CAGTCTGTGCTGACTCAGCCACCCTCA...GCGTCTGGGACCCCCG... | CAGTCTGTGCTGACTCAGCCACCCTCA...GCGTCTGGGACCCCCG... | TGTGCAGCATGGGATGACAGCCTGAGTGGTTGGGTGTTC | ... | IGLJ3*02 | 1.0 | 397 | 434 | 2.46e-16 | 0 | F | F | standard | B_VDJ_48_4_2_VJ_50_3_5 |
| 7353 | vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 | GGCTGGGGTCTCAGGAGGCAGCACTCTCGGGACGTCTCCACCATGG... | F | T | IGLV2-11*01 | NaN | IGLJ2*01,IGLJ3*01,IGLJ3*02 | CAGTCTGCCCTGACTCAGCCTCGCTCA...GTGTCCGGGTCTCCTG... | CAGTCTGCCCTGACTCAGCCTCGCTCA...GTGTCCGGGTCTCCTG... | TGCTGCTCATATGCAGGCAGCTACACTGTGTTTTTC | ... | IGLJ3*01 | 1.0 | 393 | 430 | 2.46e-11 | 4 | F | F | standard | B_VDJ_117_5_3_VJ_102_3_4 |
| 7354 | vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 | AGCTCTGAGAGAGGAGCCCAGCCCTGGGATTTTCAGGTGTTTTCAT... | F | T | IGHV3-23*01,IGHV3-23D*01 | NaN | IGHJ4*02 | GAGGTGCAGGTGTTGGAGTCTGGGGGA...GGCTTGGAACAGCCTG... | GAGGTGCAGCTGTTGGAGTCTGGGGGA...GGCTTGGTACAGCCTG... | TGTGCGGGGAGTCGGTGGTTATATTCTTTTGACTACTGG | ... | IGHJ4*02 | 1.0 | 449 | 491 | 1.65e-17 | 8 | F | F | standard | B_VDJ_117_5_3_VJ_102_3_4 |
7355 rows × 124 columns
[25]:
vdj2.write_10x(
folder="10x_test",
filename_prefix="all",
) # this writes both the conting_annotations.csv and contig.fasta
df = pd.read_csv("10x_test/all_contig_annotations.csv")
df
[25]:
| barcode | contig_id | length | chain | v_gene | d_gene | j_gene | c_gene | full_length | productive | cdr3 | cdr3_nt | reads | umis | raw_clonotype_id | raw_consensus_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG | sc5p_v2_hs_PBMC_10k_AAACCTGTCATATCGG_contig_1 | 556 | IGK | IGKV1-33*01,IGKV1D-33*01 | NaN | IGKJ4*01 | IGKC | NaN | True | CQQYDELPVTF | TGTCAACAATATGACGAACTTCCCGTCACTTTC | 9139 | 68 | B_VJ_76_2_7 | B_VJ_76_2_7 |
| 1 | sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC | sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_2 | 565 | IGH | IGHV1-69*01,IGHV1-69D*01 | IGHD3-22*01 | IGHJ3*02 | IGHM | NaN | True | CATTYYYDSSGYYQNDAFDIW | TGTGCGACTACGTATTACTATGATAGTAGTGGTTATTACCAGAATG... | 4161 | 51 | B_VDJ_191_3_2_VJ_185_2_3 | B_VDJ_191_3_2_VJ_185_2_3 |
| 2 | sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC | sc5p_v2_hs_PBMC_10k_AAACCTGTCCGTTGTC_contig_1 | 551 | IGK | IGKV1-8*01 | NaN | IGKJ1*01 | IGKC | NaN | True | CQQYYSYPRTF | TGTCAACAGTATTATAGTTACCCTCGGACGTTC | 5679 | 43 | B_VDJ_191_3_2_VJ_185_2_3 | B_VDJ_191_3_2_VJ_185_2_3 |
| 3 | sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG | sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_1 | 642 | IGL | IGLV5-45*02 | NaN | IGLJ3*02 | IGLC3 | NaN | True | CMIWHSSAWVV | TGTATGATTTGGCACAGCAGCGCTTGGGTGGTC | 13160 | 90 | B_VDJ_9_1_2_VJ_153_1_1 | B_VDJ_9_1_2_VJ_153_1_1 |
| 4 | sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG | sc5p_v2_hs_PBMC_10k_AAACCTGTCGAGAACG_contig_2 | 550 | IGH | IGHV1-2*02 | NaN | IGHJ3*02 | IGHM | NaN | True | CAREIEGDGVFEIW | TGTGCGAGAGAGATAGAGGGGGACGGTGTTTTTGAAATCTGG | 5080 | 47 | B_VDJ_9_1_2_VJ_153_1_1 | B_VDJ_9_1_2_VJ_153_1_1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7350 | vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC | vdj_v1_hs_pbmc3_TTTCCTCTCGACAGCC_contig_1 | 577 | IGH | IGHV1-46*01 | IGHD2-15*01 | IGHJ5*02 | IGHM | NaN | True | CAREGYCSGGSCYSPDPNNGWFDPW | TGTGCGAGAGAGGGATATTGTAGTGGTGGTAGCTGCTACTCCCCCG... | 2960 | 28 | NaN | NaN |
| 7351 | vdj_v1_hs_pbmc3_TTTGCGCCATACCATG | vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_2 | 568 | IGH | IGHV1-69*01,IGHV1-69D*01 | IGHD2-15*01 | IGHJ6*02 | IGHM | NaN | True | CARSLDIVVVVALYYYYGMDVW | TGTGCGAGATCTCTGGATATTGTAGTGGTGGTAGCACTCTACTACT... | 2464 | 32 | B_VDJ_48_4_2_VJ_50_3_5 | B_VDJ_48_4_2_VJ_50_3_5 |
| 7352 | vdj_v1_hs_pbmc3_TTTGCGCCATACCATG | vdj_v1_hs_pbmc3_TTTGCGCCATACCATG_contig_1 | 645 | IGL | IGLV1-47*01 | NaN | IGLJ3*02 | IGLC3 | NaN | True | CAAWDDSLSGWVF | TGTGCAGCATGGGATGACAGCCTGAGTGGTTGGGTGTTC | 2457 | 28 | B_VDJ_48_4_2_VJ_50_3_5 | B_VDJ_48_4_2_VJ_50_3_5 |
| 7353 | vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG | vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_2 | 641 | IGL | IGLV2-11*01 | NaN | IGLJ2*01,IGLJ3*01,IGLJ3*02 | IGLC | NaN | True | CCSYAGSYTVFF | TGCTGCTCATATGCAGGCAGCTACACTGTGTTTTTC | 2744 | 36 | B_VDJ_117_5_3_VJ_102_3_4 | B_VDJ_117_5_3_VJ_102_3_4 |
| 7354 | vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG | vdj_v1_hs_pbmc3_TTTGGTTGTAGGCATG_contig_1 | 562 | IGH | IGHV3-23*01,IGHV3-23D*01 | NaN | IGHJ4*02 | IGHM | NaN | True | CAGSRWLYSFDYW | TGTGCGGGGAGTCGGTGGTTATATTCTTTTGACTACTGG | 1915 | 22 | B_VDJ_117_5_3_VJ_102_3_4 | B_VDJ_117_5_3_VJ_102_3_4 |
7355 rows × 16 columns
[ ]: