Loading data for analysis

[1]:

import dandelion as ddl
import scanpy as sc
import warnings
import os

warnings.filterwarnings("ignore")

This notebook shows how to prepare both GEX and VDJ data for Dandelion analysis. You don’t have to run it if you don’t want to, the resulting objects are downloaded at the start of the subsequent notebook. However, you’re likely to find the provided syntax for loading and concatenating multiple samples with Dandelion useful.

We’re using public 10X data, and ran the Dandelion preprocessing pipeline on it. We compressed the resulting folder, with both GEX and VDJ data ready for ingestion, and will download it shortly. The following are detailed instructions of what happened to construct said folder.

The GEX and VDJ were downloaded like so:

Download commands

We then returned to the dandelion_tutorial directory we created and constructed the following meta.csv file for Dandelion preprocessing to use:

sample,prefix,individual
vdj_v1_hs_pbmc3,vdj_v1_hs_pbmc3,vdj_v1_hs_pbmc3
vdj_nextgem_hs_pbmc3,vdj_nextgem_hs_pbmc3,vdj_nextgem_hs_pbmc3
sc5p_v2_hs_PBMC_10k,sc5p_v2_hs_PBMC_10k,sc5p_v2_hs_PBMC_10k
sc5p_v2_hs_PBMC_1k,sc5p_v2_hs_PBMC_1k,sc5p_v2_hs_PBMC_1k

The prefix column makes our life easier when loading the data by prepending the sample ID to the cell barcodes so we don’t have to do it. The individual column is explicitly included for TIgGER purposes. We then ran the preprocessing pipeline with the aid of the Dandelion singularity container:

singularity pull library://kt16/default/sc-dandelion:latest
singularity run -B $PWD sc-dandelion_latest.sif dandelion-preprocess --meta meta.csv --file_prefix filtered

We can now download the result of all of these operations and decompress the folder.

[2]:

if not os.path.exists("dandelion_tutorial"):
    os.system(
        "wget ftp://ftp.sanger.ac.uk/pub/users/kp9/dandelion_tutorial.tar.gz"
    )
    os.system("tar -xzf dandelion_tutorial.tar.gz")
    os.remove("dandelion_tutorial.tar.gz")

The folder features the following samples.

[3]:

samples = [
    "sc5p_v2_hs_PBMC_1k",
    "sc5p_v2_hs_PBMC_10k",
    "vdj_v1_hs_pbmc3",
    "vdj_nextgem_hs_pbmc3",
]

Import the GEX data and combine it into a single object. Prepend the sample name to each cell barcode, separated with _.

[4]:

adata_list = []
for sample in samples:
    adata = sc.read_10x_h5(
        "dandelion_tutorial/" + sample + "/filtered_feature_bc_matrix.h5",
        gex_only=True,
    )
    adata.obs["sampleid"] = sample
    # rename cells to sample id + barcode, and cleave the trailing -#
    adata.obs_names = [
        str(sample) + "_" + str(j).split("-")[0] for j in adata.obs_names
    ]
    adata.var_names_make_unique()
    adata_list.append(adata)
# no need for index_unique as we already made barcodes unique by prepending the sample ID
adata = adata_list[0].concatenate(adata_list[1:], index_unique=None)

Import the Dandelion preprocessing output, and then combine that into a matching single object as well. We don’t need to modify the cell names here, as they’ve already got the sample ID prepended to them by specifying the prefix in meta.csv.

[5]:

vdj_list = []
for sample in samples:
    vdj = ddl.read_10x_airr(
        "dandelion_tutorial/"
        + sample
        + "/dandelion/filtered_contig_dandelion.tsv"
    )
    # the dandelion output already has the sample ID prepended at the start of each contig
    vdj_list.append(vdj)
vdj = ddl.concat(vdj_list)

Do standard GEX processing via Scanpy.

[6]:

sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata, n_pcs=20)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)

And that’s it! Save the objects.

[7]:

adata.write("demo-gex.h5ad")
vdj.write("demo-vdj.h5ddl")