{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "provincial-toronto",
   "metadata": {},
   "source": [
    "# Loading data for analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "vocational-cambridge",
   "metadata": {},
   "outputs": [],
   "source": [
    "import dandelion as ddl\n",
    "import scanpy as sc\n",
    "import warnings\n",
    "import os\n",
    "\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "regular-consumer",
   "metadata": {},
   "source": [
    "This notebook shows how to prepare both GEX and VDJ data for Dandelion analysis. You don't have to run it if you don't want to, the resulting objects are downloaded at the start of the subsequent notebook. However, you're likely to find the provided syntax for loading and concatenating multiple samples with Dandelion useful.\n",
    "\n",
    "We're using public 10X data, and ran the Dandelion preprocessing pipeline on it. We compressed the resulting folder, with both GEX and VDJ data ready for ingestion, and will download it shortly. The following are detailed instructions of what happened to construct said folder.\n",
    "\n",
    "The GEX and VDJ were downloaded like so:\n",
    "\n",
    "<details>\n",
    "    <summary>Download commands</summary>\n",
    "\n",
    "    mkdir dandelion_tutorial;\n",
    "    mkdir -p dandelion_tutorial/vdj_nextgem_hs_pbmc3;\n",
    "    mkdir -p dandelion_tutorial/vdj_v1_hs_pbmc3;\n",
    "    mkdir -p dandelion_tutorial/sc5p_v2_hs_PBMC_10k;\n",
    "    mkdir -p dandelion_tutorial/sc5p_v2_hs_PBMC_1k;\n",
    "    cd dandelion_tutorial/vdj_v1_hs_pbmc3;\n",
    "    wget -O filtered_feature_bc_matrix.h5 https://cf.10xgenomics.com/samples/cell-vdj/3.1.0/vdj_v1_hs_pbmc3/vdj_v1_hs_pbmc3_filtered_feature_bc_matrix.h5;\n",
    "    wget -O filtered_contig_annotations.csv https://cf.10xgenomics.com/samples/cell-vdj/3.1.0/vdj_v1_hs_pbmc3/vdj_v1_hs_pbmc3_b_filtered_contig_annotations.csv;\n",
    "    wget -O filtered_contig.fasta https://cf.10xgenomics.com/samples/cell-vdj/3.1.0/vdj_v1_hs_pbmc3/vdj_v1_hs_pbmc3_b_filtered_contig.fasta;\n",
    "    cd ../vdj_nextgem_hs_pbmc3\n",
    "    wget -O filtered_feature_bc_matrix.h5 https://cf.10xgenomics.com/samples/cell-vdj/3.1.0/vdj_nextgem_hs_pbmc3/vdj_nextgem_hs_pbmc3_filtered_feature_bc_matrix.h5;\n",
    "    wget -O filtered_contig_annotations.csv https://cf.10xgenomics.com/samples/cell-vdj/3.1.0/vdj_nextgem_hs_pbmc3/vdj_nextgem_hs_pbmc3_b_filtered_contig_annotations.csv;\n",
    "    wget -O filtered_contig.fasta https://cf.10xgenomics.com/samples/cell-vdj/3.1.0/vdj_nextgem_hs_pbmc3/vdj_nextgem_hs_pbmc3_b_filtered_contig.fasta;\n",
    "    cd ../sc5p_v2_hs_PBMC_10k;\n",
    "    wget -O filtered_feature_bc_matrix.h5 https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_filtered_feature_bc_matrix.h5;\n",
    "    wget -O filtered_contig_annotations.csv https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_b_filtered_contig_annotations.csv;\n",
    "    wget -O filtered_contig.fasta https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_10k/sc5p_v2_hs_PBMC_10k_b_filtered_contig.fasta;\n",
    "    cd ../sc5p_v2_hs_PBMC_1k;\n",
    "    wget -O filtered_feature_bc_matrix.h5 wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_1k/sc5p_v2_hs_PBMC_1k_filtered_feature_bc_matrix.h5;\n",
    "    wget -O filtered_contig_annotations.csv wget https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_1k/sc5p_v2_hs_PBMC_1k_b_filtered_contig_annotations.csv;\n",
    "    wget -O filtered_contig.fasta https://cf.10xgenomics.com/samples/cell-vdj/4.0.0/sc5p_v2_hs_PBMC_1k/sc5p_v2_hs_PBMC_1k_b_filtered_contig.fasta;\n",
    "\n",
    "</details>\n",
    "<br>\n",
    "\n",
    "We then returned to the `dandelion_tutorial` directory we created and constructed the following `meta.csv` file for Dandelion preprocessing to use:\n",
    "\n",
    "```\n",
    "sample,prefix,individual\n",
    "vdj_v1_hs_pbmc3,vdj_v1_hs_pbmc3,vdj_v1_hs_pbmc3\n",
    "vdj_nextgem_hs_pbmc3,vdj_nextgem_hs_pbmc3,vdj_nextgem_hs_pbmc3\n",
    "sc5p_v2_hs_PBMC_10k,sc5p_v2_hs_PBMC_10k,sc5p_v2_hs_PBMC_10k\n",
    "sc5p_v2_hs_PBMC_1k,sc5p_v2_hs_PBMC_1k,sc5p_v2_hs_PBMC_1k\n",
    "```\n",
    "\n",
    "The `prefix` column makes our life easier when loading the data by prepending the sample ID to the cell barcodes so we don't have to do it. The `individual` column is explicitly included for TIgGER purposes. We then ran the preprocessing pipeline with the aid of the Dandelion singularity container:\n",
    "\n",
    "```\n",
    "singularity pull library://kt16/default/sc-dandelion:latest\n",
    "singularity run -B $PWD sc-dandelion_latest.sif dandelion-preprocess --meta meta.csv --file_prefix filtered\n",
    "```\n",
    "\n",
    "We can now download the result of all of these operations and decompress the folder."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "downtown-outreach",
   "metadata": {},
   "outputs": [],
   "source": [
    "if not os.path.exists(\"dandelion_tutorial\"):\n",
    "    os.system(\n",
    "        \"wget ftp://ftp.sanger.ac.uk/pub/users/kp9/dandelion_tutorial.tar.gz\"\n",
    "    )\n",
    "    os.system(\"tar -xzf dandelion_tutorial.tar.gz\")\n",
    "    os.remove(\"dandelion_tutorial.tar.gz\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "speaking-luxembourg",
   "metadata": {},
   "source": [
    "The folder features the following samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "mysterious-geology",
   "metadata": {},
   "outputs": [],
   "source": [
    "samples = [\n",
    "    \"sc5p_v2_hs_PBMC_1k\",\n",
    "    \"sc5p_v2_hs_PBMC_10k\",\n",
    "    \"vdj_v1_hs_pbmc3\",\n",
    "    \"vdj_nextgem_hs_pbmc3\",\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "better-wesley",
   "metadata": {},
   "source": [
    "Import the GEX data and combine it into a single object. Prepend the sample name to each cell barcode, separated with `_`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "basic-radio",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_list = []\n",
    "for sample in samples:\n",
    "    adata = sc.read_10x_h5(\n",
    "        \"dandelion_tutorial/\" + sample + \"/filtered_feature_bc_matrix.h5\",\n",
    "        gex_only=True,\n",
    "    )\n",
    "    adata.obs[\"sampleid\"] = sample\n",
    "    # rename cells to sample id + barcode, and cleave the trailing -#\n",
    "    adata.obs_names = [\n",
    "        str(sample) + \"_\" + str(j).split(\"-\")[0] for j in adata.obs_names\n",
    "    ]\n",
    "    adata.var_names_make_unique()\n",
    "    adata_list.append(adata)\n",
    "# no need for index_unique as we already made barcodes unique by prepending the sample ID\n",
    "adata = adata_list[0].concatenate(adata_list[1:], index_unique=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "disabled-brake",
   "metadata": {},
   "source": [
    "Import the Dandelion preprocessing output, and then combine that into a matching single object as well. We don't need to modify the cell names here, as they've already got the sample ID prepended to them by specifying the `prefix` in `meta.csv`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "harmful-davis",
   "metadata": {},
   "outputs": [],
   "source": [
    "vdj_list = []\n",
    "for sample in samples:\n",
    "    vdj = ddl.read_airr(\n",
    "        \"dandelion_tutorial/\"\n",
    "        + sample\n",
    "        + \"/dandelion/filtered_contig_dandelion.tsv\"\n",
    "    )\n",
    "    # the dandelion output already has the sample ID prepended at the start of each contig\n",
    "    vdj_list.append(vdj)\n",
    "vdj = ddl.concat(vdj_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "judicial-brown",
   "metadata": {},
   "source": [
    "Do standard GEX processing via Scanpy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "surprised-paintball",
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.filter_genes(adata, min_cells=3)\n",
    "sc.pp.normalize_total(adata, target_sum=1e4)\n",
    "sc.pp.log1p(adata)\n",
    "adata.raw = adata\n",
    "sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)\n",
    "adata = adata[:, adata.var.highly_variable]\n",
    "sc.pp.scale(adata, max_value=10)\n",
    "sc.tl.pca(adata, svd_solver=\"arpack\")\n",
    "sc.pp.neighbors(adata, n_pcs=20)\n",
    "sc.tl.umap(adata)\n",
    "sc.tl.leiden(adata, resolution=0.5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "alien-thought",
   "metadata": {},
   "source": [
    "And that's it! Save the objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "horizontal-celebration",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.write(\"demo-gex.h5ad\")\n",
    "vdj.write(\"demo-vdj.h5ddl\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dandelion-tutorial",
   "language": "python",
   "name": "dandelion-tutorial"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}