dandelion.external.immcantation.polars.shazam.calculate_threshold

dandelion.external.immcantation.polars.shazam.calculate_threshold(data, mode='single-cell', manual_threshold=None, VJthenLen=False, model=None, normalize_method=None, threshold_method=None, edge=None, cross=None, subsample=None, threshold_model=None, cutoff=None, sensitivity=None, specificity=None, plot=True, plot_group=None, figsize=(4.5, 2.5), save_plot=None, n_cpus=1, **kwargs)[source]

Calculating nearest neighbor distances for tuning clonal assignment with shazam.

https://shazam.readthedocs.io/en/stable/vignettes/DistToNearest-Vignette/

Runs the following:

distToNearest

Get non-zero distance of every heavy chain (IGH) sequence (as defined by sequenceColumn) to its nearest sequence in a partition of heavy chains sharing the same V gene, J gene, and junction length (VJL), or in a partition of single cells with heavy chains sharing the same heavy chain VJL combination, or of single cells with heavy and light chains sharing the same heavy chain VJL and light chain VJL combinations.

findThreshold

automtically determines an optimal threshold for clonal assignment of Ig sequences using a vector of nearest neighbor distances. It provides two alternative methods using either a Gamma/Gaussian Mixture Model fit (threshold_method=”gmm”) or kernel density fit (threshold_method=”density”).

Parameters:
  • data (Dandelion | pd.DataFrame | str) – input Danelion, AIRR data as pandas DataFrame or path to file.

  • mode (Literal[“single-cell”, “heavy”], optional) – accepts one of “heavy” or “single-cell”. Refer to https://shazam.readthedocs.io/en/stable/vignettes/DistToNearest-Vignette.

  • manual_threshold (float | None, optional) – value to manually plot in histogram.

  • VJthenLen (bool, optional) – logical value specifying whether to perform partitioning as a 2-stage process. If True, partitions are made first based on V and J gene, and then further split based on junction lengths corresponding to sequenceColumn. If False, perform partition as a 1-stage process during which V gene, J gene, and junction length are used to create partitions simultaneously. Defaults to False.

  • model (Literal[“ham”, “aa”, “hh_s1f”, “hh_s5f”, “mk_rs1nf”, “hs1f_compat”, “m1n_compat”, ] | None, optional) – underlying SHM model, which must be one of “ham”,”aa”,”hh_s1f”,”hh_s5f”,”mk_rs1nf”,”hs1f_compat”,”m1n_compat”.

  • normalize_method (Literal[“len”] | None, optional) – method of normalization. The default is “len”, which divides the distance by the length of the sequence group. If “none” then no normalization if performed.

  • threshold_method (Literal[“gmm”, “density”] | None, optional) – string defining the method to use for determining the optimal threshold. One of “gmm” or “density”.

  • edge (float | None, optional) – upper range as a fraction of the data density to rule initialization of Gaussian fit parameters. Default value is 0.9 (or 90). Applies only when threshold_method=”density”.

  • cross (list[float] | None, optional) – supplementary nearest neighbor distance vector output from distToNearest for initialization of the Gaussian fit parameters. Applies only when method=”gmm”.

  • subsample (int | None, optional) – maximum number of distances to subsample to before threshold detection.

  • threshold_model (Literal[“norm-norm”, “norm-gamma”, “gamma-norm”, “gamma-gamma”] | None, optional) – allows the user to choose among four possible combinations of fitting curves: “norm-norm”, “norm-gamma”, “gamma-norm”, and “gamma-gamma”. Applies only when method=”gmm”.

  • cutoff (Literal[“optimal”, “intersect”, “user”] | None, optional) – method to use for threshold selection: the optimal threshold “optimal”, the intersection point of the two fitted curves “intersect”, or a value defined by user for one of the sensitivity or specificity “user”. Applies only when method=”gmm”.

  • sensitivity (float | None, optional) – sensitivity required. Applies only when method=”gmm” and cutoff=”user”.

  • specificity (float | None, optional) – specificity required. Applies only when method=”gmm” and cutoff=”user”.

  • plot (bool, optional) – whether or not to return plot.

  • plot_group (str | None, optional) – determines the fill color and facets.

  • figsize (tuple[float, float], optional) – size of plot.

  • save_plot (str | None, optional) – if specified, plot will be save with this path.

  • n_cpus (int, optional) – number of cpus to run distToNearest. defaults to 1.

  • **kwargs – passed to shazam’s distToNearest.

Returns:

threshold value for clonal assignment in DefineClones.

Return type:

float

Raises:

ValueError – if automatic thresholding failed.