dandelion.polars.preprocessing.check_contigs

dandelion.polars.preprocessing.check_contigs(vdj, adata=None, productive_only=True, library_type=None, umi_foldchange_cutoff=2.0, consensus_foldchange_cutoff=5.0, ntop_vdj=1, ntop_vj=2, filter_missing=True, filter_extra=True, filter_ambiguous=False, save=None, verbose=True, **kwargs)[source]

Check contigs for whether they can be considered as ambiguous or not.

This function identifies and marks contigs as ambiguous, extra, or chimeric based on UMI/consensus dominance tests and gene call consistency. Uses vectorized polars operations for high performance.

Parameters:
  • vdj (DandelionPolars | pl.DataFrame | str) – V(D)J AIRR data to check. Can be DandelionPolars object, polars DataFrame, or file path to AIRR .tsv file.

  • adata (AnnData | None, optional) – AnnData object to filter. If provided, will track which cells have contigs. If None, assumes all cells in AIRR table should be kept.

  • productive_only (bool, default=True) – Whether to retain only productive contigs.

  • library_type (Literal[“ig”, “tr-ab”, “tr-gd”] | None, optional) –

    If specified, filter based on expected contig types:
    • ig: IGH, IGK, IGL

    • tr-ab: TRA, TRB

    • tr-gd: TRG, TRD

  • umi_foldchange_cutoff (float, default=2.0) – Minimum UMI fold-change threshold for dominance test.

  • consensus_foldchange_cutoff (float, default=5.0) – Minimum consensus count fold-change threshold for dominance test.

  • ntop_vdj (int, default=1) – Number of top VDJ contigs to keep (IGH, TRB, TRD).

  • ntop_vj (int, default=2) – Number of top VJ contigs to keep (IGK, IGL, TRA, TRG).

  • filter_missing (bool, default=True) – If True and adata provided, remove cells not found in AnnData object.

  • filter_extra (bool, default=True) – Whether to remove contigs marked as extra.

  • filter_ambiguous (bool, default=False) – Whether to remove contigs marked as ambiguous.

  • save (str | None, optional) – If provided, save filtered table with _checked.tsv suffix.

  • verbose (bool, default=True) – Whether to print progress messages.

  • **kwargs – Additional kwargs passed to DandelionPolars constructor.

Returns:

If adata provided: (DandelionPolars object, updated AnnData) If adata is None: DandelionPolars object only

Return type:

tuple[DandelionPolars, AnnData] | DandelionPolars

Raises:
  • IndexError – If no contigs pass filtering.

  • ValueError – If save filename doesn’t end with .tsv.

Notes

This function: 1. Filters by productive status and library type (if specified) 2. Marks ambiguous/extra contigs using vectorized dominance tests 3. Marks chimeric contigs (mismatched BCR/TCR genes) 4. Optionally filters contigs based on flags 5. Creates DandelionPolars object with metadata

The vectorized implementation uses mark_ambiguous_contigs_vec for 10-100x performance improvement over the original pandas-based version.

Examples

>>> # Basic usage with DandelionPolars object
>>> ddl_polars = check_contigs(ddl_polars)
>>> # With AnnData filtering
>>> ddl_polars, adata = check_contigs(ddl_polars, adata=adata)
>>> # Custom thresholds
>>> ddl_polars = check_contigs(
...     ddl_polars,
...     umi_foldchange_cutoff=3.0,
...     consensus_foldchange_cutoff=10.0,
...     ntop_vdj=2,
...     ntop_vj=3
... )
>>> # From file path
>>> ddl_polars = check_contigs("filtered_contig_annotations.tsv")

See also

mark_ambiguous_contigs_vec

Core vectorized function for marking contigs

check_chimeric_genes_vec

Detects chimeric gene calls