dandelion.polars.preprocessing.check_contigs
- dandelion.polars.preprocessing.check_contigs(vdj, adata=None, productive_only=True, library_type=None, umi_foldchange_cutoff=2.0, consensus_foldchange_cutoff=5.0, ntop_vdj=1, ntop_vj=2, filter_missing=True, filter_extra=True, filter_ambiguous=False, save=None, verbose=True, **kwargs)[source]
Check contigs for whether they can be considered as ambiguous or not.
This function identifies and marks contigs as ambiguous, extra, or chimeric based on UMI/consensus dominance tests and gene call consistency. Uses vectorized polars operations for high performance.
- Parameters:
vdj (DandelionPolars | pl.DataFrame | str) – V(D)J AIRR data to check. Can be DandelionPolars object, polars DataFrame, or file path to AIRR .tsv file.
adata (AnnData | None, optional) – AnnData object to filter. If provided, will track which cells have contigs. If None, assumes all cells in AIRR table should be kept.
productive_only (bool, default=True) – Whether to retain only productive contigs.
library_type (Literal[“ig”, “tr-ab”, “tr-gd”] | None, optional) –
- If specified, filter based on expected contig types:
ig: IGH, IGK, IGL
tr-ab: TRA, TRB
tr-gd: TRG, TRD
umi_foldchange_cutoff (float, default=2.0) – Minimum UMI fold-change threshold for dominance test.
consensus_foldchange_cutoff (float, default=5.0) – Minimum consensus count fold-change threshold for dominance test.
ntop_vdj (int, default=1) – Number of top VDJ contigs to keep (IGH, TRB, TRD).
ntop_vj (int, default=2) – Number of top VJ contigs to keep (IGK, IGL, TRA, TRG).
filter_missing (bool, default=True) – If True and adata provided, remove cells not found in AnnData object.
filter_extra (bool, default=True) – Whether to remove contigs marked as extra.
filter_ambiguous (bool, default=False) – Whether to remove contigs marked as ambiguous.
save (str | None, optional) – If provided, save filtered table with _checked.tsv suffix.
verbose (bool, default=True) – Whether to print progress messages.
**kwargs – Additional kwargs passed to DandelionPolars constructor.
- Returns:
If adata provided: (DandelionPolars object, updated AnnData) If adata is None: DandelionPolars object only
- Return type:
tuple[DandelionPolars,AnnData] |DandelionPolars- Raises:
IndexError – If no contigs pass filtering.
ValueError – If save filename doesn’t end with .tsv.
Notes
This function: 1. Filters by productive status and library type (if specified) 2. Marks ambiguous/extra contigs using vectorized dominance tests 3. Marks chimeric contigs (mismatched BCR/TCR genes) 4. Optionally filters contigs based on flags 5. Creates DandelionPolars object with metadata
The vectorized implementation uses mark_ambiguous_contigs_vec for 10-100x performance improvement over the original pandas-based version.
Examples
>>> # Basic usage with DandelionPolars object >>> ddl_polars = check_contigs(ddl_polars)
>>> # With AnnData filtering >>> ddl_polars, adata = check_contigs(ddl_polars, adata=adata)
>>> # Custom thresholds >>> ddl_polars = check_contigs( ... ddl_polars, ... umi_foldchange_cutoff=3.0, ... consensus_foldchange_cutoff=10.0, ... ntop_vdj=2, ... ntop_vj=3 ... )
>>> # From file path >>> ddl_polars = check_contigs("filtered_contig_annotations.tsv")
See also
mark_ambiguous_contigs_vecCore vectorized function for marking contigs
check_chimeric_genes_vecDetects chimeric gene calls