dandelion.base.preprocessing.check_contigs
- dandelion.base.preprocessing.check_contigs(data, adata=None, productive_only=True, library_type=None, umi_foldchange_cutoff=2, consensus_foldchange_cutoff=5, ntop_vdj=1, ntop_vj=2, allow_exceptions=True, filter_missing=True, filter_extra=True, filter_ambiguous=False, save=None, verbose=True, **kwargs)[source]
Check contigs for whether they can be considered as ambiguous or not.
Returns an ambiguous column with boolean T/F in the data. If the sequence_alignment is an exact match between contigs, the contigs will be merged into the one with the highest umi count, summing the umi/duplicate count. After this check, if there are still multiple contigs, cells with multiple contigs checked for whether there is a clear dominance in terms of UMI count resulting in two scenarios: 1) if true, all other contigs will be flagged as ambiguous; 2) if false, all contigs will be flagged as ambiguous. This is repeated for each cell, for their productive and non-productive VDJ and VJ contigs separately. Dominance is assessed by whether or not the umi counts demonstrate a > umi_foldchange_cutoff. There are some exceptions: 1) IgM and IgD are allowed to co-exist in the same B cell if no other isotypes are detected; 2) TRD and TRB contigs are allowed in the same cell because rearrangement of TRB and TRD loci happens at the same time during development and TRD variable region genes exhibits allelic inclusion. Thus this can potentially result in some situations where T cells expressing productive TRA-TRB chains can also express productive TRD chains. This can be toggled by allow_exceptions argument.
Default behvaiour is to only consider productive contigs and remove all non-productive before checking, toggled by productive_only argument.
If library_type is provided, it will remove all contigs that do not belong to the related loci. The rationale is that the choice of the library type should mean that the primers used would most likely amplify those related sequences and if there’s any unexpected loci, they likely represent artifacts and shouldn’t be analysed.
If an adata object is provided, contigs with no corresponding cell barcode in the AnnData object is filtered in the output if filter_missing is True.
- Parameters:
data (Dandelion | pd.DataFrame | str) – V(D)J AIRR data to check. Can be Dandelion, pandas DataFrame and file path to AIRR .tsv file.
adata (AnnData | None, optional) – AnnData object to filter. If not provided, it will assume to keep all cells in the airr table and just return a Dandelion object.
productive_only (bool, optional) – whether or not to retain only productive contigs.
library_type (Literal[“ig”, “tr-ab”, “tr-gd”] | None, optional) –
- if specified, it will first filter based on the expected type of contigs:
- ig:
IGH, IGK, IGL
- tr-ab:
TRA, TRB
- tr-gd:
TRG, TRD
umi_foldchange_cutoff (int, optional) – related to minimum fold change of UMI count, required to rescue contigs/barcode otherwise they will be marked as extra/ambiguous.
consensus_foldchange_cutoff (int, optional) – related to minimum fold change of consensus count, required to rescue contigs/barcode otherwise they will be marked as extra/ambiguous.
ntop_vdj (int, optional) – number of top VDJ contigs to consider for dominance check.
ntop_vj (int, optional) – number of top VJ contigs to consider for dominance check.
allow_exceptions (bool, optional) – whether or not to allow exceptions for certain loci.
filter_missing (bool, optional) – cells in V(D)J data not found in AnnData object will removed from the dandelion object.
filter_extra (bool, optional) – whether or not to remove contigs that are marked as extra.
filter_ambiguous (bool, optional) – whether or not to remove contigs that are marked as ambiguous. This step is only for cleaning up the .data slot as the ambiguous contigs are not passed to the .metadata slot anyway.
save (str | None, optional) – Only used if a pandas data frame or dandelion object is provided. Specifying will save the formatted vdj table with a _checked.tsv suffix extension.
verbose (bool, optional) – whether to print progress when marking contigs.
**kwargs – additional kwargs passed to dandelion.utilities._core.Dandelion.
- Returns:
checked dandelion V(D)J object and AnnData object.
- Return type:
tuple[Dandelion,AnnData]- Raises:
IndexError – if no contigs passed filtering.
ValueError – if save file name is not suitable.