dandelion.polars.preprocessing.reannotate_genes

dandelion.polars.preprocessing.reannotate_genes(data, igblast_db=None, germline=None, org='human', loci='ig', extended=True, filename_prefix=None, flavour='strict', min_j_match=7, min_d_match=9, v_evalue=0.0001, d_evalue=0.001, j_evalue=0.0001, reassign_dj=True, overwrite=True, dust='no', db='imgt', lightchain_db=None, strain=None, additional_args={'assigngenes': [], 'blastn_d': [], 'blastn_j': [], 'igblastn': [], 'makedb': []})[source]

Reannotate cellranger fasta files with igblastn and parses to airr format.

Parameters:

data (list[str]) – list of fasta file locations, or folder name containing fasta files. if provided as a single string, it will first be converted to a list; this allows for the function to be run on single/multiple samples.
igblast_db (str | None, optional) – path to igblast database folder. Defaults to IGDATA environmental variable.
germline (str | None, optional) – path to germline database folder. Defaults to GERMLINE environmental variable.
org (Literal[“human”, “mouse”], optional) – organism of germline database.
loci (Literal[“ig”, “tr”], optional) – mode for igblastn. ‘ig’ for BCRs, ‘tr’ for TCRs.
extended (bool, optional) – whether or not to transfer additional 10X annotations to output file.
filename_prefix (list[str] | str | None, optional) – list of prefixes of file names preceding ‘_contig’. None defaults to ‘all’.
flavour (Literal[“strict”, “original”], optional) – Either ‘strict’ or ‘original’. Determines how igblastn should be run. Running in ‘strict’ flavour will add the additional the evalue and min_d_match options to the run.
min_j_match (int, optional) – Minimum D gene nucleotide matches. This controls the threshold for D gene detection. You can set the minimal number of required consecutive nucleotide matches between the query sequence and the D genes based on your own criteria. Note that the matches do not include overlapping matches at V-D or D-J junctions.
min_d_match (int, optional) – Minimum D gene nucleotide matches. This controls the threshold for D gene detection. You can set the minimal number of required consecutive nucleotide matches between the query sequence and the D genes based on your own criteria. Note that the matches do not include overlapping matches at V-D or D-J junctions.
v_evalue (float, optional) – This is the statistical significance threshold for reporting matches against database sequences. Lower EXPECT thresholds are more stringent and report only high similarity matches. Choose higher EXPECT value (for example 1 or more) if you expect a low identity between your query sequence and the targets. for v gene.
d_evalue (float, optional) – This is the statistical significance threshold for reporting matches against database sequences. Lower EXPECT thresholds are more stringent and report only high similarity matches. Choose higher EXPECT value (for example 1 or more) if you expect a low identity between your query sequence and the targets. for d gene.
j_evalue (float, optional) – This is the statistical significance threshold for reporting matches against database sequences. Lower EXPECT thresholds are more stringent and report only high similarity matches. Choose higher EXPECT value (for example 1 or more) if you expect a low identity between your query sequence and the targets. for j gene.
reassign_dj (bool, optional) – whether or not to perform a targetted blastn reassignment for D and J genes.
overwrite (bool, optional) – whether or not to overwrite the assignment if flavour = ‘strict’.
dust (str | None, optional) – dustmasker options. Filter query sequence with DUST Format: ‘yes’, or ‘no’ to disable. Accepts str. If None, defaults to 20 64 1.
db (Literal[“imgt”, “ogrdb”, “kiarva”, “gkhlab”], optional) – database to use for igblastn. Defaults to ‘imgt’.
lightchain_db (Literal[“imgt”, “ogrdb”] | None, optional) – database to use for light chain annotation if. None defaults to db. However, if db is ‘kiarva’ or ‘gkhlab’, None defaults to ‘imgt’ but this option can also be set to ‘ogrdb’ if desired.
strain (Literal[“c57bl6”, “balbc”, “129S1_SvImJ”, “AKR_J”, “A_J”, “BALB_c_ByJ”, “BALB_c”, “C3H_HeJ”, “C57BL_6J”, “C57BL_6”, “CAST_EiJ”, “CBA_J”, “DBA_1J”, “DBA_2J”, “LEWES_EiJ”, “MRL_MpJ”, “MSM_MsJ”, “NOD_ShiLtJ”, “NOR_LtJ”, “NZB_BlNJ”, “PWD_PhJ”, “SJL_J”] | None, optional) – strain of mouse to use for germline sequences. Only for db=”ogrdb”. Note that only “c57bl6”, “balbc”, “CAST_EiJ”, “LEWES_EiJ”, “MSM_MsJ”, “NOD_ShiLt_J” and “PWD_PhJ” contains both heavy chain and light chain germline sequences as a set. The rest will not allow igblastn and MakeDB.py to generate a successful airr table (check the failed file). “c57bl6” and “balbc” are merged databases of “C57BL_6” with “C57BL_6J” and “BALB_c” with “BALB_c_ByJ” respectively. None defaults to all combined.
additional_args (dict[str, list[str]], optional) – additional arguments to pass to AssignGenes.py, MakeDb.py, igblastn and blastn. This accepts a dictionary with keys as the name of the sub-function (assigngenes, makedb, igblastn, blastn_j and blastn_d) and the records as lists of arguments to pass to the relevant scripts/tools.

Raises:

FileNotFoundError – if path to fasta file is unknown.