dandelion.polars.preprocessing.reannotate_genes

dandelion.polars.preprocessing.reannotate_genes(data, igblast_db=None, germline=None, org='human', loci='ig', extended=True, filename_prefix=None, flavour='strict', min_j_match=7, min_d_match=9, v_evalue=0.0001, d_evalue=0.001, j_evalue=0.0001, reassign_dj=True, overwrite=True, dust='no', db='imgt', strain=None, additional_args={'assigngenes': [], 'blastn_d': [], 'blastn_j': [], 'igblastn': [], 'makedb': []})[source]

Reannotate cellranger fasta files with igblastn and parses to airr format.

Parameters:
  • data (list[str]) – list of fasta file locations, or folder name containing fasta files. if provided as a single string, it will first be converted to a list; this allows for the function to be run on single/multiple samples.

  • igblast_db (str | None, optional) – path to igblast database folder. Defaults to IGDATA environmental variable.

  • germline (str | None, optional) – path to germline database folder. Defaults to GERMLINE environmental variable.

  • org (Literal[“human”, “mouse”], optional) – organism of germline database.

  • loci (Literal[“ig”, “tr”], optional) – mode for igblastn. ‘ig’ for BCRs, ‘tr’ for TCRs.

  • extended (bool, optional) – whether or not to transfer additional 10X annotations to output file.

  • filename_prefix (list[str] | str | None, optional) – list of prefixes of file names preceding ‘_contig’. None defaults to ‘all’.

  • flavour (Literal[“strict”, “original”], optional) – Either ‘strict’ or ‘original’. Determines how igblastn should be run. Running in ‘strict’ flavour will add the additional the evalue and min_d_match options to the run.

  • min_j_match (int, optional) – Minimum D gene nucleotide matches. This controls the threshold for D gene detection. You can set the minimal number of required consecutive nucleotide matches between the query sequence and the D genes based on your own criteria. Note that the matches do not include overlapping matches at V-D or D-J junctions.

  • min_d_match (int, optional) – Minimum D gene nucleotide matches. This controls the threshold for D gene detection. You can set the minimal number of required consecutive nucleotide matches between the query sequence and the D genes based on your own criteria. Note that the matches do not include overlapping matches at V-D or D-J junctions.

  • v_evalue (float, optional) – This is the statistical significance threshold for reporting matches against database sequences. Lower EXPECT thresholds are more stringent and report only high similarity matches. Choose higher EXPECT value (for example 1 or more) if you expect a low identity between your query sequence and the targets. for v gene.

  • d_evalue (float, optional) – This is the statistical significance threshold for reporting matches against database sequences. Lower EXPECT thresholds are more stringent and report only high similarity matches. Choose higher EXPECT value (for example 1 or more) if you expect a low identity between your query sequence and the targets. for d gene.

  • j_evalue (float, optional) – This is the statistical significance threshold for reporting matches against database sequences. Lower EXPECT thresholds are more stringent and report only high similarity matches. Choose higher EXPECT value (for example 1 or more) if you expect a low identity between your query sequence and the targets. for j gene.

  • reassign_dj (bool, optional) – whether or not to perform a targetted blastn reassignment for D and J genes.

  • overwrite (bool, optional) – whether or not to overwrite the assignment if flavour = ‘strict’.

  • dust (str | None, optional) – dustmasker options. Filter query sequence with DUST Format: ‘yes’, or ‘no’ to disable. Accepts str. If None, defaults to 20 64 1.

  • db (Literal[“imgt”, “ogrdb”], optional) – database to use for igblastn. Defaults to ‘imgt’.

  • strain (Literal[“c57bl6”, “balbc”, “129S1_SvImJ”, “AKR_J”, “A_J”, “BALB_c_ByJ”, “BALB_c”, “C3H_HeJ”, “C57BL_6J”, “C57BL_6”, “CAST_EiJ”, “CBA_J”, “DBA_1J”, “DBA_2J”, “LEWES_EiJ”, “MRL_MpJ”, “MSM_MsJ”, “NOD_ShiLtJ”, “NOR_LtJ”, “NZB_BlNJ”, “PWD_PhJ”, “SJL_J”] | None, optional) – strain of mouse to use for germline sequences. Only for db=”ogrdb”. Note that only “c57bl6”, “balbc”, “CAST_EiJ”, “LEWES_EiJ”, “MSM_MsJ”, “NOD_ShiLt_J” and “PWD_PhJ” contains both heavy chain and light chain germline sequences as a set. The rest will not allow igblastn and MakeDB.py to generate a successful airr table (check the failed file). “c57bl6” and “balbc” are merged databases of “C57BL_6” with “C57BL_6J” and “BALB_c” with “BALB_c_ByJ” respectively. None defaults to all combined.

  • additional_args (dict[str, list[str]], optional) – additional arguments to pass to AssignGenes.py, MakeDb.py, igblastn and blastn. This accepts a dictionary with keys as the name of the sub-function (assigngenes, makedb, igblastn, blastn_j and blastn_d) and the records as lists of arguments to pass to the relevant scripts/tools.

Raises:

FileNotFoundError – if path to fasta file is unknown.