dandelion.polars.tools.generate_network
- dandelion.polars.tools.generate_network(vdj, adata=None, key=None, clone_key=None, min_size=2, sample=None, force_replace=False, verbose=True, compute_graph=True, compute_layout=True, layout_method='mod_fr2', singleton_mass=0.5, expanded_only=False, use_existing_distance=True, use_existing_graph=True, n_cpus=1, sequential_chain=False, distance_mode='clone', dist_func=None, pad_to_max=False, lazy=False, zarr_path=None, chunk_size=None, memory_limit_gb=None, memory_safety_fraction=0.3, compress=True, random_state=None, **kwargs)[source]
Generate a Levenshtein distance network based on VDJ and VJ sequences.
The distance matrices are then combined into a singular matrix.
- Parameters:
vdj (DandelionPolars) – Dandelion object.
key (str | None, optional) – column name for distance calculations. None defaults to ‘sequence_alignment_aa’.
clone_key (str | None, optional) – column name to build network on.
min_size (int, optional) – For visualization purposes, two graphs are created where one contains all cells and a trimmed second graph. This value specifies the minimum number of edges required otherwise node will be trimmed in the secondary graph.
sample (int | None, optional) – If specified, cells will be randomly sampled to the integer provided. If the integer is larger than the number of cells, sampling with replacement is used and the same cell may appear multiple times with different sequence and cell ids. If None, no resampling is performed. A new Dandelion class will be returned.
force_replace (bool, optional) – whether or not to sample with replacement when sample is smaller or equal to than the number of cells.
verbose (bool, optional) – whether or not to print the progress bars.
compute_graph (bool, optional) – whether or not to generate the graph after distance matrix calculation.
compute_layout (bool, optional) – whether or not to generate the layout. May be time consuming if too many cells.
layout_method (Literal[“mod_fr”, “mod_fr2”, “mod_fr2_gpu”, “mod_fr_bh”, “mod_fr_bh_gpu”, “fa2”], optional) – Layout algorithm. Options: - ‘mod_fr’: Original python modified FR layout - ‘mod_fr2’: Numba-accelerated modified FR (faster CPU) - ‘mod_fr2_gpu’: PyTorch GPU modified FR (auto-tiles for >100K nodes) - ‘mod_fr_bh’: Barnes-Hut O(N log N) CPU layout (scalable for large graphs) - ‘mod_fr_bh_gpu’: Barnes-Hut O(N log N) GPU layout (scalable for large graphs, requires CUDA) - ‘fa2’: ForceAtlas2 (requires fa2-modified)
singleton_mass (float, optional) – Mass assigned to singleton nodes (no edges) in Barnes-Hut layouts. Lower values reduce their impact on pushing connected components apart. Default 0.5. Only used with ‘mod_fr_bh’ and ‘mod_fr_bh_gpu’.
expanded_only (bool, optional) – whether or not to only compute layout on expanded clonotypes.
use_existing_distance (bool, optional) – whether or not to use the pre-computed distance matrix in vdj.distances if it exists. If False, distances will be re-computed even if they already exist.
use_existing_graph (bool, optional) – whether or not to just compute the layout using the existing graph if it exists in the object.
n_cpus (int, optional) – number of cores to use for parallelizable steps. -1 uses all available cores.
sequential_chain (bool, optional) – whether or not to use the original method for distance calculation method where each chain is calculated separately and sequentially added to the total distance matrix. This method is slower but would be more precise calculation. If False, concatenated sequences with a long separator are used for distance calculation. Ignored if lazy=True as the lazy method always uses the long separator approach. The long separator approach inserts a long string of consistent characters on a per-chain basis to ensure that distances between chains are large and do not interfere with intra-chain distances.
distance_mode (Literal[“clone”, “full”], optional) – method to compute distance matrix. ‘clone’ refers to the original membership-based distance calculation where only distances within clones are calculated. Whereas ‘full’ computes the full pairwise distance matrix.
dist_func (Callable | str | None, optional) – distance function to use. If None, polyleven.levenshtein is used. If a string is provided, it will use Bio.Align’s substitution matrices (e.g., ‘BLOSUM62’, ‘PAM250’). See Bio.Align.substitution_matrices.load for available options.
pad_to_max (bool, optional) – whether or not to pad sequences to the maximum length in the dataset before distance calculation. This will allow for distance calculations that need sequences of the same length (e.g., Hamming distance). Note that this may increase memory usage and computation time.
lazy (bool, optional) – If True, computation will be performed lazily using Dask/Zarr arrays. True will also return a Dask array view of the distance matrix stored on disk instead of a numpy array stored in memory.
zarr_path (Path | str | None, optional) – Path to store Zarr array when using lazy mode. If None, “distance_matrix.zarr” will be created in the current working directory.
chunk_size (int | None, optional) – Chunk size for distance matrix computation when using lazy mode. If None, chunk size is automatically computed based on available memory and number of cores. The automatic chunk size can be further adjusted using memory_limit_gb and memory_safety_fraction parameters.
memory_limit_gb (float | None, optional) – Memory limit per worker in GB for Dask. None defaults to all available memory/cores.
memory_safety_fraction (float, optional) – Fraction of available memory to use. Defaults to 0.3 (i.e., 30% of available memory will be used for chunk size calculation).
compress (bool, optional) – Whether to compress the Zarr array using Blosc with zstd.
random_state (int | np.random.RandomState | None, optional) – Random state for reproducible sampling.
**kwargs – additional kwargs passed to layout functions in generate_layout.
- Returns:
DandelionPolars object with .edges, .layout, .graph initialized.
- Return type:
DandelionPolars|tuple[DandelionPolars,AnnData]- Raises:
ValueError – if any errors with dandelion input.