API

Import SNPmanifold as:

import SNPmanifold

Main Object

Object of type SNP_VAE for clustering with binomial mixture model

Parameters to initialize and load data into the main object SNP_VAE:

path (string) - path of cellSNP-lite output folder which contains cellSNP.tag.AD.mtx, cellSNP.tag.DP.mtx, and cellSNP.base.vcf.gz

SNP_mask (list of string) - list of variant names to mask from VAE, please refer to the internal variant names: VCF[‘TEXT’], if you use VCF as input (default: [])

AD (string) - path of AD matrix in scipy.sparse.coo_matrix format with shape (SNP, cell)

DP (string) - path of DP matrix in scipy.sparse.coo_matrix format with shape (SNP, cell)

VCF (string) - path of VCF.gz file

variant_name (string) - path of variant_name.tsv file which is a list of custom variant name stored in pandas dataframe without header and index

SNPread (string) - optional observed-SNP normalization, ‘normalized’ or ‘unnormalized’ (default: ‘normalized’)

missing_value (float between 0 and 1) - impute value for missing allele frequency in AF matrix, i.e. DP = 0 (default: 0.5)

cell_weight (string) - optional cost normalization for each cell, ‘normalized’ or ‘unnormalized’ (default: ‘unnormalized’)

prior (string) - path of prior weights of mutation for each variant in csv format (default: None)

Functions

class SNPmanifold.SNP_VAE(path=None, SNP_mask=[], AD=None, DP=None, VCF=None, variant_name=None, SNPread='normalized', missing_value=0.5, cell_weight='unnormalized', prior=None, UMI_correction=None)
AF_scatter(SNP_name, dpi=100.0)

Visualize allele frequency of one particular SNP in latent space

Parameters:
  • SNP_name (string) – name of the SNP to visualize

  • dpi (float) – dpi resolution for figure

SNP_heatmap(SNP_name, dpi=100.0, bad_color='blue', fontsize_c=None, fontsize_x=None, fontsize_y=None, cmap_heatmap=<matplotlib.colors.ListedColormap object>)

Visualize allele frequency of specific SNPs in heatmap

Parameters:
  • SNP_name (list) – list of names of the SNPs to visualize

  • dpi (float) – dpi resolution for figures

  • bad_color (string) – color of heatmap when allele frequency is missing, i.e. DP = 0 (default: ‘blue’)

  • fontsize_c (float) – fontsize of cluster labels on heatmap

  • fontsize_x (float) – fontsize of cell labels on heatmap

  • fontsize_y (float) – fontsize of SNP labels on heatmap

  • cmap_heatmap – colormap used for heatmap visualization (default: mpl.colormaps[‘rocket’])

cluster_heatmap(cluster_order, SNP_no=50, dpi=100.0, bad_color='blue', fontsize_c=None, fontsize_x=None, fontsize_y=None, cmap_heatmap=<matplotlib.colors.ListedColormap object>, SNP_ranking='AF_diff')

Visualize allele frequency of specific clusters in heatmap

Parameters:
  • cluster_order (list) – list of clusters to visualize

  • SNP_no (integer) – number of top-ranked SNPs to be visualized in heatmap (default: 50)

  • dpi (float) – dpi resolution for figures

  • bad_color (string) – color of heatmap when allele frequency is missing, i.e. DP = 0 (default: ‘blue’)

  • fontsize_c (float) – fontsize of cluster labels on heatmap

  • fontsize_x (float) – fontsize of cell labels on heatmap

  • fontsize_y (float) – fontsize of SNP labels on heatmap

  • cmap_heatmap – colormap used for heatmap visualization (default: mpl.colormaps[‘rocket’])

  • SNP_ranking (string) – method for ranking SNPs, ‘variance’ or ‘AF_diff’ (default: ‘AF_diff’)

clustering(algorithm='leiden_full', max_cluster=15, resolution=1)

Cluster cells using k-means clustering or Leiden clustering in SCANPY, in either full-dimensional latent space or 3D UMAP

Parameters:
  • algorithm (string) – ‘kmeans_umap3d’, ‘kmeans_full’, ‘leiden_umap3d’, or ‘leiden_full’ (default: ‘leiden_full’)

  • max_cluster (integer) – for k-means clustering only, maximum number of clusters (default: 15)

  • resolution (float) – for Leiden clustering only, resolution of clusters (default: 1)

clustering_summary(dpi=100.0)

Re-display figures shown in clustering with higher dpi

Parameters:

dpi (float) – dpi resolution for figures

filtering(save_memory=False, cell_SNPread_threshold=None, SNP_DPmean_threshold=None, SNP_logit_var_threshold=None, filtering_only=False, num_neighbour=3, what_to_do='skip')

Filter low quality cells and SNPs based on number of observed SNPs for each cell, mean coverage of each SNP, and logit-variance of each SNP

Parameters:
  • save_memory (boolean) – if True, raw matrices and VCF will be deleted from the object to save memory (default: False)

  • cell_SNPread_threshold (float) – minimal number of observed SNPs for a cell to be included for analysis, input after showing the plot if None (default: None)

  • SNP_DPmean_threshold (float) – minimal cell-average coverage for a SNP to be included for analysis, input after showing the plot if None (default: None)

  • SNP_logit_var_threshold (float) – minimal logit-variance for a SNP to be included for analysis, input after showing the plot if None (default: None)

  • filtering_only (boolean) – if True, it does not process AF matrices which are required for subsequent analyses in order to speed up (default: False)

  • num_neighbour (integer) – for missing_value = neighbour only, number of neighbouring cells for imputation (default: 3)

  • what_to_do (string) – what to do for cells with 0 oberserved SNPs after filtering (default: ‘skip’)

filtering_summary(dpi=100.0)

Re-display figures shown in filtering with higher dpi

Parameters:

dpi (float) – dpi resolution for figures

phylogeny(cluster_no=None, pair_no=100, SNP_no=50, bad_color='blue', cmap_heatmap=<matplotlib.colors.ListedColormap object>, SNP_ranking='AF_diff')

Construct phylogenetic tree of cells in full-dimensional latent space and rank SNPs according to p-values

Parameters:
  • cluster_no (integer) – for k-means clustering only, number of clusters for phylogenetic tree construction and ranking of SNPs (default: None)

  • pair_no (integer) – number of pair of cells to consider between each pair of clusters when constructing phylogenetic tree (default: 100)

  • SNP_no (integer) – number of top-ranked SNPs to be visualized in heatmap (default: 50)

  • bad_color (string) – color of heatmap when allele frequency is missing, i.e. DP = 0 (default: ‘blue’)

  • cmap_heatmap (mpl.colormaps) – colormap used for heatmap visualization (default: mpl.colormaps[‘rocket’])

  • SNP_ranking (string) – method for ranking SNPs, ‘variance’ or ‘AF_diff’ (default: ‘AF_diff’)

phylogeny_summary(SNP_no=None, dpi=100.0, bad_color='blue', fontsize_c=None, fontsize_x=None, fontsize_y=None, cmap_heatmap=<matplotlib.colors.ListedColormap object>, SNP_ranking='AF_diff', tree_fig_size=(12, 10))

Re-display figures shown in phylogeny with higher dpi, different number of SNPs, color and fontsizes

Parameters:
  • SNP_no (integer) – number of top-ranked SNPs to be visualized in heatmap (default: 50)

  • dpi (float) – dpi resolution for figures

  • bad_color (string) – color of heatmap when allele frequency is missing, i.e. DP = 0 (default: ‘blue’)

  • fontsize_c (float) – fontsize of cluster labels on heatmap

  • fontsize_x (float) – fontsize of cell labels on heatmap

  • fontsize_y (float) – fontsize of SNP labels on heatmap

  • cmap_heatmap – colormap used for heatmap visualization (default: mpl.colormaps[‘rocket’])

  • SNP_ranking (string) – method for ranking SNPs, ‘variance’ or ‘AF_diff’ (default: ‘AF_diff’)

  • tree_fig_size (tuple of numbers with length 2) – figure size of phylogenetic tree (default: (12, 10))

retrain_umap()

Re-train UMAP in the same latent space of VAE

training(num_epoch=2000, stepsize=0.0001, z_dim=None, beta=0, num_batch=5, is_cuda=True)

Train VAE using Adam optimizer and visualize latent space using PCA and UMAP

Parameters:
  • num_epoch (integer) – number of epochs for training VAE (default: 2000)

  • stepsize (float) – stepsize of Adam optimizer (default: 0.0001)

  • z_dim (integer) – dimension of latent space (default: half of number of filtered SNPs)

  • beta (float) – strength of standard Gaussian prior in cost of VAE (default: 0)

  • num_batch (integer) – number of batchs for training VAE (default: 5)

  • is_cuda (boolean) – Set True if you want to use CUDA, set False if you want to use CPU (default: True)

training_summary(dpi=100.0)

Re-display figures shown in training with higher dpi

Parameters:

dpi (float) – dpi resolution for figures

Attributes

After running SNP_VAE.filtering():

cell_filter (np.array of booleans) - boolean filter for all input cells

SNP_filter (np.array of booleans) - boolean filter for all input SNPs

cell_total (integer) - total number of cells after filtering

SNP_total (integer) - total number of SNPs after filtering

AD_filtered (np.array with shape (cell_total, SNP_total)) - AD matrix after filtering

DP_filtered (np.array with shape (cell_total, SNP_total)) - DP matrix after filtering

AF_filtered (torch.tensor with shape (cell_total, SNP_total)) - AF matrix which is the input to VAE

VCF_filtered (pd.DataFrame) - VCF after filtering which contains variant names

After running SNP_VAE.training():

model (VAE_normalized) - trained VAE model implemented in PyTorch

latent (np.array with shape (cell_total, z_dim)) - latent factors of all cells after filtering

pc (np.array with shape (cell_total, z_dim)) - principal components of PCA of the latent space

embedding_2d (np.array with shape (cell_total, 2)) - 2D UMAP embedding of the latent space

embedding_3d (np.array with shape (cell_total, 3)) - 3D UMAP embedding of the latent space

After running SNP_VAE.clustering() and SNP_VAE.phylogeny():

cluster_no (integer) - total number of clusters

assigned_label (np.array of integers with shape (cell_total)) - assigned cluster labels of all cells after filtering

clusters (list of np.arrays with length (cluster_no)) - np.where(assigned_label == r) for each cluster r

colors (np.array with shape (cluster_no, 4)) - colors of all clusters in figures

edge (list of tuples with length (cluster_no - 1)) - all connected edges in the phylogenetic tree

f_stat (np.array with shape (SNP_total)) - F-statistics of all SNPs after filtering

p_value (np.array with shape (SNP_total)) - P-values of all SNPs after filtering

rank_SNP (np.array with shape (SNP_total)) - ranking of SNPs from the lowest p-value