API
Import SNPmanifold as:
import SNPmanifold
Main Object
Object of type SNP_VAE for clustering with binomial
mixture model
Parameters to initialize and load data into the main object SNP_VAE:
path (string) - path of cellSNP-lite output folder which contains cellSNP.tag.AD.mtx, cellSNP.tag.DP.mtx, and cellSNP.base.vcf.gz
SNP_mask (list of string) - list of variant names to mask from VAE, please refer to the internal variant names: VCF[‘TEXT’], if you use VCF as input (default: [])
AD (string) - path of AD matrix in scipy.sparse.coo_matrix format with shape (SNP, cell)
DP (string) - path of DP matrix in scipy.sparse.coo_matrix format with shape (SNP, cell)
VCF (string) - path of VCF.gz file
variant_name (string) - path of variant_name.tsv file which is a list of custom variant name stored in pandas dataframe without header and index
SNPread (string) - optional observed-SNP normalization, ‘normalized’ or ‘unnormalized’ (default: ‘normalized’)
missing_value (float between 0 and 1) - impute value for missing allele frequency in AF matrix, i.e. DP = 0 (default: 0.5)
cell_weight (string) - optional cost normalization for each cell, ‘normalized’ or ‘unnormalized’ (default: ‘unnormalized’)
prior (string) - path of prior weights of mutation for each variant in csv format (default: None)
Functions
- class SNPmanifold.SNP_VAE(path=None, SNP_mask=[], AD=None, DP=None, VCF=None, variant_name=None, SNPread='normalized', missing_value=0.5, cell_weight='unnormalized', prior=None, UMI_correction=None)
- AF_scatter(SNP_name, dpi=100.0)
Visualize allele frequency of one particular SNP in latent space
- Parameters:
SNP_name (string) – name of the SNP to visualize
dpi (float) – dpi resolution for figure
- SNP_heatmap(SNP_name, dpi=100.0, bad_color='blue', fontsize_c=None, fontsize_x=None, fontsize_y=None, cmap_heatmap=<matplotlib.colors.ListedColormap object>)
Visualize allele frequency of specific SNPs in heatmap
- Parameters:
SNP_name (list) – list of names of the SNPs to visualize
dpi (float) – dpi resolution for figures
bad_color (string) – color of heatmap when allele frequency is missing, i.e. DP = 0 (default: ‘blue’)
fontsize_c (float) – fontsize of cluster labels on heatmap
fontsize_x (float) – fontsize of cell labels on heatmap
fontsize_y (float) – fontsize of SNP labels on heatmap
cmap_heatmap – colormap used for heatmap visualization (default: mpl.colormaps[‘rocket’])
- cluster_heatmap(cluster_order, SNP_no=50, dpi=100.0, bad_color='blue', fontsize_c=None, fontsize_x=None, fontsize_y=None, cmap_heatmap=<matplotlib.colors.ListedColormap object>, SNP_ranking='AF_diff')
Visualize allele frequency of specific clusters in heatmap
- Parameters:
cluster_order (list) – list of clusters to visualize
SNP_no (integer) – number of top-ranked SNPs to be visualized in heatmap (default: 50)
dpi (float) – dpi resolution for figures
bad_color (string) – color of heatmap when allele frequency is missing, i.e. DP = 0 (default: ‘blue’)
fontsize_c (float) – fontsize of cluster labels on heatmap
fontsize_x (float) – fontsize of cell labels on heatmap
fontsize_y (float) – fontsize of SNP labels on heatmap
cmap_heatmap – colormap used for heatmap visualization (default: mpl.colormaps[‘rocket’])
SNP_ranking (string) – method for ranking SNPs, ‘variance’ or ‘AF_diff’ (default: ‘AF_diff’)
- clustering(algorithm='leiden_full', max_cluster=15, resolution=1)
Cluster cells using k-means clustering or Leiden clustering in SCANPY, in either full-dimensional latent space or 3D UMAP
- Parameters:
algorithm (string) – ‘kmeans_umap3d’, ‘kmeans_full’, ‘leiden_umap3d’, or ‘leiden_full’ (default: ‘leiden_full’)
max_cluster (integer) – for k-means clustering only, maximum number of clusters (default: 15)
resolution (float) – for Leiden clustering only, resolution of clusters (default: 1)
- clustering_summary(dpi=100.0)
Re-display figures shown in clustering with higher dpi
- Parameters:
dpi (float) – dpi resolution for figures
- filtering(save_memory=False, cell_SNPread_threshold=None, SNP_DPmean_threshold=None, SNP_logit_var_threshold=None, filtering_only=False, num_neighbour=3, what_to_do='skip')
Filter low quality cells and SNPs based on number of observed SNPs for each cell, mean coverage of each SNP, and logit-variance of each SNP
- Parameters:
save_memory (boolean) – if True, raw matrices and VCF will be deleted from the object to save memory (default: False)
cell_SNPread_threshold (float) – minimal number of observed SNPs for a cell to be included for analysis, input after showing the plot if None (default: None)
SNP_DPmean_threshold (float) – minimal cell-average coverage for a SNP to be included for analysis, input after showing the plot if None (default: None)
SNP_logit_var_threshold (float) – minimal logit-variance for a SNP to be included for analysis, input after showing the plot if None (default: None)
filtering_only (boolean) – if True, it does not process AF matrices which are required for subsequent analyses in order to speed up (default: False)
num_neighbour (integer) – for missing_value = neighbour only, number of neighbouring cells for imputation (default: 3)
what_to_do (string) – what to do for cells with 0 oberserved SNPs after filtering (default: ‘skip’)
- filtering_summary(dpi=100.0)
Re-display figures shown in filtering with higher dpi
- Parameters:
dpi (float) – dpi resolution for figures
- phylogeny(cluster_no=None, pair_no=100, SNP_no=50, bad_color='blue', cmap_heatmap=<matplotlib.colors.ListedColormap object>, SNP_ranking='AF_diff')
Construct phylogenetic tree of cells in full-dimensional latent space and rank SNPs according to p-values
- Parameters:
cluster_no (integer) – for k-means clustering only, number of clusters for phylogenetic tree construction and ranking of SNPs (default: None)
pair_no (integer) – number of pair of cells to consider between each pair of clusters when constructing phylogenetic tree (default: 100)
SNP_no (integer) – number of top-ranked SNPs to be visualized in heatmap (default: 50)
bad_color (string) – color of heatmap when allele frequency is missing, i.e. DP = 0 (default: ‘blue’)
cmap_heatmap (mpl.colormaps) – colormap used for heatmap visualization (default: mpl.colormaps[‘rocket’])
SNP_ranking (string) – method for ranking SNPs, ‘variance’ or ‘AF_diff’ (default: ‘AF_diff’)
- phylogeny_summary(SNP_no=None, dpi=100.0, bad_color='blue', fontsize_c=None, fontsize_x=None, fontsize_y=None, cmap_heatmap=<matplotlib.colors.ListedColormap object>, SNP_ranking='AF_diff', tree_fig_size=(12, 10))
Re-display figures shown in phylogeny with higher dpi, different number of SNPs, color and fontsizes
- Parameters:
SNP_no (integer) – number of top-ranked SNPs to be visualized in heatmap (default: 50)
dpi (float) – dpi resolution for figures
bad_color (string) – color of heatmap when allele frequency is missing, i.e. DP = 0 (default: ‘blue’)
fontsize_c (float) – fontsize of cluster labels on heatmap
fontsize_x (float) – fontsize of cell labels on heatmap
fontsize_y (float) – fontsize of SNP labels on heatmap
cmap_heatmap – colormap used for heatmap visualization (default: mpl.colormaps[‘rocket’])
SNP_ranking (string) – method for ranking SNPs, ‘variance’ or ‘AF_diff’ (default: ‘AF_diff’)
tree_fig_size (tuple of numbers with length 2) – figure size of phylogenetic tree (default: (12, 10))
- retrain_umap()
Re-train UMAP in the same latent space of VAE
- training(num_epoch=2000, stepsize=0.0001, z_dim=None, beta=0, num_batch=5, is_cuda=True)
Train VAE using Adam optimizer and visualize latent space using PCA and UMAP
- Parameters:
num_epoch (integer) – number of epochs for training VAE (default: 2000)
stepsize (float) – stepsize of Adam optimizer (default: 0.0001)
z_dim (integer) – dimension of latent space (default: half of number of filtered SNPs)
beta (float) – strength of standard Gaussian prior in cost of VAE (default: 0)
num_batch (integer) – number of batchs for training VAE (default: 5)
is_cuda (boolean) – Set True if you want to use CUDA, set False if you want to use CPU (default: True)
- training_summary(dpi=100.0)
Re-display figures shown in training with higher dpi
- Parameters:
dpi (float) – dpi resolution for figures
Attributes
After running SNP_VAE.filtering():
cell_filter (np.array of booleans) - boolean filter for all input cells
SNP_filter (np.array of booleans) - boolean filter for all input SNPs
cell_total (integer) - total number of cells after filtering
SNP_total (integer) - total number of SNPs after filtering
AD_filtered (np.array with shape (cell_total, SNP_total)) - AD matrix after filtering
DP_filtered (np.array with shape (cell_total, SNP_total)) - DP matrix after filtering
AF_filtered (torch.tensor with shape (cell_total, SNP_total)) - AF matrix which is the input to VAE
VCF_filtered (pd.DataFrame) - VCF after filtering which contains variant names
After running SNP_VAE.training():
model (VAE_normalized) - trained VAE model implemented in PyTorch
latent (np.array with shape (cell_total, z_dim)) - latent factors of all cells after filtering
pc (np.array with shape (cell_total, z_dim)) - principal components of PCA of the latent space
embedding_2d (np.array with shape (cell_total, 2)) - 2D UMAP embedding of the latent space
embedding_3d (np.array with shape (cell_total, 3)) - 3D UMAP embedding of the latent space
After running SNP_VAE.clustering() and SNP_VAE.phylogeny():
cluster_no (integer) - total number of clusters
assigned_label (np.array of integers with shape (cell_total)) - assigned cluster labels of all cells after filtering
clusters (list of np.arrays with length (cluster_no)) - np.where(assigned_label == r) for each cluster r
colors (np.array with shape (cluster_no, 4)) - colors of all clusters in figures
edge (list of tuples with length (cluster_no - 1)) - all connected edges in the phylogenetic tree
f_stat (np.array with shape (SNP_total)) - F-statistics of all SNPs after filtering
p_value (np.array with shape (SNP_total)) - P-values of all SNPs after filtering
rank_SNP (np.array with shape (SNP_total)) - ranking of SNPs from the lowest p-value