swnn.utils.process module

@CreateDate: 2020/07/18 @Author: Xingyan Liu @File: process.py @Project: stagewiseNN

swnn.utils.process.check_dirs(path)
swnn.utils.process.reverse_dict(d)

the values of the dict must be list-like type

swnn.utils.process.describe_dataframe(df, **kwargs)
swnn.utils.process.describe_series(srs, max_cats=100, asstr=False)

inspect data-structure

swnn.utils.process.make_binary(mat)
swnn.utils.process.set_adata_hvgs(adata, gene_list=None, indicator=None, slim=True, copy=False)

Setting the given (may be pre-computed) set of genes as highly variable, if copy is False, changes will be made to the input adata. if slim is True and adata.raw is None, raw data will be backup.

swnn.utils.process.change_names(seq, mapping=None, **kwmaps)
Return type

list

swnn.utils.process.normalize_default(adata, target_sum=None, copy=False, log_only=False)

Normalizing datasets with default settings (total-counts normalization followed by log(x+1) transform).

Parameters
  • adataAnnData object

  • target_sum – scale factor of total-count normalization

  • copy – whether to copy the dataset

  • log_only – whether to skip the “total-counts normalization” and only perform log(x+1) transform

Returns

Return type

AnnData or None

swnn.utils.process.normalize_log_then_total(adata, target_sum=None, copy=False)

For SplitSeq data, performing log(x+1) BEFORE total-sum normalization will results a better UMAP visualization (e.g. clusters would be less confounded by different total-counts ).

swnn.utils.process.groupwise_hvgs_freq(adata, groupby='batch', return_hvgs=True, **hvg_kwds)

Separately compute highly variable genes (HVGs) for each group, and count the frequencies of genes being selected as HVGs among those groups.

Parameters
  • adata – the AnnData object

  • groupby – a column name in adata.obs specifying batches or groups that you would like to independently compute HVGs.

  • return_hvgs (bool) – whether to return the computed dict of HVG-lists for each group

  • hvg_kwds – Other Parameters for sc.pp.highly_variable_genes

Returns

  • hvg_freq (dict) – the HVG frequencies

  • hvg_dict (dict) – returned only if return_hvgs is True

swnn.utils.process.take_high_freq_elements(freq, min_freq=3)
swnn.utils.process.set_precomputed_neighbors(adata, distances, connectivities, n_neighbors=15, metric='cosine', method='umap', metric_kwds=None, use_rep=None, n_pcs=None, key_added=None)
swnn.utils.process.quick_preprocess_raw(adata, target_sum=None, hvgs=None, batch_key=None, copy=True, log_first=False, **hvg_kwds)

Go through the data-analysis pipeline, including normalization, HVG selection, and z-scoring (centering and scaling)

Parameters
  • adata (AnnData) – the Anndata object

  • target_sum (Optional[int]) – the target total counts after normalization. If None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.

  • hvgs (Optional[Sequence]) – highly variable genes to be used for dimensionality reduction (centering and PCA)

  • batch_key – a column name in adata.obs specifying the batch labels

  • copy – whether to make a co[y of the input data. if False, the data object will be change inplace.

  • log_first (bool) – for some data distributions, perform log(x+1) before total-count normalization might give a better result (e.g. clustering results may be less affected by the sequencing depths)

  • hvg_kwds – other key-word parameters for sc.pp.highly_variable_genes

Return type

AnnData

swnn.utils.process.label_binarize_each(labels, classes, sparse_out=True)
swnn.utils.process.group_mean(X, labels, binary=False, classes=None, features=None, print_groups=True)

This function may work with more efficiency than df.groupby().mean() when handling sparse matrix.

Parameters
  • X (shape (n_samples, n_features)) –

  • labels (shape (n_samples, )) –

  • classes (optional) – names of groups

  • features (optional) – names of features

  • print_groups (bool) – whether to inspect the groups

swnn.utils.process.group_mean_dense(X, labels, binary=False, index_name='group', classes=None)
swnn.utils.process.group_median_dense(X, labels, binary=False, index_name='group', classes=None)
swnn.utils.process.group_mean_adata(adata, groupby, features=None, binary=False, use_raw=False)

Compute averaged feature-values for each group

Parameters
  • adata (AnnData) –

  • groupby (str) – a column name in adata.obs

  • features – a subset of names in adata.var_names (or adata.raw.var_names)

  • binary (bool) – if True, the results will turn to be the non-zeor proportions for all (or the given) features

  • use_raw (bool) – whether to access adata.raw to compute the averages.

Returns

Return type

a pd.DataFrame with features as index and groups as columns