corelay.processor.embedding

A module that contains processors for embedding algorithms.

Module Attributes

UMAP

Uniform Manifold Approximation and Projection

Classes

EigenDecomposition

A spectral embedding Processor that performs eigenvalue decomposition.

Embedding

The abstract base class for embedding processors.

LLEEmbedding

An embedding Processor that uses the locally linear embedding (LLE) algorithm to reduce the dimensionality of the input data.

PCAEmbedding

An embedding Processor that uses the principal component analysis (PCA) algorithm to reduce the dimensionality of the input data.

TSNEEmbedding

An embedding Processor that uses the t-SNE algorithm to reduce the dimensionality of the input data.

UMAPEmbedding

An embedding Processor that uses the Uniform Manifold Approximation and Projection (UMAP) algorithm to reduce the dimensionality of the input data.

class corelay.processor.embedding.UMAP[source]

Bases: BaseEstimator, ClassNamePrefixFeaturesOutMixin

Performs the Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction algorithm, which will find a low dimensional embedding of the data that approximates an underlying manifold.

Note

Since the UMAP library is an optional dependency of CoRelAy, it is imported using the corelay.utils.import_or_stub() function, which tries to import the module/type/function specified. If the import fails, it returns a stub instead, which will raise an exception when used. The exception message will tell users how to install the missing dependencies for the functionality to work.

Returns:

Returns a UMAP cluster estimator, which can be used to fit the data.

Return type:

sklearn.base.TransformerMixin

__init__(n_neighbors=15, n_components=2, metric='euclidean', metric_kwds=None, output_metric='euclidean', output_metric_kwds=None, n_epochs=None, learning_rate=1.0, init='spectral', min_dist=0.1, spread=1.0, low_memory=True, n_jobs=-1, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, a=None, b=None, random_state=None, angular_rp_forest=False, target_n_neighbors=-1, target_metric='categorical', target_metric_kwds=None, target_weight=0.5, transform_seed=42, transform_mode='embedding', force_approximation_algorithm=False, verbose=False, tqdm_kwds=None, unique=False, densmap=False, dens_lambda=2.0, dens_frac=0.3, dens_var_shift=0.1, output_dens=False, disconnection_distance=None, precomputed_knn=(None, None, None))[source]
__repr__()[source]

Return repr(self).

fit(X, y=None, ensure_all_finite=True, **kwargs)[source]

Fit X into an embedded space.

Optionally use y for supervised dimension reduction.

Parameters:
  • X (array, shape (n_samples, n_features) or (n_samples, n_samples)) – If the metric is ‘precomputed’ X must be a square distance matrix. Otherwise it contains a sample per row. If the method is ‘exact’, X may be a sparse matrix of type ‘csr’, ‘csc’ or ‘coo’.

  • y (array, shape (n_samples)) – A target array for supervised dimension reduction. How this is handled is determined by parameters UMAP was instantiated with. The relevant attributes are target_metric and target_metric_kwds.

  • ensure_all_finite (Whether to raise an error on np.inf, np.nan, pd.NA in array.) –

    The possibilities are: - True: Force all values of array to be finite.
    • False: accepts np.inf, np.nan, pd.NA in array.

    • ’allow-nan’: accepts only np.nan and pd.NA values in array. Values cannot be infinite.

  • **kwargs (optional) – Any additional keyword arguments are passed to _fit_embed_data.

fit_transform(X, y=None, ensure_all_finite=True, **kwargs)[source]

Fit X into an embedded space and return that transformed output.

Parameters:
  • X (array, shape (n_samples, n_features) or (n_samples, n_samples)) – If the metric is ‘precomputed’ X must be a square distance matrix. Otherwise it contains a sample per row.

  • y (array, shape (n_samples)) – A target array for supervised dimension reduction. How this is handled is determined by parameters UMAP was instantiated with. The relevant attributes are target_metric and target_metric_kwds.

  • ensure_all_finite (Whether to raise an error on np.inf, np.nan, pd.NA in array.) –

    The possibilities are: - True: Force all values of array to be finite.
    • False: accepts np.inf, np.nan, pd.NA in array.

    • ’allow-nan’: accepts only np.nan and pd.NA values in array. Values cannot be infinite.

  • **kwargs (Any additional keyword arguments are passed to _fit_embed_data.)

Returns:

  • X_new (array, shape (n_samples, n_components)) – Embedding of the training data in low-dimensional space.

  • or a tuple (X_new, r_orig, r_emb) if output_dens flag is set,

  • which additionally includes

  • r_orig (array, shape (n_samples)) – Local radii of data points in the original data space (log-transformed).

  • r_emb (array, shape (n_samples)) – Local radii of data points in the embedding (log-transformed).

inverse_transform(X)[source]

Transform X in the existing embedded space back into the input data space and return that transformed output.

Parameters:

X (array, shape (n_samples, n_components)) – New points to be inverse transformed.

Returns:

X_new – Generated data points new data in data space.

Return type:

array, shape (n_samples, n_features)

set_fit_request(*, ensure_all_finite: bool | None | str = '$UNCHANGED$') UMAP[source]

Request metadata passed to the fit method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • ensure_all_finite (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for ensure_all_finite parameter in fit.

  • self (UMAP)

Returns:

self – The updated object.

Return type:

object

set_transform_request(*, ensure_all_finite: bool | None | str = '$UNCHANGED$') UMAP[source]

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:
  • ensure_all_finite (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for ensure_all_finite parameter in transform.

  • self (UMAP)

Returns:

self – The updated object.

Return type:

object

transform(X, ensure_all_finite=True)[source]

Transform X into the existing embedded space and return that transformed output.

Parameters:
  • X (array, shape (n_samples, n_features)) – New data to be transformed.

  • ensure_all_finite (Whether to raise an error on np.inf, np.nan, pd.NA in array.) –

    The possibilities are: - True: Force all values of array to be finite.
    • False: accepts np.inf, np.nan, pd.NA in array.

    • ’allow-nan’: accepts only np.nan and pd.NA values in array. Values cannot be infinite.

Returns:

X_new – Embedding of the new data in low-dimensional space.

Return type:

array, shape (n_samples, n_components)

class corelay.processor.embedding.Embedding[source]

Bases: Processor

The abstract base class for embedding processors.

Parameters:
  • is_output (bool) – A value indicating whether this Embedding processor is the output of a Pipeline. Defaults to False.

  • is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to False.

  • io (Storable | None) – An IO object that is used to cache intermediate results of the Pipeline, which can then be re-used in this run or in subsequent runs of the Pipeline. Defaults to an instance of NoStorage.

  • kwargs (dict[str, Any]) – Additional keyword arguments for the embedding algorithm. Defaults to an empty dict.

kwargs: Annotated[dict[str, Any], Param]

Additional keyword arguments to pass to the embedding function.

Parameters:
Return type:

Plug

__tracked__: collections.OrderedDict[str, Any]

An collections.OrderedDict with all public class attributes, i.e., all class attributes not enclosed with double underscores.

class corelay.processor.embedding.EigenDecomposition[source]

Bases: Embedding

A spectral embedding Processor that performs eigenvalue decomposition.

Parameters:
  • is_output (bool) – A value indicating whether this EigenDecomposition embedding processor is the output of a Pipeline. Defaults to False.

  • is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to False.

  • io (Storable | None) – An IO object that is used to cache intermediate results of the Pipeline, which can then be re-used in this run or in subsequent runs of the Pipeline. Defaults to an instance of NoStorage.

  • kwargs (dict[str, Any]) – Additional keyword arguments for the eigenvalue decomposition embedding algorithm. Defaults to an empty

  • n_eigval (int) – The number of eigenvalues and eigenvectors to compute. Defaults to 32.

  • which (str) – The type of eigenvalues and eigenvectors to compute. Defaults to “LM” (largest in magnitude).

  • normalize (bool) – A value indicating whether to normalize the eigenvectors. Defaults to True.

n_eigval: Annotated[int, Param]

The number of eigenvalues and eigenvectors to compute. Defaults to 32.

Parameters:
Return type:

Plug

which: Annotated[str, Param]

The type of eigenvalues and eigenvectors to compute. The options are:

  • “LM”: Largest (in magnitude) eigenvalues.

  • “SM”: Smallest (in magnitude) eigenvalues.

  • “LA”: Largest (algebraic) eigenvalues.

  • “SA”: Smallest (algebraic) eigenvalues.

  • “BE”: Half (k/2) from each end of the spectrum.

Defaults to “LM” (largest in magnitude).

Note

If the input is a complex Hermitian matrix, ‘BE’ is invalid.

Parameters:
Return type:

Plug

normalize: Annotated[bool, Param]

A value indicating whether to normalize the eigenvectors. Defaults to True.

Parameters:
Return type:

Plug

function(data: Any) Any[source]

Computes the spectral embedding of the input data using eigenvalue decomposition.

Note

We use the fact that (I-A)v = (1-λ)v and thus compute the largest eigenvalues of the identity minus the data and return one minus the eigenvalue.

Parameters:

data (Any) – The input data to be embedded. The data should be a NumPy array of shape (number_of_samples, number_of_features).

Raises:

ValueError – The eigenvalue and eigenvector type is not valid.

Returns:

Returns a tuple containing the eigenvalues and eigenvectors of the input data.

Return type:

Any

__tracked__: collections.OrderedDict[str, Any]

An collections.OrderedDict with all public class attributes, i.e., all class attributes not enclosed with double underscores.

class corelay.processor.embedding.TSNEEmbedding[source]

Bases: Embedding

An embedding Processor that uses the t-SNE algorithm to reduce the dimensionality of the input data.

Parameters:
  • is_output (bool) – A value indicating whether this TSNEEmbedding embedding processor is the output of a Pipeline. Defaults to False.

  • is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to False.

  • io (Storable | None) – An IO object that is used to cache intermediate results of the Pipeline, which can then be re-used in this run or in subsequent runs of the Pipeline. Defaults to an instance of NoStorage.

  • kwargs (dict[str, Any]) – Additional keyword arguments for the t-SNE embedding algorithm. Defaults to an empty dict.

  • n_components (int) – The number of dimensions to reduce the data to. Defaults to 2.

  • metric (str) – The distance metric to use. Defaults to “euclidean”.

  • perplexity (float) – The perplexity parameter for the t-SNE algorithm. Defaults to 30.

  • early_exaggeration (float) – The early exaggeration parameter for the t-SNE algorithm. Defaults to 12.

n_components: Annotated[int, Param]

The number of dimensions to reduce the data to. Defaults to 2.

Parameters:
Return type:

Plug

metric: Annotated[str, Param]

The distance metric to use. Can be one of

  • “braycurtis”

  • “canberra”

  • “chebychev”, “chebyshev”, “cheby”, “cheb”, “ch”

  • “cityblock”, “cblock”, “cb”, “c”

  • “correlation”, “co”

  • “cosine”, “cos”

  • “dice”

  • “euclidean”, “euclid”, “eu”, “e”

  • “hamming”, “hamm”, “ha”, “h”

  • “minkowski”, “mi”, “m”

  • “pnorm”

  • “jaccard”, “jacc”, “ja”, “j”

  • “jensenshannon”, “js”

  • “kulczynski1”

  • “mahalanobis”, “mahal”, “mah”

  • “rogerstanimoto”

  • “russellrao”

  • “seuclidean”, “se”, “s”

  • “sokalmichener”

  • “sokalsneath”

  • “sqeuclidean”, “sqe”, “sqeuclid”

  • “yule”

Defaults to “euclidean”.

Parameters:
Return type:

Plug

perplexity: Annotated[float, Param]

The perplexity parameter for the t-SNE algorithm. Defaults to 30.

Parameters:
Return type:

Plug

early_exaggeration: Annotated[float, Param]

The early exaggeration parameter for the t-SNE algorithm. Defaults to 12.

Parameters:
Return type:

Plug

function(data: Any) Any[source]

Computes the t-SNE embedding of the input data.

Parameters:

data (Any) – The input data to be embedded. The data should be a NumPy array of shape (number_of_samples, number_of_features) or (number_of_samples, number_of_samples).

Returns:

Returns the t-SNE embedding of the input data as a NumPy array of shape (number_of_samples, number_of_components).

Return type:

Any

__tracked__: collections.OrderedDict[str, Any]

An collections.OrderedDict with all public class attributes, i.e., all class attributes not enclosed with double underscores.

class corelay.processor.embedding.PCAEmbedding[source]

Bases: Embedding

An embedding Processor that uses the principal component analysis (PCA) algorithm to reduce the dimensionality of the input data.

Parameters:
  • is_output (bool) – A value indicating whether this PCAEmbedding embedding processor is the output of a Pipeline. Defaults to False.

  • is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to False.

  • io (Storable | None) – An IO object that is used to cache intermediate results of the Pipeline, which can then be re-used in this run or in subsequent runs of the Pipeline. Defaults to an instance of NoStorage.

  • kwargs (dict[str, Any]) – Additional keyword arguments for the PCA embedding algorithm. Defaults to an empty dict.

  • n_components (int) – The number of dimensions to reduce the data to. Defaults to 2.

  • whiten (bool) – A value indicating whether to whiten the data. Defaults to False.

n_components: Annotated[int, Param]

The number of dimensions to reduce the data to. Defaults to 2.

Parameters:
Return type:

Plug

__tracked__: collections.OrderedDict[str, Any]

An collections.OrderedDict with all public class attributes, i.e., all class attributes not enclosed with double underscores.

whiten: Annotated[bool, Param]

A value indicating whether to whiten the data. Defaults to False.

Parameters:
Return type:

Plug

function(data: Any) Any[source]

Computes the PCA embedding of the input data.

Parameters:

data (Any) – The input data to be embedded. The data should be a NumPy array of shape (number_of_samples, number_of_features).

Returns:

Returns the PCA embedding of the input data as a NumPy array of shape (number_of_samples, number_of_components).

Return type:

Any

class corelay.processor.embedding.LLEEmbedding[source]

Bases: Embedding

An embedding Processor that uses the locally linear embedding (LLE) algorithm to reduce the dimensionality of the input data.

Parameters:
  • is_output (bool) – A value indicating whether this LLEEmbedding embedding processor is the output of a Pipeline. Defaults to False.

  • is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to False.

  • io (Storable | None) – An IO object that is used to cache intermediate results of the Pipeline, which can then be re-used in this run or in subsequent runs of the Pipeline. Defaults to an instance of NoStorage.

  • kwargs (dict[str, Any]) – Additional keyword arguments for the LLE embedding algorithm. Defaults to an empty dict.

  • n_components (int) – The number of dimensions to reduce the data to. Defaults to 2.

  • n_neighbors (int) – The number of neighbors to use for the LLE algorithm. Defaults to 5.

__tracked__: collections.OrderedDict[str, Any]

An collections.OrderedDict with all public class attributes, i.e., all class attributes not enclosed with double underscores.

n_components: Annotated[int, Param]

The number of dimensions to reduce the data to. Defaults to 2.

Parameters:
Return type:

Plug

n_neighbors: Annotated[int, Param]

The number of neighbors to use for the LLE algorithm. Defaults to 5.

Parameters:
Return type:

Plug

function(data: Any) Any[source]

Computes the LLE embedding of the input data.

Parameters:

data (Any) – The input data to be embedded. The data should be a NumPy array of shape (number_of_samples, number_of_features).

Returns:

Returns the LLE embedding of the input data as a NumPy array of shape (number_of_samples, number_of_components).

Return type:

Any

class corelay.processor.embedding.UMAPEmbedding[source]

Bases: Embedding

An embedding Processor that uses the Uniform Manifold Approximation and Projection (UMAP) algorithm to reduce the dimensionality of the input data.

Parameters:
  • is_output (bool) – A value indicating whether this UMAPEmbedding embedding processor is the output of a Pipeline. Defaults to False.

  • is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to False.

  • io (Storable | None) – An IO object that is used to cache intermediate results of the Pipeline, which can then be re-used in this run or in subsequent runs of the Pipeline. Defaults to an instance of NoStorage.

  • kwargs (dict[str, Any]) – Additional keyword arguments for the UMAP embedding algorithm. Defaults to an empty dict.

  • n_neighbors (int) – The number of neighbors to use for the UMAP algorithm. Defaults to 15.

  • min_dist (float) – The minimum distance between points in the UMAP algorithm. Defaults to 0.1.

  • metric (str) – The distance metric to use for the UMAP algorithm. Defaults to “correlation”.

__tracked__: collections.OrderedDict[str, Any]

An collections.OrderedDict with all public class attributes, i.e., all class attributes not enclosed with double underscores.

n_neighbors: Annotated[int, Param]

The number of neighbors to use for the UMAP algorithm. Defaults to 15.

Parameters:
Return type:

Plug

min_dist: Annotated[float, Param]

The minimum distance between points in the UMAP algorithm. Defaults to 0.1.

Parameters:
Return type:

Plug

metric: Annotated[str, Param]

The distance metric to use for the UMAP algorithm. This can be one of “euclidean”, “manhattan”, “chebyshev”, “minkowski”, “canberra”, “braycurtis”, “mahalanobis”, “wminkowski”, “seuclidean”, “cosine”, “correlation”, “haversine”, “hamming”, “jaccard”, “dice”, “russelrao”, “kulsinski”, “ll_dirichlet”, “hellinger”, “rogerstanimoto”, “sokalmichener”, “sokalsneath”, or “yule” Defaults to “correlation”.

Parameters:
Return type:

Plug

function(data: Any) Any[source]

Computes the UMAP embedding of the input data.

Note

For information on the UMAP algorithm, see the UMAP documentation.

Parameters:

data (Any) – The input data to be embedded. The data should be a NumPy array of shape (number_of_samples, number_of_features).

Returns:

Returns the UMAP embedding of the input data as a NumPy array of shape (number_of_samples, number_of_new_features).

Return type:

Any