corelay.processor.embedding
A module that contains processors for embedding algorithms.
Module Attributes
Uniform Manifold Approximation and Projection |
Classes
A spectral embedding |
|
The abstract base class for embedding processors. |
|
An embedding |
|
An embedding |
|
An embedding |
|
An embedding |
- class corelay.processor.embedding.UMAP[source]
Bases:
BaseEstimator,ClassNamePrefixFeaturesOutMixinPerforms the Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction algorithm, which will find a low dimensional embedding of the data that approximates an underlying manifold.
Note
Since the UMAP library is an optional dependency of CoRelAy, it is imported using the
corelay.utils.import_or_stub()function, which tries to import the module/type/function specified. If the import fails, it returns a stub instead, which will raise an exception when used. The exception message will tell users how to install the missing dependencies for the functionality to work.- Returns:
Returns a UMAP cluster estimator, which can be used to fit the data.
- Return type:
- __init__(n_neighbors=15, n_components=2, metric='euclidean', metric_kwds=None, output_metric='euclidean', output_metric_kwds=None, n_epochs=None, learning_rate=1.0, init='spectral', min_dist=0.1, spread=1.0, low_memory=True, n_jobs=-1, set_op_mix_ratio=1.0, local_connectivity=1.0, repulsion_strength=1.0, negative_sample_rate=5, transform_queue_size=4.0, a=None, b=None, random_state=None, angular_rp_forest=False, target_n_neighbors=-1, target_metric='categorical', target_metric_kwds=None, target_weight=0.5, transform_seed=42, transform_mode='embedding', force_approximation_algorithm=False, verbose=False, tqdm_kwds=None, unique=False, densmap=False, dens_lambda=2.0, dens_frac=0.3, dens_var_shift=0.1, output_dens=False, disconnection_distance=None, precomputed_knn=(None, None, None))[source]
- fit(X, y=None, ensure_all_finite=True, **kwargs)[source]
Fit X into an embedded space.
Optionally use y for supervised dimension reduction.
- Parameters:
X (array, shape (n_samples, n_features) or (n_samples, n_samples)) – If the metric is ‘precomputed’ X must be a square distance matrix. Otherwise it contains a sample per row. If the method is ‘exact’, X may be a sparse matrix of type ‘csr’, ‘csc’ or ‘coo’.
y (array, shape (n_samples)) – A target array for supervised dimension reduction. How this is handled is determined by parameters UMAP was instantiated with. The relevant attributes are
target_metricandtarget_metric_kwds.ensure_all_finite (Whether to raise an error on np.inf, np.nan, pd.NA in array.) –
- The possibilities are: - True: Force all values of array to be finite.
False: accepts np.inf, np.nan, pd.NA in array.
’allow-nan’: accepts only np.nan and pd.NA values in array. Values cannot be infinite.
**kwargs (optional) – Any additional keyword arguments are passed to _fit_embed_data.
- fit_transform(X, y=None, ensure_all_finite=True, **kwargs)[source]
Fit X into an embedded space and return that transformed output.
- Parameters:
X (array, shape (n_samples, n_features) or (n_samples, n_samples)) – If the metric is ‘precomputed’ X must be a square distance matrix. Otherwise it contains a sample per row.
y (array, shape (n_samples)) – A target array for supervised dimension reduction. How this is handled is determined by parameters UMAP was instantiated with. The relevant attributes are
target_metricandtarget_metric_kwds.ensure_all_finite (Whether to raise an error on np.inf, np.nan, pd.NA in array.) –
- The possibilities are: - True: Force all values of array to be finite.
False: accepts np.inf, np.nan, pd.NA in array.
’allow-nan’: accepts only np.nan and pd.NA values in array. Values cannot be infinite.
**kwargs (Any additional keyword arguments are passed to _fit_embed_data.)
- Returns:
X_new (array, shape (n_samples, n_components)) – Embedding of the training data in low-dimensional space.
or a tuple (X_new, r_orig, r_emb) if
output_densflag is set,which additionally includes
r_orig (array, shape (n_samples)) – Local radii of data points in the original data space (log-transformed).
r_emb (array, shape (n_samples)) – Local radii of data points in the embedding (log-transformed).
- inverse_transform(X)[source]
Transform X in the existing embedded space back into the input data space and return that transformed output.
- Parameters:
X (array, shape (n_samples, n_components)) – New points to be inverse transformed.
- Returns:
X_new – Generated data points new data in data space.
- Return type:
array, shape (n_samples, n_features)
- set_fit_request(*, ensure_all_finite: bool | None | str = '$UNCHANGED$') UMAP[source]
Request metadata passed to the
fitmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
- set_transform_request(*, ensure_all_finite: bool | None | str = '$UNCHANGED$') UMAP[source]
Request metadata passed to the
transformmethod.Note that this method is only relevant if
enable_metadata_routing=True(seesklearn.set_config()). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed totransformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it totransform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
- transform(X, ensure_all_finite=True)[source]
Transform X into the existing embedded space and return that transformed output.
- Parameters:
X (array, shape (n_samples, n_features)) – New data to be transformed.
ensure_all_finite (Whether to raise an error on np.inf, np.nan, pd.NA in array.) –
- The possibilities are: - True: Force all values of array to be finite.
False: accepts np.inf, np.nan, pd.NA in array.
’allow-nan’: accepts only np.nan and pd.NA values in array. Values cannot be infinite.
- Returns:
X_new – Embedding of the new data in low-dimensional space.
- Return type:
array, shape (n_samples, n_components)
- class corelay.processor.embedding.Embedding[source]
Bases:
ProcessorThe abstract base class for embedding processors.
- Parameters:
is_output (bool) – A value indicating whether this
Embeddingprocessor is the output of aPipeline. Defaults toFalse.is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to
False.io (Storable | None) – An IO object that is used to cache intermediate results of the
Pipeline, which can then be re-used in this run or in subsequent runs of thePipeline. Defaults to an instance ofNoStorage.kwargs (dict[str, Any]) – Additional keyword arguments for the embedding algorithm. Defaults to an empty
dict.
- kwargs: Annotated[dict[str, Any], Param]
Additional keyword arguments to pass to the embedding function.
- __tracked__: collections.OrderedDict[str, Any]
An
collections.OrderedDictwith all public class attributes, i.e., all class attributes not enclosed with double underscores.
- class corelay.processor.embedding.EigenDecomposition[source]
Bases:
EmbeddingA spectral embedding
Processorthat performs eigenvalue decomposition.- Parameters:
is_output (bool) – A value indicating whether this
EigenDecompositionembedding processor is the output of aPipeline. Defaults toFalse.is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to
False.io (Storable | None) – An IO object that is used to cache intermediate results of the
Pipeline, which can then be re-used in this run or in subsequent runs of thePipeline. Defaults to an instance ofNoStorage.kwargs (dict[str, Any]) – Additional keyword arguments for the eigenvalue decomposition embedding algorithm. Defaults to an empty
n_eigval (int) – The number of eigenvalues and eigenvectors to compute. Defaults to 32.
which (str) – The type of eigenvalues and eigenvectors to compute. Defaults to “LM” (largest in magnitude).
normalize (bool) – A value indicating whether to normalize the eigenvectors. Defaults to
True.
- n_eigval: Annotated[int, Param]
The number of eigenvalues and eigenvectors to compute. Defaults to 32.
- which: Annotated[str, Param]
The type of eigenvalues and eigenvectors to compute. The options are:
“LM”: Largest (in magnitude) eigenvalues.
“SM”: Smallest (in magnitude) eigenvalues.
“LA”: Largest (algebraic) eigenvalues.
“SA”: Smallest (algebraic) eigenvalues.
“BE”: Half (k/2) from each end of the spectrum.
Defaults to “LM” (largest in magnitude).
Note
If the input is a complex Hermitian matrix, ‘BE’ is invalid.
- normalize: Annotated[bool, Param]
A value indicating whether to normalize the eigenvectors. Defaults to True.
- function(data: Any) Any[source]
Computes the spectral embedding of the input data using eigenvalue decomposition.
Note
We use the fact that (I-A)v = (1-λ)v and thus compute the largest eigenvalues of the identity minus the data and return one minus the eigenvalue.
- Parameters:
data (Any) – The input data to be embedded. The data should be a NumPy array of shape (number_of_samples, number_of_features).
- Raises:
ValueError – The eigenvalue and eigenvector type is not valid.
- Returns:
Returns a tuple containing the eigenvalues and eigenvectors of the input data.
- Return type:
- __tracked__: collections.OrderedDict[str, Any]
An
collections.OrderedDictwith all public class attributes, i.e., all class attributes not enclosed with double underscores.
- class corelay.processor.embedding.TSNEEmbedding[source]
Bases:
EmbeddingAn embedding
Processorthat uses the t-SNE algorithm to reduce the dimensionality of the input data.- Parameters:
is_output (bool) – A value indicating whether this
TSNEEmbeddingembedding processor is the output of aPipeline. Defaults toFalse.is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to
False.io (Storable | None) – An IO object that is used to cache intermediate results of the
Pipeline, which can then be re-used in this run or in subsequent runs of thePipeline. Defaults to an instance ofNoStorage.kwargs (dict[str, Any]) – Additional keyword arguments for the t-SNE embedding algorithm. Defaults to an empty
dict.n_components (int) – The number of dimensions to reduce the data to. Defaults to 2.
metric (str) – The distance metric to use. Defaults to “euclidean”.
perplexity (float) – The perplexity parameter for the t-SNE algorithm. Defaults to 30.
early_exaggeration (float) – The early exaggeration parameter for the t-SNE algorithm. Defaults to 12.
- metric: Annotated[str, Param]
The distance metric to use. Can be one of
“braycurtis”
“canberra”
“chebychev”, “chebyshev”, “cheby”, “cheb”, “ch”
“cityblock”, “cblock”, “cb”, “c”
“correlation”, “co”
“cosine”, “cos”
“dice”
“euclidean”, “euclid”, “eu”, “e”
“hamming”, “hamm”, “ha”, “h”
“minkowski”, “mi”, “m”
“pnorm”
“jaccard”, “jacc”, “ja”, “j”
“jensenshannon”, “js”
“kulczynski1”
“mahalanobis”, “mahal”, “mah”
“rogerstanimoto”
“russellrao”
“seuclidean”, “se”, “s”
“sokalmichener”
“sokalsneath”
“sqeuclidean”, “sqe”, “sqeuclid”
“yule”
Defaults to “euclidean”.
- perplexity: Annotated[float, Param]
The perplexity parameter for the t-SNE algorithm. Defaults to 30.
- early_exaggeration: Annotated[float, Param]
The early exaggeration parameter for the t-SNE algorithm. Defaults to 12.
- function(data: Any) Any[source]
Computes the t-SNE embedding of the input data.
- Parameters:
data (Any) – The input data to be embedded. The data should be a NumPy array of shape (number_of_samples, number_of_features) or (number_of_samples, number_of_samples).
- Returns:
Returns the t-SNE embedding of the input data as a NumPy array of shape (number_of_samples, number_of_components).
- Return type:
- __tracked__: collections.OrderedDict[str, Any]
An
collections.OrderedDictwith all public class attributes, i.e., all class attributes not enclosed with double underscores.
- class corelay.processor.embedding.PCAEmbedding[source]
Bases:
EmbeddingAn embedding
Processorthat uses the principal component analysis (PCA) algorithm to reduce the dimensionality of the input data.- Parameters:
is_output (bool) – A value indicating whether this
PCAEmbeddingembedding processor is the output of aPipeline. Defaults toFalse.is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to
False.io (Storable | None) – An IO object that is used to cache intermediate results of the
Pipeline, which can then be re-used in this run or in subsequent runs of thePipeline. Defaults to an instance ofNoStorage.kwargs (dict[str, Any]) – Additional keyword arguments for the PCA embedding algorithm. Defaults to an empty
dict.n_components (int) – The number of dimensions to reduce the data to. Defaults to 2.
whiten (bool) – A value indicating whether to whiten the data. Defaults to
False.
- __tracked__: collections.OrderedDict[str, Any]
An
collections.OrderedDictwith all public class attributes, i.e., all class attributes not enclosed with double underscores.
- class corelay.processor.embedding.LLEEmbedding[source]
Bases:
EmbeddingAn embedding
Processorthat uses the locally linear embedding (LLE) algorithm to reduce the dimensionality of the input data.- Parameters:
is_output (bool) – A value indicating whether this
LLEEmbeddingembedding processor is the output of aPipeline. Defaults toFalse.is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to
False.io (Storable | None) – An IO object that is used to cache intermediate results of the
Pipeline, which can then be re-used in this run or in subsequent runs of thePipeline. Defaults to an instance ofNoStorage.kwargs (dict[str, Any]) – Additional keyword arguments for the LLE embedding algorithm. Defaults to an empty
dict.n_components (int) – The number of dimensions to reduce the data to. Defaults to 2.
n_neighbors (int) – The number of neighbors to use for the LLE algorithm. Defaults to 5.
- __tracked__: collections.OrderedDict[str, Any]
An
collections.OrderedDictwith all public class attributes, i.e., all class attributes not enclosed with double underscores.
- n_neighbors: Annotated[int, Param]
The number of neighbors to use for the LLE algorithm. Defaults to 5.
- class corelay.processor.embedding.UMAPEmbedding[source]
Bases:
EmbeddingAn embedding
Processorthat uses the Uniform Manifold Approximation and Projection (UMAP) algorithm to reduce the dimensionality of the input data.- Parameters:
is_output (bool) – A value indicating whether this
UMAPEmbeddingembedding processor is the output of aPipeline. Defaults toFalse.is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to
False.io (Storable | None) – An IO object that is used to cache intermediate results of the
Pipeline, which can then be re-used in this run or in subsequent runs of thePipeline. Defaults to an instance ofNoStorage.kwargs (dict[str, Any]) – Additional keyword arguments for the UMAP embedding algorithm. Defaults to an empty
dict.n_neighbors (int) – The number of neighbors to use for the UMAP algorithm. Defaults to 15.
min_dist (float) – The minimum distance between points in the UMAP algorithm. Defaults to 0.1.
metric (str) – The distance metric to use for the UMAP algorithm. Defaults to “correlation”.
- __tracked__: collections.OrderedDict[str, Any]
An
collections.OrderedDictwith all public class attributes, i.e., all class attributes not enclosed with double underscores.
- n_neighbors: Annotated[int, Param]
The number of neighbors to use for the UMAP algorithm. Defaults to 15.
- min_dist: Annotated[float, Param]
The minimum distance between points in the UMAP algorithm. Defaults to 0.1.
- metric: Annotated[str, Param]
The distance metric to use for the UMAP algorithm. This can be one of “euclidean”, “manhattan”, “chebyshev”, “minkowski”, “canberra”, “braycurtis”, “mahalanobis”, “wminkowski”, “seuclidean”, “cosine”, “correlation”, “haversine”, “hamming”, “jaccard”, “dice”, “russelrao”, “kulsinski”, “ll_dirichlet”, “hellinger”, “rogerstanimoto”, “sokalmichener”, “sokalsneath”, or “yule” Defaults to “correlation”.
- function(data: Any) Any[source]
Computes the UMAP embedding of the input data.
Note
For information on the UMAP algorithm, see the UMAP documentation.