corelay.processor.clustering

A module that contains processors for clustering algorithms.

Module Attributes

hdbscan

Performs the HDBSCAN clustering algorithm on a vector or distance matrix.

Classes

AgglomerativeClustering

A clustering Processor that performs the Agglomerative Clustering algorithm.

Clustering

The abstract base class for Processor that performs clustering.

DBSCAN

A clustering Processor that performs the DBSCAN clustering algorithm.

Dendrogram

A clustering Processor that generates plots the hierarchical clustering as a dendrogram.

HDBSCAN

A clustering Processor that performs the HDBSCAN clustering algorithm.

KMeans

A clustering Processor that performs the k-Means clustering algorithm.

corelay.processor.clustering.hdbscan[source]

Performs the HDBSCAN clustering algorithm on a vector or distance matrix.

Note

Since the HDBSCAN library is an optional dependency of CoRelAy, it is imported using the import_or_stub() function, which tries to import the module/type/function specified. If the import fails, it returns a stub instead, which will raise an exception when used. The exception message will tell users how to install the missing dependencies for the functionality to work.

Returns:

Returns an HDBSCAN cluster estimator, which can be used to fit the data.

Return type:

sklearn.base.ClusterMixin

class corelay.processor.clustering.Clustering[source]

Bases: Processor

The abstract base class for Processor that performs clustering.

Parameters:
  • is_output (bool) – A value indicating whether this Clustering processor is the output of a Pipeline. Defaults to False.

  • is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to False.

  • io (Storable | None) – An IO object that is used to cache intermediate results of the Pipeline, which can then be re-used in this run or in subsequent runs of the Pipeline. Defaults to an instance of NoStorage.

  • kwargs (dict[str, Any]) – Additional keyword arguments for the clustering algorithm. Defaults to an empty dict.

kwargs: Annotated[dict[str, Any], Param]

A dict of additional keyword arguments for the clustering algorithm. Defaults to an empty dict.

Parameters:
Return type:

Plug

__tracked__: collections.OrderedDict[str, Any]

An collections.OrderedDict with all public class attributes, i.e., all class attributes not enclosed with double underscores.

class corelay.processor.clustering.KMeans[source]

Bases: Clustering

A clustering Processor that performs the k-Means clustering algorithm.

Parameters:
  • is_output (bool) – A value indicating whether this KMeans clustering processor is the output of a Pipeline. Defaults to False.

  • is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to False.

  • io (Storable | None) – An IO object that is used to cache intermediate results of the Pipeline, which can then be re-used in this run or in subsequent runs of the Pipeline. Defaults to an instance of NoStorage.

  • kwargs (dict[str, Any]) – Additional keyword arguments for the k-Means clustering algorithm. Defaults to an empty dict.

  • n_clusters (int) – The number of clusters to form. Defaults to 2.

  • index (tuple[int | slice]) – The indices of the data to be clustered. Defaults to an empty slice.

n_clusters: Annotated[int, Param]

The number of clusters to form. Defaults to 2.

Parameters:
Return type:

Plug

index: Annotated[SupportsIndex | tuple[SupportsIndex, ...], Param]

The indices of the data to be clustered. Defaults to an empty slice.

Parameters:
Return type:

Plug

function(data: Any) Any[source]

Performs k-Means clustering on the data.

Parameters:

data (Any) – The data to be clustered. The data should be a NumPy array or a sparse matrix.

Returns:

Returns the a NumPy array of shape (number_of_samples,), that contains the cluster labels, where each label corresponds to a cluster assigned to the data point.

Return type:

Any

__tracked__: collections.OrderedDict[str, Any]

An collections.OrderedDict with all public class attributes, i.e., all class attributes not enclosed with double underscores.

class corelay.processor.clustering.HDBSCAN[source]

Bases: Clustering

A clustering Processor that performs the HDBSCAN clustering algorithm.

Parameters:
  • is_output (bool) – A value indicating whether this HDBSCAN clustering processor is the output of a Pipeline. Defaults to False.

  • is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to False.

  • io (Storable | None) – An IO object that is used to cache intermediate results of the Pipeline, which can then be re-used in this run or in subsequent runs of the Pipeline. Defaults to an instance of NoStorage.

  • kwargs (dict[str, Any]) – Additional keyword arguments for the HDBSCAN clustering algorithm. Defaults to an empty dict.

  • n_clusters (int) – The number of clusters to form. Defaults to 2.

  • metric (str) – The distance metric to use. This can be one of “euclidean”, “l2” (equivalent to “euclidean”), “minkowski”, “p” (equivalent to “minkowski”), “manhattan”, “cityblock” (equivalent to “manhattan”), “l1” (equivalent to “manhattan”), “chebyshev”, “infinity” (equivalent to “chebyshev”), “seuclidean”, “mahalanobis”, “wminkowski”, “hamming”, “canberra”, “braycurtis”, “matching”, “jaccard”, “dice”, “kulsinski”, “rogerstanimoto”, “russellrao”, “sokalmichener”, “sokalsneath”, “haversine”, “cosine” (since the cosine distance is not a true distance measure, it is not supported; using “cosine” will use the “arccos” distance instead), and “arccos”. Defaults to “euclidean”.

Notes

GitHub repository including documentation for HDBSCAN: https://github.com/scikit-learn-contrib/hdbscan.

n_clusters: Annotated[int, Param]

The number of clusters to form. Defaults to 5.

Parameters:
Return type:

Plug

metric: Annotated[str, Param]

The distance metric to use. Defaults to “euclidean”.

Parameters:
Return type:

Plug

function(data: Any) Any[source]

Performs the HDBSCAN clustering algorithm on the data.

Parameters:

data (Any) – The data to be clustered. The data should be a NumPy array of shape (number_of_samples, number_of_features) or a sparse matrix.

Returns:

Returns a NumPy array of shape (number_of_samples,), that contains the cluster labels, where each label corresponds to a cluster assigned to the data point.

Return type:

Any

__tracked__: collections.OrderedDict[str, Any]

An collections.OrderedDict with all public class attributes, i.e., all class attributes not enclosed with double underscores.

class corelay.processor.clustering.DBSCAN[source]

Bases: Clustering

A clustering Processor that performs the DBSCAN clustering algorithm.

Parameters:
  • is_output (bool) – A value indicating whether this DBSCAN clustering processor is the output of a Pipeline. Defaults to False.

  • is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to False.

  • io (Storable | None) – An IO object that is used to cache intermediate results of the Pipeline, which can then be re-used in this run or in subsequent runs of the Pipeline. Defaults to an instance of NoStorage.

  • kwargs (dict[str, Any]) – Additional keyword arguments for the DBSCAN clustering algorithm. Defaults to an empty dict.

  • metric (str) – The distance metric to use. Defaults to “euclidean”.

  • eps (float) – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. Defaults to 0.5.

  • min_samples (int) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. If min_samples is set to a higher value, DBSCAN will find denser clusters, whereas if it is set to a lower value, the found clusters will be more sparse. Defaults to 5.

metric: Annotated[str, Param]

The distance metric to use. Can be one of

  • “cityblock”

  • “cosine”

  • “euclidean”

  • “haversine”

  • “l1”

  • “l2”

  • “manhattan”

  • “nan_euclidean”

  • “precomputed”.

Defaults to “euclidean”.

Parameters:
Return type:

Plug

eps: Annotated[float, Param]

The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. Defaults to 0.5.

Parameters:
Return type:

Plug

min_samples: Annotated[int, Param]

The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. If DBSCAN.min_samples is set to a higher value, DBSCAN will find denser clusters, whereas if it is set to a lower value, the found clusters will be more sparse. Defaults to 5.

Parameters:
Return type:

Plug

function(data: Any) Any[source]

Performs the DBSCAN clustering algorithm on the data.

Parameters:

data (Any) – The data to be clustered. The data should be a NumPy array of shape (number_of_samples, number_of_features) or a sparse matrix.

Returns:

Returns a NumPy array of shape (number_of_samples,), that contains the cluster labels, where each label corresponds to a cluster assigned to the data point. Noisy points will be labeled as -1.

Return type:

Any

__tracked__: collections.OrderedDict[str, Any]

An collections.OrderedDict with all public class attributes, i.e., all class attributes not enclosed with double underscores.

class corelay.processor.clustering.AgglomerativeClustering[source]

Bases: Clustering

A clustering Processor that performs the Agglomerative Clustering algorithm.

Parameters:
  • is_output (bool) – A value indicating whether this AgglomerativeClustering clustering processor is the output of a Pipeline. Defaults to False.

  • is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to False.

  • io (Storable | None) – An IO object that is used to cache intermediate results of the Pipeline, which can then be re-used in this run or in subsequent runs of the Pipeline. Defaults to an instance of NoStorage.

  • kwargs (dict[str, Any]) – Additional keyword arguments for the agglomerative clustering algorithm. Defaults to an empty dict.

  • n_clusters (int) – The number of clusters to form. Defaults to 5.

  • metric (str) – The distance metric to use. Defaults to “euclidean”.

  • linkage (str) – The linkage method to use. This determines which distance to use between two newly formed clusters. The algorithm will merge the pairs of clusters that minimize this method. Defaults to “ward”.

__tracked__: collections.OrderedDict[str, Any]

An collections.OrderedDict with all public class attributes, i.e., all class attributes not enclosed with double underscores.

n_clusters: Annotated[int, Param]

The number of clusters to form. Defaults to 5.

Parameters:
Return type:

Plug

metric: Annotated[str, Param]

The distance metric to use. Can be one of

  • “euclidean”

  • “l1”

  • “l2”

  • “manhattan”

  • “cosine”

  • “precomputed”.

Defaults to “euclidean”.

Parameters:
Return type:

Plug

linkage: Annotated[str, Param]

The linkage method to use. This determines which distance to use between two newly formed clusters. The algorithm will merge the pairs of clusters that minimize this method. Can be one of

  • “ward” minimizes the variance of the clusters being merged.

  • “average” uses the average of the distances of each observation of the two clusters.

  • “complete” uses the maximum distances between all observations of the two clusters.

  • “single” uses the minimum of the distances between all observations of the two clusters.

Defaults to “ward”.

Parameters:
Return type:

Plug

function(data: Any) Any[source]

Performs the Agglomerative Clustering algorithm on the data.

Parameters:

data (Any) – The data to be clustered. The data should be a NumPy array of shape (number_of_samples, number_of_features) or (number_of_samples, number_of_samples), or a sparse matrix.

Returns:

Returns a NumPy array of shape (number_of_samples,), that contains the cluster labels, where each label corresponds to a cluster assigned to the data point.

Return type:

Any

class corelay.processor.clustering.Dendrogram[source]

Bases: Clustering

A clustering Processor that generates plots the hierarchical clustering as a dendrogram.

Parameters:
  • is_output (bool) – A value indicating whether this Dendrogram clustering processor is the output of a Pipeline. Defaults to False.

  • is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to False.

  • io (Storable | None) – An IO object that is used to cache intermediate results of the Pipeline, which can then be re-used in this run or in subsequent runs of the Pipeline. Defaults to an instance of NoStorage.

  • kwargs (dict[str, Any]) – Additional keyword arguments for the hierarchical clustering algorithm. Defaults to an empty dict.

  • output_file (str | io.IOBase) – The path to a file or a file descriptor to save the dendrogram plot to.

  • metric (str) – The distance metric to use for the clustering. Defaults to “euclidean”.

  • linkage (str) – The linkage criterion to use. This determines which distance to use between sets of observation. Defaults to “ward”.

__tracked__: collections.OrderedDict[str, Any]

An collections.OrderedDict with all public class attributes, i.e., all class attributes not enclosed with double underscores.

output_file: Annotated[str | IOBase, Param]

The path to a file or a file descriptor to save the dendrogram plot to.

Parameters:
Return type:

Plug

metric: Annotated[str, Param]

The distance metric to use for the clustering. This can be one of “braycurtis”, “canberra”, “chebyshev”, “cityblock”, “correlation”, “cosine”, “dice”, “euclidean”, “hamming”, “jaccard”, “jensenshannon”, “kulczynski1”, “mahalanobis”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, or “yule”. Defaults to “euclidean”.

Parameters:
Return type:

Plug

linkage: Annotated[str, Param]

The linkage method used by the Dendrogram Processor. The linkage method is used to determine the distance between two newly formed clusters when performing hierarchical clustering. The hierarchical clustering algorithm used by the Dendrogram Processor will merge the pairs of clusters that minimize this method. The following linkage methods are supported:

  • “ward” minimizes the variance of the clusters being merged.

  • “average” uses the average of the distances of each observation of the two clusters.

  • “complete” uses the maximum distances between all observations of the two clusters.

  • “single” uses the minimum of the distances between all observations of the two clusters.

  • “centroid” the centroid of the new cluster that would be formed by merging the two clusters.

  • “median” uses the median of the centroids of the two clusters.

  • “weighted” assigns the weighted distance between the two original clusters and a third remaining cluster to the new cluster.

Defaults to “ward”.

Parameters:
Return type:

Plug

function(data: Any) Any[source]

Performs the hierarchical clustering algorithm on the data and generates a dendrogram plot.

Parameters:

data (Any) – The data to be clustered. The data should be a NumPy array that contains a condensed distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. Alternatively, an array of shape (number_of_observations, number_of_dimensions) may be passed in.

Raises:

ValueError – The linkage method is invalid.

Returns:

Returns the data that was passed in. The data is not modified.

Return type:

Any