corelay.processor.clustering
A module that contains processors for clustering algorithms.
Module Attributes
Performs the HDBSCAN clustering algorithm on a vector or distance matrix. |
Classes
A clustering |
|
The abstract base class for |
|
A clustering |
|
A clustering |
|
A clustering |
|
A clustering |
- corelay.processor.clustering.hdbscan[source]
Performs the HDBSCAN clustering algorithm on a vector or distance matrix.
Note
Since the HDBSCAN library is an optional dependency of CoRelAy, it is imported using the
import_or_stub()function, which tries to import the module/type/function specified. If the import fails, it returns a stub instead, which will raise an exception when used. The exception message will tell users how to install the missing dependencies for the functionality to work.- Returns:
Returns an HDBSCAN cluster estimator, which can be used to fit the data.
- Return type:
- class corelay.processor.clustering.Clustering[source]
Bases:
ProcessorThe abstract base class for
Processorthat performs clustering.- Parameters:
is_output (bool) – A value indicating whether this
Clusteringprocessor is the output of aPipeline. Defaults toFalse.is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to
False.io (Storable | None) – An IO object that is used to cache intermediate results of the
Pipeline, which can then be re-used in this run or in subsequent runs of thePipeline. Defaults to an instance ofNoStorage.kwargs (dict[str, Any]) – Additional keyword arguments for the clustering algorithm. Defaults to an empty
dict.
- kwargs: Annotated[dict[str, Any], Param]
A
dictof additional keyword arguments for the clustering algorithm. Defaults to an emptydict.
- __tracked__: collections.OrderedDict[str, Any]
An
collections.OrderedDictwith all public class attributes, i.e., all class attributes not enclosed with double underscores.
- class corelay.processor.clustering.KMeans[source]
Bases:
ClusteringA clustering
Processorthat performs the k-Means clustering algorithm.- Parameters:
is_output (bool) – A value indicating whether this
KMeansclustering processor is the output of aPipeline. Defaults toFalse.is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to
False.io (Storable | None) – An IO object that is used to cache intermediate results of the
Pipeline, which can then be re-used in this run or in subsequent runs of thePipeline. Defaults to an instance ofNoStorage.kwargs (dict[str, Any]) – Additional keyword arguments for the k-Means clustering algorithm. Defaults to an empty
dict.n_clusters (int) – The number of clusters to form. Defaults to 2.
index (tuple[int | slice]) – The indices of the data to be clustered. Defaults to an empty slice.
See also
- index: Annotated[SupportsIndex | tuple[SupportsIndex, ...], Param]
The indices of the data to be clustered. Defaults to an empty slice.
- __tracked__: collections.OrderedDict[str, Any]
An
collections.OrderedDictwith all public class attributes, i.e., all class attributes not enclosed with double underscores.
- class corelay.processor.clustering.HDBSCAN[source]
Bases:
ClusteringA clustering
Processorthat performs the HDBSCAN clustering algorithm.- Parameters:
is_output (bool) – A value indicating whether this
HDBSCANclustering processor is the output of aPipeline. Defaults toFalse.is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to
False.io (Storable | None) – An IO object that is used to cache intermediate results of the
Pipeline, which can then be re-used in this run or in subsequent runs of thePipeline. Defaults to an instance ofNoStorage.kwargs (dict[str, Any]) – Additional keyword arguments for the HDBSCAN clustering algorithm. Defaults to an empty
dict.n_clusters (int) – The number of clusters to form. Defaults to 2.
metric (str) – The distance metric to use. This can be one of “euclidean”, “l2” (equivalent to “euclidean”), “minkowski”, “p” (equivalent to “minkowski”), “manhattan”, “cityblock” (equivalent to “manhattan”), “l1” (equivalent to “manhattan”), “chebyshev”, “infinity” (equivalent to “chebyshev”), “seuclidean”, “mahalanobis”, “wminkowski”, “hamming”, “canberra”, “braycurtis”, “matching”, “jaccard”, “dice”, “kulsinski”, “rogerstanimoto”, “russellrao”, “sokalmichener”, “sokalsneath”, “haversine”, “cosine” (since the cosine distance is not a true distance measure, it is not supported; using “cosine” will use the “arccos” distance instead), and “arccos”. Defaults to “euclidean”.
See also
Notes
GitHub repository including documentation for HDBSCAN: https://github.com/scikit-learn-contrib/hdbscan.
- function(data: Any) Any[source]
Performs the HDBSCAN clustering algorithm on the data.
- Parameters:
data (Any) – The data to be clustered. The data should be a NumPy array of shape (number_of_samples, number_of_features) or a sparse matrix.
- Returns:
Returns a NumPy array of shape (number_of_samples,), that contains the cluster labels, where each label corresponds to a cluster assigned to the data point.
- Return type:
- __tracked__: collections.OrderedDict[str, Any]
An
collections.OrderedDictwith all public class attributes, i.e., all class attributes not enclosed with double underscores.
- class corelay.processor.clustering.DBSCAN[source]
Bases:
ClusteringA clustering
Processorthat performs the DBSCAN clustering algorithm.- Parameters:
is_output (bool) – A value indicating whether this
DBSCANclustering processor is the output of aPipeline. Defaults toFalse.is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to
False.io (Storable | None) – An IO object that is used to cache intermediate results of the
Pipeline, which can then be re-used in this run or in subsequent runs of thePipeline. Defaults to an instance ofNoStorage.kwargs (dict[str, Any]) – Additional keyword arguments for the DBSCAN clustering algorithm. Defaults to an empty
dict.metric (str) – The distance metric to use. Defaults to “euclidean”.
eps (float) – The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. Defaults to 0.5.
min_samples (int) – The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. If
min_samplesis set to a higher value, DBSCAN will find denser clusters, whereas if it is set to a lower value, the found clusters will be more sparse. Defaults to 5.
See also
- metric: Annotated[str, Param]
The distance metric to use. Can be one of
“cityblock”
“cosine”
“euclidean”
“haversine”
“l1”
“l2”
“manhattan”
“nan_euclidean”
“precomputed”.
Defaults to “euclidean”.
- eps: Annotated[float, Param]
The maximum distance between two samples for one to be considered as in the neighborhood of the other. This is not a maximum bound on the distances of points within a cluster. This is the most important DBSCAN parameter to choose appropriately for your data set and distance function. Defaults to 0.5.
- min_samples: Annotated[int, Param]
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself. If
DBSCAN.min_samplesis set to a higher value, DBSCAN will find denser clusters, whereas if it is set to a lower value, the found clusters will be more sparse. Defaults to 5.
- function(data: Any) Any[source]
Performs the DBSCAN clustering algorithm on the data.
- Parameters:
data (Any) – The data to be clustered. The data should be a NumPy array of shape (number_of_samples, number_of_features) or a sparse matrix.
- Returns:
Returns a NumPy array of shape (number_of_samples,), that contains the cluster labels, where each label corresponds to a cluster assigned to the data point. Noisy points will be labeled as -1.
- Return type:
- __tracked__: collections.OrderedDict[str, Any]
An
collections.OrderedDictwith all public class attributes, i.e., all class attributes not enclosed with double underscores.
- class corelay.processor.clustering.AgglomerativeClustering[source]
Bases:
ClusteringA clustering
Processorthat performs the Agglomerative Clustering algorithm.- Parameters:
is_output (bool) – A value indicating whether this
AgglomerativeClusteringclustering processor is the output of aPipeline. Defaults toFalse.is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to
False.io (Storable | None) – An IO object that is used to cache intermediate results of the
Pipeline, which can then be re-used in this run or in subsequent runs of thePipeline. Defaults to an instance ofNoStorage.kwargs (dict[str, Any]) – Additional keyword arguments for the agglomerative clustering algorithm. Defaults to an empty
dict.n_clusters (int) – The number of clusters to form. Defaults to 5.
metric (str) – The distance metric to use. Defaults to “euclidean”.
linkage (str) – The linkage method to use. This determines which distance to use between two newly formed clusters. The algorithm will merge the pairs of clusters that minimize this method. Defaults to “ward”.
- __tracked__: collections.OrderedDict[str, Any]
An
collections.OrderedDictwith all public class attributes, i.e., all class attributes not enclosed with double underscores.
- metric: Annotated[str, Param]
The distance metric to use. Can be one of
“euclidean”
“l1”
“l2”
“manhattan”
“cosine”
“precomputed”.
Defaults to “euclidean”.
- linkage: Annotated[str, Param]
The linkage method to use. This determines which distance to use between two newly formed clusters. The algorithm will merge the pairs of clusters that minimize this method. Can be one of
“ward” minimizes the variance of the clusters being merged.
“average” uses the average of the distances of each observation of the two clusters.
“complete” uses the maximum distances between all observations of the two clusters.
“single” uses the minimum of the distances between all observations of the two clusters.
Defaults to “ward”.
- function(data: Any) Any[source]
Performs the Agglomerative Clustering algorithm on the data.
- Parameters:
data (Any) – The data to be clustered. The data should be a NumPy array of shape (number_of_samples, number_of_features) or (number_of_samples, number_of_samples), or a sparse matrix.
- Returns:
Returns a NumPy array of shape (number_of_samples,), that contains the cluster labels, where each label corresponds to a cluster assigned to the data point.
- Return type:
- class corelay.processor.clustering.Dendrogram[source]
Bases:
ClusteringA clustering
Processorthat generates plots the hierarchical clustering as a dendrogram.- Parameters:
is_output (bool) – A value indicating whether this
Dendrogramclustering processor is the output of aPipeline. Defaults toFalse.is_checkpoint (bool | None) – A value indicating whether check-pointed pipeline computations should start at this point, if there exists a previously computed checkpoint value. Defaults to
False.io (Storable | None) – An IO object that is used to cache intermediate results of the
Pipeline, which can then be re-used in this run or in subsequent runs of thePipeline. Defaults to an instance ofNoStorage.kwargs (dict[str, Any]) – Additional keyword arguments for the hierarchical clustering algorithm. Defaults to an empty
dict.output_file (str | io.IOBase) – The path to a file or a file descriptor to save the dendrogram plot to.
metric (str) – The distance metric to use for the clustering. Defaults to “euclidean”.
linkage (str) – The linkage criterion to use. This determines which distance to use between sets of observation. Defaults to “ward”.
- __tracked__: collections.OrderedDict[str, Any]
An
collections.OrderedDictwith all public class attributes, i.e., all class attributes not enclosed with double underscores.
- output_file: Annotated[str | IOBase, Param]
The path to a file or a file descriptor to save the dendrogram plot to.
- metric: Annotated[str, Param]
The distance metric to use for the clustering. This can be one of “braycurtis”, “canberra”, “chebyshev”, “cityblock”, “correlation”, “cosine”, “dice”, “euclidean”, “hamming”, “jaccard”, “jensenshannon”, “kulczynski1”, “mahalanobis”, “minkowski”, “rogerstanimoto”, “russellrao”, “seuclidean”, “sokalmichener”, “sokalsneath”, “sqeuclidean”, or “yule”. Defaults to “euclidean”.
- linkage: Annotated[str, Param]
The linkage method used by the Dendrogram
Processor. The linkage method is used to determine the distance between two newly formed clusters when performing hierarchical clustering. The hierarchical clustering algorithm used by the DendrogramProcessorwill merge the pairs of clusters that minimize this method. The following linkage methods are supported:“ward” minimizes the variance of the clusters being merged.
“average” uses the average of the distances of each observation of the two clusters.
“complete” uses the maximum distances between all observations of the two clusters.
“single” uses the minimum of the distances between all observations of the two clusters.
“centroid” the centroid of the new cluster that would be formed by merging the two clusters.
“median” uses the median of the centroids of the two clusters.
“weighted” assigns the weighted distance between the two original clusters and a third remaining cluster to the new cluster.
Defaults to “ward”.
- function(data: Any) Any[source]
Performs the hierarchical clustering algorithm on the data and generates a dendrogram plot.
- Parameters:
data (Any) – The data to be clustered. The data should be a NumPy array that contains a condensed distance matrix. A condensed distance matrix is a flat array containing the upper triangular of the distance matrix. Alternatively, an array of shape (number_of_observations, number_of_dimensions) may be passed in.
- Raises:
ValueError – The linkage method is invalid.
- Returns:
Returns the data that was passed in. The data is not modified.
- Return type: