High Dimensional Clustering

Clustering high dimensional data points in challenging! The regular techniques such as k-means, DBSCAN, HDBSCAN, Agglomerative Clustering all suffer from the non-intuitive properties of metrics in high dimension. However, experiments have shown that dimension reduction techniques applied before running a clustering algorithm can really improve on the results obtained. More precisely, using UMAP to reduce dimensionality followed by HDBSCAN to identify clusters perform reasonably well in identifying the underlying clusters. In general, it performs better than using PCA for dimensionality reduction. The following figure shows results from a study of the Pendigits data set. We show clustering accuracies of five clustering algorithms on the original high dimensional points and on low dimensional points obtained using PCA or UMAP. The accuracies are measured with the adjusted Rand index (ARI) and the adjusted mutual information (AMI).

Previous exploration work can be found in the archive folder

The new algorithm we are currently testing works as follow. It first builds a graph. The goal is to contruct a graph with edge weights given by estimating the probability of being the nearest neighbor. This provides a (directed!) graph with probabilities assigned to edges. We can run single linkage clustering on the resulting graph, and then use HDBSCAN style condensed tree approaches to simplify and get some clusters out. Next we need single linkage clustering of the graph. That is most easily done by computing a minimum spanning forest, and then processing that into a forest of merge trees. Lastly we condense then extract clusters. We could use any technique, but went with the leaf extraction because it seemed to work better. The clustering of the graph works well in that it picks out most of the clusters, but it leaves a great deal of data as noise. We can fix that by running a label propagation through the graph. We do still want to keep the ability to label points as noise, and it would be good to keep the propagation soft, so we can do a Bayesian label propagation of probability vectors over possible labels, including a noise label, with a Bayesian update of the distribution at each propagation round. We can then iterate until relative convergence.

See Notebook 00 for the code.

Predict cluster assignment for new data

One very nice aspect of the proposed algorithm is that it can naturally be adapted to deal with new data: unseen points. So given a data partition, perhaps with noise, and a set of new points, we can predict to which of the existing clusters the new data points should be assigned, including a noise attribution. We currently have two versions of this predict function, but none of them has been tested extensively so far. (The code is part of the 00 notebook).

Stability Issues

When investigating the predict function, we have encountered stability issues. That is, under a random sample that contains 90% of the 70,000 MNIST data points, the adjusted rand index we get for our clusterings vary from 0.85 to 0.92. The scores are clustered into two groups: the clusterings that yielded 10 clusters (same as ground truth) and the ones that yielded 11 clusters. More has to be learned from this experiment.

Parameter exploration

The algorithm proposed runs under a large number of parameters. These parameters are not always intuitive to set and are not at all independent. Can the parameter space be transformed into a simplified space, more intuitive?

Issues with min_cluster_size parameter

The min-cluster-size parameter is used to build the condensed tree that is in turn use to seed the label propagation step of the algorithm. Because of the label propagation step, the minimum cluster size given in the parameters can be much smaller than the minimum cluster size. With the Japanese character Kuzushiji-MNIST Dataset, we observe the following problem. If the min-cluster-size is not small enough, some of the smallish clusters are just labelled as noise. If we make the min-cluster-size small enough, we fix the problem but we introduce a new one: larger clusters get split into smaller ones.

Theoretical insights into k-NN graphs

The nearest neighbor probability models that are constructed at each node are obtained via an iterative process. They are first assigned a prior, and then the prior gets updated using the neighbor's models. The models are gaussian, so at each step, we update a $\mu$ and a $\sigma$ value. It turns out that the $\mu$'s updates can be obtained via a row-normalized version of the reachability matrix $I+A$ where $A$ is the adjacency of the directed graph. We are interested in understanding better those matrices and the convergence behaviour of the iterative update process.

ABOUT EASYDATA

This git repository is build from the Easydata framework, which aims to make your data science workflow reproducible.

EASYDATA REQUIREMENTS

Make
conda >= 4.8 (via Anaconda or Miniconda)
Git

GETTING STARTED

Initial Git Configuration and Checking Out the Repo

If you haven't yet done so, we recommend following the instructions in Setting up git and Checking Out the Repo in order to check-out the code and set-up your remote branches

Setting up your environment

Make note of the path to your conda binary:

   $ which conda
   ~/miniconda3/bin/conda

ensure your CONDA_EXE environment variable is set to this value (or edit Makefile.include directly)

    export CONDA_EXE=~/miniconda3/bin/conda

Create and switch to the virtual environment:

cd HighDimensionalClustering
make create_environment
conda activate HighDimensionalClustering

Now you're ready to run jupyter notebook (or jupyterlab) and explore the notebooks in the notebooks directory.

For more instructions on setting up and maintaining your environment (including how to point your environment at your custom forks and work in progress) see Setting up and Maintaining your Conda Environment Reproducibly.

Project Organization

LICENSE
Makefile
- Top-level makefile. Type make for a list of valid commands.
Makefile.include
- Global includes for makefile routines. Included by Makefile.
Makefile.env
- Command for maintaining reproducible conda environment. Included by Makefile.
README.md
- this file
catalog
- Data catalog. This is where config information such as data sources and data transformations are saved.
- catalog/config.ini
  - Local Data Store. This configuration file is for local data only, and is never checked into the repo.
data
- Data directory. Often symlinked to a filesystem with lots of space.
- data/raw
  - Raw (immutable) hash-verified downloads.
- data/interim
  - Extracted and interim data representations.
- data/interim/cache
  - Dataset cache
- data/processed
  - The final, canonical data sets ready for analysis.
docs
- Sphinx-format documentation files for this project.
- docs/Makefile: Makefile for generating HTML/Latex/other formats from Sphinx-format documentation.
notebooks
- Jupyter notebooks. Naming convention is a number (for ordering), the creator's initials, and a short - delimited description, e.g. 1.0-jqp-initial-data-exploration.
reference
- Data dictionaries, documentation, manuals, scripts, papers, or other explanatory materials.
- reference/easydata: Easydata framework and workflow documentation.
- reference/templates: Templates and code snippets for Jupyter
- reference/dataset: resources related to datasets; e.g. dataset creation notebooks and scripts
reports
- Generated analysis as HTML, PDF, LaTeX, etc.
- reports/figures
  - Generated graphics and figures to be used in reporting.
environment.yml
- The user-readable YAML file for reproducing the conda/pip environment.
environment.(platform).lock.yml
- resolved versions, result of processing environment.yml
setup.py
- Turns contents of src into a pip-installable python module (pip install -e .) so it can be imported in python code.
src
- Source code for use in this project.
- src/__init__.py
  - Makes src a Python module.
- src/data
  - Scripts to fetch or generate data.
- src/analysis
  - Scripts to turn datasets into output products.

This project was built using Easydata, a python framework aimed at making your data science workflow reproducible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

High Dimensional Clustering

Predict cluster assignment for new data

Stability Issues

Parameter exploration

Issues with min_cluster_size parameter

Theoretical insights into k-NN graphs

ABOUT EASYDATA

EASYDATA REQUIREMENTS

GETTING STARTED

Initial Git Configuration and Checking Out the Repo

Setting up your environment

Project Organization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
.circleci		.circleci
.conda-ops		.conda-ops
catalog		catalog
docs		docs
models		models
notebooks		notebooks
reference		reference
reports		reports
scripts		scripts
src		src
.easydata.json		.easydata.json
.easydata.yml		.easydata.yml
.gitignore		.gitignore
.post-create-environment.txt		.post-create-environment.txt
LICENSE		LICENSE
Makefile		Makefile
Makefile.envs		Makefile.envs
Makefile.include		Makefile.include
Makefile.win32		Makefile.win32
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

License

vpoulin/HighDimensionalClustering

Folders and files

Latest commit

History

Repository files navigation

High Dimensional Clustering

Predict cluster assignment for new data

Stability Issues

Parameter exploration

Issues with min_cluster_size parameter

Theoretical insights into k-NN graphs

ABOUT EASYDATA

EASYDATA REQUIREMENTS

GETTING STARTED

Initial Git Configuration and Checking Out the Repo

Setting up your environment

Project Organization

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages