Skip to content

Constrained Hierarchical Agglomerative Clustering

Nathalie Vialaneix edited this page Feb 8, 2017 · 23 revisions

Background

Constrained HAC is hierarchical agglomerative clustering in which each observation is associated to a position, and the clustering is constrained so as only adjacent clusters are merged. These methods are useful in various application fields, including ecology (Quaternary data) and bioinformatics (e.g. in Genome-Wide Association Studies (GWAS), see https://doi.org/10.1186/s12859-015-0556-6). However, the complexity of constrained HAC is intrinsically quadratic in the number p of items to cluster. In common situations where ~p~ is of the order of 10^5-10^6, this algorithm is not applicable.

In several applications including GWAS, the similarity between two items is a decreasing function of their physical distance, and it is reasonable to assume that the similarity between two items is close to 0 when these items are distant of more than a constant h. Under this assumption, it is possible to propose an algorithm whose complexity in p is quasilinear in time (O(p(log(p)+h)) and linear in space (O(ph)). This algorithm is described in this PhD thesis (chapter 3) for the Ward criterion. It takes as an input a band similarity matrix of size ph, and combines pre-calculation of certain partial sums of similarities with an efficient storage of the successive merges in a binary heap. A beta package implementing this algorithm is already available: https://github.com/pneuvial/adjclust, with Ward criterion to produce a dendrogram associated to constrained HAC (the position of the rows in the similarity matrix are the positions used for the adjacency constrains).

This project aims at making this beta package a package which is submittable to CRAN.

Related works

The R package rioja implements a similar algorithm (it takes dist objects as an input and uses either Ward criterion or single linkage criterion). This algorithm does not scale well with p, as it is quadratic in both time and space.

Details of your coding project

The package needs:

proper documentation
following the CRAN policies
user-friendly interface
in the current version, only a part of the matrix, located in the neighborhood of the diagonal, is passed to the function; however, this format is not standard and an automatic function to create this data is needed. This function should be able to take similarity matrices given as standard matrices or as sparse matrix objects or as dist objects and should output a standard hclust object
more merging criteria
more merging criteria (single linkage, complete linkage, …) should be added to the package
tests
small tests to check the equivalence between the similarity and dissimilarity cases (in the simple Euclidean case) and the proper functioning (i.e., increasing merging similarities) are needed

Expected impact

The issue addressed by this method is a very standard problem, especially in bioinformatics. In this field, the size of data is usually very large (many thousands) and an efficient constrained HAC package with all linkage options would have many applications (GWAS, Hi-C analysis, …).

Mentors

Please get in touch with Pierre Neuvial and Nathalie Villa-Vialaneix for this project.

Tests

Several tests that potential students can do to demonstrate their capabilities for this particular project:

  • Easy: use the R package rioja to obtain constrained HAC (with the function chclust) using both implemented “method”s and this dataset (which is a dissimilarity matrix). Provide script, output and explain possible problems (the dissimilarity matrix is not Euclidean).
  • Medium: fork our github repository and compile the package. The package includes a vignette describing how to use the low level function ‘HeapHop’ in the package. The input is a vector which contains a band (width: h) centered around a similarity matrix.
    • using this dataset (an Euclidean dissimilarity matrix), create the similarity matrix using the last equation in page 6 of this article. Then extract the band centered on the diagonal with width 5 and use the function adjclustBand to obtain the clustering of this dataset. Provide your entire script (from the similarity construction to the use of the ‘HeapHop’ function) and your outputs (RMarkdown or similar preferred)
    • compare the results obtained by ‘HeapHop’ to the ones obtained by ‘rioja’ for this dissimilarity matrix by varying the width in ‘HeapHop’
    • implement a basic R function that takes a similarity matrix and a value for h and call for ‘HeapHop’ to obtain a ‘hclust’ object
  • Hard: the package uses C++ scripts.
    • Have a look at the src directory and describe what are the different functions used for and the relations between them (textual description, a single line by function should be enough plus a dependency graph for instance)
    • Modify the C++ scripts in the src directory to implement merging of clusters by single linkage criterion (Ward criterion is the one already implemented). Post a script with the functions that are modified (replace the current functions by the ones implementing the single linkage criterion. Do not try to pass an option for choosing between the two merging options)

Solutions of tests

Students, please post a link to your test results here.

Clone this wiki locally