-
Notifications
You must be signed in to change notification settings - Fork 124
Add HDBSCAN from HorseML.jl #273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
MommaWatasu
wants to merge
47
commits into
JuliaStats:master
Choose a base branch
from
MommaWatasu:master
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+391
−0
Open
Changes from 8 commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
23bbed1
[add function or file]add hdbscan
MommaWatasu fa44398
[test]add test for hdbscan
MommaWatasu 4e64fdf
[fix]change Int64 to Int
MommaWatasu e294565
[fix]change all Int64 into Int
MommaWatasu b851997
[change]change usage and remove extra code
MommaWatasu 7822a7c
[test]update test
MommaWatasu 8901cfe
[docs]add docs for HDBSCAN
MommaWatasu 6de5d02
[fix]export HdbscanCluster
MommaWatasu 85c5644
[docs]merge docs of HDBSCAN with DBSCAN.md
MommaWatasu edcc70a
[clean]refactoring HDBSCANGraph
MommaWatasu 939ce65
[docs]fix docs
MommaWatasu 2f67d07
[clean]refactoring
MommaWatasu 61463f2
[clean]refactoring
MommaWatasu 09ed174
[test]update test
MommaWatasu 6039c0c
[fix]change isnothing into ===
MommaWatasu 3cc7689
[fix]remove println
MommaWatasu 1798148
[docs]update docs
MommaWatasu a0a819e
[docs]fix docs
MommaWatasu bf38eb6
[fix]fix docstring
MommaWatasu 380acf1
[fix]add isnoise to the list of exprted function
MommaWatasu 6fdebe6
core_distances() tweaks
alyst ff98806
Hdbscan: simplify results generation
alyst 64a202a
remove heappush!: unused
alyst 661edbc
hdbscan test: small tweaks
alyst 082c5e6
fixup hdbscan assignments
alyst c9db368
hdbscan: further opt core_dists
alyst c205575
hdbscan: optimize edges generation
alyst bdf1aec
HDBSCANGraph -> HdbscanGraph
alyst 9aa9841
HdbscanEdge
alyst c9b9374
HdbscanGraph: rename edges to adj_edges
alyst 092ac40
MSTEdge remove no-op expand() method
alyst c972caa
refactor HdbscanMSTEdge
alyst 57b05e6
hdbscan: fix graph vertices, remove add_edges!
alyst f616795
hdbscan_minspantree(): refactor
alyst 0f6e992
prune_clusters!(): cleanup
alyst e01609b
hdbscan: fix 1.0 compat
alyst de6b83a
[docs]fix docstring
MommaWatasu 38212ca
[clean]rename and refactoring
MommaWatasu 159858d
[test]add test for unionfind
MommaWatasu a03c224
hdbscan_minspantree: fix edges sorting
alyst 3a577ea
hdbscan_clusters(): fix cost type
alyst 744af22
hdbscan_clusters: improve MST iteration
alyst 8803573
Merge branch 'master' of https://github.com/MommaWatasu/Clustering.jl
MommaWatasu b94de25
[clean]rename the result structure
MommaWatasu 7078223
[hotfix]apply hot fix
MommaWatasu b0791d9
[test]add test about min_size
MommaWatasu dc5dd40
[add function or file]support ClusteringResult
MommaWatasu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,6 +16,7 @@ makedocs( | |
"mcl.md", | ||
"affprop.md", | ||
"dbscan.md", | ||
"hdbscan.md", | ||
"fuzzycmeans.md", | ||
], | ||
"validate.md", | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# HDBSCAN | ||
|
||
Hierarchical Density-based Spatial Clustering of Applications with Noise(HDBSCAN) is similar to DBSCAN but uses hierarchical tree to find clusters. | ||
|
||
## Interface | ||
The implementation of *HDBSCAN* algorithm is provided by [`hdbscan`](@ref) function | ||
```@docs | ||
hdbscan | ||
HdbscanResult | ||
HdbscanCluster | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,296 @@ | ||
struct HDBSCANGraph | ||
edges::Vector{Vector{Tuple{Int, Float64}}} | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
HDBSCANGraph(n) = new([Tuple{Int, Float64}[] for _ in 1 : n]) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
end | ||
|
||
function add_edge(G::HDBSCANGraph, v::Tuple{Int, Int, Float64}) | ||
i, j, c = v | ||
push!(G.edges[i], (j, c)) | ||
push!(G.edges[j], (i, c)) | ||
end | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
Base.getindex(G::HDBSCANGraph, i::Int) = G.edges[i] | ||
|
||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
""" | ||
HdbscanCluster(..., points, stability, ...) | ||
- `points`: vector of points which belongs to the cluster | ||
- `stability`: stablity of the cluster(-1 for noise clusters) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
You can use `length` to know the number of pooints in the cluster. And `isnoise` funciton is also available to know whether the cluster is noise or not. | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
""" | ||
mutable struct HdbscanCluster | ||
parent::Int | ||
children::Vector{Int} | ||
points::Vector{Int} | ||
λp::Vector{Float64} | ||
stability::Float64 | ||
children_stability::Float64 | ||
function HdbscanCluster(noise::Bool, points::Vector{Int}) | ||
if noise | ||
return new(0, [], [], [], -1, -1) | ||
else | ||
return new(0, [], points, [], 0, 0) | ||
end | ||
end | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
end | ||
|
||
Base.length(c::HdbscanCluster) = size(c.points, 1) | ||
isnoise(c::HdbscanCluster) = c.stability == -1 | ||
hasstability(c::HdbscanCluster) = c.stability != 0 | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
function compute_stability(c::HdbscanCluster, λbirth) | ||
c.stability += sum(c.λp.-λbirth) | ||
end | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
""" | ||
HdbscanResult(k, minpts, clusters) | ||
- `k`: we will define "core distance of point A" as the distance between point A and the `k` th neighbor point of point A. | ||
- `min_cluster_size`: minimum number of points in the cluster | ||
- `clusters`: result vector of clusters | ||
""" | ||
MommaWatasu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
mutable struct HdbscanResult | ||
k::Int | ||
min_cluster_size::Int | ||
clusters::Vector{HdbscanCluster} | ||
end | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
""" | ||
hdbscan(points::AbstractMatrix, k::Int, min_cluster_size::Int; gen_mst::Bool=true, mst=nothing) | ||
Density-Based Clustering Based on Hierarchical Density Estimates. | ||
This algorithm performs clustering as follows. | ||
1. generate a minimum spanning tree | ||
2. build a HDBSCAN hierarchy | ||
3. extract the target cluster | ||
4. generate the list of cluster assignment for each point | ||
The detail is so complex it is difficult to explain the detail in here. But, if you want to know more about this algorithm, you should read [this docs](https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html). | ||
|
||
# Parameters | ||
- `points`: the d×n matrix, where each column is a d-dimensional coordinate of a point | ||
- `k`: we will define "core distance of point A" as the distance between point A and the `k` th neighbor point of point A. | ||
- `min_cluster_size`: minimum number of points in the cluster | ||
- `gen_mst`: whether to generate minimum-spannig-tree or not | ||
- `mst`: when is specified and `gen_mst` is false, new mst won't be generated | ||
""" | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
function hdbscan(points::AbstractMatrix, k::Int, min_cluster_size::Int; gen_mst::Bool=true, mst=nothing) | ||
if min_cluster_size < 1 | ||
throw(DomainError(min_cluster_size, "The `min_cluster_size` must be greater than or equal to 1")) | ||
end | ||
n = size(points, 1) | ||
if gen_mst | ||
#calculate core distances for each point | ||
core_dists = core_dist(points, k) | ||
#calculate mutual reachability distance between any two points | ||
mrd = mutual_reachability_distance(core_dists, points) | ||
#compute a minimum spanning tree by prim method | ||
mst = prim(mrd, n) | ||
elseif mst == nothing | ||
throw(ArgumentError("if you set `gen_mst` to false, you must pass a minimum spanning tree as `mst`")) | ||
end | ||
#build a HDBSCAN hierarchy | ||
hierarchy = build_hierarchy(mst, min_cluster_size) | ||
#extract the target cluster | ||
extract_cluster!(hierarchy) | ||
#generate the list of cluster assignment for each point | ||
result = HdbscanCluster[] | ||
noise_points = fill(-1, n) | ||
for (i, j) in enumerate(hierarchy[2n-1].children) | ||
c = hierarchy[j] | ||
push!(result, c) | ||
for k in c.points | ||
noise_points[k] = 0 | ||
end | ||
end | ||
push!(result, HdbscanCluster(true, Int[])) | ||
result[end].points = findall(x->x==-1, noise_points) | ||
return HdbscanResult(k, min_cluster_size, result) | ||
end | ||
|
||
function core_dist(points, k) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
core_dists = Array{Float64}(undef, size(points, 1)) | ||
for i in 1 : size(points, 1) | ||
p = points[i:i, :] | ||
dists = vec(sum((@. (points - p)^2), dims=2)) | ||
sort!(dists) | ||
core_dists[i] = dists[k] | ||
end | ||
return core_dists | ||
end | ||
|
||
function mutual_reachability_distance(core_dists, points) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
n = size(points, 1) | ||
graph = HDBSCANGraph(div(n * (n-1), 2)) | ||
for i in 1 : n-1 | ||
for j in i+1 : n | ||
c = max(core_dists[i], core_dists[j], sum((points[i, :]-points[j, :]).^2)) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
add_edge(graph, (i, j, c)) | ||
end | ||
end | ||
return graph | ||
end | ||
|
||
function prim(graph, n) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
minimum_spanning_tree = Array{Tuple{Float64, Int, Int}}(undef, n-1) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
marked = falses(n) | ||
marked_cnt = 1 | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
marked[1] = true | ||
|
||
h = [] | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
for (i, c) in graph[1] | ||
heappush!(h, (c, 1, i)) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
end | ||
|
||
while marked_cnt < n | ||
c, i, j = popfirst!(h) | ||
|
||
marked[j] == true && continue | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
minimum_spanning_tree[marked_cnt] = (c, i, j) | ||
marked[j] = true | ||
marked_cnt += 1 | ||
|
||
for (k, c) in graph[j] | ||
marked[k] == true && continue | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
heappush!(h, (c, j, k)) | ||
end | ||
end | ||
return minimum_spanning_tree | ||
end | ||
|
||
function build_hierarchy(mst, min_size) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
n = length(mst) + 1 | ||
cost = 0 | ||
uf = UnionFind(n) | ||
Hierarchy = Array{HdbscanCluster}(undef, 2n-1) | ||
if min_size == 1 | ||
for i in 1 : n | ||
Hierarchy[i] = HdbscanCluster(false, [i]) | ||
end | ||
else | ||
for i in 1 : n | ||
Hierarchy[i] = HdbscanCluster(true, Int[]) | ||
end | ||
end | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
sort!(mst) | ||
|
||
for i in 1 : n-1 | ||
c, j, k = mst[i] | ||
cost += c | ||
λ = 1 / cost | ||
#child clusters | ||
c1 = group(uf, j) | ||
c2 = group(uf, k) | ||
#reference to the parent cluster | ||
Hierarchy[c1].parent = Hierarchy[c2].parent = n+i | ||
nc1, nc2 = isnoise(Hierarchy[c1]), isnoise(Hierarchy[c2]) | ||
if !(nc1 || nc2) | ||
#compute stability | ||
compute_stability(Hierarchy[c1], λ) | ||
compute_stability(Hierarchy[c2], λ) | ||
#unite cluster | ||
unite!(uf, j, k) | ||
#create parent cluster | ||
points = members(uf, group(uf, j)) | ||
Hierarchy[n+i] = HdbscanCluster(false, points) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
elseif !(nc1 && nc2) | ||
if nc2 == true | ||
(c1, c2) = (c2, c1) | ||
end | ||
#record the lambda value | ||
append!(Hierarchy[c2].λp, fill(λ, length(Hierarchy[c1]))) | ||
#unite cluster | ||
unite!(uf, j, k) | ||
#create parent cluster | ||
points = members(uf, group(uf, j)) | ||
Hierarchy[n+i] = HdbscanCluster(false, points) | ||
else | ||
#unite the noise cluster | ||
unite!(uf, j, k) | ||
#create parent cluster | ||
points = members(uf, group(uf, j)) | ||
if length(points) < min_size | ||
Hierarchy[n+i] = HdbscanCluster(true, Int[]) | ||
else | ||
Hierarchy[n+i] = HdbscanCluster(false, points) | ||
end | ||
end | ||
end | ||
return Hierarchy | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
end | ||
|
||
function extract_cluster!(hierarchy::Vector{HdbscanCluster}) | ||
for i in 1 : length(hierarchy)-1 | ||
if isnoise(hierarchy[i]) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
c = hierarchy[i] | ||
push!(hierarchy[c.parent].children, i) | ||
hierarchy[c.parent].children_stability += c.stability | ||
else | ||
c = hierarchy[i] | ||
if c.stability > c.children_stability | ||
push!(hierarchy[c.parent].children, i) | ||
hierarchy[c.parent].children_stability += c.stability | ||
else | ||
append!(hierarchy[c.parent].children, c.children) | ||
hierarchy[c.parent].children_stability += c.children_stability | ||
end | ||
end | ||
end | ||
end | ||
|
||
# Below are utility functions for building hierarchical trees | ||
heappush!(h, v) = insert!(h, searchsortedfirst(h, v), v) | ||
|
||
mutable struct UnionFind{T <: Integer} | ||
parent:: Vector{T} # parent[root] is the negative of the size | ||
label::Dict{Int, Int} | ||
cnt::Int | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
function UnionFind{T}(nodes::T) where T<:Integer | ||
if nodes <= 0 | ||
throw(ArgumentError("invalid argument for nodes: $nodes")) | ||
end | ||
|
||
parent = -ones(T, nodes) | ||
label = Dict([(i, i) for i in 1 : nodes]) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
new{T}(parent, label, nodes) | ||
end | ||
end | ||
|
||
UnionFind(nodes::Integer) = UnionFind{typeof(nodes)}(nodes) | ||
group(uf::UnionFind, x)::Int = uf.label[root(uf, x)] | ||
members(uf::UnionFind, x::Int) = collect(keys(filter(n->n.second == x, uf.label))) | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
||
function root(uf::UnionFind{T}, x::T)::T where T<:Integer | ||
if uf.parent[x] < 0 | ||
return x | ||
else | ||
return uf.parent[x] = root(uf, uf.parent[x]) | ||
end | ||
end | ||
|
||
function issame(uf::UnionFind{T}, x::T, y::T)::Bool where T<:Integer | ||
return root(uf, x) == root(uf, y) | ||
end | ||
|
||
function Base.size(uf::UnionFind{T}, x::T)::T where T<:Integer | ||
return -uf.parent[root(uf, x)] | ||
end | ||
|
||
function unite!(uf::UnionFind{T}, x::T, y::T)::Bool where T<:Integer | ||
x = root(uf, x) | ||
y = root(uf, y) | ||
if x == y | ||
return false | ||
end | ||
if uf.parent[x] > uf.parent[y] | ||
x, y = y, x | ||
end | ||
# unite smaller tree(y) to bigger one(x) | ||
uf.parent[x] += uf.parent[y] | ||
uf.parent[y] = x | ||
uf.cnt += 1 | ||
uf.label[y] = uf.cnt | ||
for i in members(uf, group(uf, x)) | ||
uf.label[i] = uf.cnt | ||
end | ||
return true | ||
end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
using Test | ||
using Clustering | ||
|
||
@testset "HDBSCAN" begin | ||
# make moons for test | ||
upper_x = [i for i in 0:π/50:π] | ||
lower_x = [i for i in π/2:π/50:3/2*π] | ||
upper_y = sin.(upper_x) .+ rand(50)./10 | ||
lower_y = cos.(lower_x) .+ rand(51)./10 | ||
data = hcat([lower_x; upper_x], [lower_y; upper_y]) | ||
#test for main function | ||
@test_throws DomainError hdbscan(data, 5, 0) | ||
@test_throws ArgumentError hdbscan(data, 5, 3; gen_mst=false) | ||
@test_nowarn result = hdbscan(data, 5, 3) | ||
|
||
# tests for result | ||
result = hdbscan(data, 5, 3) | ||
@test sum([length(c) for c in result.clusters]) == 101 | ||
MommaWatasu marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
end |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.