Skip to content

Commit 0cb7296

Browse files
committed
Further expository text updates
1 parent 26ffa4b commit 0cb7296

File tree

1 file changed

+15
-6
lines changed

1 file changed

+15
-6
lines changed

notebooks/Benchmarking scalability of clustering implementations.ipynb

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -745,7 +745,7 @@
745745
"cell_type": "markdown",
746746
"metadata": {},
747747
"source": [
748-
"Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation. I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. Note how the $O(n\\log n)$ algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done with HDBSCAN."
748+
"Now we run that for each of our pre-existing datasets to extrapolate out predicted performance on the relevant dataset sizes. A little pandas wrangling later and we've produced a table of roughly how large a dataset you can tackle in each time frame with each implementation."
749749
]
750750
},
751751
{
@@ -888,6 +888,15 @@
888888
"datasize_table"
889889
]
890890
},
891+
{
892+
"cell_type": "markdown",
893+
"metadata": {},
894+
"source": [
895+
"I had to leave out the scipy KMeans timings because the noise in timing results caused the model to be unrealistic at larger data sizes. It is also worth keeping in mind that some of the results of the models for larger numbers are simply false -- you'll recall that Fastcluster and Scipy's single linkage both didn't scale at all well past 40000 points on my laptop, so I'm certainly not ging to manage 50000 or 100000 over lunch. The same applies to DeBaCl and the slower Sklearn implementations as they also produce the full pairwise distance matrix during computations.\n",
896+
"\n",
897+
"The main thing to note is how the $O(n\\log n)$ algorithms utterly dominate here. In the meantime, for medium sizes data sets you can still get quite a lot done with HDBSCAN."
898+
]
899+
},
891900
{
892901
"cell_type": "markdown",
893902
"metadata": {},
@@ -911,21 +920,21 @@
911920
],
912921
"metadata": {
913922
"kernelspec": {
914-
"display_name": "Python 2",
923+
"display_name": "Python 3",
915924
"language": "python",
916-
"name": "python2"
925+
"name": "python3"
917926
},
918927
"language_info": {
919928
"codemirror_mode": {
920929
"name": "ipython",
921-
"version": 2
930+
"version": 3
922931
},
923932
"file_extension": ".py",
924933
"mimetype": "text/x-python",
925934
"name": "python",
926935
"nbconvert_exporter": "python",
927-
"pygments_lexer": "ipython2",
928-
"version": "2.7.10"
936+
"pygments_lexer": "ipython3",
937+
"version": "3.4.3"
929938
}
930939
},
931940
"nbformat": 4,

0 commit comments

Comments
 (0)