Datasets for DBSCAN evaluation

Helmut Neukirchen, 20. June 2019

For evaluating implementations of the popular DBSCAN clustering algorithm, various publications use several datasets. Pointers to these datasets and information on paramaters (e.g. normalisation, epsilon and minpts) are collected here. You are welcome to contact me if you have further (big) datasets that are good benchmarks for DBSCAN.

Sarma et al.: μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality

TODO: check in detail datasets used, but some are those datasets used in some of the other publications below, but "In addition, we have also used a few other real datasets: 3D Road Network (3DSRN) [32] contains vechicular GPS data; Household Power (HHP*) and KDDBIO145K (KDDB*) datasets are borrowed from UCI Repository [33]."

Gan, Tao: DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation

Data normalized to [0, 10^5 ] for every dimension.

MinPts = 100, Epsilon = 5000 and higher. (Note: far too high value turning almost the entire dataset into a single cluster -- the mis-claim is on their side!).

Their preprocessed datasets

  • PAMAP2 (3,850,505 4D points),
  • Farm (3,627,086 5D points),
  • Houshold (2,049,280 7D points)

can be obtained from their webpage.

Mai, Assent, Jacobsen, Storgaard Dieu: Anytime parallel density-based clustering

  • Same household datasets used as by Gan, Tao.
  • Also PAMAP2 is used, but claimed to be 974,479 39D points whereas Gan and Tao reduced it to 4 dimensions using PCA, but claim to have 3,850,505 points.
  • In addition, the UCI Gas Sensor dataset by Fonollosa et al. is used: 4,208,261 16D points (DETAILS NOT PROVIDED IN PAPER).

Kriegel, Schubert, Zimek: The (black) art of runtime evaluation: Are we comparing algorithms or implementations?

  • Same PAMAP2, Farm and household datasets used as by Gan, Tao (including also smaller epsilon values as these make more sense).
  • In addition, for higher dimensional data, the Amsterdam Library of Object Images (ALOI) dataset from Geusebroek et al is used, namely the 110250 HSV/HSB color histograms provided on the ELKI Multi-View Data Sets webpage. Namly, the eight dimensions (two divisions per HSV color component) dataset (I assume, this is the 2x2x2 dataset) with epsilon=0.01 and minPts=20.

Patwary, Satish, Sundaram, Manne, Habib, Dubey: Pardicle: parallel approximate density-based clustering

PDSDBSCAN

A subsampled version of the above Millenium Run dataset has also been used in the paper A new scalable parallel DBSCAN algorithm using the disjoint-set data structure by the same main author as Pardicle describing and evaluating PDSDBSCAN who published also a 50,000 10D point dataset used also in that paper.

Götz, Bodenstein, Riedel: HPDBSCAN: highly parallel DBSCAN

The Bremen 3D point cloud and Twitter 2D GPS locations are available as full and subsampled (small) datasets: DOI: 10.23728/b2share.7f0c22ba9a5a44ca83cdf4fb304ce44e (Note: the original publication refers to the dataset via a handle.net handle which does not work anymore).

  • Twitter (dataset t): 16,602,137 2D points (eps=0.01, minPts=40). Note that this dataset contains some bogus artefacts (most likely Twitter spam with bogus GPS coordinates).
  • Twitter small (dataset ts): 3,704,351 2D points (eps=0.01, minPts=40)
  • Bremen (dataset b): 81,398,810 3D points (eps=100, minPts=10000)
  • Bremen small (dataset bs): 2,543,712 2D points (eps=100, minPts=312)

Neukirchen: Elephant against Goliath: Performance of Big Data versus High-Performance Computing DBSCAN Clustering Implementations

The same Twitter small dataset as provided by Götz et al. has been used with the same parameters.