Helmut Neukirchen

Datasets for DBSCAN evaluation

Helmut Neukirchen, 20. June 2019

For evaluating implementations of the popular DBSCAN clustering algorithm, various publications use several datasets. Pointers to these datasets and information on paramaters (e.g. normalisation, epsilon and minpts) are collected here. You are welcome to contact me if you have further (big) datasets that are good benchmarks for DBSCAN.

Sarma et al.: μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality

TODO: check in detail datasets used, but some are those datasets used in some of the other publications below, but "In addition, we have also used a few other real datasets: 3D Road Network (3DSRN) [32] contains vechicular GPS data; Household Power (HHP*) and KDDBIO145K (KDDB*) datasets are borrowed from UCI Repository [33]."

Gan, Tao: DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation

Data normalized to [0, 10^5 ] for every dimension.

MinPts = 100, Epsilon = 5000 and higher. (Note: far too high value turning almost the entire dataset into a single cluster -- the mis-claim is on their side!).

Their preprocessed datasets

PAMAP2 (3,850,505 4D points),
Farm (3,627,086 5D points),
Houshold (2,049,280 7D points)

can be obtained from their webpage.

Mai, Assent, Jacobsen, Storgaard Dieu: Anytime parallel density-based clustering

Same household datasets used as by Gan, Tao.
Also PAMAP2 is used, but claimed to be 974,479 39D points whereas Gan and Tao reduced it to 4 dimensions using PCA, but claim to have 3,850,505 points.
In addition, the UCI Gas Sensor dataset by Fonollosa et al. is used: 4,208,261 16D points (DETAILS NOT PROVIDED IN PAPER).

Kriegel, Schubert, Zimek: The (black) art of runtime evaluation: Are we comparing algorithms or implementations?

Same PAMAP2, Farm and household datasets used as by Gan, Tao (including also smaller epsilon values as these make more sense).
In addition, for higher dimensional data, the Amsterdam Library of Object Images (ALOI) dataset from Geusebroek et al is used, namely the 110250 HSV/HSB color histograms provided on the ELKI Multi-View Data Sets webpage. Namly, the eight dimensions (two divisions per HSV color component) dataset (I assume, this is the 2x2x2 dataset) with epsilon=0.01 and minPts=20.

Patwary, Satish, Sundaram, Manne, Habib, Dubey: Pardicle: parallel approximate density-based clustering

Halo and galaxy formation datasets in astrophysics, taken from the database on Millennium Run described by Lemson and the Virgo Consortium. (Online access via wget to the data stored in a SQLdatabase is possible.) These datasets contain far more dimensions than used by Patwary et al., so it is unclear which subsets of the dimensions have been used (DETAILS NOT PROVIDED IN PAPER).

1,056M 8D points (eps=0.005, minPts=2): This MPAGalaxiesDeLucia2006a (md) dataset (attributed by Patwary et al. to De Lucia et al. The hierarchical formation of the brightest cluster galaxies) probably refers to the DeLucia2006adataset
1,015M 8D points (eps=0.005, minPts=2): This DGalaxiesBower2006a (db) dataset (attributed by Patwary et al. to Bower et al. Breaking the hierarchy of galaxy formation) probably refers to the Bower2006a dataset.
1,015M 8D points (eps=0.005, minPts=2): For this MPAGalaxiesBertone2007 (mb) dataset (attributed by Patwary et al. to Bertone et al. The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model), it is not obvious to which dataset from the database on Millennium Run it refers to.
761M 9D points (eps=0.001, minPts=2). This MPAHaloTreesMhalo (mm) dataset (attributed by Patwary et al. to Guo et al.; Galaxy formation in WMAP1 and WMAP7 cosmologies) probably refers to the MPAHalo dataset.

92M and 116M 10 D points (eps=20, minPts=5 and eps=50, minPts=2): generated using the IBM Quest synthetic data generator, however details of generation have not been documented, so it cannot be reproduced.

PDSDBSCAN

A subsampled version of the above Millenium Run dataset has also been used in the paper A new scalable parallel DBSCAN algorithm using the disjoint-set data structure by the same main author as Pardicle describing and evaluating PDSDBSCAN who published also a 50,000 10D point dataset used also in that paper.

Götz, Bodenstein, Riedel: HPDBSCAN: highly parallel DBSCAN

The Bremen 3D point cloud and Twitter 2D GPS locations are available as full and subsampled (small) datasets: DOI: 10.23728/b2share.7f0c22ba9a5a44ca83cdf4fb304ce44e (Note: the original publication refers to the dataset via a handle.net handle which does not work anymore).

Twitter (dataset t): 16,602,137 2D points (eps=0.01, minPts=40). Note that this dataset contains some bogus artefacts (most likely Twitter spam with bogus GPS coordinates).
Twitter small (dataset ts): 3,704,351 2D points (eps=0.01, minPts=40)
Bremen (dataset b): 81,398,810 3D points (eps=100, minPts=10000)
Bremen small (dataset bs): 2,543,712 2D points (eps=100, minPts=312)

Neukirchen: Elephant against Goliath: Performance of Big Data versus High-Performance Computing DBSCAN Clustering Implementations

The same Twitter small dataset as provided by Götz et al. has been used with the same parameters.

Research

Contact

Dr. Helmut Neukirchen

Professor of Computer Science and Software Engineering

Faculty of Industrial Engineering, Mechanical Engineering and Computer Science

Deputy head of faculty (Autumn 2024-Spring 2026)

University of Iceland
Department of Computer Science

Gróska building, 3rd floor (stairway A or B), room 306
Bjargargata 1
102 Reykjavik
Iceland

E-Mail: helmut at hi. is
(Encrypted e-mail welcome: my public PGP key, also available at key servers -- X.509 based S/MIME encryption possible on request.)
Phone (mobile): 615 2554
University of Iceland
Meta

Professor of Computer Science & Software Engineering, Uni Iceland