Datasets for DBSCAN evaluation
For evaluating implementations of the popular DBSCAN clustering algorithm, various publications use several datasets. Pointers to these datasets and information on paramaters (e.g. normalisation, epsilon and minpts) are collected here. You are welcome to contact me if you have further (big) datasets that are good benchmarks for DBSCAN.
Sarma et al.: μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality
TODO: check in detail datasets used, but some are those datasets used in some of the other publications below, but "In addition, we have also used a few other real datasets: 3D Road Network (3DSRN) [32] contains vechicular GPS data; Household Power (HHP*) and KDDBIO145K (KDDB*) datasets are borrowed from UCI Repository [33]."
Gan, Tao: DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation
Data normalized to [0, 10^5 ] for every dimension.
MinPts = 100, Epsilon = 5000 and higher. (Note: far too high value turning almost the entire dataset into a single cluster -- the mis-claim is on their side!).
Their preprocessed datasets
- PAMAP2 (3,850,505 4D points),
- Farm (3,627,086 5D points),
- Houshold (2,049,280 7D points)
can be obtained from their webpage.
Mai, Assent, Jacobsen, Storgaard Dieu: Anytime parallel density-based clustering
- Same household datasets used as by Gan, Tao.
- Also PAMAP2 is used, but claimed to be 974,479 39D points whereas Gan and Tao reduced it to 4 dimensions using PCA, but claim to have 3,850,505 points.
- In addition, the UCI Gas Sensor dataset by Fonollosa et al. is used: 4,208,261 16D points (DETAILS NOT PROVIDED IN PAPER).
Kriegel, Schubert, Zimek: The (black) art of runtime evaluation: Are we comparing algorithms or implementations?
- Same PAMAP2, Farm and household datasets used as by Gan, Tao (including also smaller epsilon values as these make more sense).
- In addition, for higher dimensional data, the Amsterdam Library of Object Images (ALOI) dataset from Geusebroek et al is used, namely the 110250 HSV/HSB color histograms provided on the ELKI Multi-View Data Sets webpage. Namly, the eight dimensions (two divisions per HSV color component) dataset (I assume, this is the 2x2x2 dataset) with epsilon=0.01 and minPts=20.
Patwary, Satish, Sundaram, Manne, Habib, Dubey: Pardicle: parallel approximate density-based clustering
- Halo and galaxy formation datasets in astrophysics, taken from the database on Millennium Run described by Lemson and the Virgo Consortium. (Online access via wget to the data stored in a SQLdatabase is possible.) These datasets contain far more dimensions than used by Patwary et al., so it is unclear which subsets of the dimensions have been used (DETAILS NOT PROVIDED IN PAPER).
- 1,056M 8D points (eps=0.005, minPts=2): This MPAGalaxiesDeLucia2006a (md) dataset (attributed by Patwary et al. to De Lucia et al. The hierarchical formation of the brightest cluster galaxies) probably refers to the DeLucia2006adataset
- 1,015M 8D points (eps=0.005, minPts=2): This DGalaxiesBower2006a (db) dataset (attributed by Patwary et al. to Bower et al. Breaking the hierarchy of galaxy formation) probably refers to the Bower2006a dataset.
- 1,015M 8D points (eps=0.005, minPts=2): For this MPAGalaxiesBertone2007 (mb) dataset (attributed by Patwary et al. to Bertone et al. The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model), it is not obvious to which dataset from the database on Millennium Run it refers to.
- 761M 9D points (eps=0.001, minPts=2). This MPAHaloTreesMhalo (mm) dataset (attributed by Patwary et al. to Guo et al.; Galaxy formation in WMAP1 and WMAP7 cosmologies) probably refers to the MPAHalo dataset.
- 92M and 116M 10 D points (eps=20, minPts=5 and eps=50, minPts=2): generated using the IBM Quest synthetic data generator, however details of generation have not been documented, so it cannot be reproduced.
PDSDBSCAN
A subsampled version of the above Millenium Run dataset has also been used in the paper A new scalable parallel DBSCAN algorithm using the disjoint-set data structure by the same main author as Pardicle describing and evaluating PDSDBSCAN who published also a 50,000 10D point dataset used also in that paper.
Götz, Bodenstein, Riedel: HPDBSCAN: highly parallel DBSCAN
The Bremen 3D point cloud and Twitter 2D GPS locations are available as full and subsampled (small) datasets: DOI: 10.23728/b2share.7f0c22ba9a5a44ca83cdf4fb304ce44e (Note: the original publication refers to the dataset via a handle.net handle which does not work anymore).
- Twitter (dataset t): 16,602,137 2D points (eps=0.01, minPts=40). Note that this dataset contains some bogus artefacts (most likely Twitter spam with bogus GPS coordinates).
- Twitter small (dataset ts): 3,704,351 2D points (eps=0.01, minPts=40)
- Bremen (dataset b): 81,398,810 3D points (eps=100, minPts=10000)
- Bremen small (dataset bs): 2,543,712 2D points (eps=100, minPts=312)
Neukirchen: Elephant against Goliath: Performance of Big Data versus High-Performance Computing DBSCAN Clustering Implementations
The same Twitter small dataset as provided by Götz et al. has been used with the same parameters.