﻿{"id":1658,"date":"2019-06-20T15:12:56","date_gmt":"2019-06-20T15:12:56","guid":{"rendered":"http:\/\/uni.hi.is\/helmut\/?p=1658"},"modified":"2020-01-24T15:10:24","modified_gmt":"2020-01-24T15:10:24","slug":"datasets-for-dbscan-evaluation","status":"publish","type":"post","link":"https:\/\/uni.hi.is\/helmut\/2019\/06\/20\/datasets-for-dbscan-evaluation\/","title":{"rendered":"Datasets for DBSCAN evaluation"},"content":{"rendered":"<p>For evaluating implementations of the popular <a href=\"https:\/\/en.wikipedia.org\/wiki\/DBSCAN\">DBSCAN<\/a> clustering algorithm, various publications use several datasets. Pointers to these datasets and information on paramaters (e.g. normalisation, epsilon and minpts) are collected here. You are welcome to contact me if you have further (big) datasets that are good benchmarks for DBSCAN.<\/p>\n<h3>Sarma et al.: <a href=\"https:\/\/doi.org\/10.1109\/CLUSTER.2019.8891020\">\u03bcDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality<\/a><\/h3>\n<p>TODO: check in detail datasets used, but some are those datasets used in some of the other publications below, but \"In addition, we have also used a few other real datasets: 3D Road Network (3DSRN) [32] contains vechicular GPS data; Household Power (HHP*) and KDDBIO145K (KDDB*) datasets are borrowed from UCI Repository [33].\"<\/p>\n<h3>Gan, Tao: <a href=\"https:\/\/doi.org\/10.1145\/2723372.2737792\">DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation<\/a><\/h3>\n<p>Data normalized to [0, 10^5 ] for every dimension.<\/p>\n<p>MinPts = 100, Epsilon = 5000 and higher. (Note: far too high value turning almost the entire dataset into a single cluster -- the mis-claim is on their side!).<\/p>\n<p>Their preprocessed datasets<\/p>\n<ul>\n<li>PAMAP2 (3,850,505 4D points),<\/li>\n<li>Farm (3,627,086 5D points),<\/li>\n<li>Houshold (2,049,280 7D points)<\/li>\n<\/ul>\n<p>can be obtained from their <a href=\"https:\/\/sites.google.com\/view\/approxdbscan\/datasets\">webpage<\/a>.<\/p>\n<h3>Mai, Assent, Jacobsen, Storgaard Dieu: <a href=\"https:\/\/doi.org\/10.1007\/s10618-018-0562-1\">Anytime parallel density-based clustering<\/a><\/h3>\n<ul>\n<li>Same household datasets used as by Gan, Tao.<\/li>\n<li>Also PAMAP2 is used, but claimed to be 974,479 39D points whereas Gan and Tao reduced it to 4 dimensions using PCA, but claim to have 3,850,505 points.<\/li>\n<li>In addition, the UCI <a href=\"https:\/\/archive.ics.uci.edu\/ml\/datasets\/Gas+Sensor+Array+Drift+Dataset+at+Different+Concentrations\">Gas Sensor dataset<\/a> by <a href=\"https:\/\/doi.org\/10.1016\/j.dib.2015.01.003\">Fonollosa et al.<\/a> is used: 4,208,261 16D points (DETAILS NOT PROVIDED IN PAPER).<\/li>\n<\/ul>\n<h3>Kriegel, Schubert, Zimek: <a href=\"https:\/\/doi.org\/10.1007\/s10115-016-1004-2\">The (black) art of runtime evaluation: Are we comparing algorithms or implementations?<\/a><\/h3>\n<ul>\n<li>Same PAMAP2, Farm and household datasets used as by Gan, Tao (including also smaller epsilon values as these make more sense).<\/li>\n<li>In addition, for higher dimensional data, the <a href=\"http:\/\/aloi.science.uva.nl\/\">Amsterdam Library of Object Images<\/a> (ALOI) dataset from <a href=\"http:\/\/dx.doi.org\/10.1023\/B:VISI.0000042993.50813.60\">Geusebroek et al<\/a> is used, namely the 110250 HSV\/HSB color histograms provided on <a href=\"https:\/\/elki-project.github.io\/datasets\/multi_view\">the ELKI Multi-View Data Sets webpage.<\/a>  Namly, the  eight dimensions (two divisions per HSV color component) dataset (I assume, this is the 2x2x2 dataset) with epsilon=0.01 and minPts=20.\n<\/li>\n<\/ul>\n<h3>Patwary, Satish, Sundaram, Manne, Habib, Dubey: <a href=\"https:\/\/doi.org\/10.1109\/SC.2014.51\">Pardicle: parallel approximate density-based clustering<\/a><\/h3>\n<ul>\n<li>Halo and galaxy formation datasets in astrophysics, taken from the <a href=\"http:\/\/gavo.mpa-garching.mpg.de\/Millennium\/\">database on Millennium Run<\/a> described by <a href=\"https:\/\/arxiv.org\/pdf\/astro-ph\/0608019.pdf\">Lemson and the Virgo Consortium<\/a>. (<a href=\"http:\/\/gavo.mpa-garching.mpg.de\/Millennium\/Help?page=onlineaccess\">Online access via wget<\/a> to the data stored in a SQLdatabase is possible.) These datasets contain far more dimensions than used by Patwary et al., so it is unclear which subsets of the dimensions have been used (DETAILS NOT PROVIDED IN PAPER).\n<\/li>\n<ul>\n<li>1,056M 8D points (eps=0.005, minPts=2): This MPAGalaxiesDeLucia2006a (md) dataset (attributed by Patwary et al. to De Lucia et al. <a href=\"https:\/\/doi.org\/10.1111\/j.1365-2966.2006.11287.x\">The hierarchical formation of the brightest cluster galaxies<\/a>) probably refers to the <a href=\"http:\/\/gavo.mpa-garching.mpg.de\/Millennium\/Help?page=databases\/millimil\/delucia2006a\">DeLucia2006a<\/a>dataset<\/li>\n<li>1,015M 8D points (eps=0.005, minPts=2): This DGalaxiesBower2006a (db) dataset (attributed by Patwary et al. to Bower et al. <a href=\"https:\/\/doi.org\/10.1111\/j.1365-2966.2006.10519.x\">Breaking the hierarchy of galaxy formation<\/a>) probably refers to the <a href=\"http:\/\/gavo.mpa-garching.mpg.de\/Millennium\/Help?page=databases\/millimil\/bower2006a\">Bower2006a<\/a> dataset.<\/li>\n<li>1,015M 8D points (eps=0.005, minPts=2): For this MPAGalaxiesBertone2007 (mb) dataset (attributed by Patwary et al. to Bertone et al. <a href=\"https:\/\/doi.org\/10.1111\/j.1365-2966.2007.11997.x\">The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model<\/a>), it is not obvious to which dataset from the <a \/>database on Millennium Run<\/a> it refers to.<\/li>\n<li>761M 9D points (eps=0.001, minPts=2). This MPAHaloTreesMhalo (mm) dataset (attributed by Patwary et al. to Guo et al.; <a href=\"https:\/\/doi.org\/10.1093\/mnras\/sts115\">Galaxy formation in WMAP1 and WMAP7 cosmologies<\/a>) probably refers to the <a href=\"http:\/\/gavo.mpa-garching.mpg.de\/Millennium\/Help?page=databases\/millimil\/mpahalo\">MPAHalo<\/a> dataset.<\/li>\n<\/ul>\n<li>92M and 116M 10 D points (eps=20, minPts=5 and eps=50, minPts=2):  generated using the <a href=\"https:\/\/github.com\/zakimjz\/IBMGenerator\">IBM Quest synthetic data generator<\/a>, however details of generation have not been documented, so it cannot be reproduced.<\/li>\n<\/ul>\n<\/ul>\n<h3>PDSDBSCAN<\/h3>\n<p>A subsampled version of the above Millenium Run dataset has also been used in the  paper <a href=\"https:\/\/doi.org\/10.1109\/SC.2012.9\">A new scalable parallel DBSCAN algorithm using the disjoint-set data structure<\/a> by the same main author as Pardicle describing and evaluating <a href=\"http:\/\/cucis.ece.northwestern.edu\/projects\/Clustering\/download_code_dbscan.html\">PDSDBSCAN<\/a> who published also a <a href=\"http:\/\/cucis.ece.northwestern.edu\/projects\/Clustering\/download_data.html\">50,000 10D point dataset<\/a> used also in that paper.<\/p>\n<h3>G\u00f6tz, Bodenstein, Riedel: <a href=\"https:\/\/doi.org\/10.1145\/2834892.2834894\">HPDBSCAN: highly parallel DBSCAN<\/a><\/h3>\n<p>The Bremen 3D point cloud and Twitter 2D GPS locations are available as full and subsampled (small) datasets: <a href=\"http:\/\/doi.org\/10.23728\/b2share.7f0c22ba9a5a44ca83cdf4fb304ce44e\">DOI: 10.23728\/b2share.7f0c22ba9a5a44ca83cdf4fb304ce44e<\/a> (Note: the original publication refers to the dataset via a handle.net handle which does not work anymore). <\/p>\n<ul>\n<li>Twitter (dataset t): 16,602,137 2D points (eps=0.01, minPts=40). Note that this dataset contains some bogus artefacts (most likely Twitter spam with bogus GPS coordinates).\n<\/li>\n<li>Twitter small (dataset ts): 3,704,351 2D points (eps=0.01, minPts=40)<\/li>\n<li>Bremen (dataset  b): 81,398,810 3D points (eps=100, minPts=10000)<\/li>\n<li>Bremen small (dataset bs): 2,543,712 2D points (eps=100, minPts=312)<\/li>\n<\/ul>\n<h3>Neukirchen: <a href=\"https:\/\/doi.org\/10.1007\/978-3-319-96271-9_16\">Elephant against Goliath: Performance of Big Data versus High-Performance Computing DBSCAN Clustering Implementations<\/a><\/h3>\n<p>The same Twitter small dataset as provided by G\u00f6tz et al. has been used with the same parameters.<\/p>\n<p><!-- Reminder: Grid-based supposed to be exponential wrt. dimensionality: Investigate memory consumption wrt. dimensions. Grid-bases support only Minkowski distances only (e.g. Manhattan distance, Euclidean distance, or Chebyshev distance). --><\/p>\n","protected":false},"excerpt":{"rendered":"<p>For evaluating implementations of the popular DBSCAN clustering algorithm, various publications use several datasets. Pointers to these datasets and information on paramaters (e.g. normalisation, epsilon and minpts) are collected here. You are welcome to contact me if you have further (big) datasets that are good benchmarks for DBSCAN. Sarma et al.: \u03bcDBSCAN: An Exact Scalable [&hellip;]<\/p>\n","protected":false},"author":512,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[79],"tags":[],"class_list":["post-1658","post","type-post","status-publish","format-standard","hentry","category-research"],"_links":{"self":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/1658","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/users\/512"}],"replies":[{"embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/comments?post=1658"}],"version-history":[{"count":40,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/1658\/revisions"}],"predecessor-version":[{"id":1941,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/1658\/revisions\/1941"}],"wp:attachment":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/media?parent=1658"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/categories?post=1658"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/tags?post=1658"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}