Comparison of Big Data and High-Performance Computing platforms and applications (since 2017)

Big data analysis requires parallel processing. While the standard technology for huge non-embarrassingly parallel, but rather tightly-coupled computational problems is High-Performance Computing (HPC), highly praised contenders for huge parallel processing problems are big data processing frameworks such as Apache Hadoop or Apache Spark. To be able to decide whether HPC or big data platforms are better suited for big data problems, this projects investigates and compares the two paradigms and their platforms.
As a case study, the run-time performance and scalability of different implementations of the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering algorithm is investigated and compared.

Publications

Helmut Neukirchen.
Elephant against Goliath: Performance of Big Data versus High-Performance Computing DBSCAN Clustering Implementations.Simulation Science. First International Workshop, SimScience 2017, Göttingen, Germany, April 27–28, 2017, Revised Selected Papers, Communications in Computer and Information Science (CCIS), volume 889, DOI: 978-3-319-96271-9_16, Springer 2018.
Download

Helmut Neukirchen.
Performance of Big Data versus High-Performance Computing: Some Observations.
Extended Abstract. Clausthal-Göttingen International Workshop on Simulation Science, 27-28 April 2017, Göttingen, Germany. Proceedings of Accepted Abstracts, Clausthal-Göttingen Simulation Science Center, 2017, pp. 93-95.
Download

Helmut Neukirchen.
Survey and Performance Evaluation of DBSCAN Spatial Clustering Implementations for Big Data and High-Performance Computing Paradigms.Technical Report VHI-01-2016, Engineering Research Institute, University of Iceland, Reykjavik, Iceland, November 2016.
Download