Helmut Neukirchen

PhD Defense Standards-based Models and Architectures to Automate Scalable and Distributed Data Processing and Analysis

Helmut Neukirchen, 7. October 2019

Shahbaz Memon successfully defended his PhD thesis in Computer Science on Standards-based Models and Architectures to Automate Scalable and Distributed Data Processing and Analysis. The thesis covers Scientific Workflows and middlewares for High-Performance Computing and High-Throughput Computing.

This PhD is an example of the collaboration between the Faculty of Industrial Engineering, Mechanical Engineering and Computer Science and Jülich Supercomputing Centre (JSC).

PhD candidate, opponents, dean, and PhD committee

Members of the PhD commitee were Morris Riedel, Helmut Neukirchen, and Matthias Book, opponents were Ramin Yahyapour and Robert Lovas. The head of faculty, Rúnar Unnþórsson, was steering the defense. While we have some on cultural diversity involved, we need to improve on gender diversity! More photos can be found on flickr.

Research, Teaching

European Researcher's Night: From the next generation supercomputer DEEP-EST to your smartphone -- real-time object detection using neural network

Helmut Neukirchen, 28. September 2019

The DEEP-EST research project is at Vísindavaka, part of the European Researcher's Night, in Reykjavik, 28. September 2019.

DEEP-EST Booth at European Researchers Night

Use the camera of your smartphone to detect objects in real-time. While neural networks are still best trained on a supercomputer, such as DEEP-EST with its Data Analysis Module, the trained neural network even runs in the browser of a smartphone. Bring your smartphone and objects such as apples, bananas or teddy bears to let your smartphone detect these objects.

Just open the following web page and allow your browser to use the camera: https://nvndr.csb.app/.

(Allow a few seconds for loading the trained model and initialisation.)

The used approach is Single Shot Detector (SSD) (the percentage shows how sure the neural network is about the classification) using the Mobilenet neural network architecture. The dataset used for training is COCO (Common Objects in Context), i.e. only objects of the labeled object classes contained in COCO will get detected. The Javascript code that is running in your browser uses Tensorflow Lite and its Object Detection API and model zoo.

If you want learn more about DEEP-EST, have a look at the poster below (click for PDF version):

Research

Research project European Open Science Cloud (EOSC)-Nordic starting

Helmut Neukirchen, 1. September 2019

University of Iceland was successful in a consortium applying for funding from the European Horizon 2020 research programme with the European Open Science Cloud (EOSC)-centric proposal EOSC-Nordic.

EOSC-Nordic aims to foster and advance the take-up of the European Open Science Cloud (EOSC) at the Nordic level by coordinating the EOSC-relevant initiatives taking place in Finland, Sweden, Norway, Denmark, Iceland, Estonia, Latvia, Lithuania, Netherlands and Germany. EOSC-Nordic aims to facilitate the coordination of EOSC relevant initiatives within the Nordic and Baltic countries and exploit synergies to achieve greater harmonisation at policy and service provisioning across these countries, in compliance with EOSC agreed standards and practices. By doing so, the project will seek to establish the Nordic and Baltic countries as frontrunners in the take-up of the EOSC concept, principles and approach. EOSC-Nordic brings together a strong consortium including e-Infrastructure providers, research performing organisations and expert networks, with national mandates with regards to the provision of research services and open science policy, and wide experience of engaging with the research community and mobilising national governments, funding agencies, international bodies and global initiatives and high-level experts on EOSC strategic matters.

A successful EOSC-Nordic will reinforce Nordic research area capability and competitiveness, create a profile of a leading knowledge based region, increase the ability of the region to attract talent and investments, enhance its appeal as a partner in cooperation, and strengthen the Nordic region and its efforts in the overall EOSC, through the creation of a cross-border cooperation model for Europe.

The project is coordinated by the Nordic e-Infrastructure Collaboration (NeIC) and the University of Iceland is one of the project participants. The University of Iceland's diverse team is lead by Ebba Þóra Hvannberg. Helmut Neukirchen and Morris Riedel contribute their knowledge with respect to e-Science, such as scalable, parallel machine learning, scientific workflows, and data federation. In addition to these researchers from the University's Computer Science department, experts from other departments of the University of Iceland contribute to EOSC-Nordic.

Project duration is 1st of September 2019 to 31st of August 2022. More information can be found on the EOSC-Nordic web page and also on my local page covering this research project.

EOSC Partners Group Photo

Research

12th Nordic Workshop on Multi-Core Computing (MCC2019)

Helmut Neukirchen, 30. August 2019

The objective of MCC is to bring together Nordic researchers and practitioners from academia and industry to present and discuss recent work in the area of multi-core computing. This year's edition is hosted by Blekinge Institute of Technology in Karlskrona, Sweden.

The scope of the workshop is both hardware and software aspects of multi-core computing, including design and development as well as practical usage of systems. The topics of interest include, but is not limited to, the following:

Architecture of multi-core processors, GPUs, accelerators, heterogeneous systems, memory systems, interconnects and on-chip networks
Parallel programming models, languages, environments
Parallel algorithms and applications
Compiler optimizations and techniques for multi-core systems
Hardware/software design trade-offs in multi-core systems
Operating system, middleware, and run-time system support for multi-core systems
Correctness and performance analysis of parallel hardware and software
Tools and methods for development and evaluation of multi-core systems

There are two types of papers eligible for submission. The first type is original research work and the second type is work already published in 2018 or later.

Participants submitting original work are asked to send an electronic version of the paper that does not exceed four pages using the ACM proceedings format, http://www.acm.org/publications/proceedings-template, to https://easychair.org/conferences/?conf=mcc20190.

The same URL is to be used should you want to present an already published paper as described above. In that case, you need to clearly specify that the paper is already published and where the paper has been published.

No proceedings will be distributed. Contributions will not disqualify subsequent publication in conferences or journals.

The conference web page is https://sites.google.com/view/mcc2019.

Important dates

Sep. 29 2019: Submission deadline
Oct. 27 2019: Author notification
Nov. 18 2019: Registration deadline
Nov. 27-28 2019: MCC Workshop

Research

PhD Defense GraphTyper: A pangenome method for identifying sequence variants at a population-scale

Helmut Neukirchen, 26. June 2019

Hannes Pétur Eggertsson successfully defended his PhD thesis in Computer Science on GraphTyper: A pangenome method for identifying sequence variants at a population-scale. I had the honor to steer this defense in my role as vice head of faculty.

As you notice, only men are occurring here. We need to improve on this! More pictures can be found on flickr.

Research, Teaching

Datasets for DBSCAN evaluation

Helmut Neukirchen, 20. June 2019

For evaluating implementations of the popular DBSCAN clustering algorithm, various publications use several datasets. Pointers to these datasets and information on paramaters (e.g. normalisation, epsilon and minpts) are collected here. You are welcome to contact me if you have further (big) datasets that are good benchmarks for DBSCAN.

Sarma et al.: μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality

TODO: check in detail datasets used, but some are those datasets used in some of the other publications below, but "In addition, we have also used a few other real datasets: 3D Road Network (3DSRN) [32] contains vechicular GPS data; Household Power (HHP*) and KDDBIO145K (KDDB*) datasets are borrowed from UCI Repository [33]."

Gan, Tao: DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation

Data normalized to [0, 10^5 ] for every dimension.

MinPts = 100, Epsilon = 5000 and higher. (Note: far too high value turning almost the entire dataset into a single cluster -- the mis-claim is on their side!).

Their preprocessed datasets

PAMAP2 (3,850,505 4D points),
Farm (3,627,086 5D points),
Houshold (2,049,280 7D points)

can be obtained from their webpage.

Mai, Assent, Jacobsen, Storgaard Dieu: Anytime parallel density-based clustering

Same household datasets used as by Gan, Tao.
Also PAMAP2 is used, but claimed to be 974,479 39D points whereas Gan and Tao reduced it to 4 dimensions using PCA, but claim to have 3,850,505 points.
In addition, the UCI Gas Sensor dataset by Fonollosa et al. is used: 4,208,261 16D points (DETAILS NOT PROVIDED IN PAPER).

Kriegel, Schubert, Zimek: The (black) art of runtime evaluation: Are we comparing algorithms or implementations?

Same PAMAP2, Farm and household datasets used as by Gan, Tao (including also smaller epsilon values as these make more sense).
In addition, for higher dimensional data, the Amsterdam Library of Object Images (ALOI) dataset from Geusebroek et al is used, namely the 110250 HSV/HSB color histograms provided on the ELKI Multi-View Data Sets webpage. Namly, the eight dimensions (two divisions per HSV color component) dataset (I assume, this is the 2x2x2 dataset) with epsilon=0.01 and minPts=20.

Patwary, Satish, Sundaram, Manne, Habib, Dubey: Pardicle: parallel approximate density-based clustering

Halo and galaxy formation datasets in astrophysics, taken from the database on Millennium Run described by Lemson and the Virgo Consortium. (Online access via wget to the data stored in a SQLdatabase is possible.) These datasets contain far more dimensions than used by Patwary et al., so it is unclear which subsets of the dimensions have been used (DETAILS NOT PROVIDED IN PAPER).

1,056M 8D points (eps=0.005, minPts=2): This MPAGalaxiesDeLucia2006a (md) dataset (attributed by Patwary et al. to De Lucia et al. The hierarchical formation of the brightest cluster galaxies) probably refers to the DeLucia2006adataset
1,015M 8D points (eps=0.005, minPts=2): This DGalaxiesBower2006a (db) dataset (attributed by Patwary et al. to Bower et al. Breaking the hierarchy of galaxy formation) probably refers to the Bower2006a dataset.
1,015M 8D points (eps=0.005, minPts=2): For this MPAGalaxiesBertone2007 (mb) dataset (attributed by Patwary et al. to Bertone et al. The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model), it is not obvious to which dataset from the database on Millennium Run it refers to.
761M 9D points (eps=0.001, minPts=2). This MPAHaloTreesMhalo (mm) dataset (attributed by Patwary et al. to Guo et al.; Galaxy formation in WMAP1 and WMAP7 cosmologies) probably refers to the MPAHalo dataset.

92M and 116M 10 D points (eps=20, minPts=5 and eps=50, minPts=2): generated using the IBM Quest synthetic data generator, however details of generation have not been documented, so it cannot be reproduced.

PDSDBSCAN

A subsampled version of the above Millenium Run dataset has also been used in the paper A new scalable parallel DBSCAN algorithm using the disjoint-set data structure by the same main author as Pardicle describing and evaluating PDSDBSCAN who published also a 50,000 10D point dataset used also in that paper.

Götz, Bodenstein, Riedel: HPDBSCAN: highly parallel DBSCAN

The Bremen 3D point cloud and Twitter 2D GPS locations are available as full and subsampled (small) datasets: DOI: 10.23728/b2share.7f0c22ba9a5a44ca83cdf4fb304ce44e (Note: the original publication refers to the dataset via a handle.net handle which does not work anymore).

Twitter (dataset t): 16,602,137 2D points (eps=0.01, minPts=40). Note that this dataset contains some bogus artefacts (most likely Twitter spam with bogus GPS coordinates).
Twitter small (dataset ts): 3,704,351 2D points (eps=0.01, minPts=40)
Bremen (dataset b): 81,398,810 3D points (eps=100, minPts=10000)
Bremen small (dataset bs): 2,543,712 2D points (eps=100, minPts=312)

Neukirchen: Elephant against Goliath: Performance of Big Data versus High-Performance Computing DBSCAN Clustering Implementations

The same Twitter small dataset as provided by Götz et al. has been used with the same parameters.

Research

Towards Exascale Computing: European DEEP-EST research project

Helmut Neukirchen, 17. May 2019

The DEEP-EST ("Dynamical Exascale Entry Platform - Extreme Scale Technologies") project is funded as part of the European Commission's Horizon 2020 ambitious Future and Emerging Technologies (FET) programme in order to create the blueprints of the next generation ("pre-exascale") supercomputer hardware and software.
The current goal in supercomputing is to reach exascale performance: a quintillion in American culture or a trillion in European culture or 10 to the power of 18 floating point arithmetic operations per second (FLOPS). These are needed to drive large-scale scientific simulations and big data analytics forward. Current supercomputers are able to achieve 0.2 exaFLOPS (or 200 petaFLOPS or 200 thousand teraFLOPS) (for comparison: if you have a very high-end personal computer, it's CPU can maybe compute half a teraFLOP).

Exascale computing is some sort of "wall", i.e. it is hard to reach it and in particular to go beyond anytime soon. While according Moore's law the number of transistors in a CPU doubles every two years, the performance of a CPU does not anymore double that fast (the transistors go into more cores and more caches). Currently, the only way to boost performance is to use not generic CPUs, but specialised "accelerators", e.g. graphical processors (GPUs), but also accelerators in other parts of a supercomputer, e.g. the network fabric that inter-connects the many CPU nodes of a supercomputer or the storage. DEEP-EST therefore suggest a Modular Supercomputing Architecture (MSA) where the supercomputer is composed of multiple modules, each being specialised in a particular domain, e.g. a GPU-heavy booster for computations that scale well and are suitable for GPUs, a "normal" CPU cluster module for applications that do not scale that well, a data analysis module having hardware specialised for machine learning.

Talking about accelerators: one of our project partners is CERN and the project meeting took place there: we were lucky enough that the Large Hadron Collider (LHC) and particle accelerator is currently in maintenance/upgrade phase, so we where able to see one of the detectors (when it is running, the collisions create lots of radiation). -- Find the human in the picture below:

LHC detector

DEEP-EST has reached the middle of the project duration and the first module, the CPU cluster module has been installed. Since an additional barrier in exascale computing is energy, which also means heat created by the computers that need to be cooled down, DEEP-EST is also working on novel cooling solutions, e.g. water cooling. While typical data centres use air cooling, i.e. extra energy is needed to cool down air that is then blown into the racks, the DEEP-EST water cooling allows to use water at normal temperatures and pipe it through those components that create most of the heat. This will warm up the water and the energy contained in this warm water can then be even used for something else. I.e. instead of needing extra energy from cooling, the DEEP-EST warm water cooling allows to even gain energy (of course, this is energy inserted in the system by the electrical power that the supercomputing components consume). You see the water pipes of the newly installed CPU cluster module in the middle rack below:

Rack with water cooling

Talking about energy efficiency: another trend are field-programmable gate arrays (FPGAs) that are more energy efficient than CPUs or GPUs. These are as well used in one of the specialised DEEP-EST modules.

The downside of the usage of accelerators is that they need special programming. University of Iceland is as DEEP-EST member developing machine learning software that exploits the DEEP-EST Modular Supercomputing Architecture (MSA) as good as possible. This includes clustering (DBSCAN) and classification via Support Vector Machines (SVMs) and Deep Learning/Deep Neural Networks.

You can follow the progress this project on the DEEP-EST web site and Twitter channel.

Research

Scientists for Future / Fridays for Future / Protests for more climate protection

Helmut Neukirchen, 16. March 2019

Climate change is real and will affect us all. So it is good that the Fridays for Future protests have reached Iceland. Scientists in German-speaking countries made their statement that these concerns are justified and supported by the best available science: The current measures for climate, biodiversity, forest, marine, and soil protection are far from sufficient.

I am participating in the eSTICC (eScience Tools for Investigating Climate Change at High Northern Latitudes) NordForsk-funded research project. As part of the project an impressing (or depressing) simulation of the Greenland ice sheet and climate change has been created (the simulations ran on a supercomputer located in Iceland) that shows the surface air temperature in the Arctic and Greenland glacier ice thickness, e.g. when will the Arctic sea ice be gone during summer (we got used to already now) and during winter (=no ice at the North pole in winter -- imagine this) according to the simulations:

We all should act:

A carbon tax would be a strong incentive for all sectors to reduce CO2, so urge politicians to introduce a carbon tax. (The 2018 Nobel Prize in Economic Sciences was about the effects of a carbon tax on climate while still allowing economic growth.)
No joke: Consume less meat and milk products! What many do not have in mind is the climate impact of animal farming (for meat and dairy): e.g. cows emit a lot of methane (which has 28 times the impact on temperature of a carbon dioxide emission of the same mass).
And finally: reduce flights! One transatlantic flight (back and forth) is in terms of CO2 comparable to one year of average car usage. While living on a far-away island like Iceland involves flying, at least compensate by CO2 offsetting, e.g. via atmosfair or myclimate.

Research

1st Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning (EDML 2019)

Helmut Neukirchen, 22. November 2018

My experience with evaluating implementations of machine learning algorithms is that the content of many accepted research papers cannot be reproduced, in particular because the used implementations are not open-source and the authors typically do not even answer emails requesting to use their implementations. This is one aspect of the

1st Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning (EDML 2019)
Workshop at the SIAM International Conference on Data Mining (SDM19), May 2‑4, 2019

Description

A vital part of proposing new machine learning and data mining approaches is evaluating them empirically to allow an assessment of their capabilities. Numerous choices go into setting up such experiments: how to choose the data, how to preprocess them (or not), potential problems associated with the selection of datasets, what other techniques to compare to (if any), what metrics to evaluate, etc. and last but not least how to present and interpret the results. Learning how to make those choices on-the-job, often by copying the evaluation protocols used in the existing literature, can easily lead to the development of problematic habits. Numerous, albeit scattered, publications have called attention to those questions and have occasionally called into question published results, or the usability of published methods. At a time of intense discussions about a reproducibility crisis in natural, social, and life sciences, and conferences such as SIGMOD, KDD, and ECML/PKDD encouraging researchers to make their work as reproducible as possible, we therefore feel that it is important to bring researchers together, and discuss those issues on a fundamental level.

An issue directly related to the first choice mentioned above is the following: even the best-designed experiment carries only limited information if the underlying data are lacking. We therefore also want to discuss questions related to the availability of data, whether they are reliable, diverse, and whether they correspond to realistic and/or challenging problem settings.

Topics

In this workshop, we mainly solicit contributions that discuss those questions on a fundamental level, take stock of the state-of-the-art, offer theoretical arguments, or take well-argued positions, as well as actual evaluation papers that offer new insights, e.g. question published results, or shine the spotlight on the characteristics of existing benchmark data sets.
As such, topics include, but are not limited to

Benchmark datasets for data mining tasks: are they diverse/realistic/challenging?
Impact of data quality (redundancy, errors, noise, bias, imbalance, ...) on qualitative evaluation
Propagation/amplification of data quality issues on the data mining results (also interplay between data and algorithms)
Evaluation of unsupervised data mining (dilemma between novelty and validity)
Evaluation measures
(Automatic) data quality evaluation tools: What are the aspects one should check before starting to apply algorithms to given data?
Issues around runtime evaluation (algorithm vs. implementation, dependency on hardware, algorithm parameters, dataset characteristics)
Design guidelines for crowd-sourced evaluations

The workshop will feature a mix of invited speakers, a number of accepted presentations with ample time for questions since those contributions will be less technical, and more philosophical in nature, and a panel discussion on the current state, and the areas that most urgently need improvement, as well as recommendation to achieve those improvements. An important objective of this workshop is a document synthesizing these discussions that we intend to publish at a prominent venue.

Submission

Papers should be submitted as PDF, using the SIAM conference proceedings style, available at https://www.siam.org/Portals/0/Publications/Proceedings/soda2e_061418.zip?ver=2018-06-15-102100-887. Submissions should be limited to nine pages and submitted via Easychair at https://easychair.org/conferences/?conf=edml19.

Important dates

Submission deadline: February 15, 2019
Notification: March 15, 2019
SDM pre-registration deadline: April 2, 2019
Camera ready: April 15, 2019
Conference dates: May 2-4, 2019

Further info

Web page

Uncategorized

11th Nordic Workshop on Multi-Core Computing (MCC2018)

Helmut Neukirchen, 19. September 2018

The objective of MCC is to bring together Nordic researchers and practitioners from academia and industry to present and discuss recent work in the area of multi-core computing. This year's edition is hosted by the Chalmers University of Technology (Gothenburg, Sweden).

There are two types of papers eligible for submission. The first type is original research work and the second type is work already published in 2017 or later. Participants submitting original work are asked to send an electronic version of the paper that does not exceed four pages using the ACM proceedings format, http://www.acm.org/publications/proceedings-template, to https://easychair.org/conferences/?conf=mcc2018. The same URL is to be used should you want to present an already published paper as described above. In that case, you need to clearly specify that the paper is already published and where the paper has been published.

No proceedings will be distributed. Contributions will not disqualify subsequent publications in conferences or journals. (This is a real "work"shop to facilitate discussion.)

Call for Papers (CfP).

The conference web page is https://sites.google.com/site/mccworkshop2018.

Full Paper Submission: October 8th, 2018
Author Notification: November 2nd, 2018
Registration Deadline: November 22nd, 2018
MCC Workshop: November 29th - 30th, 2018

The workshop will be held at Chalmers University of Technology, Gothenburg, Sweden.

Uncategorized

Contact

Dr. Helmut Neukirchen

Professor of Computer Science and Software Engineering

Faculty of Industrial Engineering, Mechanical Engineering and Computer Science

Deputy head of faculty (Autumn 2024-Spring 2026)

University of Iceland
Department of Computer Science

Gróska building, 3rd floor (stairway A or B), room 306
Bjargargata 1
102 Reykjavik
Iceland

E-Mail: helmut at hi. is
(Encrypted e-mail welcome: my public PGP key, also available at key servers -- X.509 based S/MIME encryption possible on request.)
Phone (mobile): 615 2554
University of Iceland
Meta

Professor of Computer Science & Software Engineering, Uni Iceland

Sarma et al.: μDBSCAN: An Exact Scalable DBSCAN Algorithm for Big Data Exploiting Spatial Locality

Gan, Tao: DBSCAN Revisited: Mis-Claim, Un-Fixability, and Approximation

Mai, Assent, Jacobsen, Storgaard Dieu: Anytime parallel density-based clustering

Kriegel, Schubert, Zimek: The (black) art of runtime evaluation: Are we comparing algorithms or implementations?

Patwary, Satish, Sundaram, Manne, Habib, Dubey: Pardicle: parallel approximate density-based clustering

PDSDBSCAN

Götz, Bodenstein, Riedel: HPDBSCAN: highly parallel DBSCAN

Neukirchen: Elephant against Goliath: Performance of Big Data versus High-Performance Computing DBSCAN Clustering Implementations

1st Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning (EDML 2019) Workshop at the SIAM International Conference on Data Mining (SDM19), May 2‑4, 2019

Description

Topics

Submission

Important dates

Further info

Contact

University of Iceland

Meta

1st Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning (EDML 2019)
Workshop at the SIAM International Conference on Data Mining (SDM19), May 2‑4, 2019