Category: Research

Software Engineering for High-Performance Computing Survey

Helmut Neukirchen, 10. November 2017

If you are a member of the HPC community, i.e. have some experience in HPC (either in an HPC expert role at a computing centre or in a user role such as a scientist), please fill out the questionnaire below where we ask for your usage of software development best practises. Filling out the survey just takes 5 minutes:

Software Engineering for High-Performance Computing Survey

If you want to advertise the survey, below is a slide and a flyer:
Slide (PPT)
Flyer (PPT) / Flyer (PDF)

Nordic Center of Excellence eSTICC at Arctic Circle Assembly 2017

Helmut Neukirchen, 26. September 2017

The NordForsk-funded Nordic Center of Excellence (NCoE) eSTICC (eScience Tools for Investigating Climate Change at High Northern Latitudes) is holding a breakout session at the Arctic Circle assembly 2017 in Reykjavik. University of Iceland is a member of eSTICC and this session has been organised and is chaired by Helmut Neukirchen from the Faculty of Industrial Engineering, Mechanical Engineering and Computer Science of the University of Iceland.

This breakout session presents results from top research groups working in the fields of climate research supported by eScience, such as extensive simulations of climate change. The session takes place on Saturday, October 14th at 17:30-19:00 in room Hafnarkot on the first floor of Harpa Conference Centre. The talks can be found on the program of the Arctic Circle website and here.

Successfull application for European funding H2020 (FET PROACTIVE – HIGH PERFORMANCE COMPUTING) Co-design of HPC systems and applications

Helmut Neukirchen, 29. January 2017

A consortium including the University of Iceland participated successfully in the European Commission's Horizon 2020 research program call FET PROACTIVE – HIGH PERFORMANCE COMPUTING: Co-design of HPC systems and applications. The University of Iceland's team is lead by Helmut Neukirchen together with Morris Riedel. The secured funding for University of Iceland (387 860 EUR for three years project duration starting from 1st July 2017) will be, among others, used to hire a researcher who will perform ambitious research by providing a scientific parallel applications from the field of machine learning for extreme scale/pre-exascale high-performance computing, i.e. creating next-generation software for the next-generation supercomputing hardware.

More details can be found here.

Deadline extension: Clausthal-Göttingen International Workshop on Simulation Science

Helmut Neukirchen, 22. January 2017

Update deadline extended until 3. February 2017!

Due to the fast development of information technology, the understanding of phenomena in natural, engineer, economy and social sciences increasingly relies on computer simulations. Simulation-based analysis and engineering techniques are traditionally a research focus of Clausthal University of Technology and University of Göttingen, which is especially reflected in their common interdisciplinary research cluster "Simulation Science Center Clausthal-Göttingen". In this context, the first "Clausthal-Göttingen International Workshop on Simulation Science" aims to bring together researchers and practitioners from both industry and academia to report on the latest advances in simulation science.

The workshop considers the broad area of modeling & simulation with a focus on:

  • Simulation and optimization in networks:
    Public & transportation networks, computer & sensor networks, queuing networks, Internet of Things (IoT) environments, simulation of uncertain optimization problems, simulation of complex stochastic systems
  • Simulation of materials:
    Development and applications of computational techniques in material and process simulation, simulation at micro (atomistic), meso and macro (continuum) scales including scale bridging, diffusive, convective transport and chemical processes in materials, simulation of granular matter
  • Distributed simulations:
    Technology enabler for distributed simulation (e.g., simulation support for vector and parallel computing architectures, grid-based systems and cloud-based systems), methods for distributed simulation (e.g., agent-based simulation, multi-level simulation, and simulation for big data analytics, fusion and mining), application examples (e.g., simulation-based quality assurance and high-energy physics)

27 - 28 April 2017, Göttingen, Germany

Extended Abstract (2-3 pages) Submission: 20 Jan 2017

Workshop web page

Call for Papers: Download

Is Supercomputing dead in the age of Big Data processing?

Helmut Neukirchen, 9. November 2016

In the age of Big Data and big data frameworks such as Apache Spark, one might be tempted to think that supercomputing/high-performance computing (HPC) is obsolete. But in fact, Big Data processing and HPC are different and one platform cannot replace the other. I outline this in a presentation on Science Day of the University of Iceland's School of Engineering and Natural Sciences Saturday October 29 2016. (Note that there is nowadays some convergence, and a graph-processing benchmark top 500 list to resemble less CPU-intensive workloads in HPC.)

Furthermore, the available open-source implementations of algorithms (e.g. clustering using DBSCAN) are currently much faster in HPC and the available Big Data implementations do in fact not even scale beyond a handful of nodes. Results of a case study performed during my guest research stay at the research group High Productivity Data Processing of the Federated Systems and Data division at the Jülich Supercomputing Centre (JSC) are published in this Technical Report.

One of the reviewers of the 1st IEEE International Workshop on Big Spatial Data (BSD 2016) seems not to like the message that Big Data needs to do its homework to match HPC, hence my paper was rejected. While I assume that an HPC conference (such as ISC) might accept it, it would be nice to get the message to the Big Data community: I might submit it to The 6th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics or later at BDCloud 2017 : The 7th IEEE International Conference on Big Data and Cloud Computing. Non-public source implementations may also be worth considering: A novel scalable DBSCAN algorithm with Spark or A Parallel DBSCAN Algorithm Based on Spark. (If we get access to the implementation, but lacking possibility of reproducing/verifying scientific results is another story covered in my Technical Report.) Also, I might add threats to validity (such as construct, internal and external validity [Carver, J., VanVoorhis, J., Basili, V., August 2004. Understanding the impact of assumptions on experimental validity.])

Update from 9.11.2016: Erich Schubert (thanks!) pointed me to this related article "The (black) art of runtime evaluation: Are we comparing algorithms or implementations?" (DOI: 10.1007/s10115-016-1004-2) which support my findings. A statement from that article on k-means: "Judging from the measured runtime and even assuming zero network overhead, we must assume that a C++ implementation using all cores of a single modern PC will outperform a 100-node cluster easily." For DBSCAN, they show that a C++ implementation is one order of magnitude faster than the Java ELKI (which confirms my measurements concerning the C++ HPDBSCAN and the Java ELKI) on their used dataset. They also support my claim that the implementation matters: "Good implementations with index accelerations would process this data set in less than 2 seconds, whereas the fastest linear scan implementation took over 90 seconds, and naïve implementations would frequently require over 100 times the best runtime. But not every index we evaluated was implemented correctly, and sometimes an index was even slower than the linear scan. Between different implementations using linear scan, we can observe runtime differences of more than two orders of magnitude. Comparing the fastest implementation measured (optimized C++ code for this task) to the slowest implementation (outdated versions of Weka), we observe four orders of magnitude: less than two seconds instead of over four hours."

Some notes on using a Spark cluster

Helmut Neukirchen, 18. August 2016

The following notes are mainly for my personal use referring to the Spark 1.6/YARN cluster that I access, but maybe they are helpful for you as well...

Upload to HDFS

By default (=used implicitly by all HDFS operations), a HDFS paths are relative to your HDFS home directory: it needs to be created first by the administrator!

While piping through SSH should work ( cat test.txt | ssh
username@masternode "hdfs dfs -put - hadoopFoldername/" ) , it is reported to be slow -- I never checked this, but as I anyway used rather small data, I did instead an scp to the local file system of the master node and used afterwards a hdfs put:
scp localFile username@masternode
hdfs dfs -put twitterSmall.csv Twitter

Concatenate HDFS files (all inside an HDFS directory) and store in
local file system (without sorting)

hdfs dfs -getmerge HdfFolderContainingSplitResultFiles LocalFileToBeCreated

Note that Spark does not overwrite output files in HDFS by default. Either take care when you re-run jobs that the output files have been (re-)moved or you have to allow it in the Spark conf of your program:  conf.set("spark.hadoop.validateOutputSpecs","false")

Debugging

  1. See http://spark.apache.org/docs/latest/running-on-yarn.html
  2. Use spark-submit --verbose
  3. If executor processes are killed, this is mainly due to insufficient RAM (garbage collection takes too long, thus timeouts occur or simple out of memory/OOM exceptions). While you see in this case in the log of the driver on the spark-submit console  only "<span class="hljs-keyword">exit</span> code <span class="hljs-number">143</span>", the details need to be found in the logs of nodes/executors. This may not be possible via Web UI due to executor nodes being firewalled -- in this case use:
    yarn logs -applicationId application_1470137500465_0147
    (App Id tp be taken from ID columns in Cluster Web UI. Works only for completed runs, not the current run.) In these logs, you can find then / search for java.lang.OutOfMemoryError: GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space

Performance tuning

  1. Note that due HDFS blocks size of 128 MB, by default, partitions of this size are created when reading data. To enforce a higher number of partitions/higher parallelism, use already at the file read stage the optional numberOfPartitions parameter (that also many other RDD creating operations support).
  2. Some introduction https://www.mapr.com/blog/resource-allocation-configuration-spark-yarn
    http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
    (in particular: more than 5 cores per executor is said to lead to bad HDFS throughput. Note that “executor” is not identical to “node”, thus instead of running one executor with 24 cores on one node, rather run 4 executors with 5 cores on each node or 8 executors with 3 cores! Note that then, however, the overall memory of a node needs to be divided by the numbers of executors per node, e.g. 5 BG per executor with 8 executors per node on a 40G RAM node.)
  3. Config for RAM-intensive jobs (=1 core per executor only & 1 core per node only, using 40GB heap space and 2GB overhead for Spark/Yarn itself => on each of the 38 nodes only one core is used that thus can make use of all available RAM), in addition increase timeouts and message size:
    spark-submit --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --conf "yarn.nodemanager.resource.cpu-vcores=1"--executor-memory 40g --conf "spark.yarn.executor.memoryOverhead=2000"--conf "spark.driver.cores=4" --conf "spark.driver.maxResultSize=0"
    (Note: not sure about the driver memory and cores: this seems to have no influence -- is it too late to set it here?)

DBSCAN evaluation

Helmut Neukirchen, 17. August 2016

This post is used to document some DBSCAN command line parameters used in a DBSCAN implementation evaluation. Once a paper will be published referencing it, it will go to EUDAT B2SHARE and get thus a persistent handle.

Conversion of HDF5 file into CSV/SSV

h5dump -d "/DBSCAN" -o out.txt twitterSmall.h5
Yields lines (12,0): 53.3243, -1.12341,
Remove Id in first column:
cut out.txt -d ':' -f 2 >out.txt2
Yields lines 53.3243, -1.12341,
Remove extra comma at end of line
cut out.txt2 -d ',' -f 1,2 >out.csv
Seems that the first line is just an empty line, remove using
tail -n +2 out.csv >twitterSmall.csv

The mraad DBSCAN implementation expects SSV format with IDs (i.e. remove brackets after h5dump run)
cut out.txt -c 5- > out_withoutleadingbracket.txt
cut out_withoutleadingbracket.txt -d ':' -f 1,2 --output-delimiter=',' > out3.txt
cut out3.txt -d ')' -f 1,2 --output-delimiter=',' > out4.txt
cut out4.txt -d ',' -f 1,4,5 --output-delimiter=',' > cut twitterBig_withIds.csv
cut -d ',' twitterBig_withIds.csv -f 1-3 --output-delimiter=' ' > twitterBig_withIds.ssv_with_doublespaces
cut -d ' ' -f 1,3,5 twitterBig_withIds.ssv_with_doublespaces >twitterBig_withIds.ssv

The Dianwei Han DBSCAN expects SSV without IDs: remove Id from mraad format:
cut twitterSmall.ssv -f 2,3 -d ' ' >twitterSmallNoIds.ssv

Running on Spark cluster

A single "assembly" jar needs to be created. In some cases, the sbt build file does not match the JVM/Scala/library versions and thus, minor adjustments were needed.

Find out optimal number of threads (cores) per executor for RDD-DBSCAN

(typically, 5 is recommended to avoid problems with parallel HDFS access by the same JVM): 3, 4 or 6 or more(?). Note we have 39*24 = 936 cores, but we should leave a few cores for master and other stuff.

912 cores, 3 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 3 --num-executors 304 --executor-cores 3 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions 25000 0.01 40 : 12mins, 55sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output (repartition to single file would crash process that has to do this "reduce" step).

932 cores, 4 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 3sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output.

912 cores, 6 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 152 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 38sec
928 core = 116 executors * 8 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach 25000 912 0.01 40 : 10mins, 26sec
912 cores, 76 executors * 12 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 76 --executor-cores 12 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_76executors_12coreseach 25000 912 0.01 40 : 12mins, 4sec
924 core = 42 executors * 22 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 42 --executor-cores 22 --executor-memory 10gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_42executors_22coreseach 25000 912 0.01 40 : 13mins, 58sec

Trying to find out optimal number of partitions for RDD-DBSCAN

spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 3sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output.
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_456_partitions_233executors_4coreseach 25000 0.01 40 : 12mins, 16sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_228_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 17sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_114_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 57sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_57_partitions_233executors_4coreseach 25000 0.01 40 : 10mins, 22sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_28_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 29sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 10 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_57_partitions_10executors_6coreseach 25000 0.01 40 : 11mins, 49sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 2 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_12_partitions_2executors_6coreseach 25000 0.01 40 : 21mins, 20sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 4 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_24_partitions_24executors_6coreseach 25000 24 0.01 40 : 17mins, 11sec
Note that there is an imbalance in one of the stages (stage 3 at line 127): while the median of this task is 22s, the longest task takes 3.5 minutes! But with smaller partition size, this becomes maybe less imbalanced and thus slightly faster? (Or is this rather less overhead?)

Experiment with Partitions size (MaxPoints) parameter

Investigate wether this results in different amount of noise filtered out!

TwitterSmall: 3 704 351 points.
Lower bound of partition size that is possible to achieve with the given eps is 25 000 (=each partition has less than 25 000 points, the majority much less, maybe about 9000). 3 704 351 / 25 000 means that at least 148 partitions are created, with partition size of 9000: 3 704 351 / 9 000 = more than 400 partitions.
Or: if we want to have 912 partitions (=slightly less than the available cores): 3 704 351 / 912 = 4061 points per partition would be best! TODO: try this value! (Note: if the resulting partitioning rectangle reaches size 2*eps, no smaller partitioning is possible and in this case, this rectangle will contain more points than specified.)
MaxPoints=4061 (a couple of "cannot split messages" occured):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 6gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_4061_epsmaxPoints 4061 912 0.01 40 : 19mins, 17sec
MaxPoints=9000 (a couple of "cannot split messages" occured):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_9000maxPoints 9000 912 0.01 40 : 13mins, 43sec
MaxPoints=20000 (Can't split: (DBSCANRectangle(51.4999988488853,-0.13999999687075615,51.51999884843826,-0.11999999731779099) -> 21664) (maxSize: 20000)
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_20000maxPoints 20000 912 0.01 40 : 14mins, 27sec
MaxPoints=25000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 15sec
MaxPoints=30000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_30000maxPoints 30000 912 0.01 40 : 9mins, 31sec
MaxPoints=32500:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_32500maxPoints 32500 912 0.01 40 : 10mins, 53sec but second run: 9mins, 6sec third run: 9mins, 13sec, fourth run: 9mins, 20sec
MaxPoints=35000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 6gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_35000_epsmaxPoints 35000 912 0.01 40 : 9mins, 42sec
MaxPoints=40000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_40000maxPoints 40000 912 0.01 40 : 10mins, 56sec
MaxPoints=50000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_50000maxPoints 50000 912 0.01 40 : 14mins, 6sec

Scaling test

Increase number of nodes or executors!

116 executors with 8 cores each (=928 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 15sec
116 executors with 4 cores each (=464 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_4coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 40sec
58 executors with 8 cores each (=464 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 58 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_58executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 24sec
With 464 cores not slower than with 928 cores (either: too much overhead with 928 or not enough partitioning)

29 executors with 8 cores each (=232 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 3sec
29 executors with 4 cores each (=116 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_4coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 47sec
29 executors with 2 cores each (=58 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 2 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_2coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 22sec
29 executors with 1 cores each (=29 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 1 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_1coreseach_25000maxPoints 25000 912 0.01 40 : (13mins, 33sec is this value from the 14 executors run?)
14 executors with 1 cores each (=14 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 14 --executor-cores 1 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_14executors_1coreseach_25000maxPoints 25000 912 0.01 40 : (13mins, 33sec is this value from the 29 executors run?) 12mins, 45sec
32 executors with 1 cores each (=32 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 32 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_32executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 13mins, 52sec
16 executors with 1 cores each (=16 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 16 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_16executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 16mins, 57sec and 14mins, 49sec and 14mins, 49sec

8 executors with 1 cores each (=8 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 8 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_8executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 20mins, 19sec

4 executors with 1 cores each (=4 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 4 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_4executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 33mins, 14sec
2 executors with 1 cores each (=2 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 2 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_2executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 58mins, 41sec

1 executors with 1 cores each (=1 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 1 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_1executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 2hrs, 1mins, 51sec
Note: due to skewed data, there are single tasks that delay the whole thing, hence even though not everything is processed in parallel (less executor threads/cores than tasks), but tasks are queued, the queued execution is still faster than the long running task due to skewed data.

Twitter big

spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 40g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=2000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterBig.csv twitterBig.out_with_456_partitions_38executors_1coreseach 25000 456 0.01 40 : 1hrs, 28mins, 55sec

DBSCAN on Spark (https://github.com/mraad/dbscan-spark)

By default: space-separated, only single-space separator supported (cut -d ' ' -f 1,4,6 twitterSmall.ssv > twitterSmall.ssv_no_extra_spaces to remove extra spaces in Twitter file)
Using property file (if ran on JUDGE, paths refers to HDFS paths!)
input.path=Twitter/twitterSmall.ssv
output.path=twitterSmall.out_mraad
dbscan.eps=0.01
dbscan.min.points=40
dbscan.num.partitions=912

To avoid needing to edit the properties files, we always output to the same directory and then do a rename after the run: hdfs dfs -mv twitterSmall.out_mraad twitterSmall.out_mraad116executors8cores

spark-submit --driver-memory 20g --num-executors 116 --executor-cores 8 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 28sec
spark-submit --driver-memory 20g --num-executors 58 --executor-cores 8 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 3sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 8 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 53sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 4 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 47sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 49sec
spark-submit --driver-memory 20g --num-executors 32 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 27sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 3sec
spark-submit --driver-memory 20g --num-executors 16 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 30sec
spark-submit --driver-memory 20g --num-executors 14 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 20sec
spark-submit --driver-memory 20g --num-executors 8 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 54sec
spark-submit --driver-memory 20g --num-executors 7 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 41sec
spark-submit --driver-memory 20g --num-executors 4 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 5mins, 30sec
spark-submit --driver-memory 20g --num-executors 2 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 9mins, 34sec
spark-submit --driver-memory 20g --num-executors 1 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 18mins, 25sec

spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix_58partitions smaller boxes : 1mins, 38sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix_116partitions0.01cellsize : 1mins, 24sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 4 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix_116partitions0.01cellsize : 1mins, 20sec and 1mins, 37sec

TwitterBig runs

Seem to need a lot of RAM: only use 1 core on each worker so that this core can use full RAM (otherwise: abort due to timeouts which typically are due to crashes because of out of memory)
spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 30g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=12000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterBig.properties_hdfs_noprefix_14592partitions0.01cellsize : 1hrs, 3mins, 58sec
spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 30g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=12000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterBig.properties_hdfs_noprefix_1824partitions0.01cellsize : 24mins, 51se

Spark DBSCAN (https://github.com/alitouka/spark_dbscan)

Uses CSV

Need to change the source code slighlty (to avoid passing and subsequent complaints concerning master URL consisting just of "YARN". In addition, built.sbt had a problem using version numbers not available:
Furthermore, added (hardcoded 912 partitions in it’s IOHelper class where the initial file is read).
< scalaVersion := "2.10.6"
<
< val sparkVersion = "1.6.1"
<
< // Added % "provided" so that it gets not included in assembly jar
< libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
<
<
< // Added % "provided" so that it gets not included in assembly jar
< // Elsewhere "spark-mllib_2.10" is used (the 2.10 might refer to Scala 2.10?)
< libraryDependencies += "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided"
<
< // ScalaTest used by the tests provided by Spark MLlib
< libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6" % "test"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0" % "provided"
22a9
> libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.1.3" % "test"

spark-submit --num-executors 116 --executor-cores 8 --executor-memory 12gb --class org.alitouka.spark.dbscan.DbscanDriver --master yarn /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter/twitterSmall.csv --ds-output twitterSmall_alitouka912initialpartitions4061npp --npp 4061 --eps 0.01 --numPts 40 : 53mins, 6sec

spark-submit --num-executors 116 --executor-cores 8 --executor-memory 12gb --class org.alitouka.spark.dbscan.DbscanDriver --master yarn /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter/twitterSmall.csv --ds-output twitterSmall_alitouka912initialpartitions25000npp --npp 25000 --eps 0.01 --numPts 40 : 40mins, 10sec

twitterBig

spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 40g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=2000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --class org.alitouka.spark.dbscan.DbscanDriver --master yarn /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter/twitterBig.csv --ds-output twitterBig_alitouka912initialpartitions25000npp --npp 25000 --eps 0.01 --numPts 40
fails with
java.lang.Exception: Box for point Point at (51.382, -2.3846); id = 6618196; box = 706; cluster = -2; neighbors = 0 was not found

ELKI

time java -Xmx35G -jar elki-bundle-0.7.1.jar KDDCLIApplication -dbc.in /home/helmut/DataIn/twitterSmall.csv -db.index "tree.metrical.covertree.SimplifiedCoverTree$Factory" -covertree.distancefunction minkowski.EuclideanDistanceFunction -algorithm clustering.DBSCAN -dbscan.epsilon 0.01 -dbscan.minpts 40 > twitterSmall.out_elki

time java -Xmx73G -jar elki-bundle-0.7.1.jar KDDCLIApplication -dbc.in /home/helmut/DataIn/twitterBig.csv -db.index "tree.metrical.covertree.SimplifiedCoverTree$Factory" -covertree.distancefunction minkowski.EuclideanDistanceFunction -algorithm clustering.DBSCAN -dbscan.epsilon 0.01 -dbscan.minpts 40 > twitterBig.out_elki

The 9th System Analysis and Modeling (SAM) conference

Helmut Neukirchen, 1. June 2016

Call for Papers

The System Analysis and Modeling (SAM) conference provides an open arena for participants from academia and industry to present and discuss the most recent innovations, trends, experiences and concerns in modeling, specification and analysis of distributed, communication and real-time systems using ITU-T’s Specification and Description Language (SDL) and Message Sequence Charts (MSC), as well as related system design languages (including but not limited to UML, ASN.1, TTCN, SysML and URN). 

As in previous editions, SAM 2016 will be co-located with the MODELS 2016. The SAM conference originates from the use of languages and techniques for telecommunications applications, whereas MODELS has a background in the application of UML. However, UML is also used for telecommunications, and the languages standardized by ITU-T (ASN.1, SDL-2010, MSC, TTCN-3, URN) are also used for other applications. The MODELS 2016 conference week is a unique opportunity to attend both of these events with overlapping domains of interest.

Scope and Topics

The 2016 edition of the conference is under the theme of Technology-specific aspects of Models. This theme includes domain-specific aspects of models and peculiarities of using models for different technologies, including, but not limited to the Internet of Things (IoT), automotive software, cloud applications, and embedded software. Moreover, we encourage people to consider publishing information about the usage of models for different purposes and the combination with different software engineering technologies, including, but not limited to software testing, requirements engineering, and automated code generation. 

In addition to our theme, we also invite contributions from a broader range of topics from the following non-exhaustive list:

Models and quality:

  • models quality assurance; quality of models and model artefacts; design of reusable models artefacts; reuse of model artefacts; characteristics of model quality.

Language development:

  • domain-specific languages and language extensions; standardization of language profiles; evolution of language standards; modular language design; semantics; evaluation of languages; languages for real-time systems; performance and other non-functional properties.

Model-driven development:

  • systems engineering; analysis and transformation of models; verification and validation of models; simulation; systematic testing based on and applied to models; tool support.

Applications:

  • Using Specification and Description Language, Message Sequence Charts, UML, SysML, ASN.1, TTCN-3, User Requirements Notation, and related languages.
  • Industrial usage reports; experiences from education; domain-specific applicability (e.g., in automotive, aerospace, telecommunication, process automation and healthcare); methodologies for applications.
  • Application reports should focus on what is effective (and ineffective) in applying a technique preferably backed up by some measurements. A report should not just describe an implementation, though new application areas are of interest.

Location and Venue

SAM 2016 will be held in Saint-Malo, France on October 3rd – 4th 2016. The conference will be co-located with the MODELS 2016.

Submission and Publications

All accepted papers will be published in the well-known Springer Lecture Notes on Computer Science. Submissions must be previously unpublished, written in English, and use the LNCS style as described in the LNCS Author and Editor Guidelines. Authors are strongly encouraged to use the LaTeX version of the template. Papers accepted or under review for other events are ineligible for submission to SAM 2016. Submissions in the following categories are solicited:

  • Full papers describing original, unpublished results (max. 16 pages in LNCS style)
  • Short papers describing work in progress (max. 8 pages in LNCS style)

All page limits are including illustrations, bibliography and appendices. Failure to use the LNCS style and comply to the page limit will lead to a desk-reject of the submission.

The SAM 2016 Program Committee will evaluate the technical contributions of each submission as well as its accessibility to the audience. Papers will be judged based on significance, originality, substance, correctness, and clarity.

Information on how to submit your paper can be found on the submission page.

Important Dates

  • Submission of Abstracts: Sunday, June 19th 2016
  • Submission of Papers: Sunday, June 26th 2016
  • Notification: Wednesday July 27th 2016
  • Camera Ready: Friday, August 5th 2016
  • SAM 2016: Monday, October 3rd and Tuesday, October 4th 2016

Requirements for authors

Accepted papers have to be presented by one of the authors at the SAM 2016. A full SAM 2016 conference registration is required for each accepted paper. Failure to comply to these requirements may result in the exclusion of the paper from the proceedings.

Doktorsnemi í tölvunarfræði við Verkfræði- og náttúruvísindasvið Ph.D. student position in Computer Science at the School of Engineering and Natural Sciences, University of Iceland.

Helmut Neukirchen, 24. September 2014

See text on English page.The department of Industrial Engineering, Mechanical Engineering and Computer Science, seeks applicants to fill a Ph.D. student position in Computer Science. The project is a part of a larger, Nordic research project called “eScience tools for investigating Climate Change in Northern High Latitudes” (eSTICC) funded by NordForsk – an organisation under the Nordic Council of Ministers that provides funding for Nordic research cooperation.

The goal of the overall research project is a more accurate description of the high-latitude feedback processes in the climate system by improving the eScience tools of the climate research community. The Ph.D. student will develop from a Computer Science perspective scientific workflow schemes and tools to integrate the different data and software that are used by the climate researchers. The idea is to exploit existing workflow solutions, for example from the Grid computing or Multiphysics community, and customise them to enable interoperability of the used climate research eScience tools.

The eSTICC project runs from 2014 to 2018 and the Ph.D. student is funded for 3 years and 7 months. The project will be foundation for the Ph.D. thesis and Helmut Neukirchen and Ebba Þóra Hvannberg, professors in Computer Science at the University of Iceland will supervise. The Ph.D. student will work closely with the other project partners in Northern Europe and visit them.

Applicants should have an MSc degree in Computer Science, Software Engineering, Computational Engineering or a closely related field. Knowledge or interest in high-performance computing or eScience is an advantage. Applicants need to be able to work independently and be active in shaping the project as it progresses in co-operation with the supervisors and the international research team. Good communication skills, an ability to work in a team and willingness to travel are required. The selected candidate will need to send a formal application for a Ph.D. studentship at the University of Iceland in due time.

The application shall include a description of the applicant's interests in the project and how they can specifically, contribute to the project. The application should be no longer than three pages. The following shall be appended with the application: i) Curriculum Vitae, ii) degree certificates, iii) a copy of Master dissertation or another extensive research essay, iv) names of two referees and their contact addresses.

Applications should be sent to: starfsumsoknir@hi.is marked HI1409135. Applications that are not sent electronically should be sent in duplicate to Human Resource Division, University of Iceland, Main Building, Sæmundargötu 2, 101 Reykjavík. All applications will be answered and applicants will be informed about the appointment when a decision has been made.

Further Information about the eSTICC project can be found on the project's webpage http://esticc.nilu.no. For further information, please contact either Dr. Helmut Neukirchen (helmut@hi.is) or Dr. Ebba Þóra Hvannberg (ebba@hi.is).

Appointments to the University of Iceland do take into account the Equal Rights Project of the University of Iceland.

Salary is determined by the doctoral scholarship according with the wage contract by the minister of finance and appropriate trade union.

About 990 students study Computer Science or Engineering at the Faculty of Industrial Engineering, Mechanical Engineering, and Computer Science. From these, more than 90 are graduate students, both Master's and Ph.D. students. Academic staff at the faculty is about 25. About 450 students are studying Computer Science and the number of academic Computer Science staff is 10. More information can be found on the website of the University of Iceland http://english.hi.is/.

Around 300 highly qualified employees at the School of Engineering and Natural Sciences conduct cutting-edge research and teach in programs that offer diverse and ambitious courses in the field of Engineering and Natural Sciences. The work environment is international and the ratio of international students and employees is constantly increasing.

The School’s research institutes are highly sought after affiliates by international universities and serve a significant role in the scientific community. These are: The Engineering Research Institute, The Institute of Life and Environmental Sciences and The Institute for Sustainability Studies. The Science Institute which divides into The Institute of Physical Sciences and The Institute of Earth Sciences.

The University of Iceland is the largest teaching, research, and science institute in Iceland. The University provides a wide range of education in various fields of studies and services institutions, private businesses and the government. According to Times Higher Education the University of Iceland is among the top 300 universities in the world.

Nordic Center of Excellence (NCoE) on eScience Tools for Investigating Climate Change at High Northern Latitudes (eSTICC) started

Helmut Neukirchen, 3. February 2014

We are a Nordic Center of Excellence (NCoE), now: The research project eScience Tools for Investigating Climate Change at High Northern Latitudes (eSTICC) has started work. eSTICC has gathered 13 top research groups from the Nordic countries working in the fields of climate research and eScience. The University of Iceland’s team is led by Helmut Neukirchen and focuses on High-performance computing aspects of the project (in particular workflows) and contributes also to training and education within the project. The project is funded by NordForsk as a Nordic Center of Excellence (NCoE). The University of Iceland receives a funding of approximately 26.5 million ISK. The project runs from 1/2014 to 12/2018 and will include funding for a PhD student from start of September 2014 to mid April 2018.