Some notes on using a Spark cluster

Helmut Neukirchen, 18. August 2016

The following notes are mainly for my personal use referring to the Spark 1.6/YARN cluster that I access, but maybe they are helpful for you as well...

Upload to HDFS

By default (=used implicitly by all HDFS operations), a HDFS paths are relative to your HDFS home directory: it needs to be created first by the administrator!

While piping through SSH should work ( cat test.txt | ssh
username@masternode "hdfs dfs -put - hadoopFoldername/" ) , it is reported to be slow -- I never checked this, but as I anyway used rather small data, I did instead an scp to the local file system of the master node and used afterwards a hdfs put:
scp localFile username@masternode
hdfs dfs -put twitterSmall.csv Twitter

Concatenate HDFS files (all inside an HDFS directory) and store in
local file system (without sorting)

hdfs dfs -getmerge HdfFolderContainingSplitResultFiles LocalFileToBeCreated

Note that Spark does not overwrite output files in HDFS by default. Either take care when you re-run jobs that the output files have been (re-)moved or you have to allow it in the Spark conf of your program:  conf.set("spark.hadoop.validateOutputSpecs","false")

Debugging

  1. See http://spark.apache.org/docs/latest/running-on-yarn.html
  2. Use spark-submit --verbose
  3. If executor processes are killed, this is mainly due to insufficient RAM (garbage collection takes too long, thus timeouts occur or simple out of memory/OOM exceptions). While you see in this case in the log of the driver on the spark-submit console  only "<span class="hljs-keyword">exit</span> code <span class="hljs-number">143</span>", the details need to be found in the logs of nodes/executors. This may not be possible via Web UI due to executor nodes being firewalled -- in this case use:
    yarn logs -applicationId application_1470137500465_0147
    (App Id tp be taken from ID columns in Cluster Web UI. Works only for completed runs, not the current run.) In these logs, you can find then / search for java.lang.OutOfMemoryError: GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space

Performance tuning

  1. Note that due HDFS blocks size of 128 MB, by default, partitions of this size are created when reading data. To enforce a higher number of partitions/higher parallelism, use already at the file read stage the optional numberOfPartitions parameter (that also many other RDD creating operations support).
  2. Some introduction https://www.mapr.com/blog/resource-allocation-configuration-spark-yarn
    http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
    (in particular: more than 5 cores per executor is said to lead to bad HDFS throughput. Note that “executor” is not identical to “node”, thus instead of running one executor with 24 cores on one node, rather run 4 executors with 5 cores on each node or 8 executors with 3 cores! Note that then, however, the overall memory of a node needs to be divided by the numbers of executors per node, e.g. 5 BG per executor with 8 executors per node on a 40G RAM node.)
  3. Config for RAM-intensive jobs (=1 core per executor only & 1 core per node only, using 40GB heap space and 2GB overhead for Spark/Yarn itself => on each of the 38 nodes only one core is used that thus can make use of all available RAM), in addition increase timeouts and message size:
    spark-submit --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --conf "yarn.nodemanager.resource.cpu-vcores=1"--executor-memory 40g --conf "spark.yarn.executor.memoryOverhead=2000"--conf "spark.driver.cores=4" --conf "spark.driver.maxResultSize=0"
    (Note: not sure about the driver memory and cores: this seems to have no influence -- is it too late to set it here?)

DBSCAN evaluation

Helmut Neukirchen, 17. August 2016

This post is used to document some DBSCAN command line parameters used in a DBSCAN implementation evaluation. Once a paper will be published referencing it, it will go to EUDAT B2SHARE and get thus a persistent handle.

Conversion of HDF5 file into CSV/SSV

h5dump -d "/DBSCAN" -o out.txt twitterSmall.h5
Yields lines (12,0): 53.3243, -1.12341,
Remove Id in first column:
cut out.txt -d ':' -f 2 >out.txt2
Yields lines 53.3243, -1.12341,
Remove extra comma at end of line
cut out.txt2 -d ',' -f 1,2 >out.csv
Seems that the first line is just an empty line, remove using
tail -n +2 out.csv >twitterSmall.csv

The mraad DBSCAN implementation expects SSV format with IDs (i.e. remove brackets after h5dump run)
cut out.txt -c 5- > out_withoutleadingbracket.txt
cut out_withoutleadingbracket.txt -d ':' -f 1,2 --output-delimiter=',' > out3.txt
cut out3.txt -d ')' -f 1,2 --output-delimiter=',' > out4.txt
cut out4.txt -d ',' -f 1,4,5 --output-delimiter=',' > cut twitterBig_withIds.csv
cut -d ',' twitterBig_withIds.csv -f 1-3 --output-delimiter=' ' > twitterBig_withIds.ssv_with_doublespaces
cut -d ' ' -f 1,3,5 twitterBig_withIds.ssv_with_doublespaces >twitterBig_withIds.ssv

The Dianwei Han DBSCAN expects SSV without IDs: remove Id from mraad format:
cut twitterSmall.ssv -f 2,3 -d ' ' >twitterSmallNoIds.ssv

Running on Spark cluster

A single "assembly" jar needs to be created. In some cases, the sbt build file does not match the JVM/Scala/library versions and thus, minor adjustments were needed.

Find out optimal number of threads (cores) per executor for RDD-DBSCAN

(typically, 5 is recommended to avoid problems with parallel HDFS access by the same JVM): 3, 4 or 6 or more(?). Note we have 39*24 = 936 cores, but we should leave a few cores for master and other stuff.

912 cores, 3 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 3 --num-executors 304 --executor-cores 3 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions 25000 0.01 40 : 12mins, 55sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output (repartition to single file would crash process that has to do this "reduce" step).

932 cores, 4 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 3sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output.

912 cores, 6 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 152 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 38sec
928 core = 116 executors * 8 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach 25000 912 0.01 40 : 10mins, 26sec
912 cores, 76 executors * 12 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 76 --executor-cores 12 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_76executors_12coreseach 25000 912 0.01 40 : 12mins, 4sec
924 core = 42 executors * 22 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 42 --executor-cores 22 --executor-memory 10gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_42executors_22coreseach 25000 912 0.01 40 : 13mins, 58sec

Trying to find out optimal number of partitions for RDD-DBSCAN

spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 3sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output.
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_456_partitions_233executors_4coreseach 25000 0.01 40 : 12mins, 16sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_228_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 17sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_114_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 57sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_57_partitions_233executors_4coreseach 25000 0.01 40 : 10mins, 22sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_28_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 29sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 10 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_57_partitions_10executors_6coreseach 25000 0.01 40 : 11mins, 49sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 2 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_12_partitions_2executors_6coreseach 25000 0.01 40 : 21mins, 20sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 4 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_24_partitions_24executors_6coreseach 25000 24 0.01 40 : 17mins, 11sec
Note that there is an imbalance in one of the stages (stage 3 at line 127): while the median of this task is 22s, the longest task takes 3.5 minutes! But with smaller partition size, this becomes maybe less imbalanced and thus slightly faster? (Or is this rather less overhead?)

Experiment with Partitions size (MaxPoints) parameter

Investigate wether this results in different amount of noise filtered out!

TwitterSmall: 3 704 351 points.
Lower bound of partition size that is possible to achieve with the given eps is 25 000 (=each partition has less than 25 000 points, the majority much less, maybe about 9000). 3 704 351 / 25 000 means that at least 148 partitions are created, with partition size of 9000: 3 704 351 / 9 000 = more than 400 partitions.
Or: if we want to have 912 partitions (=slightly less than the available cores): 3 704 351 / 912 = 4061 points per partition would be best! TODO: try this value! (Note: if the resulting partitioning rectangle reaches size 2*eps, no smaller partitioning is possible and in this case, this rectangle will contain more points than specified.)
MaxPoints=4061 (a couple of "cannot split messages" occured):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 6gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_4061_epsmaxPoints 4061 912 0.01 40 : 19mins, 17sec
MaxPoints=9000 (a couple of "cannot split messages" occured):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_9000maxPoints 9000 912 0.01 40 : 13mins, 43sec
MaxPoints=20000 (Can't split: (DBSCANRectangle(51.4999988488853,-0.13999999687075615,51.51999884843826,-0.11999999731779099) -> 21664) (maxSize: 20000)
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_20000maxPoints 20000 912 0.01 40 : 14mins, 27sec
MaxPoints=25000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 15sec
MaxPoints=30000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_30000maxPoints 30000 912 0.01 40 : 9mins, 31sec
MaxPoints=32500:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_32500maxPoints 32500 912 0.01 40 : 10mins, 53sec but second run: 9mins, 6sec third run: 9mins, 13sec, fourth run: 9mins, 20sec
MaxPoints=35000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 6gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_35000_epsmaxPoints 35000 912 0.01 40 : 9mins, 42sec
MaxPoints=40000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_40000maxPoints 40000 912 0.01 40 : 10mins, 56sec
MaxPoints=50000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_50000maxPoints 50000 912 0.01 40 : 14mins, 6sec

Scaling test

Increase number of nodes or executors!

116 executors with 8 cores each (=928 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 15sec
116 executors with 4 cores each (=464 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_4coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 40sec
58 executors with 8 cores each (=464 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 58 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_58executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 24sec
With 464 cores not slower than with 928 cores (either: too much overhead with 928 or not enough partitioning)

29 executors with 8 cores each (=232 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 3sec
29 executors with 4 cores each (=116 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_4coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 47sec
29 executors with 2 cores each (=58 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 2 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_2coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 22sec
29 executors with 1 cores each (=29 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 1 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_1coreseach_25000maxPoints 25000 912 0.01 40 : (13mins, 33sec is this value from the 14 executors run?)
14 executors with 1 cores each (=14 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 14 --executor-cores 1 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_14executors_1coreseach_25000maxPoints 25000 912 0.01 40 : (13mins, 33sec is this value from the 29 executors run?) 12mins, 45sec
32 executors with 1 cores each (=32 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 32 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_32executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 13mins, 52sec
16 executors with 1 cores each (=16 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 16 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_16executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 16mins, 57sec and 14mins, 49sec and 14mins, 49sec

8 executors with 1 cores each (=8 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 8 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_8executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 20mins, 19sec

4 executors with 1 cores each (=4 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 4 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_4executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 33mins, 14sec
2 executors with 1 cores each (=2 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 2 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_2executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 58mins, 41sec

1 executors with 1 cores each (=1 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 1 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_1executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 2hrs, 1mins, 51sec
Note: due to skewed data, there are single tasks that delay the whole thing, hence even though not everything is processed in parallel (less executor threads/cores than tasks), but tasks are queued, the queued execution is still faster than the long running task due to skewed data.

Twitter big

spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 40g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=2000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterBig.csv twitterBig.out_with_456_partitions_38executors_1coreseach 25000 456 0.01 40 : 1hrs, 28mins, 55sec

DBSCAN on Spark (https://github.com/mraad/dbscan-spark)

By default: space-separated, only single-space separator supported (cut -d ' ' -f 1,4,6 twitterSmall.ssv > twitterSmall.ssv_no_extra_spaces to remove extra spaces in Twitter file)
Using property file (if ran on JUDGE, paths refers to HDFS paths!)
input.path=Twitter/twitterSmall.ssv
output.path=twitterSmall.out_mraad
dbscan.eps=0.01
dbscan.min.points=40
dbscan.num.partitions=912

To avoid needing to edit the properties files, we always output to the same directory and then do a rename after the run: hdfs dfs -mv twitterSmall.out_mraad twitterSmall.out_mraad116executors8cores

spark-submit --driver-memory 20g --num-executors 116 --executor-cores 8 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 28sec
spark-submit --driver-memory 20g --num-executors 58 --executor-cores 8 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 3sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 8 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 53sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 4 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 47sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 49sec
spark-submit --driver-memory 20g --num-executors 32 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 27sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 3sec
spark-submit --driver-memory 20g --num-executors 16 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 30sec
spark-submit --driver-memory 20g --num-executors 14 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 20sec
spark-submit --driver-memory 20g --num-executors 8 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 54sec
spark-submit --driver-memory 20g --num-executors 7 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 41sec
spark-submit --driver-memory 20g --num-executors 4 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 5mins, 30sec
spark-submit --driver-memory 20g --num-executors 2 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 9mins, 34sec
spark-submit --driver-memory 20g --num-executors 1 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 18mins, 25sec

spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix_58partitions smaller boxes : 1mins, 38sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix_116partitions0.01cellsize : 1mins, 24sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 4 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix_116partitions0.01cellsize : 1mins, 20sec and 1mins, 37sec

TwitterBig runs

Seem to need a lot of RAM: only use 1 core on each worker so that this core can use full RAM (otherwise: abort due to timeouts which typically are due to crashes because of out of memory)
spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 30g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=12000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterBig.properties_hdfs_noprefix_14592partitions0.01cellsize : 1hrs, 3mins, 58sec
spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 30g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=12000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterBig.properties_hdfs_noprefix_1824partitions0.01cellsize : 24mins, 51se

Spark DBSCAN (https://github.com/alitouka/spark_dbscan)

Uses CSV

Need to change the source code slighlty (to avoid passing and subsequent complaints concerning master URL consisting just of "YARN". In addition, built.sbt had a problem using version numbers not available:
Furthermore, added (hardcoded 912 partitions in it’s IOHelper class where the initial file is read).
< scalaVersion := "2.10.6"
<
< val sparkVersion = "1.6.1"
<
< // Added % "provided" so that it gets not included in assembly jar
< libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
<
<
< // Added % "provided" so that it gets not included in assembly jar
< // Elsewhere "spark-mllib_2.10" is used (the 2.10 might refer to Scala 2.10?)
< libraryDependencies += "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided"
<
< // ScalaTest used by the tests provided by Spark MLlib
< libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6" % "test"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0" % "provided"
22a9
> libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.1.3" % "test"

spark-submit --num-executors 116 --executor-cores 8 --executor-memory 12gb --class org.alitouka.spark.dbscan.DbscanDriver --master yarn /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter/twitterSmall.csv --ds-output twitterSmall_alitouka912initialpartitions4061npp --npp 4061 --eps 0.01 --numPts 40 : 53mins, 6sec

spark-submit --num-executors 116 --executor-cores 8 --executor-memory 12gb --class org.alitouka.spark.dbscan.DbscanDriver --master yarn /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter/twitterSmall.csv --ds-output twitterSmall_alitouka912initialpartitions25000npp --npp 25000 --eps 0.01 --numPts 40 : 40mins, 10sec

twitterBig

spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 40g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=2000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --class org.alitouka.spark.dbscan.DbscanDriver --master yarn /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter/twitterBig.csv --ds-output twitterBig_alitouka912initialpartitions25000npp --npp 25000 --eps 0.01 --numPts 40
fails with
java.lang.Exception: Box for point Point at (51.382, -2.3846); id = 6618196; box = 706; cluster = -2; neighbors = 0 was not found

ELKI

time java -Xmx35G -jar elki-bundle-0.7.1.jar KDDCLIApplication -dbc.in /home/helmut/DataIn/twitterSmall.csv -db.index "tree.metrical.covertree.SimplifiedCoverTree$Factory" -covertree.distancefunction minkowski.EuclideanDistanceFunction -algorithm clustering.DBSCAN -dbscan.epsilon 0.01 -dbscan.minpts 40 > twitterSmall.out_elki

time java -Xmx73G -jar elki-bundle-0.7.1.jar KDDCLIApplication -dbc.in /home/helmut/DataIn/twitterBig.csv -db.index "tree.metrical.covertree.SimplifiedCoverTree$Factory" -covertree.distancefunction minkowski.EuclideanDistanceFunction -algorithm clustering.DBSCAN -dbscan.epsilon 0.01 -dbscan.minpts 40 > twitterBig.out_elki

No teaching in autumn 2016

Helmut Neukirchen, 28. June 2016

I will be in research sabbatical in autumn 2016 and thus focus on research without any teaching obligations.

Typically, I would have taught then HBV101F Software Maintenance. Currently, it is planned that it is taught one year later in autumn 2017. Students who would have needed to take that course in autumn 2016 can get an exemption and take another course instead.

If anyone wants to start a new M.Sc. thesis during that time, I will only accept topics with a strong research focus thus leading to a publication at an international conference, for example related to big-data processing with Apache Spark.

About Defending a Master's thesis

Helmut Neukirchen, 19. June 2016

Note from 2023: the text below is partly outdated. On Ugla, SENS has pretty good info on the timelines and the webforms to be filled out, i.e. ignore the timelines and webforms mentioned below.

The official regulations are in articles 7. and 8. of Regulation no. 994-2017 / Reglur um meistaranám við Verkfræði- og náttúruvísindasvið Háskóla Íslands, nr. 994/2017 (and article 69, items 9-15 of Regulation for the University of Iceland no. 569-2009 / Reglur fyrir Háskóla Íslands Nr. 569/2009. The text below should be in accordance -- if not, it needs to be updated... In charge of MSc. theses at VoN student service is Donna, reachable via the HÍ email user alias sensgraduate.

If you want to graduate, take care that latest in parallel to your Master's project, you finish all your coursework, e.g. Software Engineering students have three mandatory HBV courses. If you are a student coming from non-Computer Science/non-Software Engineering Bachelor, you typically have to take extra courses as part of there admission to the Master's program!

A Master's thesis needs:

  • A supervisor (i. leiðbeinandi): Supervises the student during the whole thesis project.
  • An M.Sc. committee (i. meistaraprófsnefnd) consisting of the supervisor and at least one other person who needs to have an MSc. degree -- typically another university teacher -- (unofficially called "secondary supervisor" (i. meðleiðbeinandi)): Often just gets into the game once the student is almost finished (internally, 10% of the overall supervision efforts assumed, but may be up to 25% and 50%), i.e. has a more or less final draft of the thesis available. Gives comments to improve your draft. So this person should be somewhat familiar with the topic.
  • An external thesis examiner (i. prófdómari): If possible, should be from outside HÍ (in the old days, that person was from the faculty and thus often the old English term "faculty representative" is used. Use in your English thesis the official translation External examiner). That person needs a final draft (release candidate status) before the defense, but must not otherwise be involved in the supervision.

Two web forms needs to be filled out latest 1 week before the defense by the supervisor to book a defense: one for advertising the defense, one for appointing the external examiner (the web forms can be reached via VoN intranet page in UGLA).
There, all the information that are needed (name, title, abstract (for writing an abstract, see also item 4 from Kent Beck), day, supervisors, a photo of the student that will be used for advertising -- but more recently, it seems that the photo is anyway not used, etc.) have to be provided.

To make it for the next graduation ceremony (i. brautskráning) which is in February and June each year (there is till some deadline in October, but no ceremony), there is a deadline (latest 3 weeks before the graduation ceremony for which obviously all the grades need to be handed in). A few days before that deadline, there is typically an event called meistaradagurinn where the idea is that all the students of the faculty defend their thesis. Someone organises this and needs to be contacted to participate. But of course, it is also possible to defend a thesis on another day than on meistaradagurinn.

The schedule of the defense is as follows (meistaradagurinn: typically 45 minutes for talk and discussion, but there is 60 minutes time between defenses to allow time for setting up the presentation):

  • A few introducing words by the supervisor (including an explanation of the procedure).
  • 20-30 minutes presentation of the thesis by the student. No need to be nervous: you know best about your topic and thesis (also, your supervisor would not allow you to defend if you would likely fail)! Learn the introducing words (to get your presentation started fluently) and the concluding words (to avoid an abrupt termination of your talk) by heart.
  • Max. 15 minutes questions from the audience (Note: in practise, 5 minutes for the audience and 10 minutes for internal discussion is best)
  • The audience leaves the room, only the student and the three teachers remain. Now some more private discussion (what was good/bad) and further questions are possible.
  • Finally, the student leaves the room and the teachers discuss the grade (e.g. using a grading scheme) for the thesis and after this, the student is called in again and is told the grade.

Note that the grade is filled into some form that needs to be signed by those involved in grading when using the above web form, this gets prepared by the administration based on the above web form (typically, a PDF of the form is sent to the main supervisor via e-mail by Sigríður Sif Magnúsdóttir a few days before the defense).

Based on the comments that are given during defense, some minor changes to thesis might be required.
Students need to submit an electronic copy of their thesis latest three weeks before (so that you can send the confirmation of submitting before the deadline) the next graduation ceremony (i. brautskráning) to skemman.is (printed version not required anymore, nor is an ISBN number required: remove that line if it is part of your thesis template). Student should also simultaneously need to fill out an declaration of access. The declaration of access template is accessible in English and Icelandic. For commercial settings, access to the thesis can be closed (but not for longer than 4 years); if the thesis is closed, please send the final PDF as well to your supervisors, because they can otherwise neither access it.

If the thesis is accepted, the student will receive an e-mail confirming this. The student must send the confirmation from skemman to sensgraduate at hi.is or to Sigríður Sif Magnúsdóttir by email (before the deadline where all grades for brautskráning need to be available). There is also an UGLA page on brautskráning that hopefully is still available when you read this...

To allow the supervisors to read and comment on the thesis, a first draft needs to be finished in time:

  • First draft for the supervisor: latest 1 month before the defense. Preferably, use an agile approach of delivering early drafts as soon as a new chapter is finished. (Do not start with the Introduction -- that is often the last chapter written.)
  • Release candidate draft for the co-supervisor and "prófdómari" latest 1 week before the defense, better much earlier. This deadline applies also for filling out the above mentioned web form.
  • Poster needs to be printed latest 1 day before defense on Meistaradagurinn (e.g., Háskólaprent does this within 15-30 minutes).

If you finish your thesis in August/September/October you may not need to pay tuition fees for the new academic year.

There is an UGLA page with the various deadlines of the graduation process.

Templates for thesis, presentation, and poster

Note: Since 10/2021, HÍ has a new corporate identity that is covered here:

  • A MS Word template, but I really recommend using the LaTeX templates for writing the thesis. As "Advisors", list first your supervisor and in the next line the secondary/co-supervisors. Note that while the template may contain an ISBN number, you have to remove that line as nowadays, everything is electronic only. Have a look at some older MSc. theses to get an idea of the typical contents.
  • For the defense, a PPT slide template is available on the HÍ corporate design web page -> Hönnunarstaðall (at top right corner) -> Rafrænar einingar ->PowerPoint (and then the download is at the bottom). If you want to rather use LateX for your presentation, Katrín Halldórsdóttir created a LaTeX Beamer template that however is using the old 2010 corporate design (the tex file is GPL, however the logos are property of the University). Any volunteers to update it to the new design?
  • Furthermore, a poster is displayed on Meistaradagurinn (if defense is on a different day, you are still supposed to prepare a poster to be displayed later on Meistaradagurinn). You find templates with the new 2021 look on the HÍ corporate design web page -> Hönnunarstaðall (at top right corner) -> Prentmiðlar -> Veggspjöld (and then the download is at the bottom). However, that template contained in the ppt download is not very helpful -- you rather would need the provided Adobe InDesign template. As most will not have a license for Adobe InDesign, I provide here a PPT template that has been converted from the Adobe InDesign template. Note that you need to download all fonts of the the Google fonttype family "Jost" and install them (if you have it not installed, PowerPoint will use another font, but the printshop that probably has the Jost font and then, the layout does not match anymore).
    I suggest to use a smaller fontsize than in the template: compare with the fontsize used in the PPT template using the old look (but note that the old template uses A0 page size, while the new one uses A4 pages size -- which will then be scaled up when printed in, e.g., A0. Hence, display both side by side for comparison.)
    You should also refer to that old template to get an idea of the typical contents, such as adding the names of the supervisors, etc. As a backup, I provide here a copy of that old template.
    Háskolaprent can print the poster (typically in A0 size).

Note that our School of Engineering and Natural Sciences offers a Course on thesis skills such as writing (and you even get ECTS credit points for it). In addition, you will find on the web other general information on thesis writing.

The 9th System Analysis and Modeling (SAM) conference

Helmut Neukirchen, 1. June 2016

Call for Papers

The System Analysis and Modeling (SAM) conference provides an open arena for participants from academia and industry to present and discuss the most recent innovations, trends, experiences and concerns in modeling, specification and analysis of distributed, communication and real-time systems using ITU-T’s Specification and Description Language (SDL) and Message Sequence Charts (MSC), as well as related system design languages (including but not limited to UML, ASN.1, TTCN, SysML and URN). 

As in previous editions, SAM 2016 will be co-located with the MODELS 2016. The SAM conference originates from the use of languages and techniques for telecommunications applications, whereas MODELS has a background in the application of UML. However, UML is also used for telecommunications, and the languages standardized by ITU-T (ASN.1, SDL-2010, MSC, TTCN-3, URN) are also used for other applications. The MODELS 2016 conference week is a unique opportunity to attend both of these events with overlapping domains of interest.

Scope and Topics

The 2016 edition of the conference is under the theme of Technology-specific aspects of Models. This theme includes domain-specific aspects of models and peculiarities of using models for different technologies, including, but not limited to the Internet of Things (IoT), automotive software, cloud applications, and embedded software. Moreover, we encourage people to consider publishing information about the usage of models for different purposes and the combination with different software engineering technologies, including, but not limited to software testing, requirements engineering, and automated code generation. 

In addition to our theme, we also invite contributions from a broader range of topics from the following non-exhaustive list:

Models and quality:

  • models quality assurance; quality of models and model artefacts; design of reusable models artefacts; reuse of model artefacts; characteristics of model quality.

Language development:

  • domain-specific languages and language extensions; standardization of language profiles; evolution of language standards; modular language design; semantics; evaluation of languages; languages for real-time systems; performance and other non-functional properties.

Model-driven development:

  • systems engineering; analysis and transformation of models; verification and validation of models; simulation; systematic testing based on and applied to models; tool support.

Applications:

  • Using Specification and Description Language, Message Sequence Charts, UML, SysML, ASN.1, TTCN-3, User Requirements Notation, and related languages.
  • Industrial usage reports; experiences from education; domain-specific applicability (e.g., in automotive, aerospace, telecommunication, process automation and healthcare); methodologies for applications.
  • Application reports should focus on what is effective (and ineffective) in applying a technique preferably backed up by some measurements. A report should not just describe an implementation, though new application areas are of interest.

Location and Venue

SAM 2016 will be held in Saint-Malo, France on October 3rd – 4th 2016. The conference will be co-located with the MODELS 2016.

Submission and Publications

All accepted papers will be published in the well-known Springer Lecture Notes on Computer Science. Submissions must be previously unpublished, written in English, and use the LNCS style as described in the LNCS Author and Editor Guidelines. Authors are strongly encouraged to use the LaTeX version of the template. Papers accepted or under review for other events are ineligible for submission to SAM 2016. Submissions in the following categories are solicited:

  • Full papers describing original, unpublished results (max. 16 pages in LNCS style)
  • Short papers describing work in progress (max. 8 pages in LNCS style)

All page limits are including illustrations, bibliography and appendices. Failure to use the LNCS style and comply to the page limit will lead to a desk-reject of the submission.

The SAM 2016 Program Committee will evaluate the technical contributions of each submission as well as its accessibility to the audience. Papers will be judged based on significance, originality, substance, correctness, and clarity.

Information on how to submit your paper can be found on the submission page.

Important Dates

  • Submission of Abstracts: Sunday, June 19th 2016
  • Submission of Papers: Sunday, June 26th 2016
  • Notification: Wednesday July 27th 2016
  • Camera Ready: Friday, August 5th 2016
  • SAM 2016: Monday, October 3rd and Tuesday, October 4th 2016

Requirements for authors

Accepted papers have to be presented by one of the authors at the SAM 2016. A full SAM 2016 conference registration is required for each accepted paper. Failure to comply to these requirements may result in the exclusion of the paper from the proceedings.

CORBA remote object IORs in a NAT environment

Helmut Neukirchen, 23. October 2015

When running CORBA remote objects in a NAT environment (assuming Internet protocols are used), the IIOP IOR remote object references that will be created (and registered at some nameservice) will contain the private IP address (to convince yourself: dump the IOR as string and paste that string in http://www2.parc.com/istl/projects/ILU/parseIOR/). As a result, when a client outside the NAT environment looks up the IOR, it will get one containing the private IP and access to the remote object does of course not work. For the Oracle OpenJDK CORBA implementation, the following command line parameter needs to be provided to both the ORB and the JVM running at the remote object side:
-ORBServerHost PublicIPofServer

Concerning the ports:
By default, the Oracle OpenJDK is using TCP port 1049 for the activation service. You can change this port via the ORB command line parameter -port.

The port used for the CORBA Naming Service (which is automatically provided by the OpenJDK Java ORB) depends on whether orbd is started as root or as an ordinary user: when started as root, TCP port 900 is used, otherwise TCP port 1049 (because ports lower than 1024 can only be created by root). Unfortunately, TCP port 1049 is also used by the activation service as described above. Hence, a port collision (=exceptions) will occur (what a stupid design)!
In this case, let the ORB start the Naming Service e.g. on TCP port 1050:
orbd -ORBInitialPort 1050

When changing the Naming Service port from the default 900, client and server JVMs that use that Naming Service also need to know about the changed Naming Service port number: Start the JVMs with additional parameter:
java -ORBInitialPort 1050

When running client and server on different hosts, take care that they use the same Naming Service. Assuming that the Naming Service running on the server's host is used: the server will anyway use this local Naming Service, but the client needs to know the hostname of the server's Naming Service: start the client JVM with additional parameter:
java -ORBInitialHost nameserverhost

Note that in addition to these standard services (Activation and Naming), CORBA uses by default dynamically assigned TCP ports (=expect difficulties with firewalls) for all further objects such as your own remote objects that are contained in the IORs. However, you can enforce a port to be used by a servant created within a JVM using the additional parameter:
java -ORBServerPort port

Giving external students access to UGLA documents

Helmut Neukirchen, 16. October 2015

Sometimes, students that are not registered for a course (but have an UGLA account), need access to course material in UGLA. This can be achieved as follows:

  1. Operations -> Users and groups
  2. New group
  3. Give the new group some name, e.g. External access. Confirm. (UGLA allows to select registered students to be added here, but leave the group empty!)
  4. On the group overview page click on the newly created group.
  5. Add user
  6. In the SSI/kennitala field: either enter kennitala or the person's full HÍ email address. Save.
  7. All operations -> Change group permission. Change permissions accordingly. NOTE: giving permissions for a folder does not recursively apply to the file contained in the folder -- you need to change each and every individual file as well!
  8. Using the URL that you get from "Front Page" link (or simply via the link provided in the course catalogue), the persons should be able to access the folder.

Alternatively, in the Files and Folder area, you can for individual files and folders change the access permissions using the Edit/pen symbol -> Access Permissions

If you have a person without UGLA account, the only possibility is to make the whole course page world-wide visible:

  1. All Operations -> Change front page title -> At Access to the Teaching Web, select Open for everybody (no authentication)

This and other things (e.g. electronic homework submissing incl. student view of this) is also explained (in Icelandic) at Kennslumiðstöð.

Errata CDK5 book on Distributed Systems 5th edition by Coulouris, Dollimore, Kindberg and Blair

Helmut Neukirchen, 1. September 2015

While the CDK5 homepage lists some errata, I found more that are listed below (I reported them, however they did not make it into the official errata):

Table in Figure 3.23: Lower bound of 3G phone bandwidth is 0.384 Mbps, not 384 Mbps. As an update: 4G currently is up to 300 Mbps and has latencies as low as 5 ms.

The table in Figure 3.23 is at least outdated if not wrong, e.g. 10Base5 is only about the 500 m STP; the "T" standards are about twisted pair cables, hence listing coaxial cable (STP) lengths there makes no sense. 1000BaseT allows nowadays 100m twisted pair cables. The "fibre" lines refer rather ot the "F" standards not "T" as the column headings suggest. Furthermore, mono-mode fibre length for 1000BaseF has made significant advancements. Finally, 10GBase, 40GBase and 100GBase are now available.

Further errata to come...

Debian Linux on Thinkpad X250

Helmut Neukirchen, 4. March 2015

What I did to install Debian Linux (Jessie) on Thinkpad X250:

Booting from USB device (to install Debian) was some challenge: in particular USB 3 needed to be disabled in BIOS (maybe some more BIOS tweaks that I cannot remember anymore).

To make the Trackpoint keys work:

In BIOS, disable Touchpad (anyway a good idea to prevent accidental touches there).

Added file /etc/modprobe.d/x250.conf with content
options psmouse proto=imps

Added file /usr/share/X11/xorg.conf.d/20-thinkpad.conf with content (works only if Touchpad is disabled in BIOS)

Section "InputClass"
Identifier "Trackpoint Wheel Emulation"
MatchProduct "PPS/2 IBM TrackPoint|DualPoint Stick|Synaptics Inc. Composite TouchPad / TrackPoint|ThinkPad USB Keyboard with TrackPoint|USB Trackpoint pointing device|Composite TouchPad / TrackPoint|PS/2 Synaptics TouchPad"
MatchDevicePath "/dev/input/event*"
Option "EmulateWheel" "true"
Option "EmulateWheelButton" "2"
Option "Emulate3Buttons" "false"
Option "XAxisMapping" "6 7"
Option "YAxisMapping" "4 5"
EndSection

Also to make side button of my Logitech USB mouse act as middle button:
Added file 20-logitech-mouse-side-button.conf with content

Section "InputClass"
Identifier "Logitech mouse side button remap"
MatchProduct "Logitech USB Receiver"
MatchDevicePath "/dev/input/event*"
Option "ButtonMapping" "1 0 3 4 5 6 7 2 9 10"
EndSection

(Still sometimes Logitech mouse stops completely to work, then unplugging USB receiver from docking station works -- still need to investigate that. Update it seems that plugging in the USB receiver into another USB port (=other USB type) helps.)

I also experience sometimes that my external Dell monitor connected via DP cable and my dock sometimes blanks for half a second: a firmware update of the dock is needed, but is only available as MS Windows executable. Any hints welcome how to do this via Linux! (A BIOS update via Linux is possible and worked.)
I do not have that problem when using the DVI-D port and cable of the dock -- however for 4k resolution, DP is better than DVI!

I also had an old 1440x900 display that did not report its native resolution when connected via VGA (which btw. reports as DP2). While I might probably add some modeline to some xconfig file as I last did probably 10 years ago, I did the following:

cvt 1440 900
Then pasted the modeline generated by cvt:
xrandr --output DP2 --newmode "1440x900_60.00" 106.50 1440 1528 1672 1904 900 903 909 934 -hsync +vsync
xrandr --addmode DP2 "1440x900"
xrandr --output DP2 --mode 1440x900

Also my other display sometimes gets no recognised:

cvt 1920 1080
Then pasted the modeline generated by cvt:
xrandr --output DP2 --newmode "1920x1080_60.00" 173.00 1920 2048 2248 2576 1080 1083 1088 1120 -hsync +vsync
xrandr --addmode DP2 "1920x1080"
xrandr --output DP2 --mode 1920x1080

For getting cloned display output with KDE "Display and Monitor" configuration system setting pane, the two screens have to dragged onto each other. However, I like
the old "Size & Orientation" pane more which can be obtained by installing the kde-workspace-randr package.

Just as reminder for me: to use Gutenprint for the photoprinter: create first in CUPS (e.g. via web interface) an entry for the photoprinter so that the printer gets an own queue. Then, in Gimp, this queue can be used when setting up the photoprinter there. In case the Print with Gutenprint menu entry does not show up in Gimp, an extra package needs to be installed: IIRC for Debian it is package: gimp-gutenprint
What I did to install Debian Linux (Jessie) on Thinkpad X250:

Booting from USB device (to install Debian) was some challenge: in particular USB 3 needed to be disabled in BIOS (maybe some more BIOS tweaks that I cannot remember anymore).

To make the Trackpoint keys work:

In BIOS, disable Touchpad (anyway a good idea to prevent accidental touches there).

Added file /etc/modprobe.d/x250.conf with content
options psmouse proto=imps

Added file /usr/share/X11/xorg.conf.d/20-thinkpad.conf with content (works only if Touchpad is disabled in BIOS)

Section "InputClass"
Identifier "Trackpoint Wheel Emulation"
MatchProduct "PPS/2 IBM TrackPoint|DualPoint Stick|Synaptics Inc. Composite TouchPad / TrackPoint|ThinkPad USB Keyboard with TrackPoint|USB Trackpoint pointing device|Composite TouchPad / TrackPoint|PS/2 Synaptics TouchPad"
MatchDevicePath "/dev/input/event*"
Option "EmulateWheel" "true"
Option "EmulateWheelButton" "2"
Option "Emulate3Buttons" "false"
Option "XAxisMapping" "6 7"
Option "YAxisMapping" "4 5"
EndSection

Also to make side button of my Logitech USB mouse act as middle button:
Added file 20-logitech-mouse-side-button.conf with content

Section "InputClass"
Identifier "Logitech mouse side button remap"
MatchProduct "Logitech USB Receiver"
MatchDevicePath "/dev/input/event*"
Option "ButtonMapping" "1 0 3 4 5 6 7 2 9 10"
EndSection

(Still sometimes Logitech mouse stops completely to work, then unplugging USB receiver from docking station works -- still need to investigate that. Update it seems that plugging in the USB receiver into another USB port (=other USB type) helps.)

I also experience sometimes that my external Dell monitor connected via DP cable and my dock sometimes blanks for half a second: a firmware update of the dock is needed, but is only available as MS Windows executable. Any hints welcome how to do this via Linux! (A BIOS update via Linux is possible and worked.)
I do not have that problem when using the DVI-D port and cable of the dock -- however for 4k resolution, DP is better than DVI!

I also had an old 1440x900 display that did not report its native resolution when connected via VGA (which btw. reports as DP2). While I might probably add some modeline to some xconfig file as I last did probably 10 years ago, I did the following:

cvt 1440 900
Then pasted the modeline generated by cvt:
xrandr --output DP2 --newmode "1440x900_60.00" 106.50 1440 1528 1672 1904 900 903 909 934 -hsync +vsync
xrandr --addmode DP2 "1440x900"
xrandr --output DP2 --mode 1440x900

Also my other display sometimes gets no recognised:

cvt 1920 1080
Then pasted the modeline generated by cvt:
xrandr --output DP2 --newmode "1920x1080_60.00" 173.00 1920 2048 2248 2576 1080 1083 1088 1120 -hsync +vsync
xrandr --addmode DP2 "1920x1080"
xrandr --output DP2 --mode 1920x1080

For getting cloned display output with KDE "Display and Monitor" configuration system setting pane, the two screens have to dragged onto each other. However, I like
the old "Size & Orientation" pane more which can be obtained by installing the kde-workspace-randr package.

Just as reminder for me: to use Gutenprint for the photoprinter: create first in CUPS (e.g. via web interface) an entry for the photoprinter so that the printer gets an own queue. Then, in Gimp, this queue can be used when setting up the photoprinter there. In case the Print with Gutenprint menu entry does not show up in Gimp, an extra package needs to be installed: IIRC for Debian it is package: gimp-gutenprint

Update 27.5.2024: With Debian Bookworm, I can in the CUPS web interface not detect the photoprinter. Install the package printer-driver-gutenprint did make the printer show in the CUPS administrator interface.

But then, I got an error message about an incorrect paper format. I then compiled the latest version of Gutenprint manually -- but this did not compile the Gimp plugin, so I had to install first libgimp2.0-dev.

Still, that did not work, so I had to downgrade the packages to the Gutenprint version prior to the regression:


The issue is resolved by removing these packages and manually installing the packages from Jammy:

libgutenprint-common/jammy,jammy,now 5.3.3-9 all
libgutenprint9/jammy,now 5.3.3-9 amd64
printer-driver-gutenprint/jammy,now 5.3.3-9

For version pinning, create a file in etc/apt/preferences.d with contents:


Package: libgutenprint-common
Pin: version 5.3.3-5
Pin-Priority: 1000
Explanation: Newer versions in Debian have a regression https://sourceforge.net/p/gimp-print/discussion/4359/thread/8fca54c027/

Package: libgutenprint9
Pin: version 5.3.3-5
Pin-Priority: 1000
Explanation: Newer versions in Debian have a regression https://sourceforge.net/p/gimp-print/discussion/4359/thread/8fca54c027/

Package: printer-driver-gutenprint
Pin: version 5.3.3-5
Pin-Priority: 1000
Explanation: Newer versions in Debian have a regression https://sourceforge.net/p/gimp-print/discussion/4359/thread/8fca54c027/

Once Debian has versions as new as 5.3.4-2023-08-23 (e.g. in sid), these packages can be used.

RICOH driversy
When I tried to install some Ricoh printer-specifc PPDs (offered by CUPS), that gave an error:


The PPD version (5.3.3) is not compatible with Gutenprint 5.3.4-2023-12-14T01-00-6a3da773. Please run `/usr/sbin/cups-genppdupdate' as administrator."

Running that command did not resolve the problem, so I chose some generic PDF driver offered by CUPS in the Ricoh section and that one worked. However, that PPD offered only A4, not A3. But by copying over from the not-working printer specific PPD all lines containing A3, that worked. I probably shall do the same for A4, because the printer itself always complains that this is the wrong A4 and I need to confirm printing to A4 on the printer user panel (which I do not have to for the copied over A3 format).

After I had then a fresh Debian 12 Bookworm install, the above RICOH problem did not occur: I did just use the Guteprint/CUPS driver for the model (just be aware that there is a RICOH and a Ricoh category that have different entries). As that fresh install did not have my above package mix from the old Jammy, the above Ricoh problem might in fact have been caused by my messing around with CUPS...

Update 5.1.2025:After a fresh Debian 12 Bookworm install, the drivers for my Brother laserprinter were missing. That was solved by: apt install foomatic-db-engine foomatic-db openprinting-ppds psutils

Promotion to full professor: Inaugural lecture

Helmut Neukirchen, 18. November 2014

On Wednesday, 19.11.2014, I will celebrate my promotion to full professor. In an inaugural lecture, I will give an overview on my research areas: distributed systems and software engineering. You are welcome to attend in room 132 of building Askja from 15:00 to 15:40. The lecture will be recorded and will later-on be accessible via the School of Enginering and Natural Sciences web page.

bodskort

The lecture was recorded and uploaded to YouTube by someone who is not familiar with German names.