Helmut Neukirchen

No teaching in autumn 2017 / Underfinancing of Icelandic universities #háskólaríhættu / How the University deals with it

Helmut Neukirchen, 28. January 2017

Update on teaching: I will teach HBV101F Software Maintenance in Spring 2018 and TÖL503M Distributed Systems will be very likely taught in fall 2017 by an external teacher.

I will not be teaching in autumn 2017, hence the course HBV101F Software Maintenance and TÖL503M/TÖL102F Distributed Systems will not be taught by me. Due to lack of sufficient financing of public universities by the Icelandic government, it is currently not possible to pay someone to teach these courses. If state financing for universities improves, this might change!

Students who would have needed to take the course HBV101F Software Maintenance (which is mandatory in the Software Engineering study line) can get an exemption and take another course instead.

Some background on public university financing: in Iceland, the state spends a little bit less than 1.3 million krona (at the current exchange rate: 10 660 EUR) per student and year (which is not only for salaries of all kind of staff, but also for infrastructure such as buildings, or infrastructure to do research) whereas the average in Iceland's Nordic neighbour states is more than 2.2 million krona (at the current exchange rate: 18 000 EUR) per student and year. As a result, I am not allowed to work overtime to teach beyond my teaching obligation as I did in the past. (Well, I could work overtime, but I will not get paid and then, the state would rely on stupid professors working for free and lower the funding even further). While typically permanent overtime is more expensive than hiring additional staff, a professor has 48% teaching obligation, 12% administration and 40% research obligation. Hence, hiring a new professor just in order to add more teaching capacity, pays not off: only 48% of this salary would go into teaching. Hence, permanent working overtime of professors to ensure teaching (not talking about research -- of course, a good university needs to do both) makes in fact economically sense and is thus often the norm. Reducing funding of universities to such an extent that the only way for the universities to safe money is reducing overtime payments, therefore leads to problems with respect to teaching offerings and teaching quality! Of course, best would be instead of working and paying overtime, to employ further professors, because these ensure both teaching and research which are both pillars of universities!

If you think the underfinancing of public universities by the Icelandic government is a shame, then you have not read how University administration deals with the current underfunding of the fiscal year 2017:

Our Faculty of Industrial Engineering, Mechanical Engineering and Computer Science (Icelandic: IVT) is part of the School of Engineering and Natural Sciences (SENS or Icelandic: VoN). For determining how the budget is distributing to the individual faculties, the University of Iceland applies a distribution model ("deililíkan", see the Icelandic description in Deililíkan Háskóla Íslands -- Skýrsla til rektors -- Tillögur um breytingar og úrbætur and MPA thesis Árangursstjórnun í háskólum á Íslandi, or for English texts, section 1.5.9. of Evaluation System for Public Higher Education Institutions Description and Self‐Review -- December 9, 2016 and section 3.4 of DOI:10.13177/irpa.a.2016.12.1.9, however the formula in the latter contains some typos) that involves an allocation formula that takes (among others) the teaching (in terms of number of students) and research activities (in terms of publications and acquired funding) into account. While this is calculated individually for each faculty, the money goes not directly to each faculty, but instead SENS receives the money for all its faculties. However, this money is not forwarded by the head of SENS to the faculties according to the distribution model! Instead we (who safe money) get less and others (who do not safe money) get more (in fact, they get our money):

While our IVT faculty is, together with a smaller one, the only faculty of SENS that manages to operate within the budget of the distribution model (we even use less because we have not as much permanent positions staffed as we should -- see above), the other faculties do not, but exceed their budget. Because our IVT faculty is so good in reducing costs (for example due to do teaching as cheaper overtime -- see above), our money is taken away and given instead to all the faculties that do not manage to stay within their budget. In fact, we are even requested to save even more (see above: no overtime payments) while the other faculties are allowed to continue spending more than the budget distributed to them according to the distribution model allows.

TL;DR: we are forced by the University administration to cut down our budget far beyond our allocation ("earned" by us due to our performance indicators used as input for the distribution model) in a way that we sacrifice our teaching offering and quality -- only to feed the other faculties that need more money than according to their allocation (they either have to improve their performance indicators, convince everyone to change the distribution model, or safe money). Due to a lack of transparency, we cannot even check whether the others at least try to safe money (e.g. while we cancel courses with less than 10 participants, we do not know whether they do this at all): we were only given by the dean of SENS their overall budget need but no motivation for their budget.

I leave it up to you to decide whether this makes SENSe or not.

P.S.: In January 2017, we were not paid any overtime: this overtime payment refers to overtime worked in 2016, i.e. the word is (we were never officially informed about the reason -- see lack of transparency above) that the dean of SENS refuses to pay work the we did back in 2016 (and even earlier in some cases) even though it was not forbidden to work overtime in these days, but overtime was rather ordered. This is a clear violation of the collective wage agreements (so also the University administration relies on stupid professors working for free). At least, I get my normal fixed salary paid -- in contrast to a part time teacher who did not get paid at all unless he threatened to go on hunger strike. Maybe we should do the same...

P.P.S.: Notably, the university mastered the severe financial crash 2008 in Iceland without the above problems. It needed a new government that in times of a flourishing economy of 2016/2017 underfinances the university. That government was elected (over the one that cleaned up the mess after the 2008 crisis and allowed indebted house owners to write off debts that were higher than 110% of their property's value) because it promised even more write-offs of housing debts (which is one of the reason why that government has no money left to finance the universities). It just came into light that those that benefited most from these write offs where those with high-income that took high loans -- in contrast to those that wisely did avoid high debts or even made no debts at all. Does this remind you of the above faculties that do not stay within their budget and thus get even more money and our faculty that wisely stays withing its budget...?

Uncategorized

Deadline extension: Clausthal-Göttingen International Workshop on Simulation Science

Helmut Neukirchen, 22. January 2017

Update deadline extended until 3. February 2017!

Due to the fast development of information technology, the understanding of phenomena in natural, engineer, economy and social sciences increasingly relies on computer simulations. Simulation-based analysis and engineering techniques are traditionally a research focus of Clausthal University of Technology and University of Göttingen, which is especially reflected in their common interdisciplinary research cluster "Simulation Science Center Clausthal-Göttingen". In this context, the first "Clausthal-Göttingen International Workshop on Simulation Science" aims to bring together researchers and practitioners from both industry and academia to report on the latest advances in simulation science.

The workshop considers the broad area of modeling & simulation with a focus on:

Simulation and optimization in networks:
Public & transportation networks, computer & sensor networks, queuing networks, Internet of Things (IoT) environments, simulation of uncertain optimization problems, simulation of complex stochastic systems
Simulation of materials:
Development and applications of computational techniques in material and process simulation, simulation at micro (atomistic), meso and macro (continuum) scales including scale bridging, diffusive, convective transport and chemical processes in materials, simulation of granular matter
Distributed simulations:
Technology enabler for distributed simulation (e.g., simulation support for vector and parallel computing architectures, grid-based systems and cloud-based systems), methods for distributed simulation (e.g., agent-based simulation, multi-level simulation, and simulation for big data analytics, fusion and mining), application examples (e.g., simulation-based quality assurance and high-energy physics)

27 - 28 April 2017, Göttingen, Germany

Extended Abstract (2-3 pages) Submission: 20 Jan 2017

Workshop web page

Call for Papers: Download

Organisational, Research

EGU session on eScience, ensemble methods and environmental changes in high latitudes

Helmut Neukirchen, 18. November 2016

The eSTICC project is holding a session on "eScience, ensemble methods and environmental changes in high latitudes" at EGU (European Geosciences Union General Assembly) 2017 Vienna, Austria, 23-28 April 2017.

Convener: Ignacio Pisso
Co-Conveners: Andreas Stohl, Michael Schulz, Torben R. Christensen, Risto Makkonen, Tuula Aalto, Helmut Neukirchen, Alberto Carrassi, Laurent Bertino.

The multiple environmental feedback processes at high latitudes involve interactions between the land, ocean, cryosphere, biosphere and atmosphere. For trustworthy computational predictions of future climate change, these interactions need to be taken into account by eScience tools. In particular, this requires: 1) Integration of existing measurement data and enhanced information flow between disciplines; 2) Representation of the current process understanding in Earth System Models (ESMs) for which computational limitations require balancing the process simplifications; and 3) Improved process understanding. eScience such as High-Performance Computing (HPC), big data or scientific workflows is central in all of these areas.
Contributions in fields related to the intersection of environmental change (such as, but not restricted to, measurements, inverse modeling, data assimilation, process parametrizations, ESMs) and eScience (such as, but not restricted to, and HPC, scientific workflows, big data, ensemble methods) are welcome.

The session welcome contributions in fields related to the intersection of environmental change (such as, but not restricted to, measurements, inverse modeling, data assimilation, process parametrizations, ESMs) and eScience (such as, but not restricted to, and HPC, scientific workflows, big data, ensemble methods).

The deadline for receipt of abstracts is 11 January 2017, 13:00 CET. You are welcome to submit abstracts via the session's web page.

Uncategorized

Is Supercomputing dead in the age of Big Data processing?

Helmut Neukirchen, 9. November 2016

In the age of Big Data and big data frameworks such as Apache Spark, one might be tempted to think that supercomputing/high-performance computing (HPC) is obsolete. But in fact, Big Data processing and HPC are different and one platform cannot replace the other. I outline this in a presentation on Science Day of the University of Iceland's School of Engineering and Natural Sciences Saturday October 29 2016. (Note that there is nowadays some convergence, and a graph-processing benchmark top 500 list to resemble less CPU-intensive workloads in HPC.)

Furthermore, the available open-source implementations of algorithms (e.g. clustering using DBSCAN) are currently much faster in HPC and the available Big Data implementations do in fact not even scale beyond a handful of nodes. Results of a case study performed during my guest research stay at the research group High Productivity Data Processing of the Federated Systems and Data division at the Jülich Supercomputing Centre (JSC) are published in this Technical Report.

One of the reviewers of the 1st IEEE International Workshop on Big Spatial Data (BSD 2016) seems not to like the message that Big Data needs to do its homework to match HPC, hence my paper was rejected. While I assume that an HPC conference (such as ISC) might accept it, it would be nice to get the message to the Big Data community: I might submit it to The 6th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics or later at BDCloud 2017 : The 7th IEEE International Conference on Big Data and Cloud Computing. Non-public source implementations may also be worth considering: A novel scalable DBSCAN algorithm with Spark or A Parallel DBSCAN Algorithm Based on Spark. (If we get access to the implementation, but lacking possibility of reproducing/verifying scientific results is another story covered in my Technical Report.) Also, I might add threats to validity (such as construct, internal and external validity [Carver, J., VanVoorhis, J., Basili, V., August 2004. Understanding the impact of assumptions on experimental validity.])

Update from 9.11.2016: Erich Schubert (thanks!) pointed me to this related article "The (black) art of runtime evaluation: Are we comparing algorithms or implementations?" (DOI: 10.1007/s10115-016-1004-2) which support my findings. A statement from that article on k-means: "Judging from the measured runtime and even assuming zero network overhead, we must assume that a C++ implementation using all cores of a single modern PC will outperform a 100-node cluster easily." For DBSCAN, they show that a C++ implementation is one order of magnitude faster than the Java ELKI (which confirms my measurements concerning the C++ HPDBSCAN and the Java ELKI) on their used dataset. They also support my claim that the implementation matters: "Good implementations with index accelerations would process this data set in less than 2 seconds, whereas the fastest linear scan implementation took over 90 seconds, and naïve implementations would frequently require over 100 times the best runtime. But not every index we evaluated was implemented correctly, and sometimes an index was even slower than the linear scan. Between different implementations using linear scan, we can observe runtime differences of more than two orders of magnitude. Comparing the fastest implementation measured (optimized C++ code for this task) to the slowest implementation (outdated versions of Weka), we observe four orders of magnitude: less than two seconds instead of over four hours."

Research

Fake / predatory (Open Access) Journals

Helmut Neukirchen, 8. November 2016

Fake / predatory journals (typically open access journals that publish everything as long as they get paid for it) are a problem to scholars. A good starting point to identify them is Beall’s List with lists on publishers that publish a range of fake journals, single fake journals which are not related to the above publishers, as well as hijacked journals that look like the submission web page of the original version. Also searching the above web site is a good idea.

Update from 2018: The above web pages do not exist anymore in 2018 (but 2017 versions can be retrieved via http://archive.org. In addition, there is https://beallslist.weebly.com/ that even adds new entries. Another blog covering this topic is http://flakyj.blogspot.com/.

In addition to the above blacklists, there is also some whitelist by Directory of Open Access Journals (DOAJ). But beware: some journals appear even both on the blacklist and the whitelist...

Fake / predatory conferences are also a problem, for example those hosted by IARIA: I was once myself TPC member of the The First International Conference on Advances in System Testing and Validation Lifecycle (VALID 2009). As it was the first one and even published by IEEE, it was to me at that time not obvious that this is a bogus conference. Just when I as a reviewer never got access to the reviews of the other reviewers, it became obvious that no rigorous academic standards apply and I did not anymore accept to be TPC member of any IARIA conference (nor submit there of course).

Anyway, University of Iceland respects most publications listed in ISI - Web of Knowledge and Scopus which contain so far only serious publication targets.

Organisational

Tahoma and Tahoma bold font in Wine/CrossOver

Helmut Neukirchen, 27. October 2016

Even if the free Microsoft Core fonts are installed, Tahoma is missing. A Microsoft knowledge base support entry is available to download as Tahoma32.exe, however this is a broken link. Hence, download the therein contained files (tahoma.ttf and tahomabd.ttf) from elsewhere (seems to be legal as Microsoft offered them anyway to the public), e.g. https://github.com/caarlos0/msfonts/tree/master/fonts

Copy font file to ~/.fonts directory and run fc-cache -fv

Tech

Some notes on using a Spark cluster

Helmut Neukirchen, 18. August 2016

The following notes are mainly for my personal use referring to the Spark 1.6/YARN cluster that I access, but maybe they are helpful for you as well...

Upload to HDFS

By default (=used implicitly by all HDFS operations), a HDFS paths are relative to your HDFS home directory: it needs to be created first by the administrator!

While piping through SSH should work ( cat test.txt | ssh
username@masternode "hdfs dfs -put - hadoopFoldername/" ) , it is reported to be slow -- I never checked this, but as I anyway used rather small data, I did instead an scp to the local file system of the master node and used afterwards a hdfs put:
scp localFile username@masternode
hdfs dfs -put twitterSmall.csv Twitter

Concatenate HDFS files (all inside an HDFS directory) and store in
local file system (without sorting)

hdfs dfs -getmerge HdfFolderContainingSplitResultFiles LocalFileToBeCreated

Note that Spark does not overwrite output files in HDFS by default. Either take care when you re-run jobs that the output files have been (re-)moved or you have to allow it in the Spark conf of your program: conf.set("spark.hadoop.validateOutputSpecs","false")

Debugging

See http://spark.apache.org/docs/latest/running-on-yarn.html
Use spark-submit --verbose
If executor processes are killed, this is mainly due to insufficient RAM (garbage collection takes too long, thus timeouts occur or simple out of memory/OOM exceptions). While you see in this case in the log of the driver on the spark-submit console only "<span class="hljs-keyword">exit</span> code <span class="hljs-number">143</span>", the details need to be found in the logs of nodes/executors. This may not be possible via Web UI due to executor nodes being firewalled -- in this case use:
yarn logs -applicationId application_1470137500465_0147(App Id tp be taken from ID columns in Cluster Web UI. Works only for completed runs, not the current run.) In these logs, you can find then / search for java.lang.OutOfMemoryError: GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space

Performance tuning

Note that due HDFS blocks size of 128 MB, by default, partitions of this size are created when reading data. To enforce a higher number of partitions/higher parallelism, use already at the file read stage the optional numberOfPartitions parameter (that also many other RDD creating operations support).
Some introduction https://www.mapr.com/blog/resource-allocation-configuration-spark-yarn
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
(in particular: more than 5 cores per executor is said to lead to bad HDFS throughput. Note that “executor” is not identical to “node”, thus instead of running one executor with 24 cores on one node, rather run 4 executors with 5 cores on each node or 8 executors with 3 cores! Note that then, however, the overall memory of a node needs to be divided by the numbers of executors per node, e.g. 5 BG per executor with 8 executors per node on a 40G RAM node.)
Config for RAM-intensive jobs (=1 core per executor only & 1 core per node only, using 40GB heap space and 2GB overhead for Spark/Yarn itself => on each of the 38 nodes only one core is used that thus can make use of all available RAM), in addition increase timeouts and message size:
spark-submit --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --conf "yarn.nodemanager.resource.cpu-vcores=1"--executor-memory 40g --conf "spark.yarn.executor.memoryOverhead=2000"--conf "spark.driver.cores=4" --conf "spark.driver.maxResultSize=0"(Note: not sure about the driver memory and cores: this seems to have no influence -- is it too late to set it here?)

Research, Tech

DBSCAN evaluation

Helmut Neukirchen, 17. August 2016

This post is used to document some DBSCAN command line parameters used in a DBSCAN implementation evaluation. Once a paper will be published referencing it, it will go to EUDAT B2SHARE and get thus a persistent handle.

Conversion of HDF5 file into CSV/SSV

h5dump -d "/DBSCAN" -o out.txt twitterSmall.h5
Yields lines (12,0): 53.3243, -1.12341,
Remove Id in first column:
cut out.txt -d ':' -f 2 >out.txt2
Yields lines 53.3243, -1.12341,
Remove extra comma at end of line
cut out.txt2 -d ',' -f 1,2 >out.csv
Seems that the first line is just an empty line, remove using
tail -n +2 out.csv >twitterSmall.csv

The mraad DBSCAN implementation expects SSV format with IDs (i.e. remove brackets after h5dump run)
cut out.txt -c 5- > out_withoutleadingbracket.txt
cut out_withoutleadingbracket.txt -d ':' -f 1,2 --output-delimiter=',' > out3.txt
cut out3.txt -d ')' -f 1,2 --output-delimiter=',' > out4.txt
cut out4.txt -d ',' -f 1,4,5 --output-delimiter=',' > cut twitterBig_withIds.csv
cut -d ',' twitterBig_withIds.csv -f 1-3 --output-delimiter=' ' > twitterBig_withIds.ssv_with_doublespaces
cut -d ' ' -f 1,3,5 twitterBig_withIds.ssv_with_doublespaces >twitterBig_withIds.ssv

The Dianwei Han DBSCAN expects SSV without IDs: remove Id from mraad format:
cut twitterSmall.ssv -f 2,3 -d ' ' >twitterSmallNoIds.ssv

Running on Spark cluster

A single "assembly" jar needs to be created. In some cases, the sbt build file does not match the JVM/Scala/library versions and thus, minor adjustments were needed.

Find out optimal number of threads (cores) per executor for RDD-DBSCAN

(typically, 5 is recommended to avoid problems with parallel HDFS access by the same JVM): 3, 4 or 6 or more(?). Note we have 39*24 = 936 cores, but we should leave a few cores for master and other stuff.

912 cores, 3 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 3 --num-executors 304 --executor-cores 3 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions 25000 0.01 40 : 12mins, 55sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output (repartition to single file would crash process that has to do this "reduce" step).

932 cores, 4 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 3sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output.

912 cores, 6 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 152 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 38sec
928 core = 116 executors * 8 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach 25000 912 0.01 40 : 10mins, 26sec
912 cores, 76 executors * 12 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 76 --executor-cores 12 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_76executors_12coreseach 25000 912 0.01 40 : 12mins, 4sec
924 core = 42 executors * 22 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 42 --executor-cores 22 --executor-memory 10gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_42executors_22coreseach 25000 912 0.01 40 : 13mins, 58sec

Trying to find out optimal number of partitions for RDD-DBSCAN

spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 3sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output.
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_456_partitions_233executors_4coreseach 25000 0.01 40 : 12mins, 16sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_228_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 17sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_114_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 57sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_57_partitions_233executors_4coreseach 25000 0.01 40 : 10mins, 22sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_28_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 29sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 10 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_57_partitions_10executors_6coreseach 25000 0.01 40 : 11mins, 49sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 2 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_12_partitions_2executors_6coreseach 25000 0.01 40 : 21mins, 20sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 4 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_24_partitions_24executors_6coreseach 25000 24 0.01 40 : 17mins, 11sec
Note that there is an imbalance in one of the stages (stage 3 at line 127): while the median of this task is 22s, the longest task takes 3.5 minutes! But with smaller partition size, this becomes maybe less imbalanced and thus slightly faster? (Or is this rather less overhead?)

Experiment with Partitions size (MaxPoints) parameter

Investigate wether this results in different amount of noise filtered out!

TwitterSmall: 3 704 351 points.
Lower bound of partition size that is possible to achieve with the given eps is 25 000 (=each partition has less than 25 000 points, the majority much less, maybe about 9000). 3 704 351 / 25 000 means that at least 148 partitions are created, with partition size of 9000: 3 704 351 / 9 000 = more than 400 partitions.
Or: if we want to have 912 partitions (=slightly less than the available cores): 3 704 351 / 912 = 4061 points per partition would be best! TODO: try this value! (Note: if the resulting partitioning rectangle reaches size 2*eps, no smaller partitioning is possible and in this case, this rectangle will contain more points than specified.)
MaxPoints=4061 (a couple of "cannot split messages" occured):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 6gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_4061_epsmaxPoints 4061 912 0.01 40 : 19mins, 17sec
MaxPoints=9000 (a couple of "cannot split messages" occured):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_9000maxPoints 9000 912 0.01 40 : 13mins, 43sec
MaxPoints=20000 (Can't split: (DBSCANRectangle(51.4999988488853,-0.13999999687075615,51.51999884843826,-0.11999999731779099) -> 21664) (maxSize: 20000)
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_20000maxPoints 20000 912 0.01 40 : 14mins, 27sec
MaxPoints=25000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 15sec
MaxPoints=30000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_30000maxPoints 30000 912 0.01 40 : 9mins, 31sec
MaxPoints=32500:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_32500maxPoints 32500 912 0.01 40 : 10mins, 53sec but second run: 9mins, 6sec third run: 9mins, 13sec, fourth run: 9mins, 20sec
MaxPoints=35000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 6gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_35000_epsmaxPoints 35000 912 0.01 40 : 9mins, 42sec
MaxPoints=40000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_40000maxPoints 40000 912 0.01 40 : 10mins, 56sec
MaxPoints=50000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_50000maxPoints 50000 912 0.01 40 : 14mins, 6sec

Scaling test

Increase number of nodes or executors!

116 executors with 8 cores each (=928 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 15sec
116 executors with 4 cores each (=464 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_4coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 40sec
58 executors with 8 cores each (=464 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 58 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_58executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 24sec
With 464 cores not slower than with 928 cores (either: too much overhead with 928 or not enough partitioning)

29 executors with 8 cores each (=232 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 3sec
29 executors with 4 cores each (=116 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_4coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 47sec
29 executors with 2 cores each (=58 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 2 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_2coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 22sec
29 executors with 1 cores each (=29 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 1 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_1coreseach_25000maxPoints 25000 912 0.01 40 : (13mins, 33sec is this value from the 14 executors run?)
14 executors with 1 cores each (=14 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 14 --executor-cores 1 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_14executors_1coreseach_25000maxPoints 25000 912 0.01 40 : (13mins, 33sec is this value from the 29 executors run?) 12mins, 45sec
32 executors with 1 cores each (=32 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 32 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_32executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 13mins, 52sec
16 executors with 1 cores each (=16 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 16 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_16executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 16mins, 57sec and 14mins, 49sec and 14mins, 49sec

8 executors with 1 cores each (=8 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 8 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_8executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 20mins, 19sec

4 executors with 1 cores each (=4 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 4 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_4executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 33mins, 14sec
2 executors with 1 cores each (=2 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 2 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_2executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 58mins, 41sec

1 executors with 1 cores each (=1 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 1 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_1executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 2hrs, 1mins, 51sec
Note: due to skewed data, there are single tasks that delay the whole thing, hence even though not everything is processed in parallel (less executor threads/cores than tasks), but tasks are queued, the queued execution is still faster than the long running task due to skewed data.

Twitter big

spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 40g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=2000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterBig.csv twitterBig.out_with_456_partitions_38executors_1coreseach 25000 456 0.01 40 : 1hrs, 28mins, 55sec

DBSCAN on Spark (https://github.com/mraad/dbscan-spark)

By default: space-separated, only single-space separator supported (cut -d ' ' -f 1,4,6 twitterSmall.ssv > twitterSmall.ssv_no_extra_spaces to remove extra spaces in Twitter file)
Using property file (if ran on JUDGE, paths refers to HDFS paths!)
input.path=Twitter/twitterSmall.ssv
output.path=twitterSmall.out_mraad
dbscan.eps=0.01
dbscan.min.points=40
dbscan.num.partitions=912

To avoid needing to edit the properties files, we always output to the same directory and then do a rename after the run: hdfs dfs -mv twitterSmall.out_mraad twitterSmall.out_mraad116executors8cores

spark-submit --driver-memory 20g --num-executors 116 --executor-cores 8 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 28sec
spark-submit --driver-memory 20g --num-executors 58 --executor-cores 8 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 3sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 8 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 53sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 4 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 47sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 49sec
spark-submit --driver-memory 20g --num-executors 32 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 27sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 3sec
spark-submit --driver-memory 20g --num-executors 16 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 30sec
spark-submit --driver-memory 20g --num-executors 14 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 20sec
spark-submit --driver-memory 20g --num-executors 8 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 54sec
spark-submit --driver-memory 20g --num-executors 7 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 41sec
spark-submit --driver-memory 20g --num-executors 4 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 5mins, 30sec
spark-submit --driver-memory 20g --num-executors 2 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 9mins, 34sec
spark-submit --driver-memory 20g --num-executors 1 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 18mins, 25sec

spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix_58partitions smaller boxes : 1mins, 38sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix_116partitions0.01cellsize : 1mins, 24sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 4 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix_116partitions0.01cellsize : 1mins, 20sec and 1mins, 37sec

TwitterBig runs

Seem to need a lot of RAM: only use 1 core on each worker so that this core can use full RAM (otherwise: abort due to timeouts which typically are due to crashes because of out of memory)
spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 30g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=12000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterBig.properties_hdfs_noprefix_14592partitions0.01cellsize : 1hrs, 3mins, 58sec
spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 30g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=12000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterBig.properties_hdfs_noprefix_1824partitions0.01cellsize : 24mins, 51se

Spark DBSCAN (https://github.com/alitouka/spark_dbscan)

Uses CSV

Need to change the source code slighlty (to avoid passing and subsequent complaints concerning master URL consisting just of "YARN". In addition, built.sbt had a problem using version numbers not available:
Furthermore, added (hardcoded 912 partitions in it’s IOHelper class where the initial file is read).
< scalaVersion := "2.10.6"
<
< val sparkVersion = "1.6.1"
<
< // Added % "provided" so that it gets not included in assembly jar
< libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
<
<
< // Added % "provided" so that it gets not included in assembly jar
< // Elsewhere "spark-mllib_2.10" is used (the 2.10 might refer to Scala 2.10?)
< libraryDependencies += "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided"
<
< // ScalaTest used by the tests provided by Spark MLlib
< libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6" % "test"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0" % "provided"
22a9
> libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.1.3" % "test"

spark-submit --num-executors 116 --executor-cores 8 --executor-memory 12gb --class org.alitouka.spark.dbscan.DbscanDriver --master yarn /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter/twitterSmall.csv --ds-output twitterSmall_alitouka912initialpartitions4061npp --npp 4061 --eps 0.01 --numPts 40 : 53mins, 6sec

spark-submit --num-executors 116 --executor-cores 8 --executor-memory 12gb --class org.alitouka.spark.dbscan.DbscanDriver --master yarn /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter/twitterSmall.csv --ds-output twitterSmall_alitouka912initialpartitions25000npp --npp 25000 --eps 0.01 --numPts 40 : 40mins, 10sec

twitterBig

spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 40g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=2000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --class org.alitouka.spark.dbscan.DbscanDriver --master yarn /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter/twitterBig.csv --ds-output twitterBig_alitouka912initialpartitions25000npp --npp 25000 --eps 0.01 --numPts 40
fails with
java.lang.Exception: Box for point Point at (51.382, -2.3846); id = 6618196; box = 706; cluster = -2; neighbors = 0 was not found

ELKI

time java -Xmx35G -jar elki-bundle-0.7.1.jar KDDCLIApplication -dbc.in /home/helmut/DataIn/twitterSmall.csv -db.index "tree.metrical.covertree.SimplifiedCoverTree$Factory" -covertree.distancefunction minkowski.EuclideanDistanceFunction -algorithm clustering.DBSCAN -dbscan.epsilon 0.01 -dbscan.minpts 40 > twitterSmall.out_elki

time java -Xmx73G -jar elki-bundle-0.7.1.jar KDDCLIApplication -dbc.in /home/helmut/DataIn/twitterBig.csv -db.index "tree.metrical.covertree.SimplifiedCoverTree$Factory" -covertree.distancefunction minkowski.EuclideanDistanceFunction -algorithm clustering.DBSCAN -dbscan.epsilon 0.01 -dbscan.minpts 40 > twitterBig.out_elki

Research

No teaching in autumn 2016

Helmut Neukirchen, 28. June 2016

I will be in research sabbatical in autumn 2016 and thus focus on research without any teaching obligations.

Typically, I would have taught then HBV101F Software Maintenance. Currently, it is planned that it is taught one year later in autumn 2017. Students who would have needed to take that course in autumn 2016 can get an exemption and take another course instead.

If anyone wants to start a new M.Sc. thesis during that time, I will only accept topics with a strong research focus thus leading to a publication at an international conference, for example related to big-data processing with Apache Spark.

Organisational, Teaching

About Defending a Master's thesis

Helmut Neukirchen, 19. June 2016

Note from 2023: the text below is partly outdated. On Ugla, SENS has pretty good info on the timelines and the webforms to be filled out, i.e. ignore the timelines and webforms mentioned below.

The official regulations are in articles 7. and 8. of Regulation no. 994-2017 / Reglur um meistaranám við Verkfræði- og náttúruvísindasvið Háskóla Íslands, nr. 994/2017 (and article 69, items 9-15 of Regulation for the University of Iceland no. 569-2009 / Reglur fyrir Háskóla Íslands Nr. 569/2009. The text below should be in accordance -- if not, it needs to be updated... In charge of MSc. theses at VoN student service is Donna, reachable via the HÍ email user alias sensgraduate.

If you want to graduate, take care that latest in parallel to your Master's project, you finish all your coursework, e.g. Software Engineering students have three mandatory HBV courses. If you are a student coming from non-Computer Science/non-Software Engineering Bachelor, you typically have to take extra courses as part of there admission to the Master's program!

A Master's thesis needs:

A supervisor (i. leiðbeinandi): Supervises the student during the whole thesis project.
An M.Sc. committee (i. meistaraprófsnefnd) consisting of the supervisor and at least one other person who needs to have an MSc. degree -- typically another university teacher -- (unofficially called "secondary supervisor" (i. meðleiðbeinandi)): Often just gets into the game once the student is almost finished (internally, 10% of the overall supervision efforts assumed, but may be up to 25% and 50%), i.e. has a more or less final draft of the thesis available. Gives comments to improve your draft. So this person should be somewhat familiar with the topic.
An external thesis examiner (i. prófdómari): If possible, should be from outside HÍ (in the old days, that person was from the faculty and thus often the old English term "faculty representative" is used. Use in your English thesis the official translation External examiner). That person needs a final draft (release candidate status) before the defense, but must not otherwise be involved in the supervision.

Two web forms needs to be filled out latest 1 week before the defense by the supervisor to book a defense: one for advertising the defense, one for appointing the external examiner (the web forms can be reached via VoN intranet page in UGLA).
There, all the information that are needed (name, title, abstract (for writing an abstract, see also item 4 from Kent Beck), day, supervisors, a photo of the student that will be used for advertising -- but more recently, it seems that the photo is anyway not used, etc.) have to be provided.

To make it for the next graduation ceremony (i. brautskráning) which is in February and June each year (there is till some deadline in October, but no ceremony), there is a deadline (latest 3 weeks before the graduation ceremony for which obviously all the grades need to be handed in). A few days before that deadline, there is typically an event called meistaradagurinn where the idea is that all the students of the faculty defend their thesis. Someone organises this and needs to be contacted to participate. But of course, it is also possible to defend a thesis on another day than on meistaradagurinn.

The schedule of the defense is as follows (meistaradagurinn: typically 45 minutes for talk and discussion, but there is 60 minutes time between defenses to allow time for setting up the presentation):

A few introducing words by the supervisor (including an explanation of the procedure).
20-30 minutes presentation of the thesis by the student. No need to be nervous: you know best about your topic and thesis (also, your supervisor would not allow you to defend if you would likely fail)! Learn the introducing words (to get your presentation started fluently) and the concluding words (to avoid an abrupt termination of your talk) by heart.
Max. 15 minutes questions from the audience (Note: in practise, 5 minutes for the audience and 10 minutes for internal discussion is best)
The audience leaves the room, only the student and the three teachers remain. Now some more private discussion (what was good/bad) and further questions are possible.
Finally, the student leaves the room and the teachers discuss the grade (e.g. using a grading scheme) for the thesis and after this, the student is called in again and is told the grade.

Note that the grade is filled into some form that needs to be signed by those involved in grading when using the above web form, this gets prepared by the administration based on the above web form (typically, a PDF of the form is sent to the main supervisor via e-mail by Sigríður Sif Magnúsdóttir a few days before the defense).

Based on the comments that are given during defense, some minor changes to thesis might be required.
Students need to submit an electronic copy of their thesis latest three weeks before (so that you can send the confirmation of submitting before the deadline) the next graduation ceremony (i. brautskráning) to skemman.is (printed version not required anymore, nor is an ISBN number required: remove that line if it is part of your thesis template). Student should also simultaneously need to fill out an declaration of access. The declaration of access template is accessible in English and Icelandic. For commercial settings, access to the thesis can be closed (but not for longer than 4 years); if the thesis is closed, please send the final PDF as well to your supervisors, because they can otherwise neither access it.

If the thesis is accepted, the student will receive an e-mail confirming this. The student must send the confirmation from skemman to sensgraduate at hi.is or to Sigríður Sif Magnúsdóttir by email (before the deadline where all grades for brautskráning need to be available). There is also an UGLA page on brautskráning that hopefully is still available when you read this...

To allow the supervisors to read and comment on the thesis, a first draft needs to be finished in time:

First draft for the supervisor: latest 1 month before the defense. Preferably, use an agile approach of delivering early drafts as soon as a new chapter is finished. (Do not start with the Introduction -- that is often the last chapter written.)
Release candidate draft for the co-supervisor and "prófdómari" latest 1 week before the defense, better much earlier. This deadline applies also for filling out the above mentioned web form.
Poster needs to be printed latest 1 day before defense on Meistaradagurinn (e.g., Háskólaprent does this within 15-30 minutes).

If you finish your thesis in August/September/October you may not need to pay tuition fees for the new academic year.

There is an UGLA page with the various deadlines of the graduation process.

Templates for thesis, presentation, and poster

Note: Since 10/2021, HÍ has a new corporate identity that is covered here:

A MS Word template, but I really recommend using the LaTeX templates for writing the thesis. As "Advisors", list first your supervisor and in the next line the secondary/co-supervisors. Note that while the template may contain an ISBN number, you have to remove that line as nowadays, everything is electronic only. Have a look at some older MSc. theses to get an idea of the typical contents.
For the defense, a PPT slide template is available on the HÍ corporate design web page -> Hönnunarstaðall (at top right corner) -> Rafrænar einingar ->PowerPoint (and then the download is at the bottom). If you want to rather use LateX for your presentation, Katrín Halldórsdóttir created a LaTeX Beamer template that however is using the old 2010 corporate design (the tex file is GPL, however the logos are property of the University). Any volunteers to update it to the new design?
Furthermore, a poster is displayed on Meistaradagurinn (if defense is on a different day, you are still supposed to prepare a poster to be displayed later on Meistaradagurinn). You find templates with the new 2021 look on the HÍ corporate design web page -> Hönnunarstaðall (at top right corner) -> Prentmiðlar -> Veggspjöld (and then the download is at the bottom). However, that template contained in the ppt download is not very helpful -- you rather would need the provided Adobe InDesign template. As most will not have a license for Adobe InDesign, I provide here a PPT template that has been converted from the Adobe InDesign template. Note that you need to download all fonts of the the Google fonttype family "Jost" and install them (if you have it not installed, PowerPoint will use another font, but the printshop that probably has the Jost font and then, the layout does not match anymore).
I suggest to use a smaller fontsize than in the template: compare with the fontsize used in the PPT template using the old look (but note that the old template uses A0 page size, while the new one uses A4 pages size -- which will then be scaled up when printed in, e.g., A0. Hence, display both side by side for comparison.)
You should also refer to that old template to get an idea of the typical contents, such as adding the names of the supervisors, etc. As a backup, I provide here a copy of that old template.
Háskolaprent can print the poster (typically in A0 size).

Note that our School of Engineering and Natural Sciences offers a Course on thesis skills such as writing (and you even get ECTS credit points for it). In addition, you will find on the web other general information on thesis writing.

Organisational, Teaching, Theses

Contact

Dr. Helmut Neukirchen

Professor of Computer Science and Software Engineering

Faculty of Industrial Engineering, Mechanical Engineering and Computer Science

Deputy head of faculty (Autumn 2024-Spring 2026)

University of Iceland
Department of Computer Science

Gróska building, 3rd floor (stairway A or B), room 306
Bjargargata 1 (will be renamed to Kristínargata 1)
102 Reykjavik
Iceland

E-Mail: helmut at hi. is
(Encrypted e-mail welcome: my public PGP key, also available at key servers -- X.509 based S/MIME encryption possible on request.)
Phone (mobile): 615 2554
University of Iceland
Meta

Professor of Computer Science & Software Engineering, Uni Iceland