Successfull application for European funding H2020 (FET PROACTIVE – HIGH PERFORMANCE COMPUTING) Co-design of HPC systems and applications

Helmut Neukirchen, 29. January 2017

A consortium including the University of Iceland participated successfully in the European Commission's Horizon 2020 research program call FET PROACTIVE – HIGH PERFORMANCE COMPUTING: Co-design of HPC systems and applications. The University of Iceland's team is lead by Helmut Neukirchen together with Morris Riedel. The secured funding for University of Iceland (387 860 EUR for three years project duration starting from 1st July 2017) will be, among others, used to hire a researcher who will perform ambitious research by providing a scientific parallel applications from the field of machine learning for extreme scale/pre-exascale high-performance computing, i.e. creating next-generation software for the next-generation supercomputing hardware.

More details can be found here.

No teaching in autumn 2017 / Underfinancing of Icelandic universities #háskólaríhættu / How the University deals with it

Helmut Neukirchen, 28. January 2017

Update on teaching: I will teach HBV101F Software Maintenance in Spring 2018 and TÖL503M Distributed Systems will be very likely taught in fall 2017 by an external teacher.

I will not be teaching in autumn 2017, hence the course HBV101F Software Maintenance and TÖL503M/TÖL102F Distributed Systems will not be taught by me. Due to lack of sufficient financing of public universities by the Icelandic government, it is currently not possible to pay someone to teach these courses. If state financing for universities improves, this might change!

Students who would have needed to take the course HBV101F Software Maintenance (which is mandatory in the Software Engineering study line) can get an exemption and take another course instead.

Some background on public university financing: in Iceland, the state spends a little bit less than 1.3 million krona (at the current exchange rate: 10 660 EUR) per student and year (which is not only for salaries of all kind of staff, but also for infrastructure such as buildings, or infrastructure to do research) whereas the average in Iceland's Nordic neighbour states is more than 2.2 million krona (at the current exchange rate: 18 000 EUR) per student and year. As a result, I am not allowed to work overtime to teach beyond my teaching obligation as I did in the past. (Well, I could work overtime, but I will not get paid and then, the state would rely on stupid professors working for free and lower the funding even further). While typically permanent overtime is more expensive than hiring additional staff, a professor has 48% teaching obligation, 12% administration and 40% research obligation. Hence, hiring a new professor just in order to add more teaching capacity, pays not off: only 48% of this salary would go into teaching. Hence, permanent working overtime of professors to ensure teaching (not talking about research -- of course, a good university needs to do both) makes in fact economically sense and is thus often the norm. Reducing funding of universities to such an extent that the only way for the universities to safe money is reducing overtime payments, therefore leads to problems with respect to teaching offerings and teaching quality! Of course, best would be instead of working and paying overtime, to employ further professors, because these ensure both teaching and research which are both pillars of universities!

If you think the underfinancing of public universities by the Icelandic government is a shame, then you have not read how University administration deals with the current underfunding of the fiscal year 2017:

Our Faculty of Industrial Engineering, Mechanical Engineering and Computer Science (Icelandic: IVT) is part of the School of Engineering and Natural Sciences (SENS or Icelandic: VoN). For determining how the budget is distributing to the individual faculties, the University of Iceland applies a distribution model ("deililíkan", see the Icelandic description in Deililíkan Háskóla Íslands -- Skýrsla til rektors -- Tillögur um breytingar og úrbætur and MPA thesis Árangursstjórnun í háskólum á Íslandi, or for English texts, section 1.5.9. of Evaluation System for Public Higher Education Institutions Description and Self‐Review -- December 9, 2016 and section 3.4 of DOI:10.13177/irpa.a.2016.12.1.9, however the formula in the latter contains some typos) that involves an allocation formula that takes (among others) the teaching (in terms of number of students) and research activities (in terms of publications and acquired funding) into account. While this is calculated individually for each faculty, the money goes not directly to each faculty, but instead SENS receives the money for all its faculties. However, this money is not forwarded by the head of SENS to the faculties according to the distribution model! Instead we (who safe money) get less and others (who do not safe money) get more (in fact, they get our money):

While our IVT faculty is, together with a smaller one, the only faculty of SENS that manages to operate within the budget of the distribution model (we even use less because we have not as much permanent positions staffed as we should -- see above), the other faculties do not, but exceed their budget. Because our IVT faculty is so good in reducing costs (for example due to do teaching as cheaper overtime -- see above), our money is taken away and given instead to all the faculties that do not manage to stay within their budget. In fact, we are even requested to save even more (see above: no overtime payments) while the other faculties are allowed to continue spending more than the budget distributed to them according to the distribution model allows.

TL;DR: we are forced by the University administration to cut down our budget far beyond our allocation ("earned" by us due to our performance indicators used as input for the distribution model) in a way that we sacrifice our teaching offering and quality -- only to feed the other faculties that need more money than according to their allocation (they either have to improve their performance indicators, convince everyone to change the distribution model, or safe money). Due to a lack of transparency, we cannot even check whether the others at least try to safe money (e.g. while we cancel courses with less than 10 participants, we do not know whether they do this at all): we were only given by the dean of SENS their overall budget need but no motivation for their budget.

I leave it up to you to decide whether this makes SENSe or not.

P.S.: In January 2017, we were not paid any overtime: this overtime payment refers to overtime worked in 2016, i.e. the word is (we were never officially informed about the reason -- see lack of transparency above) that the dean of SENS refuses to pay work the we did back in 2016 (and even earlier in some cases) even though it was not forbidden to work overtime in these days, but overtime was rather ordered. This is a clear violation of the collective wage agreements (so also the University administration relies on stupid professors working for free). At least, I get my normal fixed salary paid -- in contrast to a part time teacher who did not get paid at all unless he threatened to go on hunger strike. Maybe we should do the same...

P.P.S.: Notably, the university mastered the severe financial crash 2008 in Iceland without the above problems. It needed a new government that in times of a flourishing economy of 2016/2017 underfinances the university. That government was elected (over the one that cleaned up the mess after the 2008 crisis and allowed indebted house owners to write off debts that were higher than 110% of their property's value) because it promised even more write-offs of housing debts (which is one of the reason why that government has no money left to finance the universities). It just came into light that those that benefited most from these write offs where those with high-income that took high loans -- in contrast to those that wisely did avoid high debts or even made no debts at all. Does this remind you of the above faculties that do not stay within their budget and thus get even more money and our faculty that wisely stays withing its budget...?

Deadline extension: Clausthal-Göttingen International Workshop on Simulation Science

Helmut Neukirchen, 22. January 2017

Update deadline extended until 3. February 2017!

Due to the fast development of information technology, the understanding of phenomena in natural, engineer, economy and social sciences increasingly relies on computer simulations. Simulation-based analysis and engineering techniques are traditionally a research focus of Clausthal University of Technology and University of Göttingen, which is especially reflected in their common interdisciplinary research cluster "Simulation Science Center Clausthal-Göttingen". In this context, the first "Clausthal-Göttingen International Workshop on Simulation Science" aims to bring together researchers and practitioners from both industry and academia to report on the latest advances in simulation science.

The workshop considers the broad area of modeling & simulation with a focus on:

  • Simulation and optimization in networks:
    Public & transportation networks, computer & sensor networks, queuing networks, Internet of Things (IoT) environments, simulation of uncertain optimization problems, simulation of complex stochastic systems
  • Simulation of materials:
    Development and applications of computational techniques in material and process simulation, simulation at micro (atomistic), meso and macro (continuum) scales including scale bridging, diffusive, convective transport and chemical processes in materials, simulation of granular matter
  • Distributed simulations:
    Technology enabler for distributed simulation (e.g., simulation support for vector and parallel computing architectures, grid-based systems and cloud-based systems), methods for distributed simulation (e.g., agent-based simulation, multi-level simulation, and simulation for big data analytics, fusion and mining), application examples (e.g., simulation-based quality assurance and high-energy physics)

27 - 28 April 2017, Göttingen, Germany

Extended Abstract (2-3 pages) Submission: 20 Jan 2017

Workshop web page

Call for Papers: Download

EGU session on eScience, ensemble methods and environmental changes in high latitudes

Helmut Neukirchen, 18. November 2016

The eSTICC project is holding a session on "eScience, ensemble methods and environmental changes in high latitudes" at EGU (European Geosciences Union General Assembly) 2017 Vienna, Austria, 23-28 April 2017.

Convener: Ignacio Pisso
Co-Conveners: Andreas Stohl, Michael Schulz, Torben R. Christensen, Risto Makkonen, Tuula Aalto, Helmut Neukirchen, Alberto Carrassi, Laurent Bertino.

The multiple environmental feedback processes at high latitudes involve interactions between the land, ocean, cryosphere, biosphere and atmosphere. For trustworthy computational predictions of future climate change, these interactions need to be taken into account by eScience tools. In particular, this requires: 1) Integration of existing measurement data and enhanced information flow between disciplines; 2) Representation of the current process understanding in Earth System Models (ESMs) for which computational limitations require balancing the process simplifications; and 3) Improved process understanding. eScience such as High-Performance Computing (HPC), big data or scientific workflows is central in all of these areas.
Contributions in fields related to the intersection of environmental change (such as, but not restricted to, measurements, inverse modeling, data assimilation, process parametrizations, ESMs) and eScience (such as, but not restricted to, and HPC, scientific workflows, big data, ensemble methods) are welcome.

The session welcome contributions in fields related to the intersection of environmental change (such as, but not restricted to, measurements, inverse modeling, data assimilation, process parametrizations, ESMs) and eScience (such as, but not restricted to, and HPC, scientific workflows, big data, ensemble methods).

The deadline for receipt of abstracts is 11 January 2017, 13:00 CET. You are welcome to submit abstracts via the session's web page.

Is Supercomputing dead in the age of Big Data processing?

Helmut Neukirchen, 9. November 2016

In the age of Big Data and big data frameworks such as Apache Spark, one might be tempted to think that supercomputing/high-performance computing (HPC) is obsolete. But in fact, Big Data processing and HPC are different and one platform cannot replace the other. I outline this in a presentation on Science Day of the University of Iceland's School of Engineering and Natural Sciences Saturday October 29 2016. (Note that there is nowadays some convergence, and a graph-processing benchmark top 500 list to resemble less CPU-intensive workloads in HPC.)

Furthermore, the available open-source implementations of algorithms (e.g. clustering using DBSCAN) are currently much faster in HPC and the available Big Data implementations do in fact not even scale beyond a handful of nodes. Results of a case study performed during my guest research stay at the research group High Productivity Data Processing of the Federated Systems and Data division at the Jülich Supercomputing Centre (JSC) are published in this Technical Report.

One of the reviewers of the 1st IEEE International Workshop on Big Spatial Data (BSD 2016) seems not to like the message that Big Data needs to do its homework to match HPC, hence my paper was rejected. While I assume that an HPC conference (such as ISC) might accept it, it would be nice to get the message to the Big Data community: I might submit it to The 6th International Workshop on Parallel and Distributed Computing for Large Scale Machine Learning and Big Data Analytics or later at BDCloud 2017 : The 7th IEEE International Conference on Big Data and Cloud Computing. Non-public source implementations may also be worth considering: A novel scalable DBSCAN algorithm with Spark or A Parallel DBSCAN Algorithm Based on Spark. (If we get access to the implementation, but lacking possibility of reproducing/verifying scientific results is another story covered in my Technical Report.) Also, I might add threats to validity (such as construct, internal and external validity [Carver, J., VanVoorhis, J., Basili, V., August 2004. Understanding the impact of assumptions on experimental validity.])

Update from 9.11.2016: Erich Schubert (thanks!) pointed me to this related article "The (black) art of runtime evaluation: Are we comparing algorithms or implementations?" (DOI: 10.1007/s10115-016-1004-2) which support my findings. A statement from that article on k-means: "Judging from the measured runtime and even assuming zero network overhead, we must assume that a C++ implementation using all cores of a single modern PC will outperform a 100-node cluster easily." For DBSCAN, they show that a C++ implementation is one order of magnitude faster than the Java ELKI (which confirms my measurements concerning the C++ HPDBSCAN and the Java ELKI) on their used dataset. They also support my claim that the implementation matters: "Good implementations with index accelerations would process this data set in less than 2 seconds, whereas the fastest linear scan implementation took over 90 seconds, and naïve implementations would frequently require over 100 times the best runtime. But not every index we evaluated was implemented correctly, and sometimes an index was even slower than the linear scan. Between different implementations using linear scan, we can observe runtime differences of more than two orders of magnitude. Comparing the fastest implementation measured (optimized C++ code for this task) to the slowest implementation (outdated versions of Weka), we observe four orders of magnitude: less than two seconds instead of over four hours."

Fake / predatory (Open Access) Journals

Helmut Neukirchen, 8. November 2016

Fake / predatory journals (typically open access journals that publish everything as long as they get paid for it) are a problem to scholars. A good starting point to identify them is Beall’s List with lists on publishers that publish a range of fake journals, single fake journals which are not related to the above publishers, as well as hijacked journals that look like the submission web page of the original version. Also searching the above web site is a good idea.

Update from 2018: The above web pages do not exist anymore in 2018 (but 2017 versions can be retrieved via http://archive.org. In addition, there is https://beallslist.weebly.com/ that even adds new entries. Another blog covering this topic is http://flakyj.blogspot.com/.

In addition to the above blacklists, there is also some whitelist by Directory of Open Access Journals (DOAJ). But beware: some journals appear even both on the blacklist and the whitelist...

Fake / predatory conferences are also a problem, for example those hosted by IARIA: I was once myself TPC member of the The First International Conference on Advances in System Testing and Validation Lifecycle (VALID 2009). As it was the first one and even published by IEEE, it was to me at that time not obvious that this is a bogus conference. Just when I as a reviewer never got access to the reviews of the other reviewers, it became obvious that no rigorous academic standards apply and I did not anymore accept to be TPC member of any IARIA conference (nor submit there of course).

Anyway, University of Iceland respects most publications listed in ISI - Web of Knowledge and Scopus which contain so far only serious publication targets.

Tahoma and Tahoma bold font in Wine/CrossOver

Helmut Neukirchen, 27. October 2016

Even if the free Microsoft Core fonts are installed, Tahoma is missing. A Microsoft knowledge base support entry is available to download as Tahoma32.exe, however this is a broken link. Hence, download the therein contained files (tahoma.ttf and tahomabd.ttf) from elsewhere (seems to be legal as Microsoft offered them anyway to the public), e.g. https://github.com/caarlos0/msfonts/tree/master/fonts

Copy font file to ~/.fonts directory and run fc-cache -fv

Some notes on using a Spark cluster

Helmut Neukirchen, 18. August 2016

The following notes are mainly for my personal use referring to the Spark 1.6/YARN cluster that I access, but maybe they are helpful for you as well...

Upload to HDFS

By default (=used implicitly by all HDFS operations), a HDFS paths are relative to your HDFS home directory: it needs to be created first by the administrator!

While piping through SSH should work ( cat test.txt | ssh
username@masternode "hdfs dfs -put - hadoopFoldername/" ) , it is reported to be slow -- I never checked this, but as I anyway used rather small data, I did instead an scp to the local file system of the master node and used afterwards a hdfs put:
scp localFile username@masternode
hdfs dfs -put twitterSmall.csv Twitter

Concatenate HDFS files (all inside an HDFS directory) and store in
local file system (without sorting)

hdfs dfs -getmerge HdfFolderContainingSplitResultFiles LocalFileToBeCreated

Note that Spark does not overwrite output files in HDFS by default. Either take care when you re-run jobs that the output files have been (re-)moved or you have to allow it in the Spark conf of your program:  conf.set("spark.hadoop.validateOutputSpecs","false")

Debugging

  1. See http://spark.apache.org/docs/latest/running-on-yarn.html
  2. Use spark-submit --verbose
  3. If executor processes are killed, this is mainly due to insufficient RAM (garbage collection takes too long, thus timeouts occur or simple out of memory/OOM exceptions). While you see in this case in the log of the driver on the spark-submit console  only "<span class="hljs-keyword">exit</span> code <span class="hljs-number">143</span>", the details need to be found in the logs of nodes/executors. This may not be possible via Web UI due to executor nodes being firewalled -- in this case use:
    yarn logs -applicationId application_1470137500465_0147
    (App Id tp be taken from ID columns in Cluster Web UI. Works only for completed runs, not the current run.) In these logs, you can find then / search for java.lang.OutOfMemoryError: GC overhead limit exceeded or java.lang.OutOfMemoryError: Java heap space

Performance tuning

  1. Note that due HDFS blocks size of 128 MB, by default, partitions of this size are created when reading data. To enforce a higher number of partitions/higher parallelism, use already at the file read stage the optional numberOfPartitions parameter (that also many other RDD creating operations support).
  2. Some introduction https://www.mapr.com/blog/resource-allocation-configuration-spark-yarn
    http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
    (in particular: more than 5 cores per executor is said to lead to bad HDFS throughput. Note that “executor” is not identical to “node”, thus instead of running one executor with 24 cores on one node, rather run 4 executors with 5 cores on each node or 8 executors with 3 cores! Note that then, however, the overall memory of a node needs to be divided by the numbers of executors per node, e.g. 5 BG per executor with 8 executors per node on a 40G RAM node.)
  3. Config for RAM-intensive jobs (=1 core per executor only & 1 core per node only, using 40GB heap space and 2GB overhead for Spark/Yarn itself => on each of the 38 nodes only one core is used that thus can make use of all available RAM), in addition increase timeouts and message size:
    spark-submit --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --conf "yarn.nodemanager.resource.cpu-vcores=1"--executor-memory 40g --conf "spark.yarn.executor.memoryOverhead=2000"--conf "spark.driver.cores=4" --conf "spark.driver.maxResultSize=0"
    (Note: not sure about the driver memory and cores: this seems to have no influence -- is it too late to set it here?)

DBSCAN evaluation

Helmut Neukirchen, 17. August 2016

This post is used to document some DBSCAN command line parameters used in a DBSCAN implementation evaluation. Once a paper will be published referencing it, it will go to EUDAT B2SHARE and get thus a persistent handle.

Conversion of HDF5 file into CSV/SSV

h5dump -d "/DBSCAN" -o out.txt twitterSmall.h5
Yields lines (12,0): 53.3243, -1.12341,
Remove Id in first column:
cut out.txt -d ':' -f 2 >out.txt2
Yields lines 53.3243, -1.12341,
Remove extra comma at end of line
cut out.txt2 -d ',' -f 1,2 >out.csv
Seems that the first line is just an empty line, remove using
tail -n +2 out.csv >twitterSmall.csv

The mraad DBSCAN implementation expects SSV format with IDs (i.e. remove brackets after h5dump run)
cut out.txt -c 5- > out_withoutleadingbracket.txt
cut out_withoutleadingbracket.txt -d ':' -f 1,2 --output-delimiter=',' > out3.txt
cut out3.txt -d ')' -f 1,2 --output-delimiter=',' > out4.txt
cut out4.txt -d ',' -f 1,4,5 --output-delimiter=',' > cut twitterBig_withIds.csv
cut -d ',' twitterBig_withIds.csv -f 1-3 --output-delimiter=' ' > twitterBig_withIds.ssv_with_doublespaces
cut -d ' ' -f 1,3,5 twitterBig_withIds.ssv_with_doublespaces >twitterBig_withIds.ssv

The Dianwei Han DBSCAN expects SSV without IDs: remove Id from mraad format:
cut twitterSmall.ssv -f 2,3 -d ' ' >twitterSmallNoIds.ssv

Running on Spark cluster

A single "assembly" jar needs to be created. In some cases, the sbt build file does not match the JVM/Scala/library versions and thus, minor adjustments were needed.

Find out optimal number of threads (cores) per executor for RDD-DBSCAN

(typically, 5 is recommended to avoid problems with parallel HDFS access by the same JVM): 3, 4 or 6 or more(?). Note we have 39*24 = 936 cores, but we should leave a few cores for master and other stuff.

912 cores, 3 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 3 --num-executors 304 --executor-cores 3 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions 25000 0.01 40 : 12mins, 55sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output (repartition to single file would crash process that has to do this "reduce" step).

932 cores, 4 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 3sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output.

912 cores, 6 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 152 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 38sec
928 core = 116 executors * 8 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach 25000 912 0.01 40 : 10mins, 26sec
912 cores, 76 executors * 12 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 76 --executor-cores 12 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_76executors_12coreseach 25000 912 0.01 40 : 12mins, 4sec
924 core = 42 executors * 22 cores per executor process:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 42 --executor-cores 22 --executor-memory 10gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_42executors_22coreseach 25000 912 0.01 40 : 13mins, 58sec

Trying to find out optimal number of partitions for RDD-DBSCAN

spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 3sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output.
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_456_partitions_233executors_4coreseach 25000 0.01 40 : 12mins, 16sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_228_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 17sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_114_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 57sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_57_partitions_233executors_4coreseach 25000 0.01 40 : 10mins, 22sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_28_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 29sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 10 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_57_partitions_10executors_6coreseach 25000 0.01 40 : 11mins, 49sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 2 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_12_partitions_2executors_6coreseach 25000 0.01 40 : 21mins, 20sec
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 4 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_24_partitions_24executors_6coreseach 25000 24 0.01 40 : 17mins, 11sec
Note that there is an imbalance in one of the stages (stage 3 at line 127): while the median of this task is 22s, the longest task takes 3.5 minutes! But with smaller partition size, this becomes maybe less imbalanced and thus slightly faster? (Or is this rather less overhead?)

Experiment with Partitions size (MaxPoints) parameter

Investigate wether this results in different amount of noise filtered out!

TwitterSmall: 3 704 351 points.
Lower bound of partition size that is possible to achieve with the given eps is 25 000 (=each partition has less than 25 000 points, the majority much less, maybe about 9000). 3 704 351 / 25 000 means that at least 148 partitions are created, with partition size of 9000: 3 704 351 / 9 000 = more than 400 partitions.
Or: if we want to have 912 partitions (=slightly less than the available cores): 3 704 351 / 912 = 4061 points per partition would be best! TODO: try this value! (Note: if the resulting partitioning rectangle reaches size 2*eps, no smaller partitioning is possible and in this case, this rectangle will contain more points than specified.)
MaxPoints=4061 (a couple of "cannot split messages" occured):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 6gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_4061_epsmaxPoints 4061 912 0.01 40 : 19mins, 17sec
MaxPoints=9000 (a couple of "cannot split messages" occured):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_9000maxPoints 9000 912 0.01 40 : 13mins, 43sec
MaxPoints=20000 (Can't split: (DBSCANRectangle(51.4999988488853,-0.13999999687075615,51.51999884843826,-0.11999999731779099) -> 21664) (maxSize: 20000)
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_20000maxPoints 20000 912 0.01 40 : 14mins, 27sec
MaxPoints=25000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 15sec
MaxPoints=30000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_30000maxPoints 30000 912 0.01 40 : 9mins, 31sec
MaxPoints=32500:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_32500maxPoints 32500 912 0.01 40 : 10mins, 53sec but second run: 9mins, 6sec third run: 9mins, 13sec, fourth run: 9mins, 20sec
MaxPoints=35000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 6gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_35000_epsmaxPoints 35000 912 0.01 40 : 9mins, 42sec
MaxPoints=40000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_40000maxPoints 40000 912 0.01 40 : 10mins, 56sec
MaxPoints=50000:
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_50000maxPoints 50000 912 0.01 40 : 14mins, 6sec

Scaling test

Increase number of nodes or executors!

116 executors with 8 cores each (=928 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 15sec
116 executors with 4 cores each (=464 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_4coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 40sec
58 executors with 8 cores each (=464 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 58 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_58executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 24sec
With 464 cores not slower than with 928 cores (either: too much overhead with 928 or not enough partitioning)

29 executors with 8 cores each (=232 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 3sec
29 executors with 4 cores each (=116 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_4coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 47sec
29 executors with 2 cores each (=58 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 2 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_2coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 22sec
29 executors with 1 cores each (=29 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 1 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_1coreseach_25000maxPoints 25000 912 0.01 40 : (13mins, 33sec is this value from the 14 executors run?)
14 executors with 1 cores each (=14 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 14 --executor-cores 1 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_14executors_1coreseach_25000maxPoints 25000 912 0.01 40 : (13mins, 33sec is this value from the 29 executors run?) 12mins, 45sec
32 executors with 1 cores each (=32 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 32 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_32executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 13mins, 52sec
16 executors with 1 cores each (=16 cores):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 16 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_16executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 16mins, 57sec and 14mins, 49sec and 14mins, 49sec

8 executors with 1 cores each (=8 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 8 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_8executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 20mins, 19sec

4 executors with 1 cores each (=4 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 4 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_4executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 33mins, 14sec
2 executors with 1 cores each (=2 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 2 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_2executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 58mins, 41sec

1 executors with 1 cores each (=1 core):
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 1 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterSmall.csv twitterSmall.out_with_912_partitions_1executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 2hrs, 1mins, 51sec
Note: due to skewed data, there are single tasks that delay the whole thing, hence even though not everything is processed in parallel (less executor threads/cores than tasks), but tasks are queued, the queued execution is still faster than the long running task due to skewed data.

Twitter big

spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 40g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=2000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --class SampleRddDbscanCsv --master yarn /home/helmut/ScalaProjects/H5SparkRddDbscan/target/scala-2.10/H5SparkRddDbscan-assembly-0.1.jar Twitter/twitterBig.csv twitterBig.out_with_456_partitions_38executors_1coreseach 25000 456 0.01 40 : 1hrs, 28mins, 55sec

DBSCAN on Spark (https://github.com/mraad/dbscan-spark)

By default: space-separated, only single-space separator supported (cut -d ' ' -f 1,4,6 twitterSmall.ssv > twitterSmall.ssv_no_extra_spaces to remove extra spaces in Twitter file)
Using property file (if ran on JUDGE, paths refers to HDFS paths!)
input.path=Twitter/twitterSmall.ssv
output.path=twitterSmall.out_mraad
dbscan.eps=0.01
dbscan.min.points=40
dbscan.num.partitions=912

To avoid needing to edit the properties files, we always output to the same directory and then do a rename after the run: hdfs dfs -mv twitterSmall.out_mraad twitterSmall.out_mraad116executors8cores

spark-submit --driver-memory 20g --num-executors 116 --executor-cores 8 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 28sec
spark-submit --driver-memory 20g --num-executors 58 --executor-cores 8 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 3sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 8 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 53sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 4 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 47sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 49sec
spark-submit --driver-memory 20g --num-executors 32 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 27sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 3sec
spark-submit --driver-memory 20g --num-executors 16 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 30sec
spark-submit --driver-memory 20g --num-executors 14 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 20sec
spark-submit --driver-memory 20g --num-executors 8 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 2mins, 54sec
spark-submit --driver-memory 20g --num-executors 7 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 3mins, 41sec
spark-submit --driver-memory 20g --num-executors 4 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 5mins, 30sec
spark-submit --driver-memory 20g --num-executors 2 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 9mins, 34sec
spark-submit --driver-memory 20g --num-executors 1 --executor-cores 1 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix : 18mins, 25sec

spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix_58partitions smaller boxes : 1mins, 38sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix_116partitions0.01cellsize : 1mins, 24sec
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 4 --executor-memory 12g --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterSmall.properties_hdfs_noprefix_116partitions0.01cellsize : 1mins, 20sec and 1mins, 37sec

TwitterBig runs

Seem to need a lot of RAM: only use 1 core on each worker so that this core can use full RAM (otherwise: abort due to timeouts which typically are due to crashes because of out of memory)
spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 30g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=12000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterBig.properties_hdfs_noprefix_14592partitions0.01cellsize : 1hrs, 3mins, 58sec
spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 30g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=12000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --master yarn /home/helmut/ScalaProjects/mraad_dbscan-spark/target/scala-2.10/dbscan-spark-assembly-0.1.jar /home/helmut/DataIn/twitterBig.properties_hdfs_noprefix_1824partitions0.01cellsize : 24mins, 51se

Spark DBSCAN (https://github.com/alitouka/spark_dbscan)

Uses CSV

Need to change the source code slighlty (to avoid passing and subsequent complaints concerning master URL consisting just of "YARN". In addition, built.sbt had a problem using version numbers not available:
Furthermore, added (hardcoded 912 partitions in it’s IOHelper class where the initial file is read).
< scalaVersion := "2.10.6"
<
< val sparkVersion = "1.6.1"
<
< // Added % "provided" so that it gets not included in assembly jar
< libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
<
<
< // Added % "provided" so that it gets not included in assembly jar
< // Elsewhere "spark-mllib_2.10" is used (the 2.10 might refer to Scala 2.10?)
< libraryDependencies += "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided"
<
< // ScalaTest used by the tests provided by Spark MLlib
< libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6" % "test"
libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "1.1.0" % "provided"
22a9
> libraryDependencies += "org.scalatest" % "scalatest_2.10" % "2.1.3" % "test"

spark-submit --num-executors 116 --executor-cores 8 --executor-memory 12gb --class org.alitouka.spark.dbscan.DbscanDriver --master yarn /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter/twitterSmall.csv --ds-output twitterSmall_alitouka912initialpartitions4061npp --npp 4061 --eps 0.01 --numPts 40 : 53mins, 6sec

spark-submit --num-executors 116 --executor-cores 8 --executor-memory 12gb --class org.alitouka.spark.dbscan.DbscanDriver --master yarn /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter/twitterSmall.csv --ds-output twitterSmall_alitouka912initialpartitions25000npp --npp 25000 --eps 0.01 --numPts 40 : 40mins, 10sec

twitterBig

spark-submit --conf "spark.akka.timeout=300s" --conf "spark.network.timeout=300s" --conf "spark.akka.frameSize=2000" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 40g --conf "yarn.nodemanager.resource.cpu-vcores=1" --conf "spark.yarn.executor.memoryOverhead=2000" --conf "spark.driver.cores=20" --conf "spark.driver.maxResultSize=0" --class org.alitouka.spark.dbscan.DbscanDriver --master yarn /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar /home/helmut/ScalaProjects/spark_dbscan/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter/twitterBig.csv --ds-output twitterBig_alitouka912initialpartitions25000npp --npp 25000 --eps 0.01 --numPts 40
fails with
java.lang.Exception: Box for point Point at (51.382, -2.3846); id = 6618196; box = 706; cluster = -2; neighbors = 0 was not found

ELKI

time java -Xmx35G -jar elki-bundle-0.7.1.jar KDDCLIApplication -dbc.in /home/helmut/DataIn/twitterSmall.csv -db.index "tree.metrical.covertree.SimplifiedCoverTree$Factory" -covertree.distancefunction minkowski.EuclideanDistanceFunction -algorithm clustering.DBSCAN -dbscan.epsilon 0.01 -dbscan.minpts 40 > twitterSmall.out_elki

time java -Xmx73G -jar elki-bundle-0.7.1.jar KDDCLIApplication -dbc.in /home/helmut/DataIn/twitterBig.csv -db.index "tree.metrical.covertree.SimplifiedCoverTree$Factory" -covertree.distancefunction minkowski.EuclideanDistanceFunction -algorithm clustering.DBSCAN -dbscan.epsilon 0.01 -dbscan.minpts 40 > twitterBig.out_elki

No teaching in autumn 2016

Helmut Neukirchen, 28. June 2016

I will be in research sabbatical in autumn 2016 and thus focus on research without any teaching obligations.

Typically, I would have taught then HBV101F Software Maintenance. Currently, it is planned that it is taught one year later in autumn 2017. Students who would have needed to take that course in autumn 2016 can get an exemption and take another course instead.

If anyone wants to start a new M.Sc. thesis during that time, I will only accept topics with a strong research focus thus leading to a publication at an international conference, for example related to big-data processing with Apache Spark.