{"id":770,"date":"2016-08-17T00:29:19","date_gmt":"2016-08-17T00:29:19","guid":{"rendered":"http:\/\/uni.hi.is\/helmut\/?p=770"},"modified":"2017-11-13T15:22:46","modified_gmt":"2017-11-13T15:22:46","slug":"dbscan-evaluation","status":"publish","type":"post","link":"https:\/\/uni.hi.is\/helmut\/2016\/08\/17\/dbscan-evaluation\/","title":{"rendered":"DBSCAN evaluation"},"content":{"rendered":"

This post is used to document some DBSCAN command line parameters used in a DBSCAN implementation evaluation. Once a paper will be published referencing it, it will go to EUDAT B2SHARE and get thus a persistent handle.<\/p>\n

Conversion of HDF5 file into CSV\/SSV<\/h3>\n
h5dump -d \"\/DBSCAN\" -o out.txt twitterSmall.h5
\nYields lines (12,0): 53.3243, -1.12341,
\nRemove Id in first column:
\ncut out.txt -d ':' -f 2 >out.txt2
\nYields lines 53.3243, -1.12341,
\nRemove extra comma at end of line
\ncut out.txt2 -d ',' -f 1,2 >out.csv
\nSeems that the first line is just an empty line, remove using
\ntail -n +2 out.csv >twitterSmall.csv<\/p>\n
The mraad DBSCAN implementation expects SSV format with IDs (i.e. remove brackets after h5dump run)
\ncut out.txt -c 5- > out_withoutleadingbracket.txt
\ncut out_withoutleadingbracket.txt -d ':' -f 1,2 --output-delimiter=',' > out3.txt
\ncut out3.txt -d ')' -f 1,2 --output-delimiter=',' > out4.txt
\ncut out4.txt -d ',' -f 1,4,5 --output-delimiter=',' > cut twitterBig_withIds.csv
\ncut -d ',' twitterBig_withIds.csv -f 1-3 --output-delimiter=' ' > twitterBig_withIds.ssv_with_doublespaces
\ncut -d ' ' -f 1,3,5 twitterBig_withIds.ssv_with_doublespaces >twitterBig_withIds.ssv<\/p>\n
The Dianwei Han DBSCAN expects SSV without IDs: remove Id from mraad format:
\ncut twitterSmall.ssv -f 2,3 -d ' ' >twitterSmallNoIds.ssv<\/p>\n

Running on Spark cluster<\/h3>\n
A single \"assembly\" jar needs to be created. In some cases, the sbt build file does not match the JVM\/Scala\/library versions and thus, minor adjustments were needed.<\/p>\n

Find out optimal number of threads (cores) per executor for RDD-DBSCAN<\/h4>\n
(typically, 5 is recommended to avoid problems with parallel HDFS access by the same JVM): 3, 4 or 6 or more(?). Note we have 3924 = 936 cores, but we should leave a few cores for master and other stuff.<\/p>\n
912 cores, 3 cores per executor process:
\nspark-submit --driver-memory 2g --driver-cores 3 --num-executors 304 --executor-cores 3 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions 25000 0.01 40 : 12mins, 55sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output (repartition to single file would crash process that has to do this \"reduce\" step).<\/p>\n
932 cores, 4 cores per executor process:
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 3sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output.<\/p>\n
912 cores, 6 cores per executor process:
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 152 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 38sec
\n928 core = 116 executors 8 cores per executor process:
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach 25000 912 0.01 40 : 10mins, 26sec
\n912 cores, 76 executors * 12 cores per executor process:
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 76 --executor-cores 12 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_76executors_12coreseach 25000 912 0.01 40 : 12mins, 4sec
\n924 core = 42 executors * 22 cores per executor process:
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 42 --executor-cores 22 --executor-memory 10gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_42executors_22coreseach 25000 912 0.01 40 : 13mins, 58sec <\/p>\n

Trying to find out optimal number of partitions for RDD-DBSCAN<\/h4>\n
spark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 3sec with 912 partitions enforced in sc.textFile(src) and no repartition to a single file when saving result output.
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_456_partitions_233executors_4coreseach 25000 0.01 40 : 12mins, 16sec
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_228_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 17sec
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_114_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 57sec
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_57_partitions_233executors_4coreseach 25000 0.01 40 : 10mins, 22sec
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_28_partitions_233executors_4coreseach 25000 0.01 40 : 11mins, 29sec
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 10 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_57_partitions_10executors_6coreseach 25000 0.01 40 : 11mins, 49sec
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 2 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_12_partitions_2executors_6coreseach 25000 0.01 40 : 21mins, 20sec
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 4 --executor-cores 6 --executor-memory 5gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_24_partitions_24executors_6coreseach 25000 24 0.01 40 : 17mins, 11sec
\nNote that there is an imbalance in one of the stages (stage 3 at line 127): while the median of this task is 22s, the longest task takes 3.5 minutes! But with smaller partition size, this becomes maybe less imbalanced and thus slightly faster? (Or is this rather less overhead?)<\/p>\n

Experiment with Partitions size (MaxPoints) parameter<\/h4>\n
Investigate wether this results in different amount of noise filtered out!<\/p>\n
TwitterSmall: 3 704 351 points.
\nLower bound of partition size that is possible to achieve with the given eps is 25 000 (=each partition has less than 25 000 points, the majority much less, maybe about 9000). 3 704 351 \/ 25 000 means that at least 148 partitions are created, with partition size of 9000: 3 704 351 \/ 9 000 = more than 400 partitions.
\nOr: if we want to have 912 partitions (=slightly less than the available cores): 3 704 351 \/ 912 = 4061 points per partition would be best! TODO: try this value! (Note: if the resulting partitioning rectangle reaches size 2*eps, no smaller partitioning is possible and in this case, this rectangle will contain more points than specified.)
\nMaxPoints=4061 (a couple of \"cannot split messages\" occured):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 6gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_4061_epsmaxPoints 4061 912 0.01 40 : 19mins, 17sec
\nMaxPoints=9000 (a couple of \"cannot split messages\" occured):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_9000maxPoints 9000 912 0.01 40 : 13mins, 43sec
\nMaxPoints=20000 (Can't split: (DBSCANRectangle(51.4999988488853,-0.13999999687075615,51.51999884843826,-0.11999999731779099) -> 21664) (maxSize: 20000)
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_20000maxPoints 20000 912 0.01 40 : 14mins, 27sec
\nMaxPoints=25000:
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 15sec
\nMaxPoints=30000:
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_30000maxPoints 30000 912 0.01 40 : 9mins, 31sec
\nMaxPoints=32500:
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_32500maxPoints 32500 912 0.01 40 : 10mins, 53sec but second run: 9mins, 6sec third run: 9mins, 13sec, fourth run: 9mins, 20sec
\nMaxPoints=35000:
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 233 --executor-cores 4 --executor-memory 6gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_233executors_4coreseach_35000_epsmaxPoints 35000 912 0.01 40 : 9mins, 42sec
\nMaxPoints=40000:
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_40000maxPoints 40000 912 0.01 40 : 10mins, 56sec
\nMaxPoints=50000:
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_50000maxPoints 50000 912 0.01 40 : 14mins, 6sec <\/p>\n

Scaling test<\/h4>\n
Increase number of nodes or executors!<\/p>\n
116 executors with 8 cores each (=928 cores):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 15sec
\n116 executors with 4 cores each (=464 cores):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 116 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_116executors_4coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 40sec
\n58 executors with 8 cores each (=464 cores):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 58 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_58executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 24sec
\nWith 464 cores not slower than with 928 cores (either: too much overhead with 928 or not enough partitioning)<\/p>\n
29 executors with 8 cores each (=232 cores):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 8 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_8coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 3sec
\n29 executors with 4 cores each (=116 cores):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 4 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_4coreseach_25000maxPoints 25000 912 0.01 40 : 11mins, 47sec
\n29 executors with 2 cores each (=58 cores):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 2 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_2coreseach_25000maxPoints 25000 912 0.01 40 : 10mins, 22sec
\n29 executors with 1 cores each (=29 cores):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 29 --executor-cores 1 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_29executors_1coreseach_25000maxPoints 25000 912 0.01 40 : (13mins, 33sec is this value from the 14 executors run?)
\n14 executors with 1 cores each (=14 cores):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 14 --executor-cores 1 --executor-memory 8gb --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_14executors_1coreseach_25000maxPoints 25000 912 0.01 40 : (13mins, 33sec is this value from the 29 executors run?) 12mins, 45sec
\n32 executors with 1 cores each (=32 cores):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 32 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_32executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 13mins, 52sec
\n16 executors with 1 cores each (=16 cores):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 16 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_16executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 16mins, 57sec and 14mins, 49sec and 14mins, 49sec <\/p>\n
8 executors with 1 cores each (=8 core):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 8 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_8executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 20mins, 19sec <\/p>\n
4 executors with 1 cores each (=4 core):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 4 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_4executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 33mins, 14sec
\n2 executors with 1 cores each (=2 core):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 2 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_2executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 58mins, 41sec <\/p>\n
1 executors with 1 cores each (=1 core):
\nspark-submit --driver-memory 2g --driver-cores 1 --num-executors 1 --executor-cores 1 --executor-memory 50g --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterSmall.csv twitterSmall.out_with_912_partitions_1executors_1coreseach_25000maxPoints 25000 912 0.01 40 : 2hrs, 1mins, 51sec
\nNote: due to skewed data, there are single tasks that delay the whole thing, hence even though not everything is processed in parallel (less executor threads\/cores than tasks), but tasks are queued, the queued execution is still faster than the long running task due to skewed data.<\/p>\n

Twitter big<\/h4>\n
spark-submit --conf \"spark.akka.timeout=300s\" --conf \"spark.network.timeout=300s\" --conf \"spark.akka.frameSize=2000\" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 40g --conf \"yarn.nodemanager.resource.cpu-vcores=1\" --conf \"spark.yarn.executor.memoryOverhead=2000\" --conf \"spark.driver.cores=20\" --conf \"spark.driver.maxResultSize=0\" --class SampleRddDbscanCsv --master yarn \/home\/helmut\/ScalaProjects\/H5SparkRddDbscan\/target\/scala-2.10\/H5SparkRddDbscan-assembly-0.1.jar Twitter\/twitterBig.csv twitterBig.out_with_456_partitions_38executors_1coreseach 25000 456 0.01 40 : 1hrs, 28mins, 55sec <\/p>\n

DBSCAN on Spark (https:\/\/github.com\/mraad\/dbscan-spark)<\/h4>\n
By default: space-separated, only single-space separator supported (cut -d ' ' -f 1,4,6 twitterSmall.ssv > twitterSmall.ssv_no_extra_spaces to remove extra spaces in Twitter file)
\nUsing property file (if ran on JUDGE, paths refers to HDFS paths!)
\ninput.path=Twitter\/twitterSmall.ssv
\noutput.path=twitterSmall.out_mraad
\ndbscan.eps=0.01
\ndbscan.min.points=40
\ndbscan.num.partitions=912<\/p>\n
To avoid needing to edit the properties files, we always output to the same directory and then do a rename after the run: hdfs dfs -mv twitterSmall.out_mraad twitterSmall.out_mraad116executors8cores<\/p>\n
spark-submit --driver-memory 20g --num-executors 116 --executor-cores 8 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 3mins, 28sec
\nspark-submit --driver-memory 20g --num-executors 58 --executor-cores 8 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 3mins, 3sec
\nspark-submit --driver-memory 20g --num-executors 29 --executor-cores 8 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 2mins, 53sec
\nspark-submit --driver-memory 20g --num-executors 29 --executor-cores 4 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 2mins, 47sec
\nspark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 2mins, 49sec
\nspark-submit --driver-memory 20g --num-executors 32 --executor-cores 1 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 2mins, 27sec
\nspark-submit --driver-memory 20g --num-executors 29 --executor-cores 1 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 3mins, 3sec
\nspark-submit --driver-memory 20g --num-executors 16 --executor-cores 1 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 2mins, 30sec
\nspark-submit --driver-memory 20g --num-executors 14 --executor-cores 1 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 3mins, 20sec
\nspark-submit --driver-memory 20g --num-executors 8 --executor-cores 1 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 2mins, 54sec
\nspark-submit --driver-memory 20g --num-executors 7 --executor-cores 1 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 3mins, 41sec
\nspark-submit --driver-memory 20g --num-executors 4 --executor-cores 1 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 5mins, 30sec
\nspark-submit --driver-memory 20g --num-executors 2 --executor-cores 1 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 9mins, 34sec
\nspark-submit --driver-memory 20g --num-executors 1 --executor-cores 1 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix : 18mins, 25sec <\/p>\n
spark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix_58partitions smaller boxes : 1mins, 38sec
\nspark-submit --driver-memory 20g --num-executors 29 --executor-cores 2 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix_116partitions0.01cellsize : 1mins, 24sec
\nspark-submit --driver-memory 20g --num-executors 29 --executor-cores 4 --executor-memory 12g --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterSmall.properties_hdfs_noprefix_116partitions0.01cellsize : 1mins, 20sec and 1mins, 37sec<\/p>\n

TwitterBig runs<\/h4>\n
Seem to need a lot of RAM: only use 1 core on each worker so that this core can use full RAM (otherwise: abort due to timeouts which typically are due to crashes because of out of memory)
\nspark-submit --conf \"spark.akka.timeout=300s\" --conf \"spark.network.timeout=300s\" --conf \"spark.akka.frameSize=2000\" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 30g --conf \"yarn.nodemanager.resource.cpu-vcores=1\" --conf \"spark.yarn.executor.memoryOverhead=12000\" --conf \"spark.driver.cores=20\" --conf \"spark.driver.maxResultSize=0\" --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterBig.properties_hdfs_noprefix_14592partitions0.01cellsize : 1hrs, 3mins, 58sec
\nspark-submit --conf \"spark.akka.timeout=300s\" --conf \"spark.network.timeout=300s\" --conf \"spark.akka.frameSize=2000\" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 30g --conf \"yarn.nodemanager.resource.cpu-vcores=1\" --conf \"spark.yarn.executor.memoryOverhead=12000\" --conf \"spark.driver.cores=20\" --conf \"spark.driver.maxResultSize=0\" --master yarn \/home\/helmut\/ScalaProjects\/mraad_dbscan-spark\/target\/scala-2.10\/dbscan-spark-assembly-0.1.jar \/home\/helmut\/DataIn\/twitterBig.properties_hdfs_noprefix_1824partitions0.01cellsize : 24mins, 51se<\/p>\n

Spark DBSCAN (https:\/\/github.com\/alitouka\/spark_dbscan)<\/h4>\n
Uses CSV <\/p>\n
Need to change the source code slighlty (to avoid passing and subsequent complaints concerning master URL consisting just of \"YARN\". In addition, built.sbt had a problem using version numbers not available:
\nFurthermore, added (hardcoded 912 partitions in it\u2019s IOHelper class where the initial file is read).
\n< scalaVersion := "2.10.6"
\n<
\n< val sparkVersion = "1.6.1"
\n<
\n< \/\/ Added % "provided" so that it gets not included in assembly jar
\n< libraryDependencies += "org.apache.spark" %% "spark-core" % sparkVersion % "provided"
\n<
\n<
\n< \/\/ Added % "provided" so that it gets not included in assembly jar
\n< \/\/ Elsewhere "spark-mllib_2.10" is used (the 2.10 might refer to Scala 2.10?)
\n< libraryDependencies += "org.apache.spark" %% "spark-mllib" % sparkVersion % "provided"
\n<
\n< \/\/ ScalaTest used by the tests provided by Spark MLlib
\n< libraryDependencies += "org.scalatest" %% "scalatest" % "2.2.6" % "test"
\n libraryDependencies += \"org.apache.spark\" % \"spark-core_2.10\" % \"1.1.0\" % \"provided\"
\n22a9
\n> libraryDependencies += \"org.scalatest\" % \"scalatest_2.10\" % \"2.1.3\" % \"test\"<\/p>\n
spark-submit --num-executors 116 --executor-cores 8 --executor-memory 12gb --class org.alitouka.spark.dbscan.DbscanDriver --master yarn \/home\/helmut\/ScalaProjects\/spark_dbscan\/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar \/home\/helmut\/ScalaProjects\/spark_dbscan\/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter\/twitterSmall.csv --ds-output twitterSmall_alitouka912initialpartitions4061npp --npp 4061 --eps 0.01 --numPts 40 : 53mins, 6sec <\/p>\n
spark-submit --num-executors 116 --executor-cores 8 --executor-memory 12gb --class org.alitouka.spark.dbscan.DbscanDriver --master yarn \/home\/helmut\/ScalaProjects\/spark_dbscan\/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar \/home\/helmut\/ScalaProjects\/spark_dbscan\/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter\/twitterSmall.csv --ds-output twitterSmall_alitouka912initialpartitions25000npp --npp 25000 --eps 0.01 --numPts 40 : 40mins, 10sec<\/p>\n

twitterBig<\/h4>\n
spark-submit --conf \"spark.akka.timeout=300s\" --conf \"spark.network.timeout=300s\" --conf \"spark.akka.frameSize=2000\" --driver-memory 30g --num-executors 38 --executor-cores 1 --executor-memory 40g --conf \"yarn.nodemanager.resource.cpu-vcores=1\" --conf \"spark.yarn.executor.memoryOverhead=2000\" --conf \"spark.driver.cores=20\" --conf \"spark.driver.maxResultSize=0\" --class org.alitouka.spark.dbscan.DbscanDriver --master yarn \/home\/helmut\/ScalaProjects\/spark_dbscan\/spark_dbscan-assembly-0.0.4.jar --ds-master yarn --ds-jar \/home\/helmut\/ScalaProjects\/spark_dbscan\/spark_dbscan-assembly-0.0.4.jar --ds-input Twitter\/twitterBig.csv --ds-output twitterBig_alitouka912initialpartitions25000npp --npp 25000 --eps 0.01 --numPts 40
\nfails with
\n java.lang.Exception: Box for point Point at (51.382, -2.3846); id = 6618196; box = 706; cluster = -2; neighbors = 0 was not found<\/p>\n

ELKI<\/h3>\n
time java -Xmx35G -jar elki-bundle-0.7.1.jar KDDCLIApplication -dbc.in \/home\/helmut\/DataIn\/twitterSmall.csv -db.index \"tree.metrical.covertree.SimplifiedCoverTree$Factory\" -covertree.distancefunction minkowski.EuclideanDistanceFunction -algorithm clustering.DBSCAN -dbscan.epsilon 0.01 -dbscan.minpts 40 > twitterSmall.out_elki<\/p>\n
time java -Xmx73G -jar elki-bundle-0.7.1.jar KDDCLIApplication -dbc.in \/home\/helmut\/DataIn\/twitterBig.csv -db.index \"tree.metrical.covertree.SimplifiedCoverTree$Factory\" -covertree.distancefunction minkowski.EuclideanDistanceFunction -algorithm clustering.DBSCAN -dbscan.epsilon 0.01 -dbscan.minpts 40 > twitterBig.out_elki<\/p>\n","protected":false},"excerpt":{"rendered":"
This post is used to document some DBSCAN command line parameters used in a DBSCAN implementation evaluation. Once a paper will be published referencing it, it will go to EUDAT B2SHARE and get thus a persistent handle. Conversion of HDF5 file into CSV\/SSV h5dump -d \"\/DBSCAN\" -o out.txt twitterSmall.h5 Yields lines (12,0): 53.3243, -1.12341, Remove […]<\/p>\n","protected":false},"author":512,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[79],"tags":[],"class_list":["post-770","post","type-post","status-publish","format-standard","hentry","category-research"],"_links":{"self":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/770","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/users\/512"}],"replies":[{"embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/comments?post=770"}],"version-history":[{"count":18,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/770\/revisions"}],"predecessor-version":[{"id":773,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/posts\/770\/revisions\/773"}],"wp:attachment":[{"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/media?parent=770"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/categories?post=770"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/uni.hi.is\/helmut\/wp-json\/wp\/v2\/tags?post=770"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}

Running on Spark cluster<\/h3>\nA single \"assembly\" jar needs to be created. In some cases, the sbt build file does not match the JVM\/Scala\/library versions and thus, minor adjustments were needed.<\/p>\n

Running on Spark cluster<\/h3>\n
A single \"assembly\" jar needs to be created. In some cases, the sbt build file does not match the JVM\/Scala\/library versions and thus, minor adjustments were needed.<\/p>\n