Analysis of Particle Physics Data Using the MapReduce Paradigm (2012-2013)
(Project finished.)
Huge scientific data, such as the petabytes of data generated by the Large Hadron Collider (LHC) experiments at CERN are nowadays analysed by Grid computing infrastructures using a hierarchic filtering approach to reduce the amount of data. In practise, this means that an individual scientist has no access to the underlying raw data and furthermore, the accessible data is often outdated as filtering and distribution of data only takes places every few months. The advent of Cloud computing promises to make a “private computing Grid” available to everyone via the Internet. Together with Google’s MapReduce divide-and-conquer paradigm for efficient processing of data intensive application, the Cloud may a viable alternative for a scientist to perform analysis of huge scientific data.
This projects aims at investigating the applicability of the MapReduce paradigm, in particular as implemented by Apache Hadoop, for analysing LHC data and at investigating the feasibility of Cloud computing, e.g. as provided by the Amazon Cloud, for this task instead of Grid computing. Even though it will not be possible to assess the applicability to petabyte-sized problems, this project shall serve as a case study to prove or disprove whether this approach is applicable at all.
Main objectives
This project aims at investigating the applicability of an Apache Hadoop-based MapReduce approach for the analysis of Particle Physics data. This includes:
- Adapt some of the existing LHC data analyses to the MapReduce paradigm.
- Implement the adapted analyses for being useable with Apache Hadoop.
- Investigate the applicability of the approach by setting up and running the analyses in the Amazon EC2 public Cloud using Amazon Elastic MapReduce.
State of the art
State of the art in analysis of particle physics data
The data units recorded by a particle physics experiment are so-called events. At the Large Hadron Collider (LHC), every event contains the information from a collision between two proton bunches. The quantities of interest in a typical particle physics analysis are probabilities or probability density functions for a certain process or configuration to occur. Numerical estimates are obtained by means of histograms (e.g. the momentum distribution within a particle decay), i.e. simple counters for how often a certain condition is observed.
The standard software framework for statistical analysis and visualization in high energy physics is called ROOT [Antcheva 2009, ROOT website 2010]. Based on ROOT, several higher level analysis frameworks have been implemented for analysis of LHC data, e.g. Gaudi [Gaudi Website 2010] and DaVinci [DaVinci Website 2010]. A scientist that wants to perform specific analyses thus writes either C++ or Python code that makes use of the functionality provided by, e.g., the Gaudi or DaVinci framework.
The LHC experiment is expected to generate up to 15 petabytes of data each year. Hence, the above described analyses cannot be performed on local computing clusters. Instead, the Worldwide LHC Computing Grid (WLCG) is currently used for processing the data of the LHC experiments [Lamanna 2004]. It follows a hierarchic tier model: the Tier 0 centre at CERN is the only place where the original data is available as whole. To reduce the amount of data, a filtering (“stripping”) takes place and this filtered data is distributed to several Tier-1 centres: large national computer centres with sufficient storage capacity. In addition to serving as a distributed back-up for Tier-0, the Tier-1 centres make the filtered data available to Tier-2 centres, each consisting of one or several collaborating computing facilities, which can store sufficient data and provide adequate computing power for specific analysis tasks. Individual scientists will access these facilities through Tier-3 computing resources, which consist of local clusters in a research department. To perform the above described analyses, the executable code of the implemented analyses is executed on those nodes in the Grid where the data to be analysed is available. While this approach works, the hierarchic approach involving initial filtering (“stripping”) of data has to be considered as a major drawback: in LHC, scientists are searching for new, yet unknown particles or events. When applying a filtering to the data before passing on the data to the scientists, there is the danger that interesting data is just filtered out and just never subject to analysis.
State of the art in Cloud computing
Cloud computing is a new paradigm for affordable scalable computing. Instead of needing to invest into own hardware, external computer resources offered by a Cloud computing provider can be used in fine granular pay-per-use manner: based on virtualisation technology, a cloud user can configure her own computing cluster using “elastic” resources, i.e. machines can be used or released on-the-fly [Armbrust et al. 2009, Buyya 2009]. One may distinguish between ‘public clouds’ in which a ‘cloud provider’ provides hardware as a utility for ‘cloud users’ to build applications on and ‘private clouds’ in which an organization delivers internally services based on Cloud computing technology. As long as the same underlying Cloud technology is used, even a “hybrid cloud” approach is possible. As long as a private local cluster provided sufficient computing power the private cloud is used. As soon as more computing power is needed, the resources of a public cloud provider are additionally used. Due to the underlying virtualisation of hardware, a cloud user cannot and needs not to distinguish whether machines from the private cloud or from a public cloud are used. An example of a public cloud provider and cloud middleware is Amazon’s Elastic Compute Cloud (EC2) service [EC2 Website 2010]. The computing hardware that is anyway available at Amazon and might by idling is utilized by renting it via Cloud technology. In comparison to the classic approach of using a dedicated cluster that may be idle if no jobs are running, Cloud providers are able to offer computing power much cheaper just by the effect of large scale, by avoiding idle time by multiplex computing time between multiple users, and by the possibility to locate the Cloud hardware where electricity is cheap. Finally, Cloud computing allows to use computing resources without investment. Just the used computing time, the data storage and network traffic is charged. Using 1000 machines for one hour costs the same as using one machine for 1000 hours (e.g. $0.085 per CPU hour at Amazon EC2). This makes Cloud computing an ideal experimental platform where a huge number of machines are needed for a short amount of time.
State of the art in Map-Reduce
MapReduce is a programming model for processing large (terabyte, petabyte-sized) data sets [Dean, Ghemawat 2008]. It is based on a classical divide-and-conquer approach. Many real world problems are expressible in this model, in particular generating histograms. The “Map” step takes the input, divides it into smaller sub-problems, and distributes these to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes that smaller problem, and passes the answer back to its master node. In the “Reduce” step, the answers to all the sub-problems are taken and combined in a way to get the answer. The advantage of MapReduce is that it allows for distributed processing of the Map and Reduction operations and a MapReduce infrastructure automatically parallelizes and executes Map and Reduce steps on a large number of machines. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. An MapReduce approach has already been used as case study for Particle Physics analysis [Ekanayake et al. 2008], however not within a Cloud computing environment.
Several implementations of the MapReduce approach exist, e.g. the open-source Java-based Apache Hadoop software [Apache Hadoop Website 2010] that comes with an underlying distributed files-system implementation. For usage within the EC2 Cloud environment, Amazon provides a pre-configured version of Apache Hadoop called “Amazon Elastic MapReduce” [Amazon Elastic MapReduce Website 2010] that has been adapted to use Amazon Simple Storage Service (Amazon S3) as distributed files system.
Scientific value of using Apache Hadoop in a Cloud computing environment for analyzing Particle Physics data
Physicists are concerned that important information gets lost during the filtering (“stripping”) that is currently performed in the WLCG Grid. Instead they would like to be able to work directly on the full data. Apache Hadoop seems to be a promising candidate to perform these analyses without needing major software infrastructure development. Using cloud technology avoids investment into hardware. This project shall serve as a case study whether a MapReduce approach as implemented by Apache Hadoop is feasible for this kind of analyses. Since the computing resources in the WLCG are used for the analyses based on the described hierarchical tier approach, a public Cloud provider shall be used in this project to provide the computing and storage resources required for this small-scale proof-of-concept case study. Should this approach proof worthwhile, it is intended to apply at other sources for funding to apply it on a large scale.
References
Amazon EC2 Website (2011). http://aws.amazon.com/ec2/
Amazon Elastic MapReduce Website (2011).http://aws.amazon.com/elasticmapreduce/
Antcheva et al. (2009). ROOT – A C++ framework for petabyte data storage, statistical Analysis and visualisation. Computer Physics Communications 180 (2009) 249—2512
Apache Hadoop Website (2011). http://hadoop.apache.org/
Armbrust, Fox et al. (2009). Above the Clouds: A Berkeley View of Clouding Computing, Technical Report No. UCB/EECS-2009-28, EECS Department, University of California, Berkeley
Buyya, Yeo et al. (2009). Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems 25(6): 599-616.
DaVinci Website (2011). http://lhcb-release-area.web.cern.ch/LHCb-release-area/DOC/davinci/
Dean, Ghemawat (2008). MapReduce: simplified data processing on large clusters. Commun. ACM (51), 107-113
Ekanayake, Pallickara, Fox (2008). MapReduce for Data Intensive Scientific Analyses. Fourth IEEE International Conference on eScience, IEEE.
Gaudi Website (2011). http://proj-gaudi.web.cern.ch/proj-gaudi/
Lamanna (2004): The LHC computing grid project at CERN. Nucl. Instrum. Meth. A 534 (2004)
ROOT Website (2011). http://root.cern.ch/
Publications
Fabian Glaser, Helmut Neukirchen: Analysing High-Energy Physics Data Using the MapReduce Paradigm in a Cloud Computing Environment.Technical Report VHI-01-2012, Engineering Research Institute, University of Iceland, Reykjavik, Iceland, Second, extended Edition, July 2012.
Download
Fabian Glaser: A MapReduce Input Format for Analyzing Big High-Energy Physics Data Stored in ROOT Framework Files MSc thesis (supervised by Helmut Neukirchen), 2013
Student was invited to present results at Max Planck Institute for Nuclear Physics (23.8.2013).
Fabian Glaser, Helmut Neukirchen, Thomas Rings, Jens Grabowski.
Using MapReduce for High Energy Physics Data Analysis. Proceedings of the 2013 International Symposium on MapReduce and Big Data Infrastructure (MR.BDI 2013), 03-05 December 2013, Sydney, Australia 2013. IEEE 2013.
Download