The statement “Complex experiments, such as ALICE, generate huge amount of data that need to be processed and analyzed” cannot be characterized as revelatory. But, it can trigger an interesting question, namely “How does this processing and analysis work?”. This brings up more questions such as: Does each physicist have a supercomputer designed for such "noble tasks" under his desk? Or rather the opposite, a regular laptop is more than enough? Are the computer programs physicists use a scientific secret or are they publicly available and free? And... what do those programs exactly do and how do they work?
2. A brief history of data analysis
When the High Energy Physics (HEP) experiments started, physicists did not have an easy task analyzing the recorded data to extract physics results. For some detectors, such as cloud and bubble chambers, films, similar to those used for cinema movies, were employed to record the trajectories of particles as they were traversing the sensitive material of the detectors. Scientists were sitting with rulers, protractors and other geometric tools to analyze such images. The analysis took a lot of time and was possible only when the number of particles (visualized as tracks) was small - they had to be all visible in one image! Fortunately, together with the increasing energy of accelerators, came the development of computers. This allowed digitizing the obtained data and automatizing many of the procedures. This is valid not only for high energy physics, but for science in general. We develop the algorithms and then the computers proceed with the data processing much faster (by many orders of magnitude) than us.
Before electronic data analysis, physicists visually examined photographs of Bubble Chamber particle interactions.
3. Data analysis in ALICE
Different physics experiments used to develop their own software for their data analysis. To provide a basic common functionality an open-access data-analysis framework was developed at CERN: ROOT. It is an object-oriented software package, written in C++, originally targeted to particle physics data analysis. By now, almost all big HEP experiments in the world use ROOT as basis for the development of their own software. Naturally, many of the actual computations are experiment specific, for example we need to take into account the geometry of our detector in the reconstruction process. So, it is necessary for each experiment to use their own specific algorithms.
<200> "AliRoot is organized in a modular way - for example each detector group develops its own software which is then integrated, as a mostly independent module, in the whole system"200>
In ALICE we developed AliRoot. It contains ALICE-specific libraries which are added to standard ROOT. These include software packages for several functions needed in the experiment in the process of data analysis, from simulation and reconstruction to final extraction of physics results. AliRoot is organized in a modular way - for example each detector group develops its own software which is then integrated, as a mostly independent module, in the whole system (that is why in the structure of AliRoot one can see directories like “TPC”, “TOF”, “VZERO”, etc.). Similarly such structure is organized for physics analysis packages. Each Physics Working Group (PWG) has its own directory, where different Physics Analyses Groups can add their code for data analysis. Each module can be excluded from AliRoot not affecting the core of the package. Of course every new piece of software that is being added to the system must be checked for correctness and obey strict ALICE computing rules.
ROOT is a framework for data processing. Every day, thousands of physicists use ROOT applications to analyze their data or to perform simulations
Some collaborations restrict access to their data analysis software for their members only. In the case of ALICE, AliRoot is publicly available and free for everyone to use (the SVN repository on the Web is easily accessible). Moreover, the policy of the experiment is that all software and procedures developed and used to obtain any published physics results should be incorporated into AliRoot, so that results can be reproducible and possible to be studied by everyone interested, also from people outside the collaboration.
4. PROOF and GRID
As mentioned in the beginning, computers are fast, much faster than humans. They do what we tell them to do, in a repetitive way, usually with no errors. However, the amount of data that is being delivered by the LHC is so huge, that data storage and data processing have become a formidable challenge. These are the main problems with which we must cope in our everyday work. What is the solution? Fast internet and parallelism.
The petabytes of data, which are recorded by ALICE, cannot be stored in one place. Therefore, we use fast internet connections to send them to many computer centers in the world. These data are then made accessible to users around the world for physics analyses.
Parallel computing can be introduced at different levels. The first level is the multicore processors that every modern computer is equipped with nowadays. At a second level one can employ clusters of computers - computers which are connected with fast LAN connections and run the same environment. For parallel data processing, the ROOT framework provides a package called PROOF (Parallel ROOT Facility), which can be installed on any computer cluster (for example you can do it at your institute). The third level is the GRID - computer clusters spread around the world connected to each other with fast internet connection. That means the analysis (in the language of computers, the “job”), can be executed on different computers all around the globe.
<200> "for contemporary experimental physicists fast connection to the internet is much more important than having better computers on their desks, because most of the computations are done somewhere in the world, in dedicated computer centers"200>
And what more, physicists do not need to know where the job is executed. From our point of view it does not matter whether the job was executed, for example, on a computer in the US
which analyzed data from Hiroshima. The important part of the process is getting the result - and that we can also do from anywhere (the only need is a fast internet connection).
ALICE GRID sites all over the world
We can now answer the question posed at the beginning: for contemporary experimental physicists fast connection to the internet is much more important than having better computers on their desks, because most of the computations are done somewhere in the world, in dedicated computer centers.
The management and analysis of the huge amount of data collected by the ALICE experiment is far from being an easy task. AliRoot, the experiment's dedicated data analysis software, enables data processing in a fast and efficient way. In addition - due to its public availability – it allows sharing information with people all over the world.