The O2 project has been launched to address this extremely ambitious program me which requires a new common organization for the DAQ, HLT and offline projects. The main objective is developing together the ALICE new computing system that should be ready for the Run 3 of the LHC. This system will collect, process and compress up to 1 TByte of raw data per second thus being able to cope with the increased flow of data.
Following the LS1 period and the subsequent upgrades of the LHC and the ALICE detectors a larger number of events will be recorded. More specifically the rate for heavy-ion events handled by the online systems up to permanent data storage should be increased to up to 50 kHz; this corresponds to an increase of roughly two orders of magnitude with respect to the present system. This signals a new challenge for the amount of data that ALICE will have to record and use for the physics analysis. The mean event size has been estimated to be of the order of 23 MByte causing a global data throughput of approximately 1150 GByte/sec being read out from the detector; certainly this large volume of data needs to be substantially reduced. The estimated data throughput to the mass storage after data compression is of the order of 83 GByte/sec peak with an average of 12 GByte/sec.
Amid the challenges that the O2 will have to address, the physics simulation, the online calibration, the online reconstruction and the data quality control are amongst the toughest ones. There is of course an in-depth experience in the ALICE collaboration around these topics. However, the present challenge will be reinforced by the continuous read-out of some detectors which will result in an unceasing flow of data with a mix of superimposed interactions. This will require a complete redesign of all the relevant algorithms.
The requirement to run at a peak frequency of 50 kHz translates into an average frequency of 20 kHz during a complete fill of the LHC. One month of Pb-Pb running will result in the acquisition of 2 x 1010 events which is two orders of magnitude higher compared to the data that were taken in the 2011 Pb-Pb run. This means that while for the current event rates, the data are reconstructed in two months with 104 cores, after the upgrade 106 cores will be needed in order to reconstruct the data from Pb-Pb runs within the same period. In other words, even if we take into account the advancements in the performance of computing systems, the performance of the code needs to be increased at least a factor of 6 in order to cope with these requirements and this can come from the optimization of the current code.
Secondly, the ALICE upgrade will be based on a combination of continuous and triggered read-out. The ITS and TPC detectors are considering implementing a continuous read-out while TOF and TRD will use a triggered read-out using the L0 trigger. Moreover, a delayed trigger will be available for slower detectors. The detector readout is the only part of the system for which it is mandatory to deploy the capability to handle the 100 kHz from the beginning and a total capacity of 25 Tbit/s is currently foreseen. The new readout architecture requires a profound redesign of the DAQ and HLT systems.
Finally, it should be noted that data acquisition and processing will be carried out by a large processor farm based on GPU and CPU. The full event reconstruction will be performed in one Event-building and Processing Node where the final data compression and data recording will be performed before the local data storage.
The computing hardware is another area requiring a lot of R&D. There is also quite some experience in this area within ALICE but the computing arena has evolved quite a lot in the recent years. The former model of one quasi-monopolistic provider of CPU chips (Intel) delivering chips with the same architecture during more than 30 years and doubling their performance every 18 months is now far away. The increase of performance is coming by including more and more cores in every chip and not by the increase of the clock rate. New actors and new architectures are also showing up with the explosion of mobile computing and of powerful co-processors (GPUs, manycores, FPGAs). These changes will require revisiting the current usage of parallel platforms. In order to exploit the parallel hardware being offered by vendors the code has to be adapted to the hardware to a large extent. The calibration and reconstruction of the data will have to run very efficiently on a highly parallel system, with several levels of parallelism of different granularities. This poses another important challenge for the code that will have to be optimized for the newly emerging parallel architectures.
In order to start working on these objectives, the O2 project has initiated a dozen Computing Working Groups (CWGs) on different topics such as the architecture, the tools, the dataflow, the data model, the control and configuration, or the software lifecycle. In order to ease and encourage the joint effort, these CWGs are formed by members of the 3 projects working together on the same topic. Members of the project are meeting on a regular basis during plenary meetings in order to share and review the status of the CWGs. All the ALICE experts in these areas are active in the CWGs dedicated to these issues.
This large effort is added on top of all the other tasks to be accomplished during the LS1 period: analysing data, writing physics papers and preparing for the Run 2. The project is planning to present in September 2014 a Technical Design Report to the LHCC. There are no such things as silence periods in HEP experiments!