New ALICE Offline Coordinator
A.M. First of all, congratulations for your new position! Where have you been working before moving to ALICE?
P.B. Thank you very much. I should say that this is not the first time I will be working in ALICE and in a way this is my come back to ALICE. I have been working for many years with heavy-ion experiments. More than 23 years ago, just after I had finished my studies at Zagreb University, I joined a group in Rudjer Boskovic Institute that was building a RICH detector for the NA35 experiment. The group leader was Guy Paic who is still active in ALICE, many of you know him and some will probably meet him, as he will be celebrating his 75th birthday during the forthcoming ALICE Physics week in Mexico. That is how I was introduced to heavy ion physics from very early days where NA35 was one of the first experiments in the field. It ran for several years until we realized that we need a purpose-built detector that would be able to collect data at higher rates. That’s how the NA49 experiment was built where I worked from 1994 until 1999.
With its four large TPC detectors, the experiment was generating more data within three weeks of running during the heavy ion period than all four LEP experiments during the entire year. Clearly, this was the computing challenge at the time and during that period I fully committed myself to solving these challenges discovering my personal passion for computing. In this experiment I worked on many computing aspects, from data management and databases to reconstruction and visualization. It may be interesting to remind readers that the initial development of ROOT from 1995 to 1999 also took place in NA49. While I never worked on the ROOT project, I always enjoyed endless discussions with Rene Brun on all aspects of computing in HEP. By the end of 1999 it was clear that the future of heavy ion physics was in ALICE, as the LHC was opening new possibilities in this field. This is when I joined ALICE for the first time.
My first job in ALICE was to build a framework for distributed data processing and run the simulations needed for the first physics performance report on the emerging Grid infrastructure. I appropriately named the project AliEn@Grid as it was meant to provide the ALICE Environment on the Grid. This project is still very much alive and it definitively outlived my own stay in ALICE. In 2004 I briefly moved to the CERN IT department to work on the architecture of gLite, a common EU-funded Grid project. Finally, two years later I moved back to the Physics Department in the PH/SFT group where I am currently leading an R&D project on virtualization. This latest project develops a specially crafted Virtual Machine that we call CernVM that allows experiments to tap into so far unused computing resources, such as individual workstations and laptops to build the volunteer computing clouds or seamlessly extend their computing capacity to the emerging public or private cloud infrastructure.
A.M. Could you explain to us more the first project that you developed for ALICE?
P.B. I like to work with concrete tasks and specific deadlines, since that allows everyone to focus. In the case of AliEn, within six months I had to organize a distributed computing environment that would allow simulating enough events for the first physics performance report. While the initial system architecture and choice of technologies seemed to be at odds with mainstream Grid developments at the time, within couple of years it became apparent that the model of distributed Web of services used in the original AliEn is indeed the direction in which industry wants to go. While the academic community continued to develop the Grid middleware, the industry invented the Cloud concept by marrying the Web services approach with virtualization technology. With our current AliEn middleware we are still closer to the Grid rather than to the Cloud model and this is something that we will need to gradually change. Still, the current system has shown remarkable scaling capabilities. With the first version it took us 7 days to run 5000 simulation jobs and today we can easily run 50000 simulation, reconstruction and analysis jobs concurrently in the system. Knowing that this all happens in a completely distributed and heterogeneous environment, it is not surprising that occasional glitches still happen. Improving overall performance and turnaround time, in particular for analysis jobs, will be very high on our priority list.
A.M. This brings me to my next question. Which will be your main priorities as the Coordinator of the ALICE Offline group?
P.B. Over the next couple of years there will be two rather distinct lines to follow in the development of ALICE software. One would be to continue developing current software, making it more robust and improving software quality. This also means improving the algorithms that are used in order to gain more speed and reduce the resource consumption. At the same time we will have to start some R&D activities towards the future upgrades of ALICE. Following the endorsement of the Letter of Intent for the ALICE upgrade, we have to write a technical report. Given the huge amount of data that will be generated after the ALICE upgrade we might have to radically rethink data processing and distributed computing mode and prepare for the likely convergence of Data Acquisition, Offline and HLT activities. This convergence is a challenging task that brings together different groups that in the past decades had little or no interaction. We will have to learn how to work together on something that will be our common computing platform in six years from now when data will be collected and calibrated simultaneously and possibly reconstructed on the same facility. That’s an exciting task and I look forward to working on it.
A.M. What would you like to share with all ALICE members?
P.B. With our detector being something that weighs 10000 tons, it is easy for most of us to realize its size and respect its complexity. At the same time, most people tend to underestimate or ignore the complexity of the software because it is not tangible. However in many ways our software appears to be heavy as the ALICE magnet and has a huge inertia. You can think of it like a big ship, which is difficult to steer and once it takes a specific direction you cannot redirect it immediately. I think we should keep this picture in mind every time when raising a complaint or sending new requests to the software community. We are all happy to implement it but we cannot steer the Offline ship as if it was a speedboat and changes cannot take immediate effect especially if we want to maintain and increase the software quality by adding more testing and QA procedures. I will put all my effort in streamlining the procedures and making our software lighter, more manageable and easier to handle especially as we move towards the ALICE upgrade, but there will always be some limits to what can be achieved.
Our main goal remains to reduce time-to-publish and that means improving not only the software but also all pre-processing, calibration and QA procedures and that requires lots of communication between various groups (i.e. various Physics Working Groups and the Offline, the Offline and the Online and so on). Overall I think we need to work all together, in true collaborative spirit, in improving software performance, procedures as well as the communication if we really want to substantially improve our data processing speed.
A.M. Finally, I would like to ask you about the current computational power of ALICE and whether it allows us to deal sufficiently with the future needs of ALICE? What is the role that smaller institutions can play?
P.B. We are always short of resources of all kinds! Perhaps lack of storage is currently the most critical point but I would say that the lack in manpower is even more important. People often think that most of the computation happens in a big computing centre but statistics shows that exactly 50% of the overall computing is done in our small computing centres around the world. So nobody should feel neglected or isolated. In addition, when thinking how to improve our software there are many things that can be done by applying new techniques and following new approaches like concurrent or parallel programming. From these engineering modifications we may gain improvements measured in tens of percents, which is huge if you think the overall resources that we need. However even bigger gains measured in factors can be achieved by applying new algorithms and this is where the creativity of each and every individual becomes really important and that’s why I would like to strongly emphasize that everyone is equally valuable and we always seek new ideas. I think this is the true meaning of being a member of a collaboration like ALICE.