Following 14 years of a successful career at ALICE, Federico Carminati has recently defended his PhD thesis on a topic that he has clearly studied in depth: “Design, realisation and exploitation of the data processing system of the ALICE experiment for simulation, reconstruction and analysis”. In his thesis, pursued under the supervision of Yves Schutz, Federico explores the details of the software framework that ALICE uses for the processing of experimental data.
Federico submitted his thesis to the University of Nantes last February and we met him for this interview a few weeks later in his new office at the SFT group, where he is now working on his new project; the so-called GEANT5. In his interview Federico recalls many of the exciting moments that he had with ALICE and narrates them in his own personal style.
Q. Tell me a bit about your professional history before joining ALICE.
A. Let’s start with a bit of personal history. I obtained my degree in particle physics at the University of Pavia in 1981. My thesis was about the analysis of particle flow in p-p interactions in a bubble chamber experiment. In hindsight it was not such a bad job, and it led to the publication of two papers. After my Laurea (something between a BsC and a Master), I had no time (or willingness) for a doctorate, which in any case was not offered by the Italian university, and I immediately started working in the UA2 experiment in which the University of Pavia was involved. I still remember the night when we got word that the Z0 had been discovered by the “other” experiment! Following my experience in CERN, I moved to the Los Alamos National Laboratory where I worked on muon decay during 1983. My next appointment was at Caltech under the direction of Harvey Newman, a knowledgeable physicist and a gentleman, to work at the L3 experiment. Destiny was definitely showing me the way back to CERN, where I spent most of 1984, paying only one visit to Caltech.
Federico Carminati getting ready to defend his PhD thesis on the ALICE software in the University of Nantes.
At CERN I had the chance to start working with René Brun, Francis Bruyant and Julius Zoll and I became more and more interested in computing. At the end of the year, I was offered a position at CERN and in 1985 I began to work for the IT (at the time still DD) division in the Program Library Office led by Harry Renshall. These were years of intense learning and hard work in a very stimulating and friendly environment. I became responsible for the CERN Program Library in 1988 and then responsible for the GEANT 3 simulation project in 1991.
In 1993 the world of HEP computing started its move from FORTRAN to a new language, which quickly turned out to be C++. The people involved were all respectable, but the process turned out to be very controversial and too often technical discussions turned into personal arguments. Moreover, in this controversy, I initially supported the move to FORTRAN90, which was the losing alternative. The exceptionally friendly and productive working place I had known till then was considerably degraded. Right at that time Carlo Rubbia, who was ending his mandate as Director General, asked me to join him as simulation expert, to work on his new idea for a clean and safe source of Nuclear Power, the Energy Amplifier. I stayed with Rubbia from 1995 to 1998 and it was a unique working experience that left a lasting impression in me.
I was out of the big storm that was agitating the CERN and HEP computing world, but these years were far from being “plain sailing” for me. After four years of intense, exhausting but also exciting and groundbreaking work, I felt it was time for me to come back to my previous work of HEP computing. The major Energy Amplifier report was completed and we performed two experiments that verified the major assumptions behind the idea.
Q. How did you decide after that experience to work for the LHC experiments and, more specifically, for ALICE?
A. Well, I talked to the computing gurus of the four experiments and they told me “Thanks for your interest, we will contact you.” I got one phone call; for the others, I am still waiting (laughs). This was the time when the ALICE experiment started building the offline infrastructure and I was supposed to take a part in it, which I definitely did.
I had the advantage of being “battle-worn” than some of my colleagues, so I started doing things that looked logical to me, without knowing that these were forbidden, political heresy or simply considered impossible. I immediately decided to adopt the ROOT package that René Brun had been developing since 1992 as the basis for the ALICE Offline framework. First “faux pas”, as the director of computing at the time had just issued an official memorandum forbidding the usage of ROOT for the LHC experiments. By the way, I believe that this memo was never revoked (laughs). This decision of mine understandably caused a lot of uneasiness, within and without ALICE, and it was soon followed by others that also were opposite to the current doxa, and which created an image of ALICE as “the different one” in the LHC computing landscape.
With time, most of the unconventional and controversial choices that my colleagues from the ALICE Offline and I made were afterwards validated and adopted by the other experiments, or in any case turned out to be functional for ALICE. I have to say, in all honesty, that I derive some satisfaction from this. However it has not been an easy road to follow. Being “right in advance” just means that you are always “wrong” and in disagreement with many of your colleagues. And when you are finally right on a given subject, the remark “I told you so” helps very little in the way of improving your working relations.
Congratulations Dr. Carminati!
I think most of it comes also from my temper. Surely another person could have done the same or even better with less friction.
Q. Which were the specificities of ALICE?
A. A colleague of mine, perhaps after one too many beers, explained to me that “engrained in the DNA of a nuclear physicist is the fact that the night before the beam comes, he rolls his apparatus into the beam line, sets up his own data acquisition system and, when beam comes, he registers the data on the disk of his laptop. Once data taking is over, he starts writing the software to analyse the data.” A late-night exaggeration without any doubt, however it is true that the nuclear physics community is less experienced in large collider experiments than the particle physics community. Moreover ALICE has a computing challenge that is of the same order of magnitude as ATLAS and CMS with half the people and arguably less experience in large and integrated computing infrastructures than the communities who worked on UA, LEP and CDF (colleagues from RHIC of course being the exception).
At that time the other experiments were developing two lines of software, one for the immediate needs of detector design, mostly still in FORTRAN, and the other for the long-term data processing, in C++. However our manpower situation simply did not allow for that, and we had to “take what was available” and make it work immediately and, at the same time, evolve it into the final computing framework. This is why we took ROOT and GEANT3, both well tested and ready to go, even if marrying them was not easy and there were many open questions on their future and support. The task was even more difficult as the user community was conversant with FORTRAN, but not so much with C++, and many physicists had to learn “on the job” a new language, a new framework in constant development and, at the same time, produce results. It has to be said that even us “the experts” were learning “on the job”, as C++ is a very complex language that takes time and experience to “tame”.
Again to make the maximum of the scarce manpower we had, we decided to have physicists doing detector simulation and design, and computing experts working in a single group. This put considerable pressure on the computing experts to support a production system and develop it at the same time, as well as a great strain on the patience of physicists who saw their everyday working tool in continuous evolution. The bright side of this choice was that ALICE was the first experiment to make a complete transition to C++ and it never had a “legacy system” to deal with.
The consequence of the limited manpower was also that the whole framework had to be developed at CERN, as the detector groups just had enough resources to provide the detector-specific code, beyond the already daunting task of building and testing the detectors. So ALICE ended up with a very small offline team, much smaller than the one of the other experiments, but one which was all concentrated at CERN, where we had more people dedicated to Offline than the others. This was not an easy situation to defend politically, even if it turned out to be very effective, as a good and motivated team with closely-knit relations can be incredibly productive.
Here I have to say that, although I regard myself as a person who had some good ideas and had the determination to fight for them, I would not have achieved much without my colleagues of the Offline team. I still wonder at the incredible luck I had to find so many exceptionally talented people who accepted to accompany me in this adventure. Many of them went on to brilliant careers, often at CERN.
Another consequence of this situation is that we had to look for an appropriate model of software development. So, while pundits of Software Engineering at CERN were evangelising the usage of the formal software design techniques (document, design, write, debug), we were preaching the “agile development” (open emacs and start programming). Another open front, another battle to fight. Now of course these techniques are acquired and the enormous success of the open source revolutions has established them as standards, but at the time the discussion was raging. (Let me give you just a small anecdote. We went to a CERN School of Computing to expose our ideas on Agile Software Development. There was an uproar from students and teachers alike, and, certainly by accident, we were the only speakers for whom there was no place at the official table the evening of the banquet.)
Q. And what about the Grid?
A. When I started working for ALICE I announced my “vision”: to provide all physicists in ALICE with equal access to data and resources irrespective of their location and provenance. People thought I was just expressing a wish, nice but unrealistic, and not a practical objective, so far was technology from imagining such a system. But I was very serious, and I set about to make this true. Well, to make a very long story short, we indeed managed to do it. It all started from a stroke of genius by Predrag Buncic who, in 2001-2002 decided to diverge from the “official” Grid projects and to write his own Grid. Hell broke loose when this was announced, as people were appalled and scandalised that he could dare set about developing alone what hundreds of people were painfully putting together in the official Grid projects. It was not hard to convince me to enthusiastically support his idea. Predrag announced to me that he wanted to call this product AliLite. I thought that this could be seen as a provocation, so I vetoed this name. Predrag accepted the veto on the condition that I would accept his next proposal. I agreed and he came up with ALICE Environment, i.e. AliEn. This sounded like a worse provocation. Just imagine: “would you let AliEn taking over your computing centre?” and all this kind of jokes! However I decided to accept right away because the next proposal could have been more extreme.
Federico discussing about the GRID and the specificities of ALICE.
One year later ALICE had its own Grid fully functional and deployed on 10 computing centres. Most of the features and functionalities of AliEn are still unparalleled by other systems, and the initial design has survived more than ten years and is still serving us very well on almost ten times more centres and 100 times more CPUs.
Was this a miracle? In some sense yes, but there was a trick. Predrag, again moving along the lines of agile software development and the same principle that guided us to choose ROOT, took what was there and worked. There was in fact a very large quantity of very high quality Open Source components out there waiting to be assembled via a clever design. The initial version of AliEn was 5000 lines of code of what I called “Brainware”, defining the architecture of the system, and millions of lines of Open Source code. Again, exceptionally talented people joined in and continued the road opened by Predrag. A big step forward was made when the MonALISA monitoring system was adopted, allowing us a complete control of the Grid parameters. On the basis of MonALISA we developed a very sophisticated job control and management system, which allows us to run large distributed productions “pushing a button”. Another “jewel of the crown” is the global file catalogue, architected by Pablo Saiz. It is like a gigantic disk giving you transparent, efficient, secure and ubiquitous access to over 300 million files distributed all over the world, from San Paulo to Seoul. It grows at the rate of one million files a day and it shows no hint of performance bottleneck. If you google-trend Big Data, you will find out that the buzz started a couple of years ago. We have been doing Big Data since ten years! As often, HEP computing is today where computing will be tomorrow. But all these advanced technical tools would not have been so useful had not the whole production system been “tuned to perfection” over the years by Latchezar Betev, master in command of the production system.
Again we had to pay a “political price” for our “diversity”, but again it was well justified and far from being just a “caprice”. As said before, the ALICE computing challenge, both in complexity and in terms of resources, is comparable to the one of CMS and ATLAS. However, resources are allocated by the funding agencies in a way that is approximately proportional to the number of national physicists working in an experiment. ALICE computing is, so to say, twice as expensive per capita as the computing of ATLAS and CMS, and therefore it tends to be chronically underfunded, both in terms of sheer resources (disk and CPU) and in terms of support personnel at the different centres. To survive, the ALICE Grid had to be efficient and to run as much as possible “hands off”. Another consequence is that the resources had to be “mutualised” as much as possible. This has been a hard uphill fight convincing the various communities that it was in their best interest to share their resources with the whole of ALICE instead of dedicating them to the local physicists. Perhaps surprisingly, “emerging countries” have been exemplar in receiving the message and in sharing.
All in all the “dream” came true, and really physicists can submit, and have been submitting, jobs to AliEn from any location, having access to all the resources of the ALICE Grid within their quota and priority. The Grid has been a fantastic human and technical adventure, and the fact of learning to work together has been enriching and fulfilling, beyond the technical aspect. It has also been a resounding technical success, as all ALICE physics has been produced on the Grid.
Such has been the enthusiasm and momentum behind this adventure, that one of the large ALICE centres hosted at the KISTI/GSDC computing centre at Daejon in Korea, has decided to move from a Tier 2 centre to a fully fledged Tier 1. This is the first Tier 1 to be added to 11 historical WorldWide LHC Grid Computing centres established at its beginning 12 years ago.
Q. So do you think that you always made the right choices?
A. Oh no, far from it! Mistakes were made, and many. There are many things I would do differently. I would give more attention to planning, which at times was lacking. I would also try to share and discuss my vision much more, as I realised that at times I was so convinced that we were on the right path that I did not take the time to explain it to the larger ALICE community and to my colleagues in the other experiments. In the end I think I can say I fulfilled my mandate providing ALICE with the tool to analyse its data, however a more careful consensus-building strategy could have helped to smooth out some rough edges and also gather some more support.
Technically I believe we did good things, however I perhaps lacked the courage to correct some deeply rooted “original sins” in the computing framework, privileging stability. One of the symptoms of this is the high wall time over CPU ratio of some ALICE jobs, which has been partially corrected only in the last year of my mandate.
Probably I should have devoted more resources and attention to code optimisation. ALICE is relatively late in the exploitation of parallelism and advanced architectures with its framework, in spite of having very good experts of high performance computing in the collaboration.
But it is not so bad to realise that some things could have been done better. On one hand this shows that experience is worth something and on the other it leaves some challenges to my successor, so that his work can be interesting and challenging too (laughs).
Q.So, what were your thoughts when you left ALICE after all these years of adventure?
A. Well, I think that “I brought the ship into the harbour” at the end. I have still ten years in front of me and there is time for another project. It is the right time to start something new. Also, as much as I have loved my job, and the challenges it has offered me, and as much I appreciate my colleagues, I somehow felt that my time in ALICE was over. Your capacity for innovation is, by definition, limited, and it wears out the more you stay on the same job. ALICE is facing exceptional challenges with the upgrade and I think it needs a new captain at the Offline helm. Luckily I believe ALICE has found the right person who will certainly bring new ideas, while being an old friend.
Being Computing Coordinator is not an easy job. You have to take tough decisions and impose discipline to an unruly community, which does not always see the need for it. You become Mr. NO, and even with the best goodwill, frictions are inevitable. Moreover, if everything works, you are just perceived as restraining people’s “freedom” to do whatever they want with their software. If something goes wrong, and things do go wrong, either by human mistake or by miscommunication or bad luck, you are also guilty of imposing not only limiting and ineffective rules, but also of hindering people’s work offering a poor service. With time all this takes its toll on your resilience. I have been the longest serving computing coordinator in the history of LHC, 12 years in office against an average of two. I believe that it is time to give my colleagues a break and to myself the proof that I can do something else in life.
The close bonds of friendship I have with some of my colleagues are still there, and we meet regularly. I have described my experience in ALICE in a book that I have edited and in my doctoral thesis, which I defended at the University of Nantes last February, with the honour of having Yves Schutz as director.
Q. And what are you off to now?
A. My new project is called GEANT5. The idea is to develop the fastest possible detector simulation code on the new architectures that are coming out every week. It is essential for HEP to be able to better exploit these new machines. According to recent measurements HEP code uses between 5% and 10% of the theoretical speed of a modern machine, and if nothing is done, this percentage is bound to decrease. This is not as bad as it sounds, as even the best code reaches no more than 50%, however there is still ample room for improvement and we should go for it.
Just think, the WLCG Grid has 300,000 cores running flat out 24x7x365. Imagine if we could gain a factor two of performance. In terms of money this would be millions. We started from simulation because it is more and more essential for LHC and for the future machines and it is largely experiment independent, so the same code can satisfy many different experiments. CERN has tremendous experience being the major development centre for GEANT4, the world standard for particle transport and I have myself done quite a lot of work in GEANT3 and with Rubbia in simulation.
The hope is that this work, together with all the parallel activities on code optimisation that are going on in the SFT group, will also offer models and algorithms to the experiments to optimise their code.
I have to say that I found new energy in this job; I am back programming in C++ and discussing code design as was done at the beginning of AliRoot. My new group, SFT, has welcomed me very warmly and I feel well integrated, and enthusiastic and surrounded once more by competent people. I am developing code and studying High Performance Computing: it is quite a welcome change from the hardships of running data production under pressure (smiles).
This is a long-term project and I look forward to working for it in the years to come.