With the increasing number of Machine and Deep Learning applications in High Energy Physics, easy access to dedicated IaaS represents a requirement for fast and efficient R&D. This work explores different types of cloud services to train a Generative Adversarial Network (GAN) in a parallel environment, using a Tensorflow data parallel strategy. More specifically, we parallelize the training process on multiple GPUs and Google Tensor Processing Units (TPU) and we compare two algorithms: the TensorFlow built-in logic and a custom loop, optimised to have higher control of the elements assigned to each GPU worker or TPU core. The quality of the generated data is compared to Monte Carlo simulation. Linear speed-up of the training process is obtained, while retaining most of the performance in terms of physics results. Additionally, we benchmark the aforementioned approaches, at scale, over multiple GPU nodes, deploying the training process on different cloud vendor service offerings, seeking for overall efficiency and cost-effectiveness. The combination of data science, cloud deployment options and associated economics allows to burst out heterogeneously, exploring the full potential of cloud-based services.
Case Study Summary:
- The scientific problem we tackled:
In particle physics the simulation of particle transport through detectors requires an enormous amount of computational resources, utilizing more than 50% of the resources of the current CERN Worldwide Large Hadron Collider Grid infrastructure. This challenge has motivated the investigation of different, faster approaches for replacing the standard Monte Carlo simulations. Deep Learning Generative Adversarial Networks are among the most promising alternatives and, indeed, multiple prototypes are being explored for the simulation of different particle detectors. In this context, access to heterogenous infrastructure resources, such as cloud-based hardware accelerators, is essential to enable the development and then, deployment of complex models.
- The computational methods we used:
We demonstrated the use of different approaches to distributed DNN training on diversified hardware (TPUs and GPUs). We also compare different approaches to cloud provisioning and orchestration methods: Kubeflow-based on GCP vs MLaaS offering on Azure cloud.
- The cloud resources we used:
Multiple GPUs nodes on GCP and Azure cloud and TPUs on GCP.
- The differences we’ve observed between locally-provided and cloud-provided resources:
There are cloud-based services (e.g TPUs) that we can’t access locally on-premise not at the required scale (hundred of GPUs) what makes it very interesting to explore a hybrid cloud model.
PhD in Physics, Sofia Vallecorsa is a CERN physicist and the AI and Quantum research lead at CERN openlab. Dr. Vallecorsa has large experience as software developer in the field of High Energy Physics with significant accumulated expertise across the full research chain from real data analysis to simulation workloads. Extensive background on Quantum Machine Learning and classical Deep Learning architectures, frameworks, and methods for distributed training and hyper-parameters optimization, on different environments ranging from commercial clouds to HPC. Dr Vallecorsa is today the coordinator for Quantum Computing within the CERN Quantum Technology Initiative.
For further information:
RRoCCET21 is a conference that was held virtually by CloudBank from August 10th through 12th, 2021. Its intention is to inspire you to consider utilizing the cloud in your research, by way of sharing the success stories of others. We hope the proceedings, of which this case study is a part, give you an idea of what is possible and act as a “recipe book” for mapping powerful computational resources onto your own field of inquiry.