Horovod is an open source framework created to make distributed training of deep neural networks fast and easy for TensorFlow, PyTorch, and MXNet models. Horovod's API makes it easy to take an existing training script and scale it run on hundreds of GPUs, but provisioning a Horovod job with hundreds of GPUs can often be a challenge for users who lack access to HPC systems preconfigured with tools like MPI. The newly introduced Elastic Horovod API introduces fault tolerance and auto-scaling capabilities, but requires further infrastructure scaffolding to configure. In this talk, you will learn how Horovod on Ray can be used to easily provision large distributed Horovod jobs and take advantage of Ray's auto-scaling and fault tolerance with Elastic Horovod out of the box. With Ray Tune integration, Horovod can further be used to accelerate your time-constrained hyperparameter search jobs. Finally, we'll show you how Ray and Horovod are helping to define the future of machine learning workflows at scale.
Travis Addair is a software engineer at Uber working on the Michelangelo machine learning platform. He leads the Horovod project and chairs its Technical Steering Committee within the Linux Foundation. In the past, he’s worked on scaling machine learning systems at Google and... Read More →