Ray Summit 2020 has ended
View More Details for Ray Summit & Registration Information.
Please note: All Sessions are in Pacific Daylight Time (PDT), UTC-7

Thursday, October 1 • 4:50pm - 5:20pm
Distributed Deep Learning with Horovod on Ray - Travis Addair, Uber

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Feedback form is now closed.

Horovod is an open source framework created to make distributed training of deep neural networks fast and easy for TensorFlow, PyTorch, and MXNet models.  Horovod's API makes it easy to take an existing training script and scale it run on hundreds of GPUs, but provisioning a Horovod job with hundreds of GPUs can often be a challenge for users who lack access to HPC systems preconfigured with tools like MPI.  The newly introduced Elastic Horovod API introduces fault tolerance and auto-scaling capabilities, but requires further infrastructure scaffolding to configure.  In this talk, you will learn how Horovod on Ray can be used to easily provision large distributed Horovod jobs and take advantage of Ray's auto-scaling and fault tolerance with Elastic Horovod out of the box.  With Ray Tune integration, Horovod can further be used to accelerate your time-constrained hyperparameter search jobs. Finally, we'll show you how Ray and Horovod are helping to define the future of machine learning workflows at scale.

avatar for Travis Addair

Travis Addair

Senior Software Engineer II, Uber Technologies
Travis Addair is a software engineer at Uber working on the Michelangelo machine learning platform. He leads the Horovod project and chairs its Technical Steering Committee within the Linux Foundation.  In the past, he’s worked on scaling machine learning systems at Google and... Read More →

Thursday October 1, 2020 4:50pm - 5:20pm PDT
Virtual 4
  Case Studies
  • Slides Included Yes