Dask is a parallel computing library for Python, providing parallel ndarray and dataframe abstractions that mimic the interfaces of NumPy and Pandas. Beneath the interfaces of these abstractions, Dask automatically partitions data and composes a task graph representing NumPy and Pandas operations over these data chunks. This task graph is then executed by a Dask scheduler, parallelizing the execution of tasks across cores and/or machines, yielding both task and data parallelism.
This talk is about a Dask-on-Ray scheduler that was recently added to Ray. By implementing a Dask scheduler that farms Dask tasks out to a Ray cluster, we can run the entirety of the Dask ecosystem on top of Ray. By leveraging Ray’s innovations around decentralized peer-to-peer resource-aware locality-aware local-first scheduling, scheduling decision caching for fast worker-to-worker RPCs, zero-copy intra-node data sharing, task, worker, and GCS fault-tolerance, and scheduling throughput being unaffected by cluster state inspection, we obtain a fast, scalable, fault-tolerant Dask backend with great ops tooling at no performance cost, capable of executing any Dask graph.
Finally, we’ll talk about how Descartes Labs is leveraging Ray and this Dask-on-Ray integration to provide a geospatial data analysis product, Workflows. Using a highly scalable Ray-based backend, Workflows provides both low-latency interactive data analysis and massively scalable batch computation, distributing work across tens of thousands of cores and allowing users to process petabytes of remotely sensed data with ease
Clark Zinzow is a software engineer at Descartes Labs, where he's building a platform for geospatial data analysis that puts tens of thousands of cores and petabytes of remote sensing data at the user's fingertips. He loves working at the boundaries of service, data, and compute scale... Read More →