Dask

Caution

Currently, the dask backend can only be used if your workflow code is organized in a package due to how pytask imports your code and dask serializes task functions (issue).

Dask is a flexible library for parallel and distributed computing. You probably know it from its dask.dataframe that allows lazy processing of big data. Here, we use distributed that provides an interface similar to Executor to parallelize our execution.

There are a couple of ways in how we can use dask.

Local

By default, using dask as the parallel backend will launch a distributed.LocalCluster with processes on your local machine.

pytask --parallel-backend dask -n 2
[tool.pytask.ini_options]
parallel_backend = "dask"
n_workers = 2

Local or Remote - Connecting to a Scheduler

It is also possible to connect to an existing scheduler and use it to execute tasks. The scheduler can be launched on your local machine or in some remote environment. It also has the benefit of being able to inspect the dask dashboard for more information on the execution.

Start by launching a scheduler in some terminal on some machine.

dask scheduler

After the launch, the IP of the scheduler will be displayed. Copy it. Then, open more terminals to launch as many dask workers as you like with

dask worker <scheduler-ip>

Finally, write a function to build the dask client and register it as the dask backend. Place the code somewhere in your codebase, preferably, where you store the main configuration of your project in config.py or another module that will be imported during execution.

from pytask_parallel import ParallelBackend
from pytask_parallel import registry
from concurrent.futures import Executor
from dask.distributed import Client


def _build_dask_executor(n_workers: int) -> Executor:
    return Client(address="<scheduler-ip>").get_executor()


registry.register_parallel_backend(ParallelBackend.DASK, _build_dask_executor)

You can also register it as the custom executor using pytask_parallel.ParallelBackend.CUSTOM to switch back to the default dask executor quickly.

See also

You can find more information in the documentation for dask.distributed.

Remote

You can learn how to deploy your tasks to a remote dask cluster in this guide. They recommend to use coiled for deployment to cloud providers.

coiled is a product built on top of dask that eases the deployment of your workflow to many cloud providers like AWS, GCP, and Azure.

If you want to run the tasks in your project on a cluster managed by coiled read this guide.

Otherwise, follow the instructions in dask’s guide.