filmov
tv
Python Tutorial: Delaying Computation with Dask
Показать описание
---
We've introduced generators to defer computation & control memory use. Let's use Dask to simplify this process.
We'll get to real data soon, but we'll start with simple function composition. We define three ordinary functions f, g, & h with the def keyword as usual. Each takes a single numerical input & returns a single numerical output. We then perform a sequence of computations, assigning intermediate computations to x, y, z, and, the final result, w. This is, of course, equivalent to nesting function calls without labeling intermediate results.
We repeat this computation using delayed from the dask library. This is a higher-order function or a decorator function that maps an input function to another modified output function. The value w, then, is delayed of f of delayed of g of delayed of h of 4. If we examine w, it is a dask Delayed object rather than a numerical value. The delayed decorator stalls computation until the method compute() is invoked.
The Dask Delayed object has another method visualize() that displays a task graph in some IPython shells. This linear graph shows the execution sequence & flow of data for this computation.
Let's repeat our computation, this time reassigning the identifiers f, g, and h. The result is the same, but the functions f, g, & h are now decorated permanently by delayed. This means they always return Delayed objects that defer computation until the compute() method is called.
To recap, we can define a function like f and rebind the label f to the new function obtained after applying the decorator delayed to the original f. The @ symbol here is an equivalent shorthand notation to decorate functions in this manner. Here, the @ symbol means "apply the decorator function delayed to the function described below and bind that decorated function to the name`f".
As another example, let's use the delayed decorator with some new functions increment, double, & add. This calculation involves repeated function evaluations within a loop. The dependencies are a little trickier - c depends on a & b within each iteration and its computed value is appended to the list output.
The final result `total` is a `Delayed` object and `output` is a list of intermediate `Delayed` objects. The `visualize` method displays the sequence of evaluations needed to compute `total`.
This is where Dask helps significantly. Dask uses a variety of heuristic schedulers for complicated execution sequences like this. The scheduler automatically assigns tasks in parallel to extra threads or processes. In particular, Dask users do not have to decompose computations themselves.
To bring this together, let's repeat the Yellow Cab ride data analysis using Dask instead of generators. As before, we start with a list of 12 filenames for the CSV files. We define function count_long_trips as before, this time adding the @delayed decorator. We also define a @delayed function read_file.
We construct a pipeline, starting with a list comprehension of Delayed objects called totals. We accumulate the sum of totals in annual_totals. We invoke the compute method of the Delayed object annual_totals. The result takes about 10 seconds & yields a Pandas DataFrame with a single row & two columns. Notice this is all done with a straightforward use of Dask delayed functions rather than generators. Finally, the final result yields the fraction of trips over 20 minutes as before.
There's a lot to absorb here, so take some time to get used to the delayed decorator from Dask in these exercises.
#PythonTutorial #Python #DataCamp #BigData #parallelprogramming #dask #Delaying #Computation