A Dask DataFrame is a large parallel DataFrame composed of many smaller pandas DataFrames, split along the index. These pandas DataFrames may live on disk for larger-than-memory computing on a single machine, or on many different machines in a cluster.
At its core, the dask.dataframe module implements a “blocked parallel” DataFrame object that looks and feels like the pandas API, but for parallel and distributed workflows.
Most common Pandas operations can be used in the same way on Dask dataframes. This example shows how to slice the data based on a mask condition and then determine the standard deviation of the data in the x column.
dask.dataframe.DataFrame # class dask.dataframe.DataFrame(expr) [source] # DataFrame-like Expr Collection. The constructor takes the expression that represents the query as input. The class is not meant to be instantiated directly. Instead, use one of the IO connectors from Dask. __init__(expr) # Methods ... Attributes
Similar to pandas, Dask provides dtype-specific methods under various accessors. These are separate namespaces within Series that only apply to specific data types.
Dask Dataframes can read and store data in many of the same formats as Pandas dataframes. In this example we read and write data with the popular CSV and Parquet formats, and discuss best practices when using these formats.
This splits an in-memory Pandas dataframe into several parts and constructs a dask.dataframe from those parts on which Dask.dataframe can operate in parallel. By default, the input dataframe will be sorted by the index to produce cleanly-divided partitions (with known divisions).
In addition to pandas-style indexing, Dask DataFrame also supports indexing at a partition level with DataFrame.get_partition() and DataFrame.partitions. These can be used to select subsets of the data by partition, rather than by position in the entire DataFrame or index label.
Dask use is widespread, across all industries and scales. Dask is used anywhere Python is used and people experience pain due to large scale data, or intense computing.