Code Explanation:
Importing Required Libraries
import dask.dataframe as dd
import pandas as pd
Explanation:
pandas: A library used for creating and manipulating tabular data (DataFrames).
dask.dataframe: Works just like pandas but can handle very large datasets that don’t fit in memory by splitting data into smaller chunks (partitions) and processing them in parallel.
Think of Dask as “Pandas for big data, with parallel power.”
Creating a Pandas DataFrame
df = pd.DataFrame({'x': [1, 2, 3, 4, 5]})
Explanation:
This line creates a small pandas DataFrame named df.
It has one column (x) and five rows (values 1 to 5).
Example of what df looks like:
index x
0 1
1 2
2 3
3 4
4 5
Converting the Pandas DataFrame to a Dask DataFrame
ddf = dd.from_pandas(df, npartitions=2)
Explanation:
dd.from_pandas() converts a pandas DataFrame into a Dask DataFrame.
npartitions=2 tells Dask to split the data into 2 partitions (chunks).
Example of the split:
Partition 1 → rows [1, 2, 3]
Partition 2 → rows [4, 5]
Why?
In real-world big data, splitting allows Dask to process each partition on different CPU cores or even different machines — massive speed-up for large datasets.
Calculating the Mean Using Dask
print(ddf.x.mean().compute())
Explanation:
Let’s break this down step by step:
ddf.x → Selects the column x from the Dask DataFrame.
.mean() → Creates a lazy Dask computation to find the mean of column x.
Lazy means Dask doesn’t compute immediately — it builds a task graph (a plan for what to calculate).
.compute() → Executes that computation.
Dask processes each partition’s mean in parallel,
then combines them to produce the final result.
Output
3.0


0 Comments:
Post a Comment