import pandas as pd
import numpy as np
Introduction
Data wrangling is a critical step in the data science workflow—it transforms raw, unstructured data into a clean, organized format ready for analysis and modeling. In this tutorial, we’ll explore how to use Pandas, a powerful Python library, to efficiently import data, clean it, and perform various data manipulation tasks. These techniques are designed specifically for data science applications, helping you prepare your datasets for further analysis and machine learning.
Prerequisites
Importing Required Packages
Optional: Creating Demo Data
For this tutorial, you can generate a synthetic dataset to follow along. If you already have a dataset, feel free to skip this section.
Show the demo data creation code
# Set the seed for reproducibility
42)
np.random.seed(
# Create synthetic data for demo purposes
= {
data "id": np.arange(1, 101),
"name": [f"Item {i}" for i in range(1, 101)],
"price": np.random.uniform(10, 100, 100).round(2),
"category": np.random.choice(["A", "B", "C"], 100),
"date": pd.date_range(start="2024-01-01", periods=100, freq="D")
}
= pd.DataFrame(data)
df "demo_data.csv", index=False) df.to_csv(
Data Import
Pandas makes it simple to read data from various file formats. One of the most common operations is reading data from a CSV file.
Example: Reading a CSV File
# Read data from the demo CSV file
= pd.read_csv("demo_data.csv")
df
# Display the first few rows of the DataFrame
print(df.head())
id name price category date
0 1 Item 1 43.71 C 2024-01-01
1 2 Item 2 95.56 C 2024-01-02
2 3 Item 3 75.88 A 2024-01-03
3 4 Item 4 63.88 A 2024-01-04
4 5 Item 5 24.04 B 2024-01-05
This code loads the data into a DataFrame—a two-dimensional data structure that forms the backbone of Pandas operations.
Data Cleaning
Once the data is imported, it often needs to be cleaned to handle missing values, correct data types, and remove duplicates. Pandas offers a variety of functions to address these issues.
Example: Cleaning a DataFrame
# Load the data
= pd.read_csv("demo_data.csv")
df
# Drop rows with missing values
= df.dropna()
df_clean
# Convert the 'price' column to numeric (if needed)
'price'] = pd.to_numeric(df_clean['price'], errors='coerce')
df_clean[
# Remove duplicate rows
= df_clean.drop_duplicates()
df_clean
# Display the cleaned data
print(df_clean.head())
In this example, we remove rows with missing data, convert the ‘price’ column to a numeric type, and eliminate duplicate rows.
Data Manipulation
After cleaning the data, you can manipulate it to extract insights. Common tasks include filtering, grouping, and aggregating data.
Example: Grouping and Aggregating Data
# Load and clean the data
= pd.read_csv("demo_data.csv").dropna().drop_duplicates()
df
# Group data by the 'category' column and calculate the mean price for each group
= df.groupby("category")["price"].mean()
grouped
print("Average price by category:")
print(grouped)
Average price by category:
category
A 54.332222
B 50.723548
C 51.612727
Name: price, dtype: float64
This example groups the data by category and computes the average price for each group, demonstrating how Pandas can be used to summarize and analyze data.
Conclusion
Data wrangling with Pandas is essential for transforming raw data into a structured format that drives analysis and decision-making. By mastering techniques for data import, cleaning, and manipulation, you can streamline your data science workflow and focus on extracting meaningful insights. Experiment with these examples and adapt them to your own datasets to fully harness the power of Pandas.
Further Reading
- Data Visualization with Matplotlib
- Data Visualization with Seaborn
- Machine Learning with Scikit‑Learn
Happy coding, and enjoy transforming your data with Pandas!
Reuse
Citation
@online{kassambara2024,
author = {Kassambara, Alboukadel},
title = {Data {Wrangling} with {Pandas}},
date = {2024-02-07},
url = {https://www.datanovia.com/learn/programming/python/data-science/data-wrangling-with-pandas.html},
langid = {en}
}