Data Wrangling with Pandas

Introduction

Data wrangling is a critical step in the data science workflow—it transforms raw, unstructured data into a clean, organized format ready for analysis and modeling. In this tutorial, we’ll explore how to use Pandas, a powerful Python library, to efficiently import data, clean it, and perform various data manipulation tasks. These techniques are designed specifically for data science applications, helping you prepare your datasets for further analysis and machine learning.

Prerequisites

Importing Required Packages

import pandas as pd
import numpy as np

For this tutorial, you can generate a synthetic dataset to follow along. If you already have a dataset, feel free to skip this section.

Show the demo data creation code

# Set the seed for reproducibility
np.random.seed(42)

# Create synthetic data for demo purposes
data = {
    "id": np.arange(1, 101),
    "name": [f"Item {i}" for i in range(1, 101)],
    "price": np.random.uniform(10, 100, 100).round(2),
    "category": np.random.choice(["A", "B", "C"], 100),
    "date": pd.date_range(start="2024-01-01", periods=100, freq="D")
}

df = pd.DataFrame(data)
df.to_csv("demo_data.csv", index=False)

Data Import

Pandas makes it simple to read data from various file formats. One of the most common operations is reading data from a CSV file.

Example: Reading a CSV File

# Read data from the demo CSV file
df = pd.read_csv("demo_data.csv")

# Display the first few rows of the DataFrame
print(df.head())

   id    name  price category        date
0   1  Item 1  43.71        C  2024-01-01
1   2  Item 2  95.56        C  2024-01-02
2   3  Item 3  75.88        A  2024-01-03
3   4  Item 4  63.88        A  2024-01-04
4   5  Item 5  24.04        B  2024-01-05

This code loads the data into a DataFrame—a two-dimensional data structure that forms the backbone of Pandas operations.

Data Cleaning

Once the data is imported, it often needs to be cleaned to handle missing values, correct data types, and remove duplicates. Pandas offers a variety of functions to address these issues.

Example: Cleaning a DataFrame

# Load the data
df = pd.read_csv("demo_data.csv")

# Drop rows with missing values
df_clean = df.dropna()

# Convert the 'price' column to numeric (if needed)
df_clean['price'] = pd.to_numeric(df_clean['price'], errors='coerce')

# Remove duplicate rows
df_clean = df_clean.drop_duplicates()

# Display the cleaned data
print(df_clean.head())

In this example, we remove rows with missing data, convert the ‘price’ column to a numeric type, and eliminate duplicate rows.

Data Manipulation

After cleaning the data, you can manipulate it to extract insights. Common tasks include filtering, grouping, and aggregating data.

Example: Grouping and Aggregating Data

# Load and clean the data
df = pd.read_csv("demo_data.csv").dropna().drop_duplicates()

# Group data by the 'category' column and calculate the mean price for each group
grouped = df.groupby("category")["price"].mean()

print("Average price by category:")
print(grouped)

Average price by category:
category
A    54.332222
B    50.723548
C    51.612727
Name: price, dtype: float64

This example groups the data by category and computes the average price for each group, demonstrating how Pandas can be used to summarize and analyze data.

Conclusion

Data wrangling with Pandas is essential for transforming raw data into a structured format that drives analysis and decision-making. By mastering techniques for data import, cleaning, and manipulation, you can streamline your data science workflow and focus on extracting meaningful insights. Experiment with these examples and adapt them to your own datasets to fully harness the power of Pandas.

Explore More Articles

Note

Here are more articles from the same category to help you dive deeper into the topic.

Machine Learning with Scikit‑Learn

Build and Evaluate Simple ML Models in Python

Alboukadel Kassambara, 2024-02-07, in Programming

Learn how to build and evaluate simple machine learning models using Scikit‑Learn in Python. This tutorial provides practical examples and techniques for model training, prediction, and evaluation…