Data Wrangling with Pandas

Data Import, Cleaning, and Manipulation for Data Science

Learn how to efficiently import, clean, and manipulate data using Pandas in Python. This tutorial demonstrates practical techniques for data wrangling within a data science workflow.

Programming
Author
Affiliation
Published

February 7, 2024

Modified

February 8, 2025

Keywords

Pandas tutorial, data wrangling in Python, Python data cleaning, data manipulation with Pandas, data science Pandas

Introduction

Data wrangling is a critical step in the data science workflow—it transforms raw, unstructured data into a clean, organized format ready for analysis and modeling. In this tutorial, we’ll explore how to use Pandas, a powerful Python library, to efficiently import data, clean it, and perform various data manipulation tasks. These techniques are designed specifically for data science applications, helping you prepare your datasets for further analysis and machine learning.



Prerequisites

Importing Required Packages

import pandas as pd
import numpy as np

Optional: Creating Demo Data

For this tutorial, you can generate a synthetic dataset to follow along. If you already have a dataset, feel free to skip this section.

Show the demo data creation code
# Set the seed for reproducibility
np.random.seed(42)

# Create synthetic data for demo purposes
data = {
    "id": np.arange(1, 101),
    "name": [f"Item {i}" for i in range(1, 101)],
    "price": np.random.uniform(10, 100, 100).round(2),
    "category": np.random.choice(["A", "B", "C"], 100),
    "date": pd.date_range(start="2024-01-01", periods=100, freq="D")
}

df = pd.DataFrame(data)
df.to_csv("demo_data.csv", index=False)

Data Import

Pandas makes it simple to read data from various file formats. One of the most common operations is reading data from a CSV file.

Example: Reading a CSV File

# Read data from the demo CSV file
df = pd.read_csv("demo_data.csv")

# Display the first few rows of the DataFrame
print(df.head())
   id    name  price category        date
0   1  Item 1  43.71        C  2024-01-01
1   2  Item 2  95.56        C  2024-01-02
2   3  Item 3  75.88        A  2024-01-03
3   4  Item 4  63.88        A  2024-01-04
4   5  Item 5  24.04        B  2024-01-05

This code loads the data into a DataFrame—a two-dimensional data structure that forms the backbone of Pandas operations.

Data Cleaning

Once the data is imported, it often needs to be cleaned to handle missing values, correct data types, and remove duplicates. Pandas offers a variety of functions to address these issues.

Example: Cleaning a DataFrame

# Load the data
df = pd.read_csv("demo_data.csv")

# Drop rows with missing values
df_clean = df.dropna()

# Convert the 'price' column to numeric (if needed)
df_clean['price'] = pd.to_numeric(df_clean['price'], errors='coerce')

# Remove duplicate rows
df_clean = df_clean.drop_duplicates()

# Display the cleaned data
print(df_clean.head())

In this example, we remove rows with missing data, convert the ‘price’ column to a numeric type, and eliminate duplicate rows.

Data Manipulation

After cleaning the data, you can manipulate it to extract insights. Common tasks include filtering, grouping, and aggregating data.

Example: Grouping and Aggregating Data

# Load and clean the data
df = pd.read_csv("demo_data.csv").dropna().drop_duplicates()

# Group data by the 'category' column and calculate the mean price for each group
grouped = df.groupby("category")["price"].mean()

print("Average price by category:")
print(grouped)
Average price by category:
category
A    54.332222
B    50.723548
C    51.612727
Name: price, dtype: float64

This example groups the data by category and computes the average price for each group, demonstrating how Pandas can be used to summarize and analyze data.

Conclusion

Data wrangling with Pandas is essential for transforming raw data into a structured format that drives analysis and decision-making. By mastering techniques for data import, cleaning, and manipulation, you can streamline your data science workflow and focus on extracting meaningful insights. Experiment with these examples and adapt them to your own datasets to fully harness the power of Pandas.

Further Reading

Happy coding, and enjoy transforming your data with Pandas!

Back to top

Reuse

Citation

BibTeX citation:
@online{kassambara2024,
  author = {Kassambara, Alboukadel},
  title = {Data {Wrangling} with {Pandas}},
  date = {2024-02-07},
  url = {https://www.datanovia.com/learn/programming/python/data-science/data-wrangling-with-pandas.html},
  langid = {en}
}
For attribution, please cite this work as:
Kassambara, Alboukadel. 2024. “Data Wrangling with Pandas.” February 7, 2024. https://www.datanovia.com/learn/programming/python/data-science/data-wrangling-with-pandas.html.