Data Manipulation in Python vs. R: dplyr vs. pandas

Introduction

Effective data manipulation is essential in any data science workflow. Both R and Python provide powerful libraries for this task: dplyr in R and pandas in Python. Although the syntax differs between the two, the core functionalities—such as filtering, grouping, summarizing, and joining data—are remarkably similar. In this tutorial, we offer side-by-side examples of common data manipulation operations in dplyr and pandas, helping you understand the similarities and differences as you transition between these two ecosystems.

Filtering Data

Filtering is one of the most fundamental operations. Below are examples of filtering rows where the variable value is greater than 5.

R Example (dplyr)
Python Example (pandas)

library(dplyr)

# Create a sample data frame
data <- data.frame(
  id = 1:10,
  value = c(5, 3, 6, 2, 8, 7, 4, 9, 1, 10)
)

# Filter rows where value > 5
filtered_data <- data %>% filter(value > 5)
print(filtered_data)

import pandas as pd

# Create a sample DataFrame
data = pd.DataFrame({
    'id': list(range(1, 11)),
    'value': [5, 3, 6, 2, 8, 7, 4, 9, 1, 10]
})

# Filter rows where value > 5
filtered_data = data[data['value'] > 5]
print(filtered_data)

   id  value
2   3      6
4   5      8
5   6      7
7   8      9
9  10     10

Grouping and Summarizing Data

Grouping data and computing summary statistics is crucial for understanding data distributions. Below, we group by a categorical variable and calculate the average of a numeric variable.

R Example (dplyr)
Python Example (pandas)

library(dplyr)

# Create sample data with a grouping variable
data <- data.frame(
  group = rep(c("A", "B"), each = 5),
  value = c(5, 3, 6, 2, 8, 7, 4, 9, 1, 10)
)

# Group by 'group' and summarize average value
summary_data <- data %>%
  group_by(group) %>%
  summarize(avg_value = mean(value))
print(summary_data)

# A tibble: 2 × 2
  group avg_value
  <chr>     <dbl>
1 A           4.8
2 B           6.2

import pandas as pd

# Create sample DataFrame with a grouping column
data = pd.DataFrame({
    'group': ['A']*5 + ['B']*5,
    'value': [5, 3, 6, 2, 8, 7, 4, 9, 1, 10]
})

# Group by 'group' and calculate the mean of 'value'
summary_data = data.groupby('group')['value'].mean().reset_index()
print(summary_data)

  group  value
0     A    4.8
1     B    6.2

Joining Data

Joining (merging) datasets is a common task when combining data from multiple sources. Below is an example of performing a left join.

R Example (dplyr)
Python Example (pandas)

library(dplyr)

# Create two sample data frames
df1 <- data.frame(
  id = 1:5,
  value1 = c("A", "B", "C", "D", "E")
)

df2 <- data.frame(
  id = c(3, 4, 5, 6),
  value2 = c("X", "Y", "Z", "W")
)

# Left join df1 with df2 on 'id'
joined_data <- left_join(df1, df2, by = "id")
print(joined_data)

  id value1 value2
1  1      A   <NA>
2  2      B   <NA>
3  3      C      X
4  4      D      Y
5  5      E      Z

import pandas as pd

# Create two sample DataFrames
df1 = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'value1': ["A", "B", "C", "D", "E"]
})

df2 = pd.DataFrame({
    'id': [3, 4, 5, 6],
    'value2': ["X", "Y", "Z", "W"]
})

# Perform a left merge on 'id'
joined_data = pd.merge(df1, df2, on="id", how="left")
print(joined_data)

   id value1 value2
0   1      A    NaN
1   2      B    NaN
2   3      C      X
3   4      D      Y
4   5      E      Z

Conclusion

This tutorial has provided side-by-side examples comparing data manipulation techniques in R using dplyr and in Python using pandas. Whether you’re filtering rows, grouping data for summary statistics, or joining datasets, both ecosystems offer powerful and similar tools to accomplish your data manipulation tasks. Understanding these parallels can greatly ease the transition between R and Python and help you choose the right tool for your data science projects.

Explore More Articles

Note

Here are more articles from the same category to help you dive deeper into the topic.

Machine Learning Workflows: tidymodels vs. scikit-learn

Comparing ML Model Training, Evaluation, and Prediction in R and Python