Data Manipulation in Python vs. R: dplyr vs. pandas

Comparing Data Wrangling Techniques in R and Python

This tutorial compares data manipulation techniques using R’s dplyr and Python’s pandas libraries. Through side-by-side examples, learn how to filter, group, summarize, and join data to streamline your data science workflow.

Programming
Author
Affiliation
Published

February 13, 2024

Modified

March 11, 2025

Keywords

dplyr vs pandas, data manipulation in R and Python, pandas tutorial, dplyr tutorial, R vs Python data wrangling

Introduction

Effective data manipulation is essential in any data science workflow. Both R and Python provide powerful libraries for this task: dplyr in R and pandas in Python. Although the syntax differs between the two, the core functionalities—such as filtering, grouping, summarizing, and joining data—are remarkably similar. In this tutorial, we offer side-by-side examples of common data manipulation operations in dplyr and pandas, helping you understand the similarities and differences as you transition between these two ecosystems.



Filtering Data

Filtering is one of the most fundamental operations. Below are examples of filtering rows where the variable value is greater than 5.

library(dplyr)

# Create a sample data frame
data <- data.frame(
  id = 1:10,
  value = c(5, 3, 6, 2, 8, 7, 4, 9, 1, 10)
)

# Filter rows where value > 5
filtered_data <- data %>% filter(value > 5)
print(filtered_data)
  id value
1  3     6
2  5     8
3  6     7
4  8     9
5 10    10
import pandas as pd

# Create a sample DataFrame
data = pd.DataFrame({
    'id': list(range(1, 11)),
    'value': [5, 3, 6, 2, 8, 7, 4, 9, 1, 10]
})

# Filter rows where value > 5
filtered_data = data[data['value'] > 5]
print(filtered_data)
   id  value
2   3      6
4   5      8
5   6      7
7   8      9
9  10     10

Grouping and Summarizing Data

Grouping data and computing summary statistics is crucial for understanding data distributions. Below, we group by a categorical variable and calculate the average of a numeric variable.

library(dplyr)

# Create sample data with a grouping variable
data <- data.frame(
  group = rep(c("A", "B"), each = 5),
  value = c(5, 3, 6, 2, 8, 7, 4, 9, 1, 10)
)

# Group by 'group' and summarize average value
summary_data <- data %>%
  group_by(group) %>%
  summarize(avg_value = mean(value))
print(summary_data)
# A tibble: 2 × 2
  group avg_value
  <chr>     <dbl>
1 A           4.8
2 B           6.2
import pandas as pd

# Create sample DataFrame with a grouping column
data = pd.DataFrame({
    'group': ['A']*5 + ['B']*5,
    'value': [5, 3, 6, 2, 8, 7, 4, 9, 1, 10]
})

# Group by 'group' and calculate the mean of 'value'
summary_data = data.groupby('group')['value'].mean().reset_index()
print(summary_data)
  group  value
0     A    4.8
1     B    6.2

Joining Data

Joining (merging) datasets is a common task when combining data from multiple sources. Below is an example of performing a left join.

library(dplyr)

# Create two sample data frames
df1 <- data.frame(
  id = 1:5,
  value1 = c("A", "B", "C", "D", "E")
)

df2 <- data.frame(
  id = c(3, 4, 5, 6),
  value2 = c("X", "Y", "Z", "W")
)

# Left join df1 with df2 on 'id'
joined_data <- left_join(df1, df2, by = "id")
print(joined_data)
  id value1 value2
1  1      A   <NA>
2  2      B   <NA>
3  3      C      X
4  4      D      Y
5  5      E      Z
import pandas as pd

# Create two sample DataFrames
df1 = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'value1': ["A", "B", "C", "D", "E"]
})

df2 = pd.DataFrame({
    'id': [3, 4, 5, 6],
    'value2': ["X", "Y", "Z", "W"]
})

# Perform a left merge on 'id'
joined_data = pd.merge(df1, df2, on="id", how="left")
print(joined_data)
   id value1 value2
0   1      A    NaN
1   2      B    NaN
2   3      C      X
3   4      D      Y
4   5      E      Z

Conclusion

This tutorial has provided side-by-side examples comparing data manipulation techniques in R using dplyr and in Python using pandas. Whether you’re filtering rows, grouping data for summary statistics, or joining datasets, both ecosystems offer powerful and similar tools to accomplish your data manipulation tasks. Understanding these parallels can greatly ease the transition between R and Python and help you choose the right tool for your data science projects.

Further Reading

Happy coding, and enjoy harnessing the power of both dplyr and pandas in your data science workflows!

Back to top

Reuse

Citation

BibTeX citation:
@online{kassambara2024,
  author = {Kassambara, Alboukadel},
  title = {Data {Manipulation} in {Python} Vs. {R:} Dplyr Vs. Pandas},
  date = {2024-02-13},
  url = {https://www.datanovia.com/learn/programming/transition/data-manipulation-dplyr-vs-pandas.html},
  langid = {en}
}
For attribution, please cite this work as:
Kassambara, Alboukadel. 2024. “Data Manipulation in Python Vs. R: Dplyr Vs. Pandas.” February 13, 2024. https://www.datanovia.com/learn/programming/transition/data-manipulation-dplyr-vs-pandas.html.