Introduction
In data science, building and evaluating predictive models is a core task. Both R and Python offer robust ecosystems for machine learning—R with tidymodels (or caret) and Python with scikit-learn. This tutorial provides side-by-side examples using the well-known iris dataset to compare how each language approaches model training, evaluation, and prediction. By the end, you’ll understand the similarities and differences between these workflows, helping you choose the best toolset for your projects.
Comparative Example: Training a Classification Model on the Iris Dataset
We will split the iris dataset into training and testing sets, train a logistic regression model in each language, and then evaluate their performance.
#| label: r-tidymodels
# Load required libraries
library(tidymodels)
# Load the iris dataset
data(iris)
# Split data into training and testing sets
set.seed(123)
<- initial_split(iris, prop = 0.8)
iris_split <- training(iris_split)
iris_train <- testing(iris_split)
iris_test
# Define a logistic regression model specification for multi-class classification
<- multinom_reg() %>%
model_spec set_engine("nnet") %>%
set_mode("classification")
# Create a workflow and add the model and formula
<- workflow() %>%
iris_workflow add_model(model_spec) %>%
add_formula(Species ~ .)
# Fit the model
<- fit(iris_workflow, data = iris_train)
iris_fit
# Make predictions on the test set
<- predict(iris_fit, new_data = iris_test) %>%
iris_pred bind_cols(iris_test)
# Evaluate model performance
<- iris_pred %>% metrics(truth = Species, estimate = .pred_class)
metrics print(metrics)
Output:
.metric .estimator .estimate
<chr> <chr> <dbl>
1 accuracy multiclass 0.967
2 kap multiclass 0.946
#| label: python-scikit-learn
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
# Load the iris dataset
= load_iris()
iris = pd.DataFrame(iris.data, columns=iris.feature_names)
df 'Species'] = iris.target
df[
# Split the dataset into training and testing sets
= df.drop('Species', axis=1)
X = df['Species']
y = train_test_split(X, y, test_size=0.2, random_state=123)
X_train, X_test, y_train, y_test
# Initialize and fit a logistic regression model
= LogisticRegression(max_iter=200)
model
model.fit(X_train, y_train)
# Make predictions on the test set
= model.predict(X_test)
pred
# Evaluate model performance
= accuracy_score(y_test, pred)
accuracy print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, pred))
Output:
precision recall f1-score support
0 1.00 1.00 1.00 13
1 1.00 1.00 1.00 6
2 1.00 1.00 1.00 11
accuracy 1.00 30
macro avg 1.00 1.00 1.00 30
weighted avg 1.00 1.00 1.00 30
The support column indicates the number of true instances (or actual observations) of each class in the test set. It tells you how many samples belong to each class, which is useful for understanding the distribution of the classes in your dataset.
Discussion
Workflow Comparison
Data Splitting:
Both examples split the iris dataset into training and testing sets using similar random seed approaches.Model Training:
- R (tidymodels): Uses a workflow that integrates model specification and formula-based modeling.
- Python (scikit-learn): Implements a more imperative approach, directly fitting a logistic regression model.
- R (tidymodels): Uses a workflow that integrates model specification and formula-based modeling.
Evaluation:
Both methods evaluate model performance, though the metrics and output formats differ. The tidymodels approach leverages a tidy, pipe-friendly syntax, while scikit-learn provides built-in functions for accuracy and detailed classification reports.
Strengths of Each Approach
tidymodels (R):
Offers a highly modular, consistent workflow that integrates seamlessly with other tidyverse tools, making it easy to extend and customize.scikit-learn (Python):
Provides a straightforward, widely adopted interface with robust support for a variety of machine learning algorithms and preprocessing tools.
Conclusion
This side-by-side comparison demonstrates that both tidymodels in R and scikit-learn in Python offer powerful workflows for building and evaluating machine learning models. By understanding the nuances of each approach, you can select the environment that best aligns with your data science needs—or even combine them to leverage the strengths of both.
Further Reading
- Python for R Users: Transitioning to Python for Data Science
- Data Manipulation in Python vs. R: dplyr vs. pandas
- R Syntax vs. Python Syntax: A Comparative Guide
Happy coding, and enjoy exploring machine learning workflows in both R and Python!
Explore More Articles
Here are more articles from the same category to help you dive deeper into the topic.
Reuse
Citation
@online{kassambara2024,
author = {Kassambara, Alboukadel},
title = {Machine {Learning} {Workflows:} Tidymodels Vs. Scikit-Learn},
date = {2024-02-13},
url = {https://www.datanovia.com/learn/programming/transition/machine-learning-workflows.html},
langid = {en}
}