Extract Text from PDF in R

23 Oct

Extract Text from PDF in R

Alboukadel

Text Mining

This article describes how to extract text from PDF in R using the pdftools package.

Contents:

Installation
Load the package
Extract the PDF text content
Render the pdf pages as images
Summary

Installation

For MAC OSX and Windows, you can use the following code to install directly from CRAN repository:

install.packages("pdftools")

For Linux/Unix systems, you may need to first install the poppler library on your computer. Use the following bash code depending on your operating system:

On Debian/Ubuntu: sudo apt-get install libpoppler-cpp-dev
On Fedora or CentOS: sudo yum install poppler-cpp-devel
On Mac OSX : brew install poppler

Load the package

library("pdftools")

Extract the PDF text content

# Download a demo pdf file
pdf.file <- "https://www.datanovia.com/en/https://www.datanovia.com/en/wp-content/uploads/dn-tutorials/book-preview/clustering_en_preview.pdf"
download.file(pdf.file, destfile = "clustering.pdf", mode = "wb")

# Extract the text for all pages
pdf.text <- pdf_text("clustering.pdf")
# Display the third page text
cat(pdf.text[[3]])

## 0.1. PREFACE                                                                            3
## 0.1       Preface
## Large amounts of data are collected every day from satellite images, bio-medical,
## security, marketing, web search, geo-spatial or other automatic equipment. Mining
## knowledge from these big data far exceeds human’s abilities.
## Clustering is one of the important data mining methods for discovering knowledge
## in multidimensional data. The goal of clustering is to identify pattern or groups of
## similar objects within a data set of interest.
## In the litterature, it is referred as “pattern recognition” or “unsupervised machine
## learning” - “unsupervised” because we are not guided by a priori ideas of which
## variables or samples belong in which clusters. “Learning” because the machine
## algorithm “learns” how to cluster.
## Cluster analysis is popular in many fields, including:
##    • In cancer research for classifying patients into subgroups according their gene
##        expression profile. This can be useful for identifying the molecular profile of
##        patients with good or bad prognostic, as well as for understanding the disease.
##    • In marketing for market segmentation by identifying subgroups of customers with
##        similar profiles and who might be receptive to a particular form of advertising.
##    • In City-planning for identifying groups of houses according to their type, value
##        and location.
##    This book provides a practical guide to unsupervised machine learning or cluster
##    analysis using R software. Additionally, we developped an R package named factoextra
##    to create, easily, a ggplot2-based elegant plots of cluster analysis results. Factoextra
##    official online documentation: http://www.sthda.com/english/rpkgs/factoextra

Render the pdf pages as images

# Renders pdf to bitmap array
bitmap <- pdf_render_page("clustering.pdf", page = 3)

# Save bitmap image
png::writePNG(bitmap, "images/clustering-page-3.png")
jpeg::writeJPEG(bitmap, "images/clustering-page.jpeg")
webp::write_webp(bitmap, "images/clustering-page.webp")

clustering pdf pages