Extract Text from PDF


Extract Text from PDF

SelectPdf Library for .NET can be used to extract text from PDF using the Pdf To Text Converter.

The main class of the PDF to Text Converter is PdfToText. The PDF can be loaded using Load methods. The text can be extracted from PDF as a plain text string or as a HTML string (plain text string wrapped in html tags) using the GetText or the GetHtml methods.

Altenativelly, the text (or html) can be written directly into a file using SaveText or the SaveHtml methods.

The PdfToText class provides are few other features:

Get number of pages in the PDF document using the GetPageCount method.
Get the information of the PDF document using the GetInfo method.

Starting with version 22.1 of SelectPdf, text can be extracted from specific coordinates in the PDF document using the method ExtractText(Int32, Double, Double, Double, Double).

Quick Start

Using the Pdf To Text Converter is very easy. The first thing that needs to be done is the namespace importing.

Copy

using SelectPdf;

Imports SelectPdf

Sample code that shows how to extract the text from a PDF document using Select.Pdf Library for .NET:

Copy

// instantiate a pdf to text converter object
PdfToText pdfToText = new PdfToText();

// load PDF file
pdfToText.Load(filePdf);

// set the properties
pdfToText.Layout = textLayout;
pdfToText.StartPageNumber = startPage;
pdfToText.EndPageNumber = endPage;

// extract the text
string text = pdfToText.GetText();

' instantiate a pdf to text converter object
Dim pdfToText As New PdfToText()

' load PDF file
pdfToText.Load(filePdf)

' set the properties
pdfToText.Layout = textLayout
pdfToText.StartPageNumber = startPage
pdfToText.EndPageNumber = endPage

' extract the text
Dim text As String = pdfToText.GetText()

Pdf To Text Converter Properties

StartPageNumber - The page number from where the current operation will start on the PDF file. The default value is 1 which means that the operation will start from the first page.

EndPageNumber - The page number where the current operation will end on the PDF file. The default value is 0 which means that all the PDF document is processed starting from the StartPageNumber page.

UserPassword - The user password to be used to open the PDF document for reading. The default value is null, which means that no password will be used to open the PDF document.

Layout - Gets or sets the TextLayout of the output text. The default value is SelectPdf.TextLayout.Original.

ClipText - Controls if hidden text from the PDF document is returned or not.

Pdf To Text Converter Sample

This sample shows how to use SelectPdf Pdf Library for .NET to extract text from a PDF document, also setting a few properties.

Copy

// the test file
string filePdf = Server.MapPath("~/files/selectpdf.pdf");

// settings
string text_layout = DdlTextLayout.SelectedValue;
TextLayout textLayout = (TextLayout)Enum.Parse(typeof(TextLayout),
    text_layout, true);

int startPage = 1;
try
{
    startPage = Convert.ToInt32(TxtStartPage.Text);
}
catch { }

int endPage = 0;
try
{
    endPage = Convert.ToInt32(TxtEndPage.Text);
}
catch { }

// instantiate a pdf to text converter object
PdfToText pdfToText = new PdfToText();

// load PDF file
pdfToText.Load(filePdf);

// set the properties
pdfToText.Layout = textLayout;
pdfToText.StartPageNumber = startPage;
pdfToText.EndPageNumber = endPage;

// extract the text
string text = pdfToText.GetText();

// convert text to UTF-8 bytes
byte[] utf8 = System.Text.Encoding.UTF8.GetBytes(text);

// send text to browser
Response.Clear();
Response.ClearHeaders();

Response.AddHeader("Content-Type", "text/plain; charset=UTF-8");
Response.AddHeader("Content-Length", utf8.Length.ToString());
Response.AppendHeader("content-disposition", 
    "attachment;filename=\"output.txt\"");

Response.BinaryWrite(utf8);
Response.End();

' the test file
Dim filePdf As String = Server.MapPath("~/files/selectpdf.pdf")

' settings
Dim text_layout As String = DdlTextLayout.SelectedValue
Dim textLayout As TextLayout = _
    DirectCast([Enum].Parse(GetType(TextLayout), text_layout, True),  _
    TextLayout)

Dim startPage As Integer = 1
Try
    startPage = Convert.ToInt32(TxtStartPage.Text)
Catch
End Try

Dim endPage As Integer = 0
Try
    endPage = Convert.ToInt32(TxtEndPage.Text)
Catch
End Try

' instantiate a pdf to text converter object
Dim pdfToText As New PdfToText()

' load PDF file
pdfToText.Load(filePdf)

' set the properties
pdfToText.Layout = textLayout
pdfToText.StartPageNumber = startPage
pdfToText.EndPageNumber = endPage

' extract the text
Dim text As String = pdfToText.GetText()

' convert text to UTF-8 bytes
Dim utf8 As Byte() = System.Text.Encoding.UTF8.GetBytes(text)

' send text to browser
Response.Clear()
Response.ClearHeaders()

Response.AddHeader("Content-Type", "text/plain; charset=UTF-8")
Response.AddHeader("Content-Length", utf8.Length.ToString())
Response.AppendHeader("content-disposition", _
                      "attachment;filename=""output.txt""")

Response.BinaryWrite(utf8)
Response.[End]()

Reference

SelectPdf

PdfToText

Other Resources

Select.Pdf Online Demo with C# Sample Code

Select.Pdf Online Demo with Vb.Net Sample Code