Click or drag to resize
Pdf Library for .NET

Extract Text from PDF

SelectPdf Library for .NET can be used to extract text from PDF using the Pdf To Text Converter.

The main class of the PDF to Text Converter is PdfToText. The PDF can be loaded using Load methods. The text can be extracted from PDF as a plain text string or as a HTML string (plain text string wrapped in html tags) using the GetText or the GetHtml methods.

Altenativelly, the text (or html) can be written directly into a file using SaveText or the SaveHtml methods.

The PdfToText class provides are few other features:

Starting with version 22.1 of SelectPdf, text can be extracted from specific coordinates in the PDF document using the method ExtractText(Int32, Double, Double, Double, Double).

Quick Start

Using the Pdf To Text Converter is very easy. The first thing that needs to be done is the namespace importing.

using SelectPdf;

Sample code that shows how to extract the text from a PDF document using Select.Pdf Library for .NET:

// instantiate a pdf to text converter object
PdfToText pdfToText = new PdfToText();

// load PDF file
pdfToText.Load(filePdf);

// set the properties
pdfToText.Layout = textLayout;
pdfToText.StartPageNumber = startPage;
pdfToText.EndPageNumber = endPage;

// extract the text
string text = pdfToText.GetText();
Pdf To Text Converter Properties

StartPageNumber - The page number from where the current operation will start on the PDF file. The default value is 1 which means that the operation will start from the first page.

EndPageNumber - The page number where the current operation will end on the PDF file. The default value is 0 which means that all the PDF document is processed starting from the StartPageNumber page.

UserPassword - The user password to be used to open the PDF document for reading. The default value is null, which means that no password will be used to open the PDF document.

Layout - Gets or sets the TextLayout of the output text. The default value is SelectPdf.TextLayout.Original.

ClipText - Controls if hidden text from the PDF document is returned or not.

Pdf To Text Converter Sample

This sample shows how to use SelectPdf Pdf Library for .NET to extract text from a PDF document, also setting a few properties.

// the test file
string filePdf = Server.MapPath("~/files/selectpdf.pdf");

// settings
string text_layout = DdlTextLayout.SelectedValue;
TextLayout textLayout = (TextLayout)Enum.Parse(typeof(TextLayout),
    text_layout, true);

int startPage = 1;
try
{
    startPage = Convert.ToInt32(TxtStartPage.Text);
}
catch { }

int endPage = 0;
try
{
    endPage = Convert.ToInt32(TxtEndPage.Text);
}
catch { }

// instantiate a pdf to text converter object
PdfToText pdfToText = new PdfToText();

// load PDF file
pdfToText.Load(filePdf);

// set the properties
pdfToText.Layout = textLayout;
pdfToText.StartPageNumber = startPage;
pdfToText.EndPageNumber = endPage;

// extract the text
string text = pdfToText.GetText();

// convert text to UTF-8 bytes
byte[] utf8 = System.Text.Encoding.UTF8.GetBytes(text);

// send text to browser
Response.Clear();
Response.ClearHeaders();

Response.AddHeader("Content-Type", "text/plain; charset=UTF-8");
Response.AddHeader("Content-Length", utf8.Length.ToString());
Response.AppendHeader("content-disposition", 
    "attachment;filename=\"output.txt\"");

Response.BinaryWrite(utf8);
Response.End();
See Also