Extract Text from PDF |
SelectPdf Library for .NET can be used to extract text from PDF using the Pdf To Text Converter.
The main class of the PDF to Text Converter is PdfToText. The PDF can be loaded using Load methods. The text can be extracted from PDF as a plain text string or as a HTML string (plain text string wrapped in html tags) using the GetText or the GetHtml methods.
Altenativelly, the text (or html) can be written directly into a file using SaveText or the SaveHtml methods.
The PdfToText class provides are few other features:
Get number of pages in the PDF document using the GetPageCount method.
Get the information of the PDF document using the GetInfo method.
Using the Pdf To Text Converter is very easy. The first thing that needs to be done is the namespace importing.
Sample code that shows how to extract the text from a PDF document using Select.Pdf Library for .NET:
// instantiate a pdf to text converter object PdfToText pdfToText = new PdfToText(); // load PDF file pdfToText.Load(filePdf); // set the properties pdfToText.Layout = textLayout; pdfToText.StartPageNumber = startPage; pdfToText.EndPageNumber = endPage; // extract the text string text = pdfToText.GetText();
StartPageNumber - The page number from where the current operation will start on the PDF file. The default value is 1 which means that the operation will start from the first page.
EndPageNumber - The page number where the current operation will end on the PDF file. The default value is 0 which means that all the PDF document is processed starting from the StartPageNumber page.
UserPassword - The user password to be used to open the PDF document for reading. The default value is null, which means that no password will be used to open the PDF document.
Layout - Gets or sets the TextLayout of the output text. The default value is SelectPdf.TextLayout.Original.
ClipText - Controls if hidden text from the PDF document is returned or not.
This sample shows how to use SelectPdf Pdf Library for .NET to extract text from a PDF document, also setting a few properties.
// the test file string filePdf = Server.MapPath("~/files/selectpdf.pdf"); // settings string text_layout = DdlTextLayout.SelectedValue; TextLayout textLayout = (TextLayout)Enum.Parse(typeof(TextLayout), text_layout, true); int startPage = 1; try { startPage = Convert.ToInt32(TxtStartPage.Text); } catch { } int endPage = 0; try { endPage = Convert.ToInt32(TxtEndPage.Text); } catch { } // instantiate a pdf to text converter object PdfToText pdfToText = new PdfToText(); // load PDF file pdfToText.Load(filePdf); // set the properties pdfToText.Layout = textLayout; pdfToText.StartPageNumber = startPage; pdfToText.EndPageNumber = endPage; // extract the text string text = pdfToText.GetText(); // convert text to UTF-8 bytes byte[] utf8 = System.Text.Encoding.UTF8.GetBytes(text); // send text to browser Response.Clear(); Response.ClearHeaders(); Response.AddHeader("Content-Type", "text/plain; charset=UTF-8"); Response.AddHeader("Content-Length", utf8.Length.ToString()); Response.AppendHeader("content-disposition", "attachment;filename=\"output.txt\""); Response.BinaryWrite(utf8); Response.End();