Extract Text from PDF | |
SelectPdf Library for .NET can be used to extract text from PDF using the Pdf To Text Converter.
The main class of the PDF to Text Converter is PdfToText. The PDF can be loaded using Load methods. The text can be extracted from PDF as a plain text string or as a HTML string (plain text string wrapped in html tags) using the GetText or the GetHtml methods.
Altenativelly, the text (or html) can be written directly into a file using SaveText or the SaveHtml methods.
The PdfToText class provides are few other features:
Get number of pages in the PDF document using the GetPageCount method.
Get the information of the PDF document using the GetInfo method.
Starting with version 22.1 of SelectPdf, text can be extracted from specific coordinates in the PDF document using the method ExtractText(Int32, Double, Double, Double, Double).
Using the Pdf To Text Converter is very easy. The first thing that needs to be done is the namespace importing.
Sample code that shows how to extract the text from a PDF document using Select.Pdf Library for .NET:
// instantiate a pdf to text converter object PdfToText pdfToText = new PdfToText(); // load PDF file pdfToText.Load(filePdf); // set the properties pdfToText.Layout = textLayout; pdfToText.StartPageNumber = startPage; pdfToText.EndPageNumber = endPage; // extract the text string text = pdfToText.GetText();
StartPageNumber - The page number from where the current operation will start on the PDF file. The default value is 1 which means that the operation will start from the first page.
EndPageNumber - The page number where the current operation will end on the PDF file. The default value is 0 which means that all the PDF document is processed starting from the StartPageNumber page.
UserPassword - The user password to be used to open the PDF document for reading. The default value is null, which means that no password will be used to open the PDF document.
Layout - Gets or sets the TextLayout of the output text. The default value is SelectPdf.TextLayout.Original.
ClipText - Controls if hidden text from the PDF document is returned or not.
Text layout - controlled by Layout, type TextLayout:
TextLayout.Original (default) preserves the spatial column layout of the source PDF with padding. Best for forms, invoices and tables where the visual column structure carries meaning.
TextLayout.Reading produces the text in reading order, one run after another. Best for free-running prose or when feeding the output into a search index or natural-language pipeline.
Plain text vs. HTML.GetText returns a plain .NET String. GetHtml returns the same text wrapped in basic HTML tags - useful for quick previews in a browser or rich-text control, while GetText is the right choice for downstream processing (indexing, diffing, search).
Encoding. The strings returned by GetText/GetHtml are standard .NET UTF-16 strings. When persisting the result to a file or stream, pick a text encoding explicitly - for example System.Text.Encoding.UTF8 - so the output round-trips correctly for non-ASCII content (the sample below demonstrates this).
Hidden text. Use ClipText to control whether content outside the PDF page media box (hidden text) is included in the output.
This sample shows how to use SelectPdf Pdf Library for .NET to extract text from a PDF document, also setting a few properties.
// the test file string filePdf = Server.MapPath("~/files/selectpdf.pdf"); // settings string text_layout = DdlTextLayout.SelectedValue; TextLayout textLayout = (TextLayout)Enum.Parse(typeof(TextLayout), text_layout, true); int startPage = 1; try { startPage = Convert.ToInt32(TxtStartPage.Text); } catch { } int endPage = 0; try { endPage = Convert.ToInt32(TxtEndPage.Text); } catch { } // instantiate a pdf to text converter object PdfToText pdfToText = new PdfToText(); // load PDF file pdfToText.Load(filePdf); // set the properties pdfToText.Layout = textLayout; pdfToText.StartPageNumber = startPage; pdfToText.EndPageNumber = endPage; // extract the text string text = pdfToText.GetText(); // convert text to UTF-8 bytes byte[] utf8 = System.Text.Encoding.UTF8.GetBytes(text); // send text to browser Response.Clear(); Response.ClearHeaders(); Response.AddHeader("Content-Type", "text/plain; charset=UTF-8"); Response.AddHeader("Content-Length", utf8.Length.ToString()); Response.AppendHeader("content-disposition", "attachment;filename=\"output.txt\""); Response.BinaryWrite(utf8); Response.End();