add kb for text extracting from pdf

dessyordanova · dessyordanova · commit 5fd44ac3965b · 2025-01-28T14:58:55.000+02:00
diff --git a/introduction.md b/introduction.md
@@ -28,13 +28,13 @@ table th:first-of-type {
 
 Telerik Document Processing features the following libraries:
 
-|Library|Description||
-|----|----|----|
-| [RadPdfProcessing]({%slug radpdfprocessing-overview%})|A processing library that allows you to create, import, and export PDF documents from your code. You can use it in any web or desktop .NET application without relying on third-party software like Adobe Acrobat.|![Pdf](images/dpl-pdf.png)|
-|[RadSpreadProcessing]({%slug radspreadprocessing-overview%})|A powerful library that enables you to create applications with native support for spreadsheet documents. With RadSpreadProcessing, you can create spreadsheets from scratch, modify existing documents or convert between the most common spreadsheet formats. You can save the generated workbook to a local file, stream, or stream it to the client browser.|![Spread](images/dpl-spread.png)| 
-|[RadSpreadStreamProcessing]({%slug radspreadstreamprocessing-overview%})|Spread streaming is a document processing paradigm that allows you to create or read big spreadsheet documents with great performance and minimal memory footprint. The key for the memory efficiency is that the spread streaming library writes the spreadsheet content directly to a stream without creating and preserving the spreadsheet document model in memory.|![SpreadStream](images/dpl-spread.png)| 
-|[RadWordsProcessing]({%slug radwordsprocessing-overview%})|A processing library that allows you to create, modify and export documents to a variety of formats. Through the API, you can access each element in the document and modify, remove it or add a new one. The generated content you can save as a stream, as a file, or sent it to the client browser.|![Words](images/dpl-words.png)|  
-|[RadZipLibrary]({%slug radziplibrary-overview%})| It allows you to compress and combine files in ZIP archives, browse and extract files from existing ZIP archives and compress streams for easy file shipping and reduced storage space.|![Zip](images/dpl-zip.png)|  
+|Library|Description|
+|----|----|
+| [RadPdfProcessing]({%slug radpdfprocessing-overview%}) ![Pdf](images/dpl-pdf.png)|A processing library that allows you to create, import, and export PDF documents from your code. You can use it in any web or desktop .NET application without relying on third-party software like Adobe Acrobat.|
+|[RadSpreadProcessing]({%slug radspreadprocessing-overview%}) ![Spread](images/dpl-spread.png)|A powerful library that enables you to create applications with native support for spreadsheet documents. With RadSpreadProcessing, you can create spreadsheets from scratch, modify existing documents or convert between the most common spreadsheet formats. You can save the generated workbook to a local file, stream, or stream it to the client browser.| 
+|[RadSpreadStreamProcessing]({%slug radspreadstreamprocessing-overview%}) ![SpreadStream](images/dpl-spread.png)|Spread streaming is a document processing paradigm that allows you to create or read big spreadsheet documents with great performance and minimal memory footprint. The key for the memory efficiency is that the spread streaming library writes the spreadsheet content directly to a stream without creating and preserving the spreadsheet document model in memory.| 
+|[RadWordsProcessing]({%slug radwordsprocessing-overview%}) ![Words](images/dpl-words.png)|A processing library that allows you to create, modify and export documents to a variety of formats. Through the API, you can access each element in the document and modify, remove it or add a new one. The generated content you can save as a stream, as a file, or sent it to the client browser.|  
+|[RadZipLibrary]({%slug radziplibrary-overview%}) ![Zip](images/dpl-zip.png)| It allows you to compress and combine files in ZIP archives, browse and extract files from existing ZIP archives and compress streams for easy file shipping and reduced storage space.|  
 
 ## Key Features
 
diff --git a/knowledge-base/extract-text-from-pdf.md b/knowledge-base/extract-text-from-pdf.md
@@ -0,0 +1,50 @@
+---
+title: Extracting Text from PDF Documents
+description: Learn how to extract the text from a PDF document using RadPdfProcessing from the Telerik Document Processing libraries.
+type: how-to
+page_title: How to Extract the Text from PDF documents 
+slug: extract-text-from-pdf
+tags: pdf, document, processing, text, extract, content 
+res_type: kb
+ticketid: 1657503
+---
+
+## Environment
+
+| Version | Product | Author | 
+| ---- | ---- | ---- | 
+| 2025.1.128| RadPdfProcessing |[Desislava Yordanova](https://www.telerik.com/blogs/author/desislava-yordanova)| 
+
+## Description
+
+Learn how to extract the text content in a PDF document.
+
+## Solution
+
+Follow the steps:
+
+1\. Import the PDF document using the [PdfFormatProvider]({%slug radpdfprocessing-formats-and-conversion-pdf-pdfformatprovider%}).
+
+2\. Export the RadFixedDocument's content to text using the [TextFormatProvider]({%slug radpdfprocessing-formats-and-conversion-plain-text-textformatprovider%}). Thus, if the PDF document contains text fragments, it will be exported to the plain text result.
+
+```csharp
+            string filePath = "input.pdf";
+            PdfFormatProvider pdf_provider = new PdfFormatProvider();
+            RadFixedDocument fixed_document;
+            using (Stream stream = File.OpenRead(filePath))
+            {
+                fixed_document = pdf_provider.Import(stream);
+            }
+            Telerik.Windows.Documents.Fixed.FormatProviders.Text.TextFormatProvider provider = new Telerik.Windows.Documents.Fixed.FormatProviders.Text.TextFormatProvider();
+
+            string documentContent = provider.Export(fixed_document);
+            Debug.WriteLine(documentContent);
+```
+>important However, depending on the internal document's content, the **TextFormatProvider** may not be applicable for covering all the cases. A common scenario is a document with scanned images which contain text information. In this case, the above approach wouldn't parse the content to plain text because all the text inside is actually not text but [Path]({%slug radpdfprocessing-model-path%}) elements. Here comes the benefit of using the [OcrFormatProvider]({%slug radpdfprocessing-formats-and-conversion-ocr-ocrformatprovider%}) allowing you to convert images of typed, handwritten, or printed text into machine-encoded text from a scanned document.
+
+## See Also
+
+- [RadPdfProcessing]({%slug radpdfprocessing-overview%})
+- [OcrFormatProvider]({%slug radpdfprocessing-formats-and-conversion-ocr-ocrformatprovider%})
+- [TextFormatProvider]({%slug radpdfprocessing-formats-and-conversion-plain-text-textformatprovider%}) 
+
diff --git a/libraries/radpdfprocessing/formats-and-conversion/ocr/ocrformatprovider.md b/libraries/radpdfprocessing/formats-and-conversion/ocr/ocrformatprovider.md
@@ -12,7 +12,7 @@ position: 1
 
 Since _Q1 2025_ the __RadPdfProcessing__ library supports Optical Character Recognition (OCR). OCR is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text from a scanned document. The library uses the **OcrFormatProvider** class that allows you to import an image which is returned as a [RadFixedPage]({%slug radpdfprocessing-model-radfixedpage%}). By default, the **OcrFormatProvider** takes as a parameter a **TesseractOcrProvider** implementation which is achieved by using the third-party library [Tesseract](https://github.com/tesseract-ocr/tesseract), however you can provide any [custom implementation]({%slug radpdfprocessing-formats-and-conversion-ocr-custom-ocrprovider%}) instead.
 
-You can find all the dependencies and required steps for the implementation in the [Prerequisites]({%slug radpdfprocessing-formats-and-conversion-ocr-prerequisites%}) artilce.
+You can find all the dependencies and required steps for the implementation in the [Prerequisites]({%slug radpdfprocessing-formats-and-conversion-ocr-prerequisites%}) article.
 
 ## TesseractOcrProvider Public API
 
@@ -35,3 +35,4 @@ You can find all the dependencies and required steps for the implementation in t
 * [Prerequisites]({%slug radpdfprocessing-formats-and-conversion-ocr-prerequisites%})
 * [Timeout Mechanism]({%slug timeout-mechanism-in-dpl%})
 * [Implementing a Custom OCR Provider]({%slug radpdfprocessing-formats-and-conversion-ocr-custom-ocrprovider%})
+* [Extracting Text from PDF Documents]({%slug extract-text-from-pdf%})
diff --git a/libraries/radpdfprocessing/formats-and-conversion/plain-text/textformatprovider.md b/libraries/radpdfprocessing/formats-and-conversion/plain-text/textformatprovider.md
@@ -41,3 +41,4 @@ __Example 1__ shows how to use __TextFormatProvider__ to export __RadFixedDocume
 * [Plain text]({%slug radpdfprocessing-formats-and-conversion-plain-text-text%})
 * [TextFormatProvider Settings]({%slug radpdfprocessing-formats-and-conversion-plain-text-settings%})
 * [Timeout Mechanism]({%slug timeout-mechanism-in-dpl%})
+* [Extracting Text from PDF Documents]({%slug extract-text-from-pdf%})