GroupDocs.Parser is a document parsing and data extraction API. Extract text, metadata, barcodes, structured fields, images, tables, and document entities from PDFs, Office files, emails, eBooks, and archives—built for search indexing, compliance, data capture, and content ingestion workflows.
- See the latest release notes on NuGet and Maven Central for parser engine improvements, faster template-based extraction, and better table detection.
- Updated sample apps show invoice data extraction, email parsing, and PDF text extraction scenarios.
- New how-tos on templated parsing and container file processing in the documentation.
High-performance APIs for document parsing on .NET Framework and .NET Core.
- GroupDocs.Parser-for-.NET: Core C# API for text, metadata, tables, and template-based extraction.
- Samples & Demos: Explore runnable examples in the repository to parse PDFs, DOCX, XLSX, PPTX, MSG/EML, EPUB, ZIP, and more.
// Quick .NET Parsing Example
using (var parser = new GroupDocs.Parser.Parser("invoice.pdf"))
{
// Extract plain text from the document
using (var reader = parser.GetText())
{
Console.WriteLine(reader.ReadToEnd());
}
}Native Java library for text, metadata, and structured data extraction.
- GroupDocs.Parser-for-Java: Java API for PDF/Office/email parsing, table detection, and template-driven extraction.
// Quick Java Parsing Example
try (com.groupdocs.parser.Parser parser = new com.groupdocs.parser.Parser("contract.docx")) {
java.io.Reader reader = parser.getText();
if (reader != null) {
char[] buffer = new char[2048];
int read;
while ((read = reader.read(buffer)) != -1) {
System.out.print(new String(buffer, 0, read));
}
}
}Cross-platform Python bindings for text, metadata, and structured data extraction.
- GroupDocs.Parser-for-Python-via-.NET: Python API for PDF/Office/email parsing, table detection, template-based field extraction, and attachments.
# Quick Python Parsing Example
from groupdocs.parser import Parser
with Parser("sample.pdf") as parser:
text = parser.GetText()
print(text)- Invoice & receipt data extraction: pull totals, dates, vendors, and line items via templates.
- Email & attachment parsing: extract headers, bodies, attachments, and metadata from MSG/EML.
- Contract analysis: capture clauses, signatures, and key fields from DOCX/PDF.
- PDF table extraction: pull line items and financial tables from PDFs (see table extraction sample).
- Content migration: normalize mixed file types into structured outputs.
- High-fidelity text extraction for PDF, DOC/DOCX, XLS/XLSX, PPT/PPTX, HTML, RTF, TXT, EPUB.
- Template-based extraction to capture labeled fields, tables, and repeating blocks reliably.
- Table recognition with cell-by-cell extraction for spreadsheets and tabular PDFs.
- Metadata parsing (built-in and custom) for compliance and governance.
- Container support for ZIP, OST/PST, MSG/EML, and attachments within archived files.
- Image & embedded object extraction for logos, signatures, and inline graphics.
- Page-level & area-limited parsing to target specific regions for faster processing.
- Performance & scaling tuned for server-side, multi-document workloads.
- Documentation: Comprehensive Guides and Tutorials.
- Support: Expert help at the GroupDocs Free Support Forum.
- Evaluation: Get a Temporary License for full feature testing.
- Live Demo: Try parsing online at GroupDocs.Parser apps.
groupdocs-parser document-parser pdf-parser text-extraction data-extraction metadata-parser email-parser invoice-parsing table-extraction template-based-parsing content-ingestion document-ai search-indexing enterprise-parsing