Text extraction limitations for large PDFs in Adobe Experience Manager as a Cloud Service (AEMaaCS)
This article resolves the issue of incomplete text extraction for large PDF documents in Adobe Experience Manager as a Cloud Service (AEMaaCS) due to character limitations. The behavior is intentional, designed to optimize storage and processing efficiency, but it can affect workflows requiring full-text extraction.
Description description
Environment
Adobe Experience Manager as a Cloud Service (AEMaaCS)
Issue
When processing large PDF documents with AEM’s out-of-the-box Asset Processing capabilities, text extraction is incomplete for extensive PDFs, such as those containing hundreds of pages. The extracted text may end prematurely due to a character limit of 100,000 characters. Symptoms include:
- The
/jcr:content/renditions/cqdam.text.txt
file for large PDFs contains text only up to approximately 108 pages for a 580-page PDF. - Full-text extraction is constrained due to character limitations.
- The text extraction process is limited to 100k characters.
- Only essential sections of the document are extracted through smart summarizing.
- This limitation aligns with Oak’s indexing capabilities within AEM and aims to optimize storage and processing efficiency.
Resolution resolution
- The limitation is by-design to ensure efficient processing times and cost management in AEM.
- An enhancement request (ASSETS-45872) has been raised for future AEM releases to address this limitation, potentially introducing a worker capable of processing larger PDF files.
- Review AEM release notes for announcements regarding changes or improvements to PDF text extraction in upcoming versions.
recommendation-more-help
3d58f420-19b5-47a0-a122-5c9dab55ec7f