AEM truncates extracted text from large PDFs after 100K tokens
AEM limits PDF text extraction to 100,000 tokens by default, which can cause incomplete indexing for large documents. This impacts search accuracy and discoverability. You can resolve this by updating extraction and indexing configurations to allow full content indexing, ensuring all text in large PDFs becomes searchable.
Description description
Environment
- Adobe Experience Manager (AEM), 6.5
Issue/Symptoms
AEM truncates text when indexing large PDFs from DAM (Digital Asset Management), limiting extraction to 100,000 tokens. Logs show: Extracted text size exceeded configured limit(100000).
Updating the Adobe CQ DAM Text Extraction config does not resolve the issue, and logs continue to show truncation errors.
Resolution resolution
Use the following steps to extract and index full text from large PDFs:
-
Update the OSGI (Open Services Gateway initiative) Configuration to set the extracted token limit to be infinite:
- Go to
Adobe CQ DAM Text Extraction (com.day.cq.dam.core.impl.process.TextExtractionProcess). - Set
Activatedtotrue. - Add
application/pdfto MIME types. - Set
Max Extracted Lengthto-1.
Example config:
code language-none /apps/system/config/com.day.cq.dam.core.impl.process.TextExtractionProcess.config apply=B"true" maxExtract=L"-1" mimeTypes=[ "application/pdf"] - Go to
-
Modify the DAM Asset Lucene Index:
- Set
maxFieldLengthto99999999. - Add an aggregate path for
jcr:content/text. - Set
reindex = true.
- Set
-
Edit the
DAM Update Assetworkflow.-
Add a process step after
Process Thumbnails:- Title: Adobe CQ DAM Text Extraction Process
- Handler:
com.day.cq.dam.core.impl.process.TextExtractionProcess - Enable
Handler Advance
-
-
Run large PDFs through the updated workflow. Optionally, use a single-step workflow for faster reprocessing.
-
Test with large PDFs to confirm full content indexing.
These changes allow AEM to extract and index full text from large PDFs, improving search accuracy and completeness.