AEM truncates extracted text from large PDFs after 100K tokens

Last update: Wed Jun 04 2025 00:00:00 GMT+0000 (Coordinated Universal Time)

AEM limits PDF text extraction to 100,000 tokens by default, which can cause incomplete indexing for large documents. This impacts search accuracy and discoverability. You can resolve this by updating extraction and indexing configurations to allow full content indexing, ensuring all text in large PDFs becomes searchable.

Description description

Environment

Adobe Experience Manager (AEM), 6.5

Issue/Symptoms

AEM truncates text when indexing large PDFs from DAM (Digital Asset Management), limiting extraction to 100,000 tokens. Logs show: Extracted text size exceeded configured limit(100000).

Updating the Adobe CQ DAM Text Extraction config does not resolve the issue, and logs continue to show truncation errors.

Resolution resolution

Use the following steps to extract and index full text from large PDFs:

Update the OSGI (Open Services Gateway initiative) Configuration to set the extracted token limit to be infinite:

Go to Adobe CQ DAM Text Extraction (com.day.cq.dam.core.impl.process.TextExtractionProcess).
Set Activated to true.
Add application/pdf to MIME types.
Set Max Extracted Length to -1.

Example config:

code language-none
`/apps/system/config/com.day.cq.dam.core.impl.process.TextExtractionProcess.config apply=B"true" maxExtract=L"-1" mimeTypes=[ "application/pdf"]`

Modify the DAM Asset Lucene Index:
- Set maxFieldLength to 99999999.
- Add an aggregate path for jcr:content/text.
- Set reindex = true.
Edit the DAM Update Assetworkflow.
- Add a process step after Process Thumbnails:
  - Title: Adobe CQ DAM Text Extraction Process
  - Handler: com.day.cq.dam.core.impl.process.TextExtractionProcess
  - Enable Handler Advance
Run large PDFs through the updated workflow. Optionally, use a single-step workflow for faster reprocessing.
Test with large PDFs to confirm full content indexing.

These changes allow AEM to extract and index full text from large PDFs, improving search accuracy and completeness.

recommendation-more-help

3d58f420-19b5-47a0-a122-5c9dab55ec7f

AEM truncates extracted text from large PDFs after 100K tokens

Description description

Environment

Issue/Symptoms

Resolution resolution

Related reading