Tika configuration not working in AEMaaCS-Assets
In AEMaaCS Assets, custom Tika configurations—such as excluding PDFs—are ignored in the cloud environment, which defaults to out-of-the-box (OOTB) settings. To fix this issue, remove the “aggregate” clause from the damAssetLucene index or delete the /jcr:content/renditions/cqdam.text.txt node after processing.
Description description
Environment
- Product: Adobe Experience Manager as a Cloud Service (AEMaaCS) – Assets
- Instance: Development
Issue/Symptoms
- A custom Lucene index includes specific properties and a custom Apache Tika configuration.
- The Tika configuration excludes certain asset types (e.g., PDFs) from indexing and searching.
- The setup works correctly in the local environment.
- The cloud environment ignores the custom Tika configuration.
- The system defaults to out-of-the-box (OOTB) Tika settings.
- Log messages confirm that the default Tika configuration is being loaded instead of the custom one.
Resolution resolution
To fix this issue, follow these steps:
-
Modify the DAM Index Definition
- Open the damAssetLucene index definition. For more information, refer to the AEM documentation on content search and indexing.
- Remove the aggregate clause that targets the text rendition path (
/jcr:content/renditions/cqdam.text.txt) to exclude extracted text from full-text search.
-
Implement a Post-Processing Workflow
-
Create a custom AEM workflow that runs after the Asset Compute Service completes its processing.
-
In the workflow:
- Add a step to delete the
/jcr:content/renditions/cqdam.text.txtnode. - Alternatively, replace the node with an empty file to prevent it from being indexed.
- Add a step to delete the
-
Deploy the workflow using Cloud Manager and test it to confirm that unwanted text indexing is suppressed.
-
Notes:
- In local/AEM SDK, Tika configuration directly influences how binary content (e.g., PDFs, PNGs, MP4s) is indexed. Indexing occurs within the same runtime using the defined Tika configurations.
- In AEM as a Cloud Service, the Asset Compute Service handles text and metadata extraction from binaries. This extracted data is then supplied to the DAM index. Tika’s OSGi configuration does not influence this process.
- You cannot override or customize full-text extraction for binaries in the Cloud using local Tika configurations. Tika’s settings only affect local renditions in AEM SDK and some legacy on-prem setups.
Reading readings
Customizing the Post-Processing Workflow in AEM Assets Tutorials.
recommendation-more-help
3d58f420-19b5-47a0-a122-5c9dab55ec7f