Smart Translation Search allows the use of non-English search terms to resolve to English content. To set up AEM for Smart Translation Search, the Apache Oak Search Machine Translation OSGi bundle must be installed and configured, as well as the pertinent free and open source Apache Joshua language packs that contain the translation rules.
Hi, let’s take a look at how to set up Smart - Translation Search for AEM. The first thing we need to do is download the Oak Search - Machine Translation bundle. To do this, we head over - to search.maven.org and we wanna search - for oak-search-mt for machine translation. You’ll see we have our - oak-search-mt ArtifactId, it’s part of the - org-apache-jackrabbit group, and we have a - number of versions. Let’s check out the versions. So what version - do we wanna pick? To determine this, we - need to figure out what version of Oak our - AEM instance is running. We can head back over to AEM, open up the Web Console, open OSGI bundles, - search for Oak Core. So over here you’ll - see the version of the Oak Core bundle is 1.8.0, and that means the version of - Oak we’re running is 1.8.0. So what we wanna do is download - the corresponding version. I download the jar - download of 1.8.0. We can head back to - the Web Console, staying on the Bundles tab, and we can install - this new bundle.
Search for Machine Translation. We can see that we have - this bundle installed now at version 1.8.0 - and it is active. This bundle is ready to go. So the next thing we wanna do is download our language packs. So the language - packs are provided by the Apache-Joshua project, so we can head over - to their website to see what’s available.
And as you can see, there are a number - of language packs facilitating a variety - of translations. So it’s worth noting that not all language - packs are created equal. Typically more common languages, such as Italian or French, - Spanish or German are going to be of a higher - translation fidelity than less common languages, - such as Galatian or Latin. So for our example, we’ll be - downloading and installing the Spanish-English - language pack. Some of these language - packs can get very large. To download this, we simply - click the Download button and you’ll be taken to the - Apache-Joshua Dropbox site, where you can download - this tgz file.
So I’ve already downloaded and unzipped the Spanish-English - Apache-Joshua pack. Unzipped, this has a - number of files in it. Before we install this into AEM, we’ll wanna make a change - to the joshua.config. Open this in our - favorite text editor and we are going to scroll - down until we find the lines that start with feature-function - equals language model, followed by some parameters. So as you can see here, line 140 starts with feature- - function equals language model. So we wanna make sure - that we put a pound sign in front of this line - to comment it out, as well as the - line two below it. Again, this line starts with feature-function - equals language model, therefore we need to - comment this out. Don’t be confused with the - lines similar to this in the instructional text above. We have two lines that start with feature-function - language model within this comment block. These are already commented out. Make sure that you don’t - confuse the two lines below with these lines and - forget to comment out your feature-function equals - language model lines. This is very important - to make sure that the appropriate lines - are commented out below. As soon as we’ve made these - changes, we can save them. We’ll head back to our Finder. And the next we wanna do is figure out how large - this model’s directory is. So I can find the - size of this folder, which is about nine - gigabytes on disk. So this is important because - when we restart AEM, we wanna make sure that we add - additional space to the heap to accommodate these - language pack files. So a good rule of thumb - for each language pack is to take the size - of the model folder and round up to the - nearest two gigabytes. So because we’re at - about nine gigabytes, we’ll be rounding - up to 10 gigabytes and we’ll want to add - that extra 10 gigabytes to our heap for this - particular language pack. Note if you’re using - multiple language packs, you need to take into account - the multiple folder sizes into the additional heap size. So last thing we need to do - is copy this language pack over into our AEM’s opt folder. This is not 100% necessary, but it’s best practices to - move the language pack files to the opt folder for - consistency and convention. So I’ll cd into the opt - folder for my quickstart and I will simply copy the - apache-joshua-es-en folder into a folder named - es-en in my opt folder. Do note that the AEM process - must have read access to this folder and - the files there.
The next we wanna do is - restart our AEM instance. So I can simply stop my AEM - instance and restart it. So as you can see before, I started my AEM instance - with two gig of RAM. So as you can see, I - started my AEM instance originally with two - gig of heap space. When we’re running - our language packs, we’ll want to make sure - that we increase this to account for those - language packs. Before we saw that - our language pack rounded up to 10 gigabytes, so we’ll want to - increase our heap size from our original two gig to - have an additional 10 gig for a grand total of 12 gig. So let’s start AEM back up. Once AEM is started back up, we’ll head back over to - the Configuration Manager in the Web Console, and we’ll look for the - Apache Jackrabbit Oak Machine Translation - Fulltext Query Terms Provider. And as you can see, this is an - OSGI configuration factory, so we will want to - add a configuration for each language - pack we want to use. So I can click the plus. It’s going to ask for the path - to the joshua-config file.
We have our joshua-config, so we simply have to provide the - absolute path to this file.
We need to register - the node types that will be candidates for - this Smart Translation. The two most common node types - are dam Asset and cq page. For the zoo sample, we’ll just add the - single dam Asset. However, multiple node - types can be added.
Each translated word or phrase is defined a confidence score. The confidence score is - between zero and one, with zero being the - least confident and one being the - most confident. So the closer the score to zero, the more lenient the Smart - Translation Search will be in terms of finding - translation matches. It’s typically better to start - the minimum score lower and slowly increase it if - you see irrelevant results coming from your Smart - Translation Search. We’re gonna set our - minimum score to .2.
Any changes to that configuration - or to the Joshua files requires a re-save - of this config. So now we should be - good to test this out. So let’s head back to AEM. Let’s go to our Assets since we attached this to - the dam Asset node type, and let’s perform a - full text search. So in this case, let’s - put running in English. We have a number of search - results coming up for running. Now let’s try searching for - running using the Spanish term. And there you go. Our Smart Translation - Search kicked in and it has automatically and - intelligently translated the Spanish term for running and match it to the - English term for running, which is the actual - term on our content. -
Smart Translation Search must be set up on each AEM instance that requires it.
Download and install the Oak Search Machine Translation OSGi bundle
/system/console/bundles
.Download and update the Apache Joshua language packs
Download and unzip the desired Apache Joshua language packs.
Edit the joshua.config
file and comment out the 2 lines that begin with:
feature-function = LanguageModel ...
Determine and record the size of the language pack’s model folder, as this influence how much extra heap space AEM will require.
Move the unzipped Apache Joshua language pack folder (with the joshua.config
edits) to
.../crx-quickstart/opt/<source_language-target_language>
For example:
.../crx-quickstart/opt/es-en
Restart AEM with updated heap memory allocation
Stop AEM
Determine the new required heap size for AEM
AEM’s pre-language-lack heap size + the size of the model directory rounded up to the nearest 2GB
For example: If pre-language packs the AEM installation requires 8GB of heap to run, and the language pack’s model folder is 3.8GB uncompressed, the new heap size is:
The original 8GB
+ ( 3.75GB
rounded up to the nearest 2GB
, which is 4GB
) for a total of 12GB
Verify the machine has this amount of extra available memory.
Update AEM’s start-up scripts to adjust for the new heap size
java -Xmx12g -jar cq-author-p4502.jar
Restart AEM with the increased heap size.
The required heap space for language packs can grow large, especially when multiple language packs are used.
Always make sure the instance have enough memory to accommodate the increases in allocated heap space.
The base heap must always be calculated to support acceptable performance without any language packs installed.
Register the language packs via Apache Jackrabbit Oak Machine Translation Full-text Query Terms Provider OSGi configurations
For each language pack, create a new Apache Jackrabbit Oak Machine Translation Full-text Query Terms Provider OSGi configuration via the AEM Web Console’s Configuration manager.
Joshua Config Path
is the absolute path to the joshua.config file. The AEM process must be able to read all files in the language pack’s folder.
Node types
are the candidate node types whose full-text search will engage this language pack for translation.
Minimum score
is the minimum confidence score for a translated term for it to be used.
0.9
and also translate to the english word “human” with a confidence score 0.2
. Tuning the minimum score to 0.3
, would keep the “hombre” to “man” translation, but discard the ‘hombre’ to “human” translation as this translation score of 0.2
is less than the minimum score of 0.3
.Perform a full-text search against assets
Updating language packs
Apache Joshua language packs are wholey maintained by the Apache Joshua project, and their updating or correction is as the discretion of the Apache Joshua project.
If a language pack is updated, in order install the updates in AEM, the above steps 2 - 4 must be followed, adjusting the heap size up or down as needed.
If AEM does not require a restart, then the relevant Apache Jackrabbit Oak Machien Translation Fulltext Query Terms Provider OSGi configuration(s) that pertain to the updated language pack(s) must be re-saved so AEM processes the updated files.
In order for AEM Smart Tags to be affected by AEM Smart Translation, AEM’s /oak :index /damAssetLucene
index must be updated to mark the predictedTags (the system name for “Smart Tags”) to be part of the Asset’s aggregate Lucene index.
Under /oak:index/damAssetLucene/indexRules/dam:Asset/properties/predicatedTags
, ensure the configuration is as follows:
<damAssetLucene jcr:primaryType="oak:QueryIndexDefinition">
<indexRules jcr:primaryType="nt:unstructured">
<dam:Asset jcr:primaryType="nt:unstructured">
<properties jcr:primaryType="nt:unstructured">
...
<predictedTags
jcr:primaryType="nt:unstructured"
isRegexp="{Boolean}true"
name="jcr:content/metadata/predictedTags/*/name"
useInSpellheck="{Boolean}true"
useInSuggest="{Boolean}true"
analyzed="{Boolean}true"
nodeScopeIndex="{Boolean}true"/>