Set up Smart Translation Search with AEM Assets set-up-smart-translation-search-with-aem-assets

Smart Translation Search allows the use of non-English search terms to resolve to English content. To set up AEM for Smart Translation Search, the Apache Oak Search Machine Translation OSGi bundle must be installed and configured, as well as the pertinent free and open source Apache Joshua language packs that contain the translation rules.

Transcript
Hi, let’s take a look at how to set up Smart - Translation Search for AEM. The first thing we need to do is download the Oak Search - Machine Translation bundle. To do this, we head over - to search.maven.org and we wanna search - for oak-search-mt for machine translation. You’ll see we have our - oak-search-mt ArtifactId, it’s part of the - org-apache-jackrabbit group, and we have a - number of versions. Let’s check out the versions. So what version - do we wanna pick? To determine this, we - need to figure out what version of Oak our - AEM instance is running. We can head back over to AEM, open up the Web Console, open OSGI bundles, - search for Oak Core. So over here you’ll - see the version of the Oak Core bundle is 1.8.0, and that means the version of - Oak we’re running is 1.8.0. So what we wanna do is download - the corresponding version. I download the jar - download of 1.8.0. We can head back to - the Web Console, staying on the Bundles tab, and we can install - this new bundle.
Search for Machine Translation. We can see that we have - this bundle installed now at version 1.8.0 - and it is active. This bundle is ready to go. So the next thing we wanna do is download our language packs. So the language - packs are provided by the Apache-Joshua project, so we can head over - to their website to see what’s available.
And as you can see, there are a number - of language packs facilitating a variety - of translations. So it’s worth noting that not all language - packs are created equal. Typically more common languages, such as Italian or French, - Spanish or German are going to be of a higher - translation fidelity than less common languages, - such as Galatian or Latin. So for our example, we’ll be - downloading and installing the Spanish-English - language pack. Some of these language - packs can get very large. To download this, we simply - click the Download button and you’ll be taken to the - Apache-Joshua Dropbox site, where you can download - this tgz file.
So I’ve already downloaded and unzipped the Spanish-English - Apache-Joshua pack. Unzipped, this has a - number of files in it. Before we install this into AEM, we’ll wanna make a change - to the joshua.config. Open this in our - favorite text editor and we are going to scroll - down until we find the lines that start with feature-function - equals language model, followed by some parameters. So as you can see here, line 140 starts with feature- - function equals language model. So we wanna make sure - that we put a pound sign in front of this line - to comment it out, as well as the - line two below it. Again, this line starts with feature-function - equals language model, therefore we need to - comment this out. Don’t be confused with the - lines similar to this in the instructional text above. We have two lines that start with feature-function - language model within this comment block. These are already commented out. Make sure that you don’t - confuse the two lines below with these lines and - forget to comment out your feature-function equals - language model lines. This is very important - to make sure that the appropriate lines - are commented out below. As soon as we’ve made these - changes, we can save them. We’ll head back to our Finder. And the next we wanna do is figure out how large - this model’s directory is. So I can find the - size of this folder, which is about nine - gigabytes on disk. So this is important because - when we restart AEM, we wanna make sure that we add - additional space to the heap to accommodate these - language pack files. So a good rule of thumb - for each language pack is to take the size - of the model folder and round up to the - nearest two gigabytes. So because we’re at - about nine gigabytes, we’ll be rounding - up to 10 gigabytes and we’ll want to add - that extra 10 gigabytes to our heap for this - particular language pack. Note if you’re using - multiple language packs, you need to take into account - the multiple folder sizes into the additional heap size. So last thing we need to do - is copy this language pack over into our AEM’s opt folder. This is not 100% necessary, but it’s best practices to - move the language pack files to the opt folder for - consistency and convention. So I’ll cd into the opt - folder for my quickstart and I will simply copy the - apache-joshua-es-en folder into a folder named - es-en in my opt folder. Do note that the AEM process - must have read access to this folder and - the files there.
The next we wanna do is - restart our AEM instance. So I can simply stop my AEM - instance and restart it. So as you can see before, I started my AEM instance - with two gig of RAM. So as you can see, I - started my AEM instance originally with two - gig of heap space. When we’re running - our language packs, we’ll want to make sure - that we increase this to account for those - language packs. Before we saw that - our language pack rounded up to 10 gigabytes, so we’ll want to - increase our heap size from our original two gig to - have an additional 10 gig for a grand total of 12 gig. So let’s start AEM back up. Once AEM is started back up, we’ll head back over to - the Configuration Manager in the Web Console, and we’ll look for the - Apache Jackrabbit Oak Machine Translation - Fulltext Query Terms Provider. And as you can see, this is an - OSGI configuration factory, so we will want to - add a configuration for each language - pack we want to use. So I can click the plus. It’s going to ask for the path - to the joshua-config file.
We have our joshua-config, so we simply have to provide the - absolute path to this file.
We need to register - the node types that will be candidates for - this Smart Translation. The two most common node types - are dam Asset and cq page. For the zoo sample, we’ll just add the - single dam Asset. However, multiple node - types can be added.
Each translated word or phrase is defined a confidence score. The confidence score is - between zero and one, with zero being the - least confident and one being the - most confident. So the closer the score to zero, the more lenient the Smart - Translation Search will be in terms of finding - translation matches. It’s typically better to start - the minimum score lower and slowly increase it if - you see irrelevant results coming from your Smart - Translation Search. We’re gonna set our - minimum score to .2.
Any changes to that configuration - or to the Joshua files requires a re-save - of this config. So now we should be - good to test this out. So let’s head back to AEM. Let’s go to our Assets since we attached this to - the dam Asset node type, and let’s perform a - full text search. So in this case, let’s - put running in English. We have a number of search - results coming up for running. Now let’s try searching for - running using the Spanish term. And there you go. Our Smart Translation - Search kicked in and it has automatically and - intelligently translated the Spanish term for running and match it to the - English term for running, which is the actual - term on our content. -
NOTE
Smart Translation Search must be set up on each AEM instance that requires it.
  1. Download and install the Oak Search Machine Translation OSGi bundle

  2. Download and update the Apache Joshua language packs

    • Download and unzip the desired Apache Joshua language packs.

    • Edit the joshua.config file and comment out the 2 lines that begin with:

      code language-none
      feature-function = LanguageModel ...
      
    • Determine and record the size of the language pack’s model folder, as this influence how much extra heap space AEM will require.

    • Move the unzipped Apache Joshua language pack folder (with the joshua.config edits) to

      code language-none
      .../crx-quickstart/opt/<source_language-target_language>
      

      For example:

      code language-none
      
       .../crx-quickstart/opt/es-en
      
  3. Restart AEM with updated heap memory allocation

    • Stop AEM

    • Determine the new required heap size for AEM

      • AEM’s pre-language-lack heap size + the size of the model directory rounded up to the nearest 2GB

      • For example: If pre-language packs the AEM installation requires 8GB of heap to run, and the language pack’s model folder is 3.8GB uncompressed, the new heap size is:

        The original 8GB + ( 3.75GB rounded up to the nearest 2GB, which is 4GB) for a total of 12GB

    • Verify the machine has this amount of extra available memory.

    • Update AEM’s start-up scripts to adjust for the new heap size

      • Ex. java -Xmx12g -jar cq-author-p4502.jar
    • Restart AEM with the increased heap size.

    note note
    NOTE
    The required heap space for language packs can grow large, especially when multiple language packs are used.
    Always make sure the instance have enough memory to accommodate the increases in allocated heap space.
    The base heap must always be calculated to support acceptable performance without any language packs installed.
  4. Register the language packs via Apache Jackrabbit Oak Machine Translation Full-text Query Terms Provider OSGi configurations

    • For each language pack, create a new Apache Jackrabbit Oak Machine Translation Full-text Query Terms Provider OSGi configuration via the AEM Web Console’s Configuration manager.

      • Joshua Config Path is the absolute path to the joshua.config file. The AEM process must be able to read all files in the language pack’s folder.

      • Node types are the candidate node types whose full-text search will engage this language pack for translation.

      • Minimum score is the minimum confidence score for a translated term for it to be used.

        • For example, hombre (Spanish for “man”) may translate to the english word “man” with a confidence score of 0.9 and also translate to the english word “human” with a confidence score 0.2. Tuning the minimum score to 0.3, would keep the “hombre” to “man” translation, but discard the ‘hombre’ to “human” translation as this translation score of 0.2 is less than the minimum score of 0.3.
  5. Perform a full-text search against assets

    • Becasue dam:Asset is the node type this language pack is registered again, we must search for AEM Assets using full-text search to validate this.
    • Navigate to AEM > Assets and open Omnisearch. Search for a term in the language whose language pack was installed.
    • As needed, tune the Minimum Score in the OSGi configurations to ensure the accuracy of results.
  6. Updating language packs

    • Apache Joshua language packs are wholey maintained by the Apache Joshua project, and their updating or correction is as the discretion of the Apache Joshua project.

    • If a language pack is updated, in order install the updates in AEM, the above steps 2 - 4 must be followed, adjusting the heap size up or down as needed.

      • Note that when moving the unzipped language pack to the crx-quickstart/opt folder, move any existing language pack folder before copying in the new.
    • If AEM does not require a restart, then the relevant Apache Jackrabbit Oak Machien Translation Fulltext Query Terms Provider OSGi configuration(s) that pertain to the updated language pack(s) must be re-saved so AEM processes the updated files.

Updating damAssetLucene Index updating-damassetlucene-index

In order for AEM Smart Tags to be affected by AEM Smart Translation, AEM’s /oak :index /damAssetLucene index must be updated to mark the predictedTags (the system name for “Smart Tags”) to be part of the Asset’s aggregate Lucene index.

Under /oak:index/damAssetLucene/indexRules/dam:Asset/properties/predicatedTags, ensure the configuration is as follows:

 <damAssetLucene jcr:primaryType="oak:QueryIndexDefinition">
        <indexRules jcr:primaryType="nt:unstructured">
            <dam:Asset jcr:primaryType="nt:unstructured">
                <properties jcr:primaryType="nt:unstructured">
                    ...
                    <predictedTags
                        jcr:primaryType="nt:unstructured"
                        isRegexp="{Boolean}true"
                        name="jcr:content/metadata/predictedTags/*/name"
                        useInSpellheck="{Boolean}true"
                        useInSuggest="{Boolean}true"
                        analyzed="{Boolean}true"
                        nodeScopeIndex="{Boolean}true"/>

Additional Resources additional-resources

recommendation-more-help
a483189e-e5e6-49b5-a6dd-9c16d9dc0519