Content and Commerce AI is in beta. The documentation is subject to change.
The keyword extraction service, when given a text document, automatically extracts keywords or keyphrases that best describe the subject of the document. In order to extract keywords, a combination of named entity recognition (NER) and unsupervised keyword extraction algorithms are used.
The named entities recognized by Content and Commerce AI are listed in the following table:
Entity name | Description |
---|---|
PERSON | People, including fictional. |
NORP | Nationalities or religious or political groups. |
GPE | Countries, cities, and states. |
LOC | Non-GPE locations, mountain ranges, bodies of water. |
FAC | Buildings, airports, highways, bridges, etc. |
ORG | Companies, agencies, institutions, etc. |
PRODUCT | Objects, vehicles, foods, etc. (Not services.) |
EVENT | Named hurricanes, battles, wars, sports events, etc. |
WORK_OF_ART | Titles of books, songs, etc. |
LAW | Named documents made into laws. |
LANGUAGE | Any named language. |
If you plan on processing PDFs, skip to the instructions for PDF keyword extraction within this document. Also, support for additional file types such as docx, ppt, amd xml are set to be released at a later date.
API format
POST /services/v1/predict
Request
The following request extracts keywords from a document based on the input parameters provided in the payload.
Simplified JSON of the input file:
{
"application-id": "1234",
"language": "en",
"content-type": "inline",
"encoding": "utf-8",
"threshold": 0.01,
"top-N": 10,
"custom": {
"min-n": 2,
"entity-types": ["PERSON"]
},
"data": [
{
"content-id": "abc123",
"content": "But an influential faction on the ATP player council, which is chaired by Novak Djokovic, staged a rebellion against Kermodes regime in the spring, and he will leave the post on Dec 31"
}
]
}
See the table below the example payload for more information on the input parameters shown.
analyzer_id
determines which Sensei Content Framework is used. Please check that you have the proper analyzer_id
before making your request. For keyword extraction service, the analyzer_id
ID is:
Feature:cintel-ner:Service-1a35aefb0f0f4dc0a3b5262370ebc709
curl -w'\n' -i -X POST https://sensei.adobe.io/services/v1/predict \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: multipart/form-data" \
-H "cache-control: no-cache,no-cache" \
-H "x-api-key: {API_KEY}" \
-F file="{
\"application-id\": \"1234\",
\"language\": \"en\",
\"content-type\": \"inline\",
\"encoding\": \"utf-8\",
\"threshold\": 0.01,
\"top-N\": 10,
\"custom\": {
\"min-n\": 2,
\"entity-types\": [\"PERSON\"]
},
\"data\": [{
\"content-id\": \"abc123\",
\"content\": \"But an influential faction on the ATP player council, which is chaired by Novak Djokovic, staged a rebellion against Kermodes regime in the spring, and he will leave the post on Dec 31\"
}]
}" \
-F 'contentAnalyzerRequests={
"enable_diagnostics":"true",
"requests":[{
"analyzer_id": "Feature:cintel-ner:Service-1a35aefb0f0f4dc0a3b5262370ebc709",
"parameters": {}
}]
}'
Property | Description | Mandatory |
---|---|---|
analyzer_id |
The Sensei service ID that your request is deployed under. This ID determines which of the Sensei Content Frameworks are used. For custom services, please contact the Content and Commerce AI team to set up a custom ID. | Yes |
application-id |
The ID of the created application. | Yes |
data |
An array that contains a JSON object with each object in the array representing a document. Any parameters passed as part of this array overrides the global parameters specified outside the data array. Any of the remaining properties outlined below in this table can be overridden from within data . |
Yes |
language |
Language of input text. The default value is en . |
No |
content-type |
Used to indicate whether the input is part of the request body or a signed url for an S3 bucket. The default for this property is inline . |
Yes |
encoding |
The encoding format of input text. This can be utf-8 or utf-16 . The default for this property is utf-8 . |
No |
threshold |
The threshold of score (0 to 1) above which the results need to be returned. Use the value 0 to return all results. The default for this property is 0 . |
No |
top-N |
The number of results to be returned (cannot be a negative integer). Use the value 0 to return all results. When used in conjunction with threshold , the number of results returned is the lesser of either limit set. The default for this property is 0 . |
No |
custom |
Any custom parameters to be passed. This property requires a valid JSON object to function. See the appendix for more information on the custom parameters. | No |
content-id |
The unique ID for the data element thats returned in the response. If this is not passed, an auto-generated ID is assigned. | No |
content |
The content used by the keyword extraction service. The content can be raw text (‘inline’ content-type). If the content is a file on S3 (‘s3-bucket’ content-type), pass the signed url. When content is part of request-body, the list of data elements should have only one object. If more than one object is passed, only the first object is processed. |
Yes |
Response
A successful response returns a JSON object containing extracted keywords in the response
array.
{
"status": 200,
"cas_responses": [
{
"status": 200,
"analyzer_id": "Feature:cintel-ner:Service-1a35aefb0f0f4dc0a3b5262370ebc709",
"content_id": "",
"result": {
"response_type": "feature",
"response": [
{
"feature_value": [
{
"feature_value": "success",
"feature_name": "status"
},
{
"feature_name": "labels",
"feature_value": [
{
"feature_name": "atp player",
"feature_value": [
{
"feature_value": "KEYWORD",
"feature_name": "type"
},
{
"feature_value": 0.007743432063478832,
"feature_name": "score"
}
]
},
{
"feature_name": "Novak Djokovic",
"feature_value": [
{
"feature_name": "type",
"feature_value": "PERSON"
},
{
"feature_name": "score",
"feature_value": 0
}
]
},
{
"feature_value": [
{
"feature_name": "type",
"feature_value": "KEYWORD"
},
{
"feature_value": 0.00899321792126428,
"feature_name": "score"
}
],
"feature_name": "player council"
},
{
"feature_value": [
{
"feature_value": "KEYWORD",
"feature_name": "type"
},
{
"feature_value": 0.007743432063478832,
"feature_name": "score"
}
],
"feature_name": "kermodes regime"
},
{
"feature_value": [
{
"feature_name": "type",
"feature_value": "KEYWORD"
},
{
"feature_name": "score",
"feature_value": 0.0006052376660884209
}
],
"feature_name": "atp player council"
}
]
}
],
"feature_name": "abc123"
}
]
}
}
],
"error": []
}
Keyword extraction service supports PDFs, however, you need to use a new AnalyzerID for PDF files and change the document type to PDF. See the example below for more information.
API format
POST /services/v1/predict
Request
The following request extracts keywords from a PDF document based on the input parameters provided in the payload.
analyzer_id
determines which Sensei Content Framework is used. Please check that you have the proper analyzer_id
before making your request. For PDF keyword extraction, the analyzer_id
ID is:
Feature:cintel-ner:Service-7a87cb57461345c280b62470920bcdc5
curl -w'\n' -i -X POST https://sensei.adobe.io/services/v1/predict \
-H "Authorization: Bearer {ACCESS_TOKEN}" \
-H "Content-Type: multipart/form-data" \
-H "cache-control: no-cache,no-cache" \
-H "x-api-key: {API_KEY}" \
-F file=@TestPDF.pdf \
-F 'contentAnalyzerRequests={
"enable_diagnostics":"true",
"requests":[{
"analyzer_id": "Feature:cintel-ner:Service-7a87cb57461345c280b62470920bcdc5",
"parameters": {
"application-id": "1234",
"content-type": "file",
"encoding": "pdf",
"threshold": "0.01",
"top-N": "0",
"custom": {},
"data": [{
"content-id": "abc123",
"content": "file",
}]
}
}]
}'
Property | Description | Mandatory |
---|---|---|
analyzer_id |
The Sensei service ID that your request is deployed under. This ID determines which of the Sensei Content Frameworks are used. For custom services, please contact the Content and Commerce AI team to set up a custom ID. | Yes |
application-id |
The ID of the created application. | Yes |
data |
An array that contains a JSON object with each object in the array representing a document. Any parameters passed as part of this array overrides the global parameters specified outside the data array. Any of the remaining properties outlined below in this table can be overridden from within data . |
Yes |
language |
Language of input. The default value is en (english). |
No |
content-type |
Used to indicate the inputs content type. This should be set to file . |
Yes |
encoding |
The encoding format of the input. This should be set to pdf . More encoding types are set to be supported at a later date. |
Yes |
threshold |
The threshold of score (0 to 1) above which the results need to be returned. Use the value 0 to return all results. The default for this property is 0 . |
No |
top-N |
The number of results to be returned (cannot be a negative integer). Use the value 0 to return all results. When used in conjunction with threshold , the number of results returned is the lesser of either limit set. The default for this property is 0 . |
No |
custom |
Any custom parameters to be passed. This property requires a valid JSON object to function. See the appendix for more information on the custom parameters. | No |
content-id |
The unique ID for the data element thats returned in the response. If this is not passed, an auto-generated ID is assigned. | No |
content |
This should be set to file . |
Yes |
Response
A successful response returns a JSON object containing extracted keywords in the response
array.
{
"statusCode": 200,
"body": {
"type": "JSON",
"matchType": "strict",
"json": {
"status": 200,
"content_id": "161hw2.pdf",
"cas_responses": [
{
"status": 200,
"analyzer_id": "Feature:cintel-ner:Service-7a87cb57461345c280b62470920bcdc5",
"content_id": "161hw2.pdf",
"result": {
"response_type": "feature",
"response": [
{
"feature_value": [
{
"feature_name": "status",
"feature_value": "success"
},
{
"feature_value": [
{
"feature_name": "delbick",
"feature_value": [
{
"feature_name": "score",
"feature_value": 0.03673855028832046
},
{
"feature_name": "type",
"feature_value": "KEYWORD"
}
]
},
{
"feature_name": "Ci",
"feature_value": [
{
"feature_name": "score",
"feature_value": 0
},
{
"feature_name": "type",
"feature_value": "PERSON"
}
]
}
],
"feature_name": "labels"
}
],
"feature_name": "abc123"
}
]
}
}
],
"error": []
}
}
}
For more information and a sample on using PDF extraction containing instructions on how to set up, deploy, and integrate with the AEM cloud service. Visit the CCAI PDF extraction worker github repository.
The following table contains the available parameters that can be utilized from within custom
.
Name | Description | Mandatory |
---|---|---|
min-n |
The minimum number of words required in the keywords. | No |
entity-types |
Types of entities to be returned. See the named entity recognition table at the beginning of this document. | No |