Automate content extraction

Last update: April 11, 2024

Topics:
PDF Extract API

CREATED FOR:

Beginner
Developer

Transcript

Learn how to automate the extraction of content from a PDF document using the PDF Extract API. Extracting PDF content helps unlock critical business data, which can then be used for a variety of downstream processes. The PDF Extract API can help extract PDF content along with the content structure and reading order using Adobe Sensei AI. There are many options for how the Extract API can be invoked, such as using a programming language with the REST API, or in our example here, we’re going to use Power Automate, Microsoft’s low-code automation solution. The first step to get started is to generate the required credentials to invoke Acrobat services. To do so, go to developer.adobe.com. Select Create New Project, then Add API, Document Cloud, and PDF Services API, and then select Next. There are two options for authentication. The connectors for Power Automate have recently been updated to include OAuth. This is the preferred method as the JWT authentication is being deprecated. Select the Enterprise PDF Services developer profile, and then save the configured API. Next, you’ll need to generate an access token. Once you’ve generated the access token, you’ll now copy the required information into the Adobe Services Power Automate connector configuration. Now let’s go ahead and create a new flow. Select Automated Cloud Flow and create a name. Our flow can be triggered by many different events, but this flow here is triggered by a new document being added to a SharePoint folder. This is our source PDF to extract content from. The parameters highlighted in red are the parameters that need to be customized when extracting content from a PDF. For this SharePoint action, we need to input a SharePoint site address and a folder ID. The second action here in our flow is to call the PDF extract API using the Power Automate Acrobat Services connector. It requires two inputs, the document from which you want to extract the content, which in this case is the document uploaded to the SharePoint folder and an instruction that defines what to be extracted such as images or text. So we’ll go ahead and add a new connection to the extract API. After adding a connection name, we’ll copy the values from the Acrobat Services credentials that we created into the fields required for the Acrobat Services Power Automate connector. Here are the client ID and the client secret values that we copied and now we can create our connection. The last section in this flow is to create a file in SharePoint and the highlighted parameters show what the file should be named, what the content of the file is, and where it should be saved. This is a standard PDF document. It contains several different kinds of content, text, headers, fonts, various text position, images, and tables. The API allows you to select what you would like to extract. Uploading a PDF document into the SharePoint folder triggers the Power Automate flow. Then the next section calls the PDF extract API via the Power Automate Acrobat Services connector. Then our last section in the flow puts the results into another SharePoint folder. Let’s go ahead and see the final generated JSON output. Here you can see the associated PDF in the JSON and the corresponding content that has been extracted into the JSON output document. That’s how PDF extract API can help automate the extraction of content from long-form documents such as contracts and reports that can be used downstream in qualitative analysis processes.

Previous pageOverview

Next pageOverview

Learn

Documentation

Certifications

Events

Community

Support

Resources

Adobe Account

Adobe