Documentation Adobe Developers Live Events

Parsing PDF - As easy as working with JSON data with PDF Extract API

Last update: Fri Jan 26 2024 00:00:00 GMT+0000 (Coordinated Universal Time)

Topics:
Developer Tools

CREATED FOR:

Experienced
Developer

Learn how the PDF Extract service in Adobe PDF Services API makes parsing PDF as easy as parsing JSON. In this session, learn what PDF Extract API does, why it’s different from anything you’ve seen before, and how to use it to create unique PDF experiences without knowing anything about the PDF specification or PDF “internals”. With PDF Extract API, it’s just JSON!

video poster

https://video.tv.adobe.com/v/337600/?quality=12&learn=on&hidetitle=true

Transcript

All right, I think we can get started, Joel. So, hello everybody, and welcome to the session today, parsing PDF as easy as working with JSON data with PDF extract API. I’m Ben Vandenberg, Principal Developer Evangelist here at Adobe. And Joel, why don’t you introduce yourself? Sure, my name is Joel Geraci. I’m the founder of Practical PDF. I’m also the president and CEO, and that sounds more impressive than it really is when you realize I’m also the janitor. So it’s just me. Awesome. So I hope you’ve been enjoying some of the rest of Adobe Developers Live. Earlier today, we had a couple of sessions around using PDF services API, as well as PDF embed API. Please, if you didn’t check those out, go check those out after the recordings become available. But this is all about getting data out of your PDFs so that you can reuse some of that content. So Joel, why don’t you switch to the next slide? There you go. All right, cool. So where we are in this is, we’re actually in that last section over on the right, which is PDF extract API. Taking all that data from your PDFs, which oftentimes are difficult to extract out and be able to do that automatically. Now we’re using Adobe Sensei to be able to help facilitate that. But one of the cool things about this, which Joel is going to walk us through today, is all the different ways that PDF extract becomes powerful for you to be able to pull that information out, be able to use that for automating a number of actions and providing insight into how to work with PDFs and use things like PDF services, as well as how you can also embed some of those experiences into the website. So Joel, why don’t you walk us through some of the awesome things that happen with PDF extract API? Sure. I’ll actually be spending about the next 28 minutes doing exactly that. As Ben mentioned, I’m going to be talking about the PDF extract API. And it’s really one of the most amazing tools I’ve seen. I’ve been working in this business for about 25 years. One of the most amazing things about what extract does is it’ll give you that table and image data and structured information out of your PDF file, regardless of what kind of PDF file it happens to be. As you know, PDF files can come from a lot of different places. They can be completely digital documents or they can be scans. And a lot of the tools that are out there to help you extract information from a PDF file will do a really good job of figuring out the layout of a document, but maybe they’re not going to do so well on tables. Or they’re going to be able to identify OCR, but they’re not going to really deconstruct the page properly. Or if you’ve got a completely digitally created PDF file, maybe it’s going to read across the columns instead of down the column. So it’s really hit or miss what you’re going to get. Extract API is the best tool I’ve ever seen. And I’m going to show you a little bit more about how that works in a video that I’m going to play in just a minute. But for today’s demonstration portion, I’m going to be showing using a combination of our PDF embed API, that publish column there, as well as the output from extract API to drive a unique PDF experience. And this, I think will help you understand how you can simply work with the PDF content as though it was JSON, because by the time extract is done with it, it actually is JSON. And then we’re just going to use that to drive this PDF experience in the embed API, which is also based on JSON. So it’s a really great set of tools to build these experiences in. So today we’re going to, basically I’ll give you the introduction to the Adobe extract API. That’s a video that I’m going to play. I’m going to have to switch over from my current screen share to a different screen share in order to get the audio out to you. So that’ll just, there’ll be a little hiccup. It won’t be quite as smooth as I’d like it to be, but I will be switching. So be prepared for that. Then I want to talk a little bit about the challenges of working with PDF in web applications, and then talk about what we really need to make PDF an equal citizen with HTML. So PDF can be difficult to work with, and I’ll talk about that a little bit later, but the extract API makes it easy. So then I’m going to go through the application demo on a code walk, and then we’ll do a wrap up of some resources for you on a slide towards the end of that, as well as let you know where you can go to sign up to begin using these tools. So with that, let me switch over to the video. I’m going to have to stop sharing this screen and start sharing a different one. Well, slightly different. Adobe PDF extract API is a service that allows you to take your PDF documents and turn them into JSON data. PDF extract API goes beyond optical character recognition or OCR. It uses AI and machine learning powered by Adobe Sensei to understand the structure of PDF content and turn it into readable data that computers can easily understand. While there are services and libraries that exist to convert PDFs into other formats, most have challenges discerning the structure of document titles, tables, headers, and other content to reliably extract data at scale. For example, some systems may read from left to right and have difficulty detecting columns on a page. Some systems may extract the text, but entirely ignore styling information. Many fail to understand the relationship of paragraphs and tables that span across pages. And most systems find it difficult to accurately parse tables. PDF extract API helps solve all of these challenges with best in class PDF extraction. When you use PDF extract API, you simply pass your PDF to the service. The service will return a package with the content of the document as JSON data, the tables as CSV files, and the images that are embedded in the PDF. The extracted data and content can then be imported into your content management system, CRM, or databases, and processed downstream by machine learning or analytics engines. Let’s look at an example. In the PDF on the left, we can see the selected paragraph. On the right, we can see the paragraph in the JSON data, which includes all the associated metadata, such as the location on the page, as well as the styling information. This data can then be transformed into other formats, such as HTML or imported into databases. Similarly, tabular data in PDFs can also be extracted separately. If the tables span across pages, PDF extract API will recognize this and provide a CSV file with all the data in the table across pages. PDF extract API can be used for a variety of use cases, from republishing content into new formats like your website, extracting data tables and importing them into your databases, or helping eliminate manual tasks of rekeying information from PDFs into data systems by empowering automated data extraction from PDFs. Access to PDF extract API is available through Adobe PDF services API, which you can access through Node, Python, or Java SDKs. REST APIs are also available. With over 2.5 trillion PDFs in the world, unlock the data inside them using Adobe PDF extract API. So before I actually get into the demo portion, I want to talk a little bit about the challenges of working with PDF in web applications. So PDF is not anything like HTML. Reading it is, directly reading it is nearly impossible. You always have to have a library that’ll let you parse through the PDF file and figure out what’s going on. And it’s so different from HTML, common web skills just don’t easily transfer. PDF files are also random access. So the characters, the words, the pages, none of these things are necessarily in the sequential order and there’s a lot of really badly created PDF files out there. So most of the tools that are used to extract the, let’s say just the text from a PDF file, will actually do it in the order in which it’s painted on the page. So if you remember from the video where the extract tool, not our extract tool, but typical tools that are reading across the two columns, that’s a very common problem because that’s the way that the PDF file has been created. It’s been created based on the print stream and that print stream is how it would go out to something like a laser printer or something like that, where it’s just really painting from top to bottom with no regard to the actual column. So getting things in the right order can be a challenge. And that’s actually why we need to use artificial intelligence for our extract API. And then another aspect of dealing with PDF on the web is that the native PDF viewers really don’t have any kind of consistent APIs and oftentimes, they don’t even have a consistent experience. You’re gonna see a completely different view of a PDF file if you’re in Chrome versus if you’re in Safari or if you’re on a mobile device using one of the mobile viewers. And as a web developer who’s trying to create some kind of PDF experience, not just posting a PDF file up there for people to download and look at offline, but to create an experience inside the browser, you need that consistent API. So the native viewers are just pretty much impossible to accomplish that. So what do we need to make PDF an equal citizen to HTML? Well, the first thing we need is really some kind of normalizer, something to convert the PDF file into an easily digestible format. And I know this is gonna be shocking based on what we’ve talked about over the last few minutes, but that format is gonna be JSON. So everybody knows if you’ve done any development work at all on the web, you know how to deal with JSON. So we need to be able to convert it into this simple format, regardless of how the PDF file was constructed, including being able to get it out of scanned pages. So as I mentioned earlier, Extract API doesn’t care how that PDF file was created. It could be created all digitally, or it could have been scanned in. It’ll still give you that same structured output regardless of where that PDF file originally came from. It’s also able to identify tables, lists, and the structural relationships of the elements on the page. I recently saw a question came in on Stack Overflow. I’m looking at Stack Overflow every day. And one of the questions was, how do we detect a table tag in PDF? And from the page content level, I mean, you can get this in structured PDF, but from the page content level, there’s no such thing as a table. There’s just text that’s geometrically positioned on a grid that looks like a table to us. But in terms of PDF, there’s nothing like a table in there. Again, if you don’t deal with tagged PDF. But what we also want to be able to do is retain enough information about the PDF file using that tool to allow for interactivity. And then of course you need a consistent and predictable viewer with a rich set of APIs. And I know this might be another shocker, but I think embed API is actually the way to go with this because the only way to control the PDF experience is to control the PDF viewer itself. That way, by replacing the native viewers with one that you can control, you have complete programmatic control over what that experience is. And I’ll show you that really during the demo. So let’s look a little bit more in depth into what that JSON looks like. I’m gonna show you this PDF file, the real version of this PDF file in just a minute. But you can see that that Roman numeral I and then the word background in bold is actually being identified by extract API as an H1 or a first level heading. And you can see we’ve got the bounding box there where that’s located on the page. And I know that it’s on page one. So that gives me enough information to be able to look for other similarly tagged or other similar elements with that kind of path, H1s and H2s. I’ll talk more about that during the demo. So let’s build something. So what I’m gonna put together here is a unique PDF experience. You can throw one of these things together for just about any kind of PDF file. But what I’ve done here is on the left side, I’ll have a navigation pane and a search panel that it’s completely constructed using just HTML. And then the right side is actually embed API. Embed API does have a two way API, but in this instance, I’m only going one way. I’m gonna be going from the HTML to the embed API in order to control that experience programmatically. And not the other way around, but there you can of course have bi-directional communications. So let’s take a look at what this all looks like when I put it together. I’m gonna get out of here. I can minimize this now. And let’s take a look at that PDF file. So here’s a 41 page file. And if I just scroll down here, you can see there’s that background that I just showed you, that heading one that I just showed you earlier. And I’ve got a table, I’ve got some graphics. We’re gonna look at this table again, a little more in depth in just a minute. But I’ve got some graphics and then a little further down, if I keep going, we’ve got some other bold areas here, which we’re gonna think of those as our heading level twos. So let me go back up to the top. What you also see here is that I’ve got a set of bookmarks over here on the right side. And with PDF bookmarks, they’ll take you to the specific location in the PDF file. But by clicking on these, you can see that it’s basically just taking me to the same exact page. This isn’t really a navigational aid if I’ve got a 41 page document and the bookmarks that are in here are taking me to the same page. This is a real file by the way that I pulled down off of the internet. So these aren’t particularly useful. And what I wanna do is overwrite them. I wanna replace them with my own set of bookmarks based on the content in this document. So I could do this with PDF. I could do this with some kind of PDF library tool to rip apart the PDF and figure out where all those parts are. But instead, what I’m gonna do is run this through the extract API. So extract API is part of the PDF services API samples. You can just download that. And over on the side here, we’ve got a bunch of different samples. And the one that I’m gonna run is actually the one right here to extract the text table information with figures, tables and renditions. So what does that all mean? Well, I’m just gonna create my extract options builder. And what I’m gonna extract are the text. Obviously you don’t actually have to set this. It’ll always get the text for you. I’m also gonna get the tables cause I wanna show you how good a job it does on extracting tables, which can be really tough for a lot of PDF tools out there. And I want to create renditions of all the figures. Now a rendition is a bitmap that’s been created of the figure. Now, if the figure’s already an image, then obviously we’re just gonna pull the image out. But the images that you saw in the charts were actually vector based images. So we’re gonna create a bitmap of those rather than pull the vector or work out. So let me go ahead and I’m just going to make sure I deleted my previous files. I did. I’m gonna go ahead and run that now. And what I will get out the other side is a zip file that contains all of the elements that I requested here. My tables will actually be in I think, Excel format. Cause I didn’t set any other options. My figures will be in PNG and my text will be in that JSON object, which I’ll show you momentarily. So it’s now done. Let me open that up. Let me start with the figures. So I showed you earlier what that PDF file looked like. And if I open up one of the figures, you can see the vector artwork has been rendered into a PNG file. I could use this just as an image in a website, or I could repurpose this for some other kind of report, but I definitely got the image out of there. If I take a look at the tables, let me get this first one here. Cause I just, I love what extract did with this particular table. I want me to expand that out a little bit. And I’ll throw this over onto the right side of my screen. And I’m going to put the PDF file on the, let me get rid of the bookmarks, on the left side of my screen. There we go. And let’s close that down a little bit. Let’s take a look at that table. Was it up here? Oh, I went too far down. And here it is. So you can see, this is a fairly complex table to get extracted from a PDF file. The first column here doesn’t quite align right with the sort of sub rows here. And there’s a little bit of interesting things going on with horizontal alignment in this area, especially with these word wraps over here. But you can see this table was captured perfectly into Excel, including the fact that the exchange traded derivatives are actually a single cell. And that all of these other sub categories within that section have their own line. So it just, it did a fantastic job. This is really hard to do just coming from the raw PDF file. But now let me bring up the part that we’re going to have fun with for the rest of this presentation. This is the structured data JSON from this file. And let me close that. And what I was looking for earlier, what I showed you in the presentation, was I was looking for the paths that have the slash Hs. So you can see here, I’ve got my first H1 is that Roman numeral one, and then the background. I got an H1, there’s a second occurrence there, which would be the second, the recommended alternatives. I’ve got another H2, I’ve got an H3. I’m not looking at the H3s for the demo. I’ve shown so on and so forth. But I’ve got these elements here that correspond to parts of the document that I’m actually interested in for me to replace those bookmarks that we had in the PDF file with my own structure. So this is just where we’re looking at or what we’re looking at to construct the bookmarks panel. But I’ve also got a search pane in there, which I didn’t show you a screenshot of, but I’m gonna show you in the demo. And what the search is gonna do is search the text portion of this JSON as well. And I’ll end up looking for the words market in several cases. So you can see there’s 161 uses of the word market in here, and we’re gonna be able to locate those and actually perform some intelligent search on this. So this is just what the JSON looks like. You can see it, the bounding information is where it occurs on the page. The path is what type of element it is, and then the text is the text itself. The only thing that we need to be interested in for our PDF file later is the bounds. So notice that once I’ve run the PDF file through the extract API, I’m just working with JSON now. I’m not looking at the PDF. We’re actually done with using the PDF, it’s over. The only thing I needed for now is to load it into the preview that we have with embed API. So let’s take a look at how I put this all together. So here is an actual running copy of the advanced search tool that I put together. What I have here in this folder is the PDF file with a identically named JSON file. So the PDF file and the JSON are kind of like side by side in here. What’s driving this navigation pane is the actual JSON that I got from extract. So over here on the side, what I did is I output the areas that I’m interested in. So that first item, we can see background. Now notice that I stripped out the Roman numerals. I actually stripped out everything up until the first space there. So I modified the content a little bit to make it look more appropriate for being in the bookmark pane. But what I wanted was the, I wanted the H1s. The H1s are in the dark blue. The H2s are in the slightly brighter blue here. But again, it’s just JSON. But what do I do? I use this JSON to construct this panel with the H1s nested with the H2s. And what I’ve done is I’ve added an action inside each of these so that they’re, when you click on them, they actually take you to the correct location in the PDF file. And I use this by using combination of the bounding box and the page number. So all the way down here, each of the items that you see in the console are actually represented over here in the bookmark pane. And I’m just using JSON. I marched through all the JSON that was output by extract API. I look for the ones that contain the path parts that actually want. So anything that ends in H1 or H2. And then I create an item in this list with a on click event that takes me to the page and the bounding box defined in the JSON over here. So again, JSON is driving the navigation pane. And then the last part of this, which personally I think is the coolest part, is that we can now do advanced search as well. And what I’m searching for in here or what I’m searching against here is the JSON. So I’m gonna look for the word market and the word governance. Now you can put any contraction in here you want, or I’m sorry, any, yeah, not contraction, conjunction in there you want. I could do it in the same paragraph or the same sentence, but I’m gonna do the same paragraph. And when I hit search, what I’ve done is I searched against the JSON. I pulled out the text and you can see this is the text right here for that particular element. I’ve used just some jQuery actually to highlight the words within that paragraph. You can see in the second occurrence, I was able to search for those, find those words again. But what I did was by using the bounding box information from that element that I located the text in, I used those bounds to create a highlight. That’s the second part here. And again, the highlight is just JSON. So I’ve got the body value is the same as the text that I was captured in the JSON. And then I used the bounding box to create a set of what are known as quad points to create this highlight right here. I did the same thing further on in the document so that you can click on any of the search results. It takes you to that location of the document and then it will highlight that section of the document of interest in line. So I can actually see where that is in the PDF file with all the other surrounding information. So that’s just a very basic look at how you can use the output from extract API to simply treat the PDF file as though it were JSON because you’re just dealing with JSON to interact with the PDF file through the embed API. So with that, let me bounce back to the presentation so I can take care of any questions, but also to go through some of the other resources that we have. So additional resources and Ben, if there are any questions that have come in, feel free to read them to me while I’m going through this. You can just interrupt me at any point. So a link to the Adobe PDF services API documentation is here, that’s the first one. We’ve also got the Adobe support community. This link will take you directly to the document cloud SDKs, so questions on that. We actually have a pretty active community there now. So if you’ve got a set of common questions, they’re probably already answered in there. And what I try to do is put together code pens that respond to some of those questions as well. And I believe a couple of the other guys have put together some jists that’ll be on GitHub. We’ve also got the Adobe tech blog. So the link that you have here will take you directly to the articles that pertain to the PDF services API. And then of course, I’ve got a link to my code pens because I will not pass up a chance to provide shameless self-promotion. So Ben, how are we doing on questions? So there was an interesting question from Ira, which was actually around using a PDF extract API for pulling out things like math, like technical math out of PDFs. And maybe you wanna take, how would extract do at pulling some of that information out? So I’m gonna guess that he’s talking about complex formulas like the ones that Goodwill Humping was writing on the chalkboard. So if it’s those kinds of symbols, you’re probably going to get both an image and the characters for the math symbols. And then you would be able to use that bounding box information. By the way, I didn’t show this, but you can get the rectangle that surrounds every single character in the file using extract API. So not just rectangles around paragraphs, not just rectangles around words, you can get the rectangles for every single character. And I think by using the character that was captured by extract, as well as the rectangle area, you’d be able to reconstruct that formula probably with a decent amount of accuracy, but you’d also probably get an image of that as well extracted out. So I would have to run some tests on some specific files, but that’s my guess as to what would happen. And as we’re going through this, if people have any specific questions that they wanna ask, please do drop it into the chat window, but also we can spend a few minutes over in the document cloud networking area afterwards. And if it’s much easier to speak some of your questions or have a conversation, we’re happy to do that too. Yeah. And also what you have on your screen right now is where to go to get started with all of this. So the URL is at the bottom. It’s slightly shorter than the one in the browser window itself, but just go ahead and go to that link and you can get started with our document services APIs. If you haven’t been able to tell by my tone, this is one of the most exciting technologies to come out of Adobe that I’ve seen related to PDF, because I’ve been doing this for a long time. And this is really, really exciting to me. Awesome. Any other questions? Well, I know we’re at time. We’re at time. So Joel, thank you so much for walking us through that. I think I agree with you. This really excites me in some of the ways that it helps people transform PDFs into, helps republish content and really helps kind of bring some intelligence to a lot of the documents within people’s spaces here. So thank you again for walking through that. And again, we’ll spend a few minutes over on the Document Cloud Networking section. We’re happy to answer any questions over there, but thank you and enjoy the rest of Adobe Developers Live.

Parsing PDF - As easy as working with JSON data with PDF Extract API

Additional Resources