Parsing PDF - Extracting PDF content intelligently into JSON format with PDF Extract API

A look at how the PDF Extract service in Adobe PDF Services API can make parsing PDF as easy as parsing JSON. In this session, you’ll learn what PDF Extract API does, why it’s different from anything you’ve seen before, and how to use it to create unique PDF experiences without knowing anything about the PDF specification or PDF “internals”. With PDF Extract API, it’s just JSON!

Transcript
Today I’m going to talk about parsing PDF and extracting PDF content intelligently. So when we talk about the PDF document, so at Adobe our mission is to change the world through digital experiences. Like from the moment you wake up, like if I talk about my personal experience, when I wake up in the morning and I do a lot of activities that directly or indirectly touches Adobe technologies. I’m consuming some media images, I’m watching advertisements, I’m browsing a lot of content on social networking sites, I’m reading some materials. So all that is touching Adobe technologies. We have all the features in Adobe document suite to automate and deliver a complete digital experience. It’s just a matter of choosing the right API, right feature, depending on the use case. It’s not always, I would say, about finding a solution for a problem. We should look for a solution that is going to scale because the data is going to grow with time. So the target should not be to build a solution quickly for a short-term success that will lead to a long-term failure. Not at all. I mean, our idea and target should be to build a solution that is going to produce success in long-term and short-term. And that solution should scale. And PDF extract API is going to give you that intelligence that you can apply to your PDF documents. This is the agenda for today’s session. So we will talk about the PDF extract API for five minutes, and then I will tell you how to access the API, how you can create your own credentials to get your hands dirty on the API. And then I’ll give you a demo on PDF extract API, and we will talk about some use cases. PDF extract API, as I said, it’s more than just OCR, right? So people often get confused between OCR and the extract API. So OCR, let me clarify that. So OCR, that is optical character recognition, is used to make PDF searchable, right? So let’s say if you have a scanned copy of the PDF, and if you open the document and try to do control F, try to make or search a keyword, the PDF will not let you do that because it’s an image, it’s a scanned copy of the document. So you need to make it searchable in order to perform a search, or even if you want to extract content from that PDF file, you have to run OCR. So OCR, remember, is going to make that PDF searchable. And extract API is actually going to work on both the scanned and non-scanned documents, right? So it does not matter whether your document is scanned or non-scanned. You can use the intelligence of extract API to extract content from that document. Now when I say content, it’s not only limited to text. So this is powered by Adobe Sensei, that is the artificial intelligence and machine learning engine, right? And it is smart enough to recognize each and every bit of that PDF document, right? So whether there’s a title, heading, packet graph, font, table data, the reading order, the reading order is really important from an accessibility point of view, right? And even the styling, it is smart enough to recognize everything and intelligently going to give you a JSON file having all that data, right? So JSON will have the text element, and then it will also output images, like tables can be outputted as images, and you can also export table data in CSV, right, that you can open in Excel. And we will see all that live in action shortly. Now, before I jump into the live demo of video of extract API and show you how to get access to the API, there are a few things that I wanted to talk about, right? So as I said, it’s more than OCR, right? It can extract text, table data from virtually any PDF document, regardless of whether it is a scanned one or non-scanned one, right? So it is like a single solution to extract content from PDF. Like you can extract images, text, tables, literally anything. You can classify text objects such as headings, list, footnotes, even the multi, like let’s say if you have a table that flows to the next page, right? So it can actually get you the right information with full accuracy so that you do not miss out on sensitive data when you’re going to automate your processes using the PDF extract API. And as it uses Adobe Sensei AI technology, that is going to deliver highly accurate data extraction, like both for scanned and non-scanned PDFs without requiring custom machine learning templates or model trainings. So it’s like you have a 100% baked solution, there is an API available, right? And you just need to start using it. Now, some of the features of extract API, if I have to tell you on my fingertips, is one is there is a content extraction for sure. That is the basic minimal feature that we have. And it offers more than that. It can read a document structure, it can give you the complete insight of what is there in the document in terms of styling and with high level of accuracy, that is more important. I mean, if some API is returning you the data, you got to completely rely on that API in terms of the accuracy. And then it is platform agnostic. Nowadays, we have data being accessed everywhere, whether it is a computer, large screen, iPad, mobile devices, small screen devices. So we have to be very mindful of those things. So whatever we are building should be device agnostic. So let me show you how to get access to this API. And then we will take a look at its demo. But let me give you some exciting visuals of how this API work. So let’s say if you have this PDF document right here, let me take the laser. Now, this is a PDF input file. This is a document, it could be scanned or non-scanned. So there is a heading, there is a title, and then you have this section that is a content. And when you pass this PDF file to the extract API, the extract API is that is powered by Sensei, the artificial intelligence engine is going to apply that intelligence and start unlocking that document. And then it will recognize every piece of every section of the document. If you look at this, so this is the text Adobe vendor security review program white paper. And then it will recognize the spaces, the boundaries, and the styling. So what is the language? What is the page number? What is the text? And then line height, and then the font family. It is going to give you all that information that you would require to use that data intelligently outside of this extract API once you have access to all that data. So let me, all right. So now if you look at this slide, now this slide shows an example of how this intelligent search and some of the use case can be handled using a PDF extract API. So if you look at this example, this is the intelligent search. Now you have an extract API and you have a document that is going to have different elements like headings, tables, and list. And then once you are going to pass this document against the extract, PDF extract API, it is going to unlock that document, get all the data, and turn up into a JSON file. And then you can further use that for downstream processing. When I say downstream processing, you can have your own ERP system or you can have your CRM system where you want to use that data intelligently. Or maybe you want to use the data intelligently to automate further processes that will get triggered. And then in inverse processing, let’s say if you have a scanned copy of the images, those legacy images that are scanned, then you can also pass that to the extract API and then extract API is going to unlock that content and it will completely automate the inverse workflow. You can actually retrieve certain sections from the document. Let’s say if you are looking out for a specific content, you can navigate through that document and then you can have that processing done. Just imagine if you’re having thousands of PDF documents, whether it’s scanned or non-scanned. Let’s take an example of a non-scanned document that is going to give you some data. Then you can build an automated system that is going to scan through all those thousand PDFs and it will result into a corresponding JSON file. That JSON is going to have that intelligent data along with this trialing and then your process can apply the intelligence and get the data out of those JSON files and you can now further along develop your own, bring your customization and that is going to result into a good user experience. All that is possible very handsomely using this extract API. Now if I talk about some of the use cases, since we’re going to extract content, content processing is going to be the key, same as data analysis and also the content of republishing. Now when we say data analysis, this is a similar thing that I just explained. Extract data from very complex documents, tables, data like the cell data, column, row headers, table properties and even that if table flows over to the next page, you can use all that the machine learning models analyze and act depending on the use case. Let me show you how to get access to that API and then I will show you a quick demo of it. This is the Adobe IO page and here if you come to this page you’ll see all these options. Now we are interested in PDF extract. First thing is you need to click on get credentials and once you click get credential, it will give you these options. Then you need to select PDF extract and you need to have an Adobe ID. If you don’t have one, first create an Adobe ID and you can go to adobe.io and register yourself and that will give you an Adobe ID. Then you need to click on start free trial. Okay now it is actually processing. And then if you already have your existing credentials, well then you can go to the IO console and manage your credentials that you already have. If you don’t have one, if you want to create a new one, you can go to get started. And then system will ask you to log in. Just put your username and password and then once you’re logged in, then you can go to PDF extract like this. Go to start free trial and then you can go to get started and this will give you a new set of credentials. Now once you have these credentials, then you’ll get a zip file. That zip file is going to have, this is my code base. So it has SDK in Java, Node, C sharp and it is mentioned on the website. Based on your flavor of technology, you can use our existing SDKs or you can use your own custom code. See at the end of the day, it is just an API. So regardless of what application you’re going to use to get the API, it does not matter. You can use any technology and you just have to provide that PDF document scanned and unscanned as an input and then API is going to take care of the rest of the things. So this is my sample code and once you create a credential after going through the free trial, you’ll get a zip file having this private key and your API credentials. Once you have that, you need to put it here in your code base. So this will take care of the authentication and authorization and then these are different operations that can be performed on that PDF file. So there is one to get the text information for PDF. So if you run this endpoint, you’re going to get the text information and let’s say if you want the table info along with the text, so there is another function for that. Now if you want everything, let’s say the text, the table and everything, so you can use this one. Now the sample document that I’m going to use for today’s demo is this one. So let me open it up. So this is a document that I’m going to use. Now if you look at this document, it’s Adobe Acrobat DC for business. It’s heading and then it has these many content. Now why I have chosen this document is because it’s a bit complex. When I say complex means the complexity goes here, this table. Now if I look at this, this is a table, workforce productivity and having these many columns, one, two, three and four columns. Now if you look at the top of the table, what do you have? So there is Adobe Acrobat DC and then there is a text, the complete PDF solution for today’s multi-device world and then on the extreme right side, you have Acrobat standard pro, this guy and now these are vertically placed. Can you see this? For those two columns, however, they are not part of the table. So now I’m wondering how this API is going to treat this kind of scenario because we have not added these as part of the column but I want these details to be outputted as a column because these two are kind of, they seem like a heading to these two columns. This one, these two vertical headings. So let’s run this document against the API and see what is going to be the output. So this is a long table. Now what I’m going to do is I’m going to run the API that is going to return the text in JSON format and then images if there are any and the table in csp format. So I’ll use this document and then I’ll also show you one another example that also has some images. So let’s jump into the demo right away. What I’m going to do is I have an output folder right here. There is nothing in this folder. So this is the PDF that Adobe Acrobat DC for Business. I’m going to run this API. Now it is applying that intelligence that Sensei, it has unlocked the PDF and now it is recognizing and extracting all that data and it ideally, it does not take much time. It should be done in a very quick span of time. Of course, it is going to depend on the PDF complexity and its size. There you go. So it got completed in less than 10 seconds I believe. It didn’t even go that long and then we have a zip file as an output. So this is how it is going to output, produce the output. So there will be a zip file. Now if we extract this zip file, this is a folder. Now if I expand this folder, so it is having another nested folder with the name tables and there is a JSON output. So if I expand this table, I have two set of images and two CSV files. That’s interesting. Now if you look at this document, so these are plain text right and by the way, there are a lot of, I think many third-party libraries available in the market to unlock the PDF content. I mean they do provide success at up to some extent but the strategy is kind of messing with those technologies. I mean let’s say if this is a PDF, I want to read this PDF. As a human, I may start from left to right or right to left or maybe top to bottom. Like what should be the right of course because we may have a table that may go over to the new page. The data may flow over and it’s not always going to be only the text that we are interested in. So those third-party libraries may give you the text information but the styling information might be missing. So PDF extract API is going to give you all with all intelligence. Now back to this table. So I’m expecting this table as an image and also all this data that’s there in the table as CSV file. So let’s see our output. So I have one JSON file and I’m expecting this JSON to have all this content and we will take a look at it shortly. Now let’s get into the tables. Let’s open this image file out part3.png. Now if I open this png, look at this. It resulted into this image and it is quite, I would say very accurate in terms of creating, extracting this table as an image. And if you look at the font and everything, quite precise and the quality hasn’t been compromised. Now if I look at the CSV corresponding to this image, let me open this with an excel. There you go. So this is the CSV. So it has outputted. So the output is going to have this table in a CSV format. And if you look at, we have this header intact and that’s how we have this table. It has a header, the first row, and then it has four columns. So it has produced this output like this. Now what about these guys? These two headings that are vertically aligned. I mean, they are not inside the table. So ideally they are not part of the table, but indirectly they are. So I would expect an intelligent system like intelligent API, like extract, PDF extract API to be able to connect the dots. Let’s see. So now let’s go back to that folder. So I’ll come out of this and then this is the output folder. And then there is one more image. Let’s open this image. So it actually created an image for, you know, and it considered that element as a table. I mean, isn’t it great? Now let’s open the CSV file. Let’s see what it has. I’m going to open it in an excel format. Okay. So there you go. So now what have we got? So it’s like a table. So let me move things here a little bit so that it is accessible to you. Now if you look at this, this is the same thing. It’s not there in the table. However, this API is smart enough to unlock that. So this is the heading and then these two are the two headers. So now it is going to apply the intelligence and give you the data. Now you as a client, as an application, have enough intelligence to build a smarter solution or you can handle this data in a more smart way because this API is going to solve for all the problems. You have everything that you need, the intelligence, the connectivity, it is going to give you all the data. Now let’s take a look at the JSON. Let me come out of this. I don’t want to save it. Now let me open the JSON. There you go. So this is my JSON. All right. So now this is the document. So that JSON is going to have the content. Now if you look at this, at the top we have this Adobe Acrobat. This is kind of a figure. And then the text starts from here by combining Adobe Acrobat desktop software with premium features. If you look at this, so we have that text right here. So it’s like a key value pair that we have. The text size is given here. The language is given here. And then the text is given here. It starts with by and ends at can and column. That’s all it is. Now please pay attention to the attributes. It has the path and the page number and then the line height space after. So it is actually going to get all the beta data from that PDF document and give you that. And then if you scroll down, so this is the styling information. If you look at this, this is the kind of weight, that is 400, family name, Adobe clean. Whether it is italic, no. Italic is false. If you make this text italic, then it will be italic too. So likewise, we have all the data. Now if you scroll down all the way to the bottom, likewise you’re going to get all the data with all the attributes, with the boundaries, that is the coordinates. And then you have the alignment with everything that you need to build some intelligence using the data. Now if you talk about the use case, so there are many use cases as I said, the contract processing, invoices, and this extract API can also be used to digitize. Let’s take an example of online exam. So usually the content is kind of a scanned document or even the exam papers. So we can actually digitize the exam test material to create a content bank, a kind of a future classroom content and analyze the test performance. We can do all that. So PDF extract will save a lot of labor because if you talk about the exams and content for any study material, we are talking about thousands of questions harvested and thousands of pages of PDF documents. And just remember, just imagine like how much labor we are going to save by using the PDF extract API and the most important, maintaining more than 90% accuracy rate in extracting PDF formatting and structure. So you’re not only getting content intelligently, but you’re also getting distilling information with good accuracy. So a lot of use cases can be unlocked using the PDF extract API. As I said, the content processing, data analysis, content republishing. So I think that’s all I wanted to cover in today’s session. So please go to adobe.io as I showed you and you can create your own, you can get a free trial and then create your own credentials and get started with it. Thank you for joining today’s session. We are right on time.

Additional Resources

recommendation-more-help
3c5a5de1-aef4-4536-8764-ec20371a5186