Convert PDF tables to Excel
Photo by Jonathan Simcoe
You’re facing a deadline. You need to convert your PDF table to Excel – and quickly. But the huge selection of tools available online that proclaim to be able to do the job are confusing. Perhaps they won’t work with your table structure. Or maybe they come with huge price tags.
If you want to convert your PDF table to Excel with ease, here’s how. Get yourself ready, with an empty spreadsheet. We’re going to tell you how to do it, avoiding VBA and any manual effort.
First things first…
The best option for you to accurately get your PDF table into Excel will be the one that takes into account your document type, table format and the volume of documents you’re working with. If you don’t know this information already, you can identify which application will work best with the nuances of your document by following these headings:
Document type – you can extract tables from scanned or searchable PDFs, or from image files, like TIFs or JPEGs. Does the software you’re looking at allow for this variety?
Table formats – this is the format of the table or tables, that are set amongst other text and data on the page. Ask yourself, are the columns in a consistent format across multiple tables? If not, the application might find it hard to read the table. Are there column headings or other lines you need to remove from the table? Finally, is there a table title that influences the positioning of columns or desired output structure? Does this title need to be considered? It’s not always possible to do these things, so you need an application which doesn’t see these things as a challenge.
Line structure – The format of your individual lines is just as important as the format of the table they sit within. Again, consider whether records that need to appear on a single row in Excel have been printed over multiple lines in the PDF document. For example, the description of a transaction on a bank statement may span multiple lines. If this has happened, the application might not recognise them.
Volumes – Whether it’s a single table, a one-off, large job, or a recurring process which sees many identical tables created over and over again, will have a considerable bearing on the most effective solution for you and your team.
As well as assessing the format of your tables in order to choose the right tool to convert your PDFs, you’ll also need to consider your business’ individual requirements. What are the key features of the software you need to convert the PDF to Excel? For example:
Does the application need to be free, or can it be paid for?
Does it need to offer an online solution or is your company happy with cloud-based technology?
How much time can you afford to spend manually manipulating your PDFs to get the extracted data into the required structure?
How will you ensure the extracted data is validated and can be easily analysed?
Do you require a dedicated support team from the provider, or will you work as a community to get the most from the application?
And how easily will you need to (painlessly!) replicate the conversion process in the future?
With all that to consider, you might whittle down your options to just one or two. If so, here are the pitfalls and features to look out for next:
Searchable PDFs - free or low volume solutions
Though online tools to convert your table to Excel are often provided for free, they have some common pitfalls. For example, the transactions in your tables won’t be aggregated for multiple lines correctly so that one row = one transaction. As such, you’ll need to watch out for vertical formatting, where the transaction date is displayed in the middle or at the bottom of the transaction information.
Additionally, data might not be formatted as the correct data type or validated for immediate use in Excel. In this case, you’ll need to look out for a varying number of decimal places, and multiple date formats and locales. Finally, encrypted, or locked PDFs may hinder or block extraction through reading the embedded text. All of this leads to increased time manipulating data in Excel.
If free software is your only option, you might need to set aside a few hours for your task. To speed things up, use Alt+copy in your PDF reader, then use the ‘split by column’ feature, regex, row aggregation formulas in Excel, MID() and DATE().
Alternatively, you can call us for a free demo of StatementReader which will remove all of the problems identified above. We offer an offline, OCR solution that integrates seamlessly into your existing data extraction process and we ensure data validation is built into the application chosen, to correct and highlight possible errors to the user.
Scanned images - free or low volume solutions
Applications which work with scanned images over searchable PDFs, still tend to have the same pitfalls. However, additionally they also typically have less stringent levels of data privacy as they rely on cloud processing or have low accuracy in their output because they use local, open source solutions which only require low quality input.
Applications that work with searchable PDFs are preferable to those which convert scanned images, and the good news is you can create your own. Create a searchable PDF using your scanner or PDF reader software then apply our tips outlined above to gain back time lost on manual data manipulation when processing a few pages, or read on for higher volume solution.
Converting many pages with many users
You might need a platform which can convert multiple pages at a time or allows multiple users to access the programme through a shared network. Free versions of this type of software will typically have all the problems identified, but even paid versions come with some more, common pitfalls.
Typically, these platforms are more complex so will require training time to learn the new tools – particularly those that offer overlapping functionality with applications you already use, such as Excel. Additionally, a service provider processing the data with a manual or OCR fallback after a part-automated service, may take 24-48 hours to turnaround your extracted data.
On a more technical level, column templates might not be adjusted for movements between pages that are introduced by the scanning process. Missing dates that aren’t printed will need to be completed and the programme might lack the ability to extract other data on the page, like account numbers, sort codes or page numbers. It might not offer the ability to save and share templates between users and might not be compatible with the range of bank templates you work with., Finally, it might not recognise the multiple tables for debits and credits (as is common for USA bank statements) and split out the amounts into receipts and payments
If you encounter these problems, we recommend finding a tool that exports directly into Excel to minimise additional training, and one which offers automated data validation and analysis features.
To test StatementReader’s ability in this area, you can send us a sample of a bank statement that is representative of the worst quality and most complex table structure that you will have to process, and await the results from us. Because we’re able to extract data at less than a second per page, we promise to return the data in under 24 hours, and better still, we can provide a live demo of your data being processed.
Now you’ve come t