Automating Table Extraction from PDFs and Scanned Images
Jan 24, 2023
Automating Table Extraction from PDFs and Scanned Images
Chief AI for Everyone Officer
Modern organizational workflows heavily depend on searchable documents, which commonly contain tables that organize valuable information clearly and concisely (from a human’s perspective, anyway). Documents themselves vary in type (e.g., invoices, receipts, insurance documents, bills of lading, bank statements, and more), and the tables they include also vary (different row and column labels, counts, etc.).
Although humans can quickly discern different document types using context clues, and easily make sense of tables regardless of structure and formatting, machines aren’t so capable. This makes automating data extraction from complex documents, and doubly so for information within them presented in tables, a challenging task for most document extraction solutions.
This article offers a deep dive into automating the extraction of tabular data from complex documents, scanned images, and other sources of valuable business information.
Where do businesses encounter tables?
Tables are fairly ubiquitous in the world of business data. However, there are several places they are encountered most when attempting to automate data extraction:
PDF documents: PDFs are commonly used in businesses and organizations, and often contain tables with important information that needs to be extracted.
Image-based documents: Some documents, such as scanned documents or images of documents, may need to be converted into editable formats for further processing.
MS Office documents: MS Office documents like Word, Excel, and PowerPoint also contain tables that may require extraction.
Webpages: Web pages may contain tables that can be scraped and extracted for data analysis.
XML, JSON, CSV, and other formats: A variety of other data formats also contain tables that may need to be extracted for further analysis and processing.
What methods are used to automate table extraction?
Manually copying and pasting data from documents is an inefficient process as it may result in table structure alteration, making it challenging to bring the data back to its original organized form. Manual extraction processes require significant verification and reformatting, which can be time-consuming and susceptible to errors.
Businesses are constantly looking for ways to convert their documents, particularly those with abundant tabular data, into editable table formats such as Excel or CSV. Additionally, they seek methods to make their data searchable, thus facilitating the process of finding and extracting relevant information.
Automation tools can extract tables from various documents, such as PDFs, Word documents, scanned images, and HTML pages. Some common features of these tools include the ability to handle multiple file formats, support for batch processing of multiple documents, and the ability to export the extracted data in various file formats such as CSV or Excel. However, the quality of the output can vary depending on the quality of the input document and the specific tool being used.
Some automation tools that are used to extract tables from various kinds of sources include,
Optical Character Recognition (OCR) software is one of the most common tools used to recognize and extract text from images and scanned documents.
Web scraping tools are used to extract data from websites and web pages, which may contain tables.
PDF parsing libraries can be used to extract tabular data from PDF documents.
Spreadsheet software, such as Microsoft Excel or Google Sheets, can be used to extract data from CSV or other spreadsheet formats.
AI tools use a combination of machine learning (ML), deep neural networks, natural language processing (NLP) techniques (e.g., named entity recognition), and dependency parsing to train models for table detection and table structure recognition.
How can businesses benefit from automated table extraction?
Automating the extraction of tables from PDFs and other sources has a plethora of benefits for businesses, which include extracting legacy data stored in a tabular format; digitizing information to streamline processes and improve data reliability; collecting and organizing invoice and form data more efficiently; and reducing the risk of data misplacement or inconsistencies. Some specific end-use cases for automated table extraction include:
End-use cases for automated table extraction
Business Intelligence: Data from tables are used in financial reports, annual reports, and other business documents to generate insights and make data-driven decisions.
Healthcare: Table data is important in generating medical reports, clinical trials, and other studies to support medical research and improve patient care.
E-commerce: Extracting data from product comparison tables, pricing tables, and product specifications can help create a database of products for comparison, price tracking, and analysis.
Supply Chain Management: Extracting data from tables in shipping, inventory, and logistics documents helps in tracking the movement of goods, optimizing supply chain processes, and reducing costs.
Legal Document Processing: Extracting data from tables in legal documents such as contracts, deeds, and patents is the first step in automating legal research and document management.
News and Media: Extracting data from tables in news articles and press releases can generate a database of news events, financial performance, and other relevant information.
Government and Public Sector: Extracting data from tables in government reports and public data sets can support policymaking, budgeting, and other critical decision-making processes.
Academic Research: Extracting data from tables in research papers, journals, and academic publications helps organize scientific research and discovery.
Real Estate: Extracting data from tables in property listings, and real estate data assists with analyzing prices, property details, and the market.
Human Resources: Extracting data from tables in resumes, job descriptions, and employee records are required for the automation of the recruitment process, for tracking employee performance, and for improving HR management.
Challenges with table extraction for legacy OCR
The challenges for legacy OCR and traditional data extraction tools when it comes to extracting tables from PDFs and other document types can be bucketed into two groups:
Table layout variations
Legacy OCR solutions struggle with the variety of structural layouts that tables can have, including different numbers of columns and rows, as well as different font sizes, colors, and styles. This makes it difficult for OCR algorithms to accurately identify and extract tabular data. Additional variables that can significantly hinder OCR performance include:
Poor image quality: OCR also struggles when the quality of the image containing the table is low. This can lead to errors or inaccuracies in the extracted data.
Limited pre-processing capabilities: The process of detecting tables often starts with pre-processing document images to enhance the data and borders in tables. However, many simple OCRs don’t include flexible image pre-processing that is capable of improving the accuracy of data extraction.****
Variations in cell arrangement: Cell padding, margins, and borders can also cause problems for traditional OCR technology. These elements can make it difficult for the OCR to accurately identify the boundaries of the cells and can lead to errors in the extracted data.
Complexity in table content: Documents like freight invoices, purchase orders, financial statements, and tax documents often have dense content because they contain a large amount of detailed information.
Freight invoices, for example, may contain information about multiple items being shipped, their weight, price, shipping details, taxes, and more.
Purchase orders may contain important information such as quantity, prices, shipping details, etc.
Financial statements and tax documents may contain information about income, expenses, assets, liabilities, and other financial details.
Content ambiguity: Acronyms and abbreviations can make extracting meaning from the text harder. The way in which numerical values are represented in a table can vary, making it difficult to convert the information accurately.
Table structure complexity
While tables are meant to present data in a clear and organized format, not all tables are created equal. Many different types of tables can be used to present data, each with its own strengths and weaknesses. The following is a list of some of the most common types of complex tables that OCR can have trouble extracting:
Tables that span multiple pages: These lengthy tables can be difficult for OCRs to process accurately, often misinterpreting tables on different pages as separate, misreading row/column headers, and
Multiple tables on the same page with different headers: This type of table typically consists of multiple smaller tables on a single page, each with its own distinct header. These tables may present different sets of related data or break up a large amount of data into more manageable chunks.
Large single table: A large single table is a table that contains a large amount of data, often spanning multiple pages. These tables may be used to present data from a database or other large data source and may include a wide variety of columns and rows.
Page columns alike: This type of table is designed to look like a newspaper or magazine column, with the data presented in a narrow, vertical format. These tables may present data in a more compact, easy-to-read format or fit more data into a smaller space.
Nested tables: Nested tables are tables that are nested inside other tables. These tables may be used to present data in a hierarchical format, with data from the inner tables related to data from the outer tables.
Complex tables with unstructured rows and columns: Complex tables with unstructured rows and columns are tables that contain data that is not arranged in a structured, predictable format. These tables may be used to present data from various sources or to present data that is not easily organized into a traditional table format.
Tables with handwriting: This type of table contains handwriting instead of typed-in data. It is mostly used in cases where data is recorded by hand or when data is obtained via surveys, questionnaires, etc.
Artificial intelligence to the rescue
AI and ML tools can be used to circumvent many of the challenges faced by OCR and legacy algorithms in table extraction. One of the main challenges with OCR is that it can have difficulty accurately recognizing tables with complex structures, such as nested tables or tables with handwriting. AI and ML algorithms can be used to analyze the table's structure and identify the data's location within it, even in cases where the table is not well structured or includes handwritten text.
Another challenge that OCR can face is the ability to accurately extract data from tables that are presented in different languages or with different font styles and sizes. By using natural language processing (NLP) techniques and machine learning models, AI and ML tools can be trained to recognize and extract text from tables regardless of language or styling.
Unlike OCR and legacy computer vision algorithms that only recognize visual similarities between pixels and characters, AI tools can understand the context of the data, and make distinctions about what data is relevant or "makes sense." Artificial intelligence and machine learning techniques can be used to train models to understand data in context, which can help improve the accuracy of table extraction. For example, by using NLP and ML to understand the meaning of the text within the table and the relationships between the data, the model can better understand the tabular structure and improve the accuracy of the extraction process.
Super.AI was built to tackle complexity and variability
Super.AI’s Intelligent Document Processing (IDP) solution was built on top of our unified AI platform that can process any document or unstructured data type. The obstacles that OCR and other traditional data extraction solutions struggle with are the exact type of challenges our technology excels at. Rather than approach complex data processing tasks from a single angle, we break each task down into smaller parts, then leverage the best AI, human, or software worker for each component. When it comes to extracting tables from PDFs, our solution leverages the most effective and current AI models for pre-processing and extraction, then combines the results into a unified output.
Super.AI intentionally avoids proprietary OCR and AI models, and instead tests and adopts the most performant combination of tools for a given scenario. Additionally, our Data Processing Crowd, a high-quality, on-demand resource pool for data labeling, post-processing, and exception handling makes it possible to leverage trained human workers to quickly process or correct tables machines may have trouble understanding. Each human input is used to continuously train models that can quickly learn new table formats to rapidly improve automation rates.
Get Started on Your Automation Journey
Learn more by speaking to an unstructured data expert
How Are Large Language Models Reshaping Intelligent Document Processing?
Unlock the Power of AI in Invoice Matching
Save Your Business Time and Money with Intelligent Data Extraction
Automating PO Matching with AI
Efficiently Extracting Data from PDFs
Start processing 100% of complex documents.
Submit your information to get started with super.AI Intelligent Document Processing (IDP).