Table Extraction

There’s an enormous volume of data stored in tables in documents. Documents, particularly PDFs, are the lingua franca of modern computer applications.

In the absence of universal exchange protocols between apps such as accounting and CRM, people export to PDF from applications. Customers access this data by typing into other apps, a time-consuming and error-prone task.

From 80 to 90 percent of data generated and collected by organizations, is unstructured, and its volumes are growing rapidly — many times faster than the rate of growth for structured databases. (Unstructured Data) Businesses can only access this data manually. They have to either leave it unused or do costly manual data entry from the exported PDF to the system(s) they are transferring it to.

Demo

extract_tables.py

References

  1. Table Detection & Information Extraction using Deep Learning Industrial use cases. / Business use cases.
  2. How to Extract tabular data from PDF document using Camelot in Python