The central thesis of Docs2Info is that documents are not just containers of text, but highly optimized cognitive interfaces evolved to fit the human brain.
Written communication is an external extension of human “mental programs”. Just as the brain relies on discrete symbols and recursive composition to encode complex ideas simply, documents evolved to minimize the “description length” required for a human to understand a concept.
We developed writing systems that align with our visual cortex’s preference for contrast and edges, specifically horizontal and vertical lines. A page layout isn’t arbitrary; it is a spatial map designed to overcome our cognitive limits.
Digital formats like PDF are not a break from the past but the latest iteration of a continuous evolutionary line.
The PDF as a Cognitive Artefact: A PDF mimics the fixed layout of paper because that layout contains vital information (grouping, reading order, emphasis) that raw text streams lose. Efficient data extraction is “embodied” in the file formats themselves.
To “read” a document like a human, software cannot simply OCR text from top to bottom. It must respect the Simplicity Principle: the cognitive system chooses patterns that explain data most simply.
Effective extraction must reverse-engineer this evolution, decoding not just the characters, but the visual and spatial structures (lines, layouts, hierarchies) that humans use to process information efficiently.
We move from “people telling people what to do” (low efficiency) to tools that understand the document’s inherent structure, yielding exponential productivity gains.