Creating DocumentExtractor

On the journey of beginning my first 'made-to-launch' side project and the thinghs I've learned.

Header Image

A lot of valuable information is hidden in unstructured documents (GPT-4o)

After countless of ideas started on GitHub, documented in Notion or written done on post-its, it was time to start working on something with the goal to get it to launch. Since May 2024, I've now been working on DocumentExtractor.ai, a service I envisioned could turn any document into structured data.

Motivation

I've been drawing up ideas and building software my whole childhood (working on different projects, mostly in C++, PHP, Java, and Swift). As life caught up and I got busy with coursework and later with my full-time-job as a consultant, I dropped the art mostly to focus on the necessities of life. My last "valuable" piece of software that actually launched (a small event planning and ticketing platform) was finished over 5 years ago (in 2019) and my programming skills became more rusty by the day.

As I also got quite curious about AI in the recent months, and noticed how useful Python could be at work, I decided it was time for a new challenge and getting back into programming would be a good way to do this. What better thing to work on than AI nowadays anyway, right?

Initial idea

The initial idea for DocumentExtractor appeared to me after randomly stumbling upon the capabilities of Apache Tika, an open toolkit to extract text from various kinds of documents. "Wouldn't it be great if we could just feed that text into an LLM and get exactly the data we need, regardless of input document format?", I thought to myself. The idea for "Doc2API" was born, a service that could offer user-defined APIs to query a set of user-provided documents. The user provides a schema and gets a file-upload location plus an URL in return where a REST-API is available to interact with the data.

After giving it some further thought, this quickly evolved into a fully-blown workflow automation platform. The user would define extractions, and the platform could then be used to automatically trigger workflows (such as calling other APIs, or posting data into databases or ERP systems) based on pre-defined rules.

Before thinking about getting to point Z, though, I first needed to start at A and get the processing right.

The POC

To validate that my imagined approach of feeding Tika-HTML-output into an LLM to extract structured data would work, I set up a small proof of concept in Jupyter notebook. I spun up a quick docker container based on the initial image (I like having my dev environments portable and separate from my main workstation) and gathered some test data online by collecting documents I've saved on my disk, mostly scans (which later turned out to be a mistake). I spun up a Tika instance in another container, registered for an OpenAI Platform account (I was fine with the data from my test docs being seen by OpenAI) and gave it a go. Besides some troubles getting started with Python, I got everything working within a couple of hours – at least as a POC.

Turning it into a product

Extraction & Data Architecture

more to follow...

What's essential? MVP definition

Running into challenges

What's next?

Philipp Heller

phil@heller-web.com