You sort of have to use both. OCR and LLM and then correlate the two results. They are bad at very different things, but a subsequent call to a 2nd LLM to pair together the results does improve quality significantly, plus you get both document understanding and context as well as bounding boxes, etc.
I'm building a "never fill out paperwork again" app, if anyone is interested, would be happy to chat!
We think VLMs would outperform most OCR+LLM solutions in due time. I get that there’s need for these hybrid solutions today, but we’re comparing 20+ year mature tech vs something that’s roughly 1.5 years old.
Also, VLMs are end-to-end trainable, unlike OCR+LLM solutions (that are trained separately), so it’s clear that these approaches scale much better for domain-specific use cases or verticals.
I'm building a "never fill out paperwork again" app, if anyone is interested, would be happy to chat!