Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You sort of have to use both. OCR and LLM and then correlate the two results. They are bad at very different things, but a subsequent call to a 2nd LLM to pair together the results does improve quality significantly, plus you get both document understanding and context as well as bounding boxes, etc.

I'm building a "never fill out paperwork again" app, if anyone is interested, would be happy to chat!



We think VLMs would outperform most OCR+LLM solutions in due time. I get that there’s need for these hybrid solutions today, but we’re comparing 20+ year mature tech vs something that’s roughly 1.5 years old.

Also, VLMs are end-to-end trainable, unlike OCR+LLM solutions (that are trained separately), so it’s clear that these approaches scale much better for domain-specific use cases or verticals.


Any tips on how to prompt that second pairing step? And what sort of things to ask the llm to extract in step 1?


A VLM that invokes ocr tool use is a compelling idea that could result in pretty good results, I would expect.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: