Datalab released lift, a 9B open-weights vision model that turns PDFs and images into schema-matching JSON. It uses schema-constrained decoding for valid structure and trained abstention to return null instead of hallucinating absent fields, scoring 90.2% field accuracy on a 225-document benchmark.
The post Datalab Releases lift: A 9B Open-Weights Vision Model That Extracts Structured JSON From PDFs Using Schemas appeared first on MarkTechPost.
In this tutorial, we build a full PDF-to-structured-data workflow around Lift, built for controlled evaluation rather than a one-off demo. We prepare a Colab GPU environment, load Lift in 4-bit NF4, and generate synthetic research reports with deliberate distractors. We then run schema-guided extraction, score every field against ground truth, and assemble the results into a queryable knowledge base. The result is a repeatable extraction benchmark, not just raw model outputs.
The post Using Lift to Turn Research PDFs into Structured JSON with Controlled, Schema-Guided Field-Level Evaluation appeared first on MarkTechPost.
Marc Isaacs’ film Synthetic Sincerity may look like a documentary, but its fictional premise – a lab that scrapes movies to harvest human emotions – shines a hard light on just how far AI can go
In Marc Isaacs’ latest film, the subversive documentary maker reveals that an AI research laboratory recently licensed his entire body of work. That’s a quarter-century of droll, deadpan studies of ordinary life in Britain – from the poetic Lift, about the comings and goings in a London tower block, and The Curious World of Frinton-on-Sea, set in the sleepy retirement town dubbed “God’s waiting room”, to Philip and His Seven Wives, in which a secondhand furniture dealer declares himself to be a Hebrew king. Isaacs agreed to let data analysts at the University of Southern England feed these and other documentaries into their system to harvest authentic human emotions from which AI characters could then be created. His film about the experience takes its name from the university’s lab: Synthetic
Enterprise Document Intelligence [Vol.1 #5ter] - Table cells, OCR, captions, headings: cloud-grade structure, running on your own machine. No key, no per-page bill, nothing leaves the building
The post Parse PDFs for RAG Locally with Docling: Rich Tables, No Cloud Upload appeared first on Towards Data Science.
A widely used JavaScript implementation of Google’s Protocol Buffers format is placing too much trust in untrusted data, exposing affected applications to remote code execution and other attacks.
Researchers at Cyera have disclosed six vulnerabilities affecting “protobuf.js,” all stemming from the library’s handling of schema and metadata. Attackers could exploit an input validation oversight to insert malicious data and influence an application’s behavior.
Protocol Buffers is a technology for packaging data in a compact, structured format to streamline the exchange of information between different applications. The protobuf.js library reportedly receives more than 50 million weekly downloads. It is commonly pulled into applications indirectly through dependencies such as gRPC tooling, Google Cloud libraries, and other frameworks, making it difficult for organizations to track.
Researchers disclosed six CVEs covering remote code execution, denial-of-service (DoS) conditions, prototype
Enterprise Document Intelligence [Vol.1 #4] - A diagnostic across PDFs and questions, and a map of the techniques the rest of the series will cover
The post From Regex to Vision Models: Which RAG Technique Fits Which Problem appeared first on Towards Data Science.
How I turned 100 messy pdfs into structured insights by building a deterministic loop around agents
The post Stop Using LLMs Like Giant Problem Solvers appeared first on Towards Data Science.