Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention

BitcoinEthereumNewsapple amd deepseek v4 ethereum

Vitalik Buterin Pushes For Hardware-Diverse CROPS AI As DeepSeek V4 Runs Locally On Apple And AMD

The post Vitalik Buterin Pushes For Hardware-Diverse CROPS AI As DeepSeek V4 Runs Locally On Apple And AMD appeared on BitcoinEthereumNews.com. The term “decentralized AI” gets thrown around often, but Ethereum co-founder Vitalik Buterin is drawing a sharper line. For him, the real test of an AI system that can serve crypto users isn’t just where the inference happens—it’s whether the model runs across a range of actual hardware, from a MacBook to an AMD rig. In an update posted to his personal site and flagged by the original report, Buterin pointed to a concrete benchmark: DeepSeek V4 now has a 2-bit quantized version that fits within about 90 GB of VRAM, hitting roughly 35 tokens per second on Apple hardware and about 7 tokens per second on AMD. That matters more than many realize. For months, the AI-crypto conversation has been split between centralized cloud inference and grand schemes for decentralized compute networks. Buterin’s “CROPS AI” concept—short for Consequential, Recove

May 29, 12:38 AM

KDNuggetgemma 4 tools easy agentic tool calling

Easy Agentic Tool Calling with Gemma 4

In this tutorial, we will give Gemma 4 two new tools and watch the model decide, on its own, when to look around and when to compute.

May 22, 12:00 PM

Crypto Briefinggoogle gemma 4 open duck 3d print

Google showcases Gemma 4-powered Open Duck robot that anyone can 3D print at home

Google's initiative democratizes AI robotics, enabling local, cloud-independent AI applications, potentially transforming personal and educational tech landscapes. The post Google showcases Gemma 4-powered Open Duck robot that anyone can 3D print at home appeared first on Crypto Briefing.

May 21, 6:06 PM

decryptgoogle gemma 4 hardware multi-token prediction

Google Found a Way to Make Local AI Up to 3x Faster—No New Hardware Required

Google's new Multi-Token Prediction drafters can make Gemma 4 run up to 3x faster on your own hardware—no cloud required, and no quality lost.

May 7, 2:13 PM

ars Technica AIgoogle gemma 4 speculative decoding open ai models

Google's Gemma 4 open AI models use "speculative decoding" to get up to 3x faster

Up to 3x the speed with no loss of quality—is it too good to be true?

May 6, 3:44 PM

MarktechPostgemma 4 google ai mtp multi-token prediction

Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss

Large language models are getting incredibly powerful, but let’s be honest—their inference speed is still a massive headache for anyone trying to use them in production. Google just launched Multi-Token Prediction (MTP) drafters for the Gemma 4 model family. This specialized speculative decoding architecture can actually triple (3x) your speed at inference time, all without […] The post Google AI Releases Multi-Token Prediction (MTP) Drafters for Gemma 4: Delivering Up to 3x Faster Inference Without Quality Loss appeared first on MarkTechPost.

May 6, 8:23 AM

O'Reilly AI-MLgemma 4 frontier models local models ai providers

Local AI

The release of Gemma 4 has added energy to the discussion of local models and their importance. Models that you can download and run on hardware you own are becoming competitive with the “frontier models” hosted by large AI providers. These models have gotten good enough for production use, good enough for tasks that until […]

May 1, 2:20 PM

InfoWorld AIgoogle gemma 4 audio servers

Google’s Gemma 4 shines on local systems – both big and small

Google’s Gemma 4 comes touted as the latest evolution of Google’s multi-modal model offerings. Gemma 4 not only offers reasoning and tool use, but vision and audio functionality, and it’s available in a range of model sizes that target servers and local devices. What’s striking about Gemma 4 is that even at the higher end of its size range, it’s still decently performant on personal hardware. Google claims this is due to innovations in the architecture of the model, but the proof is in the trying. Gemma 4 is quite responsive. To that end, I took Gemma 4 for a spin on my own hardware to see how it fared for its advertised tasks. Gemma 4 model sizes Gemma 4 comes in four basic sizes or “densities”: E2B: 2.3 billion effective parameters, 5.1 billion total, 128K max context window. E4B: 4.5 billion efffective parameters, 8 billion total, 128K max context window. 31B: 31 billion parameters (the “dense” version), 256K max context window. (You will probably not use this one on your own machi

Apr 22, 9:00 AM