Google launched its Gemma 4 open models this spring, promising a new level of power and performance for local AI. The tech giant’s approach to edge AI is set to become even faster with the release of Multi-Token Prediction (MTP) drafters for Gemma. Google describes these experimental models as leveraging a form of speculative decoding to predict future tokens, significantly accelerating generation compared to traditional token-by-token processing.

The latest Gemma models are built on the same underlying technology that powers Google’s frontier Gemini AI. However, they are specifically tuned to run locally rather than on Google’s cloud infrastructure. While Gemini is optimized for Google’s custom TPU chips—operating in high-performance clusters with ultra-fast interconnects and memory—the Gemma 4 models are designed to function on consumer-grade hardware. A single high-power AI accelerator can run the largest Gemma 4 model at full precision, and quantization techniques enable it to operate on a consumer GPU.

Gemma empowers users to experiment with AI directly on their own devices, reducing reliance on cloud-based AI systems that require sharing data with third parties. In a notable shift, Google updated the license for Gemma 4 to Apache 2.0, a far more permissive open-source license compared to the custom terms used for previous Gemma releases. Despite these advancements, local AI models still face hardware limitations for many users—an issue MTP aims to address.