Gemma 4 cheatsheet

Gemma 4 is Google DeepMind's Apache 2.0 open-weights multimodal LLM family with frontier intelligence across four sizes, from mobile-edge to server-grade deployment.

#Getting Started

#Quick Start

from transformers import AutoProcessor, AutoModelForCausalLM
import torch

MODEL_ID = "google/gemma-4-E4B-it"

processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are helpful."},
    {"role": "user",   "content": "Explain MoE briefly."},
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
inputs = processor(
    text=text, return_tensors="pt"
).to(model.device)
input_len = inputs["input_ids"].shape[-1]

outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(
    outputs[0][input_len:], skip_special_tokens=False
)
result = processor.parse_response(response)
# result["thinking"]  → internal reasoning chain
# result["response"] → final answer shown to user

#Introduction

Docs (ai.google.dev)
Model Card (ai.google.dev)
HuggingFace (huggingface.co)
Cookbook (github.com)

Released 2026-03-31 · Apache 2.0 · Built from Gemini 3 research · 140+ languages

#Install

$ pip install -U transformers torch accelerate

All models: google/<model-id> on HuggingFace

Size	Model ID
E2B	`gemma-4-E2B-it`
E4B	`gemma-4-E4B-it`
31B	`gemma-4-31b-it`
26B A4B	`gemma-4-26b-a4b-it`

#Sampling Params

Parameter	Value
`temperature`	`1.0`
`top_p`	`0.95`
`top_k`	`64`

#Best Practices

Place images/audio before text in any prompt
Multi-turn: omit thought blocks from conversation history
Include 75%+ CoT data to preserve reasoning capability
Freeze vision layers when text-only fine-tuning

#Model Family

#Model Specs

Model	Arch	Total Params	Active/Token	Layers	Context	Modalities
E2B	Dense+PLE	5.1B (2.3B eff)	2.3B	35	128K	Text+Image+Audio
E4B	Dense+PLE	8B (4.5B eff)	4.5B	42	128K	Text+Image+Audio
31B	Dense	30.7B	30.7B	60	256K	Text+Image
26B A4B	MoE	25.2B	3.8B	30	256K	Text+Image

Sliding window: 512 tokens (E2B/E4B) · 1024 tokens (31B/26B) · Vocab: 262K (all models)

#Architecture Features

#Hybrid Attention

Local layers: sliding window (512 or 1024 tokens)
Global layers: full context, interleaved with local
Final layer is always a global attention layer

#PLE — Edge Models (E2B/E4B)

Per-Layer Embeddings: each decoder layer has its own small embedding table
Large static tables use fast lookups — not dense matrix multiply
Effective compute parameters ≪ total loaded parameters
Enables inference under 1.5 GB VRAM (with 4-bit quantization)

#p-RoPE & Shared KV Cache

Proportional RoPE (p-RoPE) on global layers for long-range coherence
Shared KV Cache across global layers reduces peak memory usage
Supports stable 256K context without performance degradation

#MoE — 26B A4B

128 total experts + 1 always-active shared expert
8 experts activated per token at inference time
Speed similar to a 4B dense model, quality near 30B
Vision encoder: ~550M params (same as 31B)

#Memory Requirements

Model	BF16 (16-bit)	8-bit	4-bit
E2B	9.6 GB	4.6 GB	3.2 GB
E4B	15 GB	7.5 GB	5 GB
31B	58.3 GB	30.4 GB	17.4 GB
26B A4B	48 GB	25 GB	15.6 GB

Base weights only — add VRAM for KV cache.

#Pick a Model

Available VRAM	Recommended
< 5 GB	E2B (4-bit)
5–8 GB	E4B (4-bit)
15–20 GB	E4B (BF16)
24–32 GB	31B (4-bit)
48–80 GB	31B (BF16)
High throughput	26B A4B

#Benchmarks

#Core Benchmarks

Benchmark	31B	26B A4B	E4B	E2B	Gemma 3 27B
MMLU Pro	85.2%	82.6%	69.4%	60.0%	67.6%
MMMLU (multilingual)	88.4%	86.3%	76.6%	67.4%	70.7%
AIME 2026 (math)	89.2%	88.3%	42.5%	37.5%	20.8%
GPQA Diamond	84.3%	82.3%	58.6%	43.4%	42.4%
LiveCodeBench v6	80.0%	77.1%	52.0%	44.0%	29.1%
Codeforces ELO	2150	1718	940	633	110
BigBench Extra Hard	74.4%	64.8%	33.1%	21.9%	19.3%
Tau2 avg (agentic)	76.9%	68.2%	42.2%	24.5%	16.2%
HLE no tools	19.5%	8.7%	—	—	—
HLE with search	26.5%	17.2%	—	—	—

All results for instruction-tuned models with thinking mode enabled.

#Vision Benchmarks

Benchmark	31B	26B A4B	E4B	E2B
MMMU Pro	76.9%	73.8%	52.6%	44.2%
MATH-Vision	85.6%	82.4%	59.5%	52.4%
MedXPertQA MM	61.3%	58.1%	28.7%	23.5%
OmniDocBench↓	0.131	0.149	0.181	0.290

OmniDocBench = document edit distance (lower is better).

#Long Context

Benchmark	31B	26B A4B	E4B	E2B
MRCR v2 128K	66.4%	44.1%	25.4%	19.1%

#Arena AI (LMSYS ELO)

Model	ELO	Open Rank
Gemma 4 31B	1452	#3
Gemma 4 26B A4B	1441	#6

#Thinking Mode

#Thinking Control

Add <|think|> to the start of the system prompt:

messages = [
    {
        "role": "system",
        "content": "<|think|>You are a math expert."
    },
    {"role": "user", "content": "Solve: 3x + 7 = 22"}
]

text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True
)
outputs = model.generate(**inputs, max_new_tokens=2048)
response = processor.decode(
    outputs[0][input_len:], skip_special_tokens=False
)
result = processor.parse_response(response)
# result["thinking"]  → step-by-step reasoning
# result["response"] → final answer

#Thinking output structure

<|channel>thought
[Internal step-by-step reasoning — hidden from user]
<channel|>
[Final answer visible to user]

#Disabled thinking behavior

For 31B/26B A4B with enable_thinking=False, empty tags still emitted:

<|channel>thought
<channel|>
[Final answer]

E2B/E4B variants skip empty tags entirely when disabled.

#Control Tokens

Token	Purpose
`<\\|think\\|>`	Enable thinking in system prompt
`<\\|channel>thought\n`	Open internal thought block
`<channel\\|>`	Close thought block
`<\\|turn>`	Open conversation turn
`<turn\\|>`	Close conversation turn

#Multi-turn Rules

Do not include thought content in conversation history
Pass only result["response"] as the model turn
Longer max_new_tokens for complex reasoning tasks
Use enable_thinking=True for AIME, proofs, debugging

#Multimodal

#Image Inputs

Variable aspect ratio + configurable visual token budget:

Budget	Best For
70	Fast classification, video frames
140	Captions, thumbnails
280	General image understanding
560	Charts, diagrams
1120	OCR, PDF parsing, fine detail

# Image MUST come before text (required)
messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text",  "text": "Describe this chart."},
]}]
inputs = processor(
    text=text,
    images=image,
    return_tensors="pt"
).to(model.device)

Vision encoder: ~150M params (E2B/E4B) · ~550M params (31B/26B)

#Audio Inputs

E2B and E4B only (~300M audio encoder)

Max audio: 30 seconds
Tasks: ASR and speech translation

#ASR prompt

Transcribe the following speech in
{LANGUAGE} into {LANGUAGE} text.

#Translation prompt

Transcribe in {SRC_LANG}, then
translate to {TARGET_LANG}.

#Video Inputs

Processed as sequential image frames:

Max: 60 seconds at 1 fps = 60 frames
Use low token budget (70–140) per frame
E2B/E4B: also processes audio track simultaneously

# Pass frames as list of images
inputs = processor(
    text=text,
    images=[frame1, frame2, ..., frame60],
    return_tensors="pt"
)

#Modality Order

Always place image/audio before text in content:

# ✅ Correct order
content = [
    {"type": "image", "image": img},
    {"type": "text",  "text": "Describe it."},
]

# ❌ Text-first breaks alignment
content = [
    {"type": "text",  "text": "Describe it."},
    {"type": "image", "image": img},
]

#Deployment

#HuggingFace

$ pip install -U transformers accelerate

#BF16 (default)

from transformers import (
    AutoProcessor, AutoModelForCausalLM
)
import torch

mid = "google/gemma-4-31b-it"
processor = AutoProcessor.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(
    mid,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

#4-bit quantized

from transformers import BitsAndBytesConfig

bnb = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained(
    mid,
    quantization_config=bnb,
    device_map="auto"
)

#Ollama

$ ollama pull gemma4           # 31B (default)
$ ollama pull gemma4:e4b       # edge 4B variant
$ ollama pull gemma4:e2b       # edge 2B variant
$ ollama run  gemma4           # interactive chat

#Custom GGUF (Modelfile)

FROM /path/to/fine-tuned.gguf
SYSTEM "You are a coding assistant."

$ ollama create mygemma -f Modelfile
$ ollama run mygemma

#vLLM Server

$ vllm serve google/gemma-4-31B-it \
  --max-model-len 8192 \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4 \
  --reasoning-parser gemma4

OpenAI-compatible API at http://localhost:8000/v1

#Cloud & API

Platform	Notes
Gemini API	`gemma-4-31b-it`
AI Studio	Browser playground
Vertex AI	Custom endpoint
Cloud Run	Serverless GPU
GKE + vLLM	Auto-scale

#Edge & Mobile

Runtime	Best For
AICore (Android)	System-level API
LiteRT-LM	IoT, Raspberry Pi
AI Edge Gallery	On-device eval
LM Studio	Desktop GUI
llama.cpp	CPU/GPU hybrid

#Fine-Tuning

#QLoRA Setup

Single 16 GB GPU (T4/free Colab or Kaggle):

from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    "google/gemma-4-E4B-it",
    load_in_4bit=True,
    max_seq_length=4096
)

model = FastModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=[
        "q_proj", "k_proj",
        "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
)

#Vision fine-tuning (E2B / E4B)

from unsloth import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    "google/gemma-4-E4B-it",
    finetune_vision_layers=False,   # freeze encoder
    finetune_language_layers=True,
    load_in_4bit=True,
)

#MoE Fine-tuning

For 26B A4B — full fine-tune breaks expert routing:

Use LoRA bf16 only (never FFT)
Start with r=16, lora_alpha=16
Increase context length gradually after loss converges
Avoid targeting expert router weight matrices

#Data Requirements

Requirement	Value
CoT data ratio	≥ 75% of training set
Thinking format	Include `<\\|think\\|>` trigger
Multimodal order	Images/audio before text
Chat template	ShareGPT or OpenAI format
RL reward	Verifiable final answer

#Vision Fine-tune Tips

Freeze vision first — finetune_vision_layers=False
Fine-tune LLM + attention + MLP projectors only
Validate text-domain quality before unfreezing
Unfreeze vision only for domain-specific images
Low token budget (70–280) saves VRAM during training

#Gemmaverse

#Specialist Models

Model	Domain	Description
MedGemma 4B	Medical imaging	Multimodal X-ray/MRI analysis
MedGemma 27B	Clinical text	EHR + medical report reasoning
CodeGemma	Coding	Code completion & refactoring
PaliGemma 2	Visual-language	Fine-grained VLM & visual reasoning
ShieldGemma	Safety	LLM output safety classifier
DataGemma	Factual data	Grounded via Google Data Commons
FunctionGemma	Tool calling	Low-resource function call parsing

#Ecosystem

Libraries & Frameworks

ADK (Agent Development Kit)
JAX Gemma Library (google-deepmind/gemma)
Gemma Cookbook (google-gemma/gemma-cookbook)
google/adk-samples starter agents

Community Variants

100K+ community fine-tuned derivatives
RecurrentGemma (Griffin architecture)
EmbeddingGemma, T5Gemma 2
VaultGemma (differential privacy)

#Also see

Official Docs (ai.google.dev)
Model Card (ai.google.dev)
DeepMind Models Page (deepmind.google)
HuggingFace (huggingface.co)
Gemma Cookbook (github.com)
JAX Gemma Library (github.com)
vLLM Usage Guide (docs.vllm.ai)
Ollama Library (ollama.com)