LoRA hyperparameter scaling rules¶
Scope: empirically calibrated rules for setting LoRA post-training hyperparameters: the 10x learning-rate multiplier over full fine-tuning, hidden-size LR scaling per model family, rank selection from dataset capacity (LoRA params vs trained tokens), adapter placement (all matrices, especially MLP/MoE), and batch-size caveats. Sourced from Thinking Machines' LoRA Without Regret study and the tinker-cookbook hyperparam_utils formulas; mechanics of LoRA itself live in SFT & LoRA/QLoRA.
The Python (stdlib) blocks below reimplement the published formulas and were executed with all asserts passing. The formulas are empirical fits (calibrated on Llama and Qwen families as of mid-2026); treat constants as starting points and re-verify against the tinker-cookbook repo before long runs.
What it is¶
A small set of closed-form rules that replace grid search for the two hyperparameters that dominate LoRA fine-tuning outcomes, distilled from systematic sweeps in LoRA Without Regret (Schulman et al., Thinking Machines, 2025) and shipped as code in tinker-cookbook's hyperparam_utils:
- Learning rate:
LR = 5e-5 * M_LoRA * (2000 / hidden_size)^P, withM_LoRA = 10for LoRA (1 for full fine-tuning) and a per-family exponentP(0.781 for Llama, 0.0775 for Qwen). - Rank: capacity-driven. For supervised learning, LoRA performs like full fine-tuning while
trainable LoRA params >= trained (weight=1) completion tokens; rank buys capacity linearly. For RL, very small ranks match full fine-tuning because RL learns on the order of 1 bit per episode. - Placement: adapt all weight matrices; MLP (and MoE expert) layers carry most of the capacity, and attention-only LoRA underperforms even at matched parameter count.
- The key invariance: the optimal LR is approximately independent of rank (a consequence of the
alpha/r-scaled parametrization), so LR does not need re-tuning when rank changes.
Why use it¶
- The 10x rule kills the most common LoRA failure. Teams port a full fine-tuning LR to LoRA, see it underperform, and conclude LoRA is weak; the optimal LoRA LR is about 10x higher across model sizes (independently observed in Biderman et al. 2024).
- Sweeps are expensive at post-training scale. A closed-form LR within a few percent of optimal saves the 6-run grid per model-dataset pair, and rank-independence of LR removes a whole sweep axis.
- Capacity arithmetic predicts when LoRA stops being free. LoRA does not hit a hard loss floor on large datasets; it loses training efficiency once the dataset exceeds adapter capacity. Counting params vs tokens tells you in advance whether rank 32 suffices or the job wants rank 128 or full fine-tuning.
When to use it (and when not)¶
- Use LoRA with these rules for RL post-training (small ranks, even rank 1-8, match full fine-tuning), instruction-tuning and reasoning SFT on small-to-medium datasets, and any capacity-bounded adaptation where
params >= trained tokensholds at an affordable rank. - Prefer full fine-tuning (or accept degraded efficiency) when the SFT corpus exceeds LoRA capacity at practical ranks, or when large-batch training is required: LoRA pays a loss penalty beyond a batch-size threshold that higher rank does not fix (a property of the product-of-matrices parametrization).
- Recalibrate rather than extrapolate for families outside the fitted ones; the cookbook deliberately raises for uncalibrated models (DeepSeek, Kimi, gpt-oss, Nemotron as of mid-2026) instead of guessing.
Architecture¶
flowchart TB
TASK["Task type"] -->|"RL"| RSMALL["rank 8-32 (capacity ~ bits per episode)"]
TASK -->|"SFT small/medium data"| R32["rank 32 default"]
TASK -->|"SFT large corpus"| CAP["capacity check: params >= trained tokens"]
CAP -->|"fits"| R128["raise rank (64-128+)"]
CAP -->|"does not fit"| FULL["full fine-tune"]
subgraph LRBOX["Learning rate"]
H["hidden_size"] --> F["LR = 5e-5 x (2000/H)^P"]
F -->|"LoRA: x10"| OUT["peak LR + linear/cosine schedule"]
F -->|"full FT: x1"| OUT
end
R32 --> LRBOX
R128 --> LRBOX
RSMALL --> LRBOX
How to use it¶
Reference template (needs tinker-cookbook; the same numbers apply to any LoRA stack):
from tinker_cookbook import hyperparam_utils
model = "Qwen/Qwen3.5-9B"
lr = hyperparam_utils.get_lr(model, is_lora=True) # ~5e-4 territory
ratio = hyperparam_utils.get_lora_lr_over_full_finetune_lr(model) # 10.0
params = hyperparam_utils.get_lora_param_count(model, lora_rank=32)
The executed block below reproduces the formula and asserts its load-bearing properties:
BASE_LR = 5e-5
LORA_MULTIPLIER = 10.0
FAMILY_EXPONENT = {"llama": 0.781, "qwen": 0.0775}
HIDDEN_SIZE = {
"meta-llama/Llama-3.2-1B": 2048,
"meta-llama/Llama-3.1-8B": 4096,
"meta-llama/Llama-3.1-70B": 8192,
"Qwen/Qwen3.5-35B-A3B": 2048,
"Qwen/Qwen3.5-9B": 4096,
"Qwen/Qwen3.5-27B": 5120,
}
def get_lr(model, is_lora=True):
family = next((f for f in FAMILY_EXPONENT if f in model.lower()), None)
if family is None:
raise ValueError(f"no calibrated LR formula for {model}")
lr = BASE_LR * (LORA_MULTIPLIER if is_lora else 1.0)
return lr * (2000 / HIDDEN_SIZE[model]) ** FAMILY_EXPONENT[family]
# The LoRA-to-full-finetune LR ratio is 10 for every calibrated model.
for m in HIDDEN_SIZE:
assert abs(get_lr(m, is_lora=True) / get_lr(m, is_lora=False) - 10.0) < 1e-12
# Spot value: Llama-3.1-8B LoRA LR = 5e-4 * (2000/4096)^0.781 ~= 2.86e-4.
lr_8b = get_lr("meta-llama/Llama-3.1-8B")
assert abs(lr_8b - 5e-4 * (2000 / 4096) ** 0.781) < 1e-19
assert 2.8e-4 < lr_8b < 2.9e-4
# Llama's exponent (0.781) makes LR fall steeply with width: ~2.95x from
# 1B (hidden 2048) to 70B (hidden 8192). Wider model, lower LR.
ratio = get_lr("meta-llama/Llama-3.2-1B") / get_lr("meta-llama/Llama-3.1-70B")
assert abs(ratio - (8192 / 2048) ** 0.781) < 1e-12 and 2.9 < ratio < 3.0
# Qwen's exponent (0.0775) is nearly flat: under 8% spread across the family.
qwen = [get_lr(m) for m in HIDDEN_SIZE if m.startswith("Qwen")]
assert max(qwen) / min(qwen) < 1.08
# Adversarial: an uncalibrated family must raise, not silently guess an LR.
try:
get_lr("mistralai/Mistral-7B-v0.3")
raise AssertionError("should have raised")
except ValueError:
pass
print("LR scaling rules: OK")
Rank selection is capacity arithmetic. Per-rank trainable parameters are a fixed per-model constant (sum of adapted submodules), so required rank scales linearly with trained tokens (executed, asserts passing):
import math
# Trainable LoRA parameters per unit of rank, by adapted submodule
# (measured values from tinker-cookbook hyperparam_utils, mid-2026).
PER_RANK = {
"Qwen/Qwen3.5-9B": {"mlp": 1_572_864, "attn": 1_130_496, "unembed": 252_416},
"deepseek-ai/DeepSeek-V3.1": {"mlp": 94_307_328, "attn": 2_440_000, "unembed": 136_448},
}
def lora_param_count(model, rank, mlp=True, attn=True, unembed=True):
parts = PER_RANK[model]
selected = [parts["mlp"] * mlp, parts["attn"] * attn, parts["unembed"] * unembed]
if sum(selected) == 0:
raise ValueError("at least one submodule must be adapted")
return rank * sum(selected)
def min_rank_for_dataset(model, completion_tokens):
# SL rule of thumb: LoRA params should be >= trained (weight=1) tokens.
per_rank = sum(PER_RANK[model].values())
return math.ceil(completion_tokens / per_rank)
# Rank 32 on Qwen3.5-9B holds ~94.6M trainable params, enough for an SFT set
# with up to ~94.6M completion tokens.
assert lora_param_count("Qwen/Qwen3.5-9B", 32) == 94_584_832
# A 500M-completion-token corpus pushes the same model to rank >= 170:
# capacity, not the loss floor, is what degrades first on big datasets.
assert min_rank_for_dataset("Qwen/Qwen3.5-9B", 500_000_000) == 170
# Monotonic: more trained tokens never lowers the required rank.
ranks = [min_rank_for_dataset("Qwen/Qwen3.5-9B", n) for n in (1e6, 1e8, 1e9)]
assert ranks == sorted(ranks)
# Adversarial placement check: attention-only LoRA on Qwen3.5-9B keeps only
# ~38% of per-rank capacity, and on MoE models it is far worse: DeepSeek-V3.1
# concentrates 97% of per-rank params in the MLP experts.
full = lora_param_count("Qwen/Qwen3.5-9B", 32)
attn_only = lora_param_count("Qwen/Qwen3.5-9B", 32, mlp=False, unembed=False)
assert 0.37 < attn_only / full < 0.39
ds = PER_RANK["deepseek-ai/DeepSeek-V3.1"]
assert ds["mlp"] / sum(ds.values()) > 0.97
# Adversarial: adapting no submodules must raise.
try:
lora_param_count("Qwen/Qwen3.5-9B", 32, mlp=False, attn=False, unembed=False)
raise AssertionError("should have raised")
except ValueError:
pass
print("LoRA capacity rules: OK")
Working defaults by method (tinker-cookbook recipe values, mid-2026): SFT LR 1e-4 to 5e-4 at rank 32; RL LR 1e-5 to 4e-5 at rank 32 with group sizes 4-16; DPO LR ~1e-5 at rank 32 with beta 0.1; distillation LR ~1e-4 at rank 128. Schedules: linear or cosine decay to zero over the planned steps; aim for at least 100 optimizer steps (1000+ for best results).
How to develop with it¶
- Sweep only what the rules do not pin. With LR from the formula and rank from capacity, remaining sweeps are dataset-specific (epochs, batch size). If sweeping anyway, exploit rank-independence: sweep LR at one rank, reuse it at others. Published guidance for Tulu3-style SFT uses LR around 2e-4 to 1e-3; the best rank is dataset-dependent (commonly 16 to 256).
- Transfer LRs across models, not runs. A known-good LR for model A maps to model B via
LR_B = LR_A * mult(B) / mult(A), where the full fine-tune multiplier is proportional to1/sqrt(param_count)and the LoRA multiplier adds the flat 10x. - Count tokens before ranks. Compute the corpus's weight-1 token count (completion tokens under the loss mask, see chat rendering and loss masking) and check it against
rank * per_rank_paramsbefore defaulting to rank 32. - Scale LR with batch size cautiously: the
sqrt(batch)heuristic applies until LoRA's large-batch penalty kicks in; past that point neither LR nor rank recovers the gap, so prefer more steps at moderate batch.
How to maintain it¶
- Constants drift; re-read the source. The multiplier was a formula before it was flattened to 10; exponents get recalibrated as families are added. Pin the tinker-cookbook version and diff
hyperparam_utils.pyon upgrade. - Hidden size is not always top-level. VL and some MoE configs nest
hidden_sizeundertext_config; a lookup that grabs the wrong field silently mis-scales the LR. - Watch the alpha convention when porting. These rules assume the
(alpha/r) * B @ Aparametrization with fixed alpha (32 in Tinker); a stack that scales alpha with rank re-introduces rank-dependence into the optimal LR. - Validate on short runs. The first few hundred steps at different ranks with the same LR should produce near-identical loss curves; divergence indicates a parametrization or masking difference, not a genuine rank effect.
How to run it in production¶
- RL fleets can run tiny ranks. Since small ranks match full fine-tuning for RL, multi-tenant training services and multi-adapter serving both get cheaper: more concurrent adapters per GPU pool, smaller checkpoints to sync (Tinker, multi-LoRA serving, delta weight sync).
- Budget rank by corpus, not habit. Distillation onto large reasoning corpora is the common case that wants rank 128+; single-task SFT rarely does. Higher rank costs adapter memory and sync bandwidth, not base-model compute.
- Keep LoRA compute accounting honest. LoRA trains roughly at the FLOP cost of the forward-backward on the frozen base; its efficiency advantage is memory and multi-tenancy, so capacity-driven rank increases are cheap until they are not (optimizer state on adapters).
- QLoRA changes the base precision, not these rules: LR and rank guidance carries over; see SFT & LoRA/QLoRA for the quantization mechanics.
Failure modes¶
- Full fine-tuning LR reused for LoRA: 10x too low, slow or stalled learning, and the false conclusion that LoRA is inferior.
- LoRA LR reused for full fine-tuning: 10x too high, divergence or catastrophic forgetting.
- Attention-only adapters: underperform all-matrices placement even at matched parameter count; on MoE models attention-only forfeits ~97% of capacity.
- Rank 32 on a corpus 10x its capacity: training efficiency quietly degrades; the run "works" but underperforms full fine-tuning with no hard error.
- Large-batch LoRA: loss penalty beyond a batch-size threshold that rank cannot fix; symptoms look like a bad LR but are parametrization-inherent.
- Cross-family extrapolation: applying Llama's steep exponent to a flat family (or vice versa) mis-sets LR by 2-3x at the size extremes.
References¶
- LoRA Without Regret (Thinking Machines): https://thinkingmachines.ai/blog/lora/
- Tinker docs, LoRA primer: https://tinker-docs.thinkingmachines.ai/tinker/lora-primer/
- tinker-cookbook
hyperparam_utils.py: https://github.com/thinking-machines-lab/tinker-cookbook/blob/main/tinker_cookbook/hyperparam_utils.py - LoRA Learns Less and Forgets Less (Biderman et al., 2024): https://arxiv.org/abs/2405.09673
- LoRA (Hu et al., 2021): https://arxiv.org/abs/2106.09685
Related: SFT & LoRA · Tinker · Fine-tuning · Chat Rendering & Loss Masking · Multi-LoRA Serving · RL Scaling Laws · Delta Weight Sync · Glossary