Markdown

LoRA hyperparameter scaling rules¶

Scope: empirically calibrated rules for setting LoRA post-training hyperparameters: the 10x learning-rate multiplier over full fine-tuning, hidden-size LR scaling per model family, rank selection from dataset capacity (LoRA params vs trained tokens), adapter placement (all matrices, especially MLP/MoE), and batch-size caveats. Sourced from Thinking Machines' LoRA Without Regret study and the tinker-cookbook hyperparam_utils formulas; mechanics of LoRA itself live in SFT & LoRA/QLoRA.

The Python (stdlib) blocks below reimplement the published formulas and were executed with all asserts passing. The formulas are empirical fits (calibrated on Llama and Qwen families as of mid-2026); treat constants as starting points and re-verify against the tinker-cookbook repo before long runs.

What it is¶

A small set of closed-form rules that replace grid search for the two hyperparameters that dominate LoRA fine-tuning outcomes, distilled from systematic sweeps in LoRA Without Regret (Schulman et al., Thinking Machines, 2025) and shipped as code in tinker-cookbook's hyperparam_utils:

Learning rate: LR = 5e-5 * M_LoRA * (2000 / hidden_size)^P, with M_LoRA = 10 for LoRA (1 for full fine-tuning) and a per-family exponent P (0.781 for Llama, 0.0775 for Qwen).
Rank: capacity-driven. For supervised learning, LoRA performs like full fine-tuning while trainable LoRA params >= trained (weight=1) completion tokens; rank buys capacity linearly. For RL, very small ranks match full fine-tuning because RL learns on the order of 1 bit per episode.
Placement: adapt all weight matrices; MLP (and MoE expert) layers carry most of the capacity, and attention-only LoRA underperforms even at matched parameter count.
The key invariance: the optimal LR is approximately independent of rank (a consequence of the alpha/r-scaled parametrization), so LR does not need re-tuning when rank changes.

Why use it¶

The 10x rule kills the most common LoRA failure. Teams port a full fine-tuning LR to LoRA, see it underperform, and conclude LoRA is weak; the optimal LoRA LR is about 10x higher across model sizes (independently observed in Biderman et al. 2024).
Sweeps are expensive at post-training scale. A closed-form LR within a few percent of optimal saves the 6-run grid per model-dataset pair, and rank-independence of LR removes a whole sweep axis.
Capacity arithmetic predicts when LoRA stops being free. LoRA does not hit a hard loss floor on large datasets; it loses training efficiency once the dataset exceeds adapter capacity. Counting params vs tokens tells you in advance whether rank 32 suffices or the job wants rank 128 or full fine-tuning.

When to use it (and when not)¶

Use LoRA with these rules for RL post-training (small ranks, even rank 1-8, match full fine-tuning), instruction-tuning and reasoning SFT on small-to-medium datasets, and any capacity-bounded adaptation where params >= trained tokens holds at an affordable rank.
Prefer full fine-tuning (or accept degraded efficiency) when the SFT corpus exceeds LoRA capacity at practical ranks, or when large-batch training is required: LoRA pays a loss penalty beyond a batch-size threshold that higher rank does not fix (a property of the product-of-matrices parametrization).
Recalibrate rather than extrapolate for families outside the fitted ones; the cookbook deliberately raises for uncalibrated models (DeepSeek, Kimi, gpt-oss, Nemotron as of mid-2026) instead of guessing.

Architecture¶

flowchart TB
  TASK["Task type"] -->|"RL"| RSMALL["rank 8-32 (capacity ~ bits per episode)"]
  TASK -->|"SFT small/medium data"| R32["rank 32 default"]
  TASK -->|"SFT large corpus"| CAP["capacity check: params >= trained tokens"]
  CAP -->|"fits"| R128["raise rank (64-128+)"]
  CAP -->|"does not fit"| FULL["full fine-tune"]
  subgraph LRBOX["Learning rate"]
    H["hidden_size"] --> F["LR = 5e-5 x (2000/H)^P"]
    F -->|"LoRA: x10"| OUT["peak LR + linear/cosine schedule"]
    F -->|"full FT: x1"| OUT
  end
  R32 --> LRBOX
  R128 --> LRBOX
  RSMALL --> LRBOX

How to use it¶

Reference template (needs tinker-cookbook; the same numbers apply to any LoRA stack):

from tinker_cookbook import hyperparam_utils

model = "Qwen/Qwen3.5-9B"
lr = hyperparam_utils.get_lr(model, is_lora=True)      # ~5e-4 territory
ratio = hyperparam_utils.get_lora_lr_over_full_finetune_lr(model)  # 10.0
params = hyperparam_utils.get_lora_param_count(model, lora_rank=32)

The executed block below reproduces the formula and asserts its load-bearing properties:

BASE_LR = 5e-5
LORA_MULTIPLIER = 10.0
FAMILY_EXPONENT = {"llama": 0.781, "qwen": 0.0775}
HIDDEN_SIZE = {
    "meta-llama/Llama-3.2-1B": 2048,
    "meta-llama/Llama-3.1-8B": 4096,
    "meta-llama/Llama-3.1-70B": 8192,
    "Qwen/Qwen3.5-35B-A3B": 2048,
    "Qwen/Qwen3.5-9B": 4096,
    "Qwen/Qwen3.5-27B": 5120,
}


def get_lr(model, is_lora=True):
    family = next((f for f in FAMILY_EXPONENT if f in model.lower()), None)
    if family is None:
        raise ValueError(f"no calibrated LR formula for {model}")
    lr = BASE_LR * (LORA_MULTIPLIER if is_lora else 1.0)
    return lr * (2000 / HIDDEN_SIZE[model]) ** FAMILY_EXPONENT[family]


# The LoRA-to-full-finetune LR ratio is 10 for every calibrated model.
for m in HIDDEN_SIZE:
    assert abs(get_lr(m, is_lora=True) / get_lr(m, is_lora=False) - 10.0) < 1e-12

# Spot value: Llama-3.1-8B LoRA LR = 5e-4 * (2000/4096)^0.781 ~= 2.86e-4.
lr_8b = get_lr("meta-llama/Llama-3.1-8B")
assert abs(lr_8b - 5e-4 * (2000 / 4096) ** 0.781) < 1e-19
assert 2.8e-4 < lr_8b < 2.9e-4

# Llama's exponent (0.781) makes LR fall steeply with width: ~2.95x from
# 1B (hidden 2048) to 70B (hidden 8192). Wider model, lower LR.
ratio = get_lr("meta-llama/Llama-3.2-1B") / get_lr("meta-llama/Llama-3.1-70B")
assert abs(ratio - (8192 / 2048) ** 0.781) < 1e-12 and 2.9 < ratio < 3.0

# Qwen's exponent (0.0775) is nearly flat: under 8% spread across the family.
qwen = [get_lr(m) for m in HIDDEN_SIZE if m.startswith("Qwen")]
assert max(qwen) / min(qwen) < 1.08

# Adversarial: an uncalibrated family must raise, not silently guess an LR.
try:
    get_lr("mistralai/Mistral-7B-v0.3")
    raise AssertionError("should have raised")
except ValueError:
    pass

print("LR scaling rules: OK")

Rank selection is capacity arithmetic. Per-rank trainable parameters are a fixed per-model constant (sum of adapted submodules), so required rank scales linearly with trained tokens (executed, asserts passing):

import math

# Trainable LoRA parameters per unit of rank, by adapted submodule
# (measured values from tinker-cookbook hyperparam_utils, mid-2026).
PER_RANK = {
    "Qwen/Qwen3.5-9B": {"mlp": 1_572_864, "attn": 1_130_496, "unembed": 252_416},
    "deepseek-ai/DeepSeek-V3.1": {"mlp": 94_307_328, "attn": 2_440_000, "unembed": 136_448},
}


def lora_param_count(model, rank, mlp=True, attn=True, unembed=True):
    parts = PER_RANK[model]
    selected = [parts["mlp"] * mlp, parts["attn"] * attn, parts["unembed"] * unembed]
    if sum(selected) == 0:
        raise ValueError("at least one submodule must be adapted")
    return rank * sum(selected)


def min_rank_for_dataset(model, completion_tokens):
    # SL rule of thumb: LoRA params should be >= trained (weight=1) tokens.
    per_rank = sum(PER_RANK[model].values())
    return math.ceil(completion_tokens / per_rank)


# Rank 32 on Qwen3.5-9B holds ~94.6M trainable params, enough for an SFT set
# with up to ~94.6M completion tokens.
assert lora_param_count("Qwen/Qwen3.5-9B", 32) == 94_584_832

# A 500M-completion-token corpus pushes the same model to rank >= 170:
# capacity, not the loss floor, is what degrades first on big datasets.
assert min_rank_for_dataset("Qwen/Qwen3.5-9B", 500_000_000) == 170
# Monotonic: more trained tokens never lowers the required rank.
ranks = [min_rank_for_dataset("Qwen/Qwen3.5-9B", n) for n in (1e6, 1e8, 1e9)]
assert ranks == sorted(ranks)

# Adversarial placement check: attention-only LoRA on Qwen3.5-9B keeps only
# ~38% of per-rank capacity, and on MoE models it is far worse: DeepSeek-V3.1
# concentrates 97% of per-rank params in the MLP experts.
full = lora_param_count("Qwen/Qwen3.5-9B", 32)
attn_only = lora_param_count("Qwen/Qwen3.5-9B", 32, mlp=False, unembed=False)
assert 0.37 < attn_only / full < 0.39
ds = PER_RANK["deepseek-ai/DeepSeek-V3.1"]
assert ds["mlp"] / sum(ds.values()) > 0.97

# Adversarial: adapting no submodules must raise.
try:
    lora_param_count("Qwen/Qwen3.5-9B", 32, mlp=False, attn=False, unembed=False)
    raise AssertionError("should have raised")
except ValueError:
    pass

print("LoRA capacity rules: OK")

Working defaults by method (tinker-cookbook recipe values, mid-2026): SFT LR 1e-4 to 5e-4 at rank 32; RL LR 1e-5 to 4e-5 at rank 32 with group sizes 4-16; DPO LR ~1e-5 at rank 32 with beta 0.1; distillation LR ~1e-4 at rank 128. Schedules: linear or cosine decay to zero over the planned steps; aim for at least 100 optimizer steps (1000+ for best results).

How to develop with it¶

Sweep only what the rules do not pin. With LR from the formula and rank from capacity, remaining sweeps are dataset-specific (epochs, batch size). If sweeping anyway, exploit rank-independence: sweep LR at one rank, reuse it at others. Published guidance for Tulu3-style SFT uses LR around 2e-4 to 1e-3; the best rank is dataset-dependent (commonly 16 to 256).
Transfer LRs across models, not runs. A known-good LR for model A maps to model B via LR_B = LR_A * mult(B) / mult(A), where the full fine-tune multiplier is proportional to 1/sqrt(param_count) and the LoRA multiplier adds the flat 10x.
Count tokens before ranks. Compute the corpus's weight-1 token count (completion tokens under the loss mask, see chat rendering and loss masking) and check it against rank * per_rank_params before defaulting to rank 32.
Scale LR with batch size cautiously: the sqrt(batch) heuristic applies until LoRA's large-batch penalty kicks in; past that point neither LR nor rank recovers the gap, so prefer more steps at moderate batch.

How to maintain it¶

Constants drift; re-read the source. The multiplier was a formula before it was flattened to 10; exponents get recalibrated as families are added. Pin the tinker-cookbook version and diff hyperparam_utils.py on upgrade.
Hidden size is not always top-level. VL and some MoE configs nest hidden_size under text_config; a lookup that grabs the wrong field silently mis-scales the LR.
Watch the alpha convention when porting. These rules assume the (alpha/r) * B @ A parametrization with fixed alpha (32 in Tinker); a stack that scales alpha with rank re-introduces rank-dependence into the optimal LR.
Validate on short runs. The first few hundred steps at different ranks with the same LR should produce near-identical loss curves; divergence indicates a parametrization or masking difference, not a genuine rank effect.

How to run it in production¶

RL fleets can run tiny ranks. Since small ranks match full fine-tuning for RL, multi-tenant training services and multi-adapter serving both get cheaper: more concurrent adapters per GPU pool, smaller checkpoints to sync (Tinker, multi-LoRA serving, delta weight sync).
Budget rank by corpus, not habit. Distillation onto large reasoning corpora is the common case that wants rank 128+; single-task SFT rarely does. Higher rank costs adapter memory and sync bandwidth, not base-model compute.
Keep LoRA compute accounting honest. LoRA trains roughly at the FLOP cost of the forward-backward on the frozen base; its efficiency advantage is memory and multi-tenancy, so capacity-driven rank increases are cheap until they are not (optimizer state on adapters).
QLoRA changes the base precision, not these rules: LR and rank guidance carries over; see SFT & LoRA/QLoRA for the quantization mechanics.

Failure modes¶

Full fine-tuning LR reused for LoRA: 10x too low, slow or stalled learning, and the false conclusion that LoRA is inferior.
LoRA LR reused for full fine-tuning: 10x too high, divergence or catastrophic forgetting.
Attention-only adapters: underperform all-matrices placement even at matched parameter count; on MoE models attention-only forfeits ~97% of capacity.
Rank 32 on a corpus 10x its capacity: training efficiency quietly degrades; the run "works" but underperforms full fine-tuning with no hard error.
Large-batch LoRA: loss penalty beyond a batch-size threshold that rank cannot fix; symptoms look like a bad LR but are parametrization-inherent.
Cross-family extrapolation: applying Llama's steep exponent to a flat family (or vice versa) mis-sets LR by 2-3x at the size extremes.

References¶

LoRA Without Regret (Thinking Machines): https://thinkingmachines.ai/blog/lora/
Tinker docs, LoRA primer: https://tinker-docs.thinkingmachines.ai/tinker/lora-primer/
tinker-cookbook hyperparam_utils.py: https://github.com/thinking-machines-lab/tinker-cookbook/blob/main/tinker_cookbook/hyperparam_utils.py
LoRA Learns Less and Forgets Less (Biderman et al., 2024): https://arxiv.org/abs/2405.09673
LoRA (Hu et al., 2021): https://arxiv.org/abs/2106.09685