Model merging¶
Scope: combining several fine-tuned checkpoints into one model by arithmetic on their weights (no gradient, no data, no rollouts) via task vectors, interference resolution (TIES/DARE), and interpolation (SLERP/soups). The cheapest way to compose capabilities in the post-training stack; a sibling of finetuning (SFT/LoRA) and distillation (on-policy distillation) that needs no training loop at all.
Reference templates on real APIs; pin versions and validate before production use.
What it is¶
Model merging fuses the weights of several models that share a base into one model of the same size, with no further training. The unit is the task vector: the delta between a fine-tuned checkpoint and its base, τ = θ_ft − θ_base. Adding task vectors composes skills; negating one removes a behaviour ("Editing Models with Task Arithmetic").1 The simplest merge is plain weight averaging: model soups, which shows averaging fine-tunes of the same base improves accuracy and robustness at no inference cost.2
The problem merging must solve is interference: when task vectors from different models overlap with redundant magnitudes or opposite signs, naive averaging cancels signal and degrades every parent. Two methods resolve it:
- TIES-Merging: three steps: Trim small-magnitude delta params, Elect Sign (pick the dominant sign per parameter across models), then Disjoint Merge only the agreeing params.3
- DARE ("Drop And REscale"): randomly drop a fraction
pof delta params and rescale survivors by1/(1−p); it can zero 90–99% of deltas with little loss and is applied before merging (e.g.dare_ties).4
SLERP (spherical linear interpolation) blends two models along the hypersphere rather than the straight line, preserving norm. The standard tool is mergekit, which implements all of these as a single YAML config.5
Why use it¶
- Nearly free. A merge is seconds to minutes on one GPU (or CPU), no dataset, no training run, no rollout engine. It is the cheapest capability move in post-training.
- Composes skills. Merge a maths finetune, a code finetune, and a chat finetune of the same base into one model that does all three, without multi-task retraining.1
- No inference tax. Unlike an ensemble, the output is one model of the original size: the accuracy-of-averaging without the serving cost.2
- Robustness and recovery. Souping sibling runs (different seeds/hyperparameters) averages out their idiosyncrasies; merging a domain finetune back toward its base can recover lost general ability.
When to use it (and when not)¶
- Use merging when you have multiple finetunes of the same base and want to combine or average them cheaply, or to build a strong starting point before a short finetune.
- Prefer finetuning or distillation when you need a capability that is in no parent; merging can only recombine what the task vectors already contain, it cannot create new skill.
- Requires homologous models: same architecture and tokenizer, same base lineage. You cannot meaningfully merge across families (Llama into Qwen).
- Never ship a merge unevaluated. Merges can silently regress on capabilities you did not test; always gate on a held-out eval (evaluation integrity).
Architecture¶
flowchart LR
B["Shared base θ_base"] --> V1["Task vector τ1 = θ_ft1 − θ_base"]
B --> V2["Task vector τ2 = θ_ft2 − θ_base"]
B --> V3["Task vector τ3 = θ_ft3 − θ_base"]
V1 --> RES{"Resolve interference"}
V2 --> RES
V3 --> RES
RES -->|"TIES: trim + elect sign + disjoint merge"| CMB["Weighted combine (density, weight)"]
RES -->|"DARE: drop + rescale 1/(1−p)"| CMB
CMB --> M["Merged model θ_base + Σ wᵢ·τ̃ᵢ"]
M --> EVAL{"Held-out eval gate"}
EVAL -->|"regressed?"| RES
How to use it¶
mergekit takes a YAML config and writes a merged checkpoint. This TIES example (verbatim from the mergekit repo) merges three Llama-2-13B finetunes over a shared base; density sets how much of each task vector to keep, weight its contribution:
# ties.yml — pin mergekit and verify keys on the installed version
models:
- model: psmathur/orca_mini_v3_13b
parameters:
density: [1, 0.7, 0.1] # density gradient across layers
weight: 1.0
- model: garage-bAInd/Platypus2-13B
parameters:
density: 0.5
weight: [0, 0.3, 0.7, 1] # weight gradient across layers
- model: WizardLM/WizardMath-13B-V1.0
parameters:
density: 0.33
weight:
- filter: mlp # target MLP tensors specifically
value: 0.5
- value: 0
merge_method: ties
base_model: TheBloke/Llama-2-13B-fp16
parameters:
normalize: true
int8_mask: true
dtype: float16
# CPU-only works; --cuda uses the GPU, --lazy-unpickle cuts RAM for large models.
mergekit-yaml ties.yml ./merged-model --cuda
The output ./merged-model is a standard HF checkpoint: serve it on vLLM like any model, or use it as the base for a short finetune.
How to develop with it¶
Pick the method by how many models and how much interference:
merge_method |
Use it for | Notes |
|---|---|---|
linear (model soup) |
averaging sibling runs of one base | no interference handling; equal/weighted mean2 |
slerp |
blending two models | interpolates on the hypersphere; preserves norm |
task_arithmetic |
add/subtract task vectors | the base operation; foundation for the rest1 |
ties |
merging many task finetunes | trims + sign-elects to cut interference3 |
dare_ties / dare_linear |
many finetunes, heavy overlap | DARE-sparsifies deltas first, then merges4 |
model_stock, della, breadcrumbs, sce |
newer variants | geometric/adaptive pruning; verify support on the installed mergekit |
The two knobs that matter are density (fraction of each task vector kept, where lower drops more, reducing interference but risking signal loss) and weight (each model's contribution). normalize: true rescales weights to sum to 1; int8_mask: true cuts merge memory. Both density and weight accept per-layer gradients or tensor filters (as above) for fine control. Sweep them and eval each merge; the search space is why evolutionary merging automates recipe discovery over parameter and data-flow space.6
How to scale it¶
Merging itself is cheap and does not scale like training. The expensive part is evaluating the merges. Practical scaling is a merge → eval loop: generate candidate configs, merge (CPU or one GPU, --lazy-unpickle for low RAM), and score each against a held-out suite, keeping the winner. Evolutionary methods run this loop automatically and discovered merges that beat their parents on held-out tasks.6 Because a merge produces a standard checkpoint, the eval and promotion path is identical to any post-trained model (SRE/MLOps practices).
Cookbook (common use cases)¶
1. SLERP blend of two models
# slerp.yml — exactly two models; t is the interpolation factor (0 = base_model, 1 = the other)
slices:
- sources:
- model: your-base/finetune-a
layer_range: [0, 32]
- model: your-base/finetune-b
layer_range: [0, 32]
merge_method: slerp
base_model: your-base/finetune-a
parameters:
t: [0, 0.5, 1, 0.5, 0] # per-layer interpolation schedule
dtype: bfloat16
2. Task-arithmetic negation (unlearn a behaviour)
# Subtract a task vector to reduce an unwanted capability while keeping the rest.
models:
- model: your-base/unwanted-behaviour-finetune
parameters:
weight: -1.0 # negate the task vector
merge_method: task_arithmetic
base_model: your-base/base-model
dtype: bfloat16
3. DARE-TIES merge of many finetunes: set merge_method: dare_ties and a low density (e.g. 0.5) per model to sparsify deltas before the TIES sign-election, the highest-interference-tolerance option.
Failure modes¶
- Interference from naive averaging.
linear-merging conflicting task vectors cancels signal and degrades all parents; useties/dare_tieswhen merging many finetunes.3 - Base/tokenizer mismatch. Merging models with different bases, architectures, or tokenizers produces garbage; keep the family homologous and set
tokenizer_sourcedeliberately. - Density too low. Trimming/dropping too aggressively discards the very deltas that carry the skill; sweep
density. - No eval gate. A merge that looks fine can regress on untested capabilities; never promote on "it merged": gate on a held-out eval.
- Over-merging. Stacking many models dilutes each; more parents is not monotonically better.
- Merging LoRA vs full weights carelessly. Merging adapters into a base is different from merging task vectors of full finetunes; keep the two straight (SFT/LoRA).
References¶
- Model soups (weight averaging): https://arxiv.org/abs/2203.05482
- Editing Models with Task Arithmetic (task vectors): https://arxiv.org/abs/2212.04089
- TIES-Merging (trim, elect sign, disjoint merge): https://arxiv.org/abs/2306.01708
- DARE — Language Models are Super Mario (drop and rescale): https://arxiv.org/abs/2311.03099
- Arcee's MergeKit (the toolkit): https://arxiv.org/abs/2403.13257 · repo: https://github.com/arcee-ai/mergekit
- Evolutionary Optimization of Model Merging Recipes (Sakana AI): https://arxiv.org/abs/2403.13187
Related: SFT/LoRA · On-policy distillation · Fine-tuning and post-training · DPO · RLVR · Evaluation integrity · SRE/MLOps practices · Serving open-weight models · Glossary
-
Ilharco et al., Editing Models with Task Arithmetic — a task vector
θ_ft − θ_baseis a direction in weight space; adding vectors composes tasks, negating one forgets a task. https://arxiv.org/abs/2212.04089 ↩↩↩ -
Wortsman et al., Model soups — averaging the weights of multiple fine-tunes of the same base improves accuracy and robustness with no extra inference cost. https://arxiv.org/abs/2203.05482 ↩↩↩
-
Yadav et al., TIES-Merging — Trim low-magnitude deltas, Elect a dominant sign per parameter, and Disjoint-Merge only agreeing parameters to resolve interference. https://arxiv.org/abs/2306.01708 ↩↩↩
-
Yu et al., DARE — randomly drop delta parameters with ratio
pand rescale survivors by1/(1−p); removes up to 90–99% of deltas and is applied before merging (e.g.dare_ties). https://arxiv.org/abs/2311.03099 ↩↩ -
Goddard et al., Arcee's MergeKit — an open-source toolkit that merges checkpoints into multitask models without additional training, driven by a single YAML (
merge_method,models/slices,base_model,parameters,dtype). https://arxiv.org/abs/2403.13257 ↩ -
Akiba et al. (Sakana AI), Evolutionary Optimization of Model Merging Recipes — evolutionary search over parameter-space and data-flow-space merge recipes discovers merges that beat their parents. https://arxiv.org/abs/2403.13187 ↩↩