Markdown

Model merging¶

Scope: combining several fine-tuned checkpoints into one model by arithmetic on their weights (no gradient, no data, no rollouts) via task vectors, interference resolution (TIES/DARE), and interpolation (SLERP/soups). The cheapest way to compose capabilities in the post-training stack; a sibling of finetuning (SFT/LoRA) and distillation (on-policy distillation) that needs no training loop at all.

Reference templates on real APIs; pin versions and validate before production use.

What it is¶

Model merging fuses the weights of several models that share a base into one model of the same size, with no further training. The unit is the task vector: the delta between a fine-tuned checkpoint and its base, τ = θ_ft − θ_base. Adding task vectors composes skills; negating one removes a behaviour ("Editing Models with Task Arithmetic").¹ The simplest merge is plain weight averaging: model soups, which shows averaging fine-tunes of the same base improves accuracy and robustness at no inference cost.²

The problem merging must solve is interference: when task vectors from different models overlap with redundant magnitudes or opposite signs, naive averaging cancels signal and degrades every parent. Two methods resolve it:

TIES-Merging: three steps: Trim small-magnitude delta params, Elect Sign (pick the dominant sign per parameter across models), then Disjoint Merge only the agreeing params.³
DARE ("Drop And REscale"): randomly drop a fraction p of delta params and rescale survivors by 1/(1−p); it can zero 90–99% of deltas with little loss and is applied before merging (e.g. dare_ties).⁴

SLERP (spherical linear interpolation) blends two models along the hypersphere rather than the straight line, preserving norm. The standard tool is mergekit, which implements all of these as a single YAML config.⁵

Why use it¶

Nearly free. A merge is seconds to minutes on one GPU (or CPU), no dataset, no training run, no rollout engine. It is the cheapest capability move in post-training.
Composes skills. Merge a maths finetune, a code finetune, and a chat finetune of the same base into one model that does all three, without multi-task retraining.¹
No inference tax. Unlike an ensemble, the output is one model of the original size: the accuracy-of-averaging without the serving cost.²
Robustness and recovery. Souping sibling runs (different seeds/hyperparameters) averages out their idiosyncrasies; merging a domain finetune back toward its base can recover lost general ability.

When to use it (and when not)¶

Use merging when you have multiple finetunes of the same base and want to combine or average them cheaply, or to build a strong starting point before a short finetune.
Prefer finetuning or distillation when you need a capability that is in no parent; merging can only recombine what the task vectors already contain, it cannot create new skill.
Requires homologous models: same architecture and tokenizer, same base lineage. You cannot meaningfully merge across families (Llama into Qwen).
Never ship a merge unevaluated. Merges can silently regress on capabilities you did not test; always gate on a held-out eval (evaluation integrity).

Architecture¶

flowchart LR
  B["Shared base θ_base"] --> V1["Task vector τ1 = θ_ft1 − θ_base"]
  B --> V2["Task vector τ2 = θ_ft2 − θ_base"]
  B --> V3["Task vector τ3 = θ_ft3 − θ_base"]
  V1 --> RES{"Resolve interference"}
  V2 --> RES
  V3 --> RES
  RES -->|"TIES: trim + elect sign + disjoint merge"| CMB["Weighted combine (density, weight)"]
  RES -->|"DARE: drop + rescale 1/(1−p)"| CMB
  CMB --> M["Merged model θ_base + Σ wᵢ·τ̃ᵢ"]
  M --> EVAL{"Held-out eval gate"}
  EVAL -->|"regressed?"| RES

How to use it¶

mergekit takes a YAML config and writes a merged checkpoint. This TIES example (verbatim from the mergekit repo) merges three Llama-2-13B finetunes over a shared base; density sets how much of each task vector to keep, weight its contribution:

# ties.yml — pin mergekit and verify keys on the installed version
models:
  - model: psmathur/orca_mini_v3_13b
    parameters:
      density: [1, 0.7, 0.1]   # density gradient across layers
      weight: 1.0
  - model: garage-bAInd/Platypus2-13B
    parameters:
      density: 0.5
      weight: [0, 0.3, 0.7, 1] # weight gradient across layers
  - model: WizardLM/WizardMath-13B-V1.0
    parameters:
      density: 0.33
      weight:
        - filter: mlp          # target MLP tensors specifically
          value: 0.5
        - value: 0
merge_method: ties
base_model: TheBloke/Llama-2-13B-fp16
parameters:
  normalize: true
  int8_mask: true
dtype: float16

# CPU-only works; --cuda uses the GPU, --lazy-unpickle cuts RAM for large models.
mergekit-yaml ties.yml ./merged-model --cuda

The output ./merged-model is a standard HF checkpoint: serve it on vLLM like any model, or use it as the base for a short finetune.

How to develop with it¶

Pick the method by how many models and how much interference:

`merge_method`	Use it for	Notes
`linear` (model soup)	averaging sibling runs of one base	no interference handling; equal/weighted mean²
`slerp`	blending two models	interpolates on the hypersphere; preserves norm
`task_arithmetic`	add/subtract task vectors	the base operation; foundation for the rest¹
`ties`	merging many task finetunes	trims + sign-elects to cut interference³
`dare_ties` / `dare_linear`	many finetunes, heavy overlap	DARE-sparsifies deltas first, then merges⁴
`model_stock`, `della`, `breadcrumbs`, `sce`	newer variants	geometric/adaptive pruning; verify support on the installed mergekit

The two knobs that matter are density (fraction of each task vector kept, where lower drops more, reducing interference but risking signal loss) and weight (each model's contribution). normalize: true rescales weights to sum to 1; int8_mask: true cuts merge memory. Both density and weight accept per-layer gradients or tensor filters (as above) for fine control. Sweep them and eval each merge; the search space is why evolutionary merging automates recipe discovery over parameter and data-flow space.⁶

How to scale it¶

Merging itself is cheap and does not scale like training. The expensive part is evaluating the merges. Practical scaling is a merge → eval loop: generate candidate configs, merge (CPU or one GPU, --lazy-unpickle for low RAM), and score each against a held-out suite, keeping the winner. Evolutionary methods run this loop automatically and discovered merges that beat their parents on held-out tasks.⁶ Because a merge produces a standard checkpoint, the eval and promotion path is identical to any post-trained model (SRE/MLOps practices).

Cookbook (common use cases)¶

1. SLERP blend of two models

# slerp.yml — exactly two models; t is the interpolation factor (0 = base_model, 1 = the other)
slices:
  - sources:
      - model: your-base/finetune-a
        layer_range: [0, 32]
      - model: your-base/finetune-b
        layer_range: [0, 32]
merge_method: slerp
base_model: your-base/finetune-a
parameters:
  t: [0, 0.5, 1, 0.5, 0]   # per-layer interpolation schedule
dtype: bfloat16

2. Task-arithmetic negation (unlearn a behaviour)

# Subtract a task vector to reduce an unwanted capability while keeping the rest.
models:
  - model: your-base/unwanted-behaviour-finetune
    parameters:
      weight: -1.0            # negate the task vector
merge_method: task_arithmetic
base_model: your-base/base-model
dtype: bfloat16

3. DARE-TIES merge of many finetunes: set merge_method: dare_ties and a low density (e.g. 0.5) per model to sparsify deltas before the TIES sign-election, the highest-interference-tolerance option.

Failure modes¶

Interference from naive averaging. linear-merging conflicting task vectors cancels signal and degrades all parents; use ties/dare_ties when merging many finetunes.³
Base/tokenizer mismatch. Merging models with different bases, architectures, or tokenizers produces garbage; keep the family homologous and set tokenizer_source deliberately.
Density too low. Trimming/dropping too aggressively discards the very deltas that carry the skill; sweep density.
No eval gate. A merge that looks fine can regress on untested capabilities; never promote on "it merged": gate on a held-out eval.
Over-merging. Stacking many models dilutes each; more parents is not monotonically better.
Merging LoRA vs full weights carelessly. Merging adapters into a base is different from merging task vectors of full finetunes; keep the two straight (SFT/LoRA).

References¶

Model soups (weight averaging): https://arxiv.org/abs/2203.05482
Editing Models with Task Arithmetic (task vectors): https://arxiv.org/abs/2212.04089
TIES-Merging (trim, elect sign, disjoint merge): https://arxiv.org/abs/2306.01708
DARE — Language Models are Super Mario (drop and rescale): https://arxiv.org/abs/2311.03099
Arcee's MergeKit (the toolkit): https://arxiv.org/abs/2403.13257 · repo: https://github.com/arcee-ai/mergekit
Evolutionary Optimization of Model Merging Recipes (Sakana AI): https://arxiv.org/abs/2403.13187

Ilharco et al., Editing Models with Task Arithmetic — a task vector θ_ft − θ_base is a direction in weight space; adding vectors composes tasks, negating one forgets a task. https://arxiv.org/abs/2212.04089 ↩↩↩
Wortsman et al., Model soups — averaging the weights of multiple fine-tunes of the same base improves accuracy and robustness with no extra inference cost. https://arxiv.org/abs/2203.05482 ↩↩↩
Yadav et al., TIES-Merging — Trim low-magnitude deltas, Elect a dominant sign per parameter, and Disjoint-Merge only agreeing parameters to resolve interference. https://arxiv.org/abs/2306.01708 ↩↩↩
Yu et al., DARE — randomly drop delta parameters with ratio p and rescale survivors by 1/(1−p); removes up to 90–99% of deltas and is applied before merging (e.g. dare_ties). https://arxiv.org/abs/2311.03099 ↩↩
Goddard et al., Arcee's MergeKit — an open-source toolkit that merges checkpoints into multitask models without additional training, driven by a single YAML (merge_method, models/slices, base_model, parameters, dtype). https://arxiv.org/abs/2403.13257 ↩
Akiba et al. (Sakana AI), Evolutionary Optimization of Model Merging Recipes — evolutionary search over parameter-space and data-flow-space merge recipes discovers merges that beat their parents. https://arxiv.org/abs/2403.13187 ↩↩