June 24, 2026 12 nodes #showcase#tech#ai#research
Beyond LoRA
A map of parameter-efficient fine-tuning after LoRA — how DoRA, PiSSA, VeRA and friends trade accuracy against memory on a single Pareto frontier.
The brief, in full
LoRA became the default for adapting large models — cheap, mergeable, zero inference overhead. But the 'Beyond LoRA' study shows methods sit on an accuracy-vs-memory Pareto frontier: the right choice depends on the task, not a universal best.
The LoRA baseline
Low-rank A·B added to frozen weights
LoRA freezes the base weights and learns a low-rank update B·A. Its weaknesses are structural: noise-init A with zero B, a single shared learning rate for A and B, and a fixed update geometry that diverges from full fine-tuning.
Better initialization
Start from what matters
Instead of random A and zero B, initialize adapters from the most informative directions of the existing weight — so training begins near the answer rather than from noise.
PiSSA
SVD init from principal components
PiSSA runs SVD on the original weight W and initializes A·B with the principal singular vectors, freezing the residual. Reported gains: Mistral-7B GSM8K 72.86% vs LoRA 67.7%; a 4-bit QPiSSA beat QLoRA on LLaMA-3-70B (86.05% vs 81.73%).
Weight decomposition
Separate magnitude from direction
Split each weight into a learnable magnitude and a direction, then adapt only the direction. This reshapes LoRA's update to look more like full fine-tuning, at near-identical parameter cost and still mergeable.
DoRA
Direction-only low-rank update
DoRA (NVIDIA, ICML 2024 Oral) applies LoRA to the directional component only, adding just a magnitude vector. Commonsense reasoning on LLaMA-7B: 78.4% vs LoRA's 74.7% (+3.7), with ~0.01% more trainable params and zero added inference cost.
Parameter sharing
When storage is the bottleneck
If you must serve many per-user or per-task adapters, the binding constraint is checkpoint size, not raw accuracy — so push trainable parameters as low as possible.
VeRA
Shared frozen randoms, tiny scaling vectors
VeRA freezes one shared pair of random low-rank matrices across all layers and trains only small per-layer scaling vectors. The randoms regenerate from a seed, so checkpoints are tiny — ~10x fewer trainable params than LoRA at matched performance.
Free tweaks on top
Same compute, more from it
Some improvements cost almost nothing — they change a hyperparameter rather than the architecture, and stack onto existing methods.
LoRA+
Different learning rates for A and B
A single shared LR for A and B is suboptimal for wide models. LoRA+ gives B a higher LR at a fixed ratio: 1-2% better and up to ~2x faster fine-tuning at the same compute.
Memory wall
Quantize to fit big models on one GPU
When the constraint is VRAM, quantization comes first. QLoRA fine-tunes a 65B model on a single 48GB GPU via 4-bit NF4, double quantization and paged optimizers — Guanaco reached 99.3% of ChatGPT on Vicuna after 24 GPU-hours.
The Pareto choice
Task, budget, serving shape
Full fine-tuning for precision-critical domains; DoRA/PiSSA when you want closer-to-full quality at LoRA's budget; VeRA when serving many adapters; QLoRA when memory-bound; LoRA+ as a near-free add-on. The Beyond-LoRA benchmarks (e.g. OFT beating LoRA on an image task at lower memory) make the point: pick on the frontier.