psychology DeepThought

June 24, 2026 12 nodes #showcase#tech#ai#research

Beyond LoRA

A map of parameter-efficient fine-tuning after LoRA — how DoRA, PiSSA, VeRA and friends trade accuracy against memory on a single Pareto frontier.

The brief, in full

LoRA became the default for adapting large models — cheap, mergeable, zero inference overhead. But the 'Beyond LoRA' study shows methods sit on an accuracy-vs-memory Pareto frontier: the right choice depends on the task, not a universal best.

The LoRA baseline

Low-rank A·B added to frozen weights

LoRA freezes the base weights and learns a low-rank update B·A. Its weaknesses are structural: noise-init A with zero B, a single shared learning rate for A and B, and a fixed update geometry that diverges from full fine-tuning.

Better initialization

Start from what matters

Instead of random A and zero B, initialize adapters from the most informative directions of the existing weight — so training begins near the answer rather than from noise.

PiSSA

SVD init from principal components

PiSSA runs SVD on the original weight W and initializes A·B with the principal singular vectors, freezing the residual. Reported gains: Mistral-7B GSM8K 72.86% vs LoRA 67.7%; a 4-bit QPiSSA beat QLoRA on LLaMA-3-70B (86.05% vs 81.73%).

Weight decomposition

Separate magnitude from direction

Split each weight into a learnable magnitude and a direction, then adapt only the direction. This reshapes LoRA's update to look more like full fine-tuning, at near-identical parameter cost and still mergeable.

DoRA

Direction-only low-rank update

DoRA (NVIDIA, ICML 2024 Oral) applies LoRA to the directional component only, adding just a magnitude vector. Commonsense reasoning on LLaMA-7B: 78.4% vs LoRA's 74.7% (+3.7), with ~0.01% more trainable params and zero added inference cost.

Parameter sharing

When storage is the bottleneck

If you must serve many per-user or per-task adapters, the binding constraint is checkpoint size, not raw accuracy — so push trainable parameters as low as possible.

VeRA

Shared frozen randoms, tiny scaling vectors

VeRA freezes one shared pair of random low-rank matrices across all layers and trains only small per-layer scaling vectors. The randoms regenerate from a seed, so checkpoints are tiny — ~10x fewer trainable params than LoRA at matched performance.

Free tweaks on top

Same compute, more from it

Some improvements cost almost nothing — they change a hyperparameter rather than the architecture, and stack onto existing methods.

LoRA+

Different learning rates for A and B

A single shared LR for A and B is suboptimal for wide models. LoRA+ gives B a higher LR at a fixed ratio: 1-2% better and up to ~2x faster fine-tuning at the same compute.

Memory wall

Quantize to fit big models on one GPU

When the constraint is VRAM, quantization comes first. QLoRA fine-tunes a 65B model on a single 48GB GPU via 4-bit NF4, double quantization and paged optimizers — Guanaco reached 99.3% of ChatGPT on Vicuna after 24 GPU-hours.

The Pareto choice

Task, budget, serving shape

Full fine-tuning for precision-critical domains; DoRA/PiSSA when you want closer-to-full quality at LoRA's budget; VeRA when serving many adapters; QLoRA when memory-bound; LoRA+ as a near-free add-on. The Beyond-LoRA benchmarks (e.g. OFT beating LoRA on an image task at lower memory) make the point: pick on the frontier.