ChanlChanl
Learning AI

Fine-Tune a 7B Model for $1,500 (Not $50,000)

Full fine-tuning costs $50K in H100s. QLoRA on an RTX 4090 costs $1,500. Learn how LoRA and QLoRA let you train only 0.1-1% of parameters with nearly identical results, with working code for fine-tuning models that understand your agent's tool schemas.

DGDean GroverCo-founderFollow
March 20, 2026
19 min read
Illustration of a neural network with low-rank adapter matrices injected between layers, showing only a small percentage of parameters highlighted for training

We needed a model that understood our tool schemas. Not the generic function-calling format every LLM ships with, but our specific schemas, our parameter naming, our error-handling conventions. The kind of understanding that turns a 40% tool-call success rate into a 90% one.

The first quote we got: $50,000 in H100 GPU time for full fine-tuning of a 7B model. The second approach: QLoRA on an RTX 4090 we already owned. Cost: $0 in incremental compute (or roughly $1,500 if you're buying the card). Results: within 1% of the full fine-tune on every benchmark we cared about.

This article is the guide we wish we'd had. Not the theory. The actual process of fine-tuning open-source models for agent tool-calling, with LoRA and QLoRA, on hardware you can buy at Best Buy.

Table of contents

The Method Comparison

Before diving in, here's the landscape. Every number in this table comes from published benchmarks or our own testing.

Full Fine-TuningLoRAQLoRA
Parameters trained100% (7B+)0.1-1% (~10M)0.1-1% (~10M)
Base model precisionFP16/BF16FP16/BF164-bit (NF4)
VRAM for 7B model100-120 GB16-24 GB8-10 GB
Minimum GPU2-4x H100 80GB1x A100 40GB1x RTX 4090 24GB
Hardware cost~$50,000+~$8,000-15,000~$1,500
Training speed (7B, 10K samples)8-12 hours2-4 hours1-3 hours
Quality vs. full fine-tuneBaseline95-99%90-97%
Inference overheadNoneNone (merged)None (merged)
Adapter sizeN/A10-50 MB10-50 MB

The key insight: LoRA adapters merge into the base weights at inference time. Zero latency overhead. Your fine-tuned model runs at identical speed to the original.

Why Fine-Tune for Agents

Conventional wisdom says prompting is enough. The data says otherwise: our base model hallucinated parameter names 40% of the time. After fine-tuning on our tool schemas, that dropped to under 10%. General-purpose LLMs are good at general-purpose tasks. But production AI agents need tools -- and every tool system has its own conventions. Fine-tuning closes three specific gaps.

1. Tool schema adherence. Your agent has 15 tools with specific parameter schemas. A base model hallucinates parameter names 20-40% of the time. A fine-tuned model gets them right 90%+ of the time. Google's FunctionGemma research showed accuracy jumping from 58% to 85% after tool-calling fine-tuning.

2. Response format consistency. You need structured JSON with specific fields. Prompting gets you 80% of the way. Fine-tuning gets you to 98%. The difference between "usually works" and "production-ready."

3. Domain vocabulary. Medical billing codes, legal citation formats, manufacturing part numbers. Things that few-shot prompting can't reliably teach because the patterns are too numerous and too precise.

What fine-tuning does NOT fix: knowledge gaps (use RAG for that), reasoning failures (use better models), or integration bugs (use better engineering). Fine-tuning is for behavior, not knowledge.

How LoRA Works

The original LoRA paper (Hu et al., 2021) starts from a simple observation: the weight updates during fine-tuning have low intrinsic rank. You don't need to update all 7 billion parameters to change a model's behavior. You need to update a small subspace of them.

LoRA freezes the pretrained weight matrix W and adds a low-rank decomposition:

text
W' = W + BA

Where B is a d-by-r matrix, A is an r-by-d matrix, and r (the rank) is tiny -- typically 16 to 64. For a layer with d=4096, a rank-16 LoRA adds 4096 x 16 + 16 x 4096 = 131,072 trainable parameters. The original layer has 4096 x 4096 = 16,777,216. That's 0.78% of the parameters.

Input x Frozen W(d x d) A matrix(r x d) B matrix(d x r) + Output
LoRA injects low-rank matrices alongside frozen weights

At inference time, you compute W' = W + BA once and replace the original weights. The adapter disappears. No extra computation per forward pass. No latency penalty. This is why LoRA is the default choice for production fine-tuning -- you ship a normal model.

QLoRA: The Consumer GPU Unlock

QLoRA (Dettmers et al., 2023) asked: what if we also quantize the frozen weights to 4-bit? The adapters stay in 16-bit (they're tiny), but the base model -- which is 99% of the memory footprint -- drops from 16 bits to 4 bits per parameter.

Three techniques make this work:

NF4 quantization. A data type specifically designed for normally-distributed neural network weights. More information-theoretically optimal than standard INT4 for the weight distributions you actually see in transformers.

Double quantization. The quantization constants themselves get quantized. Saves an additional ~0.4 bits per parameter, which adds up across billions of parameters.

Paged optimizers. When GPU memory spikes during gradient computation, optimizer states automatically page to CPU RAM instead of crashing. This is what makes the "fits on 24GB" claim actually reliable in practice.

The result: a 7B model that needs 100-120 GB for full fine-tuning fits in 8-10 GB. Your $1,500 RTX 4090 does the job that used to require a $50,000 H100 cluster.

Model SizeFull FT VRAMLoRA VRAMQLoRA VRAMFits on RTX 4090?
3B (Phi-3.5, Gemma 3)~48 GB~8 GB~4 GBYes
7-8B (Llama 3.1, Mistral, Qwen)~100-120 GB~20 GB~8-10 GBYes
13B (Llama 2 13B)~200 GB~32 GB~14-16 GBYes
70B (Llama 3.1 70B)~1 TB+~80 GB~38-42 GBNo (needs A100/H100)

The 7-8B Sweet Spot

If you're fine-tuning for agent tool-calling, 7-8B parameter models hit the optimal cost-performance point. Three models dominate this tier:

Llama 3.1 8B. Meta's best small model. Strong instruction following out of the box, large community of LoRA adapters, and excellent tool-calling baselines. The default recommendation.

Mistral 7B. Fast inference, efficient architecture with grouped-query attention. Slightly less capable than Llama 3.1 8B on benchmarks, but noticeably faster at generation, which matters for latency-sensitive agent loops.

Qwen 2.5 7B. Alibaba's multilingual powerhouse. If your agents handle CJK languages, start here -- the tokenizer is optimized for Chinese, Japanese, and Korean, which means fewer tokens per message and more efficient context windows.

All three fit on a 12 GB GPU with QLoRA. All three benefit enormously from domain-specific fine-tuning. The difference between a base 8B model and a fine-tuned 8B model on your specific tool schemas is often larger than the difference between a base 8B and a base 70B.

Preparing Training Data

Training data for tool-calling fine-tuning uses a conversation format with tool definitions and expected function calls. Here's the structure:

python
# training_data.jsonl -- Each line is one training example
# Format: system prompt with tools, user message, expected tool call
{
  "messages": [
    {
      "role": "system",
      "content": "You are a customer service agent. You have access to the following tools:\n\n"
                 "check_order_status(order_id: str) -> OrderStatus\n"
                 "submit_refund(order_id: str, reason: str, amount: float) -> RefundResult\n"
                 "escalate_to_human(reason: str, priority: str) -> EscalationTicket"
    },
    {
      "role": "user",
      "content": "My order #A1234 arrived damaged. I want a refund."
    },
    {
      "role": "assistant",
      # The model learns to call the right tool with the right params
      "content": null,
      "tool_calls": [
        {
          "function": {
            "name": "check_order_status",
            "arguments": "{\"order_id\": \"A1234\"}"
          }
        }
      ]
    },
    {
      "role": "tool",
      "content": "{\"status\": \"delivered\", \"delivered_at\": \"2026-03-15\"}"
    },
    {
      "role": "assistant",
      # After seeing tool result, model calls the refund tool
      "content": null,
      "tool_calls": [
        {
          "function": {
            "name": "submit_refund",
            "arguments": "{\"order_id\": \"A1234\", \"reason\": \"damaged_item\", \"amount\": 49.99}"
          }
        }
      ]
    }
  ]
}

How much data do you need? For tool-calling fine-tuning, 1,000-5,000 examples covering your full tool surface area is the starting point. 10,000+ examples with edge cases (ambiguous requests, multi-tool chains, error handling) gets you to production quality. Each tool should appear in at least 50-100 training examples.

Data generation shortcut. Use a frontier model (Claude 4.5 Sonnet, GPT-5) to generate synthetic training data. Give it your tool schemas and ask it to produce diverse user queries with correct tool-call responses. Then have humans review and correct the outputs. This is faster than hand-labeling and produces higher-quality data than manual creation alone.

The Token Estimation Trap

Before training, you need to estimate dataset size in tokens to set training parameters. Most tutorials tell you: divide character count by 4. This is dangerously wrong.

The len(text) / 4 heuristic assumes English prose with a BPE tokenizer. Reality is messier:

Content TypeActual Tokens/CharError vs. len/4
English prose~0.25 (1 token per 4 chars)Baseline
JSON/structured data~0.35 (1 token per 2.8 chars)+40% over estimate
Code (Python/TS)~0.30-0.35+20-40%
CJK (Chinese/Japanese/Korean)~0.8-1.0 (1 token per char)+200-300%
Mixed (English + JSON + code)~0.30-0.33+20-30%

For tool-calling training data (which is mostly JSON), the naive estimate is off by 20-40%. For multilingual agents, it's catastrophically wrong.

Always count tokens with the actual tokenizer:

python
from transformers import AutoTokenizer
 
# Use the exact tokenizer for your base model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
 
def count_dataset_tokens(filepath: str) -> dict:
    """Count actual tokens -- never trust len/4 for training budgets."""
    import json
    total_tokens = 0
    examples = 0
    with open(filepath) as f:
        for line in f:
            example = json.loads(line)
            # Concatenate all message content for token counting
            text = " ".join(
                m.get("content", "") or json.dumps(m.get("tool_calls", ""))
                for m in example["messages"]
            )
            total_tokens += len(tokenizer.encode(text))
            examples += 1
    return {
        "examples": examples,
        "total_tokens": total_tokens,
        "avg_tokens_per_example": total_tokens // max(examples, 1),
        # Compare against the naive estimate
        "naive_estimate_error": "Use this number, not len/4",
    }

Fine-Tuning with Unsloth

Unsloth is the fastest path from "I have training data" to "I have a fine-tuned model." It delivers 2x faster training with 70% less VRAM through custom CUDA kernels, with zero accuracy loss.

python
# pip install unsloth
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
 
# 1. Load base model in 4-bit (QLoRA) -- fits in ~8GB VRAM
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,          # QLoRA: 4-bit base weights
    dtype=None,                 # Auto-detect: bfloat16 on Ampere+
)
 
# 2. Add LoRA adapters -- these are the only trainable parameters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                       # Rank 16: sweet spot for tool-calling
    lora_alpha=32,              # Alpha = 2x rank (Microsoft convention)
    lora_dropout=0.05,          # Light regularization
    target_modules=[            # Which layers get adapters
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj",       # MLP (feed-forward)
    ],
    use_rslora=True,            # Rank-stabilized LoRA: better at higher ranks
)
 
# 3. Load your tool-calling dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
 
# 4. Train -- SFTTrainer handles chat template formatting
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./output",
        per_device_train_batch_size=4,     # 4 fits comfortably on 24GB
        gradient_accumulation_steps=4,      # Effective batch size: 16
        num_train_epochs=3,                 # 2-3 epochs for tool-calling
        learning_rate=2e-4,                 # Standard for LoRA
        lr_scheduler_type="cosine",         # Cosine decay works best
        warmup_ratio=0.05,                  # Brief warmup
        fp16=True,                          # Mixed precision
        logging_steps=10,
        save_strategy="epoch",
    ),
    max_seq_length=2048,
)
 
trainer.train()
 
# 5. Merge adapters into base model for deployment
model.save_pretrained_merged(
    "merged_model",
    tokenizer,
    save_method="merged_16bit",  # Full-precision merged weights
)

That's it. On an RTX 4090, this trains a 7B tool-calling model in 1-2 hours on 10K examples. The merged model runs at identical speed to the base model with no adapter overhead.

Fine-Tuning with PEFT and TRL

If you want more control, or Unsloth doesn't support your model yet, use HuggingFace's PEFT library directly. This is slightly slower (Unsloth is ~2x faster) but supports every model and every configuration.

python
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from trl import SFTTrainer
import torch
 
# 1. Quantization config for QLoRA
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",              # NF4: optimal for neural net weights
    bnb_4bit_compute_dtype=torch.bfloat16,   # Compute in bf16 for speed
    bnb_4bit_use_double_quant=True,          # Double quantization: saves ~0.4 bits/param
)
 
# 2. Load quantized base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)  # Enable gradient checkpointing
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
 
# 3. LoRA configuration
lora_config = LoraConfig(
    r=16,                           # Rank: 16 for general, 32-64 for code/schemas
    lora_alpha=32,                  # Scaling factor: 2x rank
    lora_dropout=0.05,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    task_type=TaskType.CAUSAL_LM,
    use_rslora=True,                # Rank-stabilized scaling
)
 
# 4. Apply LoRA -- prints trainable vs total params
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 8,030,261,248 || 0.17%

That last line is the aha moment. 0.17% of parameters. 13 million trainable out of 8 billion total. And the results are nearly indistinguishable from updating all 8 billion. This is how our tool-call success rate went from 40% to 90% -- not by buying a bigger model, but by teaching a smaller one exactly what our schemas look like.

Choosing Hyperparameters

Hyperparameter selection determines whether you get a great fine-tune or a wasted training run. Here's what actually matters, ranked by impact.

Rank (r). The single most important decision. This controls adapter capacity -- how much the model can learn.

TaskRecommended RankWhy
Simple classification/routing8Low complexity, few new patterns
Tool-calling / instruction following16-32Medium complexity, schema learning
Code generation / complex schemas32-64High complexity, precise syntax
Multi-domain adaptation64+Many distinct patterns

Recent research shows diminishing returns past rank 64, and when LoRA is applied to all linear layers, rank matters less than you'd expect. Start with 16 and only increase if evaluation metrics plateau.

Alpha. Controls the scaling of the LoRA update: scaling = alpha / r. Microsoft's convention is alpha = 2 * r (so the update is scaled by 2.0). With rank-stabilized LoRA (use_rslora=True), scaling becomes alpha / sqrt(r), which works better at higher ranks.

Target modules. Apply LoRA to all linear layers (attention + MLP) for best results. The original paper only targeted attention layers, but subsequent research shows including MLP layers (gate_proj, up_proj, down_proj) improves quality with minimal extra cost.

Learning rate. 1e-4 to 3e-4 for LoRA. Higher than full fine-tuning because you're updating fewer parameters. Use cosine decay with 5% warmup.

Epochs. 2-3 for tool-calling. More epochs on small datasets leads to overfitting. Monitor eval loss -- if it starts rising while train loss keeps falling, stop.

Evaluating Your Fine-Tune

Training loss going down means nothing if the model doesn't actually call your tools correctly. Build an evaluation set of 200-500 examples that the model never saw during training.

python
import json
 
def evaluate_tool_calling(model, tokenizer, eval_path: str) -> dict:
    """Measure what matters: does the model call the right tools correctly?"""
    correct_tool = 0
    correct_args = 0
    total = 0
 
    with open(eval_path) as f:
        for line in f:
            example = json.loads(line)
            # Build the prompt from system + user messages
            messages = [m for m in example["messages"] if m["role"] in ("system", "user")]
            expected = next(
                m for m in example["messages"]
                if m["role"] == "assistant" and m.get("tool_calls")
            )
 
            # Generate model response
            inputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
            outputs = model.generate(inputs.to(model.device), max_new_tokens=256)
            response = tokenizer.decode(outputs[0], skip_special_tokens=True)
 
            # Parse and compare tool calls
            predicted_call = parse_tool_call(response)  # Your parsing logic
            expected_call = expected["tool_calls"][0]["function"]
 
            total += 1
            if predicted_call and predicted_call["name"] == expected_call["name"]:
                correct_tool += 1
                if predicted_call["arguments"] == json.loads(expected_call["arguments"]):
                    correct_args += 1
 
    return {
        "tool_selection_accuracy": correct_tool / total,  # Right tool?
        "argument_accuracy": correct_args / total,         # Right params?
        "total_evaluated": total,
    }

What to target:

  • Tool selection accuracy > 90% (model picks the right tool)
  • Argument accuracy > 85% (model fills parameters correctly)
  • Refusal accuracy > 95% (model doesn't hallucinate tool calls when none are needed)

If tool selection is low, you need more diverse training examples per tool. If argument accuracy is low, your training data may have inconsistent parameter formats. If refusal accuracy is low, add examples where the correct response is text, not a tool call.

For production agents, connect this evaluation to your monitoring pipeline. Track tool-call success rates in production and feed failures back into your training data.

LoRAFusion: Multi-Job Training

Once you're fine-tuning multiple models (one per agent persona, one per language, one per tool domain), training efficiency becomes critical. LoRAFusion, presented at EuroSys 2026, batches multiple LoRA jobs sharing the same base model intelligently -- splitting adapters into groups and generating balanced microbatches.

Results: 1.96x speedup vs. Megatron-LM (up to) and 1.29x average speedup vs. mLoRA. For agent platforms where each workspace fine-tunes for their specific tool schemas, this makes shared GPU infrastructure practical -- one cluster serves many fine-tuning jobs simultaneously.

Serving Fine-Tuned Models

You've trained a model. Now serve it. Two paths.

Path 1: Merge and deploy as a standard model. The LoRA adapters merge into the base weights. The result is a normal model file. Deploy it anywhere you'd deploy a model: vLLM, TGI, Ollama, or any inference server.

python
# Already shown above -- merge and save
model.save_pretrained_merged("merged_model", tokenizer, save_method="merged_16bit")
 
# Then serve with vLLM (fastest open-source inference)
# vllm serve merged_model --port 8000
typescript
// Call your fine-tuned model from an agent -- same as any OpenAI-compatible API
const response = await fetch("http://localhost:8000/v1/chat/completions", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "merged_model",
    messages: [
      { role: "system", content: systemPrompt },
      { role: "user", content: userMessage },
    ],
    // Tool definitions -- the model now understands YOUR schema conventions
    tools: agentTools,
  }),
});

Path 2: Serve base model + swap adapters at request time. Keep one base model in memory and load different LoRA adapters per request. This is how multi-tenant fine-tuning works -- one GPU serves dozens of customized models.

vLLM and LoRAX support this natively. The base model stays in GPU memory; adapters are ~10-50 MB each and load in milliseconds.

For agents that use MCP tools, you get the best of both worlds: a base model fine-tuned to understand tool-calling conventions generally, with per-agent adapters that specialize in specific tool schemas. The MCP server provides tool definitions at runtime; the fine-tuned model knows how to use them reliably.

What's Next

Fine-tuning is one layer of a production agent stack. The model needs tools to call (build a tool system), memory to persist across sessions, and a full context engineering pipeline to assemble the right information at inference time.

Start small. Pick one agent with one tool domain. Fine-tune a Llama 3.1 8B on 1,000 examples of correct tool calls using QLoRA and Unsloth. Measure the before and after. If tool-call accuracy jumps from 60% to 90%, you've justified the investment in an afternoon.

We started with a $50,000 quote and a 40% tool-call success rate. We ended with a $1,500 GPU and 90% accuracy. The gap between "prompting a general model" and "fine-tuned for your specific task" is the difference between a demo and a product. LoRA and QLoRA make that gap crossable on a single consumer GPU.

Ship agents that use tools reliably

Chanl gives your AI agents tools, memory, and monitoring. Fine-tune the model; we handle everything around it.

Start building
DG

Co-founder

Building the platform for AI agents at Chanl — tools, testing, and observability for customer experience.

Learn Agentic AI

One lesson a week — practical techniques for building, testing, and shipping AI agents. From prompt engineering to production monitoring. Learn by doing.

500+ engineers subscribed

Frequently Asked Questions