Unsloth: Fine-Tune LLMs 2-5x Faster with 80% Less Memory

Introduction

Unsloth is an open-source library that makes fine-tuning large language models (LLMs) significantly faster and more memory-efficient. It achieves 2-5x speedups over standard Hugging Face training while using up to 80% less GPU memory, with no loss in accuracy. Unsloth accomplishes this through custom CUDA kernels and intelligent memory optimization, making it possible to fine-tune models like Llama 3, Mistral, Gemma, and Phi on consumer-grade GPUs.

Fine-tuning LLMs is essential for adapting general-purpose models to specific domains, tasks, or conversational styles. However, the computational cost is often prohibitive. Unsloth addresses this by optimizing the LoRA and QLoRA training process, so you can fine-tune models that would otherwise require expensive hardware.

In this guide, you will learn how to install Unsloth, load a pretrained model, apply LoRA adapters, and run a fine-tuning job using standard Hugging Face datasets and trainers.

Overview

Key features of Unsloth:

2-5x faster fine-tuning: Custom Triton kernels optimize attention, cross-entropy, and RoPE computations
80% less memory: Gradient checkpointing and memory-efficient implementations dramatically reduce VRAM usage
No accuracy loss: Optimizations are mathematically equivalent to standard training
LoRA and QLoRA support: Seamless integration with parameter-efficient fine-tuning methods
Wide model support: Works with Llama 3, Mistral, Gemma, Phi, Qwen, and other popular architectures
Hugging Face compatible: Uses standard transformers and trl APIs, so existing training scripts need minimal changes

Use cases:

Fine-tuning LLMs on domain-specific data (medical, legal, financial)
Instruction tuning and chat model training
Creating task-specific models from base models
Training on consumer GPUs (single 16GB or 24GB card)

Getting Started

Install Unsloth:

pip install unsloth

Here is a complete working example that loads a model, applies LoRA, and prepares it for training:

from unsloth import FastLanguageModel

# Load a pretrained model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                    # LoRA rank
    lora_alpha=16,           # LoRA scaling factor
    lora_dropout=0,          # No dropout for faster training
    target_modules=["q_proj", "k_proj", "v_proj",
                    "o_proj", "gate_proj", "up_proj", "down_proj"],
)

# Verify the model is ready
print(f"Model type: {type(model).__name__}")
print(f"Trainable parameters: {model.print_trainable_parameters()}")

Core Concepts

Loading Models with `FastLanguageModel`

The FastLanguageModel class is the main entry point. It wraps Hugging Face model loading with Unsloth’s optimizations:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Mistral-7B-v0.3",  # HF model name or local path
    max_seq_length=4096,     # Maximum sequence length for training
    load_in_4bit=True,       # Enable 4-bit quantization (QLoRA)
    dtype=None,              # Auto-detect dtype (float16 or bfloat16)
)

Unsloth provides pre-quantized models on Hugging Face under the unsloth/ namespace for faster loading.

Applying LoRA Adapters

After loading the base model, apply LoRA adapters to make it trainable with minimal parameters:

model = FastLanguageModel.get_peft_model(
    model,
    r=16,                       # LoRA rank (higher = more capacity, more memory)
    lora_alpha=16,              # Scaling factor
    lora_dropout=0,             # Dropout rate on LoRA layers
    target_modules=["q_proj", "k_proj", "v_proj",
                    "o_proj", "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",  # Unsloth's optimized checkpointing
)

Training with the Hugging Face `SFTTrainer`

Unsloth integrates directly with the trl library’s SFTTrainer:

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Load a dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

# Configure training
training_args = TrainingArguments(
    output_dir="./outputs",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_strategy="epoch",
    optim="adamw_8bit",
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=training_args,
    max_seq_length=2048,
)

trainer.train()

Practical Examples

Example 1: Fine-Tuning a Chat Model

from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# Load model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-1B-Instruct",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Apply LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj",
                    "o_proj", "gate_proj", "up_proj", "down_proj"],
)

# Define a chat template formatting function
def format_chat(example):
    chat = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n"
    chat += f"{example['instruction']}<|eot_id|>"
    chat += f"<|start_header_id|>assistant<|end_header_id|>\n\n"
    chat += f"{example['output']}<|eot_id|>"
    return {"text": chat}

# Load and format dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train")
dataset = dataset.map(format_chat)

# Train
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./chat-model",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=25,
        optim="adamw_8bit",
    ),
    max_seq_length=2048,
)

trainer.train()

Example 2: Saving and Loading a Fine-Tuned Model

from unsloth import FastLanguageModel

# After training, save the LoRA adapter
model.save_pretrained("my-finetuned-adapter")
tokenizer.save_pretrained("my-finetuned-adapter")

# Save as a merged model in 16-bit for deployment
model.save_pretrained_merged(
    "my-finetuned-merged",
    tokenizer,
    save_method="merged_16bit",
)

# Save as GGUF for llama.cpp inference
model.save_pretrained_gguf(
    "my-finetuned-gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

# Later, reload the LoRA adapter
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="my-finetuned-adapter",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Switch to inference mode
FastLanguageModel.for_inference(model)

# Generate text
inputs = tokenizer("What is machine learning?", return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Best Practices

Use 4-bit quantization: Set load_in_4bit=True to enable QLoRA, which dramatically reduces memory usage with negligible quality loss.
Start with rank 16: A LoRA rank of 16 is a good default. Increase to 32 or 64 for more complex tasks, decrease to 8 for simpler ones.
Use Unsloth’s gradient checkpointing: Set use_gradient_checkpointing="unsloth" for the most memory-efficient training.
Use adamw_8bit optimizer: This further reduces memory usage compared to standard AdamW.
Export for deployment: Use save_pretrained_merged for Hugging Face deployment or save_pretrained_gguf for local inference with llama.cpp.

Common pitfalls:

Unsloth is not a content generation API. It is a training optimization library that accelerates the fine-tuning process.
Not all model architectures are supported. Check the Unsloth documentation for the current list of compatible models.
Unsloth requires a CUDA-capable GPU. CPU-only training is not supported.
When using QLoRA (4-bit), ensure you have bitsandbytes installed.

Conclusion

Unsloth makes LLM fine-tuning accessible on consumer hardware by delivering substantial speedups and memory savings without compromising model quality. Its seamless integration with the Hugging Face ecosystem means you can adopt it in existing training workflows with minimal changes, making it an essential tool for anyone fine-tuning large language models.

Resources:

About this article. This article was generated by the Best-of-the-Best autonomous AI digest and reviewed by Ruslan Magana Vsevolodovna. Package metadata was last checked on 16 January 2026. See the data leaderboard and the GitHub repository for sources.

Share on

X Facebook LinkedIn Bluesky

Unsloth: Fine-Tune LLMs 2-5x Faster with 80% Less Memory

Introduction

Overview

Getting Started

Core Concepts

Loading Models with FastLanguageModel

Applying LoRA Adapters

Training with the Hugging Face SFTTrainer

Practical Examples

Example 1: Fine-Tuning a Chat Model

Example 2: Saving and Loading a Fine-Tuned Model

Best Practices

Conclusion

Share on

Loading Models with `FastLanguageModel`

Training with the Hugging Face `SFTTrainer`