How to load Large Language Models on Your PC or Mac

8 minute read

Hello today we are going to download some pretrained models for generation of text to ask interesting questions offline. For example if we ask some interesting questions about science our bot is capable to answer them. I will show a simple text-generation web application.

Fig1. Answer of Llama Model running on a Laptop with 16Gb of Ram and simple GPU with 8gb of ram.

Fig1. Llama model running on a Laptop with 16Gb of ram and simple GPU with 8gb of ram.

You can run this assistant on you local computer as you want.

Introduction

In recent years, Large Language Models (LLMs), also known as Foundational Models, have been trained using large datasets and models with a massive number of parameters, such as the common GPT-3 (175B parameters). Due to the size of the models , those models requires high performance computing with GPUs that can run fluently.

For this special reason currently the Foundational Models runs in several cloud public services such as IBM, Microsoft, OpenAI , Google , etc.

Therefore, researchers have focused on efficient fine-tuning, known as Parameter-Efficient Fine-Tuning (PEFT). For example the LoRA network inserted into specific layers to make the model adaptable to different tasks. Instead of fine-tuning the parameters of a large neural network model, the approach may shift towards training a smaller model or weight, and combining it with the specific layer weights of the original LLM.

There are my models out there that you can test such as:

Model	Description	Paremeters	Supported Laptop	Speed
CodeLlama-7b-Instruct-hf	This is the repository for the 7B instruct-tuned version in the Hugging Face Transformers format. This model is designed for general code synthesis and understanding. Links to other models can be found in the index at the bottom.	7B	Yes	Little Slow
gpt-j-6b	GPT-J or GPT-J-6B is an open-source large language model developed by EleutherAI in 2021. As the name suggests, it is a generative pre-trained transformer model designed to produce human-like text that continues from a prompt. The optional “6B” in the name refers to the fact that it has 6 billion parameters.	6B	Yes	Fast
gpt2	GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.	124M	Yes	Fast
gpt2-xl	GPT-2 XL is the 1.5B parameter version of GPT-2, a transformer-based language model created and released by OpenAI.	1.5B	Yes	Super Fast
flan-t5-large	It has strong zero-shot, few-shot, and chain of thought abilities. Because of these abilities, FLAN-T5 is useful for a wide array of natural language tasks. This model is FLAN-T5-Large, the 780M parameter version of FLAN-T5	783M	Yes	Fast
mpt-7b-instruct2	MPT-7B-Instruct2 is a retrained version of the orignal MPT-7B-Instruct model	7B	Yes	Very Slow
NousResearch/Llama-2-13b-hf	Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This is the repository for the 13B pretrained model, converted for the Hugging Face Transformers format.	13B	Yes	Super Slow
llama-2-70b-chat	Llama 2 is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters	7B	No	–
starcoder-15.5b	The StarCoder models are 15.5B parameter models trained on 80+ programming languages	15.5B	No	–
mt0-xxl-13b	BLOOMZ & mT0, a family of models capable of following human instructions in dozens of languages zero-shot. We finetune BLOOM & mT5 pretrained multilingual language models on our crosslingual	13B	No	–
flan-ul2-20b	Flan-UL2 (20B params) from Google is the best open source LLM out there, as measured on MMLU (55.7) and BigBench Hard (45.9). It surpasses Flan-T5-XXL (11B). It’s been instruction fine-tuned with a 2048 token window. It uses the same configuration as the `UL2 model`	20B	No	–

There are three models that I will focus more.

CodeLlama, Flan-T5 and GPT-J 6B

CodeLlama is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 34 billion parameters, the advantage of this model is that can run on a simple laptop not high end with a gpu.

FLAN-T5 is a pre-trained encoder-decoder model that can perform various text-to-text tasks, It is based on T5, but outperforms it on a large variety of tasks.It is multilingual and uses instruction fine-tuning to improve generalization and usability. FLAN-T5 is intended to be used as a conversational AI assistant that can answer questions, provide explanations, generate text, and engage in dialogue Flan-T5 is released with different sizes: Small, Base, Large, XL and XXL. XXL is the biggest version of Flan-T5, containing 11B parameters

‍GPT-J-6B is an open source, autoregressive language model created by a group of researchers called EleutherAI. It’s one of the most advanced alternatives to OpenAI’s GPT-3 and performs well on a wide array of natural language tasks such as chat, summarization, and question answering, to name a few

Step 1 . Setup Environment

First we need to install Conda

https://docs.conda.io/en/latest/miniconda.html

then we create a new conda environment

conda create -n textgen python=3.10.9
conda activate textgen

we require Install Pytorch, depending on you system you can choose the appropriate command

System	GPU	Command
Windows	NVIDIA	`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117`
Windows	CPU only	`pip3 install torch torchvision torchaudio`
MacOS + MPS	Any	`pip3 install torch torchvision torchaudio`
Linux/WSL	NVIDIA	`pip3 install torch torchvision torchaudio`
Linux/WSL	CPU only	`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu`
Linux	AMD	`pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2`	n

For example, I am using a laptop with a Nvidia 2070 RTX. First I have installed cuda

https://developer.nvidia.com/cuda-11-7-0-download-archive

and later

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

The up-to-date commands can be found here: https://pytorch.org/get-started/locally/.

Then we Install the web UI

git clone https://github.com/Cloud-Data-Science/text-generation-webui.git
cd text-generation-webui
pip install -r requirements.txt

Step 2. Downloading models

The second important step is download the model. Here we must be carefully, not all models can be loaded . You have to choose the model that are compatible with the the loader.

There are different loaders

Models should be placed in the text-generation-webui/models folder. They are usually downloaded from Hugging Face.

There are different input files of the models. In this loader we can deal with GPTQ and GGUF types.

Transformers or GPTQ models are made of several files and must be placed in a subfolder.
GGUF models are a single file and should be placed directly into models.

In both cases, you can use the “Model” tab of the UI to download the model from Hugging Face automatically.

It is also possible to download via the command-line with python download-model.py organization/model (use --help to see all the options).

For our case we are going to install CodeLlama with 7 billion parameters , that uses transformers, then you will need 12.5 GB of extra space to download.

python download-model.py codellama/CodeLlama-7b-Instruct-hf

flan-t5-large 2.91 GB

python download-model.py google/flan-t5-large

and GPT 2 with 124M you will need 10.4 GB

python download-model.py gpt2

EleutherAI/gpt-j-6b with 6 billion parameters you will requiere 22.5 GB

python download-model.py EleutherAI/gpt-j-6b

and finally you can download gpt2-xl with 5.99 GB

python download-model.py gpt2-xl

Due to I am planning to run the models in my laptop with 16Gb of Ram and 8GB of GPU NVIDIA 2070 it does not makes sense use larger more than 7 billion parameters like google/flan-ul2 that has a size of 36.7 GB

Starting the web UI

conda activate textgen
cd text-generation-webui
python server.py

then

If your enviroment were well installed you simply open your web browser

http://127.0.0.1:7860/

and you will see your Loader

Step 3. Setup Loader

Once the program is loaded , go to Model.

First click on Refresh models.
Select the model, for example codellama
Assign a GPU memory, in my case I have assigned 5240 MB
Assign a CPU memory, in my case I have assigned 13800 MB
Add rope freq

The parameters of the model that I will use are the default ones. So we do not modify them.

Testing chat models.

Let us try to ask a complex question for our models.

How to solve the schrodinger equation for the hydrogen?

LLaMA (Large Language Model Meta AI) is a family of large language models (LLMs), released by Meta AI starting in February 2023. For the first version of LLaMa, four model sizes were trained: 7, 13, 33 and 65 billion parameters.

Let us first test a model with 7 billions of parameters

by using CodeLlama-7b-Instruct-hf developed by Facebook

Let us now go GPT-J 6B developed by EleutherAI we got the following result

Solution with flan-t5-large developed by Google you got:

and for gpt2-xl developed by OpenAI .

and finally with mpt-7b-instruct2, we got the following results:

as you see, different models gives different results.

Conclusions.

We have tested different models among the and we rank them:

Model	Comments
1. FLAN-T5 LARGE	Load fast and give good results
2. LLaMA-7B	Load fast and give excellent results
3. GPT-J 6B	Load is slow but give standard results
4. gpt2-xl	Load super fast but results are general
5. MPT-7B	Load is slow and results are poor
6. LLaMA2-7B	Cannot load properlty

The most accurate model that answer to our question was flan-t5-large, also is fast, the second place is for CodeLlama-7b-Instruct-hf, is also very concrete.

The second CodeLlama-7b-Instruct-hf the results are very good but due to the performance is not really fast lost the first position. If you dont take care about the performance so this model is a good fit for you. The mpt-7b-instruct2 and gpt2-xl at least for our questions does not give accurate results.

Unfortunately due to our laptop that was used for the test has low specs, we cannot run the latest models with more than 13 billions of parameters. The Llama-2-13b-hf for example has 13 billions ran slow with our testing laptop.

Additional test done by Baichuan Group maked an English evaluation dataset comprising 57 tasks, encompassing elementary math, American history, computer science, law, etc MMLU . The difficulty ranges from high school level to expert level. It’s a mainstream LLM evaluation dataset. With the open-source evaluation approach.

7B Model Results

Model	MMLU
	5-shot
GPT-4	83.93
GPT-3.5 Turbo	68.54
LLaMA-7B	35.10
LLaMA2-7B	45.73
MPT-7B	27.93
Falcon-7B	26.03

For more references about how to download more models, you can visit this site. If you are interested to see the latest benchmarks of llm models you can see this site.

Congratulations! We have tested some foundational models by using our local computer.

Share on

Twitter Facebook LinkedIn

Ruslan Magana Vsevolodovna

How to load Large Language Models on Your PC or Mac

Introduction

CodeLlama, Flan-T5 and GPT-J 6B

Step 1 . Setup Environment

Step 2. Downloading models

Starting the web UI

Step 3. Setup Loader

Testing chat models.

Conclusions.

7B Model Results

Share on

Leave a comment

You may also enjoy

From Zero to Hero: Building a Multi-Agent System with Watsonx Orchestrate

07 Jul 2025

Watsonx.ai Agent to MCP Gateway

01 Jul 2025

Building a Watsonx.ai Chatbot RAG Server with MCP

20 Apr 2025

Building RAG Applications with IBM watsonx.ai and Langflow

20 Apr 2025