Published 2024-01-14.
Last modified 2025-09-20.
Time to read: 5 minutes.
llm
collection.
I've been playing with large language models (LLMs) online and locally. LLMs running on my local machines are not as powerful or as fast as large models running on expensive hardware, but you have complete control over them without extra cost, censorship, restrictions, or privacy issues.
Ollama is a way to run large language models (LLMs) locally, using a client-server architecture. Ollama wraps LLMs into a server, and clients can interrogate the server.
Ollama is an open-source tool built in Go for running and packaging generative machine learning models. Ollama clients can include:
- Program code via the Ollama server REST interface
- Text chat
- Web interface
Any Ollama client can access any Ollama server.
Installation
Installation instructions are simple.
On Windows with WSL, install Ollama for Linux in WSL. Linux installation and update looks like this:
$ curl -fsSL https://ollama.com/install.sh | sh >>> Installing ollama to /usr/local >>> Downloading Linux amd64 bundle ######################################################################## 100.0% >>> Creating ollama user... >>> Adding ollama user to render group... >>> Adding ollama user to video group... >>> Adding current user to ollama group... >>> Creating ollama systemd service... >>> Enabling and starting ollama service... Created symlink '/etc/systemd/system/default.target.wants/ollama.service' '/etc/systemd/system/ollama.service'. >>> NVIDIA GPU installed.
Ollama Model Format vs. Hugging Face Model Format
GGUF (GPT-Generated Unified Format) is a file format for storing large language and other AI models, optimized for fast loading and efficient, quantized inference on local hardware. It bundles model metadata and tensors into a single binary file, supports various quantization levels to reduce memory usage.
Although Ollama can directly use any GGUF-formatted model, caveats exist.
Your first source of Ollama-compatible models should be ollama.com
.
Hugging Face provides models in their own format, plus some of their models also are available in GGUF format. Ollama does not support LLM models in the Hugging Face format. Conversion to GGUF format can take a long time, and might require an understanding of the moving parts.
Most models found on Hugging Face were originally released in
Pytorch / tensor format and then converted to GGUF.
The conversion can mess up some parameters.
This is why your primary source for models should be ollama.com
.
Installing a LLAMA-Compatible Model
To install or update a model without running it, type ollama pull
,
followed by the name of the model.
Ollama default model is Q4 (4-bit quantized), which is faster but can be much less accurate than Q8 (8-bit quantization) models. Install Q8 versions if possible.
$ ollama pull deepseek-r1:8b # install or update pulling manifest pulling e6a7edc1a4d7: 100% ▕████████████████████████████ ▏ 5.2 GB/5.2 GB 63 MB/s 0s pulling c5ad996bda6e: 100% ▕████████████████████████████▏ 556 B pulling 6e4c38e1172f: 100% ▕████████████████████████████▏ 1.1 KB pulling ed8474dc73db: 100% ▕████████████████████████████▏ 179 B pulling f64cd5418e4b: 100% ▕████████████████████████████▏ 487 B verifying sha256 digest writing manifest success
You can install and run any LLAMA-compatible model by typing ollama run
,
followed by the name of the model.
$ ollama run deepseek-r1:8b
Inspecting a Model
Inspect an installed model:
$ ollama show deepseek-r1:8b Model architecture qwen3 parameters 8.2B context length 131072 embedding length 4096 quantization Q4_K_M
Capabilities completion thinking
Parameters stop "<|begin?of?sentence|>" stop "<|end?of?sentence|>" stop "<|User|>" stop "<|Assistant|>" temperature 0.6 top_p 0.95
License MIT License Copyright (c) 2023 DeepSeek ...
Just display the quantization:
$ ollama show deepseek-r1:8b | grep quantization quantization Q4_K_M
My Favorite Models
Following are some example open-source models that I have downloaded and played with on my PC. Larger versions could be run in the cloud with providers like ShadowPC and AWS spot pricing.
Model | Parameters | Purpose |
---|---|---|
codellama:7b | 7B | Old but good: General code synthesis and understanding using Llama 2. |
deepseek-r1:8b | 8B, Q4 | Uses Qwen architecture, best for math-focused tasks with resource constraints. Outstanding performance across various benchmark evaluations, including mathematics, programming, and general logic. Performance is similar to O3 and Gemini 2.5 Pro. |
llama3:8b | 8B | Llama 3 is very capable. |
luna-ai-llama2-uncensored-gguf | 7B | |
llama2:13b | 13B | |
llama-3.1-8b-instruct | 8B | |
mistral | 7B | |
mistral-small3.2 | 24B |
Each model has unique attributes. Some are designed for describing images, while others are designed for generating music, or other special purposes.
The 70B parameter model really puts a strain on the computer, and takes much longer than other models to yield a result.
Installation
I installed Ollama on WSL like this:
$ curl -s https://ollama.ai/install.sh | sh >>> Downloading ollama... ################################################################# 100.0%#=#=# ################################################################# 100.0% >>> Installing ollama to /usr/local/bin... >>> Creating ollama user... >>> Adding ollama user to render group... >>> Adding current user to ollama group... >>> Creating ollama systemd service... >>> Enabling and starting ollama service... Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service. >>> NVIDIA GPU installed.
Command Line Start
You can start the server from the command line, if it is not already running as a service:
$ ollama serve 2024/01/14 16:25:20 images.go:808: total blobs: 0 2024/01/14 16:25:20 images.go:815: total unused blobs removed: 0 2024/01/14 16:25:20 routes.go:930: Listening on 127.0.0.1:11434 (version 0.1.20) 2024/01/14 16:25:21 shim_ext_server.go:142: Dynamic LLM variants [cuda rocm] 2024/01/14 16:25:21 gpu.go:88: Detecting GPU type 2024/01/14 16:25:21 gpu.go:203: Searching for GPU management library libnvidia-ml.so 2024/01/14 16:25:21 gpu.go:248: Discovered GPU libraries: [/usr/lib/wsl/lib/libnvidia-ml.so.1] 2024/01/14 16:25:21 gpu.go:94: Nvidia GPU detected 2024/01/14 16:25:21 gpu.go:135: CUDA Compute Capability detected: 8.6
Ollama Models
Ollama uses models on demand; the models are ignored if no queries are active. That means you do not have to restart ollama after installing a new model or removing an existing model.
My workstation has 64 GB RAM, a 13th generation Intel i7 and a modest NVIDIA 3060. I decided to try the biggest model to see what might happen. I downloaded the Llama 2 70B model with the following incantation. (Spoiler: An NVIDIA 4090 would have been better video card for this Ollama model, and it would still be slow.)
$ ollama run llama2:70b pulling manifest pulling 68bbe6dc9cf4... 100% ▕████████████████████████████████████▏ 38 GB pulling 8c17c2ebb0ea... 100% ▕████████████████████████████████████▏ 7.0 KB pulling 7c23fb36d801... 100% ▕████████████████████████████████████▏ 4.8 KB pulling 2e0493f67d0c... 100% ▕████████████████████████████████████▏ 59 B pulling fa304d675061... 100% ▕████████████████████████████████████▏ 91 B pulling 7c96b46dca6c... 100% ▕████████████████████████████████████▏ 558 B verifying sha256 digest writing manifest removing any unused layers success >>> Send a message (/? for help)
I played around to learn what the available messages were. For more information, see Tutorial: Set Session System Message in Ollama CLI by Ingrid Stevens.
>>> /? Available Commands: /set Set session variables /show Show model information /bye Exit /?, /help Help for a command /? shortcuts Help for keyboard shortcuts Use """ to begin a multi-line message. >>> Send a message (/? for help) >>> /show Available Commands: /show info Show details for this model /show license Show model license /show modelfile Show Modelfile for this model /show parameters Show parameters for this model /show system Show system message /show template Show prompt template >>> /show modelfile # Modelfile generated by "ollama show" # To build a new Modelfile based on this one, replace the FROM line with: # FROM llama2:70b FROM /usr/share/ollama/.ollama/models/blobs/sha256:68bbe6dc9cf42eb60c9a7f96137fb8d472f752de6ebf53e9942f267f1a1e2577 TEMPLATE """[INST] <<SYS>>{{ .System }}<</SYS>> {{ .Prompt }} [/INST] """ PARAMETER stop "[INST]" PARAMETER stop "[/INST]" PARAMETER stop "<<SYS>>" >>> /show system No system message was specified for this model.
>>> /show template [INST] <<SYS>>{{ .System }}<</SYS>>
{{ .Prompt }} [/INST] >>> %}/bye
USER:
and ASSISTANT:
are helpful when writing a request for the model to reply to.
By default, Ollama models are stored in these directories:
- Linux:
/usr/share/ollama/.ollama/models
- macOS:
~/.ollama/models
The Ollama library has many models available. OllamaHub has more. For applications that may not be safe for work, there is an equivalent uncensored Llama2 70B model that can be downloaded. Do not try to work with this model unless you have a really powerful machine!
$ ollama pull llama2-uncensored:70b pulling manifest pulling abca3de387b6... 100% ▕█████████████████████████████████████▏ 38 GB pulling 9224016baa40... 100% ▕█████████████████████████████████████▏ 7.0 KB pulling 1195ea171610... 100% ▕█████████████████████████████████████▏ 4.8 KB pulling 28577ba2177f... 100% ▕█████████████████████████████████████▏ 55 B pulling ddaa351c1f3d... 100% ▕█████████████████████████████████████▏ 51 B pulling 9256cd2888b0... 100% ▕█████████████████████████████████████▏ 530 B verifying sha256 digest writing manifest removing any unused layers success
I then listed the models on my computer in another console:
$ ollama list NAME ID SIZE MODIFIED llama2:70b e7f6c06ffef4 38 GB 9 minutes ago
Running Queries
Ollama queries can be run in many ways
I used curl
, jq
and fold
to write my first query from a bash prompt.
The -s
option for curl
prevents the progress meter from cluttering up the screen,
and the jq
filter removes everything from the response except the desired text.
The fold
command wraps the text response to a width of 72 characters.
$ curl -s http://localhost:11434/api/generate -d '{ "model": "llama2:70b", "prompt": "Why is there air?", "stream": false }' | jq -r .response | fold -w 72 -s Air, or more specifically oxygen, is essential for life as we know it. It exists because of the delicate balance of chemical reactions in Earth’s atmosphere, which has allowed complex organisms like ourselves to evolve.
But if you’re asking about air in a broader sense, it serves many functions: it helps maintain a stable climate, protects living things from harmful solar radiation, and provides buoyancy for various forms of life, such as fish or birds.
Describing Images
I wrote this method to describe images.
def describe_image(image_filename) @client = Ollama.new( credentials: { address: @address }, options: { server_sent_events: true, temperature: @temperature, connection: { request: { timeout: @timeout, read_timeout: @timeout } }, } ) result = @client.generate( { model: @model, prompt: 'Please describe this image.', images: [Base64.strict_encode64(File.read(image_filename))], } ) puts result.map { |x| x['response'] }.join end
The results were ridiculous - an example of the famous hallucination that LLMs entertain their audience with. As the public becomes enculturated with these hallucinations, we may come to prefer them over human comedians. Certainly there will be a lot of material for the human comedians to fight back with. For example, when describing the photo of me at the top of this page:
Another attempt, with an equally ridiculous result:
The llava
model
is supposed to be good at describing images, so I installed it and tried again, with excellent results:
$ ollama pull llava:13b
You can try the latest LLaVA model online.
Documentation
- CLI Reference
- Ollama API.
- There are lots of controls for various models.
- Ollama Web UI
- An Apple M1 Air works great.
- Crafting Conversations with Ollama-WebUI: Your Server, Your Rules