Model Setup
OpenHuman Ollama Setup — Run Local LLMs for Free, No API Costs
One of OpenHuman's most powerful features is the ability to run entirely offline using local LLMs through Ollama. This gives you complete privacy, zero API costs, and full offline capability.
What is Ollama?
Ollama is a tool that lets you run large language models on your own hardware. It supports dozens of open-source models including Llama 3, Mistral, Qwen, DeepSeek, and many more. When paired with OpenHuman, you get a fully local AI assistant.
Installation
Install Ollama
Download from ollama.com and install:
- macOS: Download the .dmg and drag to Applications
- Linux:
curl -fsSL https://ollama.com/install.sh | sh - Windows: Download and run the installer
Download a Model
Open a terminal and pull a model. For a balance of quality and speed:
ollama pull llama3.2:3bFor better results with more RAM:
ollama pull llama3.1:8bConfigure OpenHuman to Use Ollama
Edit config.toml to point OpenHuman to your local Ollama instance:
[models.ollama] provider = "openai" api_key = "ollama" base_url = "http://localhost:11434/v1" model = "llama3.2" [model_routing] fast_model = "ollama" reasoning_model = "ollama"Setting Ollama as Your Only Model
If you want to run entirely offline with no API dependency:
[models] [models.ollama] provider = "openai" api_key = "ollama" base_url = "http://localhost:11434/v1" model = "llama3.2" [model_routing] reasoning_model = "ollama" fast_model = "ollama" vision_model = ""Performance Tips
- RAM matters: 3B models run on 4GB RAM, 8B needs 8GB+, 70B needs 32GB+
- GPU acceleration: Ollama automatically uses NVIDIA GPUs (CUDA) or Apple Metal
- Reduce context: In TokenJuice settings, set
max_context_tokens = 8192for smaller models - Quantized models: Use Q4 or Q5 quantizations for faster inference on consumer hardware
Pros and Cons
| Aspect | Local (Ollama) | Cloud API |
|---|---|---|
| Cost | Free | Pay per token |
| Privacy | 100% local | Data leaves device |
| Offline | ✅ | ❌ Requires internet |
| Speed | Depends on hardware | Fast (GPU clusters) |
| Quality | Smaller models | Frontier models |
Troubleshooting
Connection refused
Make sure Ollama is running (ollama serve). Check that the base_url ports match.
Slow responses
Try a smaller model (3B instead of 8B). Close other applications. Reduce context window in TokenJuice settings.