Local Models in Calabi AI Builder
Calabi AI Builder integrates with Calabi Local Models, a local LLM runtime that serves large language models entirely within your Kubernetes cluster. When Calabi Local Models is enabled in your Calabi deployment, every AI Builder chatflow and agent flow can use local models — with zero data ever leaving your infrastructure.
Why Use Local LLMs?
| Use Case | Recommendation |
|---|---|
| Processing highly sensitive data (PII, PHI, financial records) | Local — data never leaves the cluster |
| Air-gapped deployments with no external internet access | Local — required |
| High-volume workloads with predictable, fixed cost | Local — no per-token charges |
| Maximum response quality for general-purpose chat | Cloud (GPT-4o / Claude) |
| Complex multi-step reasoning, long documents | Cloud (GPT-4o / Claude) |
| Fast prototyping with small data | Either |
| Regulated industries (healthcare, finance, government) | Local preferred |
Available Local Models
Calabi ships with the following models pre-configured in Calabi Local Models. Additional models can be pulled by your platform administrator.
| Model | Version | Parameters | Min RAM | VRAM (GPU) | Strengths |
|---|---|---|---|---|---|
| Llama 3 | 3.1 8B | 8B | 8 GB | 6 GB | Excellent general-purpose chat, instruction following |
| Llama 3 | 3.1 70B | 70B | 64 GB | 48 GB | Near-GPT-4 quality; best local model for complex tasks |
| Mistral | 7B v0.3 | 7B | 8 GB | 6 GB | Fast, efficient, strong at structured output |
| Mistral | Mixtral 8x7B | 47B (MoE) | 32 GB | 24 GB | High quality with mixture-of-experts efficiency |
| Phi-3 | Mini 3.8B | 3.8B | 4 GB | 3 GB | Lightweight; ideal for CPU-only nodes |
| Phi-3 | Medium 14B | 14B | 16 GB | 10 GB | Balanced quality and resource usage |
| Gemma | 2B | 2B | 3 GB | 2 GB | Ultra-lightweight; good for simple Q&A |
| Gemma | 7B | 7B | 8 GB | 6 GB | Strong code generation and reasoning |
| CodeLlama | 13B | 13B | 16 GB | 10 GB | Specialized for code generation, SQL, scripts |
| nomic-embed-text | v1.5 | 137M | 1 GB | 1 GB | Embedding model for RAG vector stores |
| mxbai-embed-large | v1 | 335M | 2 GB | 1.5 GB | High-quality embeddings for RAG |
Notes:
- RAM requirements are for CPU-only inference (quantized,
q4_K_M). - VRAM requirements are for GPU-accelerated inference (also quantized).
- Models run in
q4_K_Mquantization by default, balancing quality and resource use. Your administrator can configure higher quantization levels.
Model Sizing Guide
CPU-Only Nodes (No GPU)
If your Calabi cluster does not have GPU nodes, use models with a Min RAM ≤ your node's available memory:
| Node RAM | Recommended Models | Notes |
|---|---|---|
| 4 GB | Phi-3 Mini, Gemma 2B | Suitable for simple chatbots only |
| 8 GB | Llama 3 8B, Mistral 7B, Gemma 7B | Good general-purpose performance |
| 16 GB | Phi-3 Medium | Better quality, still single-node |
| 32 GB | Mixtral 8x7B | Near-GPT-3.5 quality on CPU |
| 64 GB | Llama 3 70B | Production-grade quality, higher latency |
GPU-Accelerated Nodes
GPU inference is 5–20x faster than CPU inference:
| GPU VRAM | Recommended Models | Tokens/sec (approx) |
|---|---|---|
| 8 GB | Llama 3 8B, Mistral 7B | 30–60 t/s |
| 16 GB | Phi-3 Medium, Gemma 7B | 40–80 t/s |
| 24 GB | Mixtral 8x7B | 25–50 t/s |
| 48 GB | Llama 3 70B | 20–40 t/s |
| 80 GB | Llama 3 70B (full precision) | 40–70 t/s |
When to Use Local vs Cloud LLMs
Privacy Considerations
When Calabi Local Models is used in Calabi AI Builder:
- No data leaves the cluster. User messages, retrieved document chunks, and LLM responses are processed entirely within the Kubernetes namespace.
- No telemetry. Calabi Local Models does not send usage data to any external service.
- Audit logging. All local model requests are logged to the Calabi audit log, including the model used, prompt token count, and completion token count (but not the prompt content by default).
- Encryption in transit. Communication between AI Builder and Calabi Local Models uses mutual TLS within the cluster.
- No training on your data. Local models do not learn from your prompts — model weights are static.
For regulated industries (HIPAA, GDPR, SOC 2), using local LLMs for chatflows that process patient, financial, or personal data is strongly recommended.
Switching Models in AI Builder
Change the LLM in a Chatflow
- Open the chatflow in AI Builder.
- Double-click the Chat (Local Models) node (or add one if using a cloud LLM currently).
- In the configuration drawer:
- Base URL:
http://calabi-ollama:11434(pre-filled in Calabi deployments) - Model Name: Select from the dropdown or type the model name (e.g.,
llama3.1:8b) - Temperature: Adjust as needed
- Context Window: Set to the model's maximum (see table below)
- Base URL:
Available Model Names (for Chat (Local Models) node)
| Display Name | Model Name String | Context Window |
|---|---|---|
| Llama 3.1 8B | llama3.1:8b | 128,000 tokens |
| Llama 3.1 70B | llama3.1:70b | 128,000 tokens |
| Mistral 7B | mistral:7b | 32,000 tokens |
| Mixtral 8x7B | mixtral:8x7b | 32,000 tokens |
| Phi-3 Mini | phi3:mini | 128,000 tokens |
| Phi-3 Medium | phi3:medium | 128,000 tokens |
| Gemma 2B | gemma:2b | 8,192 tokens |
| Gemma 7B | gemma:7b | 8,192 tokens |
| CodeLlama 13B | codellama:13b | 16,384 tokens |
Change the Embedding Model
For RAG chatflows, the embedding model is configured on the Local Embeddings node:
- Double-click the Local Embeddings node.
- Set Model Name to
nomic-embed-textormxbai-embed-large. - Set Base URL to
http://calabi-ollama:11434.
If you change the embedding model in a RAG chatflow, you must re-index all documents. Vectors produced by different models are not compatible.
Pulling Additional Models
Your platform administrator can make additional models available:
# SSH into the local models pod or exec via kubectl
kubectl exec -n calabi-tenant-<id> deploy/calabi-ollama -- ollama pull <model-name>
# Examples
kubectl exec -n calabi-tenant-<id> deploy/calabi-ollama -- ollama pull llama3.2:3b
kubectl exec -n calabi-tenant-<id> deploy/calabi-ollama -- ollama pull deepseek-r1:8b
kubectl exec -n calabi-tenant-<id> deploy/calabi-ollama -- ollama pull qwen2.5:7b
After pulling, the new model appears in the AI Builder node dropdown within 60 seconds.
To pre-pull models at deployment time, add them to your Helm values:
localModels:
enabled: true
models:
- llama3.1:8b
- mistral:7b
- nomic-embed-text
- phi3:mini
resources:
requests:
memory: "16Gi"
cpu: "4"
limits:
memory: "32Gi"
cpu: "8"
gpu:
enabled: false # Set to true if GPU nodes are available
count: 1
Performance Optimization
Concurrent Requests
Calabi Local Models handles concurrent requests by queuing them against a single model instance. For high-traffic chatflows:
- Enable model parallelism in the Helm configuration:
localModels:
env:
OLLAMA_NUM_PARALLEL: "4" # Allow 4 concurrent inference requests - Scale the local models deployment to multiple replicas if your hardware supports it:
localModels:
replicaCount: 2 # Requires 2x GPU nodes
Keep-Alive
By default, Calabi Local Models keeps a model loaded in memory for 5 minutes after the last request, then unloads it. For frequently-used models, increase the keep-alive:
localModels:
env:
OLLAMA_KEEP_ALIVE: "30m" # Keep model loaded for 30 minutes
Related Pages
- Building Chatflows — How to use local models in the chatflow canvas
- RAG Agents — Use local embedding models for private document indexing
- Helm Configuration Reference — Full local models configuration options