Local Models in Calabi AI Builder

Enterprise

Calabi AI Builder integrates with Calabi Local Models, a local LLM runtime that serves large language models entirely within your Kubernetes cluster. When Calabi Local Models is enabled in your Calabi deployment, every AI Builder chatflow and agent flow can use local models — with zero data ever leaving your infrastructure.

Why Use Local LLMs?

Use Case	Recommendation
Processing highly sensitive data (PII, PHI, financial records)	Local — data never leaves the cluster
Air-gapped deployments with no external internet access	Local — required
High-volume workloads with predictable, fixed cost	Local — no per-token charges
Maximum response quality for general-purpose chat	Cloud (GPT-4o / Claude)
Complex multi-step reasoning, long documents	Cloud (GPT-4o / Claude)
Fast prototyping with small data	Either
Regulated industries (healthcare, finance, government)	Local preferred

Available Local Models

Calabi ships with the following models pre-configured in Calabi Local Models. Additional models can be pulled by your platform administrator.

Model	Version	Parameters	Min RAM	VRAM (GPU)	Strengths
Llama 3	3.1 8B	8B	8 GB	6 GB	Excellent general-purpose chat, instruction following
Llama 3	3.1 70B	70B	64 GB	48 GB	Near-GPT-4 quality; best local model for complex tasks
Mistral	7B v0.3	7B	8 GB	6 GB	Fast, efficient, strong at structured output
Mistral	Mixtral 8x7B	47B (MoE)	32 GB	24 GB	High quality with mixture-of-experts efficiency
Phi-3	Mini 3.8B	3.8B	4 GB	3 GB	Lightweight; ideal for CPU-only nodes
Phi-3	Medium 14B	14B	16 GB	10 GB	Balanced quality and resource usage
Gemma	2B	2B	3 GB	2 GB	Ultra-lightweight; good for simple Q&A
Gemma	7B	7B	8 GB	6 GB	Strong code generation and reasoning
CodeLlama	13B	13B	16 GB	10 GB	Specialized for code generation, SQL, scripts
nomic-embed-text	v1.5	137M	1 GB	1 GB	Embedding model for RAG vector stores
mxbai-embed-large	v1	335M	2 GB	1.5 GB	High-quality embeddings for RAG

Notes:

RAM requirements are for CPU-only inference (quantized, q4_K_M).
VRAM requirements are for GPU-accelerated inference (also quantized).
Models run in q4_K_M quantization by default, balancing quality and resource use. Your administrator can configure higher quantization levels.

Model Sizing Guide

CPU-Only Nodes (No GPU)

If your Calabi cluster does not have GPU nodes, use models with a Min RAM ≤ your node's available memory:

Node RAM	Recommended Models	Notes
4 GB	Phi-3 Mini, Gemma 2B	Suitable for simple chatbots only
8 GB	Llama 3 8B, Mistral 7B, Gemma 7B	Good general-purpose performance
16 GB	Phi-3 Medium	Better quality, still single-node
32 GB	Mixtral 8x7B	Near-GPT-3.5 quality on CPU
64 GB	Llama 3 70B	Production-grade quality, higher latency

GPU-Accelerated Nodes

GPU inference is 5–20x faster than CPU inference:

GPU VRAM	Recommended Models	Tokens/sec (approx)
8 GB	Llama 3 8B, Mistral 7B	30–60 t/s
16 GB	Phi-3 Medium, Gemma 7B	40–80 t/s
24 GB	Mixtral 8x7B	25–50 t/s
48 GB	Llama 3 70B	20–40 t/s
80 GB	Llama 3 70B (full precision)	40–70 t/s

When to Use Local vs Cloud LLMs

Privacy Considerations

When Calabi Local Models is used in Calabi AI Builder:

No data leaves the cluster. User messages, retrieved document chunks, and LLM responses are processed entirely within the Kubernetes namespace.
No telemetry. Calabi Local Models does not send usage data to any external service.
Audit logging. All local model requests are logged to the Calabi audit log, including the model used, prompt token count, and completion token count (but not the prompt content by default).
Encryption in transit. Communication between AI Builder and Calabi Local Models uses mutual TLS within the cluster.
No training on your data. Local models do not learn from your prompts — model weights are static.

For regulated industries (HIPAA, GDPR, SOC 2), using local LLMs for chatflows that process patient, financial, or personal data is strongly recommended.

Switching Models in AI Builder

Change the LLM in a Chatflow

Open the chatflow in AI Builder.
Double-click the Chat (Local Models) node (or add one if using a cloud LLM currently).
In the configuration drawer:
- Base URL: http://calabi-ollama:11434 (pre-filled in Calabi deployments)
- Model Name: Select from the dropdown or type the model name (e.g., llama3.1:8b)
- Temperature: Adjust as needed
- Context Window: Set to the model's maximum (see table below)

Available Model Names (for Chat (Local Models) node)

Display Name	Model Name String	Context Window
Llama 3.1 8B	`llama3.1:8b`	128,000 tokens
Llama 3.1 70B	`llama3.1:70b`	128,000 tokens
Mistral 7B	`mistral:7b`	32,000 tokens
Mixtral 8x7B	`mixtral:8x7b`	32,000 tokens
Phi-3 Mini	`phi3:mini`	128,000 tokens
Phi-3 Medium	`phi3:medium`	128,000 tokens
Gemma 2B	`gemma:2b`	8,192 tokens
Gemma 7B	`gemma:7b`	8,192 tokens
CodeLlama 13B	`codellama:13b`	16,384 tokens

Change the Embedding Model

For RAG chatflows, the embedding model is configured on the Local Embeddings node:

Double-click the Local Embeddings node.
Set Model Name to nomic-embed-text or mxbai-embed-large.
Set Base URL to http://calabi-ollama:11434.

Re-index Required

If you change the embedding model in a RAG chatflow, you must re-index all documents. Vectors produced by different models are not compatible.

Pulling Additional Models

Your platform administrator can make additional models available:

# SSH into the local models pod or exec via kubectl
kubectl exec -n calabi-tenant-<id> deploy/calabi-ollama -- ollama pull <model-name>

# Examples
kubectl exec -n calabi-tenant-<id> deploy/calabi-ollama -- ollama pull llama3.2:3b
kubectl exec -n calabi-tenant-<id> deploy/calabi-ollama -- ollama pull deepseek-r1:8b
kubectl exec -n calabi-tenant-<id> deploy/calabi-ollama -- ollama pull qwen2.5:7b

After pulling, the new model appears in the AI Builder node dropdown within 60 seconds.

To pre-pull models at deployment time, add them to your Helm values:

localModels:
  enabled: true
  models:
    - llama3.1:8b
    - mistral:7b
    - nomic-embed-text
    - phi3:mini
  resources:
    requests:
      memory: "16Gi"
      cpu: "4"
    limits:
      memory: "32Gi"
      cpu: "8"
  gpu:
    enabled: false   # Set to true if GPU nodes are available
    count: 1

Performance Optimization

Concurrent Requests

Calabi Local Models handles concurrent requests by queuing them against a single model instance. For high-traffic chatflows:

Enable model parallelism in the Helm configuration:

localModels:
  env:
    OLLAMA_NUM_PARALLEL: "4"   # Allow 4 concurrent inference requests

Scale the local models deployment to multiple replicas if your hardware supports it:
```
localModels:
  replicaCount: 2   # Requires 2x GPU nodes
```

Keep-Alive

By default, Calabi Local Models keeps a model loaded in memory for 5 minutes after the last request, then unloads it. For frequently-used models, increase the keep-alive:

localModels:
  env:
    OLLAMA_KEEP_ALIVE: "30m"   # Keep model loaded for 30 minutes

Building Chatflows — How to use local models in the chatflow canvas
RAG Agents — Use local embedding models for private document indexing
Helm Configuration Reference — Full local models configuration options

Why Use Local LLMs?​

Available Local Models​

Model Sizing Guide​

CPU-Only Nodes (No GPU)​

GPU-Accelerated Nodes​

When to Use Local vs Cloud LLMs​

Privacy Considerations​

Switching Models in AI Builder​

Change the LLM in a Chatflow​

Available Model Names (for Chat (Local Models) node)​

Change the Embedding Model​

Pulling Additional Models​

Performance Optimization​

Concurrent Requests​

Keep-Alive​

Related Pages​