Skip to main content

Local Models in Calabi AI Builder

Enterprise

Calabi AI Builder integrates with Calabi Local Models, a local LLM runtime that serves large language models entirely within your Kubernetes cluster. When Calabi Local Models is enabled in your Calabi deployment, every AI Builder chatflow and agent flow can use local models — with zero data ever leaving your infrastructure.


Why Use Local LLMs?

Use CaseRecommendation
Processing highly sensitive data (PII, PHI, financial records)Local — data never leaves the cluster
Air-gapped deployments with no external internet accessLocal — required
High-volume workloads with predictable, fixed costLocal — no per-token charges
Maximum response quality for general-purpose chatCloud (GPT-4o / Claude)
Complex multi-step reasoning, long documentsCloud (GPT-4o / Claude)
Fast prototyping with small dataEither
Regulated industries (healthcare, finance, government)Local preferred

Available Local Models

Calabi ships with the following models pre-configured in Calabi Local Models. Additional models can be pulled by your platform administrator.

ModelVersionParametersMin RAMVRAM (GPU)Strengths
Llama 33.1 8B8B8 GB6 GBExcellent general-purpose chat, instruction following
Llama 33.1 70B70B64 GB48 GBNear-GPT-4 quality; best local model for complex tasks
Mistral7B v0.37B8 GB6 GBFast, efficient, strong at structured output
MistralMixtral 8x7B47B (MoE)32 GB24 GBHigh quality with mixture-of-experts efficiency
Phi-3Mini 3.8B3.8B4 GB3 GBLightweight; ideal for CPU-only nodes
Phi-3Medium 14B14B16 GB10 GBBalanced quality and resource usage
Gemma2B2B3 GB2 GBUltra-lightweight; good for simple Q&A
Gemma7B7B8 GB6 GBStrong code generation and reasoning
CodeLlama13B13B16 GB10 GBSpecialized for code generation, SQL, scripts
nomic-embed-textv1.5137M1 GB1 GBEmbedding model for RAG vector stores
mxbai-embed-largev1335M2 GB1.5 GBHigh-quality embeddings for RAG

Notes:

  • RAM requirements are for CPU-only inference (quantized, q4_K_M).
  • VRAM requirements are for GPU-accelerated inference (also quantized).
  • Models run in q4_K_M quantization by default, balancing quality and resource use. Your administrator can configure higher quantization levels.

Model Sizing Guide

CPU-Only Nodes (No GPU)

If your Calabi cluster does not have GPU nodes, use models with a Min RAM ≤ your node's available memory:

Node RAMRecommended ModelsNotes
4 GBPhi-3 Mini, Gemma 2BSuitable for simple chatbots only
8 GBLlama 3 8B, Mistral 7B, Gemma 7BGood general-purpose performance
16 GBPhi-3 MediumBetter quality, still single-node
32 GBMixtral 8x7BNear-GPT-3.5 quality on CPU
64 GBLlama 3 70BProduction-grade quality, higher latency

GPU-Accelerated Nodes

GPU inference is 5–20x faster than CPU inference:

GPU VRAMRecommended ModelsTokens/sec (approx)
8 GBLlama 3 8B, Mistral 7B30–60 t/s
16 GBPhi-3 Medium, Gemma 7B40–80 t/s
24 GBMixtral 8x7B25–50 t/s
48 GBLlama 3 70B20–40 t/s
80 GBLlama 3 70B (full precision)40–70 t/s

When to Use Local vs Cloud LLMs


Privacy Considerations

When Calabi Local Models is used in Calabi AI Builder:

  • No data leaves the cluster. User messages, retrieved document chunks, and LLM responses are processed entirely within the Kubernetes namespace.
  • No telemetry. Calabi Local Models does not send usage data to any external service.
  • Audit logging. All local model requests are logged to the Calabi audit log, including the model used, prompt token count, and completion token count (but not the prompt content by default).
  • Encryption in transit. Communication between AI Builder and Calabi Local Models uses mutual TLS within the cluster.
  • No training on your data. Local models do not learn from your prompts — model weights are static.

For regulated industries (HIPAA, GDPR, SOC 2), using local LLMs for chatflows that process patient, financial, or personal data is strongly recommended.


Switching Models in AI Builder

Change the LLM in a Chatflow

  1. Open the chatflow in AI Builder.
  2. Double-click the Chat (Local Models) node (or add one if using a cloud LLM currently).
  3. In the configuration drawer:
    • Base URL: http://calabi-ollama:11434 (pre-filled in Calabi deployments)
    • Model Name: Select from the dropdown or type the model name (e.g., llama3.1:8b)
    • Temperature: Adjust as needed
    • Context Window: Set to the model's maximum (see table below)

Available Model Names (for Chat (Local Models) node)

Display NameModel Name StringContext Window
Llama 3.1 8Bllama3.1:8b128,000 tokens
Llama 3.1 70Bllama3.1:70b128,000 tokens
Mistral 7Bmistral:7b32,000 tokens
Mixtral 8x7Bmixtral:8x7b32,000 tokens
Phi-3 Miniphi3:mini128,000 tokens
Phi-3 Mediumphi3:medium128,000 tokens
Gemma 2Bgemma:2b8,192 tokens
Gemma 7Bgemma:7b8,192 tokens
CodeLlama 13Bcodellama:13b16,384 tokens

Change the Embedding Model

For RAG chatflows, the embedding model is configured on the Local Embeddings node:

  1. Double-click the Local Embeddings node.
  2. Set Model Name to nomic-embed-text or mxbai-embed-large.
  3. Set Base URL to http://calabi-ollama:11434.
Re-index Required

If you change the embedding model in a RAG chatflow, you must re-index all documents. Vectors produced by different models are not compatible.


Pulling Additional Models

Your platform administrator can make additional models available:

# SSH into the local models pod or exec via kubectl
kubectl exec -n calabi-tenant-<id> deploy/calabi-ollama -- ollama pull <model-name>

# Examples
kubectl exec -n calabi-tenant-<id> deploy/calabi-ollama -- ollama pull llama3.2:3b
kubectl exec -n calabi-tenant-<id> deploy/calabi-ollama -- ollama pull deepseek-r1:8b
kubectl exec -n calabi-tenant-<id> deploy/calabi-ollama -- ollama pull qwen2.5:7b

After pulling, the new model appears in the AI Builder node dropdown within 60 seconds.

To pre-pull models at deployment time, add them to your Helm values:

localModels:
enabled: true
models:
- llama3.1:8b
- mistral:7b
- nomic-embed-text
- phi3:mini
resources:
requests:
memory: "16Gi"
cpu: "4"
limits:
memory: "32Gi"
cpu: "8"
gpu:
enabled: false # Set to true if GPU nodes are available
count: 1

Performance Optimization

Concurrent Requests

Calabi Local Models handles concurrent requests by queuing them against a single model instance. For high-traffic chatflows:

  1. Enable model parallelism in the Helm configuration:
    localModels:
    env:
    OLLAMA_NUM_PARALLEL: "4" # Allow 4 concurrent inference requests
  2. Scale the local models deployment to multiple replicas if your hardware supports it:
    localModels:
    replicaCount: 2 # Requires 2x GPU nodes

Keep-Alive

By default, Calabi Local Models keeps a model loaded in memory for 5 minutes after the last request, then unloads it. For frequently-used models, increase the keep-alive:

localModels:
env:
OLLAMA_KEEP_ALIVE: "30m" # Keep model loaded for 30 minutes