Getting Started¶
This document includes installation and configuration instructions for Omni-NLI.
Installation¶
You can run Omni-NLI either by installing it as a Python package or by using a pre-built Docker image.
Python Installation¶
The base package (that supports Ollama and OpenRouter):
For local inference with HuggingFace (supports models from Ollama, OpenRouter, and HuggingFace):
Docker Installation¶
Pre-built Docker images are available from the GitHub Container Registry.
Generic CPU Image:
GPU Image (for NVIDIA GPUs):
Configuration can be passed as environment variables to the container. For example, to use OpenRouter with a custom model:
docker run --rm -it -p 8000:8000 \
-e DEFAULT_BACKEND=openrouter \
-e OPENROUTER_API_KEY=your-api-key \
-e OPENROUTER_DEFAULT_MODEL=openai/gpt-5.2 \
ghcr.io/cogitatortech/omni-nli-cpu:latest
Tip
When using the HuggingFace backend, the GPU image is recommended because inference will be a lot faster when using a GPU.
Warning
When using the Ollama backend with Docker, you must set OLLAMA_HOST to point to a valid IP or host name that has Ollama server running on it. The default localhost will point to the container itself and will fail to connect.
Note
The Docker images default to 1 Gunicorn worker because currently MCP sessions are stored in-memory per-worker. For REST-only deployments (no MCP), you can increase workers for much better throughput:
Note
The latest tag refers to the latest release on top of the main branch. You can replace latest with a specific version tag from
the list of available packages.
Configuration¶
The server can be configured using command-line arguments or environment variables.
Environment variables are read from a .env file if it exists or from the system environment.
Note
Command-line arguments take precedence over environment variables.
You could copy the example .env.example in the project's repository to the directory where you run the server and customize it.
Configuration Reference¶
| Argument | Env Var | Description | Default |
|---|---|---|---|
--host |
HOST |
Server host | 127.0.0.1 |
--port |
PORT |
Server port | 8000 |
--log-level |
LOG_LEVEL |
Logging level | INFO |
--debug |
DEBUG |
Enable debug mode | False |
--default-backend |
DEFAULT_BACKEND |
Default backend provider | huggingface |
--ollama-host |
OLLAMA_HOST |
Ollama server URL | http://localhost:11434 |
--ollama-default-model |
OLLAMA_DEFAULT_MODEL |
Default Ollama model | qwen3:8b |
--huggingface-default-model |
HUGGINGFACE_DEFAULT_MODEL |
Default HuggingFace model | microsoft/Phi-3.5-mini-instruct |
--openrouter-default-model |
OPENROUTER_DEFAULT_MODEL |
Default OpenRouter model | openai/gpt-5-mini |
--huggingface-token |
HUGGINGFACE_TOKEN |
HuggingFace API token (for gated models) | None |
--openrouter-api-key |
OPENROUTER_API_KEY |
OpenRouter API key | None |
--hf-cache-dir |
HF_CACHE_DIR |
HuggingFace models cache directory | OS default |
--max-thinking-tokens |
MAX_THINKING_TOKENS |
Max tokens for thinking traces | 4096 |
--return-thinking-trace |
RETURN_THINKING_TRACE |
Return raw thinking trace in response | False |
Note
The PROVIDER_CACHE_SIZE setting must be configured via environment variable only, not CLI flag.
Supported Backends¶
| Backend | Local | Example Models |
|---|---|---|
| Ollama | Yes | qwen3:8b, deepseek-r1:7b, phi4:latest |
| HuggingFace | Yes | microsoft/Phi-3.5-mini-instruct, Qwen/Qwen2.5-1.5B-Instruct |
| OpenRouter | No | openai/gpt-5-mini, openai/gpt-5.2, arcee-ai/trinity-large-preview:free |
Running the Server¶
Start the server using the CLI command:
The server will start at http://127.0.0.1:8000 by default.
CLI arguments example: