https://api.venice.ai/api/v1 for private inference across uncensored, frontier models. It implements the OpenAI specification, so existing OpenAI clients and SDKs work by changing the base URL — then adds Venice-native features like server-side web search, characters, and wallet payments. The full spec is also available as an OpenAPI YAML file.
Quickstart
Point any OpenAI-compatible client at Venice’s base URL (https://api.venice.ai/api/v1). Create and manage keys in your API settings.
Authentication
All API requests require HTTP Bearer authentication:Your API key is a secret. Do not share it or expose it in any client-side code.
Differences from OpenAI
Venice is OpenAI-compatible. The main additions and differences:venice_parameters— Venice-only request options (web search, scraping, character personas, thinking controls). Reference below.- System prompts — Venice appends defaults tuned for natural, uncensored output; disable with
include_venice_system_prompt: false. Details below. - Models — Use Venice model IDs directly rather than OpenAI mappings. Browse models.
- Response headers — Balance, rate-limit, model, and content-safety metadata on every response. Reference below.
- Private inference — TEE-backed and end-to-end-encrypted model options. Privacy models.
- Payments — Credits, a daily DIEM allowance, or per-request USDC via x402. x402 guide.
System Prompts
Venice provides default system prompts designed to ensure uncensored and natural model responses. You have two options for handling system prompts:- Default Behavior: Your system prompts are appended to Venice’s defaults
- Custom Behavior: Disable Venice’s system prompts entirely
Disabling Venice System Prompts
Use thevenice_parameters option to remove Venice’s default system prompts:
Venice Parameters
Thevenice_parameters object allows you to access Venice-specific features not available in the standard OpenAI API:
| Parameter | Type | Description | Default |
|---|---|---|---|
character_slug | string | The character slug of a public Venice character (discoverable as “Public ID” on the published character page) | - |
strip_thinking_response | boolean | Strip <think></think> blocks from the response (models using legacy <think> tag format). See Reasoning Models. | false |
disable_thinking | boolean | On supported reasoning models, disable thinking and strip the <think></think> blocks from the response | false |
enable_web_search | string | Enable web search for this request (off, on, auto - auto enables based on model’s discretion)Additional usage-based pricing applies, see pricing. | off |
enable_web_scraping | boolean | Enable web scraping of up to 5 URLs detected in the user message. Scraped content augments responses and bypasses web search. Only successfully scraped URLs are billed. Additional usage-based pricing applies, see pricing. | false |
enable_x_search | boolean | Enable xAI’s native search (web + X/Twitter) for supported Grok models (e.g., grok-4-20-beta). Provides higher quality search results by using xAI’s search infrastructure. When enabled, Venice’s standard web search is bypassed.Additional usage-based pricing applies, see pricing. | false |
enable_web_citations | boolean | When web search is enabled, request that the LLM cite its sources using [REF]0[/REF] format | false |
include_search_results_in_stream | boolean | Experimental: Include search results in the stream as the first emitted chunk | false |
return_search_results_as_documents | boolean | Surface search results in an OpenAI-compatible tool call named venice_web_search_documents for LangChain integration | false |
include_venice_system_prompt | boolean | Whether to include Venice’s default system prompts alongside specified system prompts | true |
These parameters can also be specified as model suffixes appended to the model name (e.g.,
zai-org-glm-5:enable_web_search=auto). See Model Feature Suffixes for details.Prompt Caching
Venice supports prompt caching on select models to reduce latency and costs for repeated content. For supported models, Venice automatically caches system prompts—no code changes required. You can also manually mark content for caching using thecache_control property on message content.
| Parameter | Type | Description |
|---|---|---|
prompt_cache_key | string | Optional routing hint to improve cache hit rates. When supplied, Venice routes requests to the same backend infrastructure, increasing the likelihood of cache hits across multi-turn conversations. |
Response Headers
All Venice API responses include HTTP headers with request, rate-limit, model, and account-balance metadata. In addition to error codes returned from API responses, you can inspect these headers to get the unique ID of a particular request, monitor rate limiting, and track your account balance. Venice recommends logging request IDs (CF-RAY header) in production deployments for more efficient troubleshooting with our support team, should the need arise.
| Header | Type | Purpose | When Returned |
|---|---|---|---|
| Standard HTTP Headers | |||
Content-Type | string | MIME type of the response body (application/json, text/csv, image/png, etc.) | Always |
Content-Encoding | string | Encoding used to compress the response body (gzip, br) | When client sends Accept-Encoding header |
Content-Disposition | string | How content should be displayed (e.g., attachment; filename=export.csv) | When downloading files or exports |
Date | string | RFC 7231 formatted timestamp when the response was generated | Always |
| Request Identification | |||
CF-RAY | string | Unique identifier for this API request, used for troubleshooting and support requests | Always |
x-venice-version | string | Current version/revision of the Venice API service (e.g., 20250828.222653) | Always |
x-venice-timestamp | string | Server timestamp when the request was processed (ISO 8601 format) | When timestamp tracking is enabled |
x-venice-host-name | string | Hostname of the server that processed the request | Error responses and debugging scenarios |
| Model Information | |||
x-venice-model-id | string | Unique identifier of the AI model used for the request (e.g., venice-01-lite) | Inference endpoints using AI models |
x-venice-model-name | string | Friendly/display name of the AI model used (e.g., Venice Lite) | Inference endpoints using AI models |
x-venice-model-router | string | Router/backend service that handled the model inference | Inference endpoints when routing info available |
x-venice-model-deprecation-warning | string | Warning message for models scheduled for deprecation | When using a deprecated model |
x-venice-model-deprecation-date | string | Date when the model will be deprecated (ISO 8601 date) | When using a deprecated model |
| Rate Limiting Information | |||
x-ratelimit-limit-requests | number | Maximum number of requests allowed in the current time window | All authenticated requests |
x-ratelimit-remaining-requests | number | Number of requests remaining in the current time window | All authenticated requests |
x-ratelimit-reset-requests | number | Unix timestamp when the request rate limit resets | All authenticated requests |
x-ratelimit-limit-tokens | number | Maximum number of tokens (prompt + completion) allowed in the time window | All authenticated requests |
x-ratelimit-remaining-tokens | number | Number of tokens remaining in the current time window | All authenticated requests |
x-ratelimit-reset-tokens | number | Duration in seconds until the token rate limit resets | All authenticated requests |
x-ratelimit-type | string | Type of rate limit applied (user, api_key, global) | When rate limiting is enforced |
| Pagination Headers | |||
x-pagination-limit | number | Number of items per page | Paginated endpoints |
x-pagination-page | number | Current page number (1-based) | Paginated endpoints |
x-pagination-total | number | Total number of items across all pages | Paginated endpoints |
x-pagination-total-pages | number | Total number of pages | Paginated endpoints |
| Account Balance Information | |||
x-venice-balance-diem | string | Your DIEM token balance before the request was processed | All authenticated requests |
x-venice-balance-usd | string | Your USD credit balance before the request was processed | All authenticated requests |
| Content Safety Headers | |||
x-venice-is-blurred | string | Indicates if generated image was blurred due to content policies (true/false) | Image generation with Safe Venice enabled |
x-venice-is-content-violation | string | Indicates if content violates Venice’s content policies (true/false) | Content generation endpoints |
x-venice-is-adult-model-content-violation | string | Indicates if content violates adult model content policies (true/false) | Image generation endpoints |
x-venice-contains-minor | string | Indicates if image contains minors (true/false) | Image analysis endpoints with age detection |
| Client Information | |||
x-venice-middleface-version | string | Version of the Venice middleface client | Requests from Venice middleface clients |
x-venice-mobile-version | string | Version of the Venice mobile app client | Requests from mobile applications |
x-venice-request-timestamp-ms | number | Client-provided request timestamp in milliseconds | When client provides timestamp in request |
x-venice-control-instance | string | Control instance identifier for debugging | Image generation endpoints for debugging |
| Authentication Headers | |||
x-auth-refreshed | string | Indicates authentication token was refreshed during request (true/false) | When authentication tokens are auto-refreshed |
x-retry-count | number | Number of retry attempts for the request | When request retries occur |
Notes and an example of accessing headers
Notes and an example of accessing headers
- Header Name Case: HTTP headers are case-insensitive, but Venice uses lowercase with hyphens for consistency
- String Values: Boolean values in headers are returned as strings (
"true"or"false") - Numeric Values: Large numbers and balance values may be returned as strings to prevent precision loss
- Optional Headers: Not all headers are returned in every response; presence depends on the endpoint and request context
- Compression: Use
Accept-Encoding: gzip, brin requests to receive compressed responses where supported
Best Practices
- Rate Limiting: Monitor
x-ratelimit-remaining-requestsandx-ratelimit-remaining-tokensheaders and implement exponential backoff - Balance Monitoring: Track
x-venice-balance-usdandx-venice-balance-diemheaders to avoid service interruptions - System Prompts: Test with and without Venice’s system prompts to find the best fit for your use case
- API Keys: Keep your API keys secure and rotate them regularly
- Request Logging: Log
CF-RAYheader values for troubleshooting with support - Model Deprecation: Check for
x-venice-model-deprecation-warningheaders when using models
API Stability
Venice maintains backward compatibility for v1 endpoints and parameters. For model lifecycle policy, deprecation notices, and migration guidance, see Deprecations.Next Steps
- Quickstart guide — from API key to a working integration
- Endpoints — full reference with an interactive playground
- Models — the full model catalog with pricing and capabilities
- Rate limits — per-model request and token limits
- OpenAPI spec (YAML) — the complete specification for codegen and RAG