Is it really an OpenAI-compatible API?

Yes, end to end. If your code works with the openai.com SDK — it works with h3llo. Change base_url and api_key — that's it. We support /v1/chat/completions, /v1/completions, /v1/embeddings, streaming, function calling, JSON mode.

Which models are available?

Llama 3.1 (8B / 70B / 405B), Qwen 2.5 (7B / 32B / 72B), Mixtral 8x22B, plus the e5-mistral and multilingual-e5 embedding models. The list keeps growing — every 2 weeks we add something based on customer requests.

Where are the models physically running?

On our B300 and H100 GPUs in data centers in-country. No external APIs behind the scenes, no shipping your data across borders. FZ-152 and FZ-242 compliant.

What about fine-tuning?

Yes, on Llama 3.1 and Qwen 2.5. Upload the dataset (jsonl, up to 10 GB), start a job via CLI or web UI — we train on B300 and give you an endpoint with your version of the model. Billing is per GPU hour.

How much does it cost?

Inference — per-second, per-token billing by model (Llama 70B input ≈ 75 ₽/M tokens, output ≈ 240 ₽/M). Fine-tuning — 820 ₽/hr on a GPU MIG 141 GB. Dedicated — from 1,600 ₽/hr for a full B300. Exact prices live in /price.

Are there rate limits and quotas?

Yes, by default 1,000 RPM and 200,000 TPM per account — enough for almost everyone. If you need more — we raise it on request within minutes.

AI platform — h3llo cloud

OpenAI-compatible API on our GPUs in-country

OpenAI-compatible API on our GPUs in-country. Llama 3.1, Qwen 2.5, Mixtral — streaming, function calling, JSON mode. Fine-tuning on B300. FZ-152 compliant. No vendor lock-in and your data doesn't leave the country.

Llama · Qwen · Mixtral · embeddings · fine-tuning · FZ-152

# drop-in openai api replacement $ curl https://api.h3llo.cloud/v1/chat/completions \ -H "Authorization: Bearer h3llo_•••" \ -H "Content-Type: application/json" \ -d '{ "model": "llama-3.1-70b", "stream": true, "messages": [ {"role": "user", "content": "Explain Raft in 3 sentences"} ] }' # ~80 ms later — first token: data: {"choices":[{"delta":{"content":"Raft"}}]} data: {"choices":[{"delta":{"content":" is a"}}]} data: {"choices":[{"delta":{"content":" consensus algorithm..."}}]}

# the same openai SDK, only base_url changed from openai import OpenAI client = OpenAI( base_url="https://api.h3llo.cloud/v1", api_key="h3llo_•••", ) resp = client.chat.completions.create( model="llama-3.1-70b", messages=[{"role": "user", "content": "..."}], stream=True, ) for chunk in resp: print(chunk.choices[0].delta.content, end="")

LLM · Llama

llama-3.1-70b

Workhorse for chat and RAG. Streaming, function calling, JSON mode.

Context128K

Input75 ₽/M

Output240 ₽/M

Tok/sec240

LLM · Llama

llama-3.1-405b

Frontier-class for the hardest tasks. Reasoning, long chains, agents.

Context128K

Input320 ₽/M

Output1 100 ₽/M

Tok/sec62

LLM · Llama · best ₽/token

llama-3.1-8b

Cheapest model in the catalog. Ideal for batch inference and simple tasks.

Context128K

Input12 ₽/M

Output32 ₽/M

Tok/sec920

LLM · Qwen

qwen-2.5-72b

Strong on code and multilingual. An alternative to Llama 70B with better Chinese/Russian.

Context128K

Input84 ₽/M

Output260 ₽/M

Tok/sec210

LLM · Qwen

qwen-2.5-32b

Mid-class — balance of price and quality. Good for agents and tool use.

Context128K

Input38 ₽/M

Output120 ₽/M

Tok/sec480

LLM · MoE

mixtral-8x22b

Mixture-of-Experts: 141B params, 39B active. Good on reasoning and multi-turn.

Context64K

Input92 ₽/M

Output280 ₽/M

Tok/sec180

Embeddings

e5-mistral-embed

Multilingual embeddings for RAG. 4 096-dim vector, sentence-aware.

Context8K

Input5 ₽/M

Output—

Tok/sec—

Embeddings · lightweight

multilingual-e5-large

Cheap multilingual embeddings model — 100+ languages, 1 024-dim vector.

Context512

Input2 ₽/M

Output—

Tok/sec—

What people typically plug us into

01 / rag

RAG in production

Embedding models + chat inference with function calling. p99 ≤ 800 ms on a corpus of 4M documents.

02 / agents

Agents and tool use

Function calling, JSON mode, structured outputs. Llama 70B and Qwen 72B hold up in complex multi-turn conversations.

03 / fine-tune

Domain fine-tuning

Upload jsonl, kick off a job on B300. You get an endpoint with your version of Llama. From 4 hours.

04 / batch

Batch inference

Async jobs for millions of requests. 4× cheaper inference when latency isn't critical for the task.

OpenAI-compatible API on our GPUs in-country

Why another inference platform

OpenAI / Anthropic direct

h3llo AI platform

Run vLLM on your own VM

Drop-in OpenAI replacement — only base_url changes

Models running right now

llama-3.1-70b

llama-3.1-405b

llama-3.1-8b

qwen-2.5-72b

qwen-2.5-32b

mixtral-8x22b

e5-mistral-embed

multilingual-e5-large

What people typically plug us into

RAG in production

Agents and tool use

Domain fine-tuning

Batch inference

Pay whichever way suits you

Guides and deep dives on AI infrastructure

GPU benchmark: real tokens/sec on our hardware

RAG architecture: hybrid retrieval, reranking, p95

Hyperion AI: 0 → 200 RPS in 6 weeks

26 items before shipping a fine-tune to production

Prompt injection, data leakage, audit

Ready modules: API + vector DB + observability

Change base_url — and you're on h3llo

Get an API key

Change base_url

Pick a model

Ship whatever you want

What people usually ask

30 seconds — and your code is on h3llo