(translation pending)
Frontier models from OpenAI and Anthropic are great, but they live abroad. Open-source models aren't one click away. Your own VM with vLLM is brittle. We built what sits in the middle.
No new SDK to learn, no client to rewrite. Change base_url in the openai client — that's it, your code runs on our models.
# drop-in openai api replacement
$ curl https://api.h3llo.cloud/v1/chat/completions \
-H "Authorization: Bearer h3llo_•••" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-70b",
"stream": true,
"messages": [
{"role": "user", "content": "Explain Raft in 3 sentences"}
]
}'
# ~80 ms later — first token:
data: {"choices":[{"delta":{"content":"Raft"}}]}
data: {"choices":[{"delta":{"content":" is a"}}]}
data: {"choices":[{"delta":{"content":" consensus algorithm..."}}]}# the same openai SDK, only base_url changed
from openai import OpenAI
client = OpenAI(
base_url="https://api.h3llo.cloud/v1",
api_key="h3llo_•••",
)
resp = client.chat.completions.create(
model="llama-3.1-70b",
messages=[{"role": "user", "content": "..."}],
stream=True,
)
for chunk in resp:
print(chunk.choices[0].delta.content, end="")Prices in rubles, per-second billing. Context length is the actual one — no hidden caps.
| Model | Context | Input · ₽/M | Output · ₽/M | tokens/sec |
|---|---|---|---|---|
| llama-3.1-70b | 128K | 75 ₽ | 240 ₽ | 240 |
| llama-3.1-405b | 128K | 320 ₽ | 1 100 ₽ | 62 |
| llama-3.1-8bbest · ₽/token | 128K | 12 ₽ | 32 ₽ | 920 |
| qwen-2.5-72b | 128K | 84 ₽ | 260 ₽ | 210 |
| qwen-2.5-32b | 128K | 38 ₽ | 120 ₽ | 480 |
| mixtral-8x22b | 64K | 92 ₽ | 280 ₽ | 180 |
| e5-mistral-embed | 8K | 5 ₽ | — | — |
| multilingual-e5-large | 512 | 2 ₽ | — | — |
TPS measured at batch-size 1, int8 quantization, B300 MIG 141 GB. On larger batches TPS scales — details in /price.
Per-token billing for prototypes and variable load. Reserved GPU for predictable invoices. Dedicated — for regulated markets.
Real practices, benchmarks, case studies. No fluff and no marketing — grab the PDF, apply it Monday.
Llama 3.1 70B, Qwen 2.5: TPS, p99 latency, cost per 1M tokens across batches.
How we hold p95 ≤ 800 ms on a corpus of 4M documents. With formulas and code.
What we measured, what we optimized, which batches we hold. Real before/after numbers.
Dataset quality, validation, eval metrics, rollback. What you must check.
Threat model for a production LLM service. With examples and priorities.
Production-ready: gateway, OpenAI-compat proxy, Qdrant, Grafana. For dev/staging/prod.
No SDK to rewrite. No new API to learn. Just change base_url — and your code runs on our models in-country.
Get a key →h3llo auth login · h3llo ai key create — in 30 seconds.https://api.openai.com/v1 → https://api.h3llo.cloud/v1. Done.model: "llama-3.1-70b" instead of gpt-4. Streaming works.No migrations, no rewrites, no vendor lock-in. Get a key, change base_url — it works.