h3llo cloud/products/AI platform
● ai platformopenai-compatible

OpenAI-compatibleAPI on our GPUsin-country

(translation pending)

Llama · Qwen · Mixtral · embeddings · fine-tuning · FZ-152
llama-3.1-70b · streaming · msk-1
active
curl /v1/chat/completions
$

Why another inference platform

Frontier models from OpenAI and Anthropic are great, but they live abroad. Open-source models aren't one click away. Your own VM with vLLM is brittle. We built what sits in the middle.

alternative A

OpenAI / Anthropic direct

  • The newest frontier models
  • Data leaves the country
  • Prices in USD · FX risk
  • FZ-152 — no, FZ-242 — no
  • Latency p99 ≥ 800 ms from-country
h3llo · ai

h3llo AI platform

  • OpenAI-compatible API — drop-in replacement
  • Open-source models on B300/H100 in-country
  • Prices in rubles, per-second billing
  • FZ-152, FZ-242, FSTEC — yes
  • p99 ≤ 80 ms in Moscow, fine-tuning out of the box
  • Own dashboard, RBAC, audit log
alternative B

Run vLLM on your own VM

  • Full control over inference
  • You tune batching and KV-cache yourself
  • You debug GPU utilization yourself
  • You keep autoscaling under load yourself
  • No drop-in OpenAI compatibility
● dx

One base_url — and the whole openai SDK is yours

No new SDK to learn, no client to rewrite. Change base_url in the openai client — that's it, your code runs on our models.

curl · drop-incopy
# drop-in openai api replacement
$ curl https://api.h3llo.cloud/v1/chat/completions \
    -H "Authorization: Bearer h3llo_•••" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "llama-3.1-70b",
      "stream": true,
      "messages": [
        {"role": "user", "content": "Explain Raft in 3 sentences"}
      ]
    }'

# ~80 ms later — first token:
data: {"choices":[{"delta":{"content":"Raft"}}]}
data: {"choices":[{"delta":{"content":" is a"}}]}
data: {"choices":[{"delta":{"content":" consensus algorithm..."}}]}
Python · openai SDKSDK
# the same openai SDK, only base_url changed
from openai import OpenAI

client = OpenAI(
    base_url="https://api.h3llo.cloud/v1",
    api_key="h3llo_•••",
)

resp = client.chat.completions.create(
    model="llama-3.1-70b",
    messages=[{"role": "user", "content": "..."}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content, end="")
● catalogupdated regularly

Models running right now

Prices in rubles, per-second billing. Context length is the actual one — no hidden caps.

ModelContextInput · ₽/MOutput · ₽/Mtokens/sec
llama-3.1-70b128K75 ₽240 ₽240
llama-3.1-405b128K320 ₽1 100 ₽62
llama-3.1-8bbest · ₽/token128K12 ₽32 ₽920
qwen-2.5-72b128K84 ₽260 ₽210
qwen-2.5-32b128K38 ₽120 ₽480
mixtral-8x22b64K92 ₽280 ₽180
e5-mistral-embed8K5 ₽
multilingual-e5-large5122 ₽

TPS measured at batch-size 1, int8 quantization, B300 MIG 141 GB. On larger batches TPS scales — details in /price.

● use cases

What people typically plug us into

01 / rag
RAG in production
Embedding models + chat inference with function calling. p99 ≤ 800 ms on a corpus of 4M documents.
02 / agents
Agents and tool use
Function calling, JSON mode, structured outputs. Llama 70B and Qwen 72B hold up in complex multi-turn conversations.
03 / fine-tune
Domain fine-tuning
Upload jsonl, kick off a job on B300. You get an endpoint with your version of Llama. From 4 hours.
04 / batch
Batch inference
Async jobs for millions of requests. 4× cheaper inference when latency isn't critical for the task.
● pricing

Pay whichever way suits you

Per-token billing for prototypes and variable load. Reserved GPU for predictable invoices. Dedicated — for regulated markets.

pay-as-you-go
Pay per token. No minimums, no subscriptions. Ideal for prototypes and variable load.
from 12 ₽ / 1M input · llama-3.1-8b
  • All catalog models
  • Streaming + function calling
  • Rate limit 1,000 RPM
  • Per-token, per-second billing
popular
reserved gpu
Reserved MIG partitions on B300. Guaranteed throughput, fixed invoice.
from 250 ₽ / hr · MIG 40 GB
  • Guaranteed p99 latency
  • MIG 40 / 80 / 141 GB or full B300
  • Hourly billing
  • Your own custom models on the endpoint
  • Rate limits disabled
dedicated
Whole B300 nodes for you, isolated inference. For the public sector and regulated markets.
on request
  • Dedicated B300 / H100 cluster
  • Isolated inference perimeter
  • FZ-152 / FSTEC / PCI DSS
  • Custom SLA from 99.99%
  • Dedicated ML engineer
● materialsfree

Guides and deep dives on AI infrastructure

Real practices, benchmarks, case studies. No fluff and no marketing — grab the PDF, apply it Monday.

All materials →
● quickstart

Change base_url — and you're on h3llo

No SDK to rewrite. No new API to learn. Just change base_url — and your code runs on our models in-country.

Get a key →
1

Get an API key

h3llo auth login · h3llo ai key create — in 30 seconds.
2

Change base_url

https://api.openai.com/v1 https://api.h3llo.cloud/v1. Done.
3

Pick a model

model: "llama-3.1-70b" instead of gpt-4. Streaming works.
4

Ship whatever you want

Your code doesn't change — only the endpoint. Function calling, JSON mode, embeddings — all there.
● faq

What people usually ask

Is it really an OpenAI-compatible API?
Yes, end to end. If your code works with the openai.com SDK — it works with h3llo. Change base_url and api_key — that's it. We support /v1/chat/completions, /v1/completions, /v1/embeddings, streaming, function calling, JSON mode.
Which models are available?
Llama 3.1 (8B / 70B / 405B), Qwen 2.5 (7B / 32B / 72B), Mixtral 8x22B, plus the e5-mistral and multilingual-e5 embedding models. The list keeps growing — every 2 weeks we add something based on customer requests.
Where are the models physically running?
On our B300 and H100 GPUs in data centers in-country. No external APIs behind the scenes, no shipping your data across borders. FZ-152 and FZ-242 compliant.
What about fine-tuning?
Yes, on Llama 3.1 and Qwen 2.5. Upload the dataset (jsonl, up to 10 GB), start a job via CLI or web UI — we train on B300 and give you an endpoint with your version of the model. Billing is per GPU hour.
How much does it cost?
Inference — per-second, per-token billing by model (Llama 70B input ≈ 75 ₽/M tokens, output ≈ 240 ₽/M). Fine-tuning — 820 ₽/hr on a GPU MIG 141 GB. Dedicated — from 1,600 ₽/hr for a full B300. Exact prices live in /price.
Are there rate limits and quotas?
Yes, by default 1,000 RPM and 200,000 TPM per account — enough for almost everyone. If you need more — we raise it on request within minutes.
● drop-in openai replacement

30 seconds —
and your code is on h3llo

No migrations, no rewrites, no vendor lock-in. Get a key, change base_url — it works.

Get an API key →Documentation