Overview
The EmberCloud API provides OpenAI-compatible endpoints for running inference on GLM models. You can use any existing OpenAI SDK or HTTP client — just change the base URL and API key.
We support both streaming and non-streaming chat completions, tool calling, JSON mode, and more.
Authentication
All API requests require a Bearer token in the Authorization header.
Authorization: Bearer YOUR_API_KEY
Keep your API key secret. Do not expose it in client-side code or public repositories.
Base URL
All API endpoints are served from:
https://api.embercloud.ai/v1
For example, chat completions are available at https://api.embercloud.ai/v1/chat/completions
Chat Completions
/v1/chat/completionsCreate a chat completion. Supports streaming and non-streaming responses, tool calling, and structured outputs.
curl https://api.embercloud.ai/v1/chat/completions \
-H "Authorization: Bearer $EMBER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-4.7",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
}'Request Body
| Parameter | Type | Required | Description |
|---|---|---|---|
messages | array | Required | A list of messages comprising the conversation. Each message has a role ("system", "user", "assistant", or "tool") and content. |
→role | string | Required | The role of the message author: "system", "user", "assistant", "tool", or "developer". |
→content | string | array | Required | The content of the message. Can be a string or an array of content parts (text, image_url). |
model | string | Optional | The model to use. Defaults to "glm-4.7". Available: "glm-4.7", "glm-4.7-flash". |
stream | boolean | Optional | If true, returns a stream of Server-Sent Events. Default: false. |
stream_options | object | Optional | Options for streaming. Set { "include_usage": true } to receive usage stats in the final chunk. |
temperature | number | Optional | Sampling temperature between 0 and 2. Higher values make output more random. Default: 1. |
top_p | number | Optional | Nucleus sampling. The model considers tokens with top_p cumulative probability. Default: 1. |
max_tokens | integer | Optional | Maximum number of tokens to generate. Model maximum: 32,768. |
seed | integer | Optional | A seed for deterministic generation. Same seed + same input = same output. |
stop | string | array | Optional | Up to 4 sequences where the model will stop generating. |
frequency_penalty | number | Optional | Penalizes tokens based on their frequency in the text so far. Range: -2 to 2. |
presence_penalty | number | Optional | Penalizes tokens based on whether they appear in the text so far. Range: -2 to 2. |
tools | array | Optional | A list of tools (functions) the model may call. Each tool has a type, name, description, and parameters schema. |
tool_choice | string | object | Optional | Controls tool calling: "none", "auto", "required", or { type: "function", function: { name: "..." } }. |
response_format | object | Optional | Set { "type": "json_object" } to force JSON output. The model will return valid JSON. |
logprobs | boolean | Optional | Whether to return log probabilities of output tokens. |
top_logprobs | integer | Optional | Number of most likely tokens to return at each position (0-20). Requires logprobs: true. |
logit_bias | object | Optional | Map of token IDs to bias values (-100 to 100). Use to increase or decrease likelihood of specific tokens. |
Response
A non-streaming response returns a chat.completion object:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1707436800,
"model": "glm-4.7",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 24,
"completion_tokens": 9,
"total_tokens": 33
}
}Response Fields
| Parameter | Type | Required | Description |
|---|---|---|---|
id | string | Required | Unique identifier for the completion (format: "chatcmpl-..."). |
object | string | Required | Always "chat.completion". |
created | integer | Required | Unix timestamp of when the completion was created. |
model | string | Required | The model that generated the completion. |
choices | array | Required | List of completion choices. Typically contains one choice. |
→message | object | Required | The assistant's response message with role and content. |
→finish_reason | string | Required | "stop" (natural end), "length" (max_tokens reached), "tool_calls" (tool call requested), or "content_filter". |
usage | object | Required | Token usage statistics: prompt_tokens, completion_tokens, total_tokens. |
Streaming
Set "stream": true to receive Server-Sent Events (SSE). Each event contains a chat.completion.chunk object with a delta of the response.
curl https://api.embercloud.ai/v1/chat/completions \
-H "Authorization: Bearer $EMBER_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "glm-4.7",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Stream Event Format
Each SSE event is prefixed with data: followed by a JSON chunk. The stream ends with data: [DONE].
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1707436800,"model":"glm-4.7","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1707436800,"model":"glm-4.7","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1707436800,"model":"glm-4.7","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1707436800,"model":"glm-4.7","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]Chunk Fields
| Parameter | Type | Required | Description |
|---|---|---|---|
delta | object | Required | Partial message content. Contains role on the first chunk, content on subsequent chunks. |
→delta.content | string | null | Optional | The next piece of generated text. |
→delta.role | string | null | Optional | Present only in the first chunk. Always "assistant". |
→delta.reasoning | string | null | Optional | Reasoning content from GLM-4.7 reasoning models, if applicable. |
→delta.tool_calls | array | null | Optional | Tool call deltas, if the model is invoking a function. |
finish_reason | string | null | Required | Null until the final content chunk, then "stop", "length", or "tool_calls". |
usage | object | null | Optional | Included in the final chunk when stream_options.include_usage is true. |
Models
/v1/modelsList all available models and their capabilities.
curl https://api.embercloud.ai/v1/models \ -H "Authorization: Bearer $EMBER_API_KEY"
Response
{
"object": "list",
"data": [
{
"id": "glm-4.7",
"name": "GLM-4.7",
"created": 1707436800,
"context_length": 128000,
"max_output_length": 32768,
"input_modalities": ["text"],
"output_modalities": ["text"],
"pricing": {
"prompt": "0.0000004",
"completion": "0.0000015"
}
},
{
"id": "glm-4.7-flash",
"name": "GLM-4.7 Flash",
"created": 1707436800,
"context_length": 128000,
"max_output_length": 32768,
"input_modalities": ["text"],
"output_modalities": ["text"],
"pricing": {
"prompt": "0.0000002",
"completion": "0.0000008"
}
}
]
}Streaming Guide
Streaming allows you to receive partial responses as they are generated, reducing perceived latency. The API uses Server-Sent Events (SSE) — a standard HTTP streaming protocol.
How it works
- Send a request with
"stream": true - The server responds with
Content-Type: text/event-stream - Each event is a JSON object prefixed with
data: - The
deltafield contains the incremental content - The stream terminates with
data: [DONE]
Usage statistics
To receive token usage in the final chunk, include "stream_options": { "include_usage": true } in your request.
Error Handling
Errors are returned as JSON with an error object containing message, type, and code.
{
"error": {
"message": "Rate limit exceeded. Please retry after a brief wait.",
"type": "rate_limit_error",
"code": 429
}
}HTTP Status Codes
| Status | Type | Description |
|---|---|---|
| 200 | Success | Request succeeded. |
| 400 | invalid_request_error | Invalid request body, missing messages, or content filtered by safety system. |
| 401 | authentication_error | Invalid or missing API key. |
| 429 | rate_limit_error | Too many requests or concurrent connections. Check the Retry-After header. |
| 502 | upstream_error | Failed to connect to upstream model provider. |
| 503 | service_unavailable | Service temporarily unavailable. Daily quota or plan limit reached. |
Retry Strategy
For 429 errors, respect the Retry-After header value (in seconds). For 503 errors, use exponential backoff starting from 5 seconds with a maximum of 60 seconds.
Models & Pricing
All prices are per token in USD.
| Model | Context | Max Output | Input Price | Output Price |
|---|---|---|---|---|
glm-4.7 Flagship reasoning model | 128K | 32K | $0.40 / 1M tokens | $1.50 / 1M tokens |
glm-4.7-flash Fast, cost-efficient variant | 128K | 32K | $0.20 / 1M tokens | $0.80 / 1M tokens |
Pricing is subject to change. Both models support text input/output, tool calling, JSON mode, and streaming.