Overview

The EmberCloud API provides OpenAI-compatible endpoints for running inference on GLM models. You can use any existing OpenAI SDK or HTTP client — just change the base URL and API key.

We support both streaming and non-streaming chat completions, tool calling, JSON mode, and more.

Authentication

All API requests require a Bearer token in the Authorization header.

Header
Authorization: Bearer YOUR_API_KEY

Keep your API key secret. Do not expose it in client-side code or public repositories.

Base URL

All API endpoints are served from:

Base URL
https://api.embercloud.ai/v1

For example, chat completions are available at https://api.embercloud.ai/v1/chat/completions

Chat Completions

POST/v1/chat/completions

Create a chat completion. Supports streaming and non-streaming responses, tool calling, and structured outputs.

curl https://api.embercloud.ai/v1/chat/completions \
  -H "Authorization: Bearer $EMBER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-4.7",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

Request Body

ParameterTypeRequiredDescription
messagesarrayRequiredA list of messages comprising the conversation. Each message has a role ("system", "user", "assistant", or "tool") and content.
rolestringRequiredThe role of the message author: "system", "user", "assistant", "tool", or "developer".
contentstring | arrayRequiredThe content of the message. Can be a string or an array of content parts (text, image_url).
modelstringOptionalThe model to use. Defaults to "glm-4.7". Available: "glm-4.7", "glm-4.7-flash".
streambooleanOptionalIf true, returns a stream of Server-Sent Events. Default: false.
stream_optionsobjectOptionalOptions for streaming. Set { "include_usage": true } to receive usage stats in the final chunk.
temperaturenumberOptionalSampling temperature between 0 and 2. Higher values make output more random. Default: 1.
top_pnumberOptionalNucleus sampling. The model considers tokens with top_p cumulative probability. Default: 1.
max_tokensintegerOptionalMaximum number of tokens to generate. Model maximum: 32,768.
seedintegerOptionalA seed for deterministic generation. Same seed + same input = same output.
stopstring | arrayOptionalUp to 4 sequences where the model will stop generating.
frequency_penaltynumberOptionalPenalizes tokens based on their frequency in the text so far. Range: -2 to 2.
presence_penaltynumberOptionalPenalizes tokens based on whether they appear in the text so far. Range: -2 to 2.
toolsarrayOptionalA list of tools (functions) the model may call. Each tool has a type, name, description, and parameters schema.
tool_choicestring | objectOptionalControls tool calling: "none", "auto", "required", or { type: "function", function: { name: "..." } }.
response_formatobjectOptionalSet { "type": "json_object" } to force JSON output. The model will return valid JSON.
logprobsbooleanOptionalWhether to return log probabilities of output tokens.
top_logprobsintegerOptionalNumber of most likely tokens to return at each position (0-20). Requires logprobs: true.
logit_biasobjectOptionalMap of token IDs to bias values (-100 to 100). Use to increase or decrease likelihood of specific tokens.

Response

A non-streaming response returns a chat.completion object:

Response
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1707436800,
  "model": "glm-4.7",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 9,
    "total_tokens": 33
  }
}

Response Fields

ParameterTypeRequiredDescription
idstringRequiredUnique identifier for the completion (format: "chatcmpl-...").
objectstringRequiredAlways "chat.completion".
createdintegerRequiredUnix timestamp of when the completion was created.
modelstringRequiredThe model that generated the completion.
choicesarrayRequiredList of completion choices. Typically contains one choice.
messageobjectRequiredThe assistant's response message with role and content.
finish_reasonstringRequired"stop" (natural end), "length" (max_tokens reached), "tool_calls" (tool call requested), or "content_filter".
usageobjectRequiredToken usage statistics: prompt_tokens, completion_tokens, total_tokens.

Streaming

Set "stream": true to receive Server-Sent Events (SSE). Each event contains a chat.completion.chunk object with a delta of the response.

curl https://api.embercloud.ai/v1/chat/completions \
  -H "Authorization: Bearer $EMBER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-4.7",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Stream Event Format

Each SSE event is prefixed with data: followed by a JSON chunk. The stream ends with data: [DONE].

Stream chunks
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1707436800,"model":"glm-4.7","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1707436800,"model":"glm-4.7","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1707436800,"model":"glm-4.7","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1707436800,"model":"glm-4.7","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Chunk Fields

ParameterTypeRequiredDescription
deltaobjectRequiredPartial message content. Contains role on the first chunk, content on subsequent chunks.
delta.contentstring | nullOptionalThe next piece of generated text.
delta.rolestring | nullOptionalPresent only in the first chunk. Always "assistant".
delta.reasoningstring | nullOptionalReasoning content from GLM-4.7 reasoning models, if applicable.
delta.tool_callsarray | nullOptionalTool call deltas, if the model is invoking a function.
finish_reasonstring | nullRequiredNull until the final content chunk, then "stop", "length", or "tool_calls".
usageobject | nullOptionalIncluded in the final chunk when stream_options.include_usage is true.

Models

GET/v1/models

List all available models and their capabilities.

curl https://api.embercloud.ai/v1/models \
  -H "Authorization: Bearer $EMBER_API_KEY"

Response

Response
{
  "object": "list",
  "data": [
    {
      "id": "glm-4.7",
      "name": "GLM-4.7",
      "created": 1707436800,
      "context_length": 128000,
      "max_output_length": 32768,
      "input_modalities": ["text"],
      "output_modalities": ["text"],
      "pricing": {
        "prompt": "0.0000004",
        "completion": "0.0000015"
      }
    },
    {
      "id": "glm-4.7-flash",
      "name": "GLM-4.7 Flash",
      "created": 1707436800,
      "context_length": 128000,
      "max_output_length": 32768,
      "input_modalities": ["text"],
      "output_modalities": ["text"],
      "pricing": {
        "prompt": "0.0000002",
        "completion": "0.0000008"
      }
    }
  ]
}

Streaming Guide

Streaming allows you to receive partial responses as they are generated, reducing perceived latency. The API uses Server-Sent Events (SSE) — a standard HTTP streaming protocol.

How it works

  1. Send a request with "stream": true
  2. The server responds with Content-Type: text/event-stream
  3. Each event is a JSON object prefixed with data:
  4. The delta field contains the incremental content
  5. The stream terminates with data: [DONE]

Usage statistics

To receive token usage in the final chunk, include "stream_options": { "include_usage": true } in your request.

Error Handling

Errors are returned as JSON with an error object containing message, type, and code.

Error response
{
  "error": {
    "message": "Rate limit exceeded. Please retry after a brief wait.",
    "type": "rate_limit_error",
    "code": 429
  }
}

HTTP Status Codes

StatusTypeDescription
200SuccessRequest succeeded.
400invalid_request_errorInvalid request body, missing messages, or content filtered by safety system.
401authentication_errorInvalid or missing API key.
429rate_limit_errorToo many requests or concurrent connections. Check the Retry-After header.
502upstream_errorFailed to connect to upstream model provider.
503service_unavailableService temporarily unavailable. Daily quota or plan limit reached.

Retry Strategy

For 429 errors, respect the Retry-After header value (in seconds). For 503 errors, use exponential backoff starting from 5 seconds with a maximum of 60 seconds.

Models & Pricing

All prices are per token in USD.

ModelContextMax OutputInput PriceOutput Price
glm-4.7
Flagship reasoning model
128K32K$0.40 / 1M tokens$1.50 / 1M tokens
glm-4.7-flash
Fast, cost-efficient variant
128K32K$0.20 / 1M tokens$0.80 / 1M tokens

Pricing is subject to change. Both models support text input/output, tool calling, JSON mode, and streaming.