API Documentation | EmberCloud

Overview

The EmberCloud API provides OpenAI-compatible endpoints for running inference on GLM models. You can use any existing OpenAI SDK or HTTP client — just change the base URL and API key.

We support both streaming and non-streaming chat completions, tool calling, JSON mode, and more.

Authentication

All API requests require a Bearer token in the Authorization header.

Header

Authorization: Bearer YOUR_API_KEY

Keep your API key secret. Do not expose it in client-side code or public repositories.

Base URL

All API endpoints are served from:

Base URL

https://api.embercloud.ai/v1

For example, chat completions are available at https://api.embercloud.ai/v1/chat/completions

Chat Completions

POST/v1/chat/completions

Create a chat completion. Supports streaming and non-streaming responses, tool calling, and structured outputs.

curl https://api.embercloud.ai/v1/chat/completions \
  -H "Authorization: Bearer $EMBER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-4.7",
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "Hello!"}
    ]
  }'

Request Body

Parameter	Type	Required	Description
`messages`	array	Required	A list of messages comprising the conversation. Each message has a role ("system", "user", "assistant", or "tool") and content.
`→role`	string	Required	The role of the message author: "system", "user", "assistant", "tool", or "developer".
`→content`	string \| array	Required	The content of the message. Can be a string or an array of content parts (text, image_url).
`model`	string	Optional	The model to use. Defaults to "glm-4.7". Available: "glm-4.7", "glm-4.7-flash".
`stream`	boolean	Optional	If true, returns a stream of Server-Sent Events. Default: false.
`stream_options`	object	Optional	Options for streaming. Set { "include_usage": true } to receive usage stats in the final chunk.
`temperature`	number	Optional	Sampling temperature between 0 and 2. Higher values make output more random. Default: 1.
`top_p`	number	Optional	Nucleus sampling. The model considers tokens with top_p cumulative probability. Default: 1.
`max_tokens`	integer	Optional	Maximum number of tokens to generate. Model maximum: 32,768.
`seed`	integer	Optional	A seed for deterministic generation. Same seed + same input = same output.
`stop`	string \| array	Optional	Up to 4 sequences where the model will stop generating.
`frequency_penalty`	number	Optional	Penalizes tokens based on their frequency in the text so far. Range: -2 to 2.
`presence_penalty`	number	Optional	Penalizes tokens based on whether they appear in the text so far. Range: -2 to 2.
`tools`	array	Optional	A list of tools (functions) the model may call. Each tool has a type, name, description, and parameters schema.
`tool_choice`	string \| object	Optional	Controls tool calling: "none", "auto", "required", or { type: "function", function: { name: "..." } }.
`response_format`	object	Optional	Set { "type": "json_object" } to force JSON output. The model will return valid JSON.
`logprobs`	boolean	Optional	Whether to return log probabilities of output tokens.
`top_logprobs`	integer	Optional	Number of most likely tokens to return at each position (0-20). Requires logprobs: true.
`logit_bias`	object	Optional	Map of token IDs to bias values (-100 to 100). Use to increase or decrease likelihood of specific tokens.

Response

A non-streaming response returns a chat.completion object:

Response

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1707436800,
  "model": "glm-4.7",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 24,
    "completion_tokens": 9,
    "total_tokens": 33
  }
}

Response Fields

Parameter	Type	Required	Description
`id`	string	Required	Unique identifier for the completion (format: "chatcmpl-...").
`object`	string	Required	Always "chat.completion".
`created`	integer	Required	Unix timestamp of when the completion was created.
`model`	string	Required	The model that generated the completion.
`choices`	array	Required	List of completion choices. Typically contains one choice.
`→message`	object	Required	The assistant's response message with role and content.
`→finish_reason`	string	Required	"stop" (natural end), "length" (max_tokens reached), "tool_calls" (tool call requested), or "content_filter".
`usage`	object	Required	Token usage statistics: prompt_tokens, completion_tokens, total_tokens.

Streaming

Set "stream": true to receive Server-Sent Events (SSE). Each event contains a chat.completion.chunk object with a delta of the response.

curl https://api.embercloud.ai/v1/chat/completions \
  -H "Authorization: Bearer $EMBER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-4.7",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

Stream Event Format

Each SSE event is prefixed with data: followed by a JSON chunk. The stream ends with data: [DONE].

Stream chunks

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1707436800,"model":"glm-4.7","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1707436800,"model":"glm-4.7","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1707436800,"model":"glm-4.7","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1707436800,"model":"glm-4.7","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Chunk Fields

Parameter	Type	Required	Description
`delta`	object	Required	Partial message content. Contains role on the first chunk, content on subsequent chunks.
`→delta.content`	string \| null	Optional	The next piece of generated text.
`→delta.role`	string \| null	Optional	Present only in the first chunk. Always "assistant".
`→delta.reasoning`	string \| null	Optional	Reasoning content from GLM-4.7 reasoning models, if applicable.
`→delta.tool_calls`	array \| null	Optional	Tool call deltas, if the model is invoking a function.
`finish_reason`	string \| null	Required	Null until the final content chunk, then "stop", "length", or "tool_calls".
`usage`	object \| null	Optional	Included in the final chunk when stream_options.include_usage is true.

Models

GET/v1/models

List all available models and their capabilities.

curl https://api.embercloud.ai/v1/models \
  -H "Authorization: Bearer $EMBER_API_KEY"

Response

{
  "object": "list",
  "data": [
    {
      "id": "glm-4.7",
      "name": "GLM-4.7",
      "created": 1707436800,
      "context_length": 128000,
      "max_output_length": 32768,
      "input_modalities": ["text"],
      "output_modalities": ["text"],
      "pricing": {
        "prompt": "0.0000004",
        "completion": "0.0000015"
      }
    },
    {
      "id": "glm-4.7-flash",
      "name": "GLM-4.7 Flash",
      "created": 1707436800,
      "context_length": 128000,
      "max_output_length": 32768,
      "input_modalities": ["text"],
      "output_modalities": ["text"],
      "pricing": {
        "prompt": "0.0000002",
        "completion": "0.0000008"
      }
    }
  ]
}

Streaming Guide

Streaming allows you to receive partial responses as they are generated, reducing perceived latency. The API uses Server-Sent Events (SSE) — a standard HTTP streaming protocol.

How it works

Send a request with "stream": true
The server responds with Content-Type: text/event-stream
Each event is a JSON object prefixed with data:
The delta field contains the incremental content
The stream terminates with data: [DONE]

Usage statistics

To receive token usage in the final chunk, include "stream_options": { "include_usage": true } in your request.

Error Handling

Errors are returned as JSON with an error object containing message, type, and code.

Error response

{
  "error": {
    "message": "Rate limit exceeded. Please retry after a brief wait.",
    "type": "rate_limit_error",
    "code": 429
  }
}

HTTP Status Codes

Status	Type	Description
200	Success	Request succeeded.
400	invalid_request_error	Invalid request body, missing messages, or content filtered by safety system.
401	authentication_error	Invalid or missing API key.
429	rate_limit_error	Too many requests or concurrent connections. Check the `Retry-After` header.
502	upstream_error	Failed to connect to upstream model provider.
503	service_unavailable	Service temporarily unavailable. Daily quota or plan limit reached.

Retry Strategy

For 429 errors, respect the Retry-After header value (in seconds). For 503 errors, use exponential backoff starting from 5 seconds with a maximum of 60 seconds.

Models & Pricing

All prices are per token in USD.

Model	Context	Max Output	Input Price	Output Price
glm-4.7 Flagship reasoning model	128K	32K	$0.40 / 1M tokens	$1.50 / 1M tokens
glm-4.7-flash Fast, cost-efficient variant	128K	32K	$0.20 / 1M tokens	$0.80 / 1M tokens

Pricing is subject to change. Both models support text input/output, tool calling, JSON mode, and streaming.