inference
The shared contract every text-generation backend implements. TextModel
Backend+Token+Messageare stdlib-only types that compile on every platform regardless of GPU availability; GPU-specific runtimes (Metal, ROCm, CUDA, …) register themselves at import time. Built oncore/go—LoadModelreturns aResult, zero external dependencies.
Install
Section titled “Install”go get dappco.re/go/inference@latestImport
Section titled “Import”import ( "dappco.re/go/inference"
// Pick one or more — each blank import registers a backend _ "forge.lthn.ai/core/go-mlx" // "metal" backend on darwin/arm64 // _ "forge.lthn.ai/core/go-rocm" // "rocm" backend on linux/amd64 // _ "forge.lthn.ai/core/go-cuda" // "cuda" backend on linux/amd64)The base package compiles without any backend; calls fail cleanly with
no backend registered rather than refusing to build. This is what lets
the same binary target a laptop without a GPU and a Mac Studio without
two GOOS builds.
Quick start
Section titled “Quick start”r := inference.LoadModel("/path/to/safetensors/model/")if !r.OK { return r }model := r.Value.(inference.TextModel)defer model.Close()
for tok := range model.Generate(ctx, "Hello", inference.WithMaxTokens(256)) { fmt.Print(tok.Text)}Backend selection is automatic — the registry picks Metal on Apple Silicon, ROCm on Linux+AMD, CUDA on Linux+NVIDIA, in that preferred order — but you can pin explicitly:
r := inference.LoadModel(path, inference.WithBackend("metal"))inference.List() returns the backend names that registered at import
time, which is useful for runtime config + diagnostics.
Generate options
Section titled “Generate options”Every generation call takes a variadic GenerateOption:
| Option | Effect |
|---|---|
WithMaxTokens(n) | Cap output length |
WithTemperature(t) | Sampling temperature (0 = greedy) |
WithTopK(k) | Restrict sampling to top K logits |
WithTopP(p) | Nucleus sampling threshold |
WithStopTokens(ids...) | Halt on any of these token IDs |
WithRepeatPenalty(p) | Penalise repeated tokens |
WithLogits() | Emit raw logits per token (for analysis) |
Load options
Section titled “Load options”LoadOption configures the model at load time — runtime knobs that
the backend can’t change without reloading:
| Option | Effect |
|---|---|
WithBackend(name) | Pin a specific backend instead of auto-select |
WithContextLen(n) | Override the model’s default context length |
WithGPULayers(n) | How many transformer layers live on GPU vs CPU |
WithParallelSlots(n) | Number of concurrent generation slots |
WithAdapterPath(path) | Layer a LoRA adapter over the base model |
Trainable models
Section titled “Trainable models”For fine-tuning workflows, LoadTrainable returns a model that exposes
gradients alongside generation:
r := inference.LoadTrainable(path, inference.WithAdapterPath("lora/"))if !r.OK { return r }trainable := r.Value.(inference.TrainableModel)
// Same generation surfacefor tok := range trainable.Generate(ctx, "prompt") { /* ... */ }
// Plus the training surfacetrainable.Backward(loss)trainable.Step()Backend support for the trainable path is opt-in — backends that ship
inference-only will return not supported.
Service registration
Section titled “Service registration”The canonical core/go Service shape lets a Core instance host an
inference runtime that every consumer reaches through actions:
c := core.New(core.Options{})
if r := inference.RegisterCore(c); !r.OK { return r }
// Or with optionssvc := inference.NewService(inference.Options{ DefaultBackend: "metal", ModelPath: "/srv/models/gemma-4-e2b",})if r := svc(c); !r.OK { return r }Probe bus
Section titled “Probe bus”NewProbeBus(sinks...) wires up token-by-token introspection — useful
for metrics, eval harnesses, or live UIs that show what the model is
“thinking”:
bus := inference.NewProbeBus( inference.NewTelemetrySink(c), inference.NewLogitSink(eval),)
r := inference.LoadModel(path, inference.WithProbeBus(bus))Each ProbeSink receives every token + (optionally) the logits behind
it. The base package ships the bus + interface; concrete sinks live in
consumer code.
Capabilities
Section titled “Capabilities”Each registered backend declares a Capability list that the registry
exposes for discovery. A consumer can pick a backend based on what it
supports (training? batched inference? logits? KV-cache snapshots?):
for _, name := range inference.List() { caps := inference.BackendCaps(name) for _, cap := range caps { fmt.Printf("%s: %s (%s)\n", name, cap.ID, cap.Status) }}Sibling packages
Section titled “Sibling packages”go/ai— higher-level multi-provider orchestrator built on inference plus remote API clientsgo/mlx— Apple Silicon Metal backend that registers as"metal"go/rag— retrieval-augmented pipeline that consumes inference for the generation stage
Source
Section titled “Source”github.com/dappcore/go-inference — full source, contract tests, and the capability registry.