dendrux
v0.2.0a1 · alphaGet started

The pluggable side-channel hook for observing a run as it happens, with fail-open semantics so a broken notifier never kills a run.

Notifier

A notifier is an object you pass to agent.run(...) that gets called at every point where the loop mutates conversation history, completes an LLM call, finishes a tool, or emits a governance event. It is the official extension point for terminal output, metrics, logs, Slack/webhook fanout, real-time dashboards, and anything else that wants to watch a run without owning its durability.

Two rules apply everywhere: a notifier never blocks the run, and a notifier that raises never kills the run. Both are enforced by the framework, not by convention.

Where notifier fits next to recorder

Dendrux has two observers on the same set of hooks: the recorder and the notifier. They look similar but they do different jobs and have opposite failure policies.

PropertyLoopRecorderLoopNotifier
PurposeDurable audit trail in the DBSide-channel for humans and external systems
PluggableNo, framework-owned (PersistenceRecorder)Yes, any implementation
Where you pass itNot a user parameteragent.run(..., notifier=...), per call
Failure policyFail-closed for authoritative writesFail-open, exceptions swallowed
What readers use it forReplay, SSE, audit queriesLive output, metrics, alerts

See Recorder for the other side of this pair.

The hook surface

LoopNotifier and LoopRecorder share the same ten methods. The runner fires the run-level hooks; the loop fires everything else. Every hook takes run_id as its first positional argument so a single notifier instance can disambiguate concurrent runs without contextvars or implicit state.

HookFires whenEmitted by
on_run_startedOnce at the top of a fresh run or resumeRunner
on_run_finishedRun reaches a non-error terminal (success, paused, cancelled, max_iterations)Runner
on_run_failedRun errors out via an unhandled exceptionRunner
on_message_appendedA message is appended to history (user, assistant, tool result)Loop
on_llm_call_startedImmediately before provider.complete*()Loop
on_llm_call_completedAfter provider.complete*() returns successfullyLoop
on_llm_call_failedWhen provider.complete*() raisesLoop
on_tool_startedImmediately before tool dispatchLoop
on_tool_completedAfter tool execution finishes (success or failure)Loop
on_governance_eventBudget threshold, guardrail finding, approval request, etc.Loop

The pairing is deliberate: every "started" hook has either a matching "completed" or "failed" hook. The two are mutually exclusive — exactly one of the pair fires per call. Tool failures arrive on on_tool_completed as ToolResult(success=False) because the loop already converts every dispatch error into a result; there is no separate on_tool_failed.

The Protocol lives in dendrux/loops/base.py:

@runtime_checkable
class LoopNotifier(Protocol):
    async def on_run_started(self, run_id, *, agent_name=None, agent_model=None): ...
    async def on_run_finished(self, run_id, result): ...
    async def on_run_failed(self, run_id, error, *, iteration=None): ...
 
    async def on_message_appended(self, run_id, message, iteration): ...
 
    async def on_llm_call_started(self, run_id, iteration, *, semantic_messages=None, semantic_tools=None): ...
    async def on_llm_call_completed(self, run_id, response, iteration, *, ...): ...
    async def on_llm_call_failed(self, run_id, iteration, error, *, duration_ms=None): ...
 
    async def on_tool_started(self, run_id, tool_call, iteration): ...
    async def on_tool_completed(self, run_id, tool_call, tool_result, iteration): ...
 
    async def on_governance_event(self, run_id, event_type, iteration, data, *, correlation_id=None): ...

runtime_checkable means a class that implements all ten methods structurally satisfies isinstance(obj, LoopNotifier). You do not need to inherit. But for everyday work you should — see the next section.

BaseNotifier: the easy way to subclass

Implementing ten methods on every notifier is tedious, and it makes future protocol additions a breaking change for every existing implementation. dendrux.loops.base.BaseNotifier is a concrete class that ships no-op defaults for every hook. Subclass it and override only what you care about:

from dendrux.loops.base import BaseNotifier
 
class CaptureNotifier(BaseNotifier):
    """Minimal notifier — records every callback that fires."""
 
    def __init__(self):
        self.log = []
 
    async def on_run_started(self, run_id, *, agent_name=None, agent_model=None):
        self.log.append(f"on_run_started         run={run_id} agent={agent_name!r}")
 
    async def on_message_appended(self, run_id, message, iteration):
        self.log.append(f"on_message_appended    iter={iteration} role={message.role.value}")
 
    async def on_llm_call_completed(self, run_id, response, iteration, **kwargs):
        u = response.usage
        self.log.append(f"on_llm_call_completed  iter={iteration} input={u.input_tokens}")
 
    async def on_tool_completed(self, run_id, tool_call, tool_result, iteration):
        self.log.append(f"on_tool_completed      iter={iteration} tool={tool_call.name!r}")
 
    async def on_run_finished(self, run_id, result):
        self.log.append(f"on_run_finished        status={result.status.value}")

Five overrides, ten hooks supported. The other five (on_run_failed, on_llm_call_started, on_llm_call_failed, on_tool_started, on_governance_event) inherit no-op defaults from BaseNotifier — they fire, your subclass ignores them. When the protocol grows in a future release, your code keeps working without changes.

Plugging one in

Unlike the recorder, the notifier is a per-call argument, not a constructor keyword. You pass it each time you call agent.run() or a resume method:

from dendrux import Agent
from dendrux.llm.anthropic import AnthropicProvider
from dendrux.notifiers import ConsoleNotifier
 
async with Agent(
    provider=AnthropicProvider(model="claude-haiku-4-5"),
    prompt="You are a calculator. Use the add tool.",
    tools=[add],
    database_url="sqlite+aiosqlite:///demo.db",
) as agent:
    await agent.run("What is 15 + 27?", notifier=ConsoleNotifier())

Per-call makes sense: a batch run wants metrics, a dev-loop invocation wants terminal output, and a production webhook wants Slack fanout. The same Agent can serve all three with different notifiers on different calls.

All submit_* and resume methods accept the same notifier= argument. If you pass a notifier to run() and a different one to submit_approval(), both are applied on their respective turns; nothing is carried over between calls.

What a real run looks like

Running a two-iteration ReAct query ("What is 15 + 27?" with an add tool) and printing capture.log afterwards:

on_run_started         run=01HXX... agent='calculator'
on_message_appended    iter=0 role=user
on_llm_call_completed  iter=1 input=585
on_message_appended    iter=1 role=assistant
on_tool_completed      iter=1 tool='add'
on_message_appended    iter=1 role=tool
on_llm_call_completed  iter=2 input=667
on_message_appended    iter=2 role=assistant
on_run_finished        status=success

Nine callbacks for a two-iteration run. on_run_started opens it, on_run_finished closes it, and the body is interleaved messages, LLM completions, and tool completions in the order the loop executed them. Everything a human dashboard or an OpenTelemetry tracer needs to render a live transcript is here.

Why both started and completed

The lifecycle pairing was added so external observers can do correct work even when calls fail.

A pure completion-only surface looks tidy but breaks down under three real conditions:

  1. Provider exceptions never reach completion. provider.complete() can raise — network blip, rate limit, malformed schema. With only on_llm_call_completed, the notifier never hears about that call. With on_llm_call_started and on_llm_call_failed, the notifier opens a span on start, marks it errored on failure, and the trace tells the truth.
  2. Span timing is wrong without start. OpenTelemetry and similar tracers want a real start time, not "completion minus duration." The lifecycle pair gives them one.
  3. Some observers care about prefetch state. A dashboard that highlights "LLM call in progress" needs to know the call started before it ends.

The same logic applies to runs (on_run_started / on_run_finished / on_run_failed) and to tools (on_tool_started / on_tool_completed). Tool failures take a slightly different shape — the loop catches them and emits on_tool_completed with ToolResult(success=False) — because the loop already wraps every dispatch in error handling and there is nothing the notifier could do that the loop has not already done. LLM failures, by contrast, propagate, so the notifier needs an explicit failed hook.

ConsoleNotifier and CompositeNotifier

Dendrux ships two notifiers in dendrux.notifiers. Both subclass BaseNotifier.

ConsoleNotifier uses rich to render the run as a terminal panel with per-iteration steps:

╭──────────────────────────────────────────────────────╮
│ What is 15 + 27?                                     │
╰──────────────────────────────────────────────────────╯
    llm 654 tokens in 2.3s
 
  Step 1
    calling add a=15, b=27
    done    add 0.0s
    llm 676 tokens in 0.8s
 
  Step 2

It overrides on_message_appended, on_llm_call_completed, on_tool_completed, and on_governance_event. The lifecycle hooks (on_run_started etc.) fire and are ignored — ConsoleNotifier does not yet render them, but you could subclass it to add a banner.

CompositeNotifier fans a single set of callbacks out to a list of inner notifiers. If you want both terminal output and a metrics sink, wrap them: CompositeNotifier([console, metrics]). It implements every hook by forwarding to each child, swallowing per-child exceptions so one broken notifier does not prevent the others from running.

For OpenTelemetry, install dendrux[otel] and pass OpenTelemetryNotifier() from dendrux.notifiers.otel. It emits a GenAI-semconv span tree (invoke_agentchatexecute_tool) on the host application's existing TracerProvider. This is a V1 integration. See the OpenTelemetry recipe for setup, span shape, and what's left out for now (cross-process trace continuity, native metrics, and log signals are deferred until real usage validates the design).

Fail-open semantics

The loop does not call your notifier directly. It calls a thin wrapper in dendrux/loops/_helpers.py that swallows exceptions:

async def notify_message(notifier, run_id, message, iteration, warnings=None):
    """Notify notifier of a message append, swallowing exceptions."""
    if notifier is None:
        return
    try:
        await notifier.on_message_appended(run_id, message, iteration)
    except Exception:
        logger.warning("Notifier.on_message_appended failed", exc_info=True)
        if warnings is not None:
            warnings.append(f"on_message_appended failed at iteration {iteration}")

Four things to pull out of that wrapper:

  1. None is fine. Passing no notifier is the common path. The wrapper short-circuits.
  2. Exceptions do not propagate. If your Slack webhook times out, the run carries on.
  3. The warning is logged. You will see the traceback in your log stream at warning level, so the bug is not silent, it is just not fatal.
  4. Warnings are collected on the run. A text label is appended to a per-run warnings list, which surfaces on the final RunResult.meta["notifier_warnings"] and, when persisted by the runner, on run_events. The run still succeeds, and the operator can see which callback failed where.

Every notifier hook is wrapped the same way (notify_run_started, notify_llm_started, notify_tool_started, etc.).

Why a side-channel at all

You might ask: if the recorder already writes run_events and an SSE client can read them back in order, do you need a notifier?

Yes, for three reasons.

  1. Latency. run_events is a DB round-trip, then an SSE poll interval, then a client render. A notifier runs in the same event loop as the LLM call, and sees the event within a coroutine await. For terminal output, live metrics, or anything that wants sub-millisecond reaction, the notifier is the right channel.
  2. Richer payloads. A notifier receives the full Message, LLMResponse, ToolCall, and ToolResult objects. The DB event log stores a condensed projection (token counts, tool name, iteration, a correlation id). If you want the entire prompt, the entire response, or the full tool result, you get it in-memory in the callback. Persisting all of that would bloat the DB; putting it in the notifier avoids the tradeoff.
  3. Out-of-band destinations. Slack, Datadog, OpenTelemetry, a custom websocket, a tqdm progress bar: none of these are storage. They do not want SSE. A notifier lets them hook in without pretending to be a durability layer.

The recorder is the run's canonical record. The notifier is its live broadcast. They coexist because their jobs are different, and the failure policies match those jobs: the canonical record refuses to drop rows, the live broadcast refuses to block the source.