AdaOS Supervisor
Goal
Introduce a small always-on local supervisor process that remains available while the main AdaOS runtime restarts, updates, validates a new slot, or rolls back.
adaos-supervisor is not a transport sidecar.
It is the local process-lifecycle and update-orchestration authority for one node.
Read this together with:
Why this exists
The current autostart/update flow couples:
- local admin API availability
- runtime process lifetime
- update status visibility
- slot validation and rollback logic
This creates an operator gap during restart-heavy paths:
update-statusbecomes unavailable exactly when update progress matters mostnode reliabilitycollapses into connection-refused even when the system is in a known transitional state- stale
restartingorapplyingstate can survive longer than intended
The supervisor solves this by keeping the update and process-control surface alive while the runtime is stopped or being replaced.
Scope
adaos-supervisor owns:
- local runtime process lifecycle
- runtime start/stop/restart sequencing
- persisted update attempt state
- candidate-to-slot prepare / validate / rollback orchestration
- post-validation bootstrap/root promotion orchestration when bootstrap-managed files changed
- skill runtime migration orchestration for installed skills during core slot transition, including deferred commit after old runtime shutdown when early slot preparation is used
- restart and validation deadlines
- local admin/update API availability during runtime downtime
- recovery from interrupted update attempts
adaos-supervisor does not own:
- hub-root protocol semantics
- root-issued trust or cross-subnet truth
- business semantics of skills, scenarios, or local event handling
- realtime transport authority
- browser/member semantic channel policy
Target runtime split
Target local node layout:
adaos-supervisor- always-on local control and update authority
- owns persisted attempt state and process supervision
adaos-runtime- main FastAPI/runtime process
- production runtime is launched from the active slot manifest
- owns local execution semantics, storage, skills, scenarios, and APIs
adaos-realtime- optional transport-only sidecar
- owns selected long-lived transport loops such as hub-root realtime transport
This means:
- the runtime is restartable without losing the local control surface
- the realtime sidecar can also be restarted independently
- update progress remains inspectable during shutdown, apply, and validate phases
- root checkout stays out of the production runtime path unless a developer explicitly launches it
For the target media-continuity path, this split also enables a more selective policy:
- if a member currently owns an active live media session, that member update should be deferred
- a hub runtime restart may still be allowed if the hub-side realtime sidecar can stay alive and continue serving the delegated realtime continuity path
- supervisor should therefore treat "restart runtime" and "tear down sidecar" as different actions once live media continuity depends on sidecar ownership
Runtime source rule
Production runtime must always come from slot A|B.
Root checkout is reserved for:
- bootstrap install
- supervisor/autostart/update-control code
- watchdog and other always-on control-plane helpers
- candidate preparation
- explicit developer-run workflows
Control-plane code must run from a stable root checkout and root virtual
environment, for example ~/adaos plus ~/adaos/.venv. It must not run from
state/core_slots/slots/<A|B>/repo or that slot's venv, because an update can
rewrite or replace a slot while the supervisor/watchdog still needs to survive
the transition and make rollback decisions.
This keeps slot switching fast, keeps production runtime independent from root checkout drift, and keeps the control-plane independent from slot mutation.
When the watchdog is re-enabled, it follows the same rule as supervisor:
- its service wrapper or launch spec uses the stable root checkout and root
.venv - it may observe, stop, or restart slot runtimes, but it is not itself slot-bound
- watchdog diagnostics must expose the effective Python/source path the same
way autostart status exposes
wrapper_python_is_core_slot
The watchdog is allowed to act on slot runtime health; it must not depend on a slot runtime interpreter to remain alive.
For runtime processes, slot resolution is process-local first:
- if
ADAOS_ACTIVE_CORE_SLOTis set in the runtime environment, that process treats the slot as its effective runtime source - otherwise the global slot marker from
state/core_slots/activeremains authoritative
This is important for warm-switch work because it allows a candidate runtime to boot from the inactive slot for prewarm/diagnostics without mutating the global active-slot marker before cutover is committed.
Slot-bound runtime ports
Supervisor should treat runtime ports as slot-owned rather than as one global mutable bind.
Target rule:
- slot
Akeeps a stable local runtime port - slot
Bkeeps a different stable local runtime port - supervisor stays on its own always-on control port
Current MVP direction is:
- default
Aruntime port:8777 - default
Bruntime port:8778 - supervisor port:
8776
This creates a clean foundation for warm-switch behavior because the inactive slot can be prepared and, later, prewarmed without fighting for the active slot's listener.
The mapping must remain supervisor-visible in diagnostics and browser-safe transition payloads so local clients can discover:
- active runtime URL
- candidate runtime URL
- current transition mode
- whether warm-switch was admitted or downgraded to stop-and-switch
Warm-switch admission
Warm-switch is desirable, but not always safe on constrained hardware.
Supervisor should therefore make an explicit admission decision before using a dual-runtime transition:
- if candidate slot uses a different reserved port and memory headroom is sufficient, transition mode is
warm_switch - if memory headroom is not sufficient, transition mode is
stop_and_switch - this decision must be visible to operator and browser-facing status surfaces before shutdown starts
Admission should be driven by a simple local resource gate such as:
- available memory
- current runtime RSS
- estimated candidate runtime footprint
- configured reserve that must remain free after candidate start
The important rule is that low-memory devices must fail safe into stop-and-switch instead of trying to start two full runtimes and getting stuck mid-transition.
Runtime instance identity
Warm-switch means active and candidate may overlap for a short period.
That requires identity stronger than just hub_id or subnet_id.
Each runtime process should therefore carry:
runtime_instance_idtransition_role(activeorcandidate)- slot-bound runtime URL/port
This identity must flow through:
- supervisor runtime status
- browser-safe transition status
- root control lifecycle reports
- root core-update reports
- hub root/NATS session issuance and logs
Without that, a candidate process can look indistinguishable from the active process and accidentally steal or overwrite control-plane state.
Candidate passive mode
Before cutover is explicitly committed, a prewarmed candidate runtime must stay passive on root-facing traffic subjects.
That means a candidate may establish root control connectivity for diagnostics, but it must not yet subscribe to the same root-routed traffic subjects as the active runtime:
tg.input.<hub_id>io.tg.in.<hub_id>.textroute.v2.to_hub.<hub_id>.*
The intent is:
- root can see that a candidate runtime exists
- active runtime remains the only consumer of live hub traffic before cutover
- candidate reconnects or retries do not supersede the active runtime's root/NATS session merely because they share the same
hub_id
The same rule must apply to local control discovery:
candidateruntime surfaces must self-identify through lightweight probes such as/api/pingand/api/admin/update/status- local fallback control resolvers must ignore a runtime that reports
transition_role=candidateoradmin_mutation_allowed=false - a candidate runtime must reject mutating local update operations (
update.start,update.cancel,update.rollback) with an explicit conflict instead of behaving like a second control plane
Fast cutover does not remove that rule. It only defines the moment when supervisor may end passive mode:
- supervisor explicitly authorizes the already-running candidate to promote itself through
POST /api/admin/runtime/promote-active - the candidate flips to
transition_role=active, reconnects root-facing transport under that new authority, and only then becomes eligible to own live hub traffic - supervisor adopts that promoted process as the managed active runtime instead of launching a second fresh process when warm-switch succeeds
- if promotion or adoption fails, supervisor tears the candidate down and falls back to the existing stop-and-switch launch path
Authority boundary
Supervisor
The supervisor is the authority for:
- whether the runtime is expected to be running
- whether an update attempt is pending, active, failed, validated, or rolled back
- whether a rollback must be triggered after deadline expiry or failed validation
- operator-visible local state of runtime lifecycle
The supervisor is not the authority for:
- whether root-side protocol acks were accepted as global truth
- whether transport-level path selection is healthy beyond its delegated runtimes
- business-policy decisions inside degraded hub execution
But the supervisor must remain able to enforce locally visible continuity guards exposed by the runtime model, for example:
- defer member update while member-owned live media remains active
- refuse a hub restart path that would drop a sidecar continuity contract the system currently depends on
- distinguish "runtime restart allowed with sidecar continuity" from "full local media teardown"
The first conservative version of this policy is now implemented:
- supervisor reads runtime continuity guard data from
GET /api/node/reliability - update transitions are deferred into explicit
planned/live_media_guardstate when the continuity contract says restart would be unsafe - manual runtime restart is refused until sidecar continuity becomes a real ready capability rather than only a planned target
Runtime
The runtime remains the authority for:
- local API semantics
- local scenario and skill execution
- local persistence and event bus behavior
- local degraded-mode execution once it is running
Realtime sidecar
The realtime sidecar remains the authority only for:
- transport lifecycle
- reconnect loops
- socket diagnostics
- local relay IO
It must not absorb supervisor responsibilities.
Local control surfaces
The target local APIs are:
Supervisor API
Always available while the node is booted:
GET /api/supervisor/statusGET /api/supervisor/update/statusPOST /api/supervisor/update/startPOST /api/supervisor/update/cancelPOST /api/supervisor/update/deferPOST /api/supervisor/update/rollbackPOST /api/supervisor/update/completePOST /api/supervisor/runtime/restartPOST /api/supervisor/runtime/candidate/startPOST /api/supervisor/runtime/candidate/stop
This API is the source of truth for:
- update attempt state
- restart reason
- validation deadlines
- rollback decisions
- current managed child processes
- active and candidate runtime process identity/state (
runtime_instance_id,transition_role, slot, port, readiness) - current skill runtime migration diagnostics for the active core update attempt
- runtime liveness separate from listener bind and runtime API readiness
- active managed runtime command/executable source for the current slot
- active slot structure diagnostics (
manifest/repo/venv/ nested-slot anomalies)
For browser-facing observability, supervisor should also expose a limited read-only transition surface that can be polled without admin-mutating privileges. That surface is intended only for restart/update visibility such as:
hub restartingupdate plannedupdate applyingrollback in progressroot promotion pendingroot restart in progresssubsequent transition queuedupdate failed
It must not expose mutating control operations or become a substitute for the authenticated operator API.
Current MVP browser behavior may preserve and display the last known transition state during reconnect windows, and routed hub sessions can now consume that browser-safe transition state primarily as pushed core.update.status events over the control /ws channel, with /hubs/<id>/api/supervisor/public/update-status retained as a fallback when the control channel is unavailable.
The target end state is stronger: every supported browser entry topology should be able to poll that read-only supervisor transition surface directly, so the shell can keep moving from hub restarting to rollback in progress or root promotion pending from supervisor truth rather than only from the last runtime-visible snapshot.
Operator-facing surfaces are also expected to consume that same supervisor truth through the canonical control-plane model, so Infrascope and related overview projections can show core-runtime transition state in active_runtimes, health strips, and recent changes instead of presenting a restart only as generic hub instability.
That browser-safe surface now also includes candidate runtime diagnostics needed for warm-switch work:
actioncandidate_runtime_instance_idcandidate_runtime_statecandidate_runtime_api_readycandidate_transition_rolecandidate_prewarm_statecandidate_prewarm_messagecandidate_prewarm_ready_at
Runtime API
Available only while adaos-runtime is running:
- current node APIs
- current admin APIs that belong to runtime semantics
- cutover-only runtime identity operations such as
POST /api/admin/runtime/promote-active - reliability, scenario, skill, Yjs, media, and operator surfaces
Persisted state
The supervisor should persist explicit local attempt state, separate from transient runtime liveness:
state/supervisor/runtime.jsonstate/supervisor/update_attempt.jsonstate/supervisor/last_result.json
Recommended fields for update_attempt.json:
attempt_idactionstatephasetarget_slottarget_revtarget_versionreasonstarted_atdeadline_atvalidated_atrestored_slotfailure_summaryskill_runtime_migrationscheduled_forplanned_reasonsubsequent_transitionsubsequent_transition_requested_atsubsequent_transition_request
The important rule is that this state is committed by the supervisor, not inferred only from whether the runtime currently listens on 127.0.0.1:8777.
Update flow
Target flow:
- operator or root-triggered action reaches supervisor
- supervisor writes
update_attempt.json - supervisor materializes the candidate source/artifact
- supervisor prepares the inactive slot from that candidate while the active runtime is still serving traffic
- supervisor starts countdown only after the target slot is materially ready
- supervisor requests graceful runtime shutdown
- supervisor commits deferred installed-skill runtime migration against the target core interpreter after the old runtime is down
- supervisor activates the target slot and either promotes the prewarmed candidate runtime to active authority or launches production runtime from that slot
- supervisor validates required runtime checks against that target-slot runtime
- on slot-validation success, supervisor commits the transition result
- if bootstrap-managed files changed, supervisor records
root_promotion_requiredand promotes root from the same validated candidate - on autostart-managed deployments, supervisor requests autostart-service restart so the root-based supervisor/bootstrap code actually switches over
- on failure or deadline expiry, supervisor rolls back the slot and records failure
Important invariants:
- the attempt record is not cleared before validation succeeds
restartingandapplyingare bounded by deadlines- interrupted supervisor boot resumes or resolves the last incomplete attempt
- if a new update signal arrives during an active transition, supervisor records exactly one deferred
subsequent_transitionand executes it once after the current transition reaches a terminal state - minimum update interval gating schedules a future update window instead of rejecting the request outright
- installed skills do not silently inherit old runtime dependencies after core migration
- root/bootstrap promotion never happens before the candidate already passed slot validation
- root promotion must preserve any already-queued subsequent transition metadata so a self-update handoff does not lose the next requested transition
- prepared slot contents must not inherit another slot's git remotes or become the authority for future updates
Bootstrap/root promotion
Bootstrap-managed code such as supervisor, autostart, and core-update orchestration is a separate promotion step.
Rules:
- slot validation always happens first
- root promotion is allowed only after the candidate is proven in a slot
- production runtime still restarts from the active slot after root promotion
- root promotion should use the same validated candidate source, not a fresh mutable branch tip
- current implementation promotes bootstrap-managed and operator-control files
into the explicit validated root target recorded for that slot, writes a
backup snapshot plus restore metadata, refreshes the autostart wrapper so the
next supervisor process uses the stable root checkout/root
.venv, records an explicit supervisor attempt state while waiting for restart, and on autostart-managed Linux deployments requests the service restart automatically so the new supervisor/bootstrap code becomes active - root promotion checks effective root parity, not only the candidate manifest:
if the current bootstrap/operator-control path list changes between
rollouts, stale root files such as
adaos nodediagnostics are detected and promoted before acceptance tooling is trusted - if another transition request arrives before that restart completes, it is queued as
subsequent_transitionon the supervisor attempt instead of being dropped or run concurrently - manual
adaos autostart update-completeremains the compatibility and retry path for older supervisors or environments where self-requested restart is unavailable
This keeps root updates out of the fast rollback path while preserving the slot-runtime model.
Autostart status should expose the wrapper Python and whether that Python lives
under state/core_slots/slots. wrapper_python_is_core_slot=true is a
control-plane isolation defect: it does not mean the active runtime slot is
wrong, but it means the next supervisor/watchdog restart is still coupled to a
mutable runtime slot.
Skill runtime migration lifecycle
Installed skills are not automatically valid just because the core slot booted.
Their runtime dependencies must be prepared against the new core interpreter and surfaced as explicit diagnostics.
If a skill uses optional data/internal, that data evolves with the runtime compatibility bucket (v<major>.<minor>), not with the A/B code slot. Cross-bucket schema changes should be implemented in the reserved migrations/data_migration.py file.
Target migration model
The target AdaOS rule is:
- slot activation is a pointer switch, not a promise of full process-memory migration
- live in-memory objects are disposable unless a runtime explicitly rebuilds them
- migration authority belongs to durable and slot-bound state, not to arbitrary Python object graphs
This means AdaOS should not treat "the skill restarted in memory" and "the skill was safely migrated" as the same thing. They are separate concerns:
- runtime cutover switches executable code and slot-bound paths
- state migration upgrades persisted state into a form that the new runtime can accept
- runtime rehydration rebuilds ephemeral caches, projections, and subscribers after cutover
The target storage classes per skill are:
- canonical durable state: long-lived business state that must survive restart, rollback, and projection rebuild
- bucket-bound schema state:
data that evolves together with runtime schema and therefore belongs under
v<major>.<minor>/data/internal - derived runtime state: projections, indexes, caches, thread summaries, embeddings, and similar rebuildable material
- live process memory: in-flight handlers, imported modules, object instances, background tasks, subscriptions, and local caches that are not durable by default
The target kernel contract is that only the first two classes are migrated. Derived runtime state must be rebuilt deterministically, and live memory must be drained and recreated rather than copied forward implicitly.
Why this matters
The current implementation already follows the first half of this model:
- runtime slot activation switches
activeandpreviousmarkers atomically - optional
data/internalis migrated duringprepareonly when a new compatibility bucket is prepared with a migration hook - rollback switches the active version/slot marker; patch rollback reuses the same bucket data, while minor rollback points back to the previous bucket data
- service skills are explicitly restarted on activate/rollback
- in-process skills reload code on next invocation by clearing skill modules and re-importing from the active slot
What is still only partial is the second half:
- there is no universal contract for draining long-lived in-process state before cutover
- there is no explicit rehydrate phase for non-service skills after activation
- migration diagnostics focus on slot/data preparation but not yet on runtime state rebuild
Target lifecycle
Per skill, the target lifecycle should become:
prepare_runtimepersist_before_switchmigrate_durable_stateactivate_pointerrehydrate_runtimehealthcheckrollbackordeactivateon failure
The intended semantics are:
prepare_runtime- stage code, interpreter, dependencies, and resolved manifest in the inactive slot
- prepare shared bucket data; patch installs reuse it, minor installs run the declared migration hook into the new bucket
persist_before_switch- flush debounced writes and checkpoint any skill-owned durable stores before mutating slot pointers
- stop accepting new local work if the runtime cannot safely overlap old and new state
migrate_durable_state- run schema migration only against canonical durable state and slot-bound schema state
- never attempt generic object-memory serialization as the platform default
activate_pointer- atomically switch active runtime version and slot
- record previous slot and migration metadata for rollback
rehydrate_runtime- rebuild projections, indexes, caches, subscriptions, and other derived state from durable truth
- re-open service endpoints or re-subscribe platform listeners only after the new slot is authoritative
healthcheck- validate that the new runtime is operational after rehydration, not merely importable
- failure handling
rollbackif the old slot must be restoreddeactivateif the core/runtime switch remains committed but this skill must be quarantined
Current implementation now also hardens the activation-failure path:
- if pointer cutover succeeds but
rehydrate_runtimefails, the runtime attempts shutdown hooks on the newly active slot - the runtime then restores the previous active version/slot selection, internal-data marker, and deactivation state
- lifecycle diagnostics for the failed target slot remain persisted for operator inspection
- runtime-wide drain and stop events now reuse the same skill lifecycle contract:
subnet.drainingfans outdrainacross active installed skills, whilesubnet.stoppingfans outdisposeandbefore_deactivate
Runtime hook direction
The platform should remain functional without custom skill hooks, but the target contract should support optional hooks for stateful skills:
before_deactivate(): flush or checkpoint durable state before cutovermigrate_state(payload): explicit migration entry point for schema-sensitive state ifdata/internalcopy is insufficientafter_activate(payload): lightweight post-switch initialization once the new slot is activerehydrate(): rebuild derived runtime state from durable truthdispose()ordrain(): stop background work and release subscriptions/resources before rollback or deactivation
These hooks should be treated as bounded lifecycle hooks, not as an invitation to invent ad-hoc migration protocols per skill.
Yjs and projection rule
For Yjs-backed experiences, the target rule should be explicit:
- Yjs is a live projection and collaboration layer
- Yjs is not the canonical migration authority for skill business state
- if a skill keeps canonical durable state elsewhere, Yjs must be rebuilt or reconciled from that durable truth after cutover
- if Yjs currently carries working state that must survive reloads, the platform should still persist a durable snapshot outside Yjs and rehydrate from that snapshot after rebuild
This is already the direction used by nlu_teacher:
- working state is visible under
data.nlu_teacher.* - durable state is also persisted under
.adaos/state/skills/nlu_teacher/<webspace>.json - rehydrate merges durable snapshot and current Yjs content after
scenarios.synced
That pattern is target-aligned because it treats live projection state as rebuildable rather than as the sole owner of migration truth.
Target lifecycle per installed skill:
preparetestactivaterollbackon activation or post-activation failuredeactivateif core transition is committed but a subset of skills must be quarantined afterward
The supervisor remains the authority for the overall core-update decision, but individual skill runtime outcomes must be persisted as part of the update result.
Current MVP implementation now splits this into two moments:
- early slot preparation may build the inactive core slot and mark skill migration as deferred while the old runtime is still live
- the mutating skill runtime commit step still happens only after the old runtime has stopped, so countdown traffic does not start reading partially-switched skill runtime state
Current MVP implementation also starts a best-effort passive candidate runtime prewarm when:
- the inactive slot is already prepared
- slot ports are reserved distinctly
- supervisor admitted
warm_switch
That prewarm now feeds a real fast-cutover path:
- candidate readiness is surfaced through supervisor runtime/public status
- candidate remains passive on root-routed traffic subjects until supervisor explicitly commits cutover
- once the old runtime is down and the prepared slot is activated, supervisor may promote/adopt the already-running candidate instead of starting a fresh runtime process
- if candidate promotion, root reconnect, or supervisor adoption fails, supervisor falls back to the existing stop-and-switch launch path from the same prepared slot
This keeps warm-switch opportunistic and reversible: the node gets a genuine low-downtime cutover path when the candidate is ready, but constrained or unhealthy cases still converge through the proven fallback path.
Recommended per-skill diagnostic fields:
skillokfailed_stageprepared_versionprepared_slotactive_slot_beforeactive_slot_afterrollback_performeddeactivatedtestserror
Operator surfaces such as Infra State and Infrascope should be able to answer:
- which skill failed migration
- whether the failure happened during prepare, tests, activate, rollback, or deactivate
- whether rollback was performed
- whether the node committed the core update with some skills intentionally deactivated
After a successful core switch, deactivate is the preferred local containment mechanism for individual broken skills when the node should keep the new core slot rather than trigger a full rollback.
The default target behavior is:
- runtime passes core post-switch validation
- supervisor runs post-commit checks for active skill runtimes
- failing skills are selectively deactivated
- core update remains committed, but operator surfaces show degraded skill set
Post-commit checks should not rely only on test suites. They should also treat persisted lifecycle diagnostics as first-class health signals, especially:
rehydratehealthcheck- shutdown-hook failures that indicate the skill cannot be safely recycled on the next transition
Current implementation now feeds lifecycle diagnostics into the skill runtime migration report and allows post-commit checks to fail and selectively deactivate a skill before test execution when runtime lifecycle health is already known to be bad.
Operator-facing projections now also distinguish lifecycle failures from plain test failures, so reports can show lifecycle/rehydrate instead of collapsing everything into tests.
Selective deactivation now also persists that same failure contract in the skill deactivation marker, so the runtime can distinguish "manually deactivated" from "quarantined after committed core switch because lifecycle/rehydrate failed".
That quarantine metadata is now propagated into operator-facing notes and post-validation status messages, so the supervisor-visible transition summary can name the quarantined skill and failing stage directly.
Roadmap checklist
Use the checklist below as the migration hardening path for the kernel/runtime layer:
- [x] document storage classes per skill explicitly in manifests/runtime docs: canonical durable state, slot-bound schema state, derived runtime state, live memory
- [x] add an explicit
persist_before_switchstage to skill/core migration orchestration - [ ] require migration logic to operate on durable and slot-bound state only, not generic process memory
- [x] define optional lifecycle hooks for
before_deactivate,after_activate,rehydrate, anddispose/drain - [x] make post-activation rehydration a declared runtime phase instead of an implicit side effect
- [x] persist per-skill migration diagnostics for
persist,migrate,rehydrate, andhealthcheck, not onlyprepare/test/activate - [x] standardize rollback semantics when pointer switch succeeded but rehydration failed
- [x] connect global runtime drain/shutdown events to skill-level
drain/dispose/before_deactivate - [x] surface lifecycle-vs-test failure classes in operator-facing migration reports
- [x] standardize deactivation metadata when core switch stays committed but one skill cannot complete rehydration
- [ ] make projection-backed skills document which branches are canonical and which are rebuildable caches
- [ ] move Yjs-backed stateful skills toward "durable truth + projection rebuild" instead of "projection is the only truth"
- [ ] add tests that simulate restart/update/rollback with persisted state present before cutover
- [ ] add tests that prove derived state can be rebuilt deterministically after activation and rollback
Relationship to systemd
Target deployment:
- systemd manages
adaos-supervisor adaos-supervisormanagesadaos-runtimeadaos-supervisoralso managesadaos-realtimewhen sidecar mode is enabled in managed topology
This is preferred over systemd managing the main runtime process directly because systemd alone does not hold AdaOS-specific update semantics, slot state, or validation rules.
Relationship to realtime sidecar
The supervisor and realtime sidecar solve different problems:
adaos-supervisor: process and update authorityadaos-realtime: transport isolation
In the managed topology, the supervisor launches, monitors, and restarts the sidecar. Standalone runtime-owned sidecar startup remains only as compatibility fallback when supervisor is absent. The sidecar must remain transport-only. The sidecar must not become the hidden owner of update status, rollback state, or degraded-mode business policy.
Memory leak detection and profiling
The supervisor should also become the local authority for memory leak detection, profiling escalation, and remote retrieval of profiling evidence.
This work is intentionally prioritized ahead of the broader supervisor rollout because:
- core-slot evolution can change memory behavior even when update flow is otherwise healthy
- skill runtime evolution can introduce long-lived leaks outside the narrow core-update path
- low-memory devices need an explicit local guard before a leak turns into process death or unstable restart loops
Goal
Provide an always-on supervisor-owned memory watchdog that can:
- observe runtime process-family memory after start, restart, and slot switch
- distinguish normal warm-up from suspicious sustained growth
- restart the runtime in an explicit profiling mode when policy thresholds are crossed
- correlate memory growth with top-level runtime operations
- persist local profiling sessions and summaries
- publish profiling summaries and artifacts to root so they can be retrieved remotely by zone-scoped operator workflows
Operating model
The target model is policy-driven rather than "always profile everything".
Supervisor should run the managed runtime in one of these modes:
normalsampled_profiletrace_profile
Target behavior:
- runtime starts in
normal - supervisor samples process-family memory and records a rolling baseline
- if the memory policy detects suspicious growth, supervisor records a profiling session intent
- supervisor restarts the same slot in
sampled_profile - if the profile confirms continued abnormal growth, supervisor records a leak incident and keeps the node recoverable through restart / rollback / quarantine policy
- operator or root workflows can later retrieve the profiling summary and, when configured, the heavier profiling artifacts
The important rule is that profiling is an escalated diagnostic mode under supervisor policy, not a permanent runtime tax.
Restart-into-profile is also an availability-affecting action. The automatic policy path must therefore defer profiling restarts while a recent browser session or live member link is observed; critical low-memory restart policy remains the separate last-resort recovery path. A profiling window starts only after the restarted runtime API is ready, so slow bootstrap time is not counted as useful sampled-profile time and cannot prematurely stop the profiler before final artifacts are materialized.
Signals and admission rules
Supervisor should avoid triggering profiling from one instantaneous RSS sample.
Memory suspicion should be based on a combination of:
- absolute process-family RSS over a configured threshold
- positive RSS growth slope over a time window
- post-switch RSS significantly above the pre-switch baseline
- threshold breach sustained beyond a stabilization grace period
This is especially important because AdaOS runtimes may legitimately allocate memory during:
- slot boot and dependency import
- skill runtime preparation or activation
- workspace materialization
- model/session warm-up
- cache rebuild after update or rollback
Profiler strategy
The preferred default profiler strategy is:
- built-in
tracemallocfor automatic supervisor-triggered profiling sessions - optional heavier profiler adapters for deep-dive workflows on supported environments
The current target adapter split is:
TracemallocProfilerAdapter- lowest operational complexity
- safe for automated restart-into-profile mode
- useful for Python allocation growth snapshots and diffs
MemrayProfilerAdapter- optional deep-dive adapter for environments where native-allocation analysis is worth the overhead and platform support is available
- not required for the first implementation
The supervisor must treat profilers as pluggable adapters. The policy engine decides when to escalate; the adapter decides how profiling is started, stopped, and materialized into artifacts.
Runtime launch contract
Escalation into profiling mode must not depend on ad-hoc runtime-specific flags.
The supervisor-owned launch contract should therefore reserve explicit runtime environment keys for memory profiling:
ADAOS_SUPERVISOR_PROFILE_MODEADAOS_SUPERVISOR_PROFILE_SESSION_IDADAOS_SUPERVISOR_PROFILE_TRIGGER
Rules:
normalremains the default when these keys are absent- Phase 1 may expose these keys as part of the contract before restart-into-profile is implemented
- Phase 2 uses the same keys for the actual restart-into-profile flow instead of inventing a second launch mechanism
- the runtime may treat these keys as read-only diagnostic context and must not promote itself into profiling mode by local guesswork alone
Top-level operation log
Profiling artifacts are much more useful when they can be aligned with top-level runtime activity.
The runtime should therefore emit a compact supervisor-consumable operation log for events such as:
slot_startedslot_promotedskill_loadedskill_activatedskill_unloadedscenario_startedworkspace_openedmodel_session_startedtool_invokedcore_update_preparecore_update_applycore_update_activate
This log should stay high-level and bounded. It is not intended to mirror every internal event-bus message.
Recommended operations.ndjson record shape:
contract_versionevent_ideventemitted_atsession_idprofile_modeslotruntime_instance_idtransition_rolesample_sourcesequencedetails
Phase 1 should freeze this envelope even if the runtime is not yet emitting real operation traffic beyond supervisor-owned control intents.
Persisted profiling state
In addition to the current supervisor state files, the target model should add local profiling storage under supervisor state.
Recommended target layout:
state/supervisor/memory/runtime.jsonstate/supervisor/memory/telemetry.ndjsonstate/supervisor/memory/sessions/<session_id>/summary.jsonstate/supervisor/memory/sessions/<session_id>/operations.ndjsonstate/supervisor/memory/sessions/<session_id>/artifacts/...state/supervisor/memory/sessions/index.json
Recommended summary fields:
session_idslotruntime_instance_idtransition_roleprofile_modetrigger_reasontrigger_thresholdbaseline_rss_bytespeak_rss_bytesrss_growth_bytesstarted_atfinished_atsuspected_leaktop_growth_sitesoperation_windowpublished_to_rootartifact_refs
The important rule is that supervisor owns the diagnostic session record even if runtime produces the raw snapshots.
Phase 1 implementation baseline
The current Phase 1 implementation establishes the contract and storage baseline without yet enabling automatic restart-into-profile behavior.
Current implementation artifacts:
src/adaos/services/supervisor_memory.py- supervisor-owned schema normalization for memory telemetry samples, runtime state, session summaries, operation events, and artifact refs
- dedicated state-path helpers for supervisor memory storage
state/supervisor/memory/runtime.json- persisted memory profiling runtime contract and current live-mode summary
state/supervisor/memory/sessions/index.json- persisted index for profiling sessions
state/supervisor/memory/sessions/<session_id>/operations.ndjson- contract-stable per-session operation log used for later growth correlation
Current implementation surfaces:
GET /api/supervisor/memory/statusGET /api/supervisor/memory/telemetryGET /api/supervisor/memory/incidentsGET /api/supervisor/memory/sessionsGET /api/supervisor/memory/sessions/{session_id}GET /api/supervisor/memory/sessions/{session_id}/artifacts/{artifact_id}POST /api/supervisor/memory/profile/startPOST /api/supervisor/memory/profile/{session_id}/stopPOST /api/supervisor/memory/publish
Current implementation scope now spans the completed Phase 1 baseline plus the first active Phase 2 slice:
- keeps the authority boundary, launch contract, operation log, and local session store under supervisor ownership
- records rolling process-family telemetry under
state/supervisor/memory/telemetry.ndjson - persists baseline RSS, growth, slope, telemetry cadence, and compact suspicion state in
runtime.json - exposes implemented launch modes
normal,sampled_profile, andtrace_profileas supervisor-managed runtime truth - lets
profile/startand policy-created sessions converge through the same requested-profile workflow instead of separate ad-hoc paths - applies requested profile mode through a controlled supervisor restart using the Phase 1 launch contract keys
- creates a supervisor-owned profiling session automatically when telemetry crosses both growth and slope thresholds
- materializes local
tracemallocstart/final/top-growth artifacts understate/supervisor/memory/sessions/<session_id>/artifacts - adds trace-oriented
tracemalloctraceback artifacts whentrace_profileis selected so that trace mode yields richer diagnostics than the sampled baseline - records top growth sites and artifact refs back into the supervisor-owned session summary when a profiled runtime exits
- suppresses repeated policy-triggered restart loops with cooldown/circuit-breaker guards before opening another automatic profiling session
- exposes telemetry tail and richer per-session inspection so operators can inspect growth samples, operation log, and collected artifacts together
- exposes explicit retry flow for failed/cancelled/stopped profiling sessions instead of overloading manual start semantics
Current implementation control mode is phase2_supervisor_restart:
profile/startcreates a supervisor-owned profiling request and the monitor applies it through restart-into-profileprofile/stopclears the requested mode and lets the monitor converge the runtime back tonormal- suspicion policy can create a
sampled_profilerequest automatically when growth remains both large and steep publishrecords operator intent locally and now attempts the first dedicated root summary publication path, persistingpublished_ref/publish_resultback into the supervisor-owned session record- retry-created sessions now carry explicit retry-chain metadata (
retry_of_session_id,retry_root_session_id,retry_depth) so later incidents can be grouped without reconstructing lineage from operations alone
Current implementation deliberately does not yet:
- publish heavy profiling artifacts to root
The first active Phase 3 slice now exists:
- supervisor publishes memory-profile summaries to root through a dedicated
memory_profilereport family - root-side retrieval can list those summaries by hub, optional session id, compact state filters, and suspected-only filters before heavy artifact transport is added
- operator surfaces can open one remotely published memory-profile session directly to inspect RSS deltas, retry lineage, telemetry tail size, and artifact summary metadata
- operator surfaces can inspect that remote summary path through
adaos hub root reports --kind memory-profile - root can expose an artifact catalog with explicit publish-policy status (
inline_available,size_limit_exceeded,kind_not_allowed, and similar states) for one published session - root can serve the currently allowed inline JSON artifact payloads for one published session, while heavy or disallowed artifacts remain local-only until a later transport policy is added
- operator tooling now has a normalized root-side delivery contract for artifacts: root can answer
root_inline_contentdirectly for small published JSON payloads, and can return an explicitlocal_control_pullcontract for heavier local artifacts, including chunkedutf-8/base64transfer metadata for the direct pull path without pretending that those artifacts are replicated at root
Target-state integration: ProfileOps
The next target-state step for this profiling work should be explicitly named:
ProfileOps
ProfileOps means:
- supervisor keeps ownership of profiling policy, profiling-mode restarts, local telemetry, sessions, and artifacts
- root keeps the
memory_profilereport family as the publication and retrieval substrate Root MCP Foundationpublishes typed profiler tools as the governed operational surface over that supervisor-owned state
This target state is intentionally not:
- direct remote control of supervisor-only endpoints by external MCP clients
- treating root report endpoints as if they were already the MCP product surface
- bypassing root policy, scope, and audit for profiling writes
The desired layering is:
supervisor profiling authority
-> local API, session store, telemetry, artifacts
root memory_profile reports
-> replicated summaries and selected artifacts
Root MCP Foundation / ProfileOps
-> typed profiler reads
-> typed bounded profiler controls
-> scope checks, capability checks, audit, and client-facing contracts
Under ProfileOps, profiling should become a first-class MCP Operational Surface, not a side channel attached after the fact.
Local control surfaces
The target local supervisor memory API should include read-only status and explicit operator controls:
GET /api/supervisor/memory/statusGET /api/supervisor/memory/sessionsGET /api/supervisor/memory/sessions/{session_id}POST /api/supervisor/memory/profile/startPOST /api/supervisor/memory/profile/{session_id}/stopPOST /api/supervisor/memory/publish
The browser-safe read-only surface should eventually expose a compact memory incident summary without exposing mutating controls.
Phase 1 and the early Phase 2 slice expose a compact browser-safe memory summary:
GET /api/supervisor/public/memory-status
That surface is intentionally small and read-only:
- current profile/control mode
- requested profiling intent, if any
- suspicion state
- compact baseline/growth summary
- session counters
- compact last-session summary
For manual controls, the safety policy should be explicit:
- manual profile start must be rejected while a core transition is already active
- only one active profiling intent/session may exist at a time unless a future multi-session policy is documented explicitly
publishmay record an operator request during Phase 1, but must not claim that root publication has completed until an explicit ack exists- low-memory or degraded nodes may still downgrade artifact collection even when restart-into-profile is supported
Root retrieval model
Profiling evidence should follow the same general remote-access philosophy as other root control reports:
- the node keeps local authoritative copies first
- supervisor publishes summaries asynchronously
- root indexes those summaries by hub, subnet, and zone
- heavier artifacts can be fetched only when requested and authorized
Target root-facing capabilities:
- ingest a memory-profile summary report from a hub
- list memory-profile incidents by
hub_id,subnet_id, andzone - fetch a single profiling session summary
- retrieve profiling artifacts when policy and size constraints allow it
This should remain a separate report family rather than overloading generic lifecycle reports.
Safety and recovery rules
Memory profiling must not reduce the node to an unrecoverable state.
Supervisor policy should therefore preserve these rules:
- profiling restarts must stay bounded by timeouts
- repeated profile-trigger loops must trip a circuit breaker
- low-memory devices may skip heavy artifact collection and keep only summaries
- candidate prewarm and warm-switch memory admission must remain separate from leak suspicion policy
- a confirmed leak may trigger rollback or quarantine policy, but profiling itself must not silently mutate slot authority
Migration plan
Phase 1 - Memory watchdog architecture and state model
- freeze the supervisor-owned memory profiling authority boundary
- define memory telemetry, profiling session, and artifact metadata schemas
- document profiler adapter strategy with
tracemallocas the default automated path - define top-level operation-log contracts needed to correlate memory growth with runtime behavior
- freeze the runtime launch contract for future restart-into-profile mode
- expose explicit operator profiling intents in supervisor-owned APIs and state, even before policy-driven automatic restart is implemented
Phase 2 - Local memory telemetry and profiler-mode restart
- add supervisor-owned rolling process-family memory telemetry
- add suspicion policy based on threshold + slope + stabilization window
- add explicit runtime launch modes
normal,sampled_profile, andtrace_profile - implement restart-into-profile flow for the active slot when memory policy is breached using the Phase 1 launch contract
- persist local profiling sessions and summaries under
state/supervisor/memory
Phase 3 - Root publication and remote retrieval
- publish memory-profile summaries to root as a dedicated report family
- scope retrieval by
hub_id,subnet_id, andzone - expose lightweight operator retrieval flows before large artifact transport
- keep local-first retention so profiling evidence survives root/network outages
Phase 3.5 - ProfileOps architecture fixation
- declare
ProfileOpsas the goal-state convergence of supervisor profiling and Root MCP - freeze the first profiler tool ids and capability vocabulary for MCP-facing reads and writes
- define the split between root-published profiling evidence and target-routed profiling control actions
- document that supervisor remains profiling authority while Root MCP becomes the typed external surface
Exit criteria:
- architecture docs consistently describe profiling as a supervisor-owned surface projected through Root MCP
- the system no longer relies on implicit knowledge of raw supervisor/root endpoints to explain the target state
Phase 4 - ProfileOps read-only MCP surface
- add read-oriented profiler contracts to Root MCP for status, incidents, sessions, artifact catalogs, and artifact retrieval
- expose the same read surface through
RootMcpClient - expose the same read surface through the local Codex
stdiobridge - keep the existing root report endpoints as substrate and compatibility paths
Exit criteria:
- an MCP client can inspect profiling state and published evidence without bespoke knowledge of
/v1/hubs/memory_profile/* - profiler reads participate in standard Root MCP policy and audit flows
Phase 5 - ProfileOps bounded control surface
- add typed MCP write tools for
start_profile,stop_profile,retry_profile, andpublish_profile - gate those tools on explicit target-published profiler capabilities
- keep control execution bounded and environment-scoped in the same style as other
hub.*write operations - preserve supervisor as the only authority that decides requested profile mode convergence and session lifecycle
Exit criteria:
- profiling writes can be triggered through Root MCP without bypassing root policy and audit
- profiler control paths are no longer special-cased outside the operational tool model
Phase 6 - Unified audit and consumer convergence
- align profiler actions with the shared Root MCP operational event model
- let Infrascope and Codex consume the same typed profiler contracts
- make capability-usage and activity views include profiler operations without a second audit vocabulary
- reserve raw endpoints for transport substrate, debugging, and compatibility rather than primary integration
Exit criteria:
- profiling has one governed operational surface with both human and agent consumers
- web and MCP clients do not need separate profiler-specific integration logic
Phase 7 - Documentation and baseline supervisor state model
- freeze supervisor authority boundary
- define persisted attempt schema
- teach CLI to prefer supervisor-style state when available
Current implementation baseline now covers this phase:
- supervisor-owned update attempt state persists under an explicit contract version instead of an ad-hoc free-form payload
- browser-safe and operator-facing update surfaces both expose normalized supervisor attempt state
adaos autostart update-statusprefers supervisor-backed state first, then falls back to the public supervisor transition surface before legacy runtime admin status- operator-facing field meanings for the normalized attempt payload are documented in
docs/guides/supervisor-update-attempts.md
Phase 8 - Resilience before full split
- add stale-attempt timeout handling
- stop clearing update plan before validation commit
- emit explicit failure state for interrupted restart/apply paths
Current implementation baseline now covers this phase:
- stale supervisor attempts now expire deterministically for both in-flight restart/apply transitions and
awaiting_root_restart - autostart keeps the pending update plan through launch and clears it only after validation reaches a terminal commit or failure
- boot-time recovery no longer degrades interrupted
restarting/applyingpaths to genericidle; it writes an explicit failed transition state instead
Phase 9 - Introduce standalone supervisor process
- add
adaos supervisor serve - move update state and admin/update endpoints into supervisor
- make systemd unit target supervisor instead of runtime
Phase 10 - Child runtime management
- launch runtime as a child process of supervisor
- move runtime restart and validation logic out of
autostart_runner - persist child process metadata and restart reason in supervisor state
Phase 11 - Sidecar alignment
- keep
adaos-realtimelifecycle under supervisor in managed topology - keep runtime-owned startup/shutdown only as standalone fallback when supervisor is absent
- keep sidecar contract transport-only
- keep warm candidates memory-bounded: warm-switch admission should account for runtime process-family RSS, and candidate prewarm should defer external service-skill startup until cutover
Phase 12 - Operator UX
adaos autostart/update-statusresolves to supervisor API firstadaos autostart update-defercan reschedule a planned/countdown update window without losing the current supervisor attempt contextadaos node reliabilitynow falls back to browser-safe supervisor transition state and reportsruntime_restarting_under_supervisorinstead of only connection failure when runtime:8777is temporarily unavailable during a managed transition- Infra State surfaces supervisor attempt state alongside runtime readiness, including
planned update,root promotion pending,root restart in progress, andsubsequent transition queued - Infra State and Infrascope surface skill runtime migration diagnostics for the current or last core update attempt
- browser header/status surfaces poll a read-only supervisor transition view so controlled restarts are not shown only as generic
offline - canonical control-plane projections keep supervisor-owned restart/promotion phases visible even when runtime API readiness has not converged yet
Exit criteria
The supervisor target state is complete when:
- update status remains available while runtime is down
- interrupted updates resolve to validated, failed, or rolled_back without manual file edits
- stale
restarting/applyingstates expire deterministically - rollback is a supervisor decision, not only a runtime-side best effort
- sidecar remains transport-only and does not absorb process/update authority