Issue Tracker
This document is the single living issue tracker for active AdaOS stabilization and delivery work.
Use sections as goals. Each goal owns task groups that can be extended, executed, and closed without creating a separate tracker document.
Device Identity and Access Usability
Goal
Make node/browser settings explain identity, editable human names, lifetime, and detach behavior without leaking transport implementation details into the operator UI.
Current Status
Snapshot date: 2026-05-14.
Local debugging found that the local hub can still be addressed by legacy
member:<local_node_id> refs in desktop settings. That is an addressing alias,
not a true member identity. Device access must normalize this alias to
hub:<subnet_id> so policies, name storage, and disabled hub-only actions are
derived from .adaos/node.yaml.
Tasks
DIAU-001: Normalize local hub identity in settings flows
Status: in progress.
Actions:
- [x] Treat
member:<local_node_id>ashub:<subnet_id>when local node config saysrole: hub. - [x] Keep hub display names editable through
.adaos/node.yaml: node.node_names. - [x] Keep hub lifetime and detach actions disabled with explicit reasons.
- [ ] Verify live desktop settings now shows
kind=hubandID hub:<subnet_id>after hub restart/client reload.
Human verification:
- Open local node Settings and confirm ID is
hub:<subnet_id>, kind ishub, Save name enables after editing, and lifetime/detach remain disabled with policy hints.
DIAU-002: Harden settings modal controls
Status: in progress.
Actions:
- [x] Use native modal controls with pointer/click duplicate suppression for Settings actions.
- [x] Keep Close, Apps, Marketplace, Hide, Save name, Lifetime, and Detach responsive after editing text fields.
- [ ] Add broader browser regression coverage once the modal E2E harness is available.
Human verification:
- Open Settings, edit Name, click Close.
- Open Settings, click Hide, then Close.
- Open Settings, edit Name, then Apps/Marketplace.
DIAU-003: Clarify browser settings identity
Status: in progress.
Actions:
- [x] Show immutable browser Device ID separately from editable Browser name.
- [x] Add explicit Save name flow for browser-name edits.
- [x] Document that the name is hub-side access policy state and is not written back to the remote browser.
- [ ] Implement immediate remote browser logout on Detach, or add a control-plane event if the current runtime has no safe rail for it.
Human verification:
- Open
[node] Browser settings, confirm Device ID is visible, edit Browser name, Save name, then refresh Browsers and confirm the new label remains.
DIAU-004: Align terminology around subnet endpoints
Status: open.
Actions:
- [x] Prefer
subnet endpointfor software participants attached to a subnet: browser, member node, LLM agent, IoT bridge, or future headless client. - [x] Keep
devicefor the operator-facing managed/trusted endpoint class. - [x] Keep
clientas a policy class for temporary browser access, not as the general architectural term. - [ ] Audit UI copy and docs for places where
device,browser,member, andclientare still conflated.
UI Runtime Diagnostics and Skill-Scoped Logs
Goal
Keep browser-side UI failures, in-process skill runtime logs, and service-skill logs attached to the skill being developed so LLM-assisted debugging can stay in the correct entity context.
Current Status
Snapshot date: 2026-05-13.
Implemented baseline is documented in
docs/architecture/ui-runtime-diagnostics.md.
Additional checkpoint on 2026-05-19: browser RuntimeDebugService keeps a
bounded ring in localStorage under adaos.runtime_debug.logs.v1, and the
node ingests bounded runtime-debug breadcrumbs through
/api/node/ui/diagnostics. The export path remains diagnostic-only and does not
write browser logs into primary Yjs state.
Tasks
UILOG-001: Complete skill-scoped diagnostics pipeline
Status: in progress.
Actions:
- [x] Add explicit skill log paths to
CurrentSkillandPathProvider. - [x] Route in-process skill-context
adaos.*logs toservice.<skill>.runtime.loginstead of platform-wideadaos.log. - [x] Send dev-mode browser UI diagnostics to the node.
- [x] Persist browser UI diagnostics to
service.<skill>.ui_runtime.log. - [x] Extend MCP
get_skill_logs(skill=...)to includeservice.<skill>.*.log. - [ ] Add widget-level ownership metadata for renderer failures that are not modal-owned.
- [x] Add bounded export/ingest for
adaos.runtime_debug.logs.v1so node-side tools can read Dev Browser breadcrumbs through skill/runtime diagnostic logs. - [ ] Add a typed ABI schema for UI diagnostic payloads.
- [ ] Add rate limiting and duplicate suppression for repeated renderer errors.
- [ ] Feed skill logs into the future LLM skill-debugging MCP workflow.
Browser Startup and Progressive Hydration
Goal
Make the browser desktop usable immediately after login by rendering from available local state first, while live Yjs sync and materialization catch up in the background.
Current Status
Snapshot date: 2026-05-13.
The browser client now keeps a read-only last-good desktop render snapshot in
localStorage. YDocService uses it only as a fallback for missing ui, data,
and registry reads; live Yjs branches always take precedence, and Yjs
IndexedDB persistence remains opt-in. DesktopRendererComponent now binds the
desktop view before initFromHub() resolves, so login no longer blocks first
paint on first Yjs sync/materialization. Runtime data-source 401/403 responses
that arrive while Yjs bootstrap is still pending are now treated as transient
load failures instead of forcing a page reload, preventing startup reload loops
when cached first paint races ahead of live runtime authorization.
Tasks
BSPH-001: Render desktop before Yjs first-sync completion
Status: in progress.
Actions:
- [x] Add a read-only last-good render snapshot separate from Yjs persistence.
- [x] Save snapshots only from live
interactiveorreadymaterialization. - [x] Let desktop schema/UI reads fall back to the snapshot while live branches are absent.
- [x] Start desktop rendering before
YDocService.initFromHub()resolves. - [x] Defer page reload on transient runtime data-source unauthorized responses while Yjs bootstrap is still pending.
- [x] Add focused Angular tests for snapshot fallback and non-blocking desktop startup.
- [ ] Add browser-visible "syncing latest state" affordance for cached first paint, without hiding normal link/Yjs diagnostics.
- [ ] Add an end-to-end timing assertion for login-to-first-desktop-paint once the browser E2E harness is available.
Operational Event Model Roadmap Consolidation
Goal
Keep event, projection, browser/runtime, platform-emitter, and heavy-skill migration work on one master delivery track so the project can execute the target event model without drifting between parallel roadmaps.
Current Status
Snapshot date: 2026-05-15.
The target architecture remains valid, but the documentation had two related roadmaps that could be read as competing priority sources:
docs/architecture/operational-event-model-roadmap.mddocs/architecture/projection-subscription-roadmap.md
The master roadmap is now explicitly authoritative. The projection roadmap is now a subordinate detailed checklist for projection record, browser subscription, Yjs adapter, shared dispatcher, Infrascope, and rollout tasks.
The next execution slice is intentionally contract-first:
- freeze a minimal shared event envelope
- align named-entity and status-card ABI work with the event model
- lock projection record and client subscription shapes
- implement the client subscription runtime
- add a shared demanded-projection dispatcher
- validate platform emitters before heavy skill migration
Tasks
OEM-001: Consolidate roadmap authority
Status: in progress.
Actions:
- [x] Mark
Operational Event Model Roadmapas the single authoritative delivery sequence. - [x] Recast
Projection Subscription Roadmapas a subordinate detail checklist instead of a separate priority track. - [x] Record the 2026-05-15 implementation boundary in the target event model: eventbus guardrails, named-entity ABI, node-aware compatibility surfaces, and remaining shared ABI gaps.
- [x] Add a top-level reference execution plan with coverage gates, contract shapes, review checklist, and completion definition.
- [x] Update roadmap progress for named-entity contract/runtime ABI and eventbus hot-topic guardrails.
- [ ] Define the minimal shared event envelope and compatibility rules for
existing
Event(type, payload, source, ts)producers. - [ ] Bind
STATUS-*work to the platform-emitter phase so status cards do not become a separate monitoring-only architecture.
OEM-002: Lock projection ABI before client/runtime migration
Status: planned.
Actions:
- [ ] Define canonical projection record fields:
status,data,meta,error, lifecycle timestamps, version/fingerprint, access metadata, and source ownership. - [ ] Define browser-written subscription records for page, widget, modal, and pinned panel consumers.
- [ ] Define compatibility rules for legacy Yjs branches during migration.
- [ ] Use
registry.named_entitiesand planned status cards as reference examples before Infrascope migration.
OEM-003: Keep heavy-skill pilots behind platform-emitter validation
Status: planned.
Actions:
- [ ] Allow Infrascope inventory/tests that do not create a parallel projection ABI.
- [ ] Migrate status cards, notifications, diagnostics, or workspace-manager surfaces first through the shared projection contract.
- [ ] Start Infrascope split only after event envelope, projection ABI, client subscriptions, dispatcher, and at least one platform-emitter pilot are materially in place.
Modal Projection and Runtime Recovery Integrity
Goal
Keep desktop and modal data contracts explicit while recovering from missing
runtime projections. A widget that declares kind: y must render from Yjs; a
widget that declares kind: stream must render from stream data. Recovery may
request refresh/project work, but it must not silently substitute direct
skill/API payloads and hide broken projection paths.
Current Status
Snapshot date: 2026-05-11.
Recent local debugging found several related issues:
- Modal data for
Subnet Environment,Infra Access,Infrastructure State, andBrowserscould appear fixed by direct client fallbacks while the real projection/materialization contract was still broken. - The full Python suite currently has a collection-order hazard around test
modules that stub
sys.modules["nats"]; fixing that locally exposes a separate set of pre-existing runtime API expectation failures that need their own cleanup pass. - Workspace skill changes are delivered through
adaos skill push, while root git commits track client/core/tests; CI needs a clearer way to prove the two layers remain compatible.
Tasks
MRI-001: Keep Yjs and stream data-source recovery contract-first
Status: in progress.
Actions:
- [x] Stop direct client fallback payloads from rendering Yjs modal data for the operational projections currently under debug.
- [x] Treat empty browser arrays as valid live Yjs data, not as missing data.
- [ ] Move the temporary client-side recovery registry toward declarative
schema metadata such as
dataSource.recovery/projection.refresh. - [ ] Audit modal schemas and ensure each data source uses
kind: yorkind: streamintentionally, with no implicit source swapping.
MRI-002: Make workspace skill publishing verifiable
Status: open.
Actions:
- [ ] Add a lightweight verification command or test fixture that confirms the pushed skill version used by tests contains the expected projection handlers.
- [ ] Document the expected workflow: edit workspace skill, run targeted tests,
adaos skill push <name> -m ..., then commit root/client changes. - [ ] Avoid root tests whose only passing implementation lives in ignored
.adaos/workspacestate unless the skill push/version is part of the test setup.
MRI-003: Restore full pytest suite health after nats test shadowing
Status: open.
Actions:
- [ ] Replace broad
sys.modules["nats"]stubs with helpers that prefer the installednats-pypackage and only stub when unavailable. - [ ] After collection is stable, triage the currently exposed runtime API expectation failures separately from modal/projection work.
- [ ] Add a regression that
tests/test_nats_ws_transport.pycan importnats.errorsregardless of test collection order.
MRI-004: Make weather provider behavior explicit
Status: in progress.
Actions:
- [x] Stop showing raw runtime i18n keys when a weather provider returns an error.
- [x] Migrate the legacy OpenWeatherMap endpoint to the no-key Open-Meteo path for local development fallback.
- [ ] Document provider selection and API-key behavior so
401is actionable instead of looking like a modal rendering bug.
MRI-005: Keep scenario switching fast without hiding rebuild problems
Status: in progress.
Actions:
- [x] Move scenario-switch worker
data.webspacessync out of the ready path; fresh-doc rebuild remains explicit, while listing fanout is post-ready and coalesced. - [x] Normalize nested/stringified
webspace_idvalues before workspace index reads/writes, and dedupe malformed legacy rows from listing output. - [ ] Investigate why
collect_inputsdominates semantic rebuild time (~300-400ms locally) and make resolver input collection cheaper without weakening projection contracts. - [ ] Investigate why fresh-doc switches naturally replace all effective branches; decide whether branch fingerprint reuse can be preserved safely without reintroducing stale Yjs state.
Runtime, Catalog, and Member Sync Integrity
Goal
Make member and hub synchronization trustworthy by separating catalog, workspace source, and active runtime state; applying full lifecycle updates in production paths; and keeping Infrastructure State quiet unless a real operator-visible drift or degraded condition exists.
Success means:
- Member nodes can run without local git for normal production consumption.
- Hub/dev nodes have explicit git requirements for catalog authority and
.adaos/devLLM-assisted development. - Production updates do not report success until the active runtime has been prepared and activated, not merely refreshed in source workspace.
- Infrastructure State shows the full installed skills/scenarios inventory by
default, with a shared
Drift onlytoggle for focused divergence review and compact status icons with tooltips. - Infrastructure State is a thin operator surface, not the authority for inventory, drift, action eligibility, lifecycle health, or quarantine decisions. The same core-owned contracts must be reusable by other skills and future MCP/LLM developer surfaces.
- Scenario installation and update paths apply the skill lifecycle to required skill dependencies and expose dependency failures as structured operation results.
- Production CLI/control commands run against the active slot venv and code, or refuse state-changing work with an actionable diagnostic.
Current Status
Snapshot date: 2026-05-18.
Stand observations showed that a source refresh can temporarily clear skill
drift markers in Infrastructure State even when the installed active runtime is
still behind the registry. For example, infrastate_skill could appear current
after another skill update while the active runtime remained 0.75.2 and the
catalog had 0.75.3+. This exposed a modeling problem: catalog version,
workspace source version, and active runtime version are currently too easy to
collapse into a single "installed" value.
Code review confirms that git is already optional in the install/materialization
path through GitHub archive fallback. That is appropriate for member nodes, but
hub/dev operation needs a stricter policy because the hub owns catalog refresh,
runtime publishing, and future LLM development in .adaos/dev.
The 2026-05-19 memory checkpoint also found a service-skill observability
issue: rasa_nlu_service_skill health checks were successful, but the embedded
HTTP server wrote every /health probe to service.rasa_nlu_service_skill.log
every two seconds. That did not explain the core runtime RSS growth, but it
created a 162 MiB service log and unnecessary cache/journald pressure. The
service skill now suppresses /health access logs while keeping non-health
request/error logs visible.
The same checkpoint found that .40 could finish booting the new slot while
core_update/status.json stayed at restarting/launch. This was a stale
supervisor finalization state, not a failed runtime. Supervisor reconciliation
now finalizes boot status when the managed runtime is API-ready on the target
slot, even if the previous attempt record has already been completed.
Product Rules
- Infrastructure State shows full Installed skills/scenarios by default so the
operator keeps the complete picture. A shared
Drift onlycontrol filters the same inventory to divergence/degradation rows when needed. - Status is represented with icons and tooltips:
behind catalog,ahead of catalog,workspace differs,active runtime differs,catalog unavailable,git unavailable,runtime inactive, anddependency lifecycle failed. - A source workspace version is never treated as proof that the runtime is
active. It can only be shown as
workspace_source_version. - In production, source-only refreshes are diagnostic/dev operations. Normal update actions must complete source refresh, prepare, activate, and projection rebuild as one operation.
- Dev workspace flows are explicit and scoped to
.adaos/dev; they may expose source/runtime divergence intentionally. infrastate_skillmay format labels, icons, streams, modal details, and local UI state. It must not be the long-term owner of catalog lookup, drift classification, lifecycle policy, action gating, scenario health, or operation artifact retention.- Any operator-facing inventory or lifecycle action exposed in
infrastatemust have a corresponding core/API contract that can also be exposed through MCP for LLM-assisted development.
Tasks
RCMS-001: Enforce git policy by role and deployment mode
Status: in progress.
Progress: 15%.
Actions:
- [ ] Keep the no-git GitHub archive materialization path for member production nodes.
- [ ] Require git on hub when dev mode or LLM development workspace features are enabled.
- [ ] In hub production, either require git for catalog-authoritative update flows or enter an explicit degraded mode for catalog refresh and dev commands.
- [ ] Persist
git.available,git.mode,git.source, andgit.reasoninto diagnostics/capacity state. - [ ] Surface git state in Infrastructure State only when it blocks an action or makes a displayed drift result stale.
- [ ] Add tests for hub/dev git-required behavior and member no-git archive install/update behavior.
- [x] Keep
skill push/scenario pushworkspaces clean after a rebase content conflict by aborting the interrupted rebase and surfacing an actionable conflict diagnostic. - [x] Bound sparse-checkout stale-file recovery so production auto-cleanup can remove repeated stale blockers without entering an unbounded retry loop.
Implementation notes:
- This is a guardrail before LLM-assisted conflict resolution: a detected git conflict now leaves the local commit intact and the worktree clean, so a future root/LLM resolver can build a bounded conflict pack from a stable repository state.
- Sparse-checkout stale blocker recovery is now iterative but capped through
ADAOS_SPARSE_CHECKOUT_BLOCKER_RETRIES, preserving deterministic failure when the workspace cannot be safely repaired.
RCMS-002: Separate catalog, workspace source, and active runtime versions
Status: in progress.
Progress: 60%.
Actions:
- [x] Extend Infrastructure State skill/scenario rows with
catalog_version,workspace_source_version,active_version, and skillslot. - [x] Add
catalog_commit,catalog_source, andruntime_bucketto the authoritative inventory model. - [x] Classify behind/ahead/different drift independently for catalog vs workspace and catalog vs active runtime.
- [x] Add explicit unknown, unavailable, and no-git drift classifications.
- [ ] Add explicit stale-catalog drift classification once catalog snapshot freshness metadata is persisted.
- [ ] Treat workspace source as a fallback only when explicitly marked
source=workspace_fallback. - [x] Return Installed skills/scenarios to the full inventory view and add a
shared
Drift onlytoggle. - [x] Order inventory columns by source flow: Catalog, Workspace, workspace actions, Runtime, runtime actions.
- [x] Register renderer table icons and render drift statuses as icons with tooltips.
- [x] Limit skill
Activatevisibility to missing runtime or workspace/runtime divergence. - [x] Add a push-comment modal for skill workspace publish actions.
- [ ] Add row-level details/logs modal wiring so the current Logs icon opens the relevant skill diagnostics instead of only returning paths in the action result.
- [ ] Extend scenario source/runtime action buttons once scenario update/push lifecycle has the same safe operation surface as skills.
- [x] Add a regression proving a source refresh cannot clear a drift marker unless the active runtime version also changes.
RCMS-003: Make production skill updates runtime-atomic
Status: in progress.
Progress: 65%.
Actions:
- [x] Replace the API
skills.updateproduction path with source refresh, inactive-slot prepare, lifecycle activation, and webspace/projection rebuild. - [x] Keep the previous active runtime if prepare or activation fails.
- [x] Return an operation result containing source version, active before/after version, active before/after slot, and migration result.
- [x] Include explicit prepared version, lifecycle stage list, and failure reason in a stable operation schema.
- [ ] Restrict lightweight
runtime_updatesource-copy behavior to dev/debug flows where source/runtime drift is expected and visible. - [x] Make API update success require active runtime convergence in production.
- [x] Make unqualified
adaos skill activate <skill>prepare and activate the workspace source version when it differs from the active runtime. - [x] Refresh same-runtime-bucket prepared sources when the workspace patch version advances, even if an earlier activation already moved the active version marker.
- [x] Correct CLI runtime drift direction so a newer workspace source reports
runtime-behind, and semantically equalv0.75.6/0.75.6versions do not show drift. - [x] Add tests around update failure and drift visibility.
- [x] Add rollback-to-previous-active coverage for partial activation failures.
- [x] Suppress noisy
/healthaccess logs inrasa_nlu_service_skilland verify service-skill reinstall/restart picks up the new runtime code.
Implementation notes:
refresh_skill_runtimenow returns a stable operation schema withprepared_version,prepared_slot,activated_slot,failed_stage,failure_reason, and orderedlifecycle_stages.- API
skills.updatereturns the same runtime refresh payload on convergence failures through the409.detail.runtime_refreshdiagnostic object. - Existing runtime activation tests cover smoke-import failures before slot
switch and
rehydratefailures after slot switch, including rollback to the previous active version. - Regression coverage now includes the
v0.75.6->0.75.7style case where both versions share one runtime bucket but the active slot still needs fresh workspace sources.
RCMS-004: Treat scenario dependencies as lifecycle operations
Status: completed.
Progress: 100%.
Actions:
- [x] Make scenario dependency bootstrap return structured per-skill results instead of silently continuing after dependency lifecycle failures.
- [x] For each required skill dependency, run install/source sync,
prepare_runtime, andactivate_for_space. - [x] Decide and implement production policy for required dependency failure: block scenario activation or activate the scenario as degraded with an explicit operation warning.
- [x] Include dependent skill lifecycle results in synchronous scenario install API payloads.
- [x] Include dependent skill lifecycle results in async scenario install operation payloads.
- [x] Include dependent skill lifecycle results in async scenario update operation payloads.
- [x] Surface dependency lifecycle failures in Infrastructure State and Operations details only when they affect active scenarios.
- [x] Add tests for dependency lifecycle result reporting.
- [x] Add tests for scenario install/update that pulls a dependent skill forward and applies its lifecycle through the full operation path.
Implementation notes:
- Async scenario install operations now persist the structured
dependency_bootstrapresult in the operation result payload, matching the synchronous scenario install API surface. - Sync and async scenario update operations now run dependency bootstrap before
Yjs projection rebuild and include the same
dependency_bootstrappayload in the operation/result surface. - Production scenario install/update now blocks scenario projection when required dependency lifecycle fails; dev mode may continue as degraded for explicit development workflows.
- Dependency bootstrap timeout/exception paths produce explicit
dependency_bootstrap.ok=falsediagnostics instead of dropping dependency lifecycle visibility from the operation result. - Regression coverage now pins that scenario install refreshes a stale dependent skill through source install, runtime prepare, and activation before Yjs projection; async install/update operation payloads preserve the same per-skill lifecycle flags.
- Infrastructure State now marks active scenario rows with a dependency lifecycle warning only when a recent scenario operation reports failed required dependencies; inactive/unprojected scenarios stay quiet.
- Operation detail streams expose the captured
dependency_bootstrappayload so an operator can inspect the failed skill lifecycle stage without dumping the full operation history into the main table.
RCMS-005: Make production CLI/control commands slot-bound
Status: in progress.
Progress: 58%.
Actions:
- [x] Add a slot-bound CLI launcher/self-reexec path so production commands can run from the active core slot venv and code.
- [x] Apply active slot manifest env/cwd when the CLI is already running under
the slot Python but
tools/slot-shell.shwas not sourced. - [x] Refuse or warn for state-changing production commands when the current interpreter, repo root, or package path does not match the active slot manifest.
- [x] Keep root checkout drift acceptable for production when only supervisor and sidecar are launched from root and the updater controls those processes.
- [x] Keep always-on supervisor/watchdog/update-orchestration root-bound while
production CLI/runtime actions remain slot-bound; the root-bound processes
use the stable root checkout/root
.venv, not a runtime slot venv. - [ ] Keep
.adaos/devdevelopment commands explicit and separate from production slot-bound commands. - [x] Add a
slot_shell_requireddiagnostic only when command context is unsafe, not as normal Infrastructure State noise. - [x] Add tests for the forgotten
tools/slot-shell.shcase. - [x] Add tests for unsafe state-changing command warning and allowed dev override.
- [x] Self-heal stale
restarting/launchcore-update status when the supervisor can prove the target-slot runtime is already ready.
Implementation notes:
adaos.exewrapper re-exec no longer blocks the second active-slot re-exec, so normal production CLI use lands in the active slot automatically.- If automatic binding is disabled or mismatched, state-changing production
commands emit a
slot_shell_requireddiagnostic; read-only commands andadaos dev ...remain quiet. - Root-launched supervisor/sidecar paths now share one bootstrap-critical path list, and tests assert that the top-level supervisor/sidecar import surface is covered before root promotion.
- Runtime-ready status reconciliation prevents a completed slot switch from remaining operator-visible as an update still in progress.
RCMS-006: Sync catalog snapshots from hub/root to members
Status: planned.
Actions:
- [ ] Persist a hub-provided catalog snapshot on members with commit,
fetched_at, source, and staleness metadata. - [ ] Use that snapshot for member drift calculations instead of requiring each member to fetch GitHub directly.
- [ ] Keep archive materialization available for members without git.
- [ ] Refresh member catalog snapshots on link/reconnect and after hub catalog update operations.
- [ ] Surface member catalog staleness only when it affects installed skill/scenario drift or update actions.
- [ ] Add tests for no-git member drift calculation from a hub snapshot.
RCMS-007: Keep operator interfaces thin over core contracts
Status: planned.
Goal:
Make infrastate_skill a reference operator interface rather than a source of
truth. The same inventory, diagnostics, health, and action contracts must be
usable by other skills, web surfaces, CLI flows, and future MCP/LLM developer
tools.
Boundaries:
- Core owns catalog/workspace/runtime inventory, drift classification, action eligibility, scenario dependency health, quarantine state, and durable operation artifacts.
- Interfaces own presentation: filtering, sorting, icons/tooltips, local view state, stream subscriptions, and modal layout.
- MCP exposes the same core read/action contracts as the UI; it should not
scrape
infrastatesnapshots to understand system state.
Actions:
- [ ] Introduce a core
ArtifactInventoryServicefor skills and scenarios that returns catalog/workspace/runtime versions, drift statuses, catalog freshness, git availability impact, and action eligibility. - [ ] Move catalog lookup, stale-catalog classification, and no-git diagnostics
out of
infrastate_skillinto the inventory service. - [ ] Introduce a core scenario health model for active scenarios:
ok,degraded,blocked, and rolloutquarantined, including failed dependency lifecycle artifacts. - [ ] Persist operation diagnostics needed by operators and LLM developers
beyond the in-memory
OperationManagerretention window. - [ ] Expose inventory, scenario health, operation details, and log/detail resources through stable API and Root MCP surfaces.
- [ ] Refactor
infrastate_skillto consume the core inventory/health/detail payloads and keep only presentation logic. - [ ] Add contract tests proving
infrastate, API, and MCP read the same core payloads for drift, action eligibility, and dependency lifecycle failures.
Implementation notes:
- RCMS-004 already follows the intended direction for scenario dependency
lifecycle:
ScenarioManager, API, andOperationManagerown the structureddependency_bootstrappayload;infrastate_skillonly renders it. - RCMS-002 still has transitional logic inside
infrastate_skillfor drift and action visibility. That is acceptable while proving the product behavior, but the target is to migrate those calculations into the core inventory contract. - RCMS-006 supplies the member-side catalog snapshot foundation needed before no-git member drift can become a core-owned, MCP-readable truth.
Hub Memory Growth Under Snapshot and Webspace Fanout
Goal
Prevent runaway hub memory growth during snapshot, webspace rebuild, and Yjs fanout storms without hiding the underlying overload signal from operators, skills, or core diagnostics.
Success means:
- A hub does not grow from a normal working set into multi-gigabyte RSS during a 10-minute snapshot/rebuild storm.
webio.stream.snapshot.requestedandsubnet.member.snapshot.changedbursts are coalesced into bounded work.- Route, Yjs, and eventbus backpressure enter degraded mode before memory runaway, while preserving causal diagnostics.
- Guardrails reduce amplification but do not suppress evidence needed to fix the originating skill or core hot path.
- Policy-triggered memory profiling always leaves an operator-visible reason, state transition, and artifact trail even when the live profile mode cannot be applied immediately.
Current Status
Snapshot date: 2026-05-06.
Incident reference:
- Live hub:
ssh -i c:/Users/Zver/.ssh/adaos_linux_exp root@192.168.0.30 - Subnet:
sn_92ffc943 - Runtime:
rt-b-a-ff6605f0 - Growth window:
2026-05-06 18:12:50 UTC->18:22:21 UTC - RSS growth in best 10-minute window: about
100 MiB->2.07 GiB
Observed behavior:
- NATS bridge was connected normally at runtime start, so this incident was not driven by a root reconnect loop.
- The hot path was a local storm of
webio.stream.snapshot.requested,subnet.member.snapshot.changed, multi-webspace semantic rebuilds, andwebio/ Yjs fanout. - In the critical window the hub emitted repeated slow async handlers for
infrastate_skill,infrascope_skill, andwebspace_runtime._on_subnet_member_snapshot_changed. - The route layer showed repeated starvation via
publish_slow,pending_data, andflush_slow. - Yjs pressure warnings showed repeated large update bursts during the same window.
- The current sampled-profile session
mem-78c3dab0stayed stuck inrequested, and supervisor repeatedly loggedfailed to apply requested memory profile mode, so memory guardrails detected the incident but did not capture a useful growth artifact. - A concurrent skill bug also appeared in the hot path:
browsers_skill ... NameError: current_device_id is not defined. - 2026-05-20 checkpoint on
.30and.40: memory growth is still in the runtime Python process, not in service-skill subprocesses or YStore files..30held about2.5 GiBRSS,.40about3.6 GiBRSS with about7%available RAM. The latest auto sampled-profile sessions on both stands had only start artifacts, exposing a finalize/attribution gap that must be closed before we call the core guard observability milestone complete.
Working hypothesis:
- The primary cause is internal snapshot/fanout amplification, not external root traffic.
- The dominant amplification chain is:
snapshot.requested->snapshot.changed-> multi-webspace rebuild -> repeatedwebio/ Yjs publish -> route starvation -> websocket reconnect / reattach -> another snapshot cycle. - The memory plateau near
2 GiBis consistent with a backlog-stuck runtime: allocations stop accelerating because useful processing has mostly stalled, not because retained memory was released. - Guardrails must therefore be designed as observability-first reducers of amplification, not as opaque drops that erase the evidence needed to improve core and skills.
Tasks
HMG-001: Coalesce snapshot storms before they become fanout storms
Status: in progress. Wave 1 landed in core and skills: duplicate stream
snapshot requests are debounced/coalesced, and repeated
subnet.member.snapshot.changed bursts now collapse into bounded rebuild
cycles.
Evidence:
- Dense bursts of
webio.stream.snapshot.requested source=events_ws. - Repeated
subnet.member.snapshot.requested/subnet.member.snapshot.changedcycles during websocket reconnects. - Slow async handlers clustered around snapshot handlers in
infrastate_skillandinfrascope_skill.
Actions:
- [ ] Add a single in-flight snapshot guard per
(stream, webspace, node, subscriber)key. - [x] Coalesce repeated
webio.stream.snapshot.requestedevents into a dirty flag plus last-request metadata instead of spawning duplicate work. - [x] Add debounce / batch windows for
subnet.member.snapshot.changedso one flap burst produces one bounded rebuild cycle. - [ ] Separate full snapshot paths from incremental refresh paths; reconnect and resubscribe must prefer bounded incremental bootstrap where possible.
- [x] Emit first-wave per-key counters for
requested,forced, andcoalesced; extend the same boundary withexecuted,skipped_unchanged, anddropped_due_to_guardrail. - [x] Make coalescing observable in logs and telemetry so operators can still see the original incoming pressure and the amount of suppressed duplicate work.
HMG-002: Bound webspace rebuild amplification
Status: in progress. Wave 1 and Wave 7 landed in core: overlapping rebuild
triggers for the same (node, webspace) key now coalesce into one active
rebuild plus at most one dirty rerun, with preserved trigger reasons,
counters, and operator-visible rebuild request IDs carried through dirty /
delayed / rerun states.
Evidence:
- In the incident window the same snapshot wave rebuilt
desktop,default,test1, andtest1-1repeatedly. - Semantic rebuild durations rose into hundreds of milliseconds and over a second for some spaces while new rebuild triggers were still arriving.
Actions:
- [x] Add first-wave per-webspace rebuild counters for
requested,scheduled,rerun,coalesced_running,coalesced_interval, anddelayed; extend with queue depth, newest generation, and oldest waiting age. - [x] Skip or supersede stale rebuild requests when a newer request for the same key is already queued or executing.
- [x] Prevent one snapshot event from scheduling overlapping semantic rebuilds for the same webspace.
- [ ] Add a degraded rebuild mode that defers noncritical projections or secondary webspaces while the hub is in memory or route pressure.
- [x] Record which upstream event caused each rebuild so we can trace pressure back to a skill, browser, reconnect, or subnet state change.
HMG-003: Add route and Yjs guardrails that preserve root-cause visibility
Status: in progress. Wave 2 and Wave 5 landed in core: route starvation now
exposes a guardrail state with activation reasons, Yjs rooms publish reusable
pressure state, and noncritical load_mark / events.recent fanout now
downshifts under both Yjs pressure and route guardrail activation without
hiding the incoming pressure.
Evidence:
hub-routestarvation repeatedly reportedpublish_slow,pending_data, andflush_slow.- Yjs owner-flow and
yroom pressurewarnings showed large update bursts in the same interval.
Actions:
- [x] Add first-wave degraded / pressure thresholds for route pending age/data and Yjs buffer, pending task, and update-size pressure; extend with explicit publish-latency and persist-backlog thresholds where still missing.
- [x] When a threshold is crossed, downshift the first noncritical stream
paths: repeated
load_markandevents.recentfanout are now suppressed under active Yjs pressure; extend the same policy to equivalent cosmetic fanout. - [x] Preserve observability by logging both the original attempted work and the reduced emitted work through pressure-state transitions and suppression counters.
- [x] Export first-wave route metrics/state for pending age/data and guardrail activation; extend with pending messages, max flush latency, and suppressed publication totals.
- [x] Export Yjs metrics for update bytes, pending send/store tasks, replay bytes, and per-webspace pressure state; extend with persist queue depth where still missing.
- [x] Ensure every guardrail activation produces a structured reason record that points back to the triggering stream, webspace, skill, or event type.
HMG-004: Make eventbus and async backlog visible and bounded
Status: in progress. Wave 4 and Wave 5 landed in core: eventbus now bounds selected hot-topic async fanout through per-topic worker queues, preserves incoming / queued / dropped visibility, supersedes stale queued snapshot work, and exposes richer backlog state for incident artifacts.
Evidence:
- The incident produced about 210 slow async handler warnings in one window.
- Current logs show slow handlers, but not the complete backlog shape or the amount of queued overlapping async work.
Actions:
- [x] Add first-wave live backlog snapshot data for eventbus pending async tasks plus per-topic and per-handler in-flight counts; extend with oldest pending task age and per-handler slow-count totals where still missing.
- [x] Bound selected hot-path async fanout with first-wave per-topic work
queues instead of unlimited
create_taskgrowth; extend the same approach to more hot topics as incident evidence evolves. - [x] Add per-topic and per-handler cancellation / supersede semantics for stale snapshot work.
- [x] Keep raw incoming-event counters visible even when bounded execution drops or coalesces work.
- [x] Add an operator-facing incident summary that names the top topics and handlers contributing to backlog growth.
- [x] Add
browser.session.changedto the core EventBus bounded/supersede defaults so browser reconnect/session churn remains observable but stale queued handler work cannot keep growing behind the current state.
HMG-005: Make memory incident capture reliable before the hub stalls
Status: in progress. Wave 3 and Wave 6 landed in supervisor: requested memory-profile sessions now expire instead of hanging indefinitely, apply failures persist structured context, supervisor writes local incident artifacts with telemetry, operations, Yjs pressure, route diagnostics, rebuild pressure, and eventbus backlog snapshots, and the artifact now includes a compact operator-facing incident summary/headline.
Evidence:
- Supervisor detected the growth threshold but left session
mem-78c3dab0inrequested. - Repeated
failed to apply requested memory profile modewarnings prevented a useful memory artifact from being captured during the live incident.
Actions:
- [x] Fix the supervisor profile-mode transition so a triggered session cannot
remain indefinitely in
requested. - [x] Persist a structured first-wave failure reason when automatic profile mode cannot be applied, including slot, runtime, requested mode, and the most recent blocking / apply-error context.
- [x] Add a fallback capture path that records growth context without requiring a full runtime restart; extend with allocator-level artifacts where available.
- [x] Tie memory incidents to the active operation and first-wave pressure context through telemetry, operation history, Yjs pressure, route diagnostics, and member snapshot rebuild pressure; extend with finer-grained snapshot/request counters where still missing.
- [x] Publish enough local-only artifacts to debug the next incident even if root publication is unavailable.
- [x] Preserve the profiling session id while stopping an active sampled/trace profile and record a local incident when the runtime does not leave a finalize marker, so policy profiles cannot silently end with only a start snapshot.
HMG-006: Fix skill-level amplifiers in snapshot and webio hot paths
Status: in progress. Wave 1 landed in skills: duplicate
webio.stream.snapshot.requested bursts are now debounced in
infrastate_skill and infrascope_skill before they can multiply into
repeated full publishes. Wave 8 hotfix extends the same policy for
infrastate_skill: noncritical streams are no longer eager-published on every
snapshot refresh, active pressure skips heavy detail snapshot construction, and
pressure-mode snapshot cache TTL expands to keep the hub responsive during a
burst. This is not the primary safety mechanism: the deliberately heavy
infrastate path remains a useful crash-test for kernel-level containment, and
the owner-quarantine work is tracked under HMG-007.
Evidence:
- The heaviest repeated slow handlers in the incident were
infrastate_skill.on_webio_stream_snapshot_requestedandinfrascope_skill.on_webio_stream_snapshot_requested. - A concurrent
browsers_skillbackground task failed withNameError: current_device_id is not defined.
Actions:
- [x] Refactor
infrastate_skillsnapshot publishing to avoid bursty repeated republish of unchanged payloads. - [x] Refactor
infrascope_skillsnapshot publishing to prefer cached or diff output when the source generation did not materially change. - [ ] Ensure skill snapshot handlers are idempotent and generation-aware.
- [x] Add first-wave suppression for noncritical skill-triggered
webio.stream.*fanout under active Yjs pressure; extend the same policy to degraded route and additional cosmetic receivers. - [x] Extend
infrastate_skillsuppression fromevents.recentto all noncritical diagnostic/detail receivers while preservingoperations.activeas the small eager status stream. - [x] Avoid constructing a full
infrastatesnapshot for detail stream requests when Yjs/route guardrails are already active; record suppression counters instead of hiding the dropped work. - [x] Increase
infrastatesnapshot cache TTL under active primary-doc pressure so repeated browser refreshes and member snapshot flaps reuse bounded work. - [ ] Fix the
browsers_skillcurrent_device_idbug and ensure background snapshot tasks fail noisily but safely, without leaving orphan churn behind. - [ ] Review all skills subscribed to
subnet.member.snapshot.changedandwebio.stream.snapshot.requestedfor duplicate work, full-state publish, and missing debounce.
HMG-007: Keep guardrails observability-first
Status: in progress. Wave 1, Wave 2, Wave 5, Wave 6, and Wave 7 guardrails were
implemented with preserved evidence at the same logical boundary so
suppression does not hide the original incoming pressure. Wave 9 adds the
missing containment layer: write pressure can now promote from a local
write-boundary decision to a short-lived owner quarantine, so the same skill
cannot keep launching expensive tools while the primary Yjs document is already
in block or sustained throttle. Wave 10 fixes the first live regression in
that containment layer: implicit webspace events now publish quarantine service
state to the configured desktop webspace, and hot browser stream events are
coalesced by handler so stale queued subscription work cannot keep growing
after pressure has already been detected. Wave 11 fixes the live regression
seen on .30: Yjs owner quarantine no longer suppresses
webio.stream.snapshot.requested / webio.stream.subscription.changed
handlers, so a quarantined skill can still serve bounded stream variables while
Yjs writes remain blocked by the primary-doc guard and stream payloads remain
covered by the stream guard.
Principle:
- Predohranitel must reduce amplification, not erase cause.
- If the hub suppresses or coalesces work, operators still need to see: what arrived, what would have run, what was skipped, why it was skipped, and which skill/core path created the pressure.
Actions:
- [x] For every new guardrail, define the preserved evidence set before implementing the drop/coalesce behavior.
- [x] Add structured first-wave counters for
requested,forced,coalesced,scheduled, andrerunat the same logical boundary; extend the same pattern withsuppressed,timed_out, andfailedas degraded mode expands. - [x] Ensure telemetry and logs distinguish "incoming load reduced by guardrail" from "incoming load disappeared".
- [x] Add kernel-level Yjs primary-doc governance at the write boundary:
get_ydoc,async_get_ydoc,mutate_live_room, and directYStore.write_updatenow evaluate the sharedwarn/throttle/blockpolicy before persisting or broadcasting skill-owned writes. - [x] Move
ProjectionServiceonto the shared Yjs governor and tag already-governed writes to avoid double-throttling in downstream Yjs paths. - [x] Attach explicit SDK Yjs ownership metadata for sync and async skill wrappers so LLM-generated skills are attributable even when they use the supported SDK facade.
- [x] Add owner-level Yjs pressure quarantine with TTL, visible deny counters,
and structured
skill_owner_quarantinedtool results instead of silent fallback. - [x] Run skill tool admission through the Yjs owner guard before skill runtime context is established, so an overloaded owner is stopped before it can build another full snapshot payload.
- [x] Notify quarantined skills through optional
onQuarantine/on_quarantinetools withttl_s,reason, blocked tool, owner, webspace, and quarantine metadata; this hook bypasses skill admission but remains subject to Yjs write governance. - [x] Persist skill-local quarantine incidents to
ADAOS_SKILL_MEMORY_PATH/logs/quarantine.jsonlso later LLM-assisted skill repair can recover the exact pressure event from the skill context. - [x] Publish active Yjs owner quarantines into the primary doc service branch
data.yjs_qrntwithitems,by_owner, andby_skill, allowing web UI consumers to disable affected apps/widgets explicitly. - [x] Run projection admission through the same owner guard before payload compaction and primary-doc mutation, preserving evidence while skipping avoidable work.
- [x] Surface active quarantine state in reliability and
adaos node reliabilityoutput (quarantine=active, reason, trigger, retry-after, tool, path). - [x] Normalize implicit Yjs owner-guard webspaces through runtime webspace
policy instead of hard-coding
default, sodata.yjs_qrntappears in the same webspace the browser is rendering. - [x] Bound
webio.stream.subscription.changedin the eventbus hot-topic queue and supersede stale queuedwebio.stream.snapshot.requested/webio.stream.subscription.changedwork by handler before it can accumulate into memory pressure. - [x] Exempt stream-control subscriptions from Yjs owner-guard quarantine:
webio.stream.snapshot.requestedandwebio.stream.subscription.changedstay on the stream-control plane, while actual Yjs writes and stream payload publication remain governed at their own boundaries. - [ ] Keep operator-visible correlation IDs or generation IDs across snapshot, rebuild, route, and Yjs stages. First wave landed for member snapshot rebuild pressure and incident summary; extend the same IDs into route and Yjs pressure payloads.
- [ ] Reject any guardrail that improves memory only by hiding the overload source from incident review.
HMG-008: Make ProjectionService the normal skill write ingress
Status: planned. Kernel pressure governance is now the last-resort safety net, but the target architecture is stricter: LLM-authored skills should not write browser-visible primary Yjs state directly during normal operation.
Principle:
ProjectionServiceis the normal skill-facing write boundary for primary shared document state.- Direct skill-owned Yjs writes are legacy or capability-gated.
- Details and large diagnostics belong in section endpoints, streams, or
360log, not broad primary-doc branches.
Roadmap:
- [ ] Observe and count direct skill-owned Yjs writes with owner, source, channel, root, path, and update size.
- [x] Ensure
infrastate_skillprojections preserve skill identity when callingProjectionService, preventing background refresh tasks from being mis-attributed as_by_owner/core. - [ ] Emit
deprecated_direct_skill_yjs_writewarnings for skill paths that bypassProjectionService. - [ ] Apply stricter budgets to direct skill writes than to governed projection writes.
- [ ] Block broad direct skill writes to shared roots such as
data,ui,registry, and desktop-wide branches unless explicitly allowlisted. - [ ] Add
skill.yamlcapability declarations for direct Yjs exceptions and projection targets. - [ ] Make direct skill-owned primary-doc writes deny-by-default outside declared capabilities.
- [x] Teach
web_desktopand the client shell to consumedata.yjs_qrntand render quarantined apps/widgets as disabled with a visible reason and retry-after, rather than silently hiding or retry-spamming them. - [ ] Add app/widget manifest metadata mapping UI entries to owning skill IDs so
data.yjs_qrnt.by_skill[skill_id]can be applied consistently across desktop icons, widgets, modals, and details panes. - [ ] Add migration tooling/reporting for skills that still depend on direct Yjs access.
- [ ] Update LLM skill templates and prompts so generated skills use projections, streams, HTTP details, or skill-local storage by default.
Realtime First 3 Minutes
Goal
Provide stable hub-root connectivity and error-free runtime behavior during the first 3 minutes after startup.
Success means:
- NATS-over-WS stays connected for at least 180 seconds without watchdog reconnects.
- Root-routed HTTP and WS requests do not timeout during normal startup probes.
- Browser
/wsand/ywshandshakes complete without fallback-only operation. - Yjs persistence does not create sustained high-pressure warnings.
- Startup and first browser attach do not block the event loop above diagnostic thresholds.
- Process memory is sampled during loading-to-ready and through the first 3 minutes; it reaches a stable startup plateau and does not show runaway growth.
Current Status
Snapshot date: 2026-05-01.
Overall completion: 99% for the expanded local + root-routed browser goal. Windows root-routed /nats is accepted again after fixing an AdaOS env-name collision: the legacy HUB_NATS_WS_PROXY=auto variable was treated by Python proxy discovery as a generic *_PROXY variable and selected the bad one-way route. The stable default is now HUB_NATS_WS_PROXY_MODE=auto, with the legacy name hidden during websockets.connect. Linux/RU root-routed browsers already load data and stay inside the first-window memory guard; the remaining work is a rollout reconfirmation and longer plateau soak, not a first-window blocker.
Done:
- Structured terminal/log diagnostics are now available for NATS WS receive failures, direct control frames, route reply lifecycle, root log extracts, event loop lag/hang, and Yjs owner pressure.
- Hot-path
load_config()was removed from route key matching. - Skill runtime status reads no longer force slot prepare/path creation during snapshot calls.
- Selected synchronous skill subscription handlers can run in worker threads.
- Root log extracts now summarize repeated incidents instead of flooding the terminal by default.
- Windows Selector loop is now an explicit diagnostic mode only.
- Startup native capacity and subnet directory registry work now runs off the event loop thread.
- YRoom pressure diagnostics no longer call ystore runtime filesystem/SQLite snapshot code from the realtime hot path by default.
- In the active local
infrascope_skillworkspace/runtime copy, background refresh target discovery now runs in a worker thread. ui.notifydelivery no longer holds the eventbus critical path; RouterService schedules notification delivery in background and drains briefly on shutdown.- Root MCP local SDK calls use a local-first embedded registry path for local runtime queries, so normal startup no longer probes the public Root MCP bridge or emits
fetch failedfallback diagnostics. - Yjs gateway persistence keeps immediate writes for durability, while owner-pressure diagnostics now treat gateway first-attach peak bursts separately from sustained pressure.
- The hub subnet-directory staler heartbeat/stale sweep no longer commits SQLite work on the event loop thread.
- NATS WS diagnostic JSONL writes are emitted from a worker thread instead of the NATS supervisor hot path.
- Active local
infra_access_skillandinfrastate_skillworkspace/runtime copies no longer perform heavy snapshot refresh fromsys.readysubscription callbacks. - Active local
infrastate_skillruntime event handling now returns fromsys.readywithout a worker-thread hop, eliminating the last startup slow-handler warning. - Active
.adaosskill hotfixes are present in the workspace skill registry repo throughd208cd3; DEV Forge publish dry-run is not applicable because these are workspace-registry skills, not DEV Forge drafts. - Final soak verification now includes process-tree memory sampling during loading-to-ready and the full 180-second window.
- Local API serve disables WebSocket per-message deflate to avoid CPU-heavy compression during root-routed Yjs first-sync bursts.
/api/node/reliabilityand/api/node/reliability/summarybuild reliability payloads off the event loop, so browser polling no longer runsload_config()/ runtime-state filesystem checks on the loop thread.- Skill service discovery refresh no longer submits recurring watchdog work to the default thread executor, avoiding the observed Windows
Thread.start()event-loop freeze path. - Control lifecycle await-resume stack watcher is now opt-in diagnostics only, avoiding a fresh diagnostic thread on every control heartbeat during normal runs.
- Backend route-open retry is deployed and visible in root logs:
open ack retry/open republishreplaced the old fallback flush path. - NATS-over-WS core transport now follows the stable
toolsbehavior by default:websocketssystem proxy auto-detect (proxy=True) is used unlessHUB_NATS_WS_PROXY_MODE=noneexplicitly forces direct-route diagnostics. - The legacy
HUB_NATS_WS_PROXYname remains backward-compatible but is no longer documented as the steady-state default because Python treats any*_PROXYenvironment variable as a proxy setting. - NATS-over-WS control-frame handling now replies to coalesced root
PINGframes without corruptingMSGpayload boundaries. - Local and root-routed browser runs on 2026-04-30 initially confirmed stable
/natsand/ywsbehavior after the proxy-auto core change. - Normal diagnostic thresholds are relaxed out of deep-debug mode: loop-lag warnings now default to 1000ms and eventbus slow async warnings default to 250ms.
- Backend-origin Yjs updates are marked so the live room can fan them out to browsers without persisting the same detached diff again as
gateway_ws. infrastate_skill.get_snapshotis read-only for HTTP callers by default and returns a compact client snapshot instead of projecting multi-megabyte diagnostic payloads into Yjs on every root-routed fallback probe.- Supervisor memory telemetry still records growth, but automatic policy-triggered sampled-profile restarts are delayed for the first 300 seconds by default and are deferred while recent browser sessions are live, so diagnostics cannot break the first browser attach/interaction window.
- Supervisor sampled-profile timing now starts after the profiled runtime API becomes ready, preventing slow bootstrap from consuming the whole profiling window and producing empty start-only artifacts.
infrascope_skill,infrastate_skill, and core Yjs load-mark streams now publish only active subscribed receivers, deduplicate unchanged payloads, and rate-limit high-churn diagnostics under browser load.
In progress:
- Keep an eye on residual sub-second event-loop drift and occasional
infrastate/infrascopebrowser-runtime handlers, but do not treat them as connectivity blockers unless they exceed the normal thresholds. - Reconfirm Windows after rollout from a clean operator environment where
HUB_NATS_WS_PROXYis unset andHUB_NATS_WS_PROXY_MODE=autois used. - Reduce follow-up Linux/RU YStore replay pressure (
sync_runtime: pressure, replay around 700 KiB) without reintroducing expensive live-backup work on the runtime hot path.
Known follow-up outside the current goal:
- If public remote Root MCP access to local hubs is required, design and deploy a backend/infra route that resolves upstream by hub route/NATS instead of direct
ADAOS_BASEHTTP proxying.
Latest verification:
first3m_20260428_225403: 180-second soak, no NATS recv failure, no route timeout, no open ack fallback, no event loop lag; one shutdown idle wait was classified as a false-positive hang.first3m_20260428_230658: after YRoom hot-path diagnostic changes, no NATS recv failure, no route timeout, no open ack fallback, no event loop lag/hang, noruntime_snapshot()/Path.stat()stack.first3m_20260428_231152: afterinfrascope_skilltarget-discovery offload, ready in about 13 seconds, 180-second soak completed, no NATS recv failure/watchdog, no route timeout, no open ack fallback, no event loop lag/hang, no control resume warning stack. NATS diagnostics showed Proactor loop, connected read task,pending_data_size=0, and no task errors.first3m_20260429_065606: after RouterService backgroundui.notifydelivery, ready in about 15 seconds, 180-second soak completed, no NATS recv failure/watchdog, no route timeout, no open ack fallback, no event loop lag/hang, no slowui.notify, and no router background delivery failure. Remaining warnings were two off-threadsys.readydurations and two_by_owner/gateway_wsYjs pressure warnings.first3m_20260429_080100: final 180-second soak completed and stopped cleanly. Counts: NATS recv failure/watchdog/ConnectionClosedError/WinError 10054 = 0, route timeout/proxy failed = 0, open ack fallback = 0, real event loop lag/hang = 0, slow async handler = 0, slowui.notify= 0, Yjs owner pressure/unknown/gateway warning = 0, infra_access snapshot failure = 0, traceback = 0. Expected non-failing signals: one embedded Root MCP fallback debug line, one idle-wait hang suppression during shutdown, one NATS disconnect during requested shutdown.first3m_20260429_final_mem4: final 180-second soak with process-tree memory sampling. Ready in 13.461s. Counts: NATS recv failure/watchdog/ConnectionClosedError/WinError 10054 = 0, route timeout/proxy failed = 0, open ack fallback = 0, event loop lag = 0, real event loop hang = 0, slow async handler = 0, slowui.notify= 0, Yjs owner pressure = 0, infra_access snapshot failure = 0, traceback = 0, Root MCPfetch failed= 0, embedded Root MCP fallback = 0. Expected non-failing signals: one idle-wait hang suppression during shutdown and one NATS disconnect during requested shutdown. Memory: process tree WorkingSet first/ready/peak/last = 121.695/238.305/250.066/248.066 MB; PrivateMemory first/ready/peak/last = 95.172/218.930/230.555/228.117 MB; loading-to-ready sampled 121.695 -> 238.305 MB WorkingSet and 95.172 -> 218.930 MB PrivateMemory; no runaway growth observed.first3m_20260429_final_accept: repeat final acceptance run. Ready in 12.795s, browser/wsaccepted, andYRoom ready webspace=desktopobserved. Counts: NATS recv failure/watchdog/ConnectionClosedError/WinError 10054 = 0, route timeout/proxy failed = 0, open ack fallback = 0, event loop lag = 0, real event loop hang = 0, slow async handler = 0, slowui.notify= 0, Yjs owner pressure = 0, infra_access snapshot failure = 0, traceback = 0, Root MCPfetch failed= 0, embedded Root MCP fallback = 0. Expected non-failing signals: one idle-wait hang suppression during shutdown and one NATS disconnect during requested shutdown. Memory: process tree WorkingSet first/ready/peak/last = 30.949/136.863/146.445/145.836 MB; PrivateMemory first/ready/peak/last = 21.930/137.117/145.211/145.211 MB; no runaway growth observed.root_remote_browser_20260429_0753Z: live local + root-routed browser load reopened the goal. Local browser remained usable, but the remote browser repeatedly reconnected through root. Evidence: root reverse-proxy accepted/hubs/sn_6acf0c01/yws/desktopwith101, then nginx emitted repeatedSSL_read() failed ... bad record macon keepalive/upgraded paths; backendws-nats-proxyreported/natsclose1006withnatsKeepalivesSent=0,lastClientPongAgo_s=67.0, and only one client ping; hub-sidenats_ws_diag.jsonlshowedpending_data_size=0whilelast_rx_ago_sgrew above 300s andka_pings_rx=1. Conclusion: this is not local pending-queue starvation; the root WS-NATS tunnel lacks regular hub<->root application-level liveness traffic under remote browser load.root_remote_after_summary_offload_20260429_111439: 3+ minute local + root-routed browser diagnostic after disabling API WebSocket compression and offloading reliability summary generation. Counts in the verification window: NATS recv failure/ConnectionClosedError/WinError 10054/watchdog_reading_task= 0, event loop lag = 0, control-lifecycle warning stack = 0,node_reliability_summary/current_reliability_payloadwarning stack = 0. Expected signals only: onenats bridge connected, oneyws connection open, oneyws connection closedduring requested shutdown, and one NATS disconnect during requested shutdown. Caveat: this run used local raw NATS keepalive diagnostics; public root now intentionally keepsWS_NATS_PROXY_KEEPALIVE_ENABLE=0, so the remaining verification target is route-open retry / supersede behavior under that mode.root_remote_backend_deploy_20260429_0835Z: after latest backend deploy, remote browser still failed to load Yjs data. Root is intentionally configured withWS_NATS_PROXY_KEEPALIVE_ENABLE=0, so backendnatsKeepalivesSent=0is expected and is not the primary bug marker. Local log shows a repeated cycle: NATS bridge connects, root-routedywsopens,/natsfails withConnectionClosedError/WinError 121after about 20-25s, thenywscloses and reconnects. Root logs show the remote route can publishopenwhile the hub route subscription is not yet reinstalled after reconnect; the old fallback then flushes early Yjs frames without a local upstream, producingno_upstream. Patch prepared: root route proxy now retriesopeninstead of flushing early frames after missingopen_ack; WS-NATS supersede waits for the new connection's route subscription before closing old peers, with a 10s max grace; WS-NATS config is logged explicitly.codex_first3m_20260429_125941: local run after changing skill-service discovery refresh away from recurringasyncio.to_thread. Ready in about 16.5s. No NATS recv failure, noConnectionClosedError, noWinError, no route timeout, noopen ack fallback, nono_upstream, and noservice_supervisor/Thread.startstack. Minor short loop-lag diagnostics remained, and shutdown emitted only the expected idle-wait suppression.codex_first3m_20260429_130538: local + root-routed browser run after making control lifecycle await watcher opt-in. Ready in about 16.0s. The previous 60sservice_supervisor -> Thread.start()freeze did not recur. Counts still showed/natschurn under root-routed load:nats_recv_failed=8,nats_watchdog=40,ConnectionClosedError=64,WinError=14, withyws_open=8andyws_closed=8. Counts stayed clean for route-level symptoms:route_timeout=0,http_proxy_failed=0,open_ack_fallback=0,no_upstream=0. Memory stayed bounded: process tree WorkingSet about 108 MB first sample, 150 MB at ready, 174 MB at the end; PrivateMemory about 85 MB first sample, 143 MB at ready, 165 MB at the end.codex_heartbeat_ab_20260429_131015: A/B run with hub-sideHUB_NATS_WS_DATA_HEARTBEAT_S=10. It did not stabilize root/nats:nats_recv_failed=4,nats_watchdog=20,ConnectionClosedError=32,WinError=6, with repeated root-routed YWS open/close. Route-level symptoms stayed clean:route_timeout=0,http_proxy_failed=0,open_ack_fallback=0,no_upstream=0. Conclusion: hub-side heartbeat alone is insufficient when rootWS_NATS_PROXY_KEEPALIVE_ENABLE=0; the next required experiment is restoring root proxy application-level keepalive.root_remote_keepalive_enabled_20260429_1035Z: after root params were changed and backend recreated, remote browser still did not load data. Local terminal shows/natsConnectionClosedErrorwithWinError 64and root-routedyws connection closedat the same time. Root close diagnostics now shownatsKeepalivesSent=2/upstreamNatsPingsSent=2, so the application keepalive is active; howeverwsPingsSent=0/wsPongsReceived=0, and localnats_ws_diag.jsonlshows only one or zero NATS PINGs observed by the hub before the socket dies. Root logs also showclosing superseded hub ws-nats connection reason=route_readywhile browser routeopenretries are still possible. Conclusion: add explicit WS control ping for/natsand stop closing superseded/natspeers immediately onroute_ready; keep them through a longer grace window.root_remote_after_ws_ping_20260429_1150Z: after backend deploy and updated root variables, remote browser still did not load data. The hub-root/natstunnel again failed after about 22s withConnectionClosedError/WinError 64, and the root-routed Yjs connection closed in the same window. A local diagnostic with.envforcingHUB_NATS_WS_HEARTBEAT_S=10,HUB_NATS_WS_HEARTBEAT_FORCE=1, and transport trace confirmed hub-side WS heartbeat traffic is active (heartbeat_s=10.0, repeatednats ws heartbeat tx kind=PING) but still does not prevent/natschurn. Route-level symptoms stayed clean in that diagnostic (route timeout=0,http proxy failed=0,open ack fallback=0,no_upstream=0). Conclusion: treat a missing root NATS keepalive PONG as a hard liveness failure and close/reopen the tunnel proactively.root_nats_independent_tools_20260429_1746: independenttoolsprobes withoutadaos api servesplit the problem. RawwebsocketsNATS framing againstwss://api.inimatic.com/natsstayed healthy for 25-45s with nats-py-like CONNECT/SUB/PUB formatting, repeated client NATS PINGs, split PUB frames, and full echo delivery (22/22,31/31,41/41). Rawaiohttpframing stopped after 4 echo messages and failed with missing PONG / close1006; stocknats-py/aiohttp stopped after 3 echo messages and failed withUnexpectedEOF/ close1006; AdaOS custom transport stopped after 4 echo messages and failed withConnectionClosedError/ close1006/WinError 121while TX continued. Conclusion: the public/natschannel is not generically unreachable; the failing path is WebSocket-client/proxy behavior under active nats-py-like traffic. Backend patch prepared to add per-connection frame counters (clientFrames,upstreamWrites,upstreamFrames,downstreamSend*) to close summaries so the next deploy shows exactly where frames stop.root_remote_frame_accounting_20260429_1730Z: after backend frame-accounting deploy, root confirms route traffic is alive before forced close:PUB -> MSG -> downstreamcounters increment,downstreamSendErrors=0, and YWSopen_ackcan be received for a fresh route key. The tunnel is then closed after a single root NATS keepalive miss:natsKeepalivesSent=1,clientFrames.pong=1,clientFrames.pub=2,upstreamFrames.msg=8-9,wsPingsSent=1,wsPongsReceived=0, followed bynats keepalive pong missing: closing tunneland close1006at about 15s uptime. Conclusion: the next backend fix should stop treating one missed keepalive as fatal, stagger WS control ping and NATS-data keepalive, and close only after repeated misses with no client data / WS pong.core_proxy_auto_20260429_223126: decisive A/B after comparingtoolsvs core. Stabletools/diag_nats_ws.pyruns used thewebsocketsdefaultproxy=Trueroute, while AdaOS core forcedproxy=Noneon Windows and selected a direct route that half-stalled after the first fewPUBframes. After changing core default to proxy-auto, an isolatednats-py + AdaOS WebSocketTransporttest stayed healthy for 45s (sent=42,got=42, clean close1000). A fulladaos api servesoak of about 190s then completed withnats ws recv failed=0, watchdog/ConnectionClosedError/WinError=0, route timeout/proxy failed/open-ack fallback/no_upstream=0, event loop lag/hang/traceback=0, root PING/PONG continuing through the run, and clean NATS WS close1000during requested shutdown. Caveat: that memory CSV sampled the launcher wrapper rather than the uvicorn child, so memory acceptance remains covered by the earlier process-tree runs.hub_browser_accept_20260430_0350andhub_browser_accept_20260430_0431: user-confirmed Windows hub-browser connectivity restored. Two latestapi servewindows shownats bridge connected=1each,nats ws recv failed=0, watchdog/ConnectionClosedError/WinError=0, route timeout/proxy failed/open-ack fallback/no_upstream=0, traceback/error level=0, and expected NATS disconnect only during requested shutdown. Root-routed Yjs connections opened and closed without route errors. Residual non-blocking issues: many sub-second loop-lag diagnostics under the old 250ms threshold and a few slow browser-runtime handlers ininfrastate_skill/infrascope_skill; normal defaults have been polished to warn only above 1000ms loop drift and 250ms async-handler duration.linux_ru_zone_split_20260430_0734: Linux hubsn_92ffc943reportshub_root: ready/stableonwss://ru.api.inimatic.com/nats, withcontrol_subs=1androute_subs=1after a clean autostart restart. Independent checks showhttps://ru.api.inimatic.com/v1/browser/hub/status?hub_id=sn_92ffc943returnsonline, whilehttps://api.inimatic.com/v1/browser/hub/status?hub_id=sn_92ffc943returnsoffline; RU root logs show no current browserroute: open/ YWS attempts. Conclusion: the remaining Linux browser failure is zone selection in the browser client, not a broken Linux hub-root NATS channel. Patch prepared: browser root-proxy base now learns and probeshub_id -> zonebefore/ws//ywsattach.linux_ru_two_browser_memory_guard_20260430_0758: after zone-aware browser deploy, backend-origin Yjs dedupe, compact/read-onlyinfrastatesnapshots, capped load-mark history, and supervisor memory-profile grace, a two-browser Linux/RU soak stayed ready for more than 4 minutes. Counts in the verification window:nats ws recv failed=0, route timeout/proxy failed=0, supervisor route watchdog reset=0, event-loop lag/hang=0, memory apply/complete profile restart=0. Browser path was active:hub_root_browser: ready/stable,route: ready,sync_runtime.yws=2, live media peer1/1. Runtime RSS moved from about 256 MiB at 45s to about 304 MiB at 4m33s, later about 336 MiB at 5m15s; this is no longer the previous runaway-to-3GB behavior, but it still needs a longer plateau soak.linux_ru_diag_polish_20260430_0939: after raising the gateway tiny-write warning threshold and restarting autostart, a 3m45s Linux/RU soak stayed clean:nats ws recv failed=0, route timeout/proxy failed=0, supervisor route watchdog reset=0, event-loop lag/hang=0, memory apply/complete profile restart=0, andYJS owner flow above threshold=0. Runtime RSS stayed in a narrow first-window band of about 247 MiB at 31s to 276 MiB at 3m44s; supervisor public memory status exposesauto_profile_min_uptime_sec=300.0and remainedcurrent_profile_mode=normal,suspicion_state=stable.hub_workspace_sync_20260430: Linux hub workspace/root/.adaos/workspacewas checked after the abnormal workstation reboot. The only hub workspace diff isskills/infrastate_skill/handlers/main.py; local.adaos/workspace/skills/infrastate_skill/handlers/main.pymatches it semantically and is already present in the workspace HEAD commitea28d74(perf: memory menagement). The remaining local workspace dirt is only.gitignore; it is unrelated to the Linux hub hotpatch.hub_core_sync_20260430: Linux hub core slotsAandBwere compared against local core changes after the abnormal workstation reboot. The runtime-hotpatched filessdk/io/out.py,services/logging.py,services/router/service.py,services/webspace_id.py,services/yjs/doc.py,services/yjs/gateway_ws.py,services/yjs/load_mark.py,services/yjs/load_mark_history.py,services/yjs/update_origin.py, andservices/yjs/webspace.pymatch local source in both slots. The only remaining local deltas are intentional commit polish:apps/api/node_api.pyformatting around the compactinfrastate/actionsnapshot call andapps/supervisor.pykeepingsuspicion_state=suspectedwhile recordingauto_profile_last_block_reasoninstead of hiding the suspicion assuppressed.windows_data_ping_regression_20260430: Windows root-routed browser load regressed after the earlier acceptance runs. Evidence:/natsfails after about 40s withConnectionClosedError/WinError 121, remote browser does not load data, while independent AdaOS transport tooling can keep the raw/natsecho stable. An initial hypothesis blamed client dataPING, but the latersn_6acf0c01-b5f3b8a6d2run failed withdata_pings_tx=0,ka_pings_rx=0, and root still reportingroute downstream send done. Conclusion: client data ping is not the root cause and may be part of the confirmed Windows-stable profile. Patch prepared: restore Windows+websocketsHUB_NATS_WS_DATA_PING_S=autoto a conservative 5s, while Linux stays disabled unless explicitly requested.windows_raw_ws_channel_20260430: raw WebSocket tools show the public/natschannel itself is healthy.tools/diag_nats_ws_concurrent.pyheld 90s with concurrent reader/writer (tx_pub=88,rx_msg=88,rx_ping=10,tx_pong=10,errors=[]).tools/diag_nats_ws.pywith nats-py CONNECT style, empty queue/reply spacing, split PUB frames, and binary frames held 90s (pubs_tx=80,msgs_rx=80,nats_pings_rx=10,nats_pongs_tx=10,errors=[]). At that checkpoint,tools/diag_nats_client.pythrough AdaOSWebSocketTransportWebsocketsandtools/diag_nats_py_ws.pythrough stock nats-py/aiohttp both stopped receiving after 3-4 echo messages and closed with1006/WinError 121orUnexpectedEOF. Conclusion: the failure was not a raw root channel failure; it was in nats-py-style transport/runtime behavior around the first keepalive window. Laterwindows_ws_control_ping_guard_20260430re-validated AdaOS transport in isolation after backend keepalive hardening.windows_proxy_ping_termination_20260430: follow-up root-log analysis shows the failing nats-py-style path correlates with Root/proxy NATSPINGdata frames delivered downstream to the hub: after the first proxy/upstream keepalive window, Root reports missing PONG/client data and the hub seesConnectionClosedError. The confirmed Windows commit hadWS_NATS_PROXY_KEEPALIVE_ENABLE=0; current defaults are restored to that. Backend patch prepared: for normal hub WS-NATS clients, Root now answers upstream NATSPINGlocally and strips thosePINGcommand frames before forwarding downstream; transparent realtime sidecar connections (rt-*) still receive raw NATS control frames. The stripper is protocol-aware and skipsMSG/HMSGpayload bytes, so route/Yjs payloads containingPING\r\nare not modified.windows_legacy_keepalive_guard_20260430: after deploying upstream-PINGtermination, the Windows regression persisted. Fresh client diagnostics showdata_pings_tx=0butka_pings_rx=2, and a partial root log read shows bothping (upstream->proxy) answered and stripped downstreamand a separateping (keepalive -> client) sent. Conclusion: the backend patch is active, but the active root env can still enable the older client-facing NATS-data keepalive. Patch prepared: legacyWS_NATS_PROXY_KEEPALIVE_ENABLE=1is ignored for normal hub clients unlessWS_NATS_PROXY_KEEPALIVE_FORCE=1; focused diagnostics must useWS_NATS_PROXY_CLIENT_KEEPALIVE_ENABLE=1.windows_ws_control_ping_guard_20260430: after the legacy NATS-data keepalive guard, standalonetools/diag_nats_client.pyheld 75s cleanly on the AdaOSWebSocketTransportWebsockets, but fulladaos api servestill dropped during browser/Yjs load. Root logs forsn_6acf0c01-753137252cshowws ping enabled pingMs=10000,wsPingsSent=6,wsPongsReceived=2, then close1006; no NATS-data client keepalive was sent (natsKeepalivesSent=0). Conclusion: a stale rootWS_NATS_PROXY_WS_PING=1can still break Windows hub clients under full runtime load. Patch prepared: ignore legacyWS_NATS_PROXY_WS_PING=1for normal hub clients unlessWS_NATS_PROXY_WS_PING_FORCE=1; explicit diagnostics useWS_NATS_PROXY_CLIENT_WS_PING_ENABLE=1. Realtime sidecarrt-*remains allowed to use its own WS ping.windows_observe_and_transport_ab_20260501: the weather observer hypothesis was ruled out again; the observed hard stall was inservices.observe._write_local, where synchronousevents.logfile I/O blocked the event loop for about 56s during the two-browser load. Local observe logging now goes through a non-blocking queue and daemon writer thread. A follow-up aiohttp run hadobserve.py=0,_write_local=0, loop lag=0, andflush_slow=0, but still flapped every 20-50s withRuntimeError: ws closed/ClientConnectionResetError: Cannot write to closing transport. A follow-up websockets run also flapped under the current Root env, with no observe/loop starvation, indicating the remaining delta from the confirmed0cfbc9eprofile is Root-side keepalive/proxy behavior rather than local file I/O.0cfbc9eusedHUB_NATS_WS_IMPL=websocketsandWS_NATS_PROXY_KEEPALIVE_ENABLE=0.windows_supersede_grace_experiment_20260501: after root env/container refresh, a 190-second Windows two-browser run still flapped (nats ws recv failed=7,ConnectionClosedError=56,yws open/close=9/9) while memory stayed bounded around 213 MiB and observe/weather/event-loop diagnostics stayed clean. Root logs showedsupersede_grace_ms=15000for the fresh runtime tags, so the immediate-supersede hypothesis has not been tested yet. With current backend parsing,WS_NATS_PROXY_SUPERSEDE_GRACE_MS=0would be coerced back to15000; useWS_NATS_PROXY_SUPERSEDE_GRACE_MS=1for the focused no-code experiment, and only patch parser/defaults if that experiment proves useful.windows_proxy_env_collision_20260501: decisive bisection found a local env collision, not a Root channel failure. Rawtools/diag_nats_ws.pywithHUB_NATS_WS_PROXY=autochanged the Root-observed source to77.37.240.23and reproduced one-way/1006behavior; the same raw probe with the variable unset used the stable217.216.106.xroute and closed cleanly. Fix prepared and verified: introduceHUB_NATS_WS_PROXY_MODE=auto, keep legacyHUB_NATS_WS_PROXYas compatibility input only, and hide the legacy variable duringwebsockets.connectso Python proxy discovery cannot consume it.windows_proxy_env_sanitized_20260501: standalonetools/diag_nats_client.pyintentionally ran with legacyHUB_NATS_WS_PROXY=autostill set. The patched AdaOS transport held 45s cleanly (tx_count=8,rx_count=7, no task errors, close1000). Root confirmed the healthy route:from=217.216.106.4,pub=8,msg=8,keepaliveMisses=0,downstreamSendErrors=0, closecode=1000.windows_two_browser_accept_20260501: fulladaos api serveunder browser load ran about 178s and stopped cleanly. Local logs:nats ws recv failed=0, watchdog/ConnectionClosedError/WinError=0, route timeout/proxy failed/starvation=0, event-loop lag/hang=0. Root logs forrt-a-5358db7fb0:from=217.216.106.4,uptime_s=177.789,pub=1054,keepaliveMisses=0,downstreamSendErrors=0, closecode=1000.windows_memory_recheck_20260501: follow-up run monitored the real uvicorn PID instead of the launcher wrapper. RSS moved from 135.7 MiB at 5s to 165.3 MiB at 121s; PrivateMemory moved from 137.8 MiB to 169.5 MiB. Root confirmed a clean 121.6s NATS session (code=1000,keepaliveMisses=0,downstreamSendErrors=0). Residual non-blocking signals: expected shutdown disconnect, high first-attachinfrastateYJS owner-flow bursts, and one slow weather handler at 0.264s.linux_ru_two_browser_plateau_20260501: live Linux hub192.168.0.30with two browsers attached was hotpatched with active-receiver/fingerprint stream guards forinfrascope_skill,infrastate_skill, and coreyjs.load_mark. A 6-minute window warmed from about 245 MiB RSS to about 439 MiB and plateaued; a follow-up 10-minute window moved from about 462 MiB to about 547 MiB, then stayed flat for the last 3-4 minutes. Connectivity stayed stable:hub_root: ready/stable,hub_root_browser: ready/stable,media_runtime live_peers=2/2,nats ws recv failed=0, route timeout/proxy failed=0, event-loop lag/hang=0. Remaining issue is bounded YStore replay pressure (sync_runtime: pressure, replay about 715 KiB), not the previous 3 GiB runaway/restart pattern. A manual live/api/node/yjs/webspaces/*/backuprequest did not return within 60s, so live compaction needs a safer off-hot-path design.replay_pressure_semantics_20260510: after auth-model rollout on RU stand, login reached connected YWS/WebRTC paths but diagnostics still showedstate-sync=degraded:aging,replay=32/32, and_by_owner/gateway_wspressure. Root cause: entry-limit YStore compaction kept a full replay tail (snapshot + replay_window), and state-sync treated bounded replay maintenance pressure as semantic sync degradation even when transport was attached, first sync was complete, and materialization was ready. Patch: entry-limit compaction now targets a smaller replay tail by default, bounded replay maintenance pressure stays visible as a blocker without turning ready materialized sync red, and the browser treats the same blocker as non-stale during mixed-version rollout.replay_pressure_autocompact_20260510: follow-up hardening now requests background YStore runtime compaction when reliability observes bounded replay pressure on an eligible webspace. The request is quiet-window guarded (ADAOS_YSTORE_AUTOCOMPACT_REPLAY_PRESSURE_QUIET_SEC, default 2s) and can be disabled withADAOS_YSTORE_AUTOCOMPACT_ON_REPLAY_PRESSURE=0, so reliability polling does not perform snapshot encoding inline.
Tasks
F3M-001: NATS-over-WS disconnects after 25-60 seconds
Status: accepted again for Windows root-routed /nats after the local proxy-env collision fix; Linux/RU first-window connectivity remains accepted. Local observe file I/O starvation is fixed.
Evidence:
nats ws recv failed ... ConnectionClosedError: no close frame received or sent code=1006ConnectionResetError: [WinError 10054]- watchdog reports
_reading_task terminated. nats_ws_diag.jsonlshowslast_rx_ago_sandlast_ping_rx_ago_sgrowing before disconnect whilepending_data_sizestays near zero.- Under remote browser load, backend
ws-nats-proxyclose diagnostics showcode=1006,natsKeepalivesSent=0, and longlastClientPongAgo_s/lastClientPingAgo_s, while root-routed/ywsrequests are repeatedly accepted with101.
Working hypothesis:
- The disconnect is not caused by local NATS pending queue starvation.
- The public root
/natsendpoint is reachable and stable with rawwebsocketstooling. - The decisive difference was route selection: tooling used
websocketssystem proxy auto-detect (proxy=True), while AdaOS core forced direct route (proxy=None) on Windows. - The direct route can become one-way under active NATS traffic: local sends appear successful, but Root stops receiving client frames and later closes with
1006. - Proxy-auto fixed the first Windows regression and remains the correct core default, but the mode must be expressed as
HUB_NATS_WS_PROXY_MODE=autoor left unset. The legacyHUB_NATS_WS_PROXY=autoname can perturb Python proxy discovery before AdaOS parses it. - The current reopened regression is narrower: nats-py-style hub clients can stall after Root/proxy-originated or upstream NATS
PINGcommand frames are delivered downstream during route/Yjs load. - Backend keepalive and frame-accounting diagnostics remain useful, but the current primary fix is to make Root terminate those NATS
PINGframes for normal hub WS-NATS clients while preserving transparent control frames for realtime sidecar clients. - Aiohttp is not the stable fallback for Windows multi-browser root-routed load; it still fails with
Cannot write to closing transport. HUB_NATS_WS_IMPL=autoshould resolve to the patched websockets transport on both Windows and Linux; aiohttp remains an explicit diagnostic override.- The latest Windows regression was not fixed by supersede or keepalive changes; it was fixed by preventing
HUB_NATS_WS_PROXY=autofrom leaking into Python proxy auto-discovery. - Weather observer callbacks are not the blocker; slow-callback diagnostics stayed at zero during the focused transport runs.
Actions:
- [x] Log structured close/error diagnostics from the Python WS transport.
- [x] Keep client-side NATS ping interval task disabled for WS transport by default.
- [x] Add backend WS/NATS proxy keepalive and supersede diagnostics.
- [x] Make Windows Selector loop an explicit diagnostic mode only.
- [x] Run multiple 180-second soaks with
ADAOS_WIN_SELECTOR_LOOP=0. - [x] Confirm the current loop is
WindowsProactorEventLoopPolicy/ProactorEventLoop. - [x] Confirm no local pending-data backpressure during the verified 180-second runs.
- [x] Reopen after root-routed browser load captured repeated remote reconnects and quiet NATS WS diagnostics.
- [x] Compare root proxy upstream ping/pong cadence against client-side
last_rx_ago_s. - [x] Keep backend WS-NATS
nats keepalive -> clientavailable only as an explicit diagnostic opt-in (WS_NATS_PROXY_CLIENT_KEEPALIVE_ENABLE=1or legacyWS_NATS_PROXY_KEEPALIVE_ENABLE=1plusWS_NATS_PROXY_KEEPALIVE_FORCE=1). - [x] Confirm root currently runs with
WS_NATS_PROXY_KEEPALIVE_ENABLE=0; treatnatsKeepalivesSent=0as expected in that mode. - [x] Increase backend WS-NATS supersede max grace to 10s, close superseded peers on new route readiness, and log resolved proxy config on startup.
- [x] Deploy backend route-open retry / supersede-grace patch to root.
- [x] Confirm root logs show
open ack retry/open republishandroute_readysupersede behavior instead of fallback frame flush. - [x] Run a local diagnostic with raw NATS keepalive and both browsers connected; confirm no NATS watchdog reconnect or remote YWS close loop before requested shutdown.
- [x] Run A/B with hub-side
HUB_NATS_WS_DATA_HEARTBEAT_S=10; confirm it does not stabilize/natswhile root proxy keepalive is disabled. - [x] Re-enable root proxy application-level keepalive and confirm root close diagnostics increment
natsKeepalivesSent. - [x] Prepare backend patch for configurable
/natsWS control ping:WS_NATS_PROXY_WS_PING=1,WS_NATS_PROXY_WS_PING_MS=10000. - [x] Prepare backend patch to keep superseded
/natspeers until grace timeout by default instead of closing them immediately onroute_ready. - [x] Deploy backend WS ping / supersede-grace patch to root.
- [x] Prepare backend patch to close and optionally terminate
/natswhen root NATS keepalive PONG is missing:WS_NATS_PROXY_CLOSE_ON_KEEPALIVE_MISS=1,WS_NATS_PROXY_TERMINATE_ON_KEEPALIVE_MISS=1. - [x] Run independent raw WS and AdaOS/nats-py transport probes from
toolswithoutadaos api serve; confirm raw root/natsecho is stable while AdaOS/nats-py transport still stalls after the first few messages. - [x] Align diagnostic tools with runtime default: prefer
wss://api.inimatic.com/nats; probewss://nats.inimatic.com/natsonly whenHUB_NATS_PREFER_DEDICATED=1. - [x] Prepare backend WS-NATS frame-accounting patch to report client, upstream-write, upstream-read, and downstream-send counters on close.
- [x] Deploy backend frame-accounting patch to root and inspect the next
conn close/upstream closesummaries forPUB -> MSG -> downstreammismatches. - [x] Confirm frame counters show route traffic reaches the hub before the root closes on the first keepalive miss.
- [x] Prepare backend patch to stagger WS ping vs NATS-data keepalive and close only after repeated keepalive misses (
WS_NATS_PROXY_KEEPALIVE_MAX_MISSES, default 3). - [x] Compare stable
toolsWebSocket route with failing AdaOS core route and identify the proxy-auto vs direct-route difference. - [x] Change core default to
proxy=True/ system proxy auto-detect, withHUB_NATS_WS_PROXY_MODE=noneas an explicit direct-route diagnostic override. - [x] Add transport regression tests for proxy default and coalesced root
PINGcontrol frames. - [x] Verify isolated
nats-py + AdaOS WebSocketTransportstays connected and echoes traffic for 45s through the proxy-auto route. - [x] Re-run
adaos api servefor about 190 seconds and confirm no NATS watchdog reconnect, route timeout, remote-route fallback, or event-loop lag/hang before requested shutdown. - [x] Re-run live local + root-routed browser acceptance with the updated core and confirm the remote browser loads Yjs data.
- [x] Update code/env defaults so the stable route is the default:
HUB_NATS_WS_PROXY_MODE=auto/ unset, and direct route only viaHUB_NATS_WS_PROXY_MODE=none. - [x] Reopen the Windows 2026-04-30 regression and test with client-originated NATS data ping disabled; confirm
/natsstill fails without anydata_pings_tx. - [x] Restore the confirmed Windows profile:
HUB_NATS_WS_DATA_PING_S=autosends a conservative 5s NATS-data ping only on Windows+websockets; Linux remains disabled unless explicitly requested. - [x] Validate raw
/natschannel withtools/diag_nats_ws.pyandtools/diag_nats_ws_concurrent.py; confirm raw WebSocket framing remains healthy for 90s under concurrent PUB/MSG and Root NATS keepalive traffic. - [x] Confirm nats-py-style clients still fail while raw WebSocket clients stay healthy.
- [x] Prepare Root WS proxy fix for nats-py-style clients: disable proxy-originated client keepalive by default and terminate upstream NATS
PINGframes at the proxy for non-transparent hub clients. - [x] Harden Root WS proxy against stale root env: ignore legacy
WS_NATS_PROXY_KEEPALIVE_ENABLE=1for normal hub clients unlessWS_NATS_PROXY_KEEPALIVE_FORCE=1; useWS_NATS_PROXY_CLIENT_KEEPALIVE_ENABLE=1only for targeted diagnostics. - [x] Harden Root WS proxy against stale root WS control ping env: ignore legacy
WS_NATS_PROXY_WS_PING=1for normal hub clients unlessWS_NATS_PROXY_WS_PING_FORCE=1; useWS_NATS_PROXY_CLIENT_WS_PING_ENABLE=1only for targeted diagnostics. - [x] Add
WebSocketTransportWebsocketssend-path diagnostics (current_send,last_send, send counters/errors) to prove whether local sends actually correspond to Root-received client frames. - [x] Add weather observer slow-callback diagnostics and rule out weather as the source of the
/natsdisconnect. - [x] Move local
events.logobserve writes off the event loop; confirm two-browser run no longer showsobserve.py,_write_local, loop lag, orflush_slowstarvation. - [x] Re-test aiohttp under two-browser load; confirm the old aiohttp
Cannot write to closing transportfailure still exists. - [x] Re-test websockets under the current Root env; confirm it still flaps, so the remaining blocker is not the local observe file I/O path alone.
- [x] Change
HUB_NATS_WS_IMPL=autoback to the patched websockets transport on both Windows and Linux; keep aiohttp explicit-only. - [x] Restore Root env to the confirmed Windows profile for normal hub clients:
WS_NATS_PROXY_KEEPALIVE_ENABLE=0,WS_NATS_PROXY_CLIENT_KEEPALIVE_ENABLE=0,WS_NATS_PROXY_KEEPALIVE_FORCE=0,WS_NATS_PROXY_WS_PING=0,WS_NATS_PROXY_CLIENT_WS_PING_ENABLE=0,WS_NATS_PROXY_WS_PING_FORCE=0, and disable deep wiretap/ping trace unless diagnosing one run. - [x] Detect that current backend parsing would coerce
WS_NATS_PROXY_SUPERSEDE_GRACE_MS=0back to15000. - [x] Run a no-code Root env experiment with
WS_NATS_PROXY_SUPERSEDE_GRACE_MS=1and verify root logs showsupersede_grace_ms=1/ immediateclosing superseded hub ws-nats connection; conclude this was not the decisive fix. - [x] Identify and fix the decisive Windows route regression: do not expose the legacy
HUB_NATS_WS_PROXY=autoenvironment variable to Python proxy discovery. - [x] Re-run live Windows root-routed browser acceptance after the proxy-env fix; confirm remote Yjs data loads without
/natswatchdog reconnects and without route starvation. - [x] Decide not to patch backend supersede parser/defaults for the current acceptance path; keep
WS_NATS_PROXY_SUPERSEDE_GRACE_MS=1as an optional Root experiment, not the primary stable profile. - [ ] Reconfirm Linux/RU after rollout still uses
auto -> websocketsand keeps the accepted first-window behavior.
F3M-002: Root-routed HTTP requests timeout during startup
Status: closed for the current 3-minute goal.
Evidence:
- Repeated
http route: timeoutandhttp proxy failedfor/api/node/status,/api/node/reliability,/api/node/reliability/summary, and/api/node/infrastate/snapshot. - Root logs show many
route.v2.to_hubrequests reach the WS/NATS proxy and are sent downstream, but not all responses return before15000ms.
Working hypothesis:
- Route request delivery is not the only failure point.
- Missing or delayed hub/browser response handling, reconnect overlap, or slow local handlers may leave route replies unpublished.
Actions:
- [x] Add route request/reply lifecycle diagnostics.
- [x] Add route publish/flush slow warnings and pending-data diagnostics.
- [x] Remove route key hot-path config reload.
- [x] Verify latest 180-second soaks have no
http route: timeoutand nohttp proxy failed. - [ ] Reopen and correlate one timed-out
keyTagfrom root logs with hub route callback logs if a timeout recurs. - [ ] Add a compact route timeout summary grouped by path and keyTag.
- [ ] Rate-limit or defer non-critical root probes while NATS is reconnecting.
F3M-003: Browser WS open ack fallback is still observed
Status: fixed in backend and deployed; supersede close behavior refined locally and awaiting backend deploy.
Evidence:
ws route: open ack fallback elapsed- early frame counters are present before
open_ack. - During root-routed remote Yjs reconnects, a browser
opencan be dropped while hub route subscription is absent; fallback then flushes early frames and the hub recordsno_upstream. - Latest root logs after the backend deploy show bounded
open ack retry/open republishbehavior and no capturedopen ack fallbackorno_upstreamin the local diagnostics window, but a superseded/natspeer can still be closed onroute_readywhile route open retries are in flight.
Working hypothesis:
- The hub can receive early frames before the route open acknowledgement is returned to root.
- The previous fallback is harmless only when the hub actually processed
openbut did not sendopen_ack; it is harmful whenopenwas dropped during NATS route reconnect because it forwards frames before an upstream tunnel exists.
Actions:
- [x] Add early frame count/bytes to open ack fallback logs.
- [x] Verify latest 180-second soaks have no
open ack fallback. - [x] Reopen and correlate fallback cases with NATS reconnect and route timeout windows.
- [x] Replace fallback frame flush with bounded
openretry (ROUTE_WS_OPEN_ACK_MAX_ATTEMPTS, default 4). - [x] Deploy backend route-open retry patch to root.
- [x] Verify latest root-routed browser diagnostics produce bounded
open ack retryrecovery, notopen ack fallbackframe flush. - [x] Prepare backend switch
WS_NATS_PROXY_CLOSE_SUPERSEDED_ON_ROUTE_READY=0so route readiness no longer cuts the grace window short by default. - [ ] Reconfirm after root
/natskeepalive is restored that root-routed browser load produces nono_upstreamincident.
F3M-004: Yjs write pressure during first attach
Status: closed for the current 3-minute goal.
Evidence:
YJS owner flow above threshold ... source=yjs.gateway_ws channel=core.yjs.gateway.live_room.persist- High write count and byte bursts around first browser attach.
Resolution:
- Initial gateway first-attach bursts are expected and preserve durable YStore/subnet replication semantics.
- The current fix keeps immediate persistence and changes diagnostics to alert on sustained gateway pressure rather than peak-only attach bursts.
Actions:
- [x] Attribute Yjs pressure by source/channel.
- [x] Split gateway persistence out of
_by_owner/unknownas_by_owner/gateway_ws. - [x] Move YRoom diagnostic ystore runtime snapshots out of the realtime hot path by default.
- [x] Confirm latest 180-second soak has no
_by_owner/unknownpressure and no YRoomruntime_snapshot()/Path.stat()blocking stack. - [x] Decide not to batch/debounce
gateway_wsystore writes for this goal; durability wins over cosmetic write smoothing. - [x] Tune gateway-owner pressure alerts to suppress peak-only first-attach warnings while preserving sustained-pressure alerts.
- [x] Confirm final 180-second soak has no Yjs owner pressure warning.
F3M-005: Event loop lag/hang during startup and shutdown
Status: locally fixed for the severe freeze path; final root-load verification pending after /nats keepalive is restored.
Evidence resolved:
- Earlier lag stacks pointed to route key config reload, skill runtime path preparation, subnet-directory SQLite commit, NATS diagnostic file append, and skill snapshot refreshes triggered by
sys.ready. - The final accepted run has no real event loop lag/hang, no control lifecycle delayed warning, and no slow async handlers.
- Shutdown can still emit an expected idle-wait suppression debug line and a requested NATS disconnect warning.
- The 2026-04-29 root-load run exposed a new severe Windows freeze stack:
service_supervisor._watchdog_loop -> refresh_discovered -> asyncio.to_thread -> run_in_executor -> ThreadPoolExecutor._adjust_thread_count -> Thread.start. - After making service discovery refresh inline and cached, the
service_supervisor/Thread.startstack did not recur in follow-up diagnostics. - A 2026-05-14 direct Root MCP smoke correctly classified remote MCP as
upstream_unavailable, but the concurrently running local hub emitted a severe hang stack inbrowsers_skill._on_refresh -> _refresh_snapshot_sync -> _run_coro -> Future.result(). The smoke was not the cause; it exposed a skill refresh handler blocking the event loop while waiting for snapshot projection.
Resolution:
- Known synchronous startup/hot-path operations have been moved off the event loop or deferred out of
sys.ready. - Diagnostic writes remain enabled, but NATS WS JSONL append now runs in a worker thread.
browsers_skillrefresh now schedules snapshot projection on its single projection executor without waiting when it is invoked from an active event loop. Pending refreshes are coalesced by webspace, and projection failures are logged asynchronously.
Actions:
- [x] Add structured loop lag/hang logs.
- [x] Move selected sync subscriptions to worker threads.
- [x] Make Selector loop opt-in diagnostics only.
- [x] Add per-topic/adapted-handler labels to slow handler warnings.
- [x] Move startup native capacity/subnet registry work to a worker thread.
- [x] Suppress idle Proactor wait stacks as hang false positives.
- [x] Move active local
infrascope_skillbackground target discovery to a worker thread. - [x] Move slow
ui.notifynetwork work away from eventbus critical path. - [x] Move hub subnet-directory staler heartbeat/stale sweep SQLite work off the event loop.
- [x] Move NATS WS diagnostic file writes off the NATS supervisor hot path.
- [x] Preserve skill/handler labels through SDK bus adaptation so slow warnings identify the exact skill.
- [x] Remove heavy
sys.readyrefresh work from active localinfra_access_skillandinfrastate_skillworkspace/runtime copies. - [x] Avoid worker-thread hop for
infrastate_skill.on_runtime_eventonsys.ready. - [x] Confirm final 180-second soak has no real event loop lag/hang and no slow async handler warnings.
- [x] Avoid recurring
asyncio.to_threadsubmission in skill-service discovery refresh. - [x] Make control lifecycle await-resume watcher opt-in so normal heartbeats do not start a diagnostic thread from the event loop.
- [x] Confirm follow-up diagnostics have no
service_supervisor/Thread.startstack and no 60-second event-loop freeze. - [x] Remove blocking
Future.result()wait frombrowsers_skillrefresh handlers and add regression coverage for event-loop invocation. - [ ] Reconfirm no real loop lag/hang during the final root-routed browser acceptance after
/natskeepalive is restored.
F3M-006: Root MCP local startup uses fallback as the normal path
Status: closed for the current 3-minute goal.
Evidence resolved:
- Earlier accepted runs emitted
Root MCP bridge upstream unavailable; using embedded local Root MCP operation=surfacebecause local SDK calls went through the public Root MCP bridge. - The public backend bridge is a direct HTTP proxy to
ADAOS_BASE/X-AdaOS-Base, which is not a reliable route from the public root service back to a local hub.
Resolution:
- Local SDK
get_local_*Root MCP calls now mark local target contexts and use embedded local registry/session/token/audit operations first. - The remote bridge fallback remains available for explicit non-local Root MCP usage and for resilience when local-first is disabled.
Actions:
- [x] Add
ADAOS_ROOT_MCP_LOCAL_FIRSTwith local-first enabled by default. - [x] Keep
ADAOS_ROOT_MCP_LOCAL_FIRST=0as an escape hatch for explicit bridge validation. - [x] Add a regression test proving local runtime calls do not probe the bridge.
- [x] Confirm final 180-second soak has
root_mcp_fetch_failed=0andembedded_fallback=0.
F3M-006A: Keep node selectors out of Root MCP managed target IDs
Status: implemented, awaiting local UI confirmation.
Evidence:
- Infra Access
issue_codex_sessionreceived the UI node UUID8db40740-b3ff-44bf-baf5-9fb013b35b01astarget_idand Root MCP rejected it withmanaged target ... is not registered. - The current managed-target registry uses hub-scoped target IDs such as
hub:<subnet_id>; UI node selectors and named-entity device refs are a different addressing layer.
Resolution:
- Root MCP SDK local target context now resolves local aliases such as
node_id,node:<node_id>,device:member:<node_id>, and baresubnet_idtohub:<subnet_id>. - Infra Access treats non-
hub:selectors from the UI as node selectors and lets the SDK infer the local hub target instead of forwarding the selector to the Root MCP target registry.
Actions:
- [x] Add SDK regression coverage for local selector to managed-target resolution.
- [x] Add Infra Access runtime coverage that
issue_codex_connectiondoes not pass a UI node UUID as the Root MCP target. - [ ] Confirm manually from
[Node 0] Infra Access: click the Codex session action and verify nomanaged target '<node uuid>' is not registerederror.
F3M-006B: Align Codex ProfileOpsRead with advertised Root MCP read tools
Status: implemented and locally smoke-verified.
Evidence:
- The local stdio Codex bridge advertised operational read tools such as
get_status,get_runtime_summary,get_operational_surface, andget_activity_log. - Fresh
ProfileOpsReadMCP session leases only received genericoperations.read.*plus memory-profile capabilities, so the advertised tools returned policy-denied payloads instead of operational data.
Resolution:
ProfileOpsReadnow includes the read-onlyhub.get_*capabilities that the Codex bridge exposes.ProfileOpsControlnow includes the same read set plushub.run_healthchecks.- The public backend capability-profile definition is kept in sync with the local hub implementation.
Verification:
pytest tests/test_root_mcp_foundation.pypasses.pytest tests/test_sdk_root_mcp.py tests/test_infra_access_skill_runtime.pypasses in the focused MCP/infra_access slice.- Local stdio MCP smoke against
adaos-local-hubreports 37 tools andok=trueforfoundation,get_status,get_runtime_summary,get_operational_surface,get_activity_log,get_skill_logs, andget_subnet_diagnostics.
F3M-006C: Classify direct remote MCP health separately from bearer validity
Status: implemented locally; public deployment/fresh-bearer validation pending.
Evidence:
- Fresh
ProfileOpsReadMCP session forhub:sn_92ffc943was active, but direct remote MCP smoke on 2026-05-14 returned HTTP502for:GET /v1/root/mcp/foundation, JSON-RPCinitialize, JSON-RPCtools/list, and JSON-RPCtools/call:get_status. - The same
502class reproduced on the regionalru.api.inimatic.comendpoint and the globalapi.inimatic.comendpoint. This means the check is failing before useful bearer/tool-level validation, not as an ordinary401/403token rejection. - Backend inspection showed the public
/v1/root/mcproute was still installed as a legacy upstream proxy toADAOS_BASE(http://127.0.0.1:8777by default). In public zones that makes a healthy bearer look like an upstream outage because the backend is trying to reach its own localhost instead of a native Root MCP surface. - The observed 2026-05-13
deny->allowtransition for the samemcp_session_lease:*actor should be treated as profile/runtime drift during rollout, not expected steady-state behavior. Session leases should carry a frozen grant snapshot; after changing profiles or endpoint mode, issue a fresh bearer and correlate events by session id and issued-at time.
Resolution:
- Added
adaos dev root mcp smokeso operator and LLM diagnostics use one repeatable transport check instead of manual curl snippets. - The smoke command redacts auth by design, exits non-zero on failure, and
classifies
401/403asauth_failed,404asendpoint_not_found, JSON-RPC errors asjsonrpc_error, and5xxresponses such as502asupstream_unavailable. - The public backend now installs only the native
/v1/root/mcpHTTP/JSON-RPC route. The historical/v1/root/mcp -> ADAOS_BASEupstream proxy has been removed for the MVP to avoid ambiguous operator diagnostics. - Follow-up live smoke after deployment still returned
adaos_root_mcp_upstream_failed, proving the legacy proxy was still taking precedence in that deployment. After legacy removal, this response body means the deployed backend is stale. - A later deployment attempt did not update backend because reverse-proxy
health failed before slot cutover.
nginx -trejectedssl_verify_client offinsidelocationblocks invhost.d/api.inimatic.com. The API vhost now keepsssl_verify_client optionalonly at server level; public routes do not need per-location disablement, and protected routes still enforce mTLS via$ssl_client_verify. - Backend Root MCP
ProfileOpsRead/ProfileOpsControlcapabilities were aligned with the Python Root MCP profile shape, includinghub.get_status,hub.get_runtime_summary,hub.get_operational_surface, activity/capability summaries, and memory read tools. - After the route repair deployed, direct public smoke advanced from
502to401, which confirms the public request is reaching an auth-gated Root MCP handler instead of the removed legacy upstream proxy. The failing bearer had anrmcp_session_*prefix produced by the local SDK/infra_access_skillembedded session issuer, while the public backend native route stores and validates its ownmcp_*session leases in backend Redis. A local hub-issuedrmcp_session_*token is therefore valid for the local/embedded Root MCP context, but not for direct publichttps://api.inimatic.com/v1/root/mcpsmoke. - The backend auth fallback previously returned
client_certificate_requiredfor any unrecognized Root MCP credential. That made a bearer issuer mismatch look like an mTLS problem. The backend now reportsinvalid_tokenwhen an auth header is present but not accepted; the CLI smoke also surfaces JSON error/message bodies so operators can see the real rejection reason.
Verification:
pytest tests/test_root_mcp_smoke.pycovers502, auth-failure, and JSON-RPC-error classification.npm run build:apipasses insrc/adaos/integrations/adaos-backend.- The 2026-05-15 live response body
{"error":"adaos_root_mcp_upstream_failed","detail":"fetch failed"}identifies the legacy proxy path rather than the native Root MCP handler. - The later 2026-05-15 live response body
{"error":"client_certificate_required","message":"Client certificate is required."}is auth-gated native behavior before the improved backend error text is deployed; with an unrecognized bearer it should becomeinvalid_token. - Manual check to repeat after backend/root route work:
adaos dev root mcp smoke --mcp-http-url https://ru.api.inimatic.com/v1/root/mcp --auth-env-var ADAOS_ROOT_MCP_AUTH.
Actions:
- [x] Add CLI smoke check for direct remote MCP.
- [x] Document failure classification and human verification path.
- [x] Fix the public backend route shape so native Root MCP can answer
initialize,tools/list, andget_status. - [x] Remove the legacy Root MCP upstream proxy from the backend MVP.
- [x] Remove invalid location-level
ssl_verify_client offdirectives from API nginx vhost templates so reverse-proxy health can pass. - [x] Deploy the backend route repair to the target zone.
- [x] Surface JSON error bodies in
adaos dev root mcp smokeoutput. - [x] Return
invalid_tokeninstead ofclient_certificate_requiredwhen the public Root MCP route receives an auth header that does not resolve. - [ ] Align Infra Access
Fresh Bearer Tokenissuance with the selected endpoint: local bridge flows may keeprmcp_session_*, while direct public Root MCP smoke must use backend-nativemcp_*sessions or a backend-accepted owner bearer. - [ ] After deployment, issue a fresh backend-native
ProfileOpsReadsession and run the smoke against the fresh session, then record the target/tool result here.
F3M-006D: Split public API and mTLS API surfaces
Status: planned.
Context:
- The current MVP uses one
api.inimatic.comnginx server withssl_verify_client optional. This keeps public browser/bootstrap endpoints reachable while still forwarding$ssl_client_verifyand certificate headers to backend routes that enforce mTLS. - nginx chooses
ssl_verify_clientduring TLS handshake, before a URI-specificlocationis selected. That means we cannot safely express "do not request a client cert for this public path" withssl_verify_client offinsidelocation; nginx rejects that config. - For now, public routes rely on server-level
optional, and protected routes enforce mTLS in backend/nginx routing by checking$ssl_client_verify.
Target architecture:
- Keep
api.inimatic.comas the hub/node API surface that requests client certificates during TLS handshake withssl_verify_client optional. Backend routes on this host use$ssl_client_verifyand forwarded certificate headers to enforce mTLS where required. - Add
pub.inimatic.comas the public API surface with no client-certificate request during TLS handshake. Browser, bootstrap, pairing, operator bearer, and Codex/Root MCP public entrypoints should move here unless they explicitly need the mTLS-aware surface. - Make backend route policy explicit: public bearer/JWT routes and mTLS routes should be distinguishable by host/surface, not only by path conventions.
Checklist:
- [x] Choose canonical host split:
api.inimatic.comfor mTLS-aware API,pub.inimatic.comfor public API. - [ ] Add nginx/vhost templates for
pub.inimatic.comwithoutssl_verify_client. - [ ] Keep
api.inimatic.comconfigured withssl_verify_client optionalat server scope for hub/node mTLS-aware routes. - [ ] Move browser/bootstrap/pairing/operator bearer/Codex Root MCP defaults
to
pub.inimatic.com. - [ ] Add deploy smoke that runs
nginx -tand validates both public and mTLS host routing before slot cutover. - [ ] Update bootstrap/node docs once the host split is live.
F3M-007: First-3-minute memory footprint
Status: closed for the current 3-minute goal.
Evidence:
- User requested memory state as part of the final loading evaluation.
- A naive first sampler captured only a launcher stub; the final accepted sampler measures the whole process tree and the heaviest child process.
Resolution:
- The final accepted run sampled the
adaos api serveprocess tree during loading-to-ready and throughout the 180-second soak. - Memory reached a startup plateau and stayed bounded: process-tree peak PrivateMemory was 230.555 MB and the last sample was 228.117 MB.
- Repeat final acceptance with browser
/wsattach stayed bounded as well: process-tree peak PrivateMemory was 145.211 MB and the last sample was 145.211 MB.
Actions:
- [x] Add process-tree memory sampling to the final soak verification.
- [x] Capture first, ready, peak, and final memory samples.
- [x] Confirm peak and final memory values are in the same plateau range.
- [x] Confirm no memory-related traceback, supervisor failure, or event-loop lag appears in the final accepted run.
F3M-008: Remote root-routed Yjs attach closes under browser load
Status: closed for the current connectivity goal; keep Yjs load performance as a watch item.
Evidence:
- With a local browser and a root-routed remote browser connected at the same time, the remote browser repeatedly hit
connection closedwhile local access remained usable. - Root reverse-proxy accepted remote
/hubs/sn_6acf0c01/yws/desktopupgrades with101, then emitted repeatedSSL_read() failed ... bad record macaround keepalive/upgraded traffic. - Hub logs showed
yws connection closed webspace=desktoparound the same window as NATS WS reconnects. - Control lifecycle delay stacks under load pointed at WebSocket/Yjs send/write paths, including expensive websocket compression and Yjs load-mark history append.
Working hypothesis:
- The primary remote disconnect is downstream of root route/NATS liveness, not local browser failure.
- Large first-sync Yjs bursts should avoid WebSocket compression and avoid avoidable synchronous diagnostics on the event loop.
- Root-routed Yjs must tolerate a dropped
openduring hub route reconnect by retryingopen, not by flushing browser frames before upstream exists. - Latest diagnostics show route-open retry is working; the remaining remote browser close loop followed
/natsConnectionClosedError/ watchdog reconnects. - The core proxy-auto fix removes the reproduced
/natschurn in isolated, full local API, and live hub-browser runs.
Actions:
- [x] Disable local uvicorn WebSocket per-message deflate for API serve.
- [x] Re-run a 3+ minute local + root-routed browser diagnostic with local raw NATS keepalive and confirm no NATS watchdog reconnect, no unexpected YWS close, and no compression-related control-lifecycle warning stack before requested shutdown.
- [x] Replace root route
open_ackfallback frame flush with boundedopenretry. - [x] Deploy backend route-open retry / supersede-grace patch to root.
- [x] Confirm latest root logs show route retry/supersede behavior and no captured fallback frame flush.
- [x] Prepare backend WS-NATS liveness refinement: configurable WS ping and no immediate supersede close on
route_ready. - [x] Deploy backend WS ping / supersede-grace refinement to root.
- [x] Prepare backend keepalive-miss close refinement so a half-open
/natstunnel is proactively closed and replaced. - [x] Align core
/natstransport with stable rawtoolsroute by defaulting towebsocketsproxy-auto. - [x] Re-run root-routed browser soak with updated core and both browsers connected.
- [x] Confirm no
permessage_deflate.encodecontrol-lifecycle delay stack recurs in the latest accepted windows. - [x] Confirm no
hub route frame arrived while upstream is not connected/no_upstreamincident recurs in the latest accepted windows. - [x] Confirm
/natsstays connected for at least 180 seconds and remote/ywsdoes not close due to route errors before requested shutdown. - [ ] If load-mark history append still appears in loop-delay stacks, move history append off the event loop or batch it under diagnostics-only mode.
F3M-009: Reliability summary polling blocks the event loop
Status: closed for the current 3-minute goal.
Evidence:
- While local and root-routed browsers were connected, repeated client polling of
/api/node/reliability/summaryproduced a control-lifecycle warning stack throughnode_reliability_summary -> current_reliability_payload -> load_config -> runtime_state_mtime_ns -> Path.resolve. - The endpoint response is relatively large and mostly stable, so the long-term architecture should move toward a reusable status plane with thin monitoring deltas.
Resolution:
/api/node/reliabilityand/api/node/reliability/summarynow build the reliability payload in an AnyIO worker thread.- The follow-up architecture work is tracked under
Reusable Status Plane And Thin Monitoring.
Actions:
- [x] Move reliability payload construction for the high-frequency HTTP endpoints off the event loop.
- [x] Update the isolated reliability endpoint test double so it reflects current bootstrap imports.
- [x] Verify targeted reliability endpoint tests pass.
- [x] Re-run a 3+ minute local + root-routed browser diagnostic and confirm no
node_reliability_summary/current_reliability_payloadwarning stack recurs.
F3M-010: Linux/RU root-routed browser selects the wrong root zone
Status: fixed for connectivity; follow-up memory pressure is tracked separately.
Evidence:
- Linux hub
sn_92ffc943is configured forzone=ruand keepshub_root: ready/stablethroughwss://ru.api.inimatic.com/nats. https://ru.api.inimatic.com/v1/browser/hub/status?hub_id=sn_92ffc943returnsonline, while the central root returnsofflinefor the same hub.- Earlier Linux runtime saw root-routed HTTP status probes but no
/wsor/ywsroute-open attempts, so the remote browser data path was not reaching the RU hub runtime. - After the zone-aware browser bundle deploy, two Linux root-routed browsers loaded data, confirming the route-zone selection fix.
Working hypothesis:
AppComponentand pairing flows use the deployment-zone service, butAdaosClient.rootHubBaseUrl()independently falls back toROOT_BASE.- A browser can therefore pass status/pairing through the RU root while YJS/WS attaches through the central root, where the Linux hub is offline.
Actions:
- [x] Add browser-side
hub_id -> zonepersistence after successful root status and pairing approval. - [x] Make YDoc root-proxy attach probe known zones through
/v1/browser/hub/statusand select the online root before setting/hubs/<hubId>base. - [x] Fall back from
adaos_hub_idtoadaos_last_subnet_idwhen restoring a browser session. - [x] Confirm client build succeeds after the async root-zone resolver change.
- [x] Deploy the updated client bundle and confirm Linux remote browsers open/load data through the RU root.
- [x] Confirm Linux reliability changes from
sync_runtime.yws=0 rooms=0 opens=0/0to active browser/YWS behavior after remote browser attach. - [x] Mark raw hub-credential NATS diagnostic tools as potentially superseding the live runtime connection.
- [ ] Update raw diagnostic tools or root auth semantics so diagnostic NATS probes do not supersede the live runtime connection.
F3M-011: Linux remote-browser attach triggers runaway memory growth
Status: fixed for the first-3-minute goal; long-run plateau confirmation pending.
Evidence:
- With two Linux root-routed browsers attached, browser data loaded, then links oscillated between recovery/degraded and the supervisor restarted the slot.
- Supervisor memory telemetry showed RSS growth around 1.8GB in the active runtime and growth slope above 800MB/min.
yjs_load_mark.jsonlgrew to hundreds of MB; recent load-mark rows showed both_by_owner/skill_infrastate_skilland_by_owner/gateway_wscarrying large sustained byte rates.- This pattern indicates backend-originated detached Yjs diffs are persisted once by
async_get_ydocand then persisted again by the live room while being fanned out to browsers. - 2026-05-08 regression pass: memory still climbed under destructive
infrastateload even though tool-call quarantine fired. Hub logs showedskill:infrastate_skillquarantined viainfrastate_skill:get_snapshot, while slowwebio.stream.snapshot.requestedsubscription handlers kept running outsideSkillManager. - 2026-05-08 follow-up regression pass on slot A /
2ec14c5: after browser click activity, the hub reached about 1.16GB RSS. Logs showed more than 1000webio.stream.snapshot.requestedand about 1900webio.stream.subscription.changedevents since restart, while quarantine service state was written underdefaultfor events without explicitwebspace_id; the desktop browser therefore missed the visibledata.yjs_qrntsignal. - Browser symptom was only
Action failed: skill_owner_quarantined; the response already carried owner/tool/reason, but the client collapsed it to the error code. Skill-local quarantine logging also failed whenADAOS_SKILL_MEMORY_PATHpointed atdata/db/skill_env.jsoninstead of a directory. - Scenario shortcut icons were present in the effective catalog as node-attributed scenario apps, but the client-side app filter treated every scenario app with a real
node_idas remote/non-desktop and hid it.
Working hypothesis:
- Skill/core writes that skip the direct live-room fast path correctly write a detached diff to YStore, then apply that diff to the active room so browsers receive it.
- The active room currently treats that already-persisted backend diff like a browser-origin update and writes it to YStore again as
gateway_ws. - Under two remote browsers and active infrastate streams, duplicate YStore writes plus unbounded load-mark history amplify memory, disk, and diagnostic pressure.
- A second amplifier was
/api/node/infrastate/snapshot: root-routed browser fallback probes calledget_snapshot, which projected the full diagnostic snapshot into Yjs and returned multi-megabyte payloads. - A third amplifier was supervisor policy profiling: the memory detector could restart the runtime into
sampled_profileduring the first browser attach, causing recovery/degraded oscillation even after transport was healthy. - A fourth amplifier is any skill
@subscribe(...)path that writes to Yjs without passing throughSkillManager.run_tool; owner quarantine must gate skill event handlers as well as public tools.
Actions:
- [x] Add a short-lived exact-update marker for backend-originated room fanout updates that were already persisted.
- [x] Make
DiagnosticYRoomskip only matching duplicate backend-origin YStore writes while preserving browser fanout and browser-origin persistence. - [x] Add gateway tests covering duplicate skip and unmarked browser-update persistence.
- [x] Cap
yjs_load_mark.jsonlby default and limit load-mark stream rows to the top pressure buckets. - [x] Disable full event payload logging by default to avoid duplicating large
io.out.stream.publishpayloads into rotating logs. - [x] Make
infrastate_skill.get_snapshotread-only for HTTP callers unlessproject=Trueis explicitly requested. - [x] Return a compact client snapshot with truncated diagnostic card content and
last_refresh_tsso browsers stop repeatedly falling back to heavy HTTP snapshot loads. - [x] Delay automatic policy-triggered memory profiling restarts until after the first 300 seconds of runtime uptime.
- [x] Deploy the active core/skill hotpatches to the Linux hub.
- [x] Sync the Linux hub workspace
infrastate_skillhotpatch back into local.adaos/workspacefor the nextskill push. - [x] Compare Linux hub core slots
AandBwith local source and identify the only remaining intentional local commit deltas. - [x] Run a two-browser Linux soak and confirm the first 3 minutes do not hit NATS churn, route timeouts, event-loop hangs, or memory-profile restarts.
- [x] Confirm memory no longer grows toward the previous 3GB supervisor restart pattern during the first acceptance window.
- [x] Run a longer 10-15 minute Linux two-browser soak and confirm RSS reaches a bounded plateau.
- [x] Confirm load-mark no longer reports simultaneous sustained high byte rates for the same backend diff under both
skill_infrastate_skillandgateway_ws. - [x] Add Yjs owner-guard admission to SDK skill subscription wrappers before executing
@subscribehandlers. - [x] Make browser skill-action quarantine warnings include owner/tool/reason/retry instead of only
skill_owner_quarantined. - [x] Fix skill quarantine JSONL logging when the skill memory path resolves to
data/db/skill_env.json. - [x] Keep scenario shortcut apps visible on the desktop even when the effective catalog tags them with the local hub
node_id. - [x] Hotpatch the Linux hub active slot B with backend subscription/logging changes and restart the runtime under supervisor.
- [x] Fix owner-guard implicit webspace normalization so desktop browsers can
see
data.yjs_qrntfor subscription-triggered quarantines. - [x] Add bounded/coalesced eventbus handling for
webio.stream.subscription.changedand stale queued stream handlers. - [x] Hotpatch Linux hub active slot A and Mediapoint/member active slot B with Wave 10 owner-guard/eventbus containment changes, then restart runtimes under supervisor.
- [ ] Rebuild/redeploy the client bundle so quarantine diagnostics and scenario shortcut filtering changes are visible in the browser.
- [ ] Run a destructive
infrastatebrowser-load soak after the client deploy and confirm skill subscription quarantine stops the memory climb. - [ ] Design safe off-hot-path YStore replay compaction for live browser load; direct live backup can exceed a 60s request window.
- [ ] Tune YStore replay defaults or background compaction so
sync_runtimeleavespressureafter browser warm-up.
Operating Checklist
Before a 3-minute soak:
- [ ]
ADAOS_WIN_SELECTOR_LOOP=0. - [ ]
HUB_NATS_WS_DIAG_FILE=.adaos/diagnostics/nats_ws_diag.jsonl. - [ ]
HUB_ROOT_LOG_SNAPSHOT=1. - [ ]
HUB_NATS_WS_PROXYis unset in normal Windows and Linux runs. UseHUB_NATS_WS_PROXY_MODE=autofor the stable route andHUB_NATS_WS_PROXY_MODE=noneonly for direct-route diagnostics. - [ ] For zoned root-routed browser acceptance, verify
/v1/browser/hub/status?hub_id=<hubId>isonlineon the expected root zone androotHubBaseUrl()resolves/hubs/<hubId>under that same zone. - [ ] For root-routed browser acceptance, root backend can use its stable defaults:
WS_NATS_PROXY_KEEPALIVE_ENABLE=0,WS_NATS_PROXY_CLIENT_KEEPALIVE_ENABLE=0,WS_NATS_PROXY_KEEPALIVE_FORCE=0,WS_NATS_PROXY_KEEPALIVE_MS=20000,WS_NATS_PROXY_KEEPALIVE_REQUIRE_HANDSHAKE=1,WS_NATS_PROXY_UPSTREAM_NATS_PING_MS=20000,WS_NATS_PROXY_WS_PING=0,WS_NATS_PROXY_CLIENT_WS_PING_ENABLE=0,WS_NATS_PROXY_WS_PING_FORCE=0,WS_NATS_PROXY_CLOSE_SUPERSEDED_ON_ROUTE_READY=0,WS_NATS_PROXY_SUPERSEDE_GRACE_MS=15000,WS_NATS_PROXY_CLOSE_ON_KEEPALIVE_MISS=1,WS_NATS_PROXY_TERMINATE_ON_KEEPALIVE_MISS=1,WS_NATS_PROXY_KEEPALIVE_MAX_MISSES=3. - [ ] Capture process-tree memory samples during loading-to-ready and final acceptance runs.
- [ ] Keep normal diagnostic defaults common across Windows and Linux:
ADAOS_LOG_EVENTS_PAYLOAD=0,ADAOS_YJS_LOAD_MARK_STREAM_MIN_INTERVAL_SEC=2.0,ADAOS_YJS_LOAD_MARK_STREAM_TICK_INTERVAL_SEC=2.0,ADAOS_YJS_LOAD_MARK_STREAM_UNCHANGED_KEEPALIVE_SEC=30.0,ADAOS_YJS_LOAD_MARK_STREAM_TOP_N=24,ADAOS_YJS_LOAD_MARK_GATEWAY_HIGH_WPS=64,ADAOS_YJS_LOAD_MARK_GATEWAY_CRITICAL_WPS=128,ADAOS_YJS_LOAD_MARK_HISTORY_MAX_BYTES=10485760,ADAOS_YJS_BACKEND_ROOM_UPDATE_SKIP_TTL_S=30,ADAOS_INFRASTATE_SNAPSHOT_CONTENT_MAX_BYTES=4096,ADAOS_SUPERVISOR_MEMORY_AUTO_PROFILE_MIN_UPTIME_SEC=300. - [ ] Deep trace is off unless investigating one focused case.
During analysis:
- [ ] Start with
.adaos/logs/adaos.log. - [ ] Use
.adaos/diagnostics/nats_ws_diag.jsonlto distinguish RX silence from local backpressure. - [ ] Use
.adaos/root_log_snapshots/*__extract.logto correlate root route timeout keyTags. - [ ] Only request full terminal output when the local logs are missing the incident window.
Post-Goal Follow-Ups
The current local runtime goal is complete. Keep these follow-ups in the issue tracker so they are not lost:
- [ ] If public remote Root MCP access to local hubs is required, design a hub-routed backend/infra bridge rather than a direct upstream HTTP proxy.
- [ ] If any future 180-second run reopens NATS/route/Yjs/loop symptoms, add a new task under this same goal with the run id and exact log evidence.
Reusable Status Plane And Thin Monitoring
Goal
Replace high-frequency polling of large monitoring payloads with a reusable status-plane architecture that core services and skills can both feed.
Success means:
- The client no longer polls large mostly-static payloads such as
/api/node/reliability/summaryfor badge/status UI. - Core services and skills can publish small versioned status cards through one common SDK/service contract.
- Status cards and
statusPlaneare not a third data transport: they carry compact state, freshness, and references to Yjs/stream/details sources, while large rows, inventories, logs, and diagnostics stay on their declared routes. - Browser-facing skills make the Yjs vs stream vs details route explicit before implementation; the runtime enforces budgets but does not silently reroute badly designed data flows.
- Primary Yjs carries only reconnect-stable bootstrap/control state and the current subscription/control surface. Operator-facing variables, active rows, telemetry, logs, and event tails move through bounded stream receivers.
- Heavy diagnostic data stays behind lazy streams, explicit details requests, or debug-only full snapshots.
- Yjs and stream guards expose limits, suppressions, quarantine, and correlation context as operator-visible diagnostics instead of hiding overload.
infrastate_skill,browsers_skill, andinfrascope_skilluse the same reusable status projection pattern instead of maintaining parallel ad-hoc debounce, fingerprint, stream snapshot, and last-good-cache logic.- Existing UI views remain compatible during migration through thin summary endpoints and backward-compatible stream receivers.
Current Status
Snapshot date: 2026-05-19.
Overall completion: 76%. First implementation slices landed the ABI/schema
contract, runtime preservation of receiver route metadata, router stream-guard
use of declared receiver budgets, per-receiver stream guard counters, and the
first SDK helper for replace-mode stream variables: skill.yaml:data_routes,
stream receiver budget/guard metadata, validator schema coverage, LLM
skill-template guidance, materialized data.webio receiver metadata, router
guard policy metadata, webio_stream_guard_snapshot(...), and
stream_variable_publish(...). The reliability full snapshot, compact summary,
and CLI now also expose stream-guard publish/suppress counters plus eventbus
webio.stream.snapshot.requested / webio.stream.subscription.changed
control-pressure counters by receiver. ProjectionService/Yjs governance now
records the projection route (scope, slot, path, root) behind the last
primary-doc pressure event. The first status-plane slice now provides
StatusCard, StatusRegistry, and adaos.sdk.status helpers for small
versioned status summaries that point to stream/tool details. The API bootstrap
now registers the shared status registry, /api/node/status/cards exposes
cheap filtered reads, and /api/node/reliability/summary includes a
compatibility statusPlane block. StatusPlane now also carries compact
derived guard cards for Yjs pressure, stream guard pressure, and stream-control
pressure, while HotEventBudget provides the shared debounce/window primitive
for converting hot raw events into stable operator status. The reliability
summary now has a migration-safe thin mode backed by statusPlane and
ETag/If-None-Match, so polling clients can avoid rebuilding or downloading
the compatibility summary when status cards are unchanged. The Angular
communication runtime now uses that contract: it probes mode=thin with a
cached ETag and only downloads mode=full when the status snapshot changed or
when the thin response cannot be interpreted as a runtime snapshot. Summary
responses now expose cache/body-size headers and
/api/node/reliability/summary/metrics so soak checks can count thin/full
responses, bytes, and 304 reuse. The same acceptance surface now includes
Yjs owner-guard attempted/allowed/blocked/throttled counters, active
quarantine state, TTL/retry context, and the last guarded path/tool, so a Yjs
quarantine does not require log scraping before a soak can be interpreted.
Status registry diagnostics now expose the
status-card compact-boundary budget (maxCardBytes, observed max bytes, and
oversized-card counters) so misuse of statusPlane as a data payload route is
visible during reviews and soaks. The SDK projection runtime now also records
per-event refresh pressure counters (requested, started, coalesced,
no_dirty, superseded, and dropped) so noisy event routing is attributable
before a skill is optimized.
The infrastate_skill migration now has a first data-route plan in
docs/architecture/infrastate-data-route-plan.md, plus matching
skill.yaml:data_routes and webui.json receiver budget/guard metadata in the
workspace skill; this is intentionally metadata-first and does not move runtime
payloads yet. Validation also restored the intended pressure split inside
infrastate_skill: Yjs block stops primary-doc projection, while Yjs
throttle stretches the projection interval and still lets stream snapshots
serve current variables.
Checkpoint on .30 confirmed a boundary bug: Yjs owner quarantine had been
blocking stream-control subscription handlers, leaving infrastate.skills and
infrastate.scenarios in their loading initial state. The core subscription
guard now keeps these control events on the stream plane instead of treating
them as Yjs writes.
The same checkpoint identified the next infrascope pressure source:
/api/node/control-plane/projections/overview was rebuilding a roughly 1.8 MB
payload in about 3.4 seconds because overview rows embedded full object
details, especially member/device actual_state. The current slice keeps
overview rows as compact route references and leaves full object state behind
inspector/detail streams. The follow-up core slice makes the overview API
compact by default: first-paint reads no longer serialize top-level objects,
heavy details, or the duplicated representations.operator; explicit
mode=full remains available for debugging. A .40/.30 memory checkpoint
then showed the compact response was smaller but still cost about 3 seconds to
build because the shared control-plane object cache was timestamped before the
expensive build; when the build exceeded the 1 second TTL, the cache entry was
stale on arrival. The cache is now stamped after build completion and protected
by per-webspace build locks so compact Overview/API and direct stream snapshot
bursts reuse the same materialized model instead of rebuilding it in parallel.
Problem statement:
- The client periodically requests
http://127.0.0.1:8777/api/node/reliability/summary. - The response is large, while most values are unchanged between requests.
- This creates unnecessary local CPU/serialization work, route traffic, and diagnostic noise during the realtime startup window.
infrastate_skillandinfrascope_skillalready demonstrate the desired split: compact Yjs state plus heavy webio stream receivers, but each skill implements its own projection helpers and local guard-aware behavior.- Thin summaries and status cards are a migration bridge for badge/status UI; if they start carrying operator tables, live rows, or diagnostics payloads, they become an accidental replacement for Yjs/stream data and must be treated as a design defect.
- The 2026-05 Yjs stability work showed that stream variables are the right route for high-churn operator data, but streams still need explicit first-paint, snapshot-on-subscribe, dedupe, freshness, rate, payload, and fanout rules.
- A noisy skill must become visible as a design defect through guard logs and quarantine. The target is not a runtime controller that invents routes; the LLM/developer owns the data-route decision and reviews it as part of skill design.
Design direction:
- Treat monitoring as a materialized status plane, not as repeated full snapshot construction.
- Use small status cards for stable operator summaries, stream receivers for live variables and warm/cold details, and explicit debug endpoints/tools for raw full diagnostics.
- Treat
statusPlaneas a compact index over the declared routes, never asroute: status; manifests and reviews must reject status/statusPlane as a browser data route. - Treat primary Yjs as bootstrap/control state: interface shape, small current status, selected ids, degraded/quarantine badges, and subscription metadata.
- Keep stream data bounded and recoverable: replace-mode variables for current
state, append-mode receivers only for true tails, stable ids,
seq/updated_at, dedupe keys, and honest initial/snapshot semantics. - Give hot transport/session events such as
browser.session.changed, route reconnects, YWS open/close, and guard/quarantine transitions their own debounce/budget before they become operator status. - Make guards observable but not architectural owners: Yjs guard protects the primary document, stream guard protects publish/snapshot/fanout pressure, and both write logs and quarantine context that a future LLM repair loop can inspect.
- Make status-card compactness observable: oversized cards should not hide the overload by moving it out of Yjs/stream, and soak reports must include the compact-boundary counters.
- Make the pattern reusable for current and future skills.
Execution order:
- Lock the YJS|Stream data-route contract and guard visibility in core/SDK.
- Add shared status-card and stream-variable helpers on top of that contract, with an explicit guardrail that status cards point to data routes instead of becoming one.
- Move core-owned inventory, health, quarantine, lifecycle, and operation
details behind stable API/MCP contracts as tracked by
RCMS-007. - Finish core guard observability and hot-event budgeting before changing skill behavior.
- Use current
infrastate_skill,browsers_skill, andinfrascope_skillbehavior as deliberate pressure fixtures while the core surfaces mature: do not quiet those skills just to make a soak green until the core can survive, attribute, throttle/block/quarantine, and log retry/TTL context. - Convert
infrastate_skillfrom broad local projection helpers to thin presentation over those contracts. - Convert
browsers_skillandinfrascope_skillafter the shared protection path is proven. - Re-run browser-load/Yjs stability soaks and record whether the YJS indicator
stays stable under
Mobileand multi-browser load.
Tasks
STATUS-000: Lock the YJS|Stream data-route contract
Status: in progress.
Progress: 92%.
Purpose:
Establish the preparatory core/SDK boundary before converting heavy operator skills. Skill authors and LLM agents choose routes at design time; runtime guards enforce safety and explain failures.
Actions:
- [x] Define a small data-route schema for browser-facing surfaces:
surface,route,owner,first_paint,recovery,update_source,budget, andguard_visibility. - [x] Add manifest/schema guidance for declaring Yjs projections separately from stream receivers and details tools.
- [x] Add WebUI receiver schema metadata for route, stream budget, snapshot policy, freshness fields, and guard visibility.
- [x] Preserve WebUI receiver route/budget/guard metadata in the compact
materialized
data.webioruntime contract. - [x] Expose stream receiver route metadata in router guard diagnostics so logs and owner-guard policy can say which skill, surface, route, and receiver created pressure.
- [x] Extend the same route metadata into ProjectionService/Yjs projection diagnostics.
- [x] Define stream-variable delivery semantics in the ABI: replace vs append, snapshot-on-subscribe, freshness/TTL, duplicate suppression, stale-event rejection, maximum payload, maximum publish rate, and maximum fanout.
- [x] Extend guard diagnostics to cover both Yjs and stream routes with common fields: owner, webspace, receiver/path, budget, observed pressure, suppression count, quarantine TTL, and correlation/generation id.
- [x] Enforce declared receiver
budget.maxPayloadBytesin the router stream guard and pass budget, route, snapshot policy, and guard visibility into owner-guard policy. - [x] Add per-receiver stream guard counters for attempted, published, suppressed, throttled, fanout, payload bytes, last reason, route surface, and declared budget.
- [x] Expose stream guard counters through reliability full snapshot, compact
summary, and
adaos node reliability. - [x] Add receiver-scoped eventbus counters for stream control pressure:
incoming, queued, superseded, and dropped
webio.stream.snapshot.requested/webio.stream.subscription.changedwork. - [x] Add contract tests proving a skill can expose a status/card plus stream variables without writing broad primary-doc Yjs branches.
- [x] Update LLM skill templates and review checklist so every new browser-facing skill includes a route plan before implementation.
- [x] Add the first SDK helper for bounded replace-mode stream variables with
id,value,seq,updated_at,fingerprint, and optionalttl_ms. - [x] Keep
status/statusPlaneout of the route enum so manifests cannot declare the status registry as a browser data route.
Human verification:
- In a browser-facing skill, add
data_routestoskill.yamland streambudget/snapshotPolicy/guardVisibilitytowebui.json, then runadaos skill validate <skill>. The manifest should validate without any runtime behavior change. - Intentionally set
route: magic_runtime_autorouteorbudget.maxPayloadBytes: 0; validation should fail and point to the schema violation. - Intentionally set
route: status; validation should fail, because status cards may reference Yjs/stream/details routes but are not a data route. - Set a low receiver
budget.maxPayloadBytes, rebuild the webspace, publish a larger stream payload, and confirm logs/guard diagnostics include receiver, owner, surface, route, budget, and quarantine retry context. - Inspect
webio_stream_guard_snapshot(...)from a local Python/debug context after stream activity; the row for the receiver should show attempted, published or suppressed totals, fanout, last reason, and declared budget. - Run
adaos node reliabilityafter stream activity. The output should includewebio_stream_guard,webio_stream_guard.top,eventbus, andeventbus.webio_control.top; forinfrastatebursts the top control row should identify the receiver, source, incoming, queued, superseded, and dropped counts. - Trigger a skill-owned Yjs projection under pressure and then run
adaos node reliability. Theyjs_pressure.lastline should include the projection route kind and surface/slot, so the noisyscope.slotcan be mapped back to the skill route plan. - Request
GET /api/node/reliability/summary?webspace_id=<id>under Yjs or stream pressure.statusPlane.cardsshould includeguard:yjs_pressure,guard:webio_stream, and/orguard:webio_stream_controlwithguardRefowner, receiver/path, observed pressure, budget, suppression/coalescing counters, and quarantine fields where present.
Next steps:
- Use those helpers to prepare the
infrastate_skilldata-route plan before moving active variables out of Yjs. - Start the shared status-card contract and SDK helpers so
infrastate_skillcan migrate without growing another local projection framework.
STATUS-001: Define the shared status card contract
Status: in progress.
Progress: 95%.
Target shape:
- A status card has stable identity:
id,owner,kind,scope, and optionalwebspace_id. - A status card has operator-facing state:
status,summary,severity,updated_at,ttl_ms, and optionalincident_id. - A status card has change tracking:
version,fingerprint, andchanged_at. - A status card can point to details without embedding them:
details_ref.kind,details_ref.receiver,details_ref.path, ordetails_ref.tool. - A status card can identify the data route backing its details:
route.kind,route.receiver,route.path,route.snapshot_policy, and optionalguard_ref. - A status card stays compact. It may contain a short summary, freshness, status, guard context, and references, but not live rows, inventories, operation tables, logs, or diagnostic tails.
Actions:
- [x] Define status values and normalization rules shared with
CanonicalStatus. - [x] Define JSON schema or typed dataclass for status cards.
- [x] Define staleness semantics when
ttl_msexpires. - [x] Define how cards map to incidents and active warnings.
- [x] Define how status cards reference stream variables and detail tools without embedding live rows or diagnostic tails.
- [x] Define compact degraded/quarantine card shape for Yjs and stream guard states.
- [x] Document examples for core,
infrastate_skill,infrascope_skill, and a future third-party skill. - [x] Add compact-boundary diagnostics so oversized status cards are visible through registry/thin-summary diagnostics instead of silently becoming a new transport.
Human verification:
- In a skill handler, call
publish_status(...)withstatus="ready"andttl_ms=30000; a registeredStatusRegistryshould expose anonline/infocard with stablefingerprintandversion=1. - Change only
updated_at; the registry should keep the same version. Changestatusorsummary; the version should increment. - Publish a deliberately oversized card in a local test/debug registry; registry
diagnostics should increment
oversizedCardTotaland record the offending card id, owner, scope, and observed bytes.
STATUS-002: Add a materialized status registry/service
Status: completed.
Progress: 100%.
Expected behavior:
- Producers publish small cards into an in-memory materialized registry.
- The registry deduplicates unchanged cards by fingerprint.
- The registry increments versions only on meaningful changes.
- The registry exposes cheap reads for thin UI summaries.
- The registry emits changed events for stream/push consumers.
Actions:
- [x] Add a core status registry service.
- [x] Add per-card fingerprinting that ignores volatile fields such as
updated_at,_age_s, and_ago_s. - [x] Add TTL/staleness sweep.
- [x] Add compact registry diagnostics: card count, changed count, stale count, and last publish latency.
- [x] Add compact-boundary diagnostics: max card budget, observed max bytes, oversized card total, and last oversized card identity.
- [x] Add unit tests for dedupe, versioning, TTL expiry, and owner scoping.
- [x] Wire the registry into API/server bootstrap and expose a read endpoint.
Human verification:
- Publish a card through
adaos.sdk.status.publish_status(...), then requestGET /api/node/status/cards?webspace_id=<id>. The response should includesource=api.node.status.cards,diagnostics.cardCount, and the compact card. - Request
GET /api/node/reliability/summary?webspace_id=<id>and verify the response still omits fullruntime/modelpayloads while includingstatusPlane.cards. - Confirm
statusPlane.diagnostics.oversizedCardTotalstays0in normal browser runs; any nonzero value is a route-design smell to investigate.
STATUS-003: Add skill-facing SDK helpers
Status: in progress.
Progress: 84%.
Expected API:
publish_status(...)publishes one card.publish_status_many(...)publishes a small batch.publish_status_stream(...)binds a card to an existing webio stream receiver.publish_stream_variable(...)or equivalent helper publishes a bounded replace-mode live variable with freshness, sequence, and fingerprint metadata.- Helpers normalize status tokens, compute fingerprints, and preserve skill/handler ownership.
Actions:
- [x] Add
adaos.sdk.statusor equivalent SDK module. - [x] Preserve current skill identity in status ownership metadata.
- [x] Provide helpers for
details_refpointing to webio stream receivers. - [x] Provide receiver helpers that coalesce unchanged payloads, attach
seq/updated_at, enforce declared budgets, and surface stream-guard suppressions. - [x] Provide a shared debounce/budget helper for hot event-to-status paths,
starting with
browser.session.changed, route reconnect, YWS open/close, and quarantine transitions. - [x] Add SDK projection event-pressure diagnostics so dirty refreshes expose per-topic requested, started, coalesced, no-dirty, superseded, and dropped counters before skill-specific optimization hides the pressure source.
- [x] Add tests showing a skill can publish status without touching Yjs or rebuilding a full snapshot.
- [x] Add migration notes for skill authors.
Human verification:
- In a local debug context, create
HotEventBudget(debounce_ms=1000, window_ms=10000, max_events=5)and calladmit("browser.session.changed", key="<webspace>:<device>")repeatedly. The first call should be admitted, close repeats should returnreason=debounce, and sustained bursts should returnreason=budget_exceeded.
STATUS-004: Convert infrastate_skill to the shared status/data-route plane
Status: in progress.
Progress: 32%.
Current useful pattern and target:
- The current transitional implementation projects compact but still broad UI
data into
infrastate.snapshot. - Target Yjs content is smaller: interface/bootstrap state, selected ids, small degraded/quarantine badges, and the current receiver/subscription list.
- High-churn sections use stream receivers such as
infrastate.operations.active,infrastate.realtime,infrastate.yjs.load_mark, andinfrastate.core_update_diagnostics. - Projection helpers already perform fingerprinting and rate limiting, but the logic is local to the skill.
- Compact status projection is allowed during
warn/throttlepressure so the widget can refresh first-paint status, whileblockstill suppresses Yjs writes and streams/details remain the route for large sections. - Operator-facing variables should become stream-backed rows/cards with bounded
first-paint and snapshot-on-subscribe behavior, while raw evidence stays in
diagnostics streams, detail tools, disk snapshots, or
360log.
Actions:
- [x] Write the
infrastatedata-route plan before code changes, listing every widget, modal section, current stream receiver, Yjs branch, detail tool, and expected budget. - [x] Add
skill.yaml:data_routesfor current browser-facinginfrastatesurfaces without changing runtime behavior. - [x] Add
webui.jsonstream receiver budget, route, and guard visibility metadata for currentinfrastate.*receivers. - [x] Preserve the YJS|Stream pressure split in
infrastate_skill:blockstops Yjs projection,throttleuses the longer Yjs projection interval, and stream snapshots continue to publish through the stream guard. - [x] Keep
get_snapshot(project=true)from starving the widget underwarn/throttle: admit compact Yjs status projection into the existing throttled projection path, but continue to suppress onblock. - [x] Bound
browser.session.changedat the core EventBus level before skill-specific optimization, preserving incoming counters while superseding stale queued handler work by(event, webspace, device). - [ ] Shrink primary Yjs usage to minimal bootstrap/control state and remove variable/diagnostic tables that can be served by streams or details.
- [ ] Move current operator variables to replace-mode stream receivers with stable ids, fingerprints, freshness, and snapshot-on-subscribe semantics.
- [ ] Keep append-mode streams only for true event/log tails with explicit maxItems, truncation, and duplicate suppression.
- [ ] Add dedicated debounce/budget handling for
browser.session.changed, YWS open/close/reconnect, route pressure, and guard/quarantine events before they update operator status. - [ ] Identify
infrastatestatus cards: runtime, route/realtime, Yjs, operations, core update, marketplace, and skill/scenario registry. - [ ] Publish those cards through the shared SDK helpers.
- [ ] Keep existing stream receivers as
details_reftargets. - [ ] Remove or reduce duplicated local projection bookkeeping where the shared helper covers it.
- [ ] Confirm existing
infrastateUI still receives current streams. - [ ] Add regression tests around unchanged snapshot/card dedupe, stream resubscribe recovery, and guard-visible suppression/quarantine.
Human verification:
- Run
adaos skill validate infrastate_skill; manifest andwebui.jsonmetadata should validate without behavior changes. - Under synthetic Yjs
policy_state=throttle, the first compact Yjs projection may write, close repeats are rate-limited, and stream snapshots still publish. - Under synthetic Yjs
policy_state=block,get_snapshot(project=true)should return the HTTP snapshot but not write compact Yjs state. - Open
[homepoint] Infrastructure State; installed skills/scenarios should still first paint frominitialStateand then fill frominfrastate.skills/infrastate.scenariosstreams. - Request
GET /api/node/reliability/summary?mode=thin&webspace_id=desktop;statusPlane.diagnostics.oversizedCardTotalshould remain0during normal use.
STATUS-005: Convert infrascope_skill to the shared status plane
Status: in progress.
Progress: 38%.
Current useful pattern:
- Compact durable UI data is projected into
infrascope.snapshot. - High-churn and large sections use receivers such as
infrascope.overview.*,infrascope.inventory.*,infrascope.operations.active, andinfrascope.inspector.*. - It already maintains last-good snapshots and per-webspace projection fingerprints locally.
Actions:
- [x] Capture
.30baseline: overview projection around 1.8 MB / 3.4 seconds;health_stripdetails dominated by member/deviceactual_state. - [x] Stop embedding heavy object details in canonical overview rows; use
details_ref/ object ids so Overview can remain a compact index. - [x] Make the control-plane Overview API compact by default, omitting
first-paint
objects, heavy details, and duplicatedrepresentations.operator; keepmode=fullas an explicit debug route. - [x] Strip legacy heavy
detailsfields ininfrascope_skilloverview rows before they enter stream/tool payloads. - [x] Declare first
infrascope_skilldata routes and receiver budgets for summary, overview streams, inventory, operations, and inspector streams. - [x] Route
webio.stream.snapshot.requestedfor overview, inventory, operations, and inspector receivers through per-receiver compact builders before falling back to the monolithic snapshot cache. - [x] Fix the core control-plane object cache so slow builds are cached from completion time and concurrent same-webspace requests coalesce behind one builder.
- [ ] Identify
infrascopestatus cards: overview, active incidents, inventory, browser/runtime state, registry, and operations. - [ ] Publish cards through the shared SDK helpers.
- [ ] Keep overview/inventory/inspector streams as details targets.
- [ ] Ensure inspector data stays lazy and is not embedded in status cards.
- [x] Add tests proving overview/inventory stream snapshots can publish without building a full Infrascope snapshot.
- [ ] Add byte-size instrumentation for compact overview sections and direct receiver builders.
Human verification:
- Open
[homepoint] Infrascope; Overview should first paint from compact Yjs summary and fill health/incidents/operations from streams without a multi-MB overview payload. - Recheck
.30:GET /api/node/control-plane/projections/overviewshould be materially smaller than the 2026-05-19 baseline. Usemode=fullonly when intentionally debugging raw canonical object state. - Reopen Overview after reconnect;
infrascope.overview.*streams should fill without waiting for the fullinfrascope.snapshotrebuild. - Repeat
GET /api/node/control-plane/projections/overview?webspace_id=desktopseveral times in one second; after the first build, immediate repeats should reuse the control-plane cache instead of taking the full multi-second path. - During/after a managed core update,
.adaos/state/core_update/status.jsonshould move fromrestarting/launchtosucceeded/validateonce the runtime API is ready on the target slot.
STATUS-005B: Convert browsers_skill after core guard observability
Status: planned.
Dependency:
- Start after shared guard status cards, hot-event budgeting, and first
infrastate/infrascopeobservations are available. Until then,browsers_skillremains a useful pressure source for proving the core diagnostics rather than hiding the problem inside the skill. - Treat current browser/session churn as a load-test fixture. If it triggers a guard policy, first record whether core status, logs, and diagnostics identify owner, route, receiver/path, retry, TTL, and quarantine context; optimize the skill only after that evidence is sufficient.
Current useful pattern and target:
- The checkpoint on
.30showedbrowser.session.changedpressure can participate in Yjs owner-guard quarantine with bothbrowsers_skillandinfrastate_skill. - Target Yjs content is limited to device/session bootstrap state, selected device ids, small auth/degraded badges, and current subscription/control state.
- Browser session churn, access-link updates, device registry details, and per-device diagnostics should become bounded stream variables or lazy detail reads.
Actions:
- [ ] Inventory current
browsers_skillYjs branches, stream receivers, action responses, and event subscriptions. - [ ] Add a data-route plan for
browser.session.changed, device rename/adopt, access-link changes, and session/auth state. - [ ] Apply shared
HotEventBudgetto browser session churn before publishing operator status or stream variables. - [ ] Identify status cards: browser runtime, session/auth, access-link registry, device registry, and guard pressure.
- [ ] Keep raw session churn in diagnostics streams/logs and publish only coalesced operator state.
- [ ] Add two-browser regression tests proving repeated session changes do not rebuild broad Yjs state or shake the status indicator.
STATUS-006: Make /api/node/reliability/summary thin and versioned
Status: in progress.
Progress: 55%.
Expected behavior:
- Default response is small and backed by the materialized status registry.
- Full diagnostic snapshot requires
?full=1or a separate debug endpoint. - The endpoint supports ETag or explicit version checks.
- Unchanged polling returns
304 Not Modifiedor a minimal unchanged response. - Thin mode exposes status-card compact-boundary counters, not embedded replacement payloads.
Actions:
- [ ] Measure current response size and polling frequency.
- [x] Expose registry-backed
statusPlanedata inside the compatibility summary response and through/api/node/status/cards. - [x] Add derived Yjs/stream guard cards to
statusPlaneso thin status clients can see pressure without requesting full diagnostics. - [x] Add
mode=thinor make thin mode the default with a compatibility flag for full mode. - [x] Add
ETag/If-None-Matchsupport orsince_version. - [x] Keep a migration-safe full snapshot path for existing debug tools.
- [x] Add tests for unchanged response behavior and full-mode compatibility.
- [x] Include status-card compact-boundary diagnostics in thin mode so soaks can
detect accidental
statusPlanedata transport growth.
Human verification:
- Request
GET /api/node/reliability/summary?mode=thin&webspace_id=desktop. The response should containmode=thin,statusPlane,ETag, andX-AdaOS-Summary-Mode: thin, withouthubRootHardeningor other compatibility diagnostic blocks. statusPlane.diagnostics.oversizedCardTotalshould be0andmaxCardBytesObservedshould remain well belowmaxCardBytesduring normal badge/status operation.- Repeat the same request with
If-None-Match: <etag from the first response>. If status cards are unchanged, the API should return304 Not Modified. - Request
GET /api/node/reliability/summary?mode=full&webspace_id=desktopwhen a debug panel needs the compatibility summary.
STATUS-007: Move client monitoring from polling to push/delta
Status: in progress.
Progress: 52%.
Expected behavior:
- Client bootstraps from a small status snapshot.
- Client receives status changes through a stream or existing realtime channel.
- Client requests full details only when a panel/inspector is opened.
- Client must not treat
statusPlaneor thin summary as a replacement source for live variables, tables, inventory rows, or diagnostic tails. - Active core update transitions (
applying,restarting) remain visible even on dev-like stands; only planned/countdown noise is suppressed there.
Actions:
- [x] Identify the current caller(s) polling
/api/node/reliability/summary. - [x] Replace the communication-runtime reliability poll with
mode=thin+If-None-Matchand fetchmode=fullonly when status changed or a full runtime snapshot is still needed. - [ ] Wire existing webio stream receivers as lazy detail sources.
- [x] Add client-side cache keyed by thin-summary ETag.
- [ ] Move badge/status UI to status-card versions once the cards cover all currently used runtime fields.
- [ ] Replace remaining badge/status polling with push/delta once the status stream/realtime channel is available; keep thin polling as the migration bridge, not the final transport.
- [x] Keep active hub restart badges visible in dev runtime while continuing to suppress planned/countdown update chatter.
- [x] Keep supervisor transition fallback probing available after the control events websocket is lost, so a missed restart event can still become an operator-visible informer.
- [x] Deduplicate same-target core update requests during countdown/validation and expose passive-candidate cleanup/reaping state, so manual retries do not schedule a redundant same-commit slot transition or leave confusing defunct runner processes in diagnostics. Same-target requests are now accepted as deduplicated instead of queued, and same-target queued follow-up transitions are dropped after the completed transition is validated.
- [x] Deduplicate direct active-slot same-target update requests before the minimum-update-period guard or planned-update refresh path, so probes/retries against the already active commit cannot create a delayed redundant slot transition.
- [x] Decouple supervisor/autostart/update-control from slot venvs: autostart
now resolves the stable root checkout from the shared
.env/root context, launches supervisor from the root.venv, refreshes the wrapper before a self-restart, and exposeswrapper_python_is_core_slotin autostart status. The same source rule applies to future watchdog/control-plane helpers: watchdog may observe and restart slot runtimes, but its own wrapper, interpreter, source root, and diagnostics must remain rooted in the stable root checkout/root.venv. - [x] Resume due planned core updates through the prepare/slot-validation path instead of the legacy countdown-only path. A planned update that wakes after the minimum-update-period guard must still prepare an inactive slot, activate that slot, validate runtime boot, and only then complete root promotion.
- [x] Refuse root-promotion completion when the active slot manifest does not
match the update target version. This prevents a false
succeeded/validatestate that reports a newer target while the runtime is still serving an older slot. - [ ] Classify member-follow update expiry separately from Yjs/provider
failures: if a member reports
pending update expired before autostart runner picked it up, hub/status UI must surface stale member-update state and keep the last YWS/provider evidence separate from the member core-update failure. - [ ] Verify the client no longer requests large summary payloads repeatedly during the first 3 minutes.
Human verification:
- Open the browser dev tools network tab on a connected stand.
- The repeated runtime health probe should call
/api/node/reliability/summary?mode=thin&webspace_id=<id>withIf-None-Matchafter the first response. - When status cards are unchanged, the response should be
304;mode=fullshould appear only after status changes, first bootstrap, or explicit debug reads. - During a core update on a dev-like stand, disconnect/reconnect the browser or
watch a natural runtime restart; the UI should show
hub restarting/applying updatewhen the transition is active, even if the websocket event was missed and the state is learned through fallback probing. adaos autostart status --jsonshould showwrapper_python_is_core_slot=falseafter the next root-promotion restart; production runtime processes should still report slotA|Bexecutables.
STATUS-008: Acceptance and observability
Status: in progress.
Progress: 92%.
Acceptance criteria:
- Repeated first-3-minute run shows no high-frequency large
/api/node/reliability/summaryresponses. - Browser attach with
Mobileand a second browser does not produce sustained red/green YJS indicator flapping frominfrastatestream or projection work. - Known noisy skills may trigger warnings, throttling, block, or quarantine during stress; that is acceptable only when the runtime stays usable and the evidence identifies owner, route, policy, retry, and TTL.
- Thin status payload size is bounded and recorded.
- Full details remain available on demand.
statusPlane.diagnostics.oversizedCardTotalremains0; if it rises, the offending card is mapped back to its declared Yjs/stream/details route and corrected instead of expanding the status-card schema.infrastate_skill,browsers_skill, andinfrascope_skillpublish status cards through the shared path after their migrations.- Yjs and stream guard logs show route, owner, receiver/path, suppression counts, and quarantine TTL when limits are hit.
- Existing realtime stability criteria from
Realtime First 3 Minutesremain green. - Yjs room bootstrap cancellation is visible as a cancellation and does not continue as an empty-doc seed attempt.
- Dev Browser runtime breadcrumbs from
adaos.runtime_debug.logs.v1are available to node-side diagnostics as bounded logs, not only in browser localStorage. - Autostart/supervisor diagnostics reveal whether the always-on control-plane
wrapper uses a stable root
.venvor is accidentally coupled to a runtime slot venv. - When watchdog is re-enabled, it exposes the same source-path diagnostic and reports false for any slot-bound Python/source check.
- Terminal update success is rejected when the active slot manifest does not
match the requested
target_version; false-positivesucceeded/validatemust become a failed validation with the active manifest attached. - Stale terminal status from a previous update must not be allowed to complete or fail a fresh active update attempt unless the terminal status itself carries the attempted target or the active slot already matches it.
- Member-follow core update control must enter through the supervisor update
contract when autostart is managed, not through the runtime-only
/api/admin/update/startcountdown path. - Runtime
/api/admin/update/startmust behave as a compatibility shim in managed autostart, forwarding to/api/supervisor/update/startbefore using its legacy countdown-only fallback.
Actions:
- [x] Add log/metric for reliability summary mode, response bytes, and unchanged/304 counts.
- [x] Add compact acceptance diagnostics to
/api/node/reliability/summary/metrics: status registry publish/change/ unchanged/card-boundary counters, stream guard attempted/published/ suppressed/throttled/fanout counters, stream-control snapshot-requested/ queued/coalesced/dropped counters, and merged per-receiver rows. - [x] Add Yjs owner-guard acceptance diagnostics to the same metrics surface: attempted/allowed/blocked/throttled counters, active quarantine state, quarantine TTL/retry context, denied count, and the last guarded path/tool.
- [x] Add a human-readable acceptance probe:
adaos node reliability-metrics --webspace desktop --receiver <receiver>prints summary response/cache counters, status registry diagnostics, Yjs owner-guard counters, stream guard counters, stream-control coalescing, and per-receiver pressure rows. - [x] Add SDK event-pressure counters for dirty projection refreshes so acceptance review can distinguish incoming hot-event load from SDK coalescing/no-dirty reductions.
- [x] Reject false-positive terminal core update success when the active slot
does not match the requested target version. Runtime boot finalization and
supervisor reconciliation now fail validation instead of completing the
attempt as
succeeded. - [x] Keep the same guard from overfiring on stale terminal status: supervisor
reconciliation now ignores targetless old
succeeded/validatepayloads for a fresh active target instead of borrowing the attempt target and producing a prematureactive_slot_target_mismatch. - [x] Route member-follow update/cancel/rollback admin calls through
/api/supervisor/update/*when a managed supervisor is available, falling back to runtime admin only if the supervisor route cannot be reached. - [x] Add the same supervisor-first compatibility guard to runtime
/api/admin/update/start, so older callers and root release reconciliation cannot bypass inactive-slot preparation on managed autostart nodes. - [ ] Add status registry diagnostics to the final soak analysis.
- [ ] Add stream guard diagnostics to the final soak analysis: published,
unchanged, coalesced, suppressed, snapshot-requested, and fanout counts by
receiver.
Reliability now carries the source counters for published/suppressed/fanout
and snapshot-requested/queued/superseded/dropped by receiver; the remaining
work is to run the soak and record the result. Yjs projection pressure now
also reports the last projection route/surface through governance and
yjs_pressure.last. The acceptance metrics endpoint also documents that skill-side unchanged stream dedupe is not visible to the core router unless the skill publishes that diagnostic; final soak should recordstatus_registry.unchanged_totaland summarynot_modified_totalas the current unchanged evidence. - [x] Preserve
asyncio.CancelledErrorduring Yjs bootstrap instead of treating a cancelledapply_updatesas an empty persisted document; this keeps update restarts from turning bootstrap timeout into a misleading seed/repair path. - [x] Add browser-side breadcrumbs for transition visibility: last supervisor
transition source, suppression reason, fallback probe URL/result, and current
Yjs red reason. The browser runtime-debug cursor now exports compact
supervisor_transition,supervisor_transition_suppressed,supervisor_transition_probe, and computedyjs_statusfields, while the communication runtime records visible/suppressed supervisor transitions and supervisor fallback probe results. - [x] Add bounded node ingest/export for the browser runtime-debug ring
(
adaos.runtime_debug.logs.v1) and include it in the standard skill/runtime log retrieval path. The client now exports a cappedui.runtime_debugtail to/api/node/ui/diagnostics, filtering its own diagnostics transport events so the export cannot self-amplify. - [x] Reduce node-side browser breadcrumb noise: the browser keeps the full
runtime-debug ring in localStorage, while node ingest skips normal
http.request/ fasthttp.responsepolling and keeps Yjs/control events, HTTP errors, slow responses, and tool/snapshot responses. - [x] Fix the root-routed local HTTP hop for
/api/tools/call: prefer the current processADAOS_RUNTIME_PORTover persisted stale runtime state, use atools/calltimeout budget that fits Root's 60s outer budget, and avoid retrying a read-timed-out POST against a different slot port. - [x] Bound read-only member snapshot RPC fallback:
get_snapshotcalls routed to a target member now use a shorter default member-link timeout and reuse the unavailable-cache even while the link still appears connected, preventing a slow/offline member from holding the browser control-plane for roughly a minute. - [x] Make YWS reconnect storms non-destructive: hot browsers and multi-client reconnect storms now record diagnostic pressure and can replace stale sessions, but reconnect pressure does not quarantine a browser or the whole webspace for the full guard cooldown. Active/session limits and auth/policy denials remain hard guards.
- [x] Add a room-bootstrap attempt id to Yjs gateway logs and reliability
diagnostics so
room ready timeout,stale bootstrap recovery,apply_updates cancelled, and laterroom readycan be correlated without manually stitching timestamps. YWS room acquisition now carries theyws_attempt_idinto room bootstrap diagnostics; reliability exposes bootstrap attempt/state/step/duration/error plus wait-timeout counters for the selected webspace. - [x] Add YWS connection-attempt correlation to browser breadcrumbs and server
guard logs, including the close code/reason seen by the browser and the
server-side guard decision. The runtime now assigns
yws_attempt_idto each YWS attempt, includes it inbrowser.session.changed, open/close logs, guard reject logs, andtransport.attempts. - [x] Add browser identity and client provider-attempt correlation to UI runtime
diagnostics. The node-side log now carries browser
device_id, family, OS, form factor, runtime-debug session/tab ids, andclient_yws_attempt_idso a red Mobile/Opera tab can be matched to gatewayyws_attempt_idlogs without guessing from timestamps. - [x] Export a compact browser runtime heartbeat/cursor to node diagnostics:
last Yjs provider state, control-WS state, last close reason/code, last export
time, and dropped-log counts. The current bounded log export proves failures
well, but a quiet stable period should also be visible without inferring it
from absence of new log rows. The client now emits
runtime_debug.cursorto/api/node/ui/diagnosticson a bounded heartbeat. - [x] Export the actual client-computed YJS indicator signal (
yjs.signal) into the browser runtime-debug cursor, so node-side diagnostics can distinguish provider/sync green from the red/green indicator the operator sees. - [x] Split YJS indicator truth from lagging status projection truth on the
client: when the local provider is connected, synced, and materialized, keep
the visible YJS signal green while still showing stale
state-syncmetadata in the diagnostic reason/cursor. - [x] Add a bounded critical control-plane subscription budget: core update
status, hub update status, subnet member link/snapshot/update-result events
can pass owner-guard starvation with debounce/window limits, while hot
browser.session.changedrefreshes remain normally governed. - [ ] Split logical status/control communication from skill communication in core: system-owned status/control events must feed compact status cards and epochs directly, while skills remain consumers/renderers whose Yjs/stream work can be throttled without hiding update/member truth.
- [x] Make realtime sidecar diagnostic/log paths independent of process cwd.
Slot switch can remove the runtime cwd while ASGI still serves reliability
diagnostics; default
.adaos/diagnostics/...paths now resolve underADAOS_BASE_DIRinstead of callingPath.cwd().
2026-05-19 implementation checkpoint:
- Core YWS observability now has a two-level correlation chain:
yws_attempt_idfor the browser websocket attempt andyroom-*bootstrap attempt for room creation. Gateway room snapshots expose the bootstrap state, last step, duration, error, and the YWS attempt that caused the wait. state-sync.bootstrapin reliability includes the same selected-webspace fields, and hard bootstrap outcomes add a blocker such asroom_bootstrap_timeout:<yroom>/<yws>.- Browser runtime-debug cursor now carries the current computed YJS indicator state/reason, browser identity, client YWS provider attempt id, and supervisor transition/probe/suppression breadcrumbs, so the node can explain a red indicator without opening DevTools or guessing which device produced it.
- This improves investigation of YJS Red / red-green flicker without changing the Yjs or stream data route. Remaining acceptance work is a pressure/soak run that records status registry and stream guard diagnostics while known noisy skills remain useful load fixtures.
/api/node/reliability/summary/metricsnow includes anacceptanceblock that joins the cheap in-memory status registry, browser stream guard, and bounded stream-control eventbus counters. This keeps final soak evidence in one low-cost endpoint without replacing the Yjs or stream-data route.adaos node reliability-metricsprovides the matching operator-facing view, so final soak notes can quote stable lines instead of pasting large JSON.- Stand
.30exposed a rollout-status blind spot: root checkout was at437c31c2, but the active runtime slot still served1987ea1bwhile update-status reported the newer target assucceeded. Core now treats that active-slot/target mismatch as failed validation; after rollout, re-run update-status before acceptance soaks and require active slot manifest == target version.
2026-05-19 checkpoint:
- Public
/api/node/infrastate/snapshot?webspace_id=desktopon.30returned successfully while public/api/tools/callforinfrastate_skill:get_snapshot(project=true)returned502. - Direct local call to the active runtime on port
8777succeeded. The public failure showed the hub-route fallback error for stale127.0.0.1:8778, which masked the first active-port local hop timing out under the old 2.5s read budget. - During manual rollout verification, a duplicate update request with short
target_version=a9725121exposed an update-prepare edge case: the checker treated short SHAs as commit targets but required exact 40-character equality during validation. The core updater now accepts a short SHA only when it is a prefix of the resolved full commit. - After the route fix, the same public
tools/callstopped returning502but took about 58.5s and returned degradedtarget_member_unavailable. This identified a separate member-link fallback budget issue, not a Yjs issue: hub-route was repaired, but readonly snapshot proxying could still wait for a slow connected member RPC. - Runtime-debug export is now confirmed on
.30:/root/.adaos/logs/service.__ui_runtime__.ui_runtime.logcapturedyjs.provider.connection_closewithreason=hub_open_ack_timeout, followed byyjs.provider.status=disconnected. This is enough to distinguish a client provider/open-ack problem from server-side Yjs materialization, which wasattached/complete/ready/freshat the same time. - Current conclusion: the working snapshot endpoint is a control-plane fallback
and does not prove Yjs health by itself. Server-side reliability/YWS
diagnostics must be checked separately; client-side
YJS Redneeds exported runtime-debug breadcrumbs from the browser rather than inference from the snapshot fallback. - Follow-up investigation found the next YWS-specific amplifier: the session
guard could turn browser reconnect loops into long per-client or webspace
quarantines. That made the channel red even when server-side materialization
was
attached/complete/ready/freshand HTTP fallback snapshots still worked. The guard now treats reconnect storms as observable pressure rather than a destructive quarantine; auth/policy denials and active-limit violations still reject the websocket. - Stand rollout of
0b32b4fon.30validated the corrected YWS behavior: active slotB, active runtime8778,active_yws_connections=1,recent_open_60s=0,storm_detected=false,quarantined_total=0,incident_total=0,reject_total=0, active clientdev_fcb0c380-64ad-4ed8-905b-9cfead2ca09f. - Server-side materialization stayed healthy after the fix:
ready=true,snapshot_source=live_ydoc, no missing branches. Full reliability reportedtransportState=attached,firstSyncState=complete,semanticState=ready,freshnessState=fresh,fallbackMode=off, andyjsPressure.policyState=ok. - Local fallback checks on
.30behaved as intended:/api/node/infrastate/snapshot?webspace_id=desktopreturned200with a compact roughly 12 KB payload. Local/api/tools/callreturned200degraded byskill_owner_quarantined/write_amplificationrather than route502, confirming the remaining degradation is skill/Yjs-owner pressure, not hub-route failure. - The same rollout exposed a lifecycle/observability gap: repeated same-target
update requests can schedule a redundant prepared transition, and an old
candidate runner may appear as a defunct process until the supervisor reaps
it. That did not leave two listening runtimes, but the status plane should
classify it instead of making operators infer safety from
ps/ss. - Follow-up
.40checkpoint showed a real member-follow gap, not a Yjs replacement issue: Mediapoint was stuck atslot A | 2bb279c | restartingandadaos autostart update-statusreportedexpiredbecause the pending member update was not picked up before TTL. Manual catch-up toa710ba6f5fd02265c1f6d3083e79bfb5998fff22succeeded on slot B. - After the
.40catch-up, local member reliability was available on active port8778, while hub.30reported YWSactive_yws_connections=1,storm_detected=false,quarantined_total=0,transportState=attached,firstSyncState=complete, andsemanticState=ready. The stale widget therefore came from delayed/stale projection and member-status propagation under pressure, not from replacing Yjs with HTTP fallback. - Node-side browser breadcrumbs confirmed the missing observability edge:
service.__ui_runtime__.ui_runtime.logcaptured the old sequence1006 -> connected -> materialization.ready, then became quiet. The newruntime_debug.cursorheartbeat closes that gap by publishing the last known browser-side provider/control/materialization cursor even when no new exceptional event is generated. - Rollout of
4c1806aa70b040db61199707e0b739b244d7af04reached.40succeeded/validateand.30succeeded/validate;.30recovered YWS on active runtime port8778withtransportState=attached,firstSyncState=complete,semanticState=ready,freshnessState=fresh, andfallbackMode=off. - The same rollout showed the observed YJS RED window was an update downtime
plus heavy disk I/O symptom rather than a YWS guard regression: while
bootstrap apply/root promotion was running, the runtime listener was absent
or delayed in
jbd2_log_wait_commit; after the listener came back, YWS opened with correlatedyws_attempt_id. - A manual same-target probe after
4c1806aaexposed a remaining ordering bug: direct same-targetupdate-startwas caught byminimum_update_periodbefore active-slot dedupe, creating a delayed redundant update plan. The probe plans were cancelled on.30and.40, and the supervisor now checks the active slot manifest first, returningdeduplicated/same_targetwithout changing the real update cadence. - Root promotion on
.30can leave the systemd wrapper path pointing at the root checkout slot (slot A) while the production runtime runs from activeslot B. Hash parity proved it was not the immediate YWS regression source, but the wrapper/venv source is still an architecture defect: the always-on supervisor and future watchdog must survive slot mutation from a stable root checkout/root.venv, not fromstate/core_slots/slots/*/venv. - The control-plane source rule is now encoded in code and architecture docs:
default_specprefers the stable root project derived fromADAOS_SHARED_DOTENV_PATH, core update commands use that root.venv, root promotion refuses to treat a slot repo as the effective root when a stable root checkout is available, refreshes the wrapper before self-restart, and autostart status exposeswrapper_python_is_core_slotfor stand verification. - The disabled watchdog is explicitly classified under the same future control-plane rule. Re-enabling it must add a watchdog-specific source diagnostic rather than inheriting a slot venv by convenience.
- Stand checkpoint after scheduling
9c7e221bexposed a false-positive update completion: both.30and.40reportedsucceeded/validatefor target9c7e221b, while active slots were still older (2ba8453ande85fada). Root.venvcontrol-plane launch was correct, but planned-update resume used the countdown-only path and skipped inactive-slot preparation. The supervisor now resumes due planned updates through prepare, and root promotion refuses to complete if the active slot commit does not match the target. - Rollout verification for
af22dfb48adfb944ee023b2b5882d29b004f7fcf:.30converged through slotBand completedsucceeded/validatewithroot promotion restart completed; runtime boot validated on slot B..40was caught in the old false-success state before the fix could run, so it was manually caught up by preparing slotAwithcore_update_apply, syncing the stable root source, activating slotA, and restartingadaos.service. It now reports active slotAataf22dfb. - Rollout verification for
273562eb48cf93c3691b3df850bd5da6e11f4ffaon.30converged through slotAand completedsucceeded/validatewith the active slot matching the target. The slot-local CLI exposedadaos node reliability-metrics, and the endpoint returned compact acceptance metrics. The stable root CLI was still stale becausesrc/adaos/apps/cli/commands/node.pyhad not been classified as a root operator-control path. - Root promotion now treats
adaos nodediagnostics as operator-control code and checks effective root parity against the current bootstrap path list, not only the historical manifestchanged_paths. This prevents acceptance soaks from silently using a stale root CLI after the runtime has moved forward. - The first
4a9bf507rollout attempt on.30proved the active-slot guard but also exposed an over-eager validation edge: pending/countdown statuses were checked against the current active slot before slot preparation could run. Runtime boot target validation now applies only during boot/validate phases, not while an update is merely planned or counting down. - Manual
.30catch-up toff9fec41667304f2bd850acd79a50d6b31f44bb4exposed a second root-promotion lifecycle edge: auto-complete could request another service restart while an earlierawaiting_root_restartattempt was still waiting for runtime boot validation. Supervisor now treats a recordedrestart_requested_atas evidence that the restart is already in flight and waits for validation/timeout instead of restarting in a loop. Both stands keep supervisor under/root/adaos/.venv/...and runtime under the active slot venv; thin reliability summaries returnok=true. - Rollout of
dfb49eeeb1f1e04f57595ce268e94018ecec44a2on.30converged through slotA; root promotion restarted once and completedsucceeded/validate. A 180-second browser-attached pressure check kept the runtime alive and rootadaos node reliability-metricsavailable, while logs attributed boot pressure toinfrastate_skill,infrascope_skill,browsers_skill,gateway_ws, YStore, and hub-route flushes. - The same check found a remaining acceptance blind spot: the log contained
YJS owner quarantined webspace=desktop owner=skill:infrastate_skill, butreliability-metricsonly printed status/stream counters. Acceptance metrics now include compact Yjs owner-guard counters and quarantine context alongside stream/status counters, without changing Yjs or stream data routes. - Rollout of
d34864240d02d61fa0f211747e0928d37334eea9converged on both.30and.40..30runs active slotBand.40runs active slotA; both reportsucceeded/validate,wrapper_python_is_core_slot=false, root/root/adaos/.venv/bin/adaos node reliability-metricsworks, and active runtimes still use slot venvs. The newacceptance.yjs_guardline was verified on both stands..30reportedowner=skill:infrastate_skill,throttled=1,quarantined=yes,quarantine_total=1, and TTL countdown;.40reported the same owner withblocked=1,throttled=3, and active quarantine context. - Post-rollout
.30180-second acceptance with attached browsers stayed insucceeded/validate; supervisor/root CLI remained stable, runtime RSS moved roughly464 MiB -> 478 MiB, andacceptance.yjs_guardretained the causal owner/path/tool/TTL evidence while the noisy skill remained quarantined. This is enough to move from core guard/observability hardening into the planned skill optimization phase, while longer plateau/memory soaks remain useful during the skill migrations themselves. - The docs-only rollout to
578aceb502a99fc94a478237c817297225529f1aconverged on.30, but.40stayed healthy on the previous runtime and marked the attempt failed asactive slot target mismatch. The root cause was a supervisor reconciliation edge: an old targetless terminal status could be interpreted as the fresh active attempt by borrowing the attempt target. The fix keeps real target-bearing mismatch rejection, but ignores stale targetless terminal payloads until the update produces its own target-bearing status or the active slot actually matches. - Rollout of
2c0f6d1a6aa33d4a3c922e476bdfb698d68b6476validated the fix..30converged automatically to slotB..40needed one manual catch-up because its old supervisor failed before it could activate the fix: slotBwas prepared, activated, rootsrc/adaos/apps/supervisor.pywas promoted from the slot, andadaos.servicewas restarted. Both stands now reportsucceeded/validate, active slotB | 2c0f6d1, live runtime8778, andwrapper_python_is_core_slot=false. Rootadaos node reliability-metricsreports Yjs owner/quarantine evidence on both stands, so the core guard/observability milestone is ready for the planned skill optimization work. Longer plateau soaks remain useful during those skill migrations, not as a blocker for starting them. - The subsequent docs-only rollout to
d798f73f29ec6fd46a2515ee6b82a9e08cfbca9cexposed a second.40lifecycle edge: member-follow posted to the runtime/api/admin/update/startroute, which writes countdown/pending restart without preparing a slot. The runtime then restarted the current slot and correctly rejected validation because active slot2c0f6d1did not match targetd798f73. The fix keeps runtime admin as fallback, but managed autostart members now try the supervisor update route first so member-follow uses the same prepare/warm-switch contract as local operator updates. .40then converged to38dfc15835590bd8599ce09c1eed9211620cb7c4through the supervisor prepare path after the bad queued SHA probe was cleared and the planned transition was moved forward..30independently confirmed that the remaining bypass can also come from runtime/api/admin/update/startcallers, so the runtime endpoint is now guarded as a supervisor-first shim on managed autostart.- Rollout of
d954e540904d429f53195e94a5466b2c63b0e3a0converged on both.30and.40to slotB..30required clearing the stale runtime-only countdown once, then completed root promotion and restart;.40completed via supervisor prepare/launch. A direct same-target probe against runtime/api/admin/update/starton both stands returned_served_by=supervisor,deduplicated=True, andstate=succeeded, proving the compatibility shim no longer creates countdown-only pending restarts on managed autostart. - The next browser checkpoint showed both stands on current autostart state, but
Dev Browser still displayed stale Infra State cards (
ed61941on one card, olderd954e54/restartingon another) and YJS Red. Logs showed the YWS provider did reconnect, while the Yjs owner guard skippedinfrastate_skillcontrol-plane subscriptions such ascore.update.status,hub.core_update.status, andsubnet.member.snapshot.changed. This was a protection-side starvation bug, not a slot rollback. Core now lets only those critical status subscriptions through a bounded hot-event budget and records the actual browser YJS indicator asyjs.signal. - The same rollout confirmed the architectural direction: the bounded budget is
a stabilization valve, while the target fix is to split
status/controlcommunication fromskillcommunication. A noisy operational skill should be quarantinable without blocking active-slot/member-status truth. - Stand
.40exposed an unrelated reliability endpoint crash after slot switch:realtime_sidecar_diag_path()calledPath.cwd(), but the runtime cwd had been removed during slot lifecycle cleanup. This caused ASGIFileNotFoundErrorresponses and could mask YJS diagnostics. The path helper now resolves default realtime diagnostics under stableADAOS_BASE_DIR.
Human verification:
- Request
GET /api/node/reliability/summary?mode=thin&webspace_id=desktopand checkX-AdaOS-Summary-Mode,X-AdaOS-Summary-Cache, andX-AdaOS-Summary-Body-Bytes. - Repeat with
If-None-Match; the304response should reportX-AdaOS-Summary-Cache: hitand body bytes0. - Request
GET /api/node/reliability/summary/metricsand verifymetrics.modes.thin.not_modified_totalincreases during unchanged polling. - [x] Run a 180-second acceptance with browser attached.
- [x] Run a pressure-fixture soak without first optimizing
browsers_skill/infrastate_skill/infrascope_skill; record whether the core survives and whether guard/status/log evidence is sufficient for later skill repair. - [x] Verify
.30and.40after the next rollout: supervisor process command should use/root/adaos/.venv/...pythonandadaos autostart status --jsonshould reportwrapper_python_is_core_slot=false; active runtimes should still come from the active slot venv. - [x] Verify after the next rollout that root
/root/adaos/.venv/bin/adaos node reliability-metricsis available without falling back to the slot-local CLI. - [x] Verify the stale-terminal-status reconciliation fix on
.30and.40; a docs-only update should either converge or remain active until real transition evidence appears, not fail immediately from an old targetlesssucceeded/validate. - [x] Verify the member-follow supervisor-route fix on
.30and.40; a member-follow docs-only update should prepare the inactive slot before any runtime restart. - [x] Verify the runtime admin compatibility shim on
.30and.40; direct/api/admin/update/starton managed autostart should return a supervisor response and should not create a countdown-only pending restart. - [ ] After the critical control-plane subscription budget rollout, open Dev
Browser and verify that Infra State converges to the active slot/status even
when
infrastate_skillis blocked or quarantined. - [ ] Check node-side UI runtime logs for
yjs.signal,browser_identity, andclient_yws_attempt_idinruntime_debug.cursor; the cursor should match the visible red/green indicator and identify the exact browser device. - [ ] After the local-document YJS indicator split rollout, verify Dev Browser,
Mobile, and Opera/macOS report
local-doc=synced:readywhen the document is healthy, even ifstate-syncstatus metadata is stale. - [ ] Verify
.40reliability summary no longer throwsFileNotFoundErrorfromrealtime_sidecar_diag_path()after the next rollout. - [ ] Run a focused
infrastatetwo-browser soak after conversion and capture Yjs owner pressure, stream pressure, route pressure, and quarantine counters. - [ ] Record payload size reduction and polling reduction in this tracker.
- [ ] Close this goal only after logs confirm no large repeated monitoring responses during normal UI operation.
TEST-001: Make test_infrastate_skill_projection.py hermetic
Status: planned.
Observed while debugging stream-backed modals:
- The full file can fail when marketplace cache/remote-probe defaults leak into tests that expect mocked remote registry data.
_project_asyncstream assertions can be affected by live/local Yjs pressure guard state unless the guardrail is explicitly mocked.
Actions:
- [ ] Clear marketplace caches and set remote-probe flags inside affected tests.
- [ ] Mock
_active_noncritical_stream_guardrailin projection tests that assert exact stream publications. - [ ] Re-run the full file as part of the stream modal regression suite.
UI-RT-001: Forward UI runtime notifications to node skill logs
Status: in progress.
Expected behavior:
- Client-side runtime issues are visible in
[Node 0] Notificationsfirst. - Dev mode may include diagnostic
details; prod mode keeps the user-facing notification compact. - The same notification envelope is eventually mirrored into node skill logs so an LLM/debugger can analyze UI contract mismatches without browser console access.
Actions:
- [x] Define a stable notification envelope for UI runtime issues.
- [x] Add a backend ingestion endpoint or stream receiver for client runtime notifications.
- [x] Mirror accepted notifications into node skill logs with webspace, node, scenario, widget, action, and modal context.
- [x] Export bounded Dev Browser runtime-debug breadcrumbs from
adaos.runtime_debug.logs.v1asui.runtime_debugdiagnostics. - [ ] Add LLM-oriented grouping for repeated contract issues.
- [ ] Add a stand smoke check that reads the resulting
service.__ui_runtime__.ui_runtime.logtail through the standard skill log retrieval path.
Named Entity Registry and NLU Canonicalization
Goal
Keep human-facing names, localized labels, runtime-observed names, aliases, and canonical refs in one governed model so NLU, UI, skills, and LLM tooling can refer to the same objects without retraining models after every rename or language-specific alias change.
Current Status
Snapshot date: 2026-05-13.
Target architecture is documented in
docs/architecture/named-entities.md.
Recommended implementation order:
- Contract and fixtures:
NamedEntityRecord,EntityResolutionResult, topic constants, and ambiguity examples. - Read-only registry:
NamedEntityService, shared display helper, localized label metadata, and optional diagnostic projection. - Event integration: observed/draft/name/alias events plus
entity.registry.changedinvalidation. - NLU dry-run: resolver trace without dispatch changes.
- Governed writes: adopt, rename, alias add/remove/deprecate with conflict checks and audit metadata.
- MCP and skill migration: canonical descriptors for LLM tooling and removal of ad hoc fallback logic.
Integration progress:
- Overall: 98%.
- Completed: target architecture, addressing boundary, event model contract,
initial roadmap, code-level record/result contracts, topic constants,
read-only device entity adapter, modal/app/scenario/webspace lookup adapter,
skill lookup adapter, browser draft-name helper, exact resolver, SDK read
helpers, NLU dry-run trace subscriber, compact read-only
registry.named_entitiesprojection, live-room-safe NLU trace writes, voice/chat router live-room writes, read-only NLU Yjs reads, browser metadata capture from Yjs handshakes, access-links-drivenentity.registry.changedinvalidation, Root MCP/Codex read access to the compact named-entity registry, core node-display hostname-before-fallback behavior, client node-display helper alignment for legacyNode Nfallback labels, client catalog/modal title enrichment fromregistry.named_entities, read-only registry label conflict diagnostics, localization-as-label-metadata architecture, compact registry label metadata, locale-aware resolver trace hints, per-locale conflict diagnostics, Root MCPNLUAuthoringPlaneread-only context with canonical named entities, Teacher probe live entity matches, per-locale ambiguity evidence in NLU trace, runtime-only model-training evidence for alias resolution, first governed alias-add proposal/apply contract, SDK alias helpers, lifecycle event envelopes for alias add/conflict, durable device/browser alias persistence inaccess_links,device_access.add_device_alias,sdk.data.entities.add_device_alias, authoritative alias lifecycle event publishing, Root MCP / NLUAuthoringPlaneadd_device_aliaswrite exposure guarded bydevelopment.write.named_entities, entity-level fingerprints,base_fingerprintstale-write protection, dedicated Root MCPentity.alias.addaudit records, governed alias remove/deprecate proposal and apply flows, durable device/browser remove/deprecate persistence, NLUAuthoringPlane remove/deprecate write tools, dedicatedentity.alias.remove/entity.alias.deprecateaudit records, and focused tests, plus first authoritative device/browser observation, browser draft-name, and display-name lifecycle events fromaccess_links. - Current implementation slice: named-entity operational lifecycle events over authoritative device/browser registry changes.
- Not started yet: profile-owned aliases, conflict-resolution UX, remote target routing, and consumer migration.
- Verification note: targeted MCP/named-entity checks pass, and
test_root_mcp_foundationis green again after test fixture alignment. The broader Yjs projection runs still expose pre-existingAdaosMemoryYStore.starteddrift; track that separately so it does not mask NER regressions.
Human verification:
- Check that docs consistently say human labels are not routing keys.
- Check that localization is described as label/alias selection, not as a change to canonical refs.
- Check that
Node Nis described as fallback-only. - Check that the implementation starts read-only and does not change NLU dispatch until dry-run trace is visible.
- Check alias lifecycle manually: add a browser alias, deprecate it and confirm it remains visible for compatibility, then remove it and confirm NLU no longer resolves that phrase.
Next implementation steps:
- Start migrating node/browser labels to the shared display helper.
- Extend observed/draft/display-name lifecycle events from device/browser sources to workspace and manifest sources.
- Add conflict-resolution UX around Root MCP alias writes.
- Add profile-owned alias storage and policy boundaries.
- Migrate node/browser labels to shared display helpers in remaining skill projections.
Tasks
NER-001: Establish canonical named-entity read model
Status: in progress.
Actions:
- [x] Add
NamedEntityRecordschema or dataclass. - [x] Add
EntityResolutionResultschema or dataclass. - [x] Add shared
entity.*event topic constants. - [x] Add golden fixtures for node/browser/device alias and ambiguity examples.
- [x] Add golden fixtures for webspace, scenario, modal, and app examples.
- [x] Add golden fixtures for skill examples.
- [x] Document localized labels and aliases as read-model metadata.
- [ ] Build a read model over device inventory, node display, workspace manifests, system model objects, and desktop registry entries.
- [x] Build the first read-only device entity adapter over
DeviceInventoryService. - [x] Build the first read-only modal/app/scenario/webspace adapter over existing NLU lookup tables.
- [x] Preserve source authority: device access remains owned by
access_links/DeviceInventoryService, not by the named-entity read model. - [x] Project a compact read-only entity registry for UI/debug consumers.
NER-002: Make device and browser display names consistent
Status: planned.
Actions:
- [ ] Prefer user-confirmed display name, then node names, then observed
hostname/browser+OS, then
Node N. - [ ] Preserve exact user-confirmed names while allowing localized aliases and localized system fallbacks.
- [x] Generate draft names for newly registered browsers.
- [x] Make core node display helpers use observed hostname before
Node Nfallback. - [x] Make the client node-display helper treat
Node Nas fallback when observed hostname or registered names are present. - [x] Use compact named-entity registry labels for client catalog and modal node display when the local label is still fallback-like.
- [x] Add locale metadata to compact registry labels while keeping
display_labelcompatibility for current UI consumers. - [ ] Make observed-only device rename flow explicitly adopt or adopt+rename.
- [x] Add read-only conflict diagnostics for duplicate display names or aliases in the compact registry payload.
- [ ] Surface conflict diagnostics in Notifications and operator-facing skill logs when user attention is useful.
- [ ] Invalidate display-name consumers through
entity.registry.changedinstead of reload-only behavior. Backend invalidation emission is in place; client/name-rendering consumers still need migration.
NER-003: Add NLU entity canonicalization
Status: in progress.
Actions:
- [x] Add a resolver dry-run mode that records NLU trace without changing dispatch behavior.
- [x] Resolve registered names and aliases before or alongside
nlp.intent.detect.request. - [x] Accept
request_localeandpreferred_localesas resolver hints. - [x] Add
normalized_text,resolved_entities, canonical refs, and ambiguity records to NLU trace. - [x] Add per-locale conflict evidence to compact registry diagnostics.
- [x] Add per-locale ambiguity evidence to NLU trace.
- [x] Update Teacher probe output to show live entity resolver matches.
- [x] Add golden tests proving runtime aliases do not require Rasa/neural retraining.
NER-004: Expose named entities to SDK/MCP/LLM tooling
Status: in progress.
Actions:
- [x] Add
sdk.data.entitiesread helpers. - [x] Add first governed alias-add proposal/apply service and SDK helpers.
- [x] Add durable device/browser alias write helper:
sdk.data.entities.add_device_alias. - [x] Expose named-entity descriptors through Root MCP read capabilities.
- [x] Include named entities in NLUAuthoringPlane context.
- [x] Expose governed device alias add through Root MCP / NLUAuthoringPlane
with a write capability separated from
ProfileOpsRead. - [x] Expose entity
fingerprintvalues and acceptbase_fingerprinton governed alias writes. - [x] Expose governed device alias remove/deprecate through SDK and Root MCP / NLUAuthoringPlane with the same write capability and stale-write guard.
NER-005: Integrate named entities with the operational event model
Status: in progress.
Actions:
- [x] Emit
entity.observedwhen authoritative device/browser access-link sources report observed labels. - [ ] Extend
entity.observedto workspace and manifest sources. - [x] Emit
entity.draft_name.suggestedfor generated browser draft names. - [ ] Extend
entity.draft_name.suggestedto generated node draft names once node draft-name policy is explicit. - [x] Emit alias lifecycle events from the first authoritative device/browser alias write path.
- [x] Emit display-name lifecycle events from the first authoritative device/browser display-name write path.
- [x] Emit alias remove/deprecate lifecycle events from authoritative device/browser write paths.
- [x] Return
entity.alias.added,entity.alias.conflict.detected, andentity.registry.changedevent envelopes from the governed alias-add apply contract. - [x] Include
localeorlocale: "und"in the first authoritative alias-add lifecycle events. - [x] Add stale-write protection through
base_fingerprintand explicitstatus: staleresults. - [ ] Emit
entity.alias.conflict.detected,entity.resolution.ambiguous, andentity.resolution.failedinto Notifications and node skill logs when operator attention is useful. - [x] Add dedicated audit trail records for Root MCP alias writes beyond the generic Root MCP invocation audit envelope.
- [x] Add dedicated audit trail records for Root MCP alias remove/deprecate writes.
- [ ] Treat
entity.registry.changedas the cache invalidation signal forEntityResolverand demanded name-rendering projections. The compact Yjs projection already subscribes to this signal; resolver cache ownership is still pending.
NER-006: Migrate consumers away from ad hoc name fallback
Status: planned.
Actions:
- [x] Replace the first client-side node display fallback path with the shared named-entity display helper for catalog and modal titles.
- [ ] Extend client-side named-entity display enrichment to widget-level node badges and workspace manager surfaces.
- [ ] Update operator-facing skills to consume canonical refs and shared display names instead of raw labels.
- [ ] Remove duplicate fallback rules after the shared helper is adopted.
- [x] Add regression test proving a newly persisted browser alias resolves through NLU without model retraining.
- [ ] Add remaining regression tests for
Node Nfallback, hostname display, browser draft names, alias ambiguity, and renamed-device NLU resolution.