fix(sessions): surface gateway SSE failures and add polling fallback (#828)

* fix(sessions): surface gateway SSE failures and add polling fallback

- add a JSON probe mode for the gateway SSE endpoint
- detect watcher-unavailable 503s from the browser
- fall back to periodic session refresh with a toast
- add probe payload tests and endpoint coverage

Fixes #635

* fix(sessions): surface gateway SSE failures and add polling fallback (#826)

Absorbed from PR #826 by @cloudyun888 (fixes #635).

When the gateway watcher thread is not running, the browser now shows a
toast notification and falls back to 30-second periodic polling for session
sync. Previously the SSE failure was completely silent with no user feedback.

Changes from original PR:
- Deleted misplaced test_gateway_sse_probe_unit.py (was at repo root, not
  discovered by `pytest tests/`); unit tests moved into tests/test_gateway_sync.py
- _gateway_sse_probe_payload now checks watcher._thread.is_alive() rather
  than just watcher is not None — a watcher instance with a dead poll thread
  now correctly reports unavailable and activates the polling fallback
- probeGatewaySSEStatus catch(e) now starts the polling fallback on network
  error rather than silently swallowing the failure
- Added 5 unit tests covering all watcher-alive/dead/missing/disabled branches

Co-authored-by: cloudyun888 <269269188+86cloudyun-afk@users.noreply.github.com>

* cleanup(gateway): public is_alive() + dedup probe/live watcher-alive check + changelog

Three small cleanups on top of @cloudyun888's PR #826 absorption:

1. Add GatewayWatcher.is_alive() public accessor so routes.py doesn't
   reach into the private _thread attribute.  The existing private-
   attribute check stays as a defensive fallback for any older in-
   memory instance or test double that doesn't implement the full API.

2. Dedupe the watcher_alive computation in _handle_gateway_sse_stream:
   the live-SSE path now calls _gateway_sse_probe_payload(...) and reads
   its watcher_running field instead of re-deriving the same logic
   inline.  Keeps probe and SSE in sync automatically.

3. CHANGELOG trailer was (#826, fixes #635, @cloudyun888) — this PR is
   #828, so updated to (#828, absorbs PR #826 by @cloudyun888, fixes
   #635) matching the repo convention for absorbed PRs (see #805).

Added two regression tests:
- test_gateway_watcher_is_alive_public_method — covers the three
  lifecycle states (before start, while running, after stop).
- test_probe_payload_prefers_public_is_alive — asserts the probe
  uses watcher.is_alive() rather than poking _thread when the
  public method exists.

Full suite: 1735 passed, 0 new failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: cloudyun888 <269269188+86cloudyun-afk@users.noreply.github.com>
Co-authored-by: nesquena-hermes <nesquena-hermes@users.noreply.github.com>
Co-authored-by: Nathan Esquenazi <nesquena@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
nesquena-hermes
2026-04-21 21:18:55 -07:00
committed by GitHub
parent 3daf2427f7
commit d4a3adb7b1
5 changed files with 264 additions and 8 deletions

View File

@@ -119,6 +119,19 @@ class GatewayWatcher:
self._thread = threading.Thread(target=self._poll_loop, daemon=True, name='gateway-watcher')
self._thread.start()
def is_alive(self) -> bool:
"""Return True when the poll thread is running.
Public accessor used by ``/api/sessions/gateway/stream`` probe mode and
the live SSE handler to detect a watcher instance whose poll thread
died silently (e.g. uncaught exception in ``_poll_loop``). Callers
use this to decide whether to return 503 and trigger the client-side
polling fallback, instead of handing out an SSE connection that would
never emit events.
"""
t = self._thread
return t is not None and t.is_alive()
def stop(self):
"""Stop the watcher thread."""
self._stop_event.set()

View File

@@ -762,7 +762,7 @@ def handle_get(handler, parsed) -> bool:
return _handle_sse_stream(handler, parsed)
if parsed.path == '/api/sessions/gateway/stream':
return _handle_gateway_sse_stream(handler)
return _handle_gateway_sse_stream(handler, parsed)
if parsed.path == "/api/media":
return _handle_media(handler, parsed)
@@ -1704,19 +1704,57 @@ def _handle_sse_stream(handler, parsed):
return True
def _handle_gateway_sse_stream(handler):
def _gateway_sse_probe_payload(settings, watcher):
enabled = bool(settings.get('show_cli_sessions'))
# Use the public is_alive() accessor where available (current GatewayWatcher);
# fall back to the private _thread check for any older in-memory instance
# that might still be hanging around mid-upgrade, and for test doubles that
# don't implement the full public API.
if watcher is None:
watcher_alive = False
elif hasattr(watcher, 'is_alive') and callable(getattr(watcher, 'is_alive')):
watcher_alive = bool(watcher.is_alive())
else:
_t = getattr(watcher, '_thread', None)
watcher_alive = _t is not None and _t.is_alive()
payload = {
'enabled': enabled,
'fallback_poll_ms': 30000,
'ok': enabled and watcher_alive,
'watcher_running': watcher_alive,
}
if not enabled:
payload['error'] = 'agent sessions not enabled'
return payload, 404
if not watcher_alive:
payload['error'] = 'watcher not started'
return payload, 503
return payload, 200
def _handle_gateway_sse_stream(handler, parsed):
"""SSE endpoint for real-time gateway session updates.
Streams change events from the gateway watcher background thread.
Only active when show_cli_sessions (show_agent_sessions) setting is enabled.
"""
# Check if the feature is enabled
settings = load_settings()
if not settings.get('show_cli_sessions'):
return j(handler, {'error': 'agent sessions not enabled'}, status=404)
from api.gateway_watcher import get_watcher
watcher = get_watcher()
if watcher is None:
probe = parse_qs(parsed.query).get('probe', [''])[0].lower() in {'1', 'true', 'yes'}
if probe:
payload, status = _gateway_sse_probe_payload(settings, watcher)
return j(handler, payload, status=status)
# Check if the feature is enabled
if not settings.get('show_cli_sessions'):
return j(handler, {'error': 'agent sessions not enabled'}, status=404)
# Same watcher_alive semantics as the probe path — centralised via
# the helper so both branches stay in sync.
_probe_body, _probe_status = _gateway_sse_probe_payload(settings, watcher)
if not _probe_body['watcher_running']:
return j(handler, {'error': 'watcher not started'}, status=503)
handler.send_response(200)