fix: resolve _ENV_LOCK deadlock that blocks chat after first message

The v0.39.0 security sprint introduced _ENV_LOCK to protect env var mutations in the streaming path. The implementation held the lock for the entire agent run (potentially minutes), then tried to re-acquire it in the finally block — a guaranteed deadlock on any non-reentrant threading.Lock(). Result: first message completes (done event fires before finally hits), but the lock is never released. Every subsequent chat/start POST blocks forever waiting for that lock. Fix: narrow the lock scope to just the env mutation. Set the vars inside the with block, then let the lock release before the agent starts. The finally block re-acquires cleanly since it no longer re-enters an already-held lock. No logic change — only the critical section boundary moves.
2026-04-08 14:22:39 +00:00
parent 9e9fcb09d2
commit 4422a87de9
1 changed files with 14 additions and 11 deletions
--- a/api/streaming.py
+++ b/api/streaming.py
@@ -107,6 +107,9 @@ def _run_agent_streaming(session_id, msg_text, model, workspace, stream_id, atta
            HERMES_HOME=_profile_home,
        )
        # Still set process-level env as fallback for tools that bypass thread-local
+        # Acquire lock only for the env mutation, then release before the agent runs.
+        # The finally block re-acquires to restore — keeping critical sections short
+        # and preventing a deadlock where the restore would re-enter the same lock.
        with _ENV_LOCK:
          old_cwd = os.environ.get('TERMINAL_CWD')
          old_exec_ask = os.environ.get('HERMES_EXEC_ASK')
@@ -117,7 +120,7 @@ def _run_agent_streaming(session_id, msg_text, model, workspace, stream_id, atta
          os.environ['HERMES_SESSION_KEY'] = session_id
          if _profile_home:
              os.environ['HERMES_HOME'] = _profile_home
-
+        # Lock released — agent runs without holding it
        try:
            def on_token(text):
                if text is None: