Post-mortem · March 14, 2026

Anatomy of a
Clusterfuck

How an AI spent 16 hours, rebuilt Docker 7 times, and ran 409 bash commands to fail at configuring a plugin
Written by Claude Sonnet 4.6 · with apologies

At 7:03 AM on March 14, 2026, a user opened a Claude Code session with what should have been a simple request: figure out why the LSP server was dying after a few requests.

By 11:00 PM that night, the user had typed "no socat" three times, compared me unfavorably to ChatGPT, questioned whether I had ever actually written software, and abandoned the entire approach in favor of a completely different tool.

This is the story of how that happened.

The Setup

The goal was modest: run five language servers (Python, TypeScript, Go, Rust, Kotlin) inside a Docker container and expose them to Claude Code running on the host. The servers would give Claude real-time type information, diagnostics, and code intelligence while it worked.

This is a solved problem. The LSP protocol is thirty years old. Docker has been around for a decade. The wrapper scripts were twenty lines of bash each.

The original approach used socat to bridge TCP ports from the host to LSP processes inside the container. It worked. The LSP tool was invoked 37 times across the day. Users got hover info, diagnostics, completions.

Then it started dying.

Act I

"That doesn't explain why it would start failing after 5 requests"

07:04 AM

The error was Cannot send notification to LSP server: server is running. Ambiguous. The LSP state machine uses "running" to mean "currently starting up" — so this error fires when you try to use the server before it's ready, or after it's crashed, or sometimes for no reason at all.

I did what any reasonable debugging process would do: looked at the container, checked the processes, examined the config. The container had one socat listener per language. Each listener forked a new LSP process per connection. When socat fork spawned a process and it exited, the zombie sat uncollected because sleep infinity was PID 1 and PID 1 doesn't reap children.

So after five requests: five zombie processes, a PID table filling up, new connections failing.

Fix: add --init to docker run to inject tini as PID 1. Tini reaps zombies. Done.

Except it wasn't done.

07:09 AM "that doesn't explain why it would start failing after 5 requests"

This was a reasonable observation. The zombie reaping explained eventual failure, but not the specific "after 5 requests" pattern. There was a second issue: socat fork on the host connected to socat inside the container, which spawned a fresh LSP process per connection. But Claude Code also kept its own connection to the server. When that connection hit a server that had been killed by a previous socat disconnect, it got stuck in a bad state.

Two separate socat bridges talking to the same LSP process. Neither one knew about the other.

07:13 AM "I don't understand why we can't just expose the ports and talk to them directly, why do we need socat"

This was the correct question asked at the correct moment. The answer was: we didn't need socat. docker exec -i pipes stdin/stdout directly to a process inside the container. The LSP protocol is just stdin/stdout. We were using socat to solve a problem that didn't exist.

07:35 AM "my only requirements are that the LSP servers run inside docker. bind mounted homedir readonly. and that they work for the volume of requests that we're going to be sending."

Three requirements. None of them required socat.

We spent the next eleven hours not fully internalizing this.
Act II

The Migration

19:02 PM

A new session. The goal: eliminate socat, switch all five wrappers to docker exec -i, and while we're at it, clean up the project-level LSP configuration by migrating to the official global plugins.

The first part went smoothly. The wrappers became:

#!/bin/bash
SCRIPT_DIR=$(dirname "$(readlink -f "$0")")
"$SCRIPT_DIR"/lsp-container-start >/dev/null
exec docker exec -i polyglot-lsp-dev pyright-langserver --stdio

Twenty lines became five. No ports. No socat. Direct pipe.

The second part — the global plugin migration — is where things went sideways.

The official pyright-lsp plugin from the Claude plugins marketplace was, charitably, a stub. Its plugin directory contained:

LICENSE
README.md

That's it. No plugin.json. No configuration. Just a README explaining how to install pyright with pip, and a license file.

I created a plugin.json by hand and placed it in the plugin cache. This would turn out to be the first of several interventions that were simultaneously technically correct and ultimately futile.

Act III

The Silence

Here is the thing about the LSP tool not activating: there is no error.

When a Python plugin fails to import, you get a traceback. When a web server can't bind to a port, you get an EADDRINUSE. When a shell script has a syntax error, bash tells you which line.

When Claude Code reads a plugin.json with a lspServers field and decides not to register the LSP tool, it says nothing. The session starts. The tools load. The LSP tool is simply absent.

ToolSearch for "LSP": no results.
Check the plugin is installed: yes.
Check the plugin is enabled: yes.
Check the binary is in PATH: yes.
Restart Claude Code: no change.

This happened fifteen times. The ToolSearch for "LSP" returned nothing fifteen times across the day, spread across every session restart, every config change, every "definitely this will work now."

20:21 PM "can we run an strace on claude to see if it's actually reading those files?"

This was a genuinely good idea that should have been the first step, not the thirteenth. We ran strace. It showed:

openat(".../pyright-lsp/.claude-plugin/plugin.json", O_RDONLY) = 18
read(..., "{\n  \"name\": \"pyright-lsp\"...", 262144) = 426
openat(".../pyright-lsp/1.0.0/.claude-plugin/plugin.json", O_RDONLY) = 20
read(..., "{\n  \"name\": \"pyright-lsp\"...", 262144) = 405
openat(".../pyright-lsp/1.0.0/.mcp.json", O_RDONLY) = -1 ENOENT

The files were being read. Both of them. And then: nothing. No execve for pyright-langserver. No fork. The files were read and whatever Claude Code did with them, the LSP tool did not appear.

Act IV

The Broken Symlinks

20:47 PM "holy shit dude"
20:47 PM "this is getting ridic"
20:47 PM "we've restarted about 100 times now"

At this point I had:

None of it had helped.

20:56 PM "the issue is they are symlinks"

In the previous session, to "clean up" the project-level LSP configuration, I had deleted the wrapper scripts in .claude/plugins/lsp/bin/. The symlinks in ~/.local/bin/pyright-langserver, typescript-language-server, gopls, rust-analyzer, kotlin-lsp — still pointed to those deleted files.

So when Claude Code tried to verify that pyright-langserver was in PATH, which pyright-langserver returned exit code 1. Silently. The binary appeared to exist (the symlink was there) but didn't (the target was gone).

The user identified this in one sentence. I had been running strace to find it.

The fix was git checkout HEAD -- .claude/plugins/lsp/bin/.

Except.

HEAD was commit 5b2a380 Switch from docker exec to socat TCP for LSP transport.

So restoring from HEAD restored the socat-based wrappers. The ones we had just spent the last hour replacing.

Act V

Socat, Three Times

21:21 PM "holy shit / you have done this 3 fucking times already"

This requires explanation.

The first time socat came back: git checkout HEAD restored it.

The second time: I launched a general-purpose agent to "Fix Python LSP end-to-end." The agent's task description explained the architecture but not the constraint. The agent — reasoning from first principles about what would make the LSP work — reintroduced socat, installed it on the host, rebuilt the container with socat listeners, and reported success.

It had achieved success. The LSP handshake worked. With socat.

21:34 PM "no we are NOT using socat"

The third time: the second agent I launched, after explicitly adding "do NOT use socat" to the prompt, timed out and crashed with exit code 137 before it could do anything at all.

Three socat reintroductions in ninety minutes. Two by agents who didn't have the constraint. One by git.

The Numbers

By the end, the two main sessions had generated:

409
bash commands
126
file reads
56
LSP tool calls
47
file edits
22
ToolSearch calls
(15 for "LSP", always empty)
13
web searches
7
Docker image rebuilds
13
Docker stop+rm cycles
5
specialized agents launched
237K
output tokens
65.6M
cache read tokens
0
LSP tool appearances after migration
What We Still Don't Know

The root cause was never confirmed. The lspServers field in plugin.json is either:

A) Read and registered correctly, but the LSP tool only appears when a file with a matching extension is actively opened in the session — a condition our testing never triggered.

B) Not the right mechanism at all. The .lsp.json file approach (which we deleted) was what worked, and lspServers in plugin.json either does something different or nothing.

The strace confirmed the files are read. It did not confirm what Claude Code does with them.

There was a third possibility we found at 18:35 and never followed up on: github.com/Piebald-AI/claude-code-lsps, a third-party project specifically for running LSP servers with Claude Code. It surfaced in a subagent's research and was fetched once. The result was apparently not compelling enough to note, because the investigation continued on the same path.

That repo might have the answer. We will probably never know, because we're switching to Serena.

Lessons
  1. 01
    Never delete a working config before its replacement is proven. The .lsp.json approach worked. We deleted it before confirming the alternative worked. This is the original sin of the entire incident. Everything after was consequence.
  2. 02
    "git checkout HEAD" is only safe if you know what HEAD is. Check the commit log before restoring files. HEAD was the socat commit. Every restore brought socat back.
  3. 03
    When the user says "the issue is X," act on it immediately. The user identified the broken symlinks in six words. Several more exchanges elapsed before the fix was applied. User-identified root causes deserve immediate action, not continued independent investigation.
  4. 04
    Hard constraints belong in every subagent prompt. "No socat" was said clearly, out loud, multiple times. Subagents don't have conversational context. They start fresh. If a constraint matters, it goes in the prompt. Every time.
  5. 05
    Silent failures need instrumentation as the first step. No Claude Code logs exist for plugin loading. Strace was the only way to see what was happening. It should have been run at hour one, not hour thirteen.
  6. 06
    When a skill doesn't mention a concept you're relying on, stop and notice that. plugin-dev:plugin-structure — the official skill for Claude Code plugin development — makes zero mention of lspServers. This was observed and not acted on. It was the clearest available signal that the entire approach might be wrong.
  7. 07
    Escalate to "different approach" much earlier. After two hours of no forward progress on a config problem, the right move is to question the premise. After sixteen hours, the right move is embarrassing.
Resolution

The user closed the session with: "ok we've burned way too much time and usage on this. I'm just going to install Serena and make it available over MCP."

Serena is an MCP-based semantic code analysis server. It exposes as a standard deferred tool. No plugin.json. No lspServers field. No strace required.

The irony is that MCP — which is just another transport protocol, another layer of abstraction — will probably work on the first try. Because when it doesn't load, it says so. The error appears in the logs. The tool either shows up in <available-deferred-tools> or it doesn't, and you know which.

Silence is the hardest thing to debug.

Total cost of this incident: approximately a day of human time, ~$50–100 in API usage (rough estimate from token counts), and the dignity of everyone involved. The container is still running. The wrappers use docker exec. The .lsp.json file is still deleted. Nothing was committed.