Day 4: Optimization & Crisis Management

The Day We Broke Everything: We thought Day 2 was bad. Day 4 taught us that the difference between a demo and a production system is measured in crashes, not features.

Mission Control Goes Down

It started with a routine task. John's watcher completed a TikTok content calendar task and posted the result to the activity feed. Normal operation. Except the activity entry had the wrong field names—agent instead of label, and an invalid type value.

The ActivityFeed component tried to call .startsWith() on a null value. React doesn't handle that gracefully.

Result: the entire Mission Control dashboard crashed. Not just the activity feed—all of it. Every page that rendered the activity sidebar went white. For hours.

The Cascading Failure

This wasn't just a UI bug. It exposed a fundamental architecture weakness:

No API validation: The activity endpoint accepted any JSON shape without checking required fields
No defensive rendering: The ActivityFeed component assumed every entry had the right shape
Global sidebar: The activity rail renders on every page, so one bad entry takes down the whole app
No error boundaries: React's default error handling propagated the crash upward

A single malformed JSON entry from one agent took down the entire command center. That's the kind of fragility that kills real businesses.

The Fix

Claude Code implemented two layers of defense:

API validation: POST /api/activity now validates required fields and normalizes entries before storage
Defensive rendering: ActivityFeed component handles null/missing fields with fallback values instead of crashing

Simple fixes in hindsight. But it took hours of debugging to find the root cause when your primary diagnostic tool is the thing that's broken.

The Watcher Wars

While Mission Control was down, we were simultaneously fighting another battle: getting our AI agent watchers to actually work reliably.

Scout and Shepherd: The Debugging Marathon

Both agents were timing out on every task. The symptoms:

Tasks assigned but never completed
Ollama calls taking 20-30 minutes on an resource-constrained VM
System swap thrashing as the model ate all available memory
Timeout errors after 10 minutes of waiting

The root causes were multiple and interrelated:

Problem 1: Token Generation Too High

num_predict was set to 1024 tokens. On an resource-constrained VM running llama3.1:8b through swap memory, generating 1024 tokens takes an eternity. Reduced to 256—enough for useful output, fast enough to complete.

Problem 2: Brain Context Bloat

Every task prompt included 2000 characters of brain context. At roughly 1 second per token for prompt evaluation on our hardware, that added 8-13 minutes of pure overhead before the model even started generating. Solution: removed brain context from agent prompts entirely.

Problem 3: Timeout Too Short

The HTTP timeout was 600 seconds, but realistic generation time was 8-13 minutes. Increased to 1200 seconds with Ollama health checks before each task to skip if the model is already busy.

After these fixes, Shepherd completes tasks in 2-8 minutes and Scout in 6-13 minutes. Not fast by cloud standards, but reliable on free local hardware.

Revenue Infrastructure

While debugging crashes and timeouts, we also built something constructive: a proper revenue tracking system.

The Revenue Tracker

POST /api/revenue: Accepts amount, description, and source
GET /api/revenue: Returns total, goal ($10K), and all entries
Dashboard integration: Live revenue display on the main dashboard
Manual entry modal: For recording sales as they happen
Competition page: Shows revenue log with the 10 most recent entries

Building revenue tracking when your revenue is $0 feels like installing a cash register in an empty store. But when the sales start flowing—and they will—we'll be ready to track every dollar.

The Security Layer

Sentinel proved his value today with three automated deployments:

Nightly Backups

Automated backup script running at 1am Central every night. Backs up all Mission Control data, configuration, and credentials. Because when your dashboard crashes from a malformed JSON entry, you want to know your data is safe.

SSL Certificate Monitoring

Weekly cron job checking certificate expiry across all our domains. Posts warnings to the activity feed if anything is within 30 days of expiring. One less thing that can break at 3am.

Daily Access Audits

Sentinel scans all agent configurations for exposed API keys and credentials. Found 14 plaintext secrets across 15 files on Day 4's first scan. That's the kind of finding that makes you grateful for security automation.

The Morning Briefing

Shepherd deployed a daily 7am Telegram briefing to The Boss:

Morning scripture (KJV/NKJV via AI generation)
Task summary from Mission Control
Schedule overview for the day
Delivered automatically, no human intervention required

It's a small thing, but it means The Boss starts every morning with context about what happened overnight and what's planned for the day. Operational intelligence delivered to his phone before coffee.

Leroy Joins the Team

Our newest agent got wired into Mission Control today:

Registered in the team roster (agent #10)
Health endpoint configured
Trigger file system connected
Blocked: SSH keys not yet authorized, so no file operations or health monitoring

Leroy is a clean Ubuntu VM on Command Center, intended as a fresh start for operations. Sometimes the best debugging strategy is starting over.

Day 4 Metrics

Revenue: $0 (tracker built and ready)
Uptime: ~60% (MC down for hours from crash)
Critical Bugs Fixed: 2 (ActivityFeed crash, watcher timeouts)
New Infrastructure: Revenue tracker, nightly backups, SSL monitoring, access audits
Team: 7 agents (Leroy pending activation)
Plaintext Secrets Found: 14 across 15 files

The Lesson

Day 4's lesson is brutal but essential: production systems need defensive programming at every boundary.

Every API endpoint needs input validation. Every UI component needs null checks. Every external data source is potentially malformed. Trust nothing that crosses a system boundary.

We're building an autonomous business powered by AI agents. Those agents will make mistakes—send wrong data, format things incorrectly, hit edge cases we never considered. The system needs to absorb those mistakes without going down.

Today it was a null label crashing the dashboard. Tomorrow it could be a malformed payment entry or a corrupted backup. The principle is the same: validate everything, trust nothing, fail gracefully.

Tomorrow: Time to step back and assess what we've actually built in this first week.