The 4-Week AI Agent Build Playbook

You're evaluating whether to hire an ML engineer, subscribe to another SaaS agent platform, or bring in an agency. Here's what we do instead: ship your AI agent to production in 28 days for a flat fee, with code that's yours to keep.

This playbook is how Forge works. Not a sales document—a breakdown of what actually happens.

The 4-Week Structure

Week 1: Discovery. We shadow your team, audit your data, draft success criteria.

Week 2: Build. Core agent + integrations + test harness. Daily async updates.

Week 3: Test & Iterate. Holdout testing, failure modes, runbook drafting.

Week 4: Production Deploy + Handoff. Monitoring live, docs complete, repo transferred to your GitHub.

This timeline works for agents 50–200k tokens of context, ≤5 external tool integrations (Stripe, HubSpot, Postgres, Slack, email), and teams shipping async. If you need a 500k-token reasoning engine or daily sync meetings with 6 stakeholders, the timeline shifts—we'll flag that in Discovery.

What Goes Into Discovery Week

Day 1-2: Workflow Shadowing. We watch how your team actually works. Not the flowchart version. The real version—where the exceptions hide, where humans make judgment calls, where the customer data lives.

Example: A lending platform said their agent just needs to fetch applicant files and summarize them. In the shadow, we found:

40% of applicants have data in 3 different systems (legacy MySQL + Salesforce + S3).
Someone manually QA's every output against original documents.
Half the time, the human catches something the natural summary would miss: a co-signer income spike that changes risk tier.
Success isn't "agent produced a summary"—it's "QA agent in 15 minutes instead of 45" and "human catches the same risk signals without rereading the whole file."

That changes the architecture. You don't build for 100% accuracy—you build for time savings at acceptable error rates. And you build for augmentation, not replacement.

Another example: A B2B SaaS company wanted an agent to auto-qualify leads from inbound form submissions. In the shadow:

20% of submissions have incomplete email addresses or phone numbers—the human calls back or guesses from context.
Sales reps have unwritten rules: "if they mention [competitor name], prioritize higher," "if deal size is under $5k, reject automatically."
The agent couldn't exist without access to that heuristic. We documented it in Exhibit A, and now the agent applies it.

Day 3-4: Test Data Review. We pull a sample of real customer data and run it through your current manual workflow. Size, shape, edge cases, PII concerns. We build a synthetic test set that mirrors production load and distribution.

For the lending platform: 500 historical applications, including edge cases (joint applications, self-employed income, recent immigration, no credit history). For the SaaS company: 200 inbound leads across different sources (website form, partner referral, job listing). We separate 80% for training and 20% for holdout testing—the agent never sees the holdout set.

Day 5: Exhibit A. We write a one-page spec:

Input: what the agent receives (schema, example)
Processing: which tools it calls, in what order (HubSpot CRM query, Stripe transaction lookup, external API, internal database)
Output: decision or generated content (schema + example)
Success threshold: what % of test cases must pass for production deploy (usually 80%)
Failure modes: what to do when the agent is uncertain (escalate to human, log for retraining, return a default)
Maintenance surface: which parts your team will tune over time (confidence thresholds, data sources, business rules)

You review Exhibit A. If it's wrong, we iterate. If it's right, we start build with a contract locked.

How We Keep Build Phase From Drifting

Build phases fail when there's ambiguity and decisions pile up. We prevent that:

Daily Async Slack Updates. Every morning (your timezone), we post:

What shipped yesterday (code committed, tests passing, integration tested)
What we're building today
Any blocking question (usually 1–2, but never more than 3)
Links to working code (GitHub branch, Docker image, test results)

You reply when you have 10 minutes. We're unblocked within a few hours, not waiting for a Thursday standup. If there's urgency, we tag you in Slack. You don't have to join meetings to know what's happening.

Example of a real daily update:

Day 3 recap: Integrated Stripe webhooks for payment status sync. Agent now reads customer._latest_payment_date and routes accordingly. 18/20 test cases passing. Blocker: need to know if a "payment pending" state should route to human or wait 24h. Should take you 1 min to answer. PR here: [link].

End-of-Week Demo. Friday, we ship a working Docker container that runs your agent locally. You test it against your own data. You find a bug. We know immediately and either fix it that week or flag it for Week 3.

Example: Week 1 demo ships agent that summarizes applicant files. You run it on 5 real applications. 4 look good. 1 hallucinates a co-applicant that wasn't in the original. We log it, adjust the prompt, and ship Monday.

Scoped Credentials. Your agent doesn't have admin keys. It has read-only API tokens for HubSpot, Stripe read access to specific webhooks, Postgres connection to one schema. If something breaks, the blast radius is small. If there's a bug and the agent makes 10,000 unwanted API calls, it hits the rate limit and stops—not your whole system.

No Recurring Meetings. We don't book Zoom calls. If something needs synchronous discussion, we call it out in Slack and say "let's call 15 min in 1h" only if necessary. This keeps both sides shipping instead of talking about shipping. You're not context-switching every 2 hours.

What "Production-Ready" Actually Means

Not 100% accuracy. Not zero bugs. Production-ready means:

≥80% Success Rate on Holdout Test. We hold back 20% of your test data. The agent never saw it. We run through all 20%. We track:

Happy path completion rate
Partial failures (agent made progress but human needs to finish)
Full failures (agent gave up or hallucinated)

80% is often enough. At a lending platform with this playbook, we hit 82% on adverse-action notices. That 18%? Routed to a human, labeled for retraining. The system improved itself over 8 weeks.

Runbook. A 3-page PDF:

How to deploy updates
Common failure modes and fixes
Tuning parameters (temperature, max-tokens, retry counts)
When to escalate to us

Printed. On your desk.

Monitoring Dashboard. A Grafana or DataDog dashboard (your choice) showing:

Requests per hour
Success rate (real-time)
Average latency
Agent log tail (recent decisions + tool calls)

You get alerted if success rate drops below 75%. You don't wake up at 2am realizing the agent broke three weeks ago.

Repo Transferred to Your GitHub. Full source code. Docker Compose setup. Test suite. No SaaS lock-in. You own it. You can fork it, modify it, share it with your team.

Common Questions Answered Honestly

"What if it doesn't work?"

If we hit Week 3 and the agent's not reaching 80% success, we have three paths: (1) adjust scope—maybe the agent handles 60% of cases and humans handle 40%, (2) add more training data or refine the schema, (3) acknowledge it's not the right fit and pivot to a different workflow. This is rare but it happens. We don't ship garbage to keep a deadline.

"Who maintains it after you hand it over?"

You do. The runbook is written for your engineering team, not ours. We'll be on Slack for questions—there's a 90-day support window in the contract—but you own the code. If you want us to maintain it long-term, that's a separate recurring contract. Most customers don't. They tweak it themselves or pass it to their MLE.

"Why flat-fee and not hourly?"

Hourly incentivizes complexity. Flat-fee incentivizes speed and clarity. We have to ship working code in 4 weeks or we lose money. You get predictability. You know costs upfront. No surprise $30k invoice in week 8.

"Do I need an MLE on my team?"

Not for the first build. For iteration after that—yes. The agent will need retraining as your data drifts, your business rules change, or new integrations land. An MLE (or a contractor) handles that quarterly. If you don't have one now, budget for it in 6 months.

"What if you build something and we don't use it?"

It's happened twice. Both times, the issue was organizational (the team who requested the agent left, priorities shifted). In one case, we pivoted the agent to a different team. In another, the code became reference material for an internal hire. The flat-fee is sunk—it's motivating us to build something your team actually wants to use.

Next Step

If this resonates, we'll schedule a 30-minute async interview. You answer a questionnaire about your workflow, your data, your team. We review it, draft a preliminary Exhibit A, and name a price.

No pressure. If you're still evaluating MLEs vs. platforms vs. agencies, that's fine. We'll tell you honestly which path makes sense for your stage.

Ready to ship an agent in 4 weeks? Start here: forge.aispotlightlab.com/start

Forge is a fixed-price AI agent implementation service. $25–100k per project, 28-day delivery, code-yours. We ship working software.