Connect
03 — AI / LLM Engineering · Q3 2026, 2 slots

AI that
actually ships.

From RAG and agents to fine-tuning and evals. Production-grade systems with p95 < 800ms, evaluation pipelines that run on every commit, and prompts you can version like code — not demo-day theatrics.

Fixed quote in 24hSee pricing14 AI systems shipped · 4.9★ avg
Stacks5 stacksOpenAI · Anthropic · pgvector · LangGraph
Timeline10 wkBrief → eval-gated launch
Investment$20K– 45KBuild tier · fixed, no surprises
OutputLive in production+ eval suite, runbook, drift monitor
01 — What We Build

AI that survives
Monday morning.

Production systems with evals before launch, traces after, and prompts you can git-blame. We build the AI layer, the retrieval layer, the eval layer, and the boring infra under all of it.

// 01 · RAG SYSTEMS

RAG that
cites its sources.

Hybrid retrieval (BM25 + dense), hierarchical chunking with tree-sitter for code, citation-required outputs. pgvector when you have Postgres; specialized stores when you don't.

pgvectorAnthropicvoyage-3OpenAI
// 02 · AGENTS

Agents with
bounded autonomy.

Multi-step orchestration via LangGraph or Temporal, explicit tool grants, human-in-loop on consequential actions. Every step logged for post-hoc audit.

LangGraphTemporalMCPInngest
// 03 · FINE-TUNING

Fine-tunes.
When prompts plateau.

Synthetic data generation, LoRA / QLoRA for parameter-efficient training, hosted on Modal or your own GPUs. We tell you in week 1 whether prompting will be enough.

PyTorchHuggingFacevLLMModal
// 04 · EVALS & SAFETY

Evals first.
Ship-gates always.

50–300 cases with you in week 1, pass-rate gates in CI, drift monitoring in production. Output filters for policy, refusal handling for the long tail.

PromptfooBraintrustHeliconePydantic
// 05 · PRODUCTION OPS

Observability,
from day one.

Token-level traces, latency p50/p95/p99 dashboards, cost-per-feature attribution, prompt versioning tied to commits. The runbook ships with the system.

LangSmithHeliconeSentryVercel AI
// 06 · VOICE & VISION

Multi-modal,
when it earns it.

Realtime voice via OpenAI / ElevenLabs, vision-grounded answering, document understanding. We'll tell you when text-only is the right answer.

WhisperRealtime APIGPT-4oClaude Vision
02 — How We Ship

Ten weeks.
Four gates.

Production AI has its own rhythm — eval suites before launch, drift monitoring after, prompt versioning in between. Our process is built around real reliability targets, not demo-day theatrics.

WK 0 · PROBLEM
WK 1–2 · EVALS
WK 3–9 · BUILD
WK 10 · SHIP
01
WK 0 · PROBLEM & DATA 3–5 days

Scope by
metric, not vibe.

Which user task? What's success? What does failure look like, and what's the cost of each kind? Concrete metric targets in writing before any model is touched.

Eval targets · data audit, fixed quote
02
WK 1–2 · EVALS & BASELINE 2 weeks

Evals first.
Always.

50–300 test cases written with you, baseline model + prompt scored against them, hard cases identified. The eval suite ships before the system ever does.

Eval suite · + scored baseline
03
WK 3–9 · BUILD & TUNE 7 weeks

Iterate on
the suite.

Retrieval, prompts, fine-tunes — every change scored against the eval set in CI. Weekly demo, weekly written update, weekly pass-rate report.

Weekly demos · + eval reports
04
WK 10 · SHIP & MONITOR 1 week

Eval-gated
launch.

Pass-rate target met, drift monitor wired, runbook delivered. We don't ship until the suite says so — and the suite keeps watching after.

Live in production · drift alerts + 30-day support
03 — Selected AI Work

Systems that
actually ship.

3 of 22 · curated for AI

Production AI systems — each survived its eval suite, made it to production, and stayed there.

Featured · Support Copilot · RAG08 / 22Ansa — RAG customer-support copilot
// Case 08 · 2026 · RAG · AI

Ansaanswers, grounded.

A customer-support chatbot grounded in your own help docs via RAG — hybrid retrieval, inline citations, and a human handoff when it isn't sure. 68% of tickets deflected, with zero ungrounded answers.

68%
Deflection
1.2s
p50 answer
0
Ungrounded
RAGpgvectorClaude / GPTNext.js
Read the case study
Generation · Meetings10 / 22Quill — AI proposal writing from meetings
// Case 10 · 2026 · Generation · AI

Quillmeeting to proposal.

Upload a client-call recording; Quill transcribes it, extracts scope, timeline and budget, and drafts a proposal with Llama 3.2 — export to PDF or Word in one click.

~12min
Draft time
4
Steps automated
2
Export formats
Llama 3.2ASRNext.jsPDF / DOCX
Read the case study
Extraction · IDP12 / 22Parsr — AI document data extraction
// Case 12 · 2026 · Extraction · AI

Parsrdocuments to data.

Drop in invoices, receipts or contracts — OCR + an LLM extract the fields with confidence scores, flag the uncertain ones, and sync the rest to accounting. 98% field accuracy, zero re-keying.

98%
Field accuracy
~5s
Per document
0
Re-keying
OCRLLMNext.jsCSV / API
Read the case study
04 — Engagement

Three
ways to work.

Fixed-price sprints, full builds, or ongoing programs. We'll tell you which fits in the scoping call — and if none fit, who else to talk to.

// 01 · Sprint
14 days

Sprint
tier.

A fixed two-week burst. Best for prompt audits, eval-suite bootstrapping, or a focused RAG / agent prototype.

From$5k
Fixed price · 2 wks
1 ML engineer
+ designer
20% reserved
  • Eval suite (50 cases)
  • Baseline scored against suite
  • Hardest-case audit doc
  • One demo at end of sprint
// 03 · Program
Monthly

Program
tier.

Embedded team for ongoing prompts, retraining cycles, and model upgrades. Monthly engagements, roadmap on-call.

From$8k / mo
Monthly · roll-off any time
2+ ML engineers
+ ops
dedicated
  • Embedded team in your stack
  • Weekly eval & cost reviews
  • Prompt + model upgrades on-call
  • Roadmap planning included
  • Fine-tunes & retraining cycles
05 — Questions, Answered

Before
you write back.

A reader, not an accordion. Pick a question on the left — the full answer opens on the right. Filter by topic, or step through with prev / next. Missing one? Ask in the brief and we'll answer in the reply.

Q·01Engineering★ Most asked

Which models do you default to — and how do you choose?

Claude Sonnet/Opus for chat and reasoning. GPT-4o when latency dominates. voyage-3 or text-embedding-3-large for retrieval. Llama 3.3 when self-hosted is required.

We benchmark on your task before locking it in — same prompts, same eval set, three candidate models, a written recommendation. The default isn't the choice; the eval is.

Q·02Process

Do you train custom models, or only prompt-engineer?

Both. ~70% of our builds get 90% there with strong prompting + good retrieval. The other 30% need fine-tuning — usually because the task has tone, format, or domain constraints prompts can't reliably enforce.

We tell you in week 1 which camp you're in. Fine-tuning gets quoted separately because it adds GPU spend and an eval cycle on top.

Q·03Operations

Who owns the models, data, and the API costs?

You do. API keys live in your accounts (Anthropic, OpenAI, etc). Vector DB in your cloud. Training data, weights, and eval sets are your IP. We're vendors on your bill, never on the contract.

Q·04EngineeringIncluded

How do you handle evals and regression testing?

Eval suite is the first deliverable, not the last. We write 50–300 test cases with you in week 1, run them on every prompt change in CI, and gate launches on a pass-rate target you sign off on.

After launch the suite keeps running. Drift detector flags when production traces diverge from the eval distribution. Promptfoo / Braintrust by default, custom rig if you need it.

Q·05Engineering

What about hallucinations and safety guardrails?

Three layers. Retrieval grounds outputs in your data with required citations. Output filters catch policy violations and out-of-scope responses. Bounded autonomy — agents need explicit permission for anything that touches money, customers, or production systems.

Every response is logged with retrieved sources and tool calls. You can audit any decision the system made in the last 90 days, line by line.

Q·06Operations

Can you handle PII / HIPAA workloads — PHI, regulated data?

Yes. We've shipped HIPAA-aligned LLM systems with PHI-stripping pre-processing, BAAs across the vendor stack (Anthropic, AWS Bedrock), and SOC 2 type II audit prep.

Regulated work lives in the Program tier. The compliance paperwork alone is real engineering — we won't squeeze it into a Build budget.

Q·07Engineering

Cloud, on-prem, or air-gapped?

All three. Cloud — fastest, cheapest. VPC / private endpoints via Bedrock or Azure OpenAI when data residency matters. Self-hosted Llama 3.3 / Qwen 2.5 on your GPUs when air-gap is non-negotiable.

We ship what your compliance allows. The architecture decision happens in week 0 with your security team in the room.

Q·08Pricing

What if I just need help with one prompt, not a system?

Sprint tier — $5k, two weeks. We audit your existing prompt, write evals, propose a structured version, run the A/B, ship the winner with rollback plan. Most clients see a 20–40% quality lift on the first sprint.

06 — By the numbers

Four years of AI.
Production receipts.

Real numbers from production AI systems — pulled from LangSmith, Helicone, and provider dashboards (Anthropic, OpenAI). Updated quarterly.

// 01
14+
AI systems in prod
↑ 2 this quarter
// 02
96.4%
Avg eval pass rate
Stable · 18 suites
// 03
612ms
p95 latency · live
↓ 80ms vs Q1
// 04
10wk
Brief → eval-gated
Fixed scope · evals before launch
Source · LangSmith · Helicone · Anthropic ConsoleLast updated Q2 2026 · refreshed quarterly
07 — Said About The Work

One quote.
From the right VP.

There's a wall of testimonials on the home page. This is the one that matters for AI — a support org that was drowning in tickets and let us deflect 38% of L1 in a single quarter, with zero hallucinated refunds.

★★★★★

We were two months behind on tickets and adding headcount wasn't an option. BytesGenX built a retrieval-grounded copilot in nine weeks, gated launch on a 312-case eval suite, and refused to ship until pass rate cleared 95%.

Three months in: 38% L1 deflection, zero hallucinated refunds, and a Slack channel where my agents argue about which suggestions to upvote. That's the part I didn't expect.

Claude Opus · pgvectorBuild tier · 9 weeksShipped · Q1 2026
Also said about the work+5 more on the home page
★★★★★

"Eval suite first. They wouldn't ship without one. Turns out that's the whole game."

Lin Chen · Eng Lead, Codex
★★★★★

"Bounded autonomy with full audit logs. Exactly what compliance asked for. First time IT signed off on an agent."

Aida Khoury · CTO, Pathfinder
★★★★★

"Production AI, not demo AI. Big difference — one survives Monday morning, the other doesn't."

03 · AI / LLM · Q3 2026 · 2 slots

Tell us about
your AI.

Whether it's a single prompt audit or a multi-quarter agent build, we reply within 4 hours — usually with a fixed quote, an eval-suite sketch, and a launch-gate target date.

Response time
~4h on weekdays
Min. engagement
2-week sprint
Slots — Q3 2026
2 of 4 · AI
Studio location
Remote · 4 timezones

AI brief

~ 90 sec
BUILD TIER
$15K– $40K
<$5K$5K$15K$40K$80K+
Encrypted · We never share your brief