AI / LLM Engineering

Q: Which models do you default to?

Claude Sonnet/Opus for chat and reasoning. GPT-4o when latency dominates. voyage-3 or text-embedding-3-large for retrieval. Llama 3.3 when self-hosted is required. We benchmark on your task before locking it in — same prompts, same eval set, three candidate models, a written recommendation. The default isn't the choice; the eval is.

Q: Who owns the models, data, and API costs?

You do. API keys live in your accounts (Anthropic, OpenAI, etc). Vector DB in your cloud. Training data, weights, and eval sets are your IP. We're vendors on your bill, never on the contract.

Q: What about hallucinations and safety?

Three layers. Retrieval grounds outputs in your data with required citations. Output filters catch policy violations and out-of-scope responses. Bounded autonomy — agents need explicit permission for anything that touches money, customers, or production systems. Every response is logged with retrieved sources and tool calls. You can audit any decision the system made in the last 90 days, line by line.

Q: Cloud, on-prem, or air-gapped?

All three. Cloud — fastest, cheapest. VPC / private endpoints via Bedrock or Azure OpenAI when data residency matters. Self-hosted Llama 3.3 / Qwen 2.5 on your GPUs when air-gap is non-negotiable. We ship what your compliance allows. The architecture decision happens in week 0 with your security team in the room.

Q: What if I just need help with one prompt?

Sprint tier — $5k, two weeks. We audit your existing prompt, write evals, propose a structured version, run the A/B, ship the winner with rollback plan. Most clients see a 20–40% quality lift on the first sprint.

Q·01Engineering★ Most asked

Which models do you default to — and how do you choose?

Claude Sonnet/Opus for chat and reasoning. GPT-4o when latency dominates. voyage-3 or text-embedding-3-large for retrieval. Llama 3.3 when self-hosted is required.

We benchmark on your task before locking it in — same prompts, same eval set, three candidate models, a written recommendation. The default isn't the choice; the eval is.

Q·02Process

Do you train custom models, or only prompt-engineer?

Both. ~70% of our builds get 90% there with strong prompting + good retrieval. The other 30% need fine-tuning — usually because the task has tone, format, or domain constraints prompts can't reliably enforce.

We tell you in week 1 which camp you're in. Fine-tuning gets quoted separately because it adds GPU spend and an eval cycle on top.

Q·03Operations

Who owns the models, data, and the API costs?

You do. API keys live in your accounts (Anthropic, OpenAI, etc). Vector DB in your cloud. Training data, weights, and eval sets are your IP. We're vendors on your bill, never on the contract.

Q·04EngineeringIncluded

How do you handle evals and regression testing?

Eval suite is the first deliverable, not the last. We write 50–300 test cases with you in week 1, run them on every prompt change in CI, and gate launches on a pass-rate target you sign off on.

After launch the suite keeps running. Drift detector flags when production traces diverge from the eval distribution. Promptfoo / Braintrust by default, custom rig if you need it.

Q·05Engineering

What about hallucinations and safety guardrails?

Three layers. Retrieval grounds outputs in your data with required citations. Output filters catch policy violations and out-of-scope responses. Bounded autonomy — agents need explicit permission for anything that touches money, customers, or production systems.

Every response is logged with retrieved sources and tool calls. You can audit any decision the system made in the last 90 days, line by line.

Q·06Operations

Can you handle PII / HIPAA workloads — PHI, regulated data?

Yes. We've shipped HIPAA-aligned LLM systems with PHI-stripping pre-processing, BAAs across the vendor stack (Anthropic, AWS Bedrock), and SOC 2 type II audit prep.

Regulated work lives in the Program tier. The compliance paperwork alone is real engineering — we won't squeeze it into a Build budget.

Q·07Engineering

Cloud, on-prem, or air-gapped?

All three. Cloud — fastest, cheapest. VPC / private endpoints via Bedrock or Azure OpenAI when data residency matters. Self-hosted Llama 3.3 / Qwen 2.5 on your GPUs when air-gap is non-negotiable.

We ship what your compliance allows. The architecture decision happens in week 0 with your security team in the room.

Q·08Pricing

What if I just need help with one prompt, not a system?

Sprint tier — $5k, two weeks. We audit your existing prompt, write evals, propose a structured version, run the A/B, ship the winner with rollback plan. Most clients see a 20–40% quality lift on the first sprint.

Tell us about
your AI.

Whether it's a single prompt audit or a multi-quarter agent build, we reply within 4 hours — usually with a fixed quote, an eval-suite sketch, and a launch-gate target date.

// After you send

A real human reads itA founder or lead engineer — never a sales team. Reply within ~4h.

A 30-minute scoping callVoice or video. We sketch a 90-day plan live and surface the hard parts early.

A written, fixed proposalScope, milestones, fixed price, team — no “phase 2 TBD,” no surprises.

Response time

~4h on weekdays

Min. engagement

2-week sprint

Slots — Q3 2026

2 of 4 · AI

Studio location

Remote · 4 timezones

AI that
actually ships.

AI that survives
Monday morning.

RAG that
cites its sources.

Agents with
bounded autonomy.

Fine-tunes.
When prompts plateau.

Evals first.
Ship-gates always.

Observability,
from day one.

Multi-modal,
when it earns it.

Ten weeks.
Four gates.

Scope by
metric, not vibe.

Evals first.
Always.

Iterate on
the suite.

Eval-gated
launch.

Systems that
actually ship.

Ansaanswers, grounded.

Quillmeeting to proposal.

Parsrdocuments to data.

Three
ways to work.

Sprint
tier.

Build
tier.

Program
tier.

Before
you write back.

Which models do you default to — and how do you choose?

Do you train custom models, or only prompt-engineer?

Who owns the models, data, and the API costs?

How do you handle evals and regression testing?

What about hallucinations and safety guardrails?

Can you handle PII / HIPAA workloads — PHI, regulated data?

Cloud, on-prem, or air-gapped?

What if I just need help with one prompt, not a system?

Four years of AI.
Production receipts.

One quote.
From the right VP.

Tell us about
your AI.

AI brief

AI that survivesMonday morning.

RAG thatcites its sources.

Agents withbounded autonomy.

Fine-tunes.When prompts plateau.

Evals first.Ship-gates always.

Observability,from day one.

Multi-modal,when it earns it.

Ten weeks.Four gates.

Scope bymetric, not vibe.

Evals first.Always.

Iterate onthe suite.

Eval-gatedlaunch.

Systems thatactually ship.

Ansaanswers, grounded.

Quillmeeting to proposal.

Parsrdocuments to data.

Threeways to work.

Sprinttier.

Buildtier.

Programtier.

Beforeyou write back.

Which models do you default to — and how do you choose?

Do you train custom models, or only prompt-engineer?

Who owns the models, data, and the API costs?

How do you handle evals and regression testing?

What about hallucinations and safety guardrails?

Can you handle PII / HIPAA workloads — PHI, regulated data?

Cloud, on-prem, or air-gapped?

What if I just need help with one prompt, not a system?

Four years of AI.Production receipts.

One quote.From the right VP.

Tell us aboutyour AI.

AI brief

AI that survives
Monday morning.

RAG that
cites its sources.

Agents with
bounded autonomy.

Fine-tunes.
When prompts plateau.

Evals first.
Ship-gates always.

Observability,
from day one.

Multi-modal,
when it earns it.

Ten weeks.
Four gates.

Scope by
metric, not vibe.

Evals first.
Always.

Iterate on
the suite.

Eval-gated
launch.

Systems that
actually ship.

Three
ways to work.

Sprint
tier.

Build
tier.

Program
tier.

Before
you write back.

Four years of AI.
Production receipts.

One quote.
From the right VP.

Tell us about
your AI.