Why campaign-level KPIs do not grade AI work
A traditional creator program reports at the campaign level. Engagement rate, sales lift, earned media value, view-through reach. Those numbers grade the whole program at once. They are useful for the board deck and useless for telling an AI system which of its proposals to repeat.
The unit that needs grading is not the campaign. It is the recommendation. A specific outreach angle. A specific brief theme. A specific creator segment. A specific paid cutdown variant. A specific narrative for the report. Each one was a discrete AI bet. Each one deserves its own row in a learning ledger.
- Without a per-recommendation metric — post-mortems devolve into vibes. The strongest opinion in the room decides which AI behavior to encourage next time.
- Without a declared direction — a metric that moves up gets celebrated as a win even when the recommendation expected it to move down. The model gets trained on noise.
- Without a declared magnitude — a fractional lift gets coded as success on aggressive bets and a real lift gets coded as failure on conservative bets. The bar is invisible.
- Without a declared confidence — high-conviction proposals and exploratory shots are graded on the same scale. Either the program over-corrects on long shots or it kills its highest-conviction plays after one bad week.
The eight fields of the contract
A complete expected-metric contract is short. It fits inside the evidence object the recommendation already carries. The fields are not optional. A proposal missing any one of them is rejected at the propose gate.
- target_metric — The single primary number the recommendation expects to move. Examples: paid CPM on a creator cutdown, organic save-rate on a hook variant, reply-rate on an outreach opener, retail sell-through on a seeded SKU.
- direction — Up or down, declared in the same vocabulary the metric uses. CPM down is a win; reply-rate up is a win. Direction must be explicit so an absolute movement cannot be reframed after the fact.
- magnitude — A target band: minimum lift to call the bet a win, expected lift to call it a hit, ceiling to flag as suspicious for re-validation. Bands beat point estimates because creator campaigns are noisy.
- confidence — A scalar (0–1) declared at propose time. New patterns start low. Patterns with three or more successful realizations rise. Patterns with consistent underperformance fall. Confidence is what AI uses to decide what to surface next.
- time_window — How long the program will wait before grading. Outreach replies grade in days; sell-through grades in weeks; brand-lift grades in months. The window prevents the team from grading too early or too late.
- attribution_method — Last-touch, geo holdout, on-platform conversion, incrementality test, panel survey. Named up front so the post-mortem is not a methodology debate.
- owner — The named human accountable for both the bet and its grading. AI proposes; a person owns the line in the ledger.
- evidence_ref — Pointer to the evidence objects that justified the bet. A claim, a prior campaign learning, a creator post performance, a PDP signal. Without this pointer the recommendation cannot be defended six weeks later.
Where the contract is written, and when
The contract is created at the propose gate, not at the draft gate. That ordering matters. Drafting anchors the team on prose. Once a caption is written, the conversation drifts to its tone, its wording, its hashtags. The chance to argue about the underlying bet has already passed.
Writing the contract before any artifact exists forces the right argument first: is this the metric we want to move on this audience, with this magnitude, given the evidence we have? Once that fight is settled, drafting is mechanical. The draft must merely faithfully express the approved bet.
In practice the contract is the propose-gate payload. AI emits a recommendation; the recommendation cannot be reviewed unless every field is populated. The reviewer signs the contract or rejects it. Drafting only starts after the signature.
Grading after execution
After the campaign ships, the realized movement attaches to the same record. Three outcomes are possible, and the system treats them differently.
- Realized inside the expected band — The bet was correct. Confidence on this pattern is promoted. Similar recommendations surface earlier in future campaigns, with a lighter review burden.
- Realized outside the expected band — Either the recommendation was wrong or the magnitude was miscalibrated. The pattern is logged with the realized number, the evidence used, and the owner. Confidence drops; recurrence frequency falls.
- Inconclusive within the window — The attribution method declared up front could not separate signal from noise. The contract is retired; the program learns to avoid betting on this metric with that method again.
Inconclusive is the result that most programs refuse to record. It is the most valuable one. It teaches the AI which bets it cannot grade with the current attribution stack, and where to invest in better measurement before proposing again.
Confidence as the feedback signal
Confidence is the field that turns the contract into a loop. It is set at propose, updated at grade, and read by the model on the next campaign. That last step is what differentiates a learning program from a content factory.
The same pattern — a creator archetype, an outreach opener, a brief structure, an ad cutdown shape — accumulates a confidence track record across campaigns. High-confidence patterns graduate into creator campaign memory as reusable plays. Low-confidence patterns are quietly demoted out of the recommendation surface.
Pair this with creator campaign prediction so the next proposal cycle knows what the prior evidence actually predicts on this audience and product class, and with the creator matching score so confidence on the creator pick is grounded in more than the most recent post.
What the contract is not
- Not a forecast — The magnitude is a band, not a point estimate. A program that punishes any miss against a single number trains AI to propose only safe bets.
- Not a budget control — Budget pacing belongs in the paid system. The contract grades whether the bet was right, not whether the spend was efficient. Mixing the two muddies both.
- Not a creator scorecard — Grading the recommendation is not grading the creator. Creator-level scoring lives on the creator’s record across campaigns, not on a single bet attached to one collaboration.
- Not a substitute for evidence — A contract without an evidence pointer is a guess in formal clothing. The contract sits on top of evidence; it does not replace it.
How the contract pairs with the broader system
The contract is one field set on a larger object. The Campaign Evidence Object schema defines the surrounding record — source class, rights scope, approval state, expiry, citation pointer. The contract is the metric-and-intent slice that grades the bet after execution.
The approval gates pipeline determines when the contract is written (propose), drafted against (draft), reviewed (approve), and fired (execute). The campaign evidence control plane is the surface where owners see live contracts across campaigns and step in when realized movement crosses a threshold.
The AI creator marketing source of truth ties it back: every artifact, every report row, every paid variant points to a recommendation, which points to a contract, which points to evidence. Six weeks later, defending any single number is a query, not an archaeology project.
Where Storika fits
Storika runs the contract end to end. The platform forces every AI recommendation to declare target metric, direction, magnitude, confidence, time window, attribution method, owner, and evidence pointer at the propose gate. Drafting cannot start without it. After execution, realized movement attaches automatically, confidence updates, and the next campaign’s recommendation surface reflects the learning.
The result is an AI creator program that accumulates conviction over time rather than repeating last quarter’s noise at higher volume. The team works at AI speed and the portfolio of bets gets sharper every cycle.
Mistakes to avoid
Mistake 1: Letting magnitude be implicit
“Improve engagement” is not a contract. Without a target band, every directional move is negotiated into a win. Force the bands at propose time, even if they are wide. Wide bands are honest; missing bands are not.
Mistake 2: Picking attribution at grading time
If the attribution method is chosen after results come in, the result was chosen first. Lock the method into the contract at propose time so the post-mortem is not a methodology negotiation. Pair with influencer marketing ROI measurement.
Mistake 3: Treating confidence as a static label
Confidence that never moves is decoration. Realized movement must feed back into the same record so future proposals weight the pattern correctly. If confidence is set once and never updated, the AI is not learning.
Mistake 4: Grading inside the window
Sell-through graded after a weekend, reply-rate graded after a holiday, brand-lift graded after two weeks — all are noise. The time window exists for a reason. Wait for it.
Mistake 5: Hiding inconclusive results
Inconclusive is data. A pattern that cannot be graded with the current attribution stack is a signal to invest in measurement before betting again. Programs that quietly delete inconclusive rows keep proposing the same un-gradable bet.
FAQ
What is an expected-metric contract?
An expected-metric contract is a declaration attached to every AI-generated recommendation in a creator campaign stating which metric it is expected to move, in which direction, by what magnitude, and with what confidence. Without it, AI output cannot be graded and the program never improves.
Why isn't a campaign-level KPI enough?
Campaign-level KPIs grade the whole program. They cannot say which AI recommendation worked, which failed, or which to repeat. The contract lives at the recommendation level so each AI bet has its own scoreboard.
What fields does the contract require?
Target metric, direction, magnitude, confidence, time window, attribution method, owner, and a pointer to the evidence objects that justify the bet. Anything less and the recommendation cannot be graded fairly after the campaign ships.
How does confidence change over time?
Confidence is set at the propose gate and updated after execution. Recommendations whose realized movement consistently matches the declared bet have their confidence promoted. Recommendations that consistently underperform have their confidence downgraded so AI stops surfacing similar bets.
What happens to recommendations without a contract?
They are rejected at the propose gate. A recommendation without a declared metric is unmeasurable by construction; letting it through means the AI program ships content without ever closing the learning loop.
How does this relate to attribution?
The contract names the attribution method up front — last-touch, incrementality test, geo holdout, on-platform conversion — so the grading after the fact is not a debate. Disagreement about attribution happens at proposal time, not in the post-mortem.
The contract is the learning loop
An AI creator program without an expected-metric contract ships more content every quarter and knows less every quarter. The contract is the smallest unit that turns AI output into AI learning: a declared bet, a graded outcome, an updated confidence, a sharper next proposal.
Adjacent guides: AI creator campaign approval gates, Campaign Evidence Object overview, Campaign Evidence Object schema, campaign evidence control plane, AI creator marketing source of truth, accurate, approved, measurable, creator campaign memory, creator campaign prediction, creator matching score, influencer marketing ROI measurement, AI outreach artifact provenance, and AI creative QA workflow.