The Eval Gate and the Retirement Loop: How…

87% of marketers now use generative AI in recurring workflows. But the top barrier to adoption? Lack of in-house AI skills, cited by 43% of teams. The gap between those two numbers is where most rollouts quietly die.

87% of marketers now use generative AI in recurring workflows, up from 51% just two years ago. But the top barrier to adoption? Lack of in-house AI skills, cited by 43% of teams. The gap between those two numbers is where most Claude skill rollouts quietly die.

Emily Kramer's recent LinkedIn thread (115 comments and counting) surfaced a question most marketing ops pros are wrestling with right now: who actually has a good workflow for rolling out Claude skills and agents to a marketing team? The answers revealed a pattern worth unpacking.

The Real Problem Isn't Distribution

Devon Watts put it bluntly: "We have a shared skills library, but the way those skills get created, tested, and used is pretty inconsistent. That leads to low confidence and low adoption (ie people just building their own skills & agents instead of using what other people have built)." That's the failure mode. Not "we can't build skills." The failure is nobody trusts the skills that exist, so everyone builds their own, and the library rots.

Ishant Kumar flagged the governance gap: "The technical part seems easy compared to deciding who owns workflows, how they're reviewed, and when an agent should replace an existing process." Fair. The tooling is the easy part. The org design around the tooling is where teams stall.

So what separates the teams that compound from the ones stuck in perpetual pilot mode?

Two Gates That Actually Work

Slava Baranskyi, after nine months of shipping marketing skills to real workflows, landed on a pattern worth stealing: "Every skill is a versioned file in a repo, and no skill ships without an eval list — five or six expected outputs written down before anyone runs it. That one step kills the 'looks impressive, behaves like a horoscope' class of skills before the team wastes a month on them."

Read that again. The eval gate isn't a nice-to-have. It's the thing that prevents your skill library from becoming a graveyard of half-working prompts nobody trusts. Alex Lindahl reinforced this: "No one is really testing skills with evals, so no one knows the quality of them."

The second gate is retirement. Baranskyi runs a monthly usage review: which skills got invoked, which got ignored. "The ignored ones are not marketing problems, they are spec problems. We retire or rewrite them instead of re-announcing them." Most teams skip this. They launch skills, send a Slack announcement, and wonder why adoption plateaus at 30%.

Who Builds vs. Who Uses

Laura Beaulieu offered the most honest framing in the thread: "Not everyone updates agents or skills. A few select people do. Building a skill or an agent takes three things: mastery of your job, the brand, and the AI. That's the benchmark. Most people are still climbing on at least one."

Her model: a small group builds, everyone else uses. Big team? An AI Center of Excellence plus a few capable ICs. Small team? One or two champions, often a GTM engineer. Skills live in shared GitHub. Nothing ships without QA review. A weekly Slack release lists every change.

Jonathan Martinez described an even leaner version: "load into GitHub > everyone downloads latest repo with skills > changes are made weekly by orchestrator > team updates to latest repo bi-weekly or weekly." Minimal process, but it works because someone owns the orchestration.

The pattern across all the serious answers: separate the builder role from the user role. Don't expect everyone to build. Do expect everyone to use. And put a named human between the two.

Run It This Week

Setup: Pick 3–5 repeatable, context-heavy workflows your team runs weekly (SEO audits, content repurposing, competitor briefs). Build each as a single-purpose skill, not one monolithic agent. Practitioner guidance consistently recommends small, composable skills over large ones.

Eval gate: Before any skill ships, write 5–6 expected outputs. Run the skill. Compare. If it "behaves like a horoscope," rewrite the description. Claude can under-trigger unless descriptions are explicit; make them pushy, with clear trigger phrases tied to real user intent.

Launch: Assign one owner (your orchestrator). Skills go in a shared repo. Weekly Slack release with what changed. Q&A lives in that thread.

Readout (monthly): Which skills got used? Which got ignored? Ignored skills get rewritten or retired. No re-announcing.

Hypothesis: If we gate every skill behind a 5-output eval and run monthly usage reviews, then adoption of shared skills will increase because trust in output quality goes up and redundant personal skills go down.

Success = 60%+ of team using shared skills weekly within 60 days. Guardrails = track skill invocation count and user-reported quality score (1–5). Stop-loss = if fewer than 3 skills pass eval after two build cycles, the problem is upstream (wrong workflows selected, not skill design).

The Trade-Off Nobody Mentions

This approach concentrates build authority in a small group. That's the point, and that's the risk. Senior practitioners save 8–10 hours per week with AI; junior team members save less. The builders become a bottleneck if you don't protect their time. And if a builder leaves without documentation, you're back to the "heroes holding institutional knowledge" problem that marketing ops exists to solve.

Baranskyi's versioned repo and eval lists aren't just quality controls. They're institutional memory. The skill file, the eval list, and the usage data together mean any new builder can pick up where the last one left off. That's the part most teams skip, and it's the part that makes the whole thing compound instead of collapse.