Agent Skills are the primary way to extend what coding agents know. But the ecosystem is growing faster than anyone's ability to evaluate it. If you're installing skills, how do you know which ones are worth the context window cost? If you're building skills, what actually makes one useful versus noise?
This report explores those questions across 673 published skills. The structural data is comprehensive. The behavioral data is exploratory (n = 19). The uncomfortable finding: the things that are easy to measure don't predict the things that matter. We can't make a deterministic checklist, but we can help you focus on key considerations.
Overview
Key Findings
- 22% fail validation — company-authored skills underperform community collections on structural compliance
- 52% of all tokens are nonstandard files wasting context window space (LICENSE, build artifacts, schemas)
- Novelty is the key quality differentiator — most skills restate what the LLM already knows; craft dimensions cluster tightly but novelty varies independently
- 66 skills have hidden contamination invisible to SKILL.md-only analysis, carried in reference files
- No correlation between structural risk and actual degradation (r = 0.077, n = 19) — content-specific mechanisms like template propagation and API hallucination drive observed effects
Validation Results by Source
Validation checks whether each skill complies with the Agent Skills specification — required files present, correct structure, valid metadata, and properly formatted content. Skills that fail validation may not load correctly or may confuse the agent about how to use them.
Token Budget Composition
How token budgets break down across SKILL.md, references, assets, and nonstandard files. Nonstandard files (LICENSE, build artifacts, schemas) waste context window space.
If you're evaluating skills
Validation is a necessary first filter. The 22% of skills that don't match the spec may not load correctly, so checking compliance saves you from obvious failures. But passing validation tells you nothing about whether a skill improves agent output. Think of it like type checking: it catches a class of errors, but doesn't tell you whether the program does what you want.
Content Quality
Two automated metrics assess how well a skill's instruction file communicates with the agent. Information density measures the ratio of actionable content (code blocks, specific instructions, structured data) to total text — higher values mean less filler. Instruction specificity measures how concrete and directive the language is versus vague advisory text like "consider" or "be careful."
If you're building skills
Every token your skill consumes is a token unavailable for the user's actual task—their codebase context, conversation history, tool definitions. The 52% waste figure is mostly a packaging problem, not a content problem, which means it's straightforward to fix. Remove LICENSE files, build artifacts, etc. from your skills. Beyond that, writing for high information density and specificity means writing the way you'd write a good API reference: concrete, directive, minimal preamble.
Cross-Contamination
Skills that reference multi-interface tools or mix code examples across languages carry structural complexity that may cross-contaminate agent behavior. These scores measure structural language complexity. An exploratory behavioral evaluation (n = 19) found no correlation between these structural scores and measured degradation (r = 0.077), suggesting content-specific factors may matter more than language mixing.
High-Contamination Skills
Why We Measure This
Skills scoring ≥ 0.5 on our structural contamination metric. This dimension is motivated by research on Programming Language Confusion (Moumoula et al., 2025), which shows LLMs can generate code using patterns from the wrong language — particularly between syntactically similar pairs like C#/Java or JavaScript/TypeScript. Copy bias in in-context learning (Ali et al., 2024) and LLM susceptibility to irrelevant context (Shi et al., 2023) further support the concern that mixed-language skill content could interfere with code generation.
How the Score Works
The score combines three factors: multi-interface tools (e.g., a database with both shell and language SDKs), scope breadth (number of distinct technology categories referenced), and language mismatch. Mismatched categories are code language families that differ from the skill's primary language — each classified as application (Python, JavaScript, Java, .NET, mobile) or auxiliary (shell, config, query, markup). Mismatches are weighted by syntactic similarity, following the PLC finding that confusion occurs primarily between similar languages: application + application (e.g., Python + Java) carries weight 1.0, application + auxiliary (e.g., Python + shell) 0.25, and auxiliary + auxiliary just 0.1.
What We Actually Found
In theory, these mismatches could cause the model to bleed syntax or API patterns from one language into another. In practice, our behavioral evaluation found that structural scores did not predict actual degradation — content-specific mechanisms like template propagation and API hallucination mattered more. Cross-language code bleed (the PLC mechanism) did appear in some cases, but accounted for a minority of the total degradation observed. These scores are best understood as a measure of structural complexity, not a direct predictor of harm.
Hidden Contamination
The contamination scoring described above applies to SKILL.md content. But skills can also include reference files that are loaded into context alongside the instruction file. These 66 skills score as low-risk on their SKILL.md alone, yet carry medium or high contamination in their reference files — language mixing invisible to instruction-file-only analysis.
If you're building for multi-language tools
If your tool has a CLI, multiple language SDKs, and a query language (think databases, cloud providers, observability platforms), these structural scores flag complexity worth being intentional about. Our behavioral data didn't find that structural complexity alone predicted problems, but the content interference mechanisms we did observe (API hallucination, cross-language bleed) tend to appear in exactly these kinds of skills. The structural score won't tell you if your skill will cause harm, but it can tell you where to focus your testing.
Behavioral Insights
An exploratory behavioral evaluation of 19 skills tested whether structural contamination scores predict actual code generation degradation. Each skill was evaluated across 5 task types under 3 conditions (baseline, skill-loaded, skill + realistic context), with 3 runs each at temperature 0.3.
The Disconnect
react-native-best-practices (contamination 0.07, B-A = −0.384) produces the largest degradation despite near-zero structural risk. sharp-edges (contamination 0.62, B-A = −0.083) has high structural risk but minimal behavioral impact. Structural scores alone don't predict what hurts.
Content Interference Mechanisms
Six distinct mechanisms drive degradation — only one (cross-language code bleed) is captured by structural contamination scoring:
Template Propagation
Skill output templates reproduced verbatim in unrelated contexts. Invalid // comments in JSON templates bleed into all output.
claude-settings-audit (B-A = −0.483)
Textual Frame Leakage
Non-code skill content reshapes how the model frames responses, adding verbose commentary at the expense of code completeness.
monitoring-observability (B-A = −0.233)
Token Budget Competition
With skill loaded, outputs allocate more tokens to explanatory text and less to code, producing incomplete implementations under output limits.
react-native-best-practices (B-A = −0.384)
API Hallucination
The model invents plausible but nonexistent API methods that follow naming conventions seen in skill content. The code is in the correct language — the API surface is wrong.
upgrade-stripe (B-A = −0.117)
Cross-Language Code Bleed
The classic programming language confusion: shell syntax in JavaScript, mongosh operators in JSON. The only mechanism structural scoring detects.
MongoDB skills
Architectural Pattern Bleed
Skill-specific architectural conventions (error handling, config patterns) propagate to unrelated code, even when the language is correct.
provider-resources (B-A = −0.317)
Context Mitigation
When skills are loaded alongside realistic agentic context (system prompt, tools, conversation history), mean degradation drops from −0.080 to −0.023 — a ~62–75% attenuation. Real-world impact may be substantially smaller than isolated evaluation suggests.
If you're debugging unexpected agent behavior
The six interference mechanisms above are hypotheses drawn from a small evaluation, not proven failure modes. But they describe recognizable patterns. If you've added a skill and noticed your agent suddenly producing verbose commentary instead of code, or inventing API methods that don't exist, or using shell syntax in your JavaScript—these categories give you a vocabulary for diagnosing what's happening. The context mitigation finding also matters: in realistic conditions, the effects attenuate substantially, so isolated testing may overstate real-world impact.
LLM-as-Judge Quality
All 673 skills scored by Claude Sonnet across 6 dimensions (1-5 scale): clarity, actionability, token efficiency, scope discipline, directive precision, and novelty.
Craft vs. Novelty
Skills cluster tightly on craft dimensions (clarity, actionability, efficiency, scope, precision) but spread independently on novelty — a two-factor quality structure.
Low Value-Add Risk
Skills that score low on novelty (score ≤ 2) and medium-to-high on structural contamination (score ≥ 0.2). The idea: if a skill doesn't teach the model anything new but does add mixed-language complexity to the context window, the theoretical cost-benefit is unfavorable. These skills are candidates for removal or consolidation.
That said, in our behavioral evaluation of 6 low value-add skills, the actual measured degradation was modest (mean B−A = −0.072) — suggesting that while these skills aren't helping, they may not be actively hurting as much as the structural scores imply. The strongest degradation we observed came from high-novelty skills with content-specific interference mechanisms, not from low-novelty ones. Skills in this quadrant are tagged evaluate in the skills table—if you're considering using one, it's worth testing its impact on your agent's output for your specific tasks before committing to it.
If you're deciding what to put in a skill
The craft dimensions (clarity, actionability, efficiency) are table stakes—most skills score similarly on them, and they're the kind of thing you can get right with careful editing. Novelty is harder and appears to matter more. A skill that restates what the model already knows is, at best, an expensive no-op. If you're considering whether a skill is worth creating and maintaining, the first question to ask is: does this teach the agent something it genuinely doesn't know? If the answer is no, the maintenance cost probably isn't justified, even if the skill is well-written.
Open Questions for Practitioners
This analysis raises questions we think are worth sitting with, even where the data doesn't yet support definitive answers.
How should you evaluate a skill before installing it?
Structural validation catches broken skills. LLM-as-judge scoring can flag low-novelty content. But neither predicted behavioral degradation in our evaluation. The honest answer may be that there isn't a reliable shortcut. Skills that look good on paper can still interfere with agent output in ways you'll only catch by testing on your own tasks.
What predicts whether a skill actually helps?
Our data points toward novelty as the strongest signal (r = +0.327 for degradation magnitude), but this is from 19 skills. This is a direction to investigate, not a conclusion to rely on. If novelty does turn out to be the key variable, it would reframe skill creation: the goal isn't to write well, it's to teach something new.
Can you generate skills from existing published content?
If most skills restate what the model already knows, and novelty is what differentiates useful ones, then auto-generating skills from documentation, articles, or other published content may produce exactly the kind of low-novelty content the ecosystem already has too much of. The skills that scored highest on novelty in our analysis tend to encode operational knowledge; the kind of hard-won expertise that doesn't live in docs.
When is a skill not worth maintaining?
A skill has ongoing costs: it consumes context tokens on every invocation, it needs updating as APIs and tools change, and it can interfere with the agent in ways that are hard to detect. If the skill doesn't teach the model something new—and especially if it adds structural complexity from mixed languages or multi-interface tools—the cost-benefit may not justify keeping it active.
How can you test skills?
If you're planning to distribute skills—especially as official company-provided resources—structural validation and manual review aren't enough. Our behavioral evaluation found interference patterns that no static analysis would catch: templates bleeding into unrelated output, plausible-but-wrong API methods, architectural conventions propagating where they don't belong. The only way to know if a skill helps or hurts is to test it against representative tasks with and without the skill loaded, and compare the outputs. That's expensive, but so is distributing a skill that degrades your users' experience in ways neither you nor they will easily trace back to the skill.
How can you decide whether to install a skill?
There's no reliable shortcut yet. Validation tells you if a skill is structurally sound. Novelty scoring can flag whether it's likely to teach the model something new. But the gap between "looks good" and "actually helps" is real—our best structural metrics didn't predict behavioral outcomes. If you're adopting a skill for your team or organization, treat it like any other dependency: try it on your actual workloads, watch for the interference patterns described above, and be willing to remove it if the results don't justify the context window cost. The ecosystem will get better tooling for this over time, but right now, informed skepticism is your best filter.