Agent Skill Analysis — Interactive Report

Overview

—

Total Skills

—

Passed Validation

—

Failed Validation

—

High Contamination

—

Context Window Waste

—

Mean LLM Quality

—

Low Value-Add

—

Hidden Contamination

Key Findings

22% fail validation — company-authored skills underperform community collections on structural compliance
52% of all tokens are nonstandard files wasting context window space (LICENSE, build artifacts, schemas)
Novelty is the key quality differentiator — most skills restate what the LLM already knows; craft dimensions cluster tightly but novelty varies independently
66 skills have hidden contamination invisible to SKILL.md-only analysis, carried in reference files
No correlation between structural risk and actual degradation (r = 0.077, n = 19) — content-specific mechanisms like template propagation and API hallucination drive observed effects

Validation Results by Source

Validation checks whether each skill complies with the Agent Skills specification — required files present, correct structure, valid metadata, and properly formatted content. Skills that fail validation may not load correctly or may confuse the agent about how to use them.

Token Budget Composition

How token budgets break down across SKILL.md, references, assets, and nonstandard files. Nonstandard files (LICENSE, build artifacts, schemas) waste context window space.

If you're evaluating skills

Validation is a necessary first filter. The 22% of skills that don't match the spec may not load correctly, so checking compliance saves you from obvious failures. But passing validation tells you nothing about whether a skill improves agent output. Think of it like type checking: it catches a class of errors, but doesn't tell you whether the program does what you want.

Content Quality

Two automated metrics assess how well a skill's instruction file communicates with the agent. Information density measures the ratio of actionable content (code blocks, specific instructions, structured data) to total text — higher values mean less filler. Instruction specificity measures how concrete and directive the language is versus vague advisory text like "consider" or "be careful."

If you're building skills

Every token your skill consumes is a token unavailable for the user's actual task—their codebase context, conversation history, tool definitions. The 52% waste figure is mostly a packaging problem, not a content problem, which means it's straightforward to fix. Remove LICENSE files, build artifacts, etc. from your skills. Beyond that, writing for high information density and specificity means writing the way you'd write a good API reference: concrete, directive, minimal preamble.

Cross-Contamination

Skills that reference multi-interface tools or mix code examples across languages carry structural complexity that may cross-contaminate agent behavior. These scores measure structural language complexity. An exploratory behavioral evaluation (n = 19) found no correlation between these structural scores and measured degradation (r = 0.077), suggesting content-specific factors may matter more than language mixing.

High-Contamination Skills

Why We Measure This

Skills scoring ≥ 0.5 on our structural contamination metric. This dimension is motivated by research on Programming Language Confusion (Moumoula et al., 2025), which shows LLMs can generate code using patterns from the wrong language — particularly between syntactically similar pairs like C#/Java or JavaScript/TypeScript. Copy bias in in-context learning (Ali et al., 2024) and LLM susceptibility to irrelevant context (Shi et al., 2023) further support the concern that mixed-language skill content could interfere with code generation.

How the Score Works

The score combines three factors: multi-interface tools (e.g., a database with both shell and language SDKs), scope breadth (number of distinct technology categories referenced), and language mismatch. Mismatched categories are code language families that differ from the skill's primary language — each classified as application (Python, JavaScript, Java, .NET, mobile) or auxiliary (shell, config, query, markup). Mismatches are weighted by syntactic similarity, following the PLC finding that confusion occurs primarily between similar languages: application + application (e.g., Python + Java) carries weight 1.0, application + auxiliary (e.g., Python + shell) 0.25, and auxiliary + auxiliary just 0.1.

What We Actually Found

In theory, these mismatches could cause the model to bleed syntax or API patterns from one language into another. In practice, our behavioral evaluation found that structural scores did not predict actual degradation — content-specific mechanisms like template propagation and API hallucination mattered more. Cross-language code bleed (the PLC mechanism) did appear in some cases, but accounted for a minority of the total degradation observed. These scores are best understood as a measure of structural complexity, not a direct predictor of harm.

Hidden Contamination

The contamination scoring described above applies to SKILL.md content. But skills can also include reference files that are loaded into context alongside the instruction file. These 66 skills score as low-risk on their SKILL.md alone, yet carry medium or high contamination in their reference files — language mixing invisible to instruction-file-only analysis.

If you're building for multi-language tools

If your tool has a CLI, multiple language SDKs, and a query language (think databases, cloud providers, observability platforms), these structural scores flag complexity worth being intentional about. Our behavioral data didn't find that structural complexity alone predicted problems, but the content interference mechanisms we did observe (API hallucination, cross-language bleed) tend to appear in exactly these kinds of skills. The structural score won't tell you if your skill will cause harm, but it can tell you where to focus your testing.

Behavioral Insights

An exploratory behavioral evaluation of 19 skills tested whether structural contamination scores predict actual code generation degradation. Each skill was evaluated across 5 task types under 3 conditions (baseline, skill-loaded, skill + realistic context), with 3 runs each at temperature 0.3.

The Disconnect

react-native-best-practices (contamination 0.07, B-A = −0.384) produces the largest degradation despite near-zero structural risk. sharp-edges (contamination 0.62, B-A = −0.083) has high structural risk but minimal behavioral impact. Structural scores alone don't predict what hurts.

Content Interference Mechanisms

Six distinct mechanisms drive degradation — only one (cross-language code bleed) is captured by structural contamination scoring:

Template Propagation

Skill output templates reproduced verbatim in unrelated contexts. Invalid // comments in JSON templates bleed into all output.

claude-settings-audit (B-A = −0.483)

Textual Frame Leakage

Non-code skill content reshapes how the model frames responses, adding verbose commentary at the expense of code completeness.

monitoring-observability (B-A = −0.233)

Token Budget Competition

With skill loaded, outputs allocate more tokens to explanatory text and less to code, producing incomplete implementations under output limits.

react-native-best-practices (B-A = −0.384)

API Hallucination

The model invents plausible but nonexistent API methods that follow naming conventions seen in skill content. The code is in the correct language — the API surface is wrong.

upgrade-stripe (B-A = −0.117)

Cross-Language Code Bleed

The classic programming language confusion: shell syntax in JavaScript, mongosh operators in JSON. The only mechanism structural scoring detects.

MongoDB skills

Architectural Pattern Bleed

Skill-specific architectural conventions (error handling, config patterns) propagate to unrelated code, even when the language is correct.

provider-resources (B-A = −0.317)

Context Mitigation

When skills are loaded alongside realistic agentic context (system prompt, tools, conversation history), mean degradation drops from −0.080 to −0.023 — a ~62–75% attenuation. Real-world impact may be substantially smaller than isolated evaluation suggests.

If you're debugging unexpected agent behavior

The six interference mechanisms above are hypotheses drawn from a small evaluation, not proven failure modes. But they describe recognizable patterns. If you've added a skill and noticed your agent suddenly producing verbose commentary instead of code, or inventing API methods that don't exist, or using shell syntax in your JavaScript—these categories give you a vocabulary for diagnosing what's happening. The context mitigation finding also matters: in realistic conditions, the effects attenuate substantially, so isolated testing may overstate real-world impact.

LLM-as-Judge Quality

All 673 skills scored by Claude Sonnet across 6 dimensions (1-5 scale): clarity, actionability, token efficiency, scope discipline, directive precision, and novelty.

Craft vs. Novelty

Skills cluster tightly on craft dimensions (clarity, actionability, efficiency, scope, precision) but spread independently on novelty — a two-factor quality structure.

Low Value-Add Risk

Skills that score low on novelty (score ≤ 2) and medium-to-high on structural contamination (score ≥ 0.2). The idea: if a skill doesn't teach the model anything new but does add mixed-language complexity to the context window, the theoretical cost-benefit is unfavorable. These skills are candidates for removal or consolidation.

That said, in our behavioral evaluation of 6 low value-add skills, the actual measured degradation was modest (mean B−A = −0.072) — suggesting that while these skills aren't helping, they may not be actively hurting as much as the structural scores imply. The strongest degradation we observed came from high-novelty skills with content-specific interference mechanisms, not from low-novelty ones. Skills in this quadrant are tagged evaluate in the skills table—if you're considering using one, it's worth testing its impact on your agent's output for your specific tasks before committing to it.

If you're deciding what to put in a skill

The craft dimensions (clarity, actionability, efficiency) are table stakes—most skills score similarly on them, and they're the kind of thing you can get right with careful editing. Novelty is harder and appears to matter more. A skill that restates what the model already knows is, at best, an expensive no-op. If you're considering whether a skill is worth creating and maintaining, the first question to ask is: does this teach the agent something it genuinely doesn't know? If the answer is no, the maintenance cost probably isn't justified, even if the skill is well-written.

Open Questions for Practitioners

This analysis raises questions we think are worth sitting with, even where the data doesn't yet support definitive answers.

How should you evaluate a skill before installing it?

Structural validation catches broken skills. LLM-as-judge scoring can flag low-novelty content. But neither predicted behavioral degradation in our evaluation. The honest answer may be that there isn't a reliable shortcut. Skills that look good on paper can still interfere with agent output in ways you'll only catch by testing on your own tasks.

What predicts whether a skill actually helps?

Our data points toward novelty as the strongest signal (r = +0.327 for degradation magnitude), but this is from 19 skills. This is a direction to investigate, not a conclusion to rely on. If novelty does turn out to be the key variable, it would reframe skill creation: the goal isn't to write well, it's to teach something new.

Can you generate skills from existing published content?

If most skills restate what the model already knows, and novelty is what differentiates useful ones, then auto-generating skills from documentation, articles, or other published content may produce exactly the kind of low-novelty content the ecosystem already has too much of. The skills that scored highest on novelty in our analysis tend to encode operational knowledge; the kind of hard-won expertise that doesn't live in docs.

When is a skill not worth maintaining?

A skill has ongoing costs: it consumes context tokens on every invocation, it needs updating as APIs and tools change, and it can interfere with the agent in ways that are hard to detect. If the skill doesn't teach the model something new—and especially if it adds structural complexity from mixed languages or multi-interface tools—the cost-benefit may not justify keeping it active.

How can you test skills?

If you're planning to distribute skills—especially as official company-provided resources—structural validation and manual review aren't enough. Our behavioral evaluation found interference patterns that no static analysis would catch: templates bleeding into unrelated output, plausible-but-wrong API methods, architectural conventions propagating where they don't belong. The only way to know if a skill helps or hurts is to test it against representative tasks with and without the skill loaded, and compare the outputs. That's expensive, but so is distributing a skill that degrades your users' experience in ways neither you nor they will easily trace back to the skill.

How can you decide whether to install a skill?

There's no reliable shortcut yet. Validation tells you if a skill is structurally sound. Novelty scoring can flag whether it's likely to teach the model something new. But the gap between "looks good" and "actually helps" is real—our best structural metrics didn't predict behavioral outcomes. If you're adopting a skill for your team or organization, treat it like any other dependency: try it on your actual workloads, watch for the interference patterns described above, and be willing to remove it if the results don't justify the context window cost. The ecosystem will get better tooling for this over time, but right now, informed skepticism is your best filter.

All Skills

Source: Status: Risk: Search:

Column Key

Tokens: Total context window tokens consumed by all skill files
Density: Information density — ratio of actionable content to total text (higher = less filler)
Specificity: Instruction specificity — how concrete and directive the language is (higher = more precise)
Contamination: Structural contamination score — language mixing complexity (higher = more mixed)
LLM Score: Overall LLM-as-judge quality — mean of 6 dimensions scored 1–5
Novelty: Novelty score (1–5) — does the skill teach the model something it doesn't already know?
evaluate: Low novelty (≤ 2) + medium-to-high contamination (≥ 0.2) — worth testing before adopting

Name	Source	Status	Errors	Warnings	Tokens	Density	Specificity	Contamination	LLM Score	Novelty