June 4, 2026 · Essay · 11 min

9 Things I've Learned Using Claude for Design Systems Work

Most AI-and-design-systems advice stays comfortably abstract. These nine concrete patterns are what actually changed how I work.


I've been pushing Claude hard against real design system work for a while now — auditing tokens, generating components, writing docs, hunting drift, sitting in critique with me as a second pair of eyes. A lot of "AI for design" advice stays comfortably abstract; this is the layer underneath. These are the nine things that have actually changed how I work, and that I now refuse to give up.

A note on framing: this is Claude-specific because Claude is what I use most. The patterns generalize across other agents, but the names — CLAUDE.md, skills, MCPs, sub-agents, plan mode — are real, concrete primitives you can wire up today, not metaphors for "AI workflows." That specificity is the whole point. The conversation about AI and design systems has been allergic to detail for a while now, and detail is what makes any of this useful.

1. CLAUDE.md is the most important design artifact you'll write this year

The file Claude loads at the start of every session is doing more work than any spec, wiki, or Notion page in your org. It's the first thing the agent sees, every time, and it sets the rails for everything that follows. Everything else — the codebase, the design tokens, the components — gets read selectively, lazily, sometimes not at all. CLAUDE.md is read every single time.

I used to treat it like a README. I now treat it like a contract. Brand voice and tone constraints. Token naming rules. Banned patterns and what to use instead. The canonical sources of truth and the order to consult them in. The list of things the agent must ask before doing. The list of things it must refuse to do without explicit human sign-off. Anything I find myself re-explaining in two consecutive sessions goes in this file.

The discipline that makes it work: keep it short, keep it current, link out for depth. A bloated CLAUDE.md is worse than no CLAUDE.md because the things you actually need get drowned in things you wrote once and forgot about. If a teammate has to re-explain something to Claude twice in a week, that's not a Claude bug. That's a CLAUDE.md bug. Treat it that way.

2. Skills are how you stop paying for the same context over and over

The first time I packaged a recurring workflow into a skill instead of pasting context every session, I felt stupid for not having done it sooner. A token-audit I used to set up by hand — explain the structure, paste the conventions, list the rules, name the output format — went from tens of thousands of tokens of prep to a single invocation in the hundreds. Same output. Different cost structure. Different posture: instead of building the runway every time, I was just taking off.

The mental model that finally clicked: prompts are for one-offs; skills are for anything you'll do more than twice. Write the workflow down once — the exact tools it uses, the exact constraints it respects, the exact shape of its output — and every future session starts from "ready to work" instead of "let me get oriented." A good skill is one you can hand to a colleague and they get the same result you did, which is also exactly the test of whether you understood your own workflow.

The caveat worth keeping in mind: skills can rot the same way documentation rots. If the system changes and the skill doesn't, you've just industrialized the wrong answer. Treat skill maintenance the way you'd treat a critical script in your CI — owned, dated, reviewed.

3. MCPs query, skills execute, rules constrain — pick the right channel

These three get conflated, and the distinction matters more than people realize. A skill is a capability you ship to the agent: load it once, run it cheap, repeat forever. An MCP server is a live connection the agent can query mid-task — "what's the current value of this token?" "Is there an existing component for this?" "Which icons match this keyword?" — answered against your real, current system instead of a snapshot in a file. Rules are the things in CLAUDE.md (or its siblings) that the agent reads every time and treats as always-on constraints.

A worked example. I want to audit a component for drift against the current design tokens. The rule that says "use tokens, never hex codes, no exceptions" lives in CLAUDE.md. The skill that says "here's how to run an audit and here's the output format I expect" lives as a skill. The actual current values of the tokens — which change underneath me — live behind an MCP server I query at the moment of the audit. Three different jobs, three different channels.

The failure mode I keep watching teams fall into is mixing them up. Tokens in CLAUDE.md go stale within a sprint. Brand rules behind an MCP server get politely queried and quietly ignored. Audit workflows pasted into every prompt cost a fortune. Once I separated those three jobs, my context costs dropped and my output got more reliable in the same week. Pick the right channel for the right kind of fact.

4. Sub-agents are how you parallelize the unglamorous work

This was the single biggest unlock for system-level work, and the one that surprised me most. Auditing a token rename across a codebase. Scanning for drift between Figma and code across dozens of components. Reviewing a hundred patterns for an accessibility regression. Hunting down every place an old icon set still leaks through. All of these used to be linear grinds — open a file, check it, close it, next — and now they fan out across sub-agents and come back as a single structured report.

The pattern I keep returning to: find → verify → synthesize. One sub-agent finds candidate issues across the whole surface. Another sub-agent — given each finding cold, with no knowledge of the first one's confidence — adversarially tries to disprove it. A third synthesizes whatever survives into the report I actually read. The adversarial step is what makes the output trustworthy; without it you're just amplifying the first agent's confident mistakes.

The bigger mindset shift is to stop thinking of Claude as "one assistant" and start thinking of it as a small org you can spin up on demand. Spawn a researcher. Spawn a critic. Spawn a fact-checker. Spawn a writer to compress what they found. The single-context-window framing leaves enormous capability on the table; the org-of-agents framing is where most of the leverage I've found actually shows up.

5. Plan mode beats vibe coding for anything system-level

For a one-off script, vibing is fine — sometimes the best move. For anything that touches the system — a new component, a token migration, a docs refactor, a renaming sweep — I make Claude plan first, every single time. State the goal in your own words. List the affected files. Sequence the steps. Name the risks. State the abort conditions. Ask before executing.

What a good plan contains, concretely: a clear scope statement, a file-by-file list of what's getting touched, the sequence (and why that sequence), the two or three failure modes that worry you most, and an explicit "if I see X, stop and ask" clause. Thirty seconds to draft. Saves hours of unwinding when something goes sideways.

The math is brutally simple and I keep relearning it the hard way: minutes spent planning replace hours spent recovering. Every time I've talked myself out of planning because the task "seemed small," it was the task I had to re-do. Plan mode isn't bureaucracy. It's a pre-mortem you can run in the time it takes to make coffee, and it pays for itself the first time it catches a "wait — that would break the icon names" before you start instead of after.

6. "Confidently wrong" is almost always your docs gap, not Claude's failure

This one stung when I first noticed it, and it keeps proving itself. Most of the times Claude has shipped me something wrong, the rule it broke wasn't actually written down anywhere it could see. The system's "obvious" conventions lived in my head, or in a Slack thread from six months ago, or in the muscle memory of the team — not in a file the agent could read. Claude wasn't failing the test; my system was failing the test of being legible enough to teach.

The diagnostic I now run when something goes wrong: could a brand-new designer learn the right answer from the documentation alone, without asking anyone? If the answer is no, the fix isn't "prompt Claude better." The fix is to write the rule down, in the place the agent reads, in language that doesn't require a human translator. Once it's written, the next twenty mistakes that would have made it through don't.

The trap to avoid is treating each wrong output as a one-off — re-prompting, getting a better answer, moving on. That fixes the symptom and leaves the cause. The compounding move is to fix the underlying gap the first time and never see that class of mistake again. Slow now, fast forever.

7. The economics of governance just flipped

This is the one I genuinely didn't see coming. Design system governance — the audits, the drift reports, the accessibility sweeps, the deprecated-token check-ins, the "are we using the new tokens yet" temperature checks — used to require either a dedicated tool with a real subscription or a dedicated person with real headcount. Often both. The work was important; the cost meant it happened quarterly at best.

I now run those audits as scripted Claude tasks. A full system health report — token usage across the codebase, deprecated patterns flagged, drift between design source and shipped code, components nobody imports anymore — costs cents, runs in minutes, and produces a more readable, more actionable output than the dashboards it replaces. The interesting effect isn't just the savings. It's that governance becomes a daily habit instead of a quarterly event. When the audit is free, you run it before merging.

The implication for design system teams is large. The price of "knowing the state of your system" used to be a real budget conversation. Now it's a habit you build. If you're still pricing governance like it's 2023 — line-item subscriptions, dedicated tooling spend, "we'll do an audit next quarter" — you're leaving the easiest budget win in the building on the table.

8. Code as canonical, design tool as visual prompt

The Figma-vs-code source-of-truth debate quietly ended for me the first time I watched Claude generate a component from a Figma frame, then watched a designer try to "fix" the Figma after the code shipped. The code was what users actually got. The Figma was now stale, but it looked authoritative, and that's worse than no source at all — it's a source pointing the wrong way.

I've inverted the relationship for any project I own. Code is canonical. The design tool is where I explore. I treat Figma as a visual prompt — sketches, intent, "something that looks roughly like this" — and the moment a decision is real, it lives in code and Storybook. Storybook in particular has quietly become my actual creative canvas; it's where I iterate, refine, and review components against their real props in their real contexts, with no rendering gap between what I'm designing and what ships.

Claude reinforces this whether you like it or not. It reads code accurately. It reads Figma approximately — through MCPs, plugins, exports — and at every step of that pipeline, information is lost. Working with the grain of that fact, instead of against it, removes a whole category of pain. It also forces a healthier discipline: if a decision exists only in the design tool, treat it as a draft, not a contract.

9. Taste, accessibility judgment, and "is this actually a good idea" still live with you

Worth saying out loud because the demos make it look otherwise. Claude can generate a hundred PRs in an hour. It cannot tell you whether the thing being PR'd is worth shipping. It can pass automated accessibility checks. It cannot tell you whether the experience is humane for a screen reader user on a slow connection. It can match your tokens flawlessly. It cannot tell you whether the system needed another card variant, or whether the right answer was to fix the three you have.

The work to deliberately protect: the taxonomy and naming decisions that shape how your team will think for years. The accessibility floor — the experience you guarantee for assistive tech users — verified by real humans using real tools. The "is this even the right shape" call that decides whether you're solving the problem or just decorating around it. The taste calls that separate a system people love from a system they tolerate. None of that compresses into a prompt.

The paradoxical effect of using Claude well is that this un-compressible work becomes more visible, not less. When everything around it gets cheaper and faster, the slow human judgment is what remains — and stands out more sharply. The teams I see thriving aren't the ones automating taste away. They're the ones using Claude to clear the runway so taste gets the time it deserves.


The thread running through all nine

The same line keeps surfacing under every one of these lessons: Claude makes whatever discipline you already had louder. If your system is well-documented, well-tokenized, well-versioned, well-governed, Claude amplifies it into an enormous productivity gain — the kind that genuinely changes what a small team can ship. If it isn't, Claude amplifies the gaps just as faithfully — confidently, fast, at scale, in places it's hard to roll back.

Which is the part I find most interesting, and the part I think gets undersold in the conversation about AI right now. The teams winning with Claude aren't the ones with the cleverest prompts or the wildest demos. They're the ones who quietly went back and did the boring foundational work nobody got promoted for — the structured docs, the named tokens, the explicit rules, the CLAUDE.md that actually reflects how decisions get made on the team. Claude is, in that sense, the most honest reviewer your design system has ever had. It can only work with what you've actually built.

If you take only one thing from this list and act on it this week, make it CLAUDE.md. Write the rules down. Write the banned patterns down. Write the sources of truth down. Everything else — the skills, the MCPs, the sub-agents, the economics — gets disproportionately easier the moment the foundation is real.