AI Usage Playbook
This playbook defines instructions and best practices for using AI effectively in software development. It is designed for the Practitioner and Expert levels of the AI Growth Path program, targeting team members across all roles — backend, frontend, DevOps, UX, QA, and more.
Core Principles: All recommendations in this playbook align with Xebia's official Core Principles for Working with AI. Refer to that document for the foundational rules that govern every AI interaction at Xebia.
How This Playbook Is Organized
This playbook consists of two complementary parts.
Part I — Tools, Techniques, and Landscape provides a broad overview of the AI-assisted software development ecosystem: model selection, cost optimization, coding assistants, spec-driven development, security, privacy, and more. Think of Part I as the reference library you come back to when you want to go deeper, explore alternatives, or make architectural decisions about your AI tooling. It gives you the knowledge foundation you need before jumping into hands-on work.
Part II — Quick Start is a hands-on, step-by-step guide for a recommended path for agentic development. It walks you through a concrete workflow for coding with an AI agent, from preparing your project all the way to committing reviewed code. The Quick Start uses Claude Code as the primary tool, GitHub for version control, and the agent's built-in planning capabilities — no additional spec-driven development frameworks required. The goal is simple: give you one clear path you can follow today, so you spend less time deciding how to work with AI and more time actually doing it. Once you internalize this workflow, you can adapt it to other tools, frameworks, and methodologies.
Start with Part I to build your understanding of the landscape. Then move to Part II (Quick Start) to put it into practice on a real task and build muscle memory.
Part I — Tools, Techniques, and Landscape
Introduction
This part covers the wider landscape of AI-assisted software development: how to choose models, manage costs, evaluate coding assistants, structure specifications, handle security and privacy, and more.
Use this part as a reference. When you need to compare tools, justify a model choice to a client, set up privacy controls, or go deeper on a topic, this is where you look. Everything here provides the foundation and context for the hands-on workflow described in Part II (Quick Start).
1. AI Assistants, Models, and Providers
Choosing the right AI model and assistant for a given task is a core Practitioner skill. Not every model excels at every job, and using a frontier reasoning model for a simple text classification task wastes both time and money. As an Expert, you should be able to recommend the right tool for any scenario and explain why.
Core Principle reminder: Choose the Right Model for the Task — Don't chase every new release, but periodically research which models are current best-in-class for your use cases. Core Principles for Working with AI
How to Compare and Evaluate AI Models
The AI model market moves fast — new releases weekly, bold benchmark claims daily. Artificial Analysis is your antidote: independent, vendor-neutral evaluations you can actually trust. Bookmark it and use it before making any model decision.
The three dimensions that matter:
Every model on the site is measured across three axes that map directly to engineering trade-offs:
- Intelligence — A composite score built from multiple benchmarks. Don't stop at the headline number. Switch to the Coding Index (code generation, completion, debugging) and Agentic Index (multi-step tool use, self-correction, planning) tabs — these are far more relevant to development work than the general ranking. A model ranked #4 overall might rank #1 for coding.
- Speed — Output tokens per second across API providers. Critical for agentic workflows and interactive assistants: latency compounds across tool calls. Check this whenever you're building anything that chains multiple model calls.
- Price — USD per 1M tokens (input and output shown separately). At CI/CD or high-volume scale, a 5× price difference is an architectural decision, not a detail.
How to get the most out of the site:
Don't treat the default leaderboard ranking as a universal truth. The most useful view is the Quality vs. Price scatter plot — it shows you the "efficient frontier," the cluster of models that deliver near-top performance at a fraction of the frontier cost. For most development tasks, the right model lives there, not at the very top.
On individual model pages, check the provider comparison table: the same model (e.g., Gemini 2.5 Pro) is often served by multiple API providers at meaningfully different prices and latency profiles. When you integrate a model into a product, this table tells you which provider to route through.
Model Selection Strategy
Model selection is an engineering decision. Match capability to the task — overkill wastes money, underpowered models waste time and quality.
Start with the right index, not the headline score. For coding tasks, use the Coding Index. For multi-step agents, use the Agentic Index. A model that ranks #5 overall may outperform everything else on the task you actually care about.
Match model tier to task complexity:
| Task type | What to optimize for | Where to look |
|---|---|---|
| Architecture analysis, security review, complex refactoring, long-context reasoning | Quality and depth — cost is secondary | Frontier models at the top of the Coding/Agentic Index |
| Feature implementation, code review, debugging, test generation | Quality-to-cost ratio | Models near the efficient frontier on the scatter plot |
| CI/CD automation, inline autocomplete, classification, short-context generation | Speed and price — quality threshold, not maximum | Lightweight models; check Speed and Price tabs |
Practical team workflow:
- Identify your top use cases and open the relevant Artificial Analysis index (Coding, Agentic, or general Intelligence)
- Shortlist two or three models clustering near the efficient frontier on the scatter plot
- Run a quick eval on representative samples from your actual codebase or tasks
- Record your findings in a shared decision log — include the date, since rankings shift
- Revisit quarterly or when a major new model drops
A model that is 5× cheaper and only marginally less capable for your specific task is almost always the correct call at volume. Use data to make that argument, not intuition.
AI Assistants — Use Cases Beyond Coding
AI assistants are not just code generators. Practitioners and Experts should use them across the full scope of their work:
Research and analysis — Quickly survey a technology, compare frameworks, or understand a new domain. Feed documentation and ask the model to summarize trade-offs.
Explaining difficult concepts — Use AI to break down complex topics (distributed consensus, event sourcing, Kubernetes networking) for different audiences. Specify the expertise level of your target audience for best results.
Consultation and design review — Share your architecture sketch, API design, or database schema and ask the model to critique it. Provide your constraints and standards as context.
Client interview preparation — Review and refresh domain knowledge for the role, summarize a client's context, generate likely technical questions.
Pair learning — Use AI as a study partner to learn difficult topics, explain concepts from multiple angles.
Official Prompt Engineering Guides
Every major AI provider publishes prompt engineering documentation. These are the authoritative starting points — bookmark them and revisit periodically as they are updated with each model generation:
| Provider | Guide | Focus |
|---|---|---|
| Anthropic | Prompting best practices | Comprehensive reference for Claude models: clarity, examples, XML structuring, thinking, agentic systems |
| Anthropic | Effective context engineering for AI agents | Context management strategies for building reliable agents — the evolution beyond prompt engineering |
| OpenAI | Prompt engineering guide | Strategies for GPT and reasoning models, including agentic and coding-specific patterns |
| Prompt design strategies (Gemini) | Zero-shot, few-shot, system instructions, and multimodal prompting for Gemini models | |
| Microsoft / Azure | Prompt engineering techniques (Azure OpenAI) | Practical prompt construction for enterprise Azure deployments |
Why this matters: The Core Principles document (principles 1–9) covers what to do. These official guides from the model providers cover exactly how to do it with their specific models. Techniques that work well with one model family may need adaptation for another.
Agent Skills Standard
The Agent Skills open standard defines a portable format for teaching AI agents reusable capabilities. Understanding and creating skills is an Expert-level competency.
- Standard website: agentskills.io
- Anthropic implementation: Claude Agent Skills | Skill authoring best practices
- OpenAI implementation: Codex Skills (built into the Codex app and CLI)
A skill is a folder with a required SKILL.md file plus optional scripts, references, and assets. Skills can be installed per-user or per-project, enabling teams to standardize how agents handle common tasks like code review, documentation generation, or deployment workflows.
2. Cost & Efficiency
Every AI interaction has a price tag. Understanding how costs work and making smart choices is a Practitioner skill; designing cost-efficient strategies for teams and clients is an Expert responsibility.
Core Principle reminder: Choose the Right Model for the Task — While the AI landscape evolves rapidly, certain models excel at specific tasks. Don't chase every new release, but periodically research which models are current best-in-class for your use cases. Core Principles for Working with AI
Subscription vs. API Pricing
Subscriptions (ChatGPT Plus, Claude Pro, Copilot Pro, Cursor Pro) charge a flat monthly fee with usage caps. Best for individual daily productivity. When evaluating plans, check: message/request limits and overage costs, which models are included, privacy guarantees (free tiers may train on your data), IP indemnity, and team management features. For client work, business-tier plans are required for compliance — check current pricing on each provider's page.
API / pay-per-use charges per token (≈ 4 characters), with output tokens typically 3–5× more expensive than input. Cost per call: (input_tokens / 1M × input_price) + (output_tokens / 1M × output_price). This is what matters at scale. Watch for hidden multipliers: oversized context (some providers double rates above 200K tokens), reasoning/thinking tokens (billed as output even when invisible), agentic loops (each cycle is a full API call), and verbose output.
Cost Reduction Levers
| Lever | Impact |
|---|---|
| Model tiering | Highest impact. Frontier models for complex tasks only; mid-tier for daily work; lightweight for autocomplete, classification, simple generation |
| Prompt caching | ~90% discount on repeated context blocks (system prompts, project instructions, RAG knowledge bases) |
| Batch API | 50% discount for non-urgent workloads (nightly analysis, bulk generation) with 24-hour turnaround |
| Effort controls | Dial down reasoning depth per request — a frontier model at reduced effort can match mid-tier cost |
| Context window management | In agentic sessions, cost per message grows linearly with conversation length. Clear context between tasks, compact it proactively, delegate verbose operations to subagents |
| Prompt engineering | Shorter prompts, structured output formats, referencing instruction files instead of re-pasting context |
For teams: estimate token consumption before integrating any API-based workflow, set spending alerts on every provider dashboard, log AI costs per project for ROI tracking, and revisit model choices quarterly as pricing trends consistently downward.
Agentic Session Economics
A single agentic coding session is not a single API call — it is a loop of dozens to hundreds of calls, each carrying the full (and growing) conversation context. Understanding this cost multiplier is essential before adopting agentic workflows at scale.
Subscription vs. API — the first decision:
For individual developers, a subscription plan is almost always more cost-effective than API billing. Agentic sessions consume far more tokens than you might expect — active developers easily exceed what a single API call would suggest, because every turn re-sends the growing conversation context. Most providers offer tiered subscription plans that absorb this cost at a flat rate, making them significantly cheaper than pay-per-token billing for daily agentic work.
Rule of thumb: If you use an agentic coding tool daily, compare subscription tiers against your estimated API spend — subscriptions typically offer a 5–25× cost advantage for active users. Reserve API billing for CI/CD automation, programmatic integrations, and team-wide deployments where you need fine-grained cost control and spending limits.
Important caveat: Subscription plans typically have usage caps that reset periodically. During sustained heavy use (multi-agent teams, large refactors), you may hit rate limits and experience throttling. API billing has no such caps — only the rate limits you configure. For teams running automated pipelines or needing guaranteed throughput, API billing may still be the right choice despite higher per-token cost.
Check current pricing and plan details on official pages: Anthropic Claude · OpenAI · GitHub Copilot · Cursor · Google Gemini
How costs compound in agentic sessions:
Not every AI interaction costs the same. The gap between a quick question and a full agentic workflow is enormous — and understanding where your work falls on this spectrum is a major factor in predicting your AI spend.
- Single API call — one request, one response. The baseline: you send a prompt, get an answer, done. This is what most pricing pages show you, and it barely reflects what agentic work actually costs.
- Interactive chat session — a multi-turn conversation where you and the model go back and forth (think ChatGPT or Claude web UI). Each turn re-sends the growing history, so costs accelerate as the conversation lengthens.
- Agentic coding session — the model works semi-autonomously: reading files, writing code, running tests, interpreting results, and looping back. Dozens to hundreds of API calls happen under the hood, each carrying the full conversation context plus tool outputs.
- Multi-agent team — several agents run in parallel (e.g., one plans, one implements, one reviews), each maintaining its own context window. Token consumption multiplies across agents, not just across turns.
| Interaction type | Typical token consumption | Cost multiplier vs. single call |
|---|---|---|
| Single API call | ~2K input + ~1K output | 1× |
| Interactive chat session (10–20 turns) | ~50K–200K input + ~10K–50K output | 25–75× |
| Agentic coding session (file reads, edits, test runs) | ~200K–500K input + ~50K–150K output | 100–400× |
| Multi-agent team (3–5 parallel agents) | ~1M–3M input + ~200K–500K output | 500–1500× |
Multi-agent teams consume roughly 7× more tokens than standard sessions because each agent maintains its own context window. At team scale, these costs become an architectural decision, not a rounding error. Use the pricing comparison tools listed below to estimate actual costs for your model and provider.
Model tiering within a session:
You do not need to use the same model for every step. The most cost-effective pattern is to tier models by task complexity within a single workflow:
| Task phase | Recommended model tier | Why | Relative cost |
|---|---|---|---|
| Architecture analysis, complex planning, multi-step reasoning | Frontier (e.g., Opus, GPT-5.x, Gemini Ultra) | Highest intelligence — worth the premium for decisions that shape the entire feature | $$$ |
| Feature implementation, code generation, debugging, refactoring | Mid-tier (e.g., Sonnet, GPT-4.x, Gemini Pro) | Best quality-to-cost ratio for the bulk of coding work | $$ |
| Subagent tasks, linting, simple lookups, classification | Lightweight (e.g., Haiku, GPT-4o-mini, Gemini Flash) | Fast and cheap — ideal for delegated operations where speed matters more than depth | $ |
Most agentic coding tools let you switch models mid-session or set defaults per task type. In API-based workflows, route different pipeline stages to different models programmatically.
Rule of thumb: Plan with a frontier model, implement with mid-tier, delegate simple tasks to lightweight. This pattern can reduce session costs by 40–60% compared to using the frontier model for everything.
Context window is your primary cost driver:
In agentic sessions, every message you send includes the entire conversation history — all previous turns, file contents, tool outputs, and thinking tokens. This means cost per message grows linearly with conversation length. A message near the end of a long session can cost 10–25× more than the same message at the start, simply because of accumulated context.
Practical strategies to keep context lean:
/clearbetween tasks — When switching to unrelated work, clear the conversation. Stale context wastes tokens on every subsequent message. Use/renamebefore clearing so you can/resumelater- Use
/compactproactively — When context grows large but you are still mid-task,/compactsummarizes the conversation while preserving key details. Add focus hints:/compact Focus on code samples and API usage - Auto-compaction is your safety net, not your strategy — Claude Code automatically compacts when approaching context limits, but by that point you have already paid for the bloated context across many messages. Compact earlier, not later
- Delegate verbose operations to subagents — Running a full test suite, fetching documentation, or processing log files can dump thousands of lines into your context. Delegate these to subagents — only a summary returns to your main conversation
- Write specific prompts — "Improve this codebase" triggers broad scanning across many files. "Add input validation to the login function in
auth.ts" lets Claude work efficiently with minimal file reads - Move detailed instructions from CLAUDE.md to skills — Your CLAUDE.md is loaded into every session. If it contains detailed instructions for specific workflows (PR reviews, database migrations), those tokens are present even when you are doing unrelated work. Skills load on-demand only when invoked — aim to keep CLAUDE.md under ~500 lines
Prompt caching — the hidden cost saver:
Every agentic turn is a new API call that re-sends the system prompt, tool definitions, CLAUDE.md content, and conversation history. Without caching, you pay full price for this repeated content on every single turn. With caching, repeated content costs 90% less after the first send.
Most providers now offer prompt caching — Anthropic, OpenAI, and Google all support it in some form. Some tools enable it automatically; for API-based workflows, check your provider's documentation.
- How it works under the hood: Caching is prefix-based — the provider matches what you send against what was sent previously, starting from the first token. The moment content diverges, everything after that point is billed as new input. This means stable content (system instructions, tool definitions, project files) should sit at the beginning of your prompt, and variable content (latest user message, tool results) at the end. If you build prompts via the API, keep this order — reordering or editing early sections invalidates the cache for everything that follows.
- What gets cached: System prompts, project instruction files, tool definitions, and the stable prefix of your conversation history. A cache hit typically costs a fraction of the standard input price (exact discount varies by provider)
- Cache lifetime: Usually a few minutes. As long as you send messages within this window, cached content stays warm. Some providers offer extended caching at higher write cost for batch workloads
- Cache-friendly habits: Keep project instruction files stable between sessions — even small edits early in the prompt invalidate everything downstream. Use the same system prompts across team members to maximize shared cache hits
- Where caching matters most: In a 50-turn agentic session, the system prompt and tool definitions are sent 50 times. Without caching, you pay full price each time. With caching, the savings compound significantly across long sessions and teams
Further reading: Prompt caching (Anthropic docs) | Prompt caching (OpenAI docs) | Context caching (Google docs) | Research paper: Don't Break the Cache: Prompt Caching for Long-Horizon Agentic Tasks
Budgeting agentic workflows for teams:
Agentic coding is not free-form — it requires the same financial discipline as any infrastructure cost. Treat AI token spend as a line item in your project budget.
| Budget lever | How to use it |
|---|---|
| Workspace/project spend limits | Most providers offer spending caps or budget alerts via their console or dashboard. Set these up before enabling agentic workflows — they prevent runaway costs from long-running agents or automation |
| Rate limits per team size | Scale tokens-per-minute allocation per user as the team grows. Fewer users are active concurrently in larger teams, so per-user allocation can decrease |
| Monitor usage actively | Use your tool's built-in usage tracking (session cost reports, token counters, usage dashboards). Review these regularly — a single runaway session can consume a week's budget |
| Reasoning/thinking token budget | Reasoning tokens (extended thinking, chain-of-thought) are billed as output — the most expensive token type. For simpler tasks, reduce reasoning depth or effort level. Check your provider's docs for how to control this |
| Multi-agent governance | When running multiple agents in parallel, each maintains its own context window. Keep teams small, tasks focused, and clean up idle agents promptly |
Recommended team workflow for cost control:
- Start with a small pilot group (3–5 developers) to establish baseline usage patterns
- Set conservative workspace spend limits based on pilot data
- Log AI costs per project — track alongside other infrastructure costs
- Review monthly and adjust: model mix, context management habits, and automation scope
- Revisit quarterly as model pricing consistently trends downward
Pricing Comparison Resources
AI pricing changes constantly — do not rely on memorized numbers. Bookmark these:
- Artificial Analysis — Quality-vs-price scatter plots across all major models: artificialanalysis.ai
- Price Per Token — Daily-updated pricing for 300+ models with cost calculators: pricepertoken.com
- Helicone LLM Cost — Side-by-side cost comparison for 300+ models: helicone.ai/llm-cost
- Vellum LLM Cost Comparison — Visual cost comparison by input/output size: vellum.ai/llm-cost-comparison
Official pricing pages: Anthropic Claude · OpenAI · Google Gemini · GitHub Copilot · Cursor
3. AI Coding Assistants
AI coding assistants have evolved from simple autocomplete tools into full agentic systems capable of planning, implementing, testing, and iterating on code across entire projects.
Categories of AI Coding Tools
| Category | How It Works | Examples | Best For |
|---|---|---|---|
| IDE-integrated autocomplete | Real-time suggestions as you type; tab to accept | GitHub Copilot (inline), Cursor (Tab), JetBrains AI | Fast completions, boilerplate, repetitive patterns |
| IDE chat / edit mode | Conversational coding within your editor; targeted edits | GitHub Copilot Chat, Cursor Ask/Chat, JetBrains AI Chat | Explanations, targeted refactors, Q&A about code |
| IDE agent mode | Autonomous multi-file editing with tool use and self-correction | GitHub Copilot Agent Mode, Cursor Agent, Windsurf, JetBrains Junie | Feature implementation, multi-file refactors, complex tasks |
| Terminal-based agents | CLI tools that understand your repo and execute commands | Claude Code, OpenAI Codex CLI, Google Gemini CLI, Aider | Full project work, git workflows, DevOps tasks, CI/CD |
| Cloud-based agents | Asynchronous agents that work in cloud sandboxes on assigned tasks | GitHub Copilot Coding Agent, OpenAI Codex (cloud), Cursor Cloud Agents | Parallelized work, issue-based delegation, background tasks |
Tool Profiles
The table below compares the major AI coding assistants across key dimensions. Detailed profiles follow.
| GitHub Copilot | Cursor | Claude Code | OpenAI Codex | |
|---|---|---|---|---|
| Form factor | IDE plugin + CLI + cloud | AI-native IDE (VS Code fork) | Terminal agent + IDE extensions | Cloud + CLI + IDE + desktop app |
| IDE support | VS Code¹, Visual Studio, JetBrains, Eclipse, Xcode, Neovim | Own IDE; JetBrains (Agent Client Protocol) | Terminal; VS Code¹, JetBrains | VS Code¹; desktop app |
| Inline completions | ✓ + Next Edit Suggestions | ✓ + predictive edits | — | — |
| Agent mode | ✓ | ✓ | ✓ | ✓ |
| Cloud / async agents | Coding agent (Issues → PR) | Cloud Agents | GitHub / GitLab CI | Cloud sandboxes; parallel tasks |
| Plan mode | ✓ | ✓ | ✓ | via $create-plan skill |
| Project config | .github/copilot-instructions.md | .cursor/rules | CLAUDE.md | AGENTS.md |
| MCP | ✓ | ✓ | ✓ | ✓ |
| Models | GPT-5.x, Claude, Gemini | Claude, GPT, Gemini, Cursor | Claude | GPT-5.x-Codex |
| Plans | Free · Pro · Pro+ · Business · Enterprise | Hobby (free) · Pro · Pro+ · Ultra · Business | Pro / Max subscription or API | Plus / Pro / Business / Edu / Enterprise |
¹ Including VS Code forks (Cursor, Windsurf)
GitHub Copilot
GitHub Copilot is the most widely adopted AI coding assistant. Its capabilities span the full development lifecycle:
- Inline suggestions and Next Edit Suggestions (NES) — predicts not just the next line, but the next logical edit location
- Plan mode — analyzes your request, generates a step-by-step implementation plan, and lets you review and refine it before any code is written. Available across VS Code, JetBrains, Eclipse, and Xcode
- Agent mode — autonomously edits multiple files, runs terminal commands, and self-corrects errors
- Coding agent — asynchronous cloud-based agent that works on GitHub Issues, creates branches, and opens pull requests for review
- Code review — AI-generated review suggestions on pull requests
- Copilot CLI — terminal-native coding agent with plan mode, autopilot mode, built-in specialized agents, and MCP support
- Customization system — a layered configuration model (repository instructions, file-scoped rules, reusable prompts, custom agents, skills) for tailoring Copilot to your project
Documentation: docs.github.com/en/copilot | Copilot customization
⚠️ IDE feature parity varies. VS Code has the most complete Copilot feature set. JetBrains IDEs support the main instructions file, prompt files, custom agents, and skills, but do not support multiple
.instructions.mdfiles withapplyTopatterns. Always verify which customization features your IDE supports in the official docs.
Cursor
Cursor is an AI-native IDE (a fork of VS Code) built from the ground up around AI workflows. Its key differentiators:
- Codebase indexing — indexes your entire project so AI can reason across all files, not just the one you have open
- Agent mode — full agentic workflow that plans, edits, runs commands, and iterates
.cursorrules/.cursor/rules— project-level configuration files to enforce coding standards, architectural patterns, and team conventions- Cloud Agents — autonomous agents running in their own cloud VMs with full development environments, computer use, and the ability to test changes and produce artifacts (screenshots, recordings, logs). Launchable from web, mobile, Slack, or GitHub
- Automations — always-on cloud agents that run on a schedule or in response to events
- JetBrains integration — available in IntelliJ, PyCharm, WebStorm and other JetBrains IDEs via Agent Client Protocol (ACP)
- Composer model — Cursor's own frontier model optimized for code editing with 4x speed improvements
- Privacy Mode — configures zero data retention with model providers when enabled
Documentation: docs.cursor.com
Claude Code
Claude Code is Anthropic's agentic coding tool that runs in your terminal. It understands your codebase, edits files, executes commands, and manages git workflows through natural language:
- Terminal-native — lives in your terminal, no IDE dependency required
- CLAUDE.md — project-level documentation files that teach Claude about your codebase, conventions, and architecture. The better these are maintained, the better the results
- Plan mode — a read-only research and planning phase (
Shift+Tab× 2) where Claude analyzes the codebase, asks clarifying questions, and produces a structured plan before writing any code - Agentic workflow — reads files, runs tests, commits code, creates PRs
- Multi-turn conversations — maintains context across a session for iterative work
- Sub-agents — can spawn focused sub-tasks in their own context
- Agent teams — coordinate multiple Claude Code instances working in parallel. One session acts as team lead, assigning tasks and synthesizing results, while teammates work independently in their own context windows and communicate directly with each other. Best suited for parallel code review, cross-layer changes, or debugging with competing hypotheses
- Multi-IDE integration — beyond the terminal, Claude Code offers official extensions for VS Code (and forks like Cursor), JetBrains IDEs, with interactive diff viewing, selection context sharing, and diagnostic integration
- GitHub integration — tag
@claudeon GitHub issues and PRs for automated code review and implementation - MCP support — extend with Model Context Protocol servers for additional tool access
Documentation: code.claude.com/docs
OpenAI Codex
OpenAI Codex has evolved into a full-fledged software engineering agent platform, available as a cloud agent, CLI tool, and IDE extension:
- Cloud agent — runs tasks in isolated cloud sandboxes, works on multiple tasks in parallel, and proposes pull requests
- Codex CLI — lightweight local agent for terminal-based coding workflows
- Desktop app — dedicated interface for managing multiple agents in parallel and collaborating on long-running tasks
- Skills — extensible task bundles that combine instructions, scripts, and resources (e.g.,
$skill-creator,$create-plan) - AGENTS.md support — respects project-level agent instructions
Documentation: developers.openai.com/codex | GPT-5 prompting guide (OpenAI)
Other Tools
The AI coding assistant landscape evolves rapidly. Beyond the tools covered here, options exist across several categories: IDE-integrated assistants, terminal/CLI-based agents, cloud provider-native tools (AWS, Google, Azure), and open-source multi-model agents. When evaluating any tool not listed in this playbook, apply the Use Approved Tools Only security principle and verify data handling policies, licensing, and organizational approval before use.
Model Context Protocol (MCP)
Every coding assistant listed above — Claude Code, Cursor, Codex CLI, GitHub Copilot — becomes significantly more powerful when it can reach beyond your code editor into the tools your team actually uses: issue trackers, databases, documentation wikis, monitoring dashboards. The Model Context Protocol (MCP) is the open standard (created by Anthropic in 2024, now governed by the Linux Foundation) that makes this possible without bespoke integrations. You write one MCP server for Jira, and it works with every MCP-compatible coding assistant — no per-tool, per-agent glue code.
How it works: Your coding assistant runs an MCP client; each external integration (a database, Jira, a browser, Sentry) runs as an MCP server exposing tools over the protocol. The assistant discovers available tools at runtime and calls them as needed during coding workflows — querying a database schema while generating a migration, pulling a Jira ticket description into context for implementation, or checking Sentry logs while debugging.
Common MCP servers for development teams:
| Server | What It Provides |
|---|---|
| Filesystem | Read, write, and search files on disk |
| PostgreSQL / MySQL | Query databases, inspect schemas |
| GitHub | Create issues, read PRs, search repositories |
| Jira / Confluence | Read and create tickets; search documentation |
| Playwright / Browser | Control a browser for testing and scraping |
| Sentry | Access error logs and stack traces |
| Context7 | Connect to up-to-date third-party documentation |
Directory of servers: github.com/modelcontextprotocol/servers
Setting up MCP servers:
- Claude Code: Configure in
.mcp.jsonat project root. Useclaude mcp add <n> -- <command>. - Codex CLI: Configure in
~/.codex/config.toml. Usecodex mcp add <n> -- <command>. - Cursor: Add via Settings → MCP or in
.cursor/mcp.json.
MCP Security Risks
When you connect MCP servers to your coding assistant, you're giving an AI agent — one that processes untrusted input like code comments, issue descriptions, and web content — direct access to your databases, repositories, and internal tools, running with your permissions. Every major security research team that has examined MCP in coding workflows — Palo Alto Unit 42, Invariant Labs, JFrog, Checkmarx — has found serious, exploitable vulnerabilities in real-world scenarios.
| Risk | What Happens | Real Example | Mitigation |
|---|---|---|---|
| Tool poisoning | Malicious server embeds hidden instructions in tool descriptions that the coding assistant follows silently | Invariant Labs: a "random fact" server exfiltrated an entire WhatsApp history through a legitimate server connected to the same assistant session | Audit tool schema definitions in source code before installing — not just the README |
| Overprivileged tokens | Compromised server with broad token scopes leaks access to all connected services | GitHub MCP server with a broad Personal Access Token (PAT) allowed a prompt-injected agent to exfiltrate private repos into a public PR | Least privilege: narrowly scoped, short-lived, dedicated credentials per server — never personal all-access tokens |
| Rug pulls | Server silently changes tool definitions between sessions, adding capabilities you never approved | Documented by eSentire: tool approved on Day 1 can reroute API keys by Day 7 | Pin server versions; review changelogs before updating; prefer known publishers |
| Command injection | AI-generated input passed unsanitized to shell commands in server implementations | CVE-2025-6514 (CVSS 9.6) in mcp-remote — 437k+ downloads, affected Cloudflare and Hugging Face integrations | Never concatenate input into shell commands; use parameterized APIs; sandbox servers in containers |
| Lethal Trifecta via MCP | Your coding assistant gains private data access (database server) + external communication (HTTP tool); a prompt injection in a code comment or issue description chains them for exfiltration | Multiple demonstrations combining database servers with outbound HTTP tools | Don't combine sensitive-data and outbound-communication servers carelessly; require human approval for external actions |
| Supply chain | No centralized review for community servers; name impersonation (e.g., mcp-github mimicking github-mcp); tampered one-click installers | Academic researchers documented unofficial installers distributing tampered packages to coding tool users | Install only from trusted sources; verify publisher identity; review source code |
MCP security checklist:
✓ Audit tool descriptions and source code before installing any MCP server
✓ Apply least-privilege: narrow token scopes, restricted filesystem/network access
✓ Pin server versions; review changes before updating
✓ Sandbox MCP servers in containers when possible
✓ Never combine sensitive-data and external-communication servers carelessly
✓ Run SAST (Static Application Security Testing) / SCA (Software Composition Analysis) on custom MCP servers; treat them as production code
✓ Require human approval for high-risk tool calls (data export, sending messages, file deletion)
✓ Keep servers and dependencies updated — check for CVEs regularly
Further reading: MCP Security Best Practices (official spec) | Red Hat: MCP Security Risks | Palo Alto Unit 42: MCP Attack Vectors
Documentation: modelcontextprotocol.io | SDKs: github.com/modelcontextprotocol
Agent Skills Standard
Coding assistants also support the Agent Skills Standard — see Section 1.
Best Practices for AI Coding Assistants
These practices apply regardless of which tool you use. They build on and complement the Core Principles.
1. Invest in context engineering
The single most impactful thing you can do is provide excellent context. This means maintaining project-level instruction files (.cursorrules, CLAUDE.md, AGENTS.md, .github/copilot-instructions.md) that describe your architecture, coding standards, approved tech stack, and conventions. Keep these files updated and version-controlled.
Deep dive: Anthropic's engineering team published two essential articles on this topic: Effective context engineering for AI agents explains why context management has surpassed prompt writing as the critical skill, and Effective harnesses for long-running agents covers strategies for maintaining quality across multi-session projects.
Auto-loaded entry files must stay concise. Every coding assistant has a main entry file that is loaded automatically into every session (
CLAUDE.mdfor Claude Code,.github/copilot-instructions.mdfor Copilot,.cursorrulesfor Cursor). Because these are always in context, keep them short and focused: tech stack, key commands, critical conventions, and references to detailed docs. Put granular rules in secondary files (.github/instructions/, Cursor's.cursor/rules/, Claude's sub-agent instructions) that load only when relevant. Always verify how your specific assistant handles instruction loading — for example, Copilot's.instructions.mdfiles require anapplyTopattern to activate, and not all features work the same across IDEs.
# Example: CLAUDE.md (or equivalent)
## Architecture
- Microservices with event-driven communication via RabbitMQ
- CQRS pattern with PostgreSQL + Marten for event sourcing
- .NET 8, C# 12, ASP.NET Core
## Coding Standards
- PascalCase for public methods, _camelCase for private fields
- Async/await for all I/O, suffix async methods with "Async"
- Constructor injection, guard clauses at method start
- XML documentation for all public APIs
## Constraints
- No GPL dependencies in production code
- API responses must be <200ms at p95
- All external dependencies require ADR approval
2. Start with small, focused tasks
Don't ask an agent to build an entire feature in one go. Break work into focused steps: data model → repository → service layer → API endpoints → tests. Review and validate at each step.
3. Use approval modes wisely
Most agents offer tiered permission levels. For production codebases, prefer modes that require explicit approval before the agent runs commands or writes to files outside your working directory. Increase autonomy only as you build trust and establish guardrails — and keep in mind that guardrails don't always hold. They are susceptible to jailbreaking, context drift, and semantic inconsistency.
4. Review everything critically
AI-generated code can look plausible while containing subtle bugs, security holes, or performance issues. Apply the same rigor to AI-generated code that you would to any code review. Never commit code you don't understand. Core Principles for Working with AI
5. Maintain version control discipline
Make small, atomic commits with descriptive messages. Never bulk-commit large AI outputs without understanding each change. Your git history should tell a coherent story. Core Principles for Working with AI
Task Management and Planning Strategies
For complex projects, consider these agent-compatible planning approaches:
Plan → Implement → Verify loop — Before coding, ask the agent to create a plan. Review the plan, then authorize implementation. After implementation, run tests and verification. This three-phase approach catches issues early. Most major tools now have a dedicated plan mode for this: Copilot's Plan mode (select from agents dropdown or press Shift+Tab in CLI), Claude Code's Plan mode (Shift+Tab × 2 or --permission-mode plan), and Cursor's agent planning step.
Cross-model validation for critical plans — For high-stakes features, consider validating an agent's implementation plan with a second model before writing any code. After your primary agent (e.g., Claude Sonnet) produces a plan, feed that plan to a reasoning-focused model (e.g., o1/o3, Claude with extended thinking) with the prompt: "Find flaws, edge cases, and missing requirements in this implementation plan." This is not necessary for routine work, but for complex features with significant business impact, a five-minute cross-model review can catch architectural blind spots that a single model's planning step might miss. Think of it as a lightweight "second opinion" on the plan — not a mandatory gate.
Multi-agent parallelism — Tools like Claude Code (sub-agents), Codex (cloud tasks), and Cursor (background agents) let you run multiple agents in parallel on independent tasks. Use this for large refactors where tasks don't have heavy dependencies on each other. When tasks are sequential or interdependent, parallelism won't work — instead, create a tasks.md file with a clear checklist and work through it across focused sessions, manually tracking progress.
4. Spec-Driven Development
Spec-Driven Development (SDD) represents the shift from ad-hoc "vibe coding" to structured, production-ready AI workflows.
What Is Spec-Driven Development?
SDD is a methodology where formal, detailed specifications serve as the primary input for AI code generation. Instead of chatting with an AI iteratively and hoping for the best, you invest upfront in writing clear specifications that describe what the system should do and how it should be built. The AI then implements against these specifications, producing more consistent, maintainable, and correct code.
The core workflow:
Specify → Plan → Tasks → Implement → Validate
- Specify — Write detailed requirements including functional behavior, technical constraints, architecture decisions, and acceptance criteria
- Plan — Ask the AI to produce an implementation plan; review and refine it before any code is written
- Tasks — Break the plan into discrete, ordered tasks with clear inputs and outputs
- Implement — The AI agent implements tasks one at a time, with review checkpoints
- Validate — Run tests, review diffs, and verify the implementation against the specification
How It Compares to Other Approaches
| Approach | Specification | Planning | Human Control | Best For |
|---|---|---|---|---|
| Vibe coding | None (ad-hoc prompts) | None | Low | Quick prototypes, throwaway scripts, exploration |
| Prompt engineering | Implicit in prompts | Minimal | Medium | Single-file tasks, targeted edits |
| Spec-Driven (SDD) | Formal, versioned specs | Full plan before code | High | Production features, team collaboration, complex systems |
SDD Tools
Several tools have been purpose-built for spec-driven workflows:
GitHub Spec Kit — An agent-agnostic toolkit that bootstraps the specification, planning, and task breakdown process. Creates a .specify folder in your repo with structured artifacts. Works with any coding agent.
Source: developer.microsoft.com/blog/spec-driven-development-spec-kit
AWS Kiro — Amazon's AI IDE built around spec-driven development, using "spec" artifacts to guide implementation with strong integration into AWS services.
Source: kiro.dev | Introducing Kiro | Specs documentation
JetBrains Junie — Supports SDD workflows through its planning and guidelines system, producing requirements.md, plan.md, and tasks.md files. The "Think More" toggle encourages deeper planning.
Source: blog.jetbrains.com/junie/2025/10/how-to-use-a-spec-driven-approach-for-coding-with-ai/
Claude Code / Codex — Both support SDD through their CLAUDE.md / AGENTS.md system and skills like $create-plan. Combine with task files for structured implementation. Community-maintained skill collections (e.g. replicating Kiro's spec workflow inside Claude Code) exist, but review them carefully before adopting — they're unofficial, may fall out of date, and can introduce unexpected behavior into your agent's workflow.
Get Shit Done (GSD) — A meta-prompting and context engineering system for Claude Code that manages the full workflow: discussion, planning, execution, verification, and shipping. Uses multi-agent orchestration, XML-formatted task specs, and atomic git commits. More opinionated and heavier than the other options here, but thorough.
Practical Example: SDD Prompt
Here is an example prompt that bootstraps a spec-driven workflow, adaptable to any agent:
# Spec-Driven Development: [Feature Name]
## Requirements
- User can upload a CSV file of up to 50MB through the web UI
- System parses the CSV, validates against the expected schema, and stores records in PostgreSQL
- Invalid rows are collected and returned as a downloadable error report
- Processing happens asynchronously; user sees a progress indicator
## Technical Constraints
- Backend: .NET 8, ASP.NET Core, EF Core 8
- Frontend: React 18 with TypeScript
- File storage: Azure Blob Storage for temporary uploads
- Max concurrent uploads per user: 3
- Must handle files with up to 500,000 rows
## Architecture
- Use background job (Hangfire) for CSV processing
- SignalR for real-time progress updates
- Repository pattern for data access
- Follow existing coding standards in .cursorrules
## Instructions
1. First, produce a `plan.md` with your implementation approach — do NOT write any code yet
2. After I review the plan, break it into discrete tasks in `tasks.md`
3. Implement tasks one at a time, waiting for my approval after each
When to Use SDD (and When Not To)
Use SDD when:
- Building production features that will be maintained long-term
- Working in a team where multiple people (or agents) need to understand intent
- The feature involves complex business logic or cross-cutting concerns
- You need traceability between requirements and implementation
Skip SDD when:
- Rapid prototyping or throwaway experiments
- Simple, well-scoped tasks (fix a typo, add a log line)
- You're exploring an unfamiliar domain and need to iterate freely
5. Privacy and Security Options
Every AI tool you use has a data handling policy. Understanding these policies and configuring tools correctly is a professional obligation to your clients and your company.
Core Principle reminder: Protect Sensitive Information and Control Data Usage — Understand your boundaries. Follow your organization's policies on what data can be shared with GenAI tools. Ensure your inputs aren't collected for training. Core Principles for Working with AI
Data Handling by Tool
Official data privacy documentation: Always verify current policies directly — they change frequently. Anthropic Privacy Center | OpenAI Enterprise Privacy | GitHub Copilot Trust Center | Cursor Privacy
| Tool | Data Used for Training? | Privacy/ZDR Option | Encryption | Compliance |
|---|---|---|---|---|
| Claude (API / Enterprise) | No | Zero data retention available | In transit + at rest | SOC 2 Type II |
| GitHub Copilot Business/Enterprise | No (code not used for training) | Business policy controls | In transit + at rest | SOC 2, FedRAMP |
| Cursor (Privacy Mode) | No (when Privacy Mode enabled) | Privacy Mode = zero data retention with providers | In transit + at rest | SOC 2 Type II |
| OpenAI Codex (ChatGPT Business/Enterprise) | No (Business/Enterprise) | Zero data retention | In transit + at rest | SOC 2 |
| ChatGPT Free/Plus | May be used for training | Opt-out available | In transit | Limited |
Rules for Xebia Teams
- Always verify client approval — Before using any AI tool on a client project, confirm that the client or project stakeholders have explicitly approved GenAI usage
- Use approved tiers only — Enterprise/Business tiers with proper data handling. Free tiers of consumer tools are prohibited for client work
- Sanitize all inputs — Remove credentials, API keys, PII, internal hostnames, and proprietary business logic before submitting to any AI tool
- Enable privacy modes — Turn on Privacy Mode in Cursor, use API-based access for Claude, use Business/Enterprise plans for Copilot and Codex
- Understand residency — Know where your data is processed. Some clients require data to stay within specific regions (EU, US)
- When in doubt, ask — Contact your project's security team or Xebia's compliance guidelines. Never assume it is okay
Data Sanitization Checklist
Before pasting anything into an AI tool, remove or replace:
✗ Passwords, API keys, tokens, secrets
✗ Database connection strings, SSH keys, certificates
✗ Names, emails, addresses, social security numbers
✗ Credit card numbers, financial account details
✗ Customer names, contract details, internal codenames
✗ Specific IPs, hostnames, internal URLs
✓ Use placeholders: "prod-db-01.company.com" → "[DB_HOST]"
✓ Use placeholders: "sk_live_a1b2c3..." → "[API_KEY]"
✓ Use placeholders: "Jan Kowalski" → "[CUSTOMER_NAME]"
✓ Use environment variable references: process.env.DB_PASSWORD
6. LLM-Specific Security Risks
Security when using AI is not an afterthought — it is a first-class concern.
Core Principles reminder: Review the full security section in Core Principles for Working with AI for Xebia's authoritative guidance on: verifying outputs, protecting sensitive data, IP protection, access controls, organizational policies, and approved tools.
OWASP Top 10 for LLM Applications (2025)
The OWASP Top 10 for LLM Applications is the industry-standard reference for securing AI-powered applications.
📄 Full reference: genai.owasp.org/llm-top-10
The Lethal Trifecta
Security researcher Simon Willison identified the "Lethal Trifecta" — a critical security pattern that every developer using AI agents must understand. When an AI agent combines these three capabilities, attackers can exploit prompt injection to steal your data:
┌─────────────────────┐ ┌────────────────────────┐ ┌──────────────────────┐
│ 1. Access to │ │ 2. Exposure to │ │ 3. Ability to │
│ PRIVATE DATA │ + │ UNTRUSTED CONTENT │ + │ COMMUNICATE │
│ (emails, files, │ │ (web pages, uploaded │ │ EXTERNALLY │
│ databases) │ │ docs, user inputs) │ │ (send data out) │
└─────────────────────┘ └────────────────────────┘ └──────────────────────┘
↓
⚠️ LETHAL TRIFECTA — Data theft possible
Why it matters: An attacker plants malicious instructions in a document or web page. When the agent processes that content, the instructions trick it into reading your private data and sending it to the attacker's server (e.g., via a URL in a Markdown link). This is not theoretical — it has been demonstrated against ChatGPT, Google Bard, Microsoft Copilot, GitHub Copilot Chat, Slack, and many others.
How to protect yourself:
- Avoid combining all three — If possible, ensure your agent doesn't have all three capabilities simultaneously
- Restrict external communication — Limit the agent's ability to make arbitrary network requests or embed URLs in outputs
- Treat all external content as untrusted — Documents, web pages, emails, and user inputs can all contain hidden prompt injection
- Use human-in-the-loop — Require manual approval before the agent takes high-risk actions (sending data, making API calls, modifying files outside the workspace)
- Monitor and audit — Log all agent actions, especially data access and external communications
📄 Full article: simonwillison.net/2025/Jun/16/the-lethal-trifecta/
7. Measuring AI Impact
Implementing AI tools in developers' daily work brings measurable benefits, but to fully optimize this process and communicate it properly, it needs to be quantifiable. This section shows how to measure the overall impact of working in the "developer + AI agent" model on a project.
These metrics need to clearly demonstrate value for the client — proving that working with AI-proficient engineers means faster time-to-market, higher quality, and better budget outcomes.
1. Delivery Output (DORA Metrics)
The most practical way to measure team efficiency is through DORA (DevOps Research and Assessment) metrics. Working with AI directly impacts two of them:
- Cycle Time: The time required to complete a task from when work begins until it is done (e.g., status change in Jira). Comparing the average Cycle Time of a solo developer against a "developer + AI agent" shows the difference concretely.
- Lead Time for Changes: The time from committing code to deploying it to production. Since AI speeds up writing integration tests and CI/CD configuration, this phase typically shortens noticeably.
2. Quality and Technical Debt
Delivery speed (Velocity) resulting from AI usage cannot come at the expense of quality. AI is not just a code generator — it's also effective at managing technical debt.
We recommend tracking industry-standard quality metrics:
- SonarQube Metrics: Observe test coverage, the number of Code Smells, and the overall Technical Debt Ratio. AI agents can be deliberately prompted to write missing unit tests and refactor code, which should directly reduce debt in SonarQube.
- QA Defect Rate (Number of bugs): Monitor the number of reported bugs per release or sprint. Developers using AI for early Code Review and logic validation before opening a PR (Pull Request) should deliver more stable code, noticeably reducing the QA team's workload.
3. Client Value & Cost Efficiency (ROI)
Most of our projects run on a Time and Materials (T&M) model, so every working hour saved by AI translates directly into client value — more features delivered in the same timeframe, or a lower final project cost.
To make this visible to management and clients, compare the cost of AI tool licenses against the value of the developer's saved time.
ROI Formula:
Value of saved time = (Estimated saved hours per month) × (Hourly rate)\ROI = Value of saved time / Monthly AI subscription cost
Example: A tool subscription (e.g., Claude Max or GitHub Copilot Enterprise) costs around \$100 per month. If the tool — by quickly generating boilerplate, tests, or assisting in debugging — saves the developer just 5 hours a month, and their rate is \$50/h, the generated savings amount to \$250. In this scenario, the tool more than pays for itself (ROI of 2.5x). In practice, AI agents typically save considerably more time than that.
4. Developer Experience (DevEx)
Developer Experience is the fourth dimension — and often the most underrated. Track the subjective feeling of productivity among the engineers themselves (e.g., based on the SPACE framework).
- Cognitive Load: AI takes over repetitive, tedious tasks, allowing developers to focus on architecture and solving complex business problems.
- Satisfaction: Short surveys measuring satisfaction with AI collaboration. Better working comfort reduces team turnover and improves software delivery stability — which benefits the end client too.
8. Use Cases in Software Development
This section covers practical AI workflows that Practitioners should use daily and Experts should be able to teach and optimize for their teams.
8.1 Understanding an Existing Codebase
When joining a new project or navigating an unfamiliar module, AI can significantly cut onboarding time.
Effective approach:
I'm new to this project. Here is the directory structure:
[paste tree output]
And here is the main entry point:
[paste code]
Please explain:
1. The overall architecture and main components
2. How data flows through the system
3. Key dependencies and their roles
4. Any patterns or conventions I should follow
With terminal agents (Claude Code, Codex CLI): These tools can scan your entire repository and answer questions about code structure, data flow, and dependencies without you needing to paste anything.
Generating a visual Code Map (Claude Code): When you need a shareable visual overview — for onboarding, architecture discussions, or stakeholder communication — Claude Code's official Playground plugin can generate one. Invoke /playground and ask for a code map of your project. It produces a self-contained interactive HTML file that visualizes the codebase's structure, dependencies, and key components.
8.2 Generation of Documentation
AI excels at generating documentation drafts that you then review and refine.
Use cases:
- API documentation from code signatures and comments
- README files for new repositories
- Architecture Decision Records (ADRs)
- Inline code comments for complex logic
- Migration guides and changelog entries
- Onboarding guides from existing documentation
Example prompt:
Based on the following controller and service code, generate OpenAPI
documentation in YAML format. Include descriptions for each endpoint,
request/response schemas, error codes, and example payloads.
[paste code]
With terminal agents (Claude Code, Codex CLI): These tools have direct access to your codebase, so they can generate documentation from your actual code without any pasting — just describe what you need documented.
8.3 Coding with Agents
With agent-based coding, you describe what you want at a high level, and the agent plans, implements, tests, and iterates.
Best practices for agentic coding:
Official guides: Claude Code best practices (Anthropic) | GitHub Copilot customization docs | GPT-5 prompting guide — agentic patterns (OpenAI)
- Write a CLAUDE.md / AGENTS.md first — Before you start any serious coding, invest 30 minutes in a context document that describes your project's architecture, conventions, and constraints. This pays for itself within the first task. Keep these files up-to-date, consistent, but not overloaded. Consider what should be included in these files, and what should be included in instructions or skills
- Identify and offload "friction" tasks first — Before adopting full spec-driven workflows, look for daily, repetitive tasks that disrupt your flow or feel like a chore. You don't need a fully autonomous agent — start with manual, on-demand commands. Create specialized prompts or scripts for common friction points: unit test generation based on your project's testing standards, contextual code review that checks against your team's priorities, localization workflows (generating translations, replacing hardcoded strings with keys), boilerplate scaffolding for new modules or components. These quick wins build confidence and skill before tackling complex agentic workflows
- Use the spec-driven workflow for anything non-trivial (see Section 4)
- Provide feedback, not just approval — When the agent proposes code, explain why something should be different. This teaches it your preferences for subsequent turns
- Run tests after each step — (or use pre/post hooks for it, which should be available in each agentic coding tool) Don't let the agent pile up five changes before verifying. Small steps, frequent verification
- Keep context window health in mind — Long sessions degrade quality. The agent typically reports context usage after key steps — when it's above ~50%, or you're switching to unrelated work, start a fresh session. For complex plans that require more work than a single session can handle, create a
tasks.mdfile and split work across multiple sessions manually — parallelism via sub-agents is not always possible when tasks have sequential dependencies
8.4 Coding with Spec Kit
GitHub Spec Kit enables a structured workflow where specifications are first-class artifacts:
- Initialize with
npx speckit initto create the.specifyfolder - Draft a "constitution" (high-level project description) and specifications
- Run
clarifyandanalyzephases to identify ambiguities - Break into implementation tasks
- Hand off to any coding agent
This approach is especially powerful for teams where multiple developers (or agents) work on the same feature.
8.5. Writing Good Tests with AI
AI can write tests that compile and pass — but that doesn't mean they test the right things. This is one of the most dangerous pitfalls of AI-assisted development.
The Problem
AI-generated tests tend to:
- Test implementation details instead of behavior — verifying that a method was called (e.g.,
mockRepo.Verify(x => x.Add(...), Times.Once)) rather than checking the actual outcome - Use trivial assertions — checking
!= nullinstead of verifying specific values and states - Mock everything — making tests pass by definition since they don't exercise any real logic
- Skip edge cases — generating happy-path tests only and ignoring nulls, boundaries, errors, and concurrency
- Pass without catching bugs — the worst kind of tests because they create false confidence
How to Write Good Tests with AI
1. Specify what to test, not just "write tests"
❌ "Write tests for UserService"
✅ "Write unit tests for UserService.AddUserAsync with the following scenarios:
- Successful creation with valid data
- Duplicate email → throws DuplicateEmailException with the email in the message
- Null user object → throws ArgumentNullException
- Empty username → throws ValidationException
- Password hashing is applied (verify hash differs from plaintext)
- CreatedAt audit field is set to current UTC time
- User is NOT persisted when validation fails"
Also provide examples of good tests — this is Few-Shot Prompting in practice, and it makes a noticeable difference in output quality.
2. Demand behavioral tests, not interaction tests
Tell the AI to verify outcomes and state, not just method calls. A test should answer: "Does the system behave correctly?" — not "Did the code run?"
3. Require the AAA pattern explicitly
Ask for Arrange / Act / Assert with one logical assertion per test and descriptive names:
[Test]
public async Task AddUserAsync_WithDuplicateEmail_ThrowsDuplicateEmailException()
4. Validate tests by breaking the code
After the AI generates tests, intentionally introduce a bug in the production code. If the tests still pass, they are not testing real behavior. This is the ultimate quality check.
5. Include a test review checklist
✓ Tests verify behavior, not implementation details
✓ Assertions check specific expected values (not just "not null")
✓ Error cases throw appropriate exceptions with helpful messages
✓ Edge cases are covered (null, empty, boundary values, concurrency)
✓ Tests are independent and can run in any order
✓ Tests use realistic data, not empty objects
✓ Mocks are used sparingly — integration points are tested where appropriate
✓ Tests fail when they should (verified by introducing bugs)
6. Consider property-based testing for complex logic
For algorithms, data transformations, or parsers, ask AI to generate property-based tests that verify invariants across thousands of random inputs, rather than a handful of specific examples.
9. Staying Up to Date
The AI landscape changes weekly. Tools release new features, models improve, new security vulnerabilities are discovered, and best practices evolve. Staying current is a professional obligation at the Practitioner and Expert levels.
Current recommendations you can find in Recommended Learning Resources.
10. Trainings and Certifications
Formal certifications validate your AI competency and demonstrate it to clients you can find in Recommended Learning Resources.
Part II — Quick Start
Introduction
This part provides a recommended path for agentic development — a hands-on, step-by-step workflow for coding with an AI agent, from preparing your project all the way to committing reviewed code.
⚠️ This is a logical tutorial, not a technical one. The steps below describe the thinking process and the sequence of decisions you should follow when working with an AI coding agent. They are deliberately abstract and tool-agnostic in their logic — they outline what to do and why, not the exact keystrokes or CLI commands for every scenario. You will need to adapt the details to your specific tool, tech stack, and project context. The goal is to give you a mental framework that works regardless of which agent or IDE you use.
You have read Part I. You understand the landscape — models, tools, costs, security. You know the Core Principles: AI is an accelerator, not a replacement; provide context, iterate, and verify outputs. But now you have a real coding task in front of you and one burning question: "What do I actually do, step by step?"
This Quick Start answers that question. It gives you a concrete, repeatable workflow for agentic development — one that has been tested in real projects and works well as a starting point. It is opinionated on purpose: rather than presenting every possible tool combination and letting you figure it out, it picks one toolset and shows you exactly how to use it.
The toolset for this Quick Start:
- Claude Code — Anthropic's terminal-native agentic coding assistant (also available as a VS Code / JetBrains extension)
- GitHub — version control and collaboration
- Built-in plan mode — Claude Code's native planning capability (no external spec framework)
- GitHub Actions — for automated AI-powered code review in CI/CD
This is not the only way to work. But it is a good-enough way to start right now. Once you are comfortable with this workflow, you can swap components, add spec-driven development frameworks (see Part I, Section 4), or integrate other tools. Adapting a working workflow is always easier than building one from scratch.
Prerequisites: A Claude Pro, Max, or Team subscription (or Anthropic Console account), Git and GitHub set up. See the official Claude Code docs for installation instructions.
Step 1 — Set Up Claude Code
Install Claude Code, get familiar with the interface, and tune its configuration.
1.1 Install Claude Code
Follow the official installation guide to set up Claude Code on your machine: Install Claude Code - Native Install (recommended)
The guide covers installation, authentication, and initial configuration for all supported platforms. Once installed, launch Claude Code from your project's root directory:
claude
On first launch, Claude Code will authenticate with your Anthropic account. It automatically reads CLAUDE.md files in your project and picks up git status, then explores other files as needed during the conversation.
1.2 Getting around Claude Code
Before diving into the workflow, spend a few minutes getting comfortable with how Claude Code actually works. The interface is deceptively simple — a terminal prompt — but there is quite a bit under the surface.
Slash commands
Type / to see the full list of available commands. The ones you will use most often:
| Command | What it does |
|---|---|
/model | Switch between models mid-conversation. Use Opus for complex reasoning and architecture decisions, Sonnet for routine implementation, Haiku for quick questions. Switching models is free and instant — match the model to the task, not the other way around. |
/effort | Control how hard the model thinks before responding. Lower effort means faster (and cheaper) answers for straightforward tasks; higher effort gives you deeper reasoning for complex problems. |
/context | Visualize current context usage as a colored grid. Shows optimization suggestions when context gets heavy — useful for understanding how much room you have left in a long session. |
/clear | Clear conversation history and free up context. Useful when context gets cluttered after a long session or when you are switching to an unrelated task. Claude Code preserves the previous session so you can resume it later. |
/usage | Show your plan usage limits and current rate limit status. Useful for checking how much capacity you have left before hitting a rate limit. For per-session token costs, use /cost instead. |
The ! prefix — running shell commands
You do not need to leave Claude Code to interact with your terminal. Prefix any command with ! and it runs directly in your shell:
! git log --oneline -5
! npm test
! docker ps
The output lands in the conversation, so the agent can see it too.
@ — referencing files
When you want the agent to look at a specific file, type @ followed by the path. Claude Code treats this as an explicit reference and loads the file into context:
@src/auth/middleware.ts What does the token validation logic do here?
This is more precise than asking Claude Code to "look at the auth middleware" — it removes ambiguity and avoids unnecessary file search.
Pasting images
Claude Code is multimodal. You can paste screenshots, diagrams, or mockups directly into the conversation. This is practical for:
- Showing a UI bug ("this button should be aligned to the right")
- Sharing a design mockup as a reference for implementation
- Pasting an error screenshot from a browser or mobile device
Just paste the image into the terminal prompt — Ctrl+V in most terminals, Cmd+V in iTerm2, Alt+V on Windows. No special syntax needed.
Extending with plugins
Claude Code supports plugins that add new slash commands and skills. To manage them:
/plugin
This opens the plugin manager where you can discover, install, and remove plugins from official and community marketplaces. A few worth considering early on:
- code-review — structured multi-agent code review (used later in this workflow)
- claude-md-management — helps maintain and improve your
CLAUDE.mdover time
Plugins live locally — add or remove them at any time. They extend what Claude Code can do without changing the core workflow.
Further reading: The official Claude Code docs cover all of this in depth — keybindings, configuration files, permission modes, MCP integrations, and more. What is listed here is enough to get started.
1.3 Configure Claude Code
Configure permissions thoughtfully. Claude Code asks for permission before running commands or editing files. For production codebases, keep the default approval mode while you build trust. You can use the /permissions command to grant selective access with wildcard syntax:
/permissions
Bash(npm run *) — allow all npm scripts
Bash(git *) — allow git operations
Edit(/src/**) — allow edits within src/
⚠️ Security note: Avoid the
--dangerously-skip-permissionscommand-line flag (passed when launchingclaude) at first, especially on client projects or production codebases. Guardrails exist for a reason — they prevent accidental destructive commands. As you become more advanced, you may find legitimate uses for it, but exercise caution when working with shared repositories.
Pro tip: Set up the status line. Run
/statuslinewithout arguments and Claude Code will auto-configure a status line from your shell prompt. It sits at the bottom of your terminal and gives you constant awareness of the current model, mode, and session state — no need to scroll up or guess. Worth doing early.
Step 2 — Prepare Your Project
Before you write your first prompt, invest time in setting up your project so the AI agent can understand your codebase, your standards, and your constraints.
2.1 Create your CLAUDE.md
CLAUDE.md is a project instruction file that Claude Code automatically loads at the start of every session. It tells the agent what the project is about, how it is structured, and what conventions to follow — build commands, testing practices, coding style, architecture decisions, and anything else that shapes how work gets done in the repo.
Generate it with Claude Code itself:
The fastest way to bootstrap a CLAUDE.md is to let Claude Code analyze your project:
> /init
This command scans your project structure, tech stack, and conventions, then generates a CLAUDE.md file automatically. Review the output and refine it.
A CLAUDE.md generated by /init is a solid starting point, but it can't capture everything — team conventions, preferred patterns, or lessons from real sessions tend to surface over time. For an existing CLAUDE.md that could use a refresh, the claude-md-management official plugin helps: its claude-md-improver skill analyzes your current file and suggests concrete improvements, and /revise-claude-md command updates it with lessons learned from your current session.
Review the generated file carefully and refine it. A good CLAUDE.md is concise (aim for under 100–150 lines), specific (include actual commands, not generic advice), and maintained (update it as the project evolves).
What to put in CLAUDE.md — and what not to:
The key principle: don't explain things the model already knows. You don't need to describe what TDD is, how SOLID works, or what microservices mean. The model knows all of that. Instead, state your preferences and constraints — what you want the model to do differently from its defaults.
- Do:
Use TDD with red-green-refactor approach— the model knows what this means, you just need to tell it to do it - Do:
Keep files under 200 lines — split into modules when they grow beyond that - Do:
Prefer composition over inheritanceorUse functional style where possible - Don't: Explain what TDD is, what SOLID stands for, or how dependency injection works
- Don't: Include generic best practices ("write clean code", "use meaningful variable names") — the model does this by default
Every line in CLAUDE.md costs tokens on every session. State preferences concisely, skip the explanations.
/init generates a solid starting structure — architecture, key commands, coding standards, constraints. After generating, review it and add your team's specific preferences that /init couldn't infer from the code alone (e.g., Use TDD (red-green-refactor), Keep files under 200 lines, preferred patterns for error handling or testing).
Pro tip: Reference existing files as examples rather than describing patterns in words. One concrete example file is worth a hundred words of explanation.
2.2 Set up coding standards and linters
The AI agent will follow your coding standards — but only if you make them explicit and enforceable. Before starting agentic work:
- Configure your linter/formatter — ESLint, Prettier, Black, dotnet format, Ruff — whatever your stack requires. Make sure these run via a simple command listed in
CLAUDE.md. - Set up pre-commit hooks or Claude Code hooks — Claude Code supports hooks that run automatically after tool use. Scope them carefully:
PostToolUsehooks fire after every single file edit during active sessions. A slow hook here will noticeably degrade your workflow - if the agent edits 20 files, that hook runs 20 times.
Rule of thumb: PostToolUse is for fast, single-file operations only - formatting or linting the one changed file, nothing more. Never run project-wide commands here (full test suite, mypy, eslint ., static analysis across the codebase). For those, use Stop hooks - they fire once when Claude finishes the entire task, not after every individual edit.
Example — fast per-file formatting in PostToolUse:
// .claude/settings.json
{
"hooks": {
"PostToolUse": [{
"matcher": "Write(*.py)",
"hooks": [{
"type": "command",
"command": "python -m black $file"
}]
}],
"Stop": [{
"matcher": "",
"hooks": [{
"type": "command",
"command": "npm test"
}]
}]
}
}
In this example, Black formats only the single changed .py file on every write (fast, scoped to $file), while the full test suite runs once at the end of the task (slow, project-wide). Notice that the PostToolUse command targets $file, not the whole directory - this distinction matters.
- Commit your CLAUDE.md — This file belongs in version control. It is documentation for both humans and AI agents. Keep it updated alongside your codebase.
2.3 Set up AI-powered code review
AI code review works best as a two-layer strategy: iterative in-session review during implementation (primary) and CI-based review on pull requests (optional safety net). The first layer is where the real value lives. It catches issues early, costs almost nothing extra (the context is already loaded), and gives you multiple chances to fix things before the code leaves your machine. The second layer is useful but consumes API credits or subscription quota on every push — treat it as a final gate, not the main mechanism.
Layer 1: Iterative in-session review (primary)
The most effective review happens during implementation, not after it. Because the AI agent already has your codebase in context, in-session review adds minimal overhead while catching issues at the earliest — and cheapest — point.
Recommended practice: multi-pass review before creating a PR
After completing implementation (Step 4) and before creating a pull request, run 2–3 review iterations within your Claude Code session:
- Pass 1 — Built-in code review: Run
/code-review(an official plugin for Claude Code) to launch parallel review agents that analyze your changes for logic errors, security issues, and standards violations - Pass 2 — Project-specific skill: If you have created a custom review skill tailored to your project's specific concerns (domain rules, architecture constraints, common pitfalls), run it as a second pass. Project-specific skills catch things generic review cannot — they know your team's patterns, your client's requirements, and your codebase's known weak spots
- Pass 3 (optional) — Targeted review: If the feature touches security-sensitive code, performance-critical paths, or complex business logic, run a focused third pass with a specific prompt (e.g., "Review only the authorization logic in these changes for OWASP Top 10 vulnerabilities")
Why multiple passes work: Each review iteration operates with slightly different focus and heuristics. In practice, a second or third pass regularly turns up issues the first one missed — subtle logic errors, missing edge cases, convention violations. The cost is marginal (the context is already loaded), while the cost of shipping a bug to production is not.
Creating a project-specific review skill:
A custom review skill (stored in .claude/commands/review.md or as an Agent Skill) should encode your team's specific review priorities. Example:
<!-- .claude/commands/project-review.md -->
Review the current changes with focus on our project-specific concerns:
1. All database queries use parameterized statements (no string concatenation)
2. New API endpoints have proper authorization attributes
3. Event handlers follow our idempotency pattern (see OrderEventHandler.cs)
4. No direct HttpClient usage — all external calls go through typed clients
5. DTOs use records with required properties, not mutable classes
6. Background jobs have proper retry policies and dead-letter handling
Layer 2: CI-based review on pull requests (optional safety net)
CI-based review should ideally find nothing new — that means Layer 1 did its job. Its value is as a final gate: it runs on a fresh context (no session drift), reviews the complete diff against the base branch, and leaves comments directly on the PR for team visibility.
Cost consideration: CI-based review triggers a full API call with the entire diff context on every push. For active PRs with frequent pushes, this adds up quickly. To manage costs: trigger only on
openedandready_for_reviewevents (notsynchronize), use a cost-efficient model (mid-tier rather than frontier), and rely on Layer 1 for iterative fixes. If budget is tight, Layer 1 alone covers most of the value.
Quick setup with Claude Code:
Requires GitHub CLI installed and authenticated (gh auth login).
claude
> /install-github-app
This command interactively walks you through the entire setup — it asks questions, gives you links to open, and tells you what commands to run. By the end, it installs the Claude GitHub App, configures the required secrets, and creates a PR with two workflow files. Once you merge that PR, both workflows are active:
claude.yml— responds to@claudementions in PR comments, review comments, and issues. Use it for on-demand questions, fixes, or implementation requests.claude-code-review.yml— runs automatic code review on every PR using thecode-reviewplugin. Posts inline findings directly on the PR diff.
Who can set this up? Installing a GitHub App requires admin access to the GitHub repository. However, GitHub organization owners can block repo admins from installing apps — in that case, only the org owner can do it. On Enterprise Cloud plans, enterprise owners can restrict this further. Repository secrets (for the API key or OAuth token) also require GitHub repo admin access. If you don't have it, you'll need to coordinate with someone who does.
Tuning the generated workflows:
The default workflows work out of the box, but are worth adjusting before you merge the PR. Here are the most impactful changes:
- Control costs: The default review workflow triggers on every push (
synchronize), which adds up fast on active PRs. Remove it and keep onlyopenedandready_for_review:
# in claude-code-review.yml, change:
on:
pull_request:
types: [opened, ready_for_review] # was: [opened, synchronize, ready_for_review, reopened]
- Change the model if needed: In both
claude.ymlandclaude-code-review.yml, addclaude_argsto the action'swith:block. Model aliases (sonnet,opus,haiku) resolve to the latest available version:
- uses: anthropics/claude-code-action@v1
with:
claude_args: "--model sonnet"
Customize what gets flagged by adding a REVIEW.md file to your repository root. This is the official mechanism for review-specific guidance — Claude reads it during code review and treats it as additive rules on top of its default correctness checks. Use it to encode what to always flag, what to skip, and team-specific conventions. Your CLAUDE.md also influences reviews (violations are flagged as nits), but REVIEW.md keeps review-only rules separate.
Example REVIEW.md (adapted from official Claude Code docs):
# Code Review Guidelines
## Always check
- New API endpoints have corresponding integration tests
- Database migrations are backward-compatible
- Error messages don't leak internal details to users
## Style
- Prefer early returns over nested conditionals
- Use structured logging, not f-string interpolation in log calls
## Skip
- Generated files under `src/gen/`
- Formatting-only changes (our linter handles it)
Further reading: Code Review (Claude Code docs) for full setup and customization options. For security-focused reviews, check out anthropics/claude-code-security-review — a dedicated GitHub Action for OWASP-aligned vulnerability detection.
Step 3 — Plan Before You Code
You have a task to implement. Whether it comes from a GitHub issue, a Jira ticket, or a Slack message, the first step is always the same: plan before you code.
No real task at hand? The best way to follow this Quick Start is with a genuine task from your current project — the learning sticks when the stakes are real. However, if that's not possible (e.g., you're between projects, onboarding, or your current work doesn't fit an agentic workflow yet), create your own mini project instead. Pick something small but non-trivial: a CLI tool, a REST API with a few endpoints, a data pipeline, or a utility library. The goal is to exercise the full workflow — plan, implement, review, commit — so even a personal side project works well.
3.1 Start in plan mode
Most agentic coding tools offer a plan or read-only mode where the agent can explore and reason about your codebase but cannot edit files or run commands. Start here. You want the agent to understand the problem fully before touching any code.
3.2 Write your prompt
Give Claude the task with enough context to produce a meaningful plan. Include the "what" (the desired outcome), link to requirements, and mention any constraints. You do not need to over-specify the "how" — that is what planning is for.
Example prompt in plan mode:
Implement the CSV upload feature from GitHub issue #42.
Requirements:
- Users upload CSV files (up to 50MB) through the web UI
- System validates against the CustomerImport schema
- Invalid rows collected into a downloadable error report
- Processing happens asynchronously with progress indicator
Constraints:
- Use Hangfire for background processing (already in our stack)
- Follow the patterns in OrderService.cs
- All new code needs unit tests following our test patterns
Clarify any details with me before proposing a plan.
Key prompting principles:
- Be specific about outcomes, not implementation steps — let the agent figure out the approach
- Reference existing code if you have it — point to a file that implements a similar pattern, so the agent follows the same approach
- Say "clarify details with me if needed" — this invites the agent to ask questions rather than guess
- Mention what NOT to do if there are common pitfalls (e.g., "Do not add new NuGet packages without asking me first")
3.3 Clarify requirements through dialogue
Good planning is conversational. Claude will likely ask clarifying questions — answer them. Push back on parts of the plan you disagree with. Ask for alternatives. This dialogue phase is where you catch misunderstandings that would be expensive to fix later.
Example exchange:
Claude: "Should the error report include row numbers from the original CSV?"
You: "Yes, include original row numbers and the specific validation error
for each failed row."
Claude: "The current schema validation uses FluentValidation. Should I follow
that pattern or use data annotations?"
You: "Use FluentValidation — it's our standard."
Claude: "I'm proposing to stream the CSV parsing rather than loading the
entire file into memory. This handles the 50MB limit efficiently.
Does that align with your expectations?"
You: "Yes, good call. Make sure it handles different encodings — we've had
issues with UTF-8 BOM and Windows-1252 from customer files."
Iterate until the plan feels right. Two or three rounds of clarification typically produce a solid plan.
3.4 Review and approve the plan
Once Claude presents a complete plan, review it critically:
- Does it cover all requirements from the issue?
- Does it follow the architectural patterns described in
CLAUDE.md? - Are there edge cases it missed?
- Is the scope reasonable — not over-engineered, not under-engineered?
If the plan is solid, approve it and move to implementation — or, even better, save it as a durable artifact first (see 2.5). If not, tell Claude exactly what to change.
3.5 Save the plan as a durable artifact
This is one of the most important habits to build. Before writing any code, ask the agent to save the approved plan to a file (e.g. docs/plans/feature-x-plan.md or simply feature-x-plan.md in the project root).
A saved plan file gives you several advantages:
- Team review before implementation — Share the plan with teammates or the tech lead for feedback before committing to an approach. Catching a wrong assumption in a plan file costs minutes; catching it in code costs hours
- Verification checklist after implementation — Once all code is written, ask the agent to verify the implementation against the plan (see Step 4.4). Agents sometimes skip or simplify details during longer sessions — the plan file keeps them honest
- Frequent context resets without losing progress — You can
/clearthe context at any point and resume from the plan file in a fresh session. This is particularly valuable for complex features that span multiple sessions - Audit trail — The plan documents what was agreed and why, which is useful for code review, post-mortems, and onboarding
Context window health: The agent typically reports context usage after planning. If it's above ~50%, or the implementation task is straightforward, run
/clearand start implementation with a fresh context — a clean context produces better code. Point the new session to the plan file and continue from there.
Step 4 — Implement with the Agent
4.1 Switch to implementation mode
Exit plan mode (press Shift+Tab to cycle back, or Escape). If you started a new session with the approved plan, point the agent to the plan file. For example:
Implement the plan in /docs/plans/csv-upload.md step by step. After each
logical step, run the tests and show me the results before moving on.
Commit each completed step with a descriptive commit message.
4.2 Work in small, reviewable steps
Do not let the agent implement everything in one shot. Break the work into logical chunks and review at each checkpoint:
Step 1: Data model and validation → review → commit
Step 2: Repository and data access → review → commit
Step 3: Background job and processing logic → review → commit
Step 4: API endpoints → review → commit
Step 5: Unit and integration tests → review → commit
After each step, Claude should run the relevant tests and linters. If anything fails, let it fix the issues before moving on.
What to watch for during implementation:
- Does the code follow your conventions? Check naming, patterns, error handling against what is in
CLAUDE.md - Are tests meaningful? AI-generated tests can look good but test nothing. Verify they test behavior, not just method calls. Ask Claude to introduce a bug and verify the test catches it
- Is it using approved dependencies? The agent might suggest new libraries. Check if they are necessary and approved
- Security considerations — sanitized inputs, no hardcoded credentials, proper authorization checks
4.3 Use git discipline
Claude Code can manage git operations for you. Ask it to make atomic commits with clear messages after each completed step:
Commit the data model changes with a descriptive message following
conventional commits format (feat:, fix:, etc.)
Small, atomic commits make code review easier and allow you to revert individual changes if something goes wrong. Never let the agent bulk-commit an entire feature in one go.
4.4 Verify implementation against the plan
This is where the plan file from Step 3.5 pays off. After all steps are implemented, point the agent back at the plan and ask it to verify completeness:
Please verify whether all uncommitted changes are consistent
with feature-x-plan.md. Check that every requirement is addressed
and nothing was missed or simplified beyond what was agreed.
In practice, agents often skip or simplify small details during longer sessions — a validation rule that was in the plan but got lost after a /clear, an edge case that seemed obvious but never made it into code. This verification step catches those gaps reliably. It works especially well precisely because the plan is a separate file that was not affected by context resets or session drift.
4.5 Handle context window limits
For complex features that span multiple hours of work, you will hit context limits. The agent reports context usage — when it's above ~50%, or you're switching to a different part of the task:
- Save progress — ensure all code is committed
- Update the plan file from Step 3.5 — mark completed steps and note what remains
- Start a new Claude Code session
- Point the new session to the plan file and continue from where you left off
> Read docs/plans/csv-upload.md and continue implementation from step 4.
The previous steps are committed. Verify the existing code before
proceeding.
Step 5 — Review and Finalize
5.1 Run your own code review
After the agent has completed all steps, run the multi-pass in-session review described in Step 2.3 (Layer 1): start with /code-review, then run your project-specific review skill if you have one, and optionally a targeted third pass for security-sensitive or complex changes.
But do not rely solely on automated review. Read the diff yourself:
git diff main..HEAD
Ask yourself: Do I understand every line? Would I be comfortable defending this code in a pull request review? If not, ask Claude to explain the parts you do not understand — and then decide whether the approach is correct.
5.2 Run the full test suite
# Run the complete test suite, not just the new tests
npm test # or dotnet test, pytest, etc.
If anything fails, ask Claude to investigate and fix. Always verify that existing tests still pass — regressions in unrelated code are a common AI pitfall.
5.3 Create the pull request
Once everything passes, create a PR:
> Create a pull request for the CSV upload feature. Reference issue #42.
Include a summary of what was implemented and any decisions made
during planning.
Claude Code will create the branch, push, and open the PR through the GitHub CLI. Your CI pipeline (with the Claude review action you set up in Step 2) will automatically review the PR, and your human teammates can review as well.
5.4 Address review feedback
If reviewers (human or AI) leave comments on the PR, you can address them directly from Claude Code:
> Read the review comments on PR #87 and address each one.
Run tests after each fix.
Or, if using the GitHub integration, simply reply to the review comment with @claude and a description of what to fix.
Step 6 — Evolve Your Workflow
Once you have completed a few tasks with this workflow, you will naturally start optimizing. Here are the most impactful next steps:
Create custom slash commands
For repetitive workflows, create reusable prompt templates in .claude/commands/:
<!-- .claude/commands/new-endpoint.md -->
Create a new API endpoint following the patterns in our existing controllers.
Steps:
1. Create the endpoint in the appropriate controller
2. Add request/response DTOs with FluentValidation
3. Add service layer method
4. Write unit tests following our test patterns
5. Update OpenAPI documentation
6. Run all tests and linter
Endpoint details: $ARGUMENTS
Then invoke it with: /new-endpoint POST /api/customers/import
Build skills for recurring patterns
Skills are the next step beyond slash commands — reusable, self-contained capability packages that the agent loads on demand. Use the skill-creator skill (from the official plugins repo) to generate skills from your existing workflows. If you notice you keep giving the agent the same instructions (e.g., "when writing tests, always use our factory pattern"), that's a skill waiting to be extracted.
Similarly, use the claude-md-management plugin and its claude-md-improver skill to keep your CLAUDE.md up to date with lessons learned from your sessions.
Practical tip — learning from mistakes: After a long session where the agent struggled and eventually found the right approach, open a second session and tell it: "In another session we worked on X and you had trouble with Y. Analyze what went wrong and do what you can — create a skill, update CLAUDE.md, whatever you think will prevent the same mistakes next time." Give the agent a brief summary of what happened — it can access past session files stored in ~/.claude/projects/, but searching through them is hit-or-miss with long conversations. Your summary plus the agent's ability to codify lessons into durable project knowledge (skills, CLAUDE.md updates) is a reliable combination.
Use sub-agents for larger tasks
When a feature has independent components, add "use subagents" to your prompt or planning feedback — the agent handles the rest. Sub-agents run in their own context, so they keep your main conversation clean and are especially useful for tasks that read many files or produce verbose output.
Consider Spec-Driven Development for complex features
The workflow described in this Quick Start uses Claude Code's built-in planning — this is sufficient for most tasks. However, for large, complex features (especially those involving multiple developers or agents), you may benefit from a more structured approach called Spec-Driven Development (SDD).
SDD introduces formal, versioned specification documents that serve as the contract between requirements and implementation. Tools like GitHub Spec Kit, AWS Kiro, and JetBrains Junie provide structured SDD workflows. Claude Code and Codex also support SDD through task files and CLAUDE.md/AGENTS.md conventions.
See Part I, Section 4 — Spec-Driven Development for a detailed explanation, comparison, and practical examples.
Quick Reference — The Workflow at a Glance
┌──────────────────────────────────────────────────────────────────┐
│ AGENTIC DEVELOPMENT WORKFLOW │
├──────────────────────────────────────────────────────────────────┤
│ │
│ SET UP (once) │
│ ├─ Install Claude Code │
│ ├─ Learn the interface (slash commands, !, @) │
│ └─ Configure permissions │
│ │
│ PREPARE (once per project) │
│ ├─ Generate and refine CLAUDE.md │
│ ├─ Set up linters, formatters, hooks │
│ └─ Set up AI-powered code review │
│ │
│ PLAN (every task) │
│ ├─ Enter plan mode (Shift+Tab × 2) │
│ ├─ Describe the task with context and constraints │
│ ├─ Clarify requirements through dialogue │
│ ├─ Review and approve the plan │
│ └─ Save approved plan to a file (e.g. feature-x-plan.md) │
│ │
│ IMPLEMENT (every task) │
│ ├─ Work in small steps with review at each checkpoint │
│ ├─ Run tests after each step │
│ ├─ Make atomic commits with descriptive messages │
│ ├─ Verify implementation against plan file │
│ └─ Monitor context window — /clear when needed │
│ │
│ REVIEW & SHIP (every task) │
│ ├─ Run /code-review + read the diff yourself │
│ ├─ Run full test suite │
│ ├─ Create PR (CI runs automated Claude review) │
│ └─ Address feedback, merge │
│ │
│ EVOLVE (ongoing) │
│ ├─ Create custom slash commands for common workflows │
│ ├─ Use sub-agents for independent components │
│ └─ Explore SDD for complex features (see Part I, Section 4) │
│ │
└──────────────────────────────────────────────────────────────────┘
This playbook is a living document. As tools and practices evolve, so will this guide. Contributions, corrections, and suggestions are welcome.
Core Principles for Working with AI: Read more