Five Levels of AI Development: Why the J-Curve Gets Everyone

Edmund Haselwanter | 20.02.2026 artificial intelligence, devops

90% of developers who consider themselves "AI-native" are sitting at Level 2 out of 5. Most believe they're done. They are not.

That's not my claim -- it's the central thesis of Dan Shapiro's framework "The Five Levels: from Spicy Autocomplete to the Dark Factory," published January 23, 2026. Shapiro is CEO of Glowforge and a Wharton Research Fellow, and his model has resonated across the developer community like few other frameworks in recent months.

At the same time, one of the most rigorous studies on the topic -- a Randomized Controlled Trial by METR with 16 experienced open-source developers -- found that AI tools slowed work down by 19% on average. Not sped up. Slowed down. And the developers didn't even notice.

Two data points that seem contradictory at first glance. Look closer, and they tell the same story: the tool was never the problem. How we use it is.

This is Part 1 of a three-part series on the "Dark Factory Gap" in software development. This post covers the why: why most organizations aren't seeing the promised returns on their AI investments. Part 2 dives into the architecture patterns that make Level 4 and 5 possible. Part 3 addresses the consequences for teams, roles, and organizations.

But this post stands on its own. If you take away one thing, let it be this: the gap between "we installed an AI tool" and "we rebuilt our development process around AI" is wider than most people think. And that gap is exactly where companies are burning time and money right now.

The Five Levels: From Spicy Autocomplete to the Dark Factory

Dan Shapiro deliberately modeled his framework on the NHTSA's five levels of driving automation (National Highway Traffic Safety Administration, 2013). That's no accident -- the parallel is precise. With autonomous driving, everyone thought Level 2 (Tesla Autopilot) was "almost self-driving." A decade later, we know: the jump from Level 2 to Level 4 isn't incremental. It requires a fundamentally different architecture.

Same thing with software.

Levels of AI Development

Level 0 -- Spicy Autocomplete

Tab completion on steroids. GitHub Copilot in its basic mode. "Not a single character hits the disk without your approval," as Shapiro puts it. The developer writes code, the AI tool suggests the next block, you accept or reject.

This is where most organizations think they're "using AI." And yes, for repetitive boilerplate patterns it saves time. But it changes nothing about the workflow. You develop exactly as before, just with a slightly smarter autocomplete.

Level 1 -- Lane-Keeping and Cruise Control

You hand off discrete tasks to an "AI intern." "Write unit tests for this function." "Generate the Swagger docs for this endpoint." The developer sets pace and direction, the tool handles clearly scoped subtasks.

Development velocity doesn't change fundamentally. You save time on individual tasks, but the overall process stays the same. Like cruise control in a car: it holds the speed, but you're still steering.

Level 2 -- Highway Autopilot

Pair programming with AI. You work interactively, discuss architecture decisions, let the tool generate larger blocks, iterate together. The developer stays in the driver's seat, but AI handles longer stretches.

Shapiro estimates that 90% of "AI-native" developers are here. And here's his central point: "This is where most people think they're done. They are not."

Level 2 feels productive. You feel faster. You have a copilot that thinks along. The problem: you haven't changed your workflow. You're doing the same thing as before, just with a faster tool. That's like cruise control plus lane assist -- more comfortable, but you're still sitting in the seat watching the road.

Level 3 -- Waymo with Safety Driver

Here the role shifts fundamentally: the developer becomes a code reviewer. "Your life is diffs," says Shapiro. AI generates the code, the developer reviews, corrects, approves. You write less code and read more.

It sounds like a small difference from Level 2, but it's a categorical leap. At Level 2 you develop with AI. At Level 3, AI develops and you verify. That requires an entirely different competency: you need to review code you didn't write, at a pace that doesn't eat up the productivity gains.

And this is where almost everyone tops out. Shapiro's observation: "Almost everyone tops out here." Why? Because the jump to Level 4 doesn't just require a better tool -- it requires a different way of working. A different organizational structure. Different roles.

Level 4 -- Robotaxi

The developer becomes a product manager. You write specifications, define agent skills, review plans -- and then walk away for 12 hours. When you come back, the code is written, tests are passing, the deployment is ready.

This isn't wishful thinking. It's already happening.

The best-documented example is StrongDM: a team of three -- Justin McCarthy, Jay Taylor, and Navan Chauhan -- has been working at this level since July 14, 2025. Their open-source agent "Attractor" is remarkable: the repository consists essentially of three Markdown specification files. The output: CXDB with 16,000 lines of Rust, 9,500 lines of Go, and 6,700 lines of TypeScript. Simon Willison, one of the most respected voices in the developer community, called it "the most ambitious form of AI-assisted software development I've seen yet."

Three people. Three spec files. Tens of thousands of lines of production code across multiple languages.

Level 5 -- The Dark Factory

The name comes from Fanuc's "lights-out" robotics factory in Japan -- a factory that runs in the dark because no humans are inside to need light. Applied to software: "A black box that turns specifications into software." Teams of fewer than five.

Level 5 is still theoretical for most use cases today. But the direction is clear. And organizations stuck at Level 2 who believe they're done will fall behind -- not in years, but in months.

The J-Curve: Why AI Tools Make You Slower First

Now for the part that hurts.

The METR Study: 19% Slower, Not Faster

In July 2025, METR (Model Evaluation & Threat Research) published a Randomized Controlled Trial of unusual rigor. 16 experienced open-source developers. 246 real tasks in repositories they actively contribute to. With and without AI tools.

The result: AI tools slowed work by 19%. The 95% confidence interval was -40% to -2% -- statistically significant. Not measurement error, not an artifact.

And here's where it gets really interesting: before the study, participants estimated they'd be 24% faster with AI. After the study -- after seeing the results -- they still believed they'd been 20% faster. They weren't just wrong, they were wrong in the wrong direction.

METR Study: Perception vs. Reality

This isn't an isolated finding. The data landscape is complex:

Peng et al. (2023): 55.8% faster with GitHub Copilot -- but on exactly one task (HTTP server in JavaScript). One task. One context. Limited generalizability.
Panto AI: Median pull request size increases 17-23% with Copilot. More code doesn't mean better code. More code means more to review, more to test, more to maintain.
Sonar 2026: 38% of developers say reviewing AI-generated code takes more effort than reviewing human-written code.
ACM TOSEM: 29.5% of Python snippets and 24.2% of JavaScript snippets from Copilot contain security vulnerabilities.
Stack Overflow Developer Survey 2025: 46% of developers don't trust the accuracy of AI output -- up from 31% the previous year.

The numbers paint a picture that's significantly more nuanced than the marketing narrative of "10x developer productivity." AI tools can deliver massive acceleration on isolated, well-defined tasks. In the full accounting of a complex software project -- with reviews, testing, debugging, and maintenance -- it looks different.

Brynjolfsson's J-Curve: The Theoretical Framework

Erik Brynjolfsson, one of the most influential economists studying technology and productivity, published a theory in the American Economic Journal: Macroeconomics in 2021 that explains why.

His thesis: General Purpose Technologies (GPTs) -- and AI coding tools clearly qualify -- require complementary investments before they deliver productivity gains. New processes. New skills. Organizational restructuring. And while those investments are underway, productivity drops first.

That's the J-Curve: down first, then up. The descent can take years. And exactly there -- at the bottom of the J-Curve -- is where most organizations are sitting right now.

The logic is compelling:

Organizations buy AI tools (Copilot licenses, ChatGPT Enterprise, Cursor seats).
Developers use the tools on top of their existing workflows.
Productivity barely changes or even drops (METR effect: context switching, prompt engineering overhead, review overhead).
Management is disappointed, questions the ROI.
Some organizations give up. The others -- the few that persist -- redesign their workflows.
Only now, after the restructuring, do productivity gains materialize.

The key quote from the video that captures this perfectly: "The organizations seeing 25-30%+ gains are not the ones that installed Copilot and called it done. They're the ones that redesigned their entire development workflow around AI capabilities."

The Gap in Numbers

McKinsey published a survey of software organizations in early 2026 that quantifies this: there's a 15-percentage-point gap in performance between the best and worst AI-adopting software organizations. That's enormous.

The explanation isn't in the tools. According to dev-sync.com (January 2026), layering AI on top of existing processes yields only 5-15% productivity gain. Fundamentally redesigned processes deliver multiples of that.

Layering vs. Deep Integration (Field Studies)

And now the number that worries me the most: 70% of organizations haven't changed their roles and processes in response to AI tools. Seventy percent. Three out of four organizations bought a new tool and left everything else as it was. And they wonder why the promised 10x isn't materializing.

Where the Level Model Meets the J-Curve

When you overlay Shapiro's five levels with Brynjolfsson's J-Curve, a clear picture emerges:

The J-Curve of AI Adoption

Levels 0-2 = Above the J-Curve: You've installed a tool, see marginal improvements, everything feels fine. No workflow redesign needed. No organizational change. Copilot suggests code, Cursor generates functions, you work as before -- just with an assistant.

Levels 2-3 = The Descent into the J-Curve: You realize the easy wins are exhausted. Reviews become more demanding. Code quality fluctuates. Security issues surface. You invest in prompts, context documents, better instructions -- and measurable productivity drops first. That's exactly what METR measured.

Levels 3-4 = The Bottom and the Climb: This is where the real transformation happens. You stop writing code with AI help -- you start writing specifications that AI turns into code. You need different skills (spec writing, code review instead of code writing), different processes (spec-driven instead of iterative development), different roles (PM-developer instead of individual contributor). These are the complementary investments Brynjolfsson describes.

Levels 4-5 = The Top: StrongDM with three people and three spec files. That's the upper arm of the J-Curve. But you only get there if you've gone through the bottom.

The problem: most organizations stand at the edge of the J-Curve, look down, and say: "That looks risky. We'd rather stay up here at Level 2." And that's exactly the trap. Level 2 feels comfortable. It feels productive. But it's a local maximum, not a global one.

My Take

I see the J-Curve in every client interaction. Not just since AI tools -- this pattern has been the same for decades.

Organizations buy Terraform. Use it like bash scripts. No state management, no modules, no testing. "Terraform doesn't work." -- No, Terraform was never the problem. Kubernetes. Docker. Now AI coding tools. The pattern is always the same: buy a tool, layer it on existing processes, be disappointed when the promised revolution doesn't materialize.

At Infralovers, we systematically evaluate AI workflows across different maturity levels -- because our clients aren't all at the same point and face different challenges at each stage. We validate approaches from structured prompting (Level 1) through review-centric workflows (Level 3) to spec-driven development (Level 4) in our own practice. My personal workflow today is at Level 4: I describe the desired outcome, AI agents handle execution, I review. But the real value isn't where we are. It's in the hard-won knowledge about the transitions between levels -- and the ability to meet teams where they are and help them move forward.

But the path wasn't linear. We rebuilt workflows three times before something stuck. There were weeks where the old approach would have been faster. That was our J-Curve. And we had the advantage of going through it as a small team -- short decision loops and the freedom to fail without an entire department grinding to a halt.

We experiment on ourselves so we can meet our clients where they are. Most teams don't start at Level 4. Most are at Level 1-2 and need a realistic path forward -- not a report on what's theoretically possible.

Larger organizations often don't have the same room to maneuver. More coordination overhead, less tolerance for productivity dips, less budget for process experiments. A 20-person development team can't afford to be slower for two weeks while transitioning from Level 2 to Level 3. At least not without a plan and without support.

The path from Level 2 to Level 4 isn't a tool upgrade -- it's an organizational and cultural shift. Different roles, different processes, different expectations of what a developer does. That takes time and deliberate decisions. And that's exactly where the opportunity lies: teams that start this journey now build an advantage that can't simply be purchased with a license.

The gap isn't technological. It's organizational. And it can be closed.

What This Means for Engineering Leaders

If you're reading this as a CTO, VP Engineering, or Head of Development and wondering why your AI adoption isn't delivering the promised results -- here's my assessment of what helps.

1. Accept the J-Curve as Reality

The phase of lower productivity isn't a sign that AI tools don't work. It's a necessary passage. If your team has been using Copilot for six months and the metrics have barely moved, that's not surprising. It's normal. The question isn't "why isn't this working?" but "which complementary investments are still missing?"

2. Measure with Hard Data

The METR study shows: developers massively overestimate their AI productivity gains. Subjective perception is not a measurement instrument. If you want to know whether AI tools are actually helping your team, you need hard data. Cycle Time, Deployment Frequency, Change Failure Rate -- the DORA metrics are a good starting point. But measure before and after, not just after.

3. Distinguish Tool Adoption from Workflow Transformation

Buying Copilot licenses for everyone is tool adoption. That gets you 5-15%. Rebuilding your development process so that AI agents can independently implement specifications is workflow transformation. That gets you 25-30%+.

Concretely: invest not just in licenses, but in spec-writing skills, code review competency for AI-generated code, prompt engineering as a core skill, and the documentation that AI agents need to work autonomously.

4. Start with a Pilot, Not a Rollout

Find one team, one project, one workflow. Give them the freedom (and the time) to move from Level 2 to Level 3. Let them experiment, fail, learn. Document the results. Then scale based on real data, not vendor promises.

5. Plan for the Role Shift

The path from Level 2 to Level 4 changes what a "developer" does. Less code writing, more spec writing. Less implementing, more reviewing. Less individual contributor, more product-manager-like. That's not for everyone. Some developers will thrive, others will struggle. Plan for it.

Concrete Next Steps

This week: Ask your team which Shapiro level they think they're at. Compare self-assessment with reality.
This month: Identify one workflow suitable for a Level 3 experiment (existing code with good test coverage is ideal).
This quarter: Invest in enablement, not more tool licenses. Spec-writing workshops, code review training for AI-generated code, documentation standards for AI-readable codebases.

This is Part 1 of a three-part series on the "Dark Factory Gap" in software development. In Part 2, we dive into the concrete architecture patterns that make Level 4 and 5 possible -- from spec-driven development to digital twins to the question of how StrongDM steers a multi-language project with three Markdown files.

References

Go Back explore our courses

AI Coding Essentials

Leverage AI tools to enhance coding efficiency, automate repetitive tasks, and unlock innovative development workflows.

AI Essentials for Engineers

Transform your engineering workflows with hands-on AI: Deploy LLMs, automate infrastructure, and master the latest tools and protocols.

Edmund Haselwanter | 22.02.2026 artificial intelligence, devops

Dark Factory Architecture: How Level 4 Actually Works

In Part 1 of this series, we covered the why: why AI tools alone don't deliver productivity, what Shapiro's five levels have to do with Brynjolfsson's J-Curve,

Edmund Haselwanter | 20.02.2026 artificial intelligence, devops

Five Levels of AI Development: Why the J-Curve Gets Everyone

90% of developers who consider themselves "AI-native" are sitting at Level 2 out of 5. Most believe they're done. They are not. That's not my claim --

Edmund Haselwanter | 19.02.2026 artificial intelligence, devops

57% Cost Cut: Model Routing for Multi-Agent Systems

One line of YAML. That was it. 1model: sonnet This single line in an agent's frontmatter reduced our per-run costs for one of the most-used agents in our

Edmund Haselwanter | 15.02.2026 security, artificial intelligence

Sandboxing Claude Code on macOS: What I Actually Found

If you've used Claude Code for more than a day, you know the drill. Every Bash command, every file write outside the working directory, every network call --

Matthias Theuermann | 09.02.2026 artificial intelligence, devops

Dynamic Context Tuning: Smarter Chatbot Context Resolution Without the LLM Overhead

The Problem With "That Product" Multi-turn conversations are natural for humans but surprisingly tricky for chatbots. When a user asks "What's

We are here for you

You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.