Five Levels of AI Development: Why the J-Curve Gets Everyone


Bicycle

90% of developers who consider themselves "AI-native" are sitting at Level 2 out of 51. Most believe they're done. They are not.

That's not my claim -- it's the central thesis of Dan Shapiro's framework "The Five Levels: from Spicy Autocomplete to the Dark Factory," published January 23, 20261. Shapiro is CEO of Glowforge and a Wharton Research Fellow, and his model has resonated across the developer community like few other frameworks in recent months.

At the same time, one of the most rigorous studies on the topic -- a Randomized Controlled Trial by METR with 16 experienced open-source developers -- found that AI tools slowed work down by 19% on average2. Not sped up. Slowed down. And the developers didn't even notice.

Two data points that seem contradictory at first glance. Look closer, and they tell the same story: the tool was never the problem. How we use it is.

This is Part 1 of a three-part series on the "Dark Factory Gap" in software development. This post covers the why: why most organizations aren't seeing the promised returns on their AI investments. Part 2 dives into the architecture patterns that make Level 4 and 5 possible. Part 3 addresses the consequences for teams, roles, and organizations.

But this post stands on its own. If you take away one thing, let it be this: the gap between "we installed an AI tool" and "we rebuilt our development process around AI" is wider than most people think. And that gap is exactly where companies are burning time and money right now.

The Five Levels: From Spicy Autocomplete to the Dark Factory

Dan Shapiro deliberately modeled his framework on the NHTSA's five levels of driving automation3. That's no accident -- the parallel is precise. With autonomous driving, everyone thought Level 2 (Tesla Autopilot) was "almost self-driving." A decade later, we know: the jump from Level 2 to Level 4 isn't incremental. It requires a fundamentally different architecture.

Same thing with software.

Levels of AI Development

Level 0 -- Spicy Autocomplete

Tab completion on steroids. GitHub Copilot in its basic mode. "Not a single character hits the disk without your approval," as Shapiro puts it1. The developer writes code, the AI tool suggests the next block, you accept or reject.

This is where most organizations think they're "using AI." And yes, for repetitive boilerplate patterns it saves time. But it changes nothing about the workflow. You develop exactly as before, just with a slightly smarter autocomplete.

Level 1 -- Lane-Keeping and Cruise Control

You hand off discrete tasks to an "AI intern." "Write unit tests for this function." "Generate the Swagger docs for this endpoint." The developer sets pace and direction, the tool handles clearly scoped subtasks.

Development velocity doesn't change fundamentally. You save time on individual tasks, but the overall process stays the same. Like cruise control in a car: it holds the speed, but you're still steering.

Level 2 -- Highway Autopilot

Pair programming with AI. You work interactively, discuss architecture decisions, let the tool generate larger blocks, iterate together. The developer stays in the driver's seat, but AI handles longer stretches.

Shapiro estimates that 90% of "AI-native" developers are here1. And here's his central point: "This is where most people think they're done. They are not."

Level 2 feels productive. You feel faster. You have a copilot that thinks along. The problem: you haven't changed your workflow. You're doing the same thing as before, just with a faster tool. That's like cruise control plus lane assist -- more comfortable, but you're still sitting in the seat watching the road.

Level 3 -- Waymo with Safety Driver

Here the role shifts fundamentally: the developer becomes a code reviewer. "Your life is diffs," says Shapiro1. AI generates the code, the developer reviews, corrects, approves. You write less code and read more.

It sounds like a small difference from Level 2, but it's a categorical leap. At Level 2 you develop with AI. At Level 3, AI develops and you verify. That requires an entirely different competency: you need to review code you didn't write, at a pace that doesn't eat up the productivity gains.

And this is where almost everyone tops out. Shapiro's observation: "Almost everyone tops out here"1. Why? Because the jump to Level 4 doesn't just require a better tool -- it requires a different way of working. A different organizational structure. Different roles.

Level 4 -- Robotaxi

The developer becomes a product manager. You write specifications, define agent skills, review plans -- and then walk away for 12 hours. When you come back, the code is written, tests are passing, the deployment is ready.

This isn't wishful thinking. It's already happening.

The best-documented example is StrongDM: a team of three -- Justin McCarthy, Jay Taylor, and Navan Chauhan -- has been working at this level since July 14, 2025. Their open-source agent "Attractor" is remarkable: the repository consists essentially of three Markdown specification files4. The output: CXDB with 16,000 lines of Rust, 9,500 lines of Go, and 6,700 lines of TypeScript. Simon Willison, one of the most respected voices in the developer community, called it "the most ambitious form of AI-assisted software development I've seen yet"5.

Three people. Three spec files. Tens of thousands of lines of production code across multiple languages.

Level 5 -- The Dark Factory

The name comes from Fanuc's "lights-out" robotics factory in Japan -- a factory that runs in the dark because no humans are inside to need light. Applied to software: "A black box that turns specifications into software." Teams of fewer than five.

Level 5 is still theoretical for most use cases today. But the direction is clear. And organizations stuck at Level 2 who believe they're done will fall behind -- not in years, but in months.

The J-Curve: Why AI Tools Make You Slower First

Now for the part that hurts.

The METR Study: 19% Slower, Not Faster

In July 2025, METR (Model Evaluation & Threat Research) published a Randomized Controlled Trial of unusual rigor2. 16 experienced open-source developers. 246 real tasks in repositories they actively contribute to. With and without AI tools.

The result: AI tools slowed work by 19%. The 95% confidence interval was -40% to -2% -- statistically significant. Not measurement error, not an artifact.

And here's where it gets really interesting: before the study, participants estimated they'd be 24% faster with AI. After the study -- after seeing the results -- they still believed they'd been 20% faster2. They weren't just wrong, they were wrong in the wrong direction.

METR Study: Perception vs. Reality

This isn't an isolated finding. The data landscape is complex:

  • Peng et al.6: 55.8% faster with GitHub Copilot -- but on exactly one task (HTTP server in JavaScript). One task. One context. Limited generalizability.
  • Panto AI7: Median pull request size increases 17-23% with Copilot. More code doesn't mean better code. More code means more to review, more to test, more to maintain.
  • Sonar8: 38% of developers say reviewing AI-generated code takes more effort than reviewing human-written code.
  • Fu et al.9: 29.5% of Python snippets and 24.2% of JavaScript snippets from Copilot contain security vulnerabilities.
  • Stack Overflow Developer Survey 202510: 46% of developers don't trust the accuracy of AI output -- up from 31% the previous year.

The numbers paint a picture that's significantly more nuanced than the marketing narrative of "10x developer productivity." AI tools can deliver massive acceleration on isolated, well-defined tasks. In the full accounting of a complex software project -- with reviews, testing, debugging, and maintenance -- it looks different.

Brynjolfsson's J-Curve: The Theoretical Framework

Erik Brynjolfsson, one of the most influential economists studying technology and productivity, published a theory in the American Economic Journal: Macroeconomics in 2021 that explains why11.

His thesis: General Purpose Technologies (GPTs) -- and AI coding tools clearly qualify -- require complementary investments before they deliver productivity gains. New processes. New skills. Organizational restructuring. And while those investments are underway, productivity drops first.

That's the J-Curve: down first, then up. The descent can take years. And exactly there -- at the bottom of the J-Curve -- is where most organizations are sitting right now.

The logic is compelling:

  1. Organizations buy AI tools (Copilot licenses, ChatGPT Enterprise, Cursor seats).
  2. Developers use the tools on top of their existing workflows.
  3. Productivity barely changes or even drops (METR effect: context switching, prompt engineering overhead, review overhead)2.
  4. Management is disappointed, questions the ROI.
  5. Some organizations give up. The others -- the few that persist -- redesign their workflows.
  6. Only now, after the restructuring, do productivity gains materialize.

The field studies tell the same story. Brynjolfsson, Li and Raymond12 measured roughly 14% productivity gains in a Fortune 500 customer service operation -- that's the "layering" effect, AI assistance on top of existing processes. Dell'Acqua et al.13 showed what happens when consultants deeply integrate AI into their workflows: 25.1% faster task completion, 12.2% more tasks finished, and over 40% higher quality -- but only within the capability frontier of the model used. Beyond that frontier, results dropped. The numbers come from field studies, not universal constants -- but the pattern is consistent.

The Gap in Numbers

McKinsey published a survey of software organizations in November 2025 that quantifies this14: there's a 15-percentage-point gap in performance between the best and worst AI-adopting software organizations. That's enormous.

The explanation isn't in the tools. McKinsey describes explicitly that strong effects require more than pure tool adoption -- it takes role and process overhaul14. Sonar quantifies the review overhead8. The METR study shows the gap between perception and measurement2. Together, a consistent picture emerges: layering AI on existing processes yields marginal gains. Fundamentally redesigned workflows deliver multiples of that.

Layering vs. Deep Integration (Field Studies)

Mollick15 argues qualitatively in the same direction: not "AI bolted onto the old way," but redesigning processes from the ground up. Most organizations are still in the first phase. That's not a judgment -- organizational change takes time, budget, and the willingness to be slower in the short term. Exactly the resources that are scarcest in day-to-day operations.

Where the Level Model Meets the J-Curve

When you overlay Shapiro's five levels1 with Brynjolfsson's J-Curve11, a clear picture emerges:

The J-Curve of AI Adoption

Levels 0-2 = Above the J-Curve: You've installed a tool, see marginal improvements, everything feels fine. No workflow redesign needed. No organizational change. Copilot suggests code, Cursor generates functions, you work as before -- just with an assistant.

Levels 2-3 = The Descent into the J-Curve: You realize the easy wins are exhausted. Reviews become more demanding. Code quality fluctuates. Security issues surface9. You invest in prompts, context documents, better instructions -- and measurable productivity drops first. That's exactly what METR measured2.

Levels 3-4 = The Bottom and the Climb: This is where the real transformation happens. You stop writing code with AI help -- you start writing specifications that AI turns into code. You need different skills (spec writing, code review instead of code writing), different processes (spec-driven instead of iterative development), different roles (PM-developer instead of individual contributor). These are the complementary investments Brynjolfsson describes11.

Levels 4-5 = The Top: StrongDM with three people and three spec files4. That's the upper arm of the J-Curve. But you only get there if you've gone through the bottom.

The structure creates a predictable dynamic: Level 2 feels comfortable. It feels productive. Looking down into the J-Curve from above feels risky. So you stay put. That's not stupidity -- it's a rational response to visible risk with uncertain reward. Every organization that stays at Level 2 has reasons. The system produces this behavior: quarterly targets reward short-term stability, not long-term transformation.

But it remains a local maximum, not a global one. And understanding the structure lets you make a conscious decision about whether and when the descent is worth it.

My Take

I know this pattern not because I observe it at clients. I know it because I've been through it myself. Multiple times.

Organizations buy Terraform. Use it like bash scripts. No state management, no modules, no testing. "Terraform doesn't work." Then you build it properly and suddenly understand why it was designed that way. Kubernetes. Docker. CI/CD. The same pattern every time: buy a tool, layer it on existing processes, be disappointed when the promised revolution doesn't materialize. Then, eventually, rebuild the process and realize: the tool was never the problem. The process was.

At Infralovers, we systematically evaluate AI workflows across different maturity levels -- because our clients aren't all at the same point and face different challenges at each stage. We validate approaches from structured prompting (Level 1) through review-centric workflows (Level 3) to spec-driven development (Level 4) in our own practice. My personal workflow today is at Level 4: I describe the desired outcome, AI agents handle execution, I review. But the real value isn't where we are. It's in the hard-won knowledge about the transitions between levels -- and the ability to meet teams where they are and help them move forward.

But the path wasn't what you'd put on a conference slide. We rebuilt workflows three times before something stuck. There were weeks where the old approach would have been faster. Three rebuilds. Three times "this isn't working." That was our J-Curve. And we had the advantage of going through it as a small team -- short decision loops and the freedom to be slower for two weeks without someone asking for the ROI report. Larger organizations don't have that luxury. More coordination overhead, quarterly planning, budget cycles, reporting structures -- the system creates inertia, not the people in it. A 20-person development team can't just clear two weeks to transition from Level 2 to Level 3. Not without a plan, without backing, and without a realistic picture of what the J-Curve looks like for their specific situation.

We experiment on ourselves so we can meet teams where they are. That doesn't mean "copy our setup." It means: we know the path, we know where it hurts, and we can help plan the descent into the J-Curve so it's survivable.

The gap isn't technological. It's organizational. And it can be closed -- with deliberate decisions, realistic timelines, and the willingness to go through the bottom of the J-Curve instead of standing in front of it.

What This Means for Engineering Leaders

If you're reading this as a CTO, VP Engineering, or Head of Development and wondering why your AI adoption isn't delivering the promised results -- here's my assessment of what helps.

1. Accept the J-Curve as Reality

The phase of lower productivity isn't a sign that AI tools don't work. It's a necessary passage. If your team has been using Copilot for six months and the metrics have barely moved, that's exactly what Brynjolfsson predicted11. The question isn't "why isn't this working?" but "which complementary investments are still missing?"

2. Measure with Hard Data

The METR study shows: developers overestimate their AI productivity gains by almost 40 percentage points2. That's not a criticism of developers -- it's a fundamental measurement problem. Without hard data, there's no valid picture. Cycle Time, Deployment Frequency, Change Failure Rate -- the DORA metrics are a good starting point. But measure before and after, not just after.

3. Distinguish Tool Adoption from Workflow Transformation

Buying Copilot licenses for everyone is tool adoption. Rebuilding your development process so that AI agents can independently implement specifications is workflow transformation. The field studies show the difference: roughly 14% productivity gain from assistance layering12, over 25% from deep workflow integration13 -- and McKinsey measures a 15-percentage-point performance gap between the most and least effective software organizations14.

Concretely: invest not just in licenses, but in spec-writing skills, code review competency for AI-generated code, prompt engineering as a core skill, and the documentation that AI agents need to work autonomously.

4. Start with a Pilot, Not a Rollout

Find one team, one project, one workflow. Give them the freedom (and the time) to move from Level 2 to Level 3. Let them experiment, fail, learn. Document the results. Then scale based on real data, not vendor promises.

5. Plan for the Role Shift

The path from Level 2 to Level 4 changes what a "developer" does. Less code writing, more spec writing. Less implementing, more reviewing. Less individual contributor, more product-manager-like. That's not for everyone. Some developers will thrive, others will struggle. Plan for it.

Concrete Next Steps

  • This week: Ask your team which Shapiro level they think they're at. Compare self-assessment with reality.
  • This month: Identify one workflow suitable for a Level 3 experiment (existing code with good test coverage is ideal).
  • This quarter: Invest in enablement, not more tool licenses. Spec-writing workshops, code review training for AI-generated code, documentation standards for AI-readable codebases.

This is Part 1 of a three-part series on the "Dark Factory Gap" in software development. In Part 2, we dive into the concrete architecture patterns that make Level 4 and 5 possible -- from spec-driven development to digital twins to the question of how StrongDM steers a multi-language project with three Markdown files.


How This Article Was Made

This article was researched and written with AI assistance -- source research via Gemini, ChatGPT and Claude, text generation with Claude Code, multiple rounds of fact-checking with manual source verification. All editorial decisions, assessments, and conclusions are my own. I describe the full workflow in AI-Assisted Knowledge Work: How I Am Rebuilding My Research and Writing Process.


  1. Shapiro, Dan (2026). The Five Levels: from Spicy Autocomplete to the Software Factory. danshapiro.com ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  2. METR (2025). Measuring the Impact of AI Tools on Developer Productivity. Randomized Controlled Trial, 16 experienced open-source developers, 246 tasks. arXiv:2507.09089 ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  3. NHTSA (2013). Preliminary Statement of Policy Concerning Automated Vehicles. nhtsa.gov (PDF) ↩︎

  4. StrongDM (2025). Attractor -- Open Source Spec-Driven Agent. github.com/strongdm/attractor ↩︎ ↩︎

  5. Willison, Simon (2026). How StrongDM's AI team build serious software without even looking at the code. February 7, 2026. simonwillison.net ↩︎

  6. Peng, Sida; Kalliamvakou, Eirini; Cihon, Peter; Demirer, Mert (2023). The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv:2302.06590 ↩︎

  7. Panto AI (2025). GitHub Copilot Statistics: PR Size Impact. getpanto.ai ↩︎

  8. Sonar (2026). State of Code Developer Survey Report. January 8, 2026. sonarsource.com ↩︎ ↩︎

  9. Fu, Yujia; Liang, Peng; Tahir, Amjed; Li, Zengyang; Shahin, Mojtaba; Yu, Jiaxin; Chen, Jinfu (2025). Security Weaknesses of Copilot-Generated Code in GitHub Projects: An Empirical Study. ACM Transactions on Software Engineering and Methodology (TOSEM). DOI:10.1145/3716848 ↩︎ ↩︎

  10. Stack Overflow (2025). Developer Survey 2025. survey.stackoverflow.co ↩︎

  11. Brynjolfsson, Erik; Rock, Daniel; Syverson, Chad (2021). The Productivity J-Curve: How Intangibles Complement General Purpose Technologies. American Economic Journal: Macroeconomics, 13(1), 333-372. DOI:10.1257/mac.20180386 ↩︎ ↩︎ ↩︎ ↩︎

  12. Brynjolfsson, Erik; Li, Danielle; Raymond, Lindsey R. (2023). Generative AI at Work. NBER Working Paper 31161. nber.org ↩︎ ↩︎

  13. Dell'Acqua, Fabrizio; McFowland III, Edward; Mollick, Ethan R. et al. (2023). Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality. Harvard Business School Working Paper 24-013. hbs.edu (PDF) ↩︎ ↩︎

  14. McKinsey (2025). Unlocking the value of AI in software development. November 3, 2025. mckinsey.com ↩︎ ↩︎ ↩︎

  15. Mollick, Ethan. One Useful Thing. Qualitative analysis of AI integration in knowledge work, Wharton. oneusefulthing.org ↩︎

Go Back explore our courses

We are here for you

You are interested in our courses or you simply have a question that needs answering? You can contact us at anytime! We will do our best to answer all your questions.

Contact us