Andrej Karpathy, Code Agents, and the New Skill of Managing AI Loops

Spread the love

Software engineering is starting to feel less like writing code line by line and more like directing a swarm of semi-autonomous workers.

That is one of the most interesting ideas from Andrej Karpathy’s conversation on No Priors, where he described the sudden shift from traditional coding to working with AI code agents. His point was not simply that AI tools can autocomplete functions or help debug errors. The bigger change is that programmers can now delegate larger chunks of work, run multiple agents in parallel, and spend more time shaping intent, reviewing outputs, and designing loops.

In other words, the bottleneck is moving.

For a long time, the software developer was limited by typing speed, implementation time, and the number of things one person could hold in mind while working on a repository. With code agents, Karpathy suggests that the human bottleneck becomes something else: taste, judgment, task decomposition, instruction quality, verification, and the ability to manage many AI-driven processes at once.

This is why the moment feels both empowering and psychologically strange.

From writing code to directing agents

Karpathy jokes that “code” may no longer be the right verb. The work is becoming less about manually producing every line and more about expressing intent to agents.

That change matters because it moves software work into larger units of action. Instead of saying, “write this function,” a developer can say, “add this feature,” “research this approach,” “compare these options,” or “try several implementations and report back.”

This is not just faster autocomplete. It is a change in abstraction level.

The developer becomes more like a technical director. One agent might explore a design. Another might implement a feature. Another might review logs. Another might test a hypothesis. The human moves between them, checks their work, corrects direction, and decides what is worth keeping.

That sounds efficient, but it also creates a new kind of pressure. If an agent is working and the human is waiting, the obvious next thought is: why not run another agent? And another?

Karpathy compares this feeling to the old anxiety of unused GPUs during machine learning research. If compute was available and not running, something felt wasted. Now the same emotional logic appears with tokens. If your AI subscriptions, context windows, or agent sessions are idle, it can feel as if you are not using your available leverage.

This is where the “skill issue” framing appears. When an AI workflow fails, the failure may not always mean the model is incapable. It may mean the human has not yet found the right instructions, memory structure, workflow, verification method, or agent setup.

That is both encouraging and exhausting. It means there is always something to improve.

The rise of macro-actions in programming

The most useful concept here is the macro-action.

Traditional coding is full of micro-actions: write a line, rename a variable, edit a file, run a test, fix a bug. Agentic coding allows larger moves. A developer can now ask for a whole feature, an architectural plan, a refactor, a benchmark, or a research pass.

This does not remove the need for engineering skill. It changes where the skill sits.

The human still has to know what to ask for. The human has to notice when the agent is confidently wrong. The human has to understand whether two parallel tasks will interfere with each other. The human has to review outputs, protect the codebase, and decide when the work is good enough.

So the job does not disappear into magic. It becomes more managerial, more architectural, and in some ways more demanding.

Bad instructions can waste enormous amounts of compute. Weak review can merge bad work. Poor task decomposition can create loops where agents produce plausible but useless output. The new productivity is real, but it is not free.

Why AutoResearch matters

Karpathy’s idea of AutoResearch pushes this logic further.

The goal is to remove the human from the loop wherever the loop can be automated. Instead of a researcher constantly checking results, adjusting parameters, and deciding the next experiment, an AI system can be given an objective, a metric, boundaries, and permission to keep trying.

This works especially well when there is a clear evaluation method.

If the system is optimizing validation loss, benchmark speed, model performance, or a similar measurable target, then many experiments can be automated. The agent can try changes, evaluate them, compare results, and continue. The human does not need to inspect every intermediate step.

Karpathy describes using this kind of loop to improve training settings in a small language-model project. What is interesting is not only that the agent found improvements. It is that the workflow hints at a larger pattern: AI systems may increasingly improve other AI systems through automated experimentation.

That is a major idea for computer science and machine learning.

Research has often been limited by human attention. A researcher has ideas, runs experiments, reads results, adjusts assumptions, and tries again. But when the evaluation is objective enough, much of that loop can be delegated. The human role shifts toward setting goals, defining safe boundaries, designing metrics, and interpreting the larger direction.

But not everything can be automated

Karpathy also gives an important warning: these systems work best in verifiable domains.

If you can measure success clearly, agents can improve quickly. Faster CUDA kernels, better validation loss, passing unit tests, benchmark improvements – these are good fits. The system can try many things and keep what works.

But many human tasks are not like that.

Nuance, taste, humor, judgment, emotional context, unclear intent, and social interpretation are harder to evaluate. Karpathy points to the strange “jaggedness” of current AI systems: they can seem like brilliant technical collaborators in one moment and oddly naive in the next.

This is one of the most grounded parts of the discussion. The progress is real, but uneven.

Current models can perform impressive agentic tasks, yet still fail at soft judgment. They can write serious code, but still produce generic jokes. They can move quickly in domains optimized by reinforcement learning and objective feedback, but they can wander when the target is vague.

That means the future of AI work is not simply “the models get smarter at everything.” It may be more uneven. Some areas will accelerate dramatically. Others will remain dependent on human judgment for longer than the hype suggests.

The agentic web: when software becomes API-first

Another important idea from the conversation is that software itself may be reorganized around agents.

Karpathy describes building a home automation assistant that can interact with devices through APIs and natural language. The deeper point is not the smart home example. The deeper point is that many apps may exist mainly because humans needed interfaces.

If agents become the main users of software on our behalf, then the ideal interface may not be another app screen. It may be a clean API, good documentation, clear permissions, and reliable tool access.

This has large implications for software engineering.

Many current apps are built around human navigation: menus, buttons, dashboards, settings pages, onboarding flows. But an AI agent does not need all of that in the same way. It needs structured access, predictable behavior, safe permissions, and enough context to act correctly.

That suggests a future where some software becomes less visible to humans and more legible to agents.

The user might simply say what they want. The agent decides which tools to call, which APIs to use, what information to gather, and what actions to perform. In that world, a lot of software becomes infrastructure rather than destination.

What this means for programmers

The obvious fear is that code agents will reduce the need for programmers.

Karpathy’s view is more complicated. He suggests that software may become cheaper, which can unlock more demand. This resembles the Jevons paradox: when something becomes easier and cheaper to produce, total use can increase instead of decrease.

If software becomes less expensive to create, more people and companies may want custom tools, internal automations, experiments, prototypes, dashboards, workflows, and products. Code may become more ephemeral, more adjustable, and more closely shaped around specific needs.

That does not mean every programming job is safe. It does mean the simple story of “AI writes code, so programmers disappear” is probably too crude.

The more realistic shift is that programmers will need to become better at:

breaking work into agent-friendly tasks
writing precise instructions
reviewing AI-generated code
building evaluation loops
understanding systems well enough to catch subtle failures
using agents without blindly trusting them

Programming may become less about remembering syntax and more about designing reliable technical processes.

Education may change too

Karpathy also makes a subtle point about education.

In the past, if someone created a technical project, they often needed to write documentation for humans: tutorials, guides, explanations, lectures, examples. But if agents can understand the project and explain it in many ways, then the author’s job changes.

Instead of explaining everything directly to every possible learner, the author may need to explain the core structure clearly enough for agents. Then the agent can translate that explanation into the learner’s level, language, goals, and questions.

This does not make good teaching irrelevant. It changes what good teaching produces.

The most valuable contribution may be the distilled insight, the clean structure, the right learning path, the few bits that are hard to discover. Agents can expand, personalize, and repeat. But humans still need to provide the conceptual compression that makes the subject understandable.

This is a useful distinction for anyone interested in AI, programming, science, mathematics, or education. The future may reward people who can create clear conceptual maps, not just long explanations.

The real lesson: leverage needs judgment

The conversation around code agents can easily become either panic or hype.

The more useful interpretation is that AI is changing the unit of work. In programming and research, people can now operate at a higher level of abstraction. They can delegate more, test more, parallelize more, and automate loops that used to require constant human attention.

But leverage without judgment is fragile.

Agents still make strange mistakes. They still need boundaries. They still perform best when the goal is measurable. They can amplify productivity, but they can also amplify confusion if the human does not know what good output looks like.

The future skill may not be “coding” in the old sense.

It may be knowing how to create the loop, define the metric, guide the agents, verify the result, and still think clearly when the system starts moving faster than one person can manually control.