Building Real-World Agentic Software in the Age of Software 3.0

The software world is shifting. Again. With the rise of Large Language Models (LLMs), we’re witnessing the emergence of what Andrej Karpathy, the erstwhile AI brain of Tesla, calls SOFTWARE 3.0 - a paradigm where agents, not traditional APIs or scripts, are central to how systems are designed and run.
But between the keynote stages and investor decks, there’s a tension: the hype of autonomous agents promising to replace entire workflows overnight vs. the reality of building systems that are stable, repeatable, and safe.
The truth lies in between - and it’s unfolding now.
Software 3.0: From Code to Language, From Functions to Agents
“English is the new programming language.” - Jensen Huang, CEO of NVIDIA
Karpathy's keynote at AI Startup School lays out the skeleton of Software 3.0: programming moves from strict syntax to natural language, and LLMs act not as assistants but as core infrastructure. He describes today’s LLM agents as "stochastic (probabilistic) simulations of people" - capable of complex reasoning, but also unpredictable and non-deterministic...uh something only Andrej can come up with.
This shift opens up a powerful new dimension: we can “vibe code” systems - experimenting, iterating, and building with creative AI tooling.
But as Karpathy and others note, vibe coding alone is not enough for production systems.
When there is a bug in the application, you cannot tell your client, “I’ve exhausted my tokens for the day, let me fix it up tomorrow.”
The Case for Repeatable, Auditable Agents
In real-world use, you don’t want agents rethinking every execution. You want them to:
Draft, implement, test, and improve repeatable processes
Follow a defined contract with guardrails and fallbacks
Be composable, so you can chain them into larger workflows
Be observable, so decisions can be explained and audited
I don’t want agents to design every individual process execution from scratch. That would make the process unreliable, unauditable, and very expensive.
Repeatability isn’t a constraint. It’s the foundation for governance, reuse, and commercialization - whether you’re chaining agent services in an enterprise flow or exposing them as paid APIs.
Don’t Forget the NFRs: Cost, Duration, Compliance
A critical yet often overlooked requirement is the layer of nonfunctional transparency. If agents are to be used at scale, they need to publish and adhere to metadata around:
Duration (performance)
Cost per execution
Compliance footprints
Failure and fallback handling
Audit logs
This lets another agent or a human decide - intelligently and accountably - whether to invoke a particular agent service in a larger process chain.
Michael Truell, CEO of Cursor, made headlines in a recent Y Combinator conversation by consistently level-setting the AI agent narrative. When the interviewer pushed visions of fully autonomous software engineering agents, Truell brought the discussion back to earth:
Agent-based development is real, but only for specific, narrow workflows
Human taste and review remain crucial
Guardrails, observability, and rollback mechanisms are essential
You can’t deploy vibe code without the engineering muscle behind it
Cursor's approach reflects what agentic systems should be: powerful, but bounded.
CI Pipelines: Where Vibe Coding Does Work — With Guardrails
Not all automation needs GPT-4-level cognition. In fact, CI/CD pipelines and DevOps flows are a perfect example where:
“Vibe coding” via agents can be extremely productive
The domain is structured and repeatable
Standardization allows the use of smaller, cheaper models
Prompt engineering + platform engineering enables reuse and compliance
In my experience, CI pipelines can be vibe coded. But you need to create the right systems - platform engineering and prompt engineering as part of the agentic software - and standardization is key.
This represents a realistic middle ground:
Creative LLMs to scaffold and evolve pipelines
Structured platforms to capture, govern, and improve them
Smaller models for performance-efficient execution within bounded use cases
Building Blocks for Real Agentic Systems
Here’s what a production-grade agentic platform should look like:
Component | Purpose |
Prompt Registry | Version and reuse proven prompts |
Agent Templates | Define repeatable, parameterised behaviour |
Agent Contracts | Standardise I/O, NFRs, and fallbacks |
Observability Layer | Track outcomes, performance, and compliance |
Model Router | Assign small vs large models based on context |
Human Oversight Hooks | Ensure discretion where it matters |
Finally....
Software 3.0 isn’t just about generative intelligence. It’s about engineering systems where intelligent behaviour becomes repeatable, composable, and trustworthy.
That means:
Grounding agent systems in standardisation and contracts
Prioritising non-functional requirements
Treating prompt engineering as software engineering
And accepting that platforms, not just models, will define the winners in this new wave