TL;DR
- Claude Code built a complete programming language with LLVM backend over 3 months
- Human involvement: ~20 hours total (5% of time) for architecture decisions and debugging
- Used Ralph Wiggum loop pattern: each session inherits previous work automatically
- Best for: Complex multi-month projects with clear specifications and testable milestones
- Key insight: Quality of specification determines quality of output
A developer let Claude Code run autonomously for three months, producing a complete programming language with lexer, parser, Hindley-Milner type inference, and native binary compilation via LLVM.
Geoffrey had an ambitious goal.
Build a complete programming language. Lexer, parser, type system, compiler backend. All the way down to LLVM code generation.
“Not a toy language. A real language that could compile to native binaries.”
Programming languages are famously complex projects. Teams spend years on them. Companies dedicate departments.
Geoffrey decided to let Claude run. And keep running.
For three months.
The Setup
Geoffrey used the Ralph Wiggum technique — a loop that catches Claude’s exit and re-feeds the original prompt.
“Every time Claude finished a piece, the loop would restart it with the updated codebase and continue the work.”
The original prompt described the language: syntax, semantics, type system, compilation target. Everything Claude needed to know about what to build.
Then Geoffrey stepped back.
The First Days
Early iterations built foundations.
Day 1-3: Lexer implementation. Tokenizing source code into meaningful units.
Day 4-7: Parser construction. Building abstract syntax trees from token streams.
Day 8-14: Basic type system. Type checking expressions and statements.
Each iteration picked up where the previous left off. The codebase grew. Claude kept building.
The Middle Months
The project entered more complex territory.
Week 3-4: Control flow analysis. Understanding branches, loops, function calls.
Week 5-6: Type inference. Claude implemented Hindley-Milner, letting types be deduced rather than declared.
Week 7-8: Intermediate representation. Translating AST to a form suitable for optimization.
“I’d check in periodically. The language was taking shape. Features I’d specified were appearing.”
The LLVM Integration
The final challenge: generating real machine code.
LLVM is the industry-standard compiler backend. It handles optimization, code generation, platform targeting. But it’s complex to integrate.
Claude tackled it systematically.
IR Generation: Translating the language’s intermediate representation to LLVM IR.
Optimization passes: Hooking into LLVM’s optimization pipeline.
Code generation: Producing actual binaries for the target platform.
“By the end, you could write a program in the language, compile it, and run it. Native binary. Real execution.”
The Intervention Points
Three months wasn’t fully autonomous.
Geoffrey intervened at key points:
Architectural decisions: When Claude faced design crossroads, Geoffrey provided direction.
Bug fixes: When Claude got stuck in loops or produced broken code, Geoffrey debugged and corrected.
Specification refinement: As the language evolved, Geoffrey clarified edge cases the original specification didn’t cover.
“Maybe 5% of the time I was active. But that 5% was critical.”
The Iteration Patterns
The loop didn’t run continuously for three months.
“It would run for hours or days. Then I’d review. Then restart.”
Multiple sessions. Cumulative progress. Each restart inherited everything the previous session produced.
“Think of it as a relay race where the baton is the codebase.”
The Code Quality
Three months of AI-generated code could be a disaster.
“Actually, the code was surprisingly coherent. Because each iteration built on the previous, and Claude maintained context, the architecture stayed consistent.”
Not perfect. There were oddities. Redundancies. Occasional strange choices. But the overall structure was sound.
“Better than some human codebases I’ve seen that grew over years without coherent architecture.”
The Testing Approach
How do you validate a programming language?
Geoffrey built test suites alongside the language. Programs that should compile and run correctly. Programs that should produce errors.
“The tests were part of the specification. Claude knew what passing meant.”
Each iteration ran the test suite. Failing tests guided the next iteration’s focus.
The Token Investment
Three months of Claude usage consumed significant resources.
“I didn’t track exact costs, but it was substantial. Not prohibitive, but real.”
The economics depended on valuation. What was a working programming language worth? Against that value, the token costs were reasonable.
The Learning Capture
The project produced more than a language.
“I learned compiler construction by watching Claude do it. Better than any textbook.”
Each generated module was a lesson. Type inference implementation. IR design. LLVM integration patterns.
“It was like having a tireless tutor who showed rather than told.”
The Human Hours
Geoffrey estimated his time investment.
Active guidance: maybe 20 hours over three months.
Background monitoring: checking progress, reviewing code, occasionally.
“For a project that would have consumed years of full-time work, I spent a few weeks of part-time attention.”
The leverage was extreme. Human time multiplied by AI execution time.
The Final Product
At project end, the language worked.
Source files compiled to binaries. The type system caught errors. Performance was reasonable (LLVM handled optimization).
“Not production-quality. Not ready for widespread use. But a complete, working language that demonstrates the concepts.”
The proof was in the compilation.
The Documentation Bonus
Claude documented as it built.
Comments explained design decisions. README files described architecture. The project was more documented than most human projects.
“Documentation wasn’t an afterthought. It was part of the generation process.”
The Reproducibility Question
Could others replicate this?
“The approach is reproducible. The results depend on the specification quality and intervention skill.”
A vague language specification would produce a vague language. A precise specification, with expert intervention at key points, could produce something useful.
The Upper Bound Question
Three months suggested something about limits.
“What’s the upper bound for autonomous AI operation? We don’t know yet. Three months isn’t the ceiling.”
Other practitioners reported even longer runs. The technique scaled with patience and budget.
The Philosophical Reflection
Building a language this way felt different.
“I wasn’t a programmer. I was a director. Specifying what I wanted. Reviewing what I got. Guiding when needed.”
The craft shifted. From writing code to orchestrating generation. From implementation to specification.
“The skill wasn’t typing. It was knowing what to ask for.”
The Implications
The three-month language proved something.
Complex, multi-month projects were achievable through autonomous AI operation. Not just quick tasks. Not just simple scripts. Real engineering projects.
“If AI can build a programming language in three months, what else can it build?”
The question wasn’t rhetorical. It was an invitation to experiment.
The Current State
The language exists. Open source. Others have studied it.
“It’s a proof of concept. Not a production language. But the concept it proves is significant.”
AI-generated compilers. AI-generated systems. AI-generated complexity.
The three-month run demonstrated the frontier.