I am writing an article about what makes a good language for AI to program in and why, and also trying to make a case for Elm as being a strong candidate here. There is also definitely a big overlap between what is good for humans to program in with what is good for AI to program in - I think for the reason that compilers exist to enforce structure, accuracy, efficiency, reliability, safety, correctness and so on.
Here are 10 qualities that I think can be evaluated against:
Smooth spectrum from spec to implementation (properties, contracts, proofs)
And my evaluation for Elm and some other languages is below.
Elm is a ‘minor’ language compared to these, and that sets it at a disadvantage in terms of available code to train on but I really do think it has great qualities in this area. Another criticism might be that if AI makes working with difficult languages like C++ easy, does having a better language even matter?
Do you agree with my assessment ? And choice of evaluation criteria ?
Would you like me to assess any other languages also against these 10 criteria ?
Summary
Elm: Very close to “ideal for AI” on most dimensions; small, pure, strongly typed, with excellent errors.
C++: Powerful but hostile on most of these criteria; good tooling via Clang, but language semantics and safety are poor for AI-generated code.
Java: Solid middle ground: safe, regular, tool-rich, but less expressive statically than ideal.
JavaScript: Needs TypeScript and tooling to approach your ideal; core language is dynamic and quirky.
Python: Friendly and tool-rich, but dynamic typing and runtime binding make strong static guarantees and precise feedback harder.
Elm
Semantics - Strong
Pure, expression-oriented, no null, no exceptions, managed effects, simple predictable semantics.
Static guarantees - Strong
Hindley-Milner types, no null/undefined, ADTs, exhaustive pattern matching; no effects system or ownership, but very safe.
Regular design - Strong
Small core, curated ecosystem, limited feature set; very little legacy or weird corner cases.
Memory safety & concurrency - Strong
Runs on JS, no raw memory; Elm’s concurrency model (signals/tasks/ports) is tightly controlled.
Declarative DSLs - Strong
Entire UI model is declarative; HTML, styling, architecture are declarative DSLs embedded in Elm.
Diagnostics - Strong (edited)
Famous for excellent human-friendly errors. The structured/machine side exists via --report=json CLI option.
Partial programs - Medium
Good compiler guidance when things are missing, but no explicit “typed holes” in the sense of Agda/Idris. Still: very friendly to incremental edits.
Tooling APIs - Medium
Has language server support and fast compilation, but less extensive/introspective than, say, Rust/Java ecosystems.
Readability/refactorability - Strong
Enforced formatting, simple module system, no overloading or operator madness; code is usually very uniform.
Spec to implementation spectrum - Medium
Strong types and pattern matching help, but little in the way of built-in contracts/proofs beyond the type system and tests.
C++
Semantics - Weak
Complex, decades of accreted features, UB everywhere; tricky evaluation order and aliasing rules.
Static guarantees - Medium
Strong types and templates, but limited by UB and unsafe constructs; no native ownership/effects in the language (RAII helps but is not enforced by the type system).
Regular design - Weak
Many overlapping features and paradigms; historic baggage; “there are many ways to do it.”
Memory safety & concurrency - Weak
Raw pointers, manual memory, data races easy to express; safe subsets exist by convention, not by language design.
Declarative DSLs - Medium
Template metaprogramming enables embedded DSLs, but often in very complex ways; not designed for declarativity first.
Diagnostics - Medium
Modern compilers (Clang, GCC, MSVC) give better errors and machine-readable formats, but template errors are still notorious; structured diagnostics exist but are compiler-specific.
Partial programs - Weak/Medium
Compilers cope with syntax/type errors but no explicit “holes”; error recovery exists but not designed as an interactive, typed-hole experience.
Tooling APIs - Strong (via Clang, etc.)
Clang/LLVM and related tooling provide rich introspection; language itself doesn’t define APIs, but ecosystem is strong.
Readability/refactorability - Medium (highly style-dependent)
Possible to write very readable C++, but language allows highly complex, non-obvious constructs; refactoring relies heavily on external tools and discipline.
Spec to implementation spectrum - Weak/Medium
Some contract support in newer standards, plus external tools (static analyzers, formal methods), but not a central design goal.
Java
Semantics - Medium/Strong
Deterministic, well-specified, no UB in the C++ sense; but lots of legacy quirks and a large standard library.
Static guarantees - Medium
Nominal OO types, generics, null is pervasive; no effects system or ownership, but type system is sound and helpful.
Regular design - Medium
Core language relatively simple, but Java 8+ added lambdas/streams/etc.; still more regular than C++.
Memory safety & concurrency - Medium
Memory-safe (no raw pointers), but data-race safety not enforced; concurrency primitives are low-level.
Declarative DSLs - Medium
Streams, annotations, builder-style APIs allow semi-declarative code, but language itself is largely imperative/OOP.
Diagnostics - Strong
Good compiler errors, IDEs provide structured feedback; build tools and LSP support stable machine-readable diagnostics.
Partial programs - Medium
IDEs plus compiler handle incomplete code well, but no notion of typed holes as language constructs.
Readability/refactorability - Strong
Verbose but regular; strong IDE refactoring support; canonical style converges on readable, explicit code.
Spec to implementation spectrum - Medium
JML and similar tools exist; annotations and frameworks for validation, but not integrated deeply into the language core.
Static guarantees - Weak
Dynamic types, no static checking beyond linters; TypeScript exists precisely to fix this.
Regular design - Weak/Medium
Modern JS is more regular, but legacy features and multiple paradigms coexist; many “gotchas”.
Memory safety & concurrency - Medium
Memory-safe (no pointer arithmetic), but data races via shared memory are rare in typical browser JS; async model is single-threaded but subtle.
Declarative DSLs - Medium/Strong (via ecosystem)
React/JSX, functional style, array combinators make a lot of UI/data-flow declarative; this is more library-level than language-level.
Diagnostics - Medium
Runtime errors often decent; static diagnostics depend on linters/TypeScript; machine-readable error formats exist but fragmented.
Partial programs - Medium
Tools (IDEs, browsers) handle incremental code reasonably well, but the language doesn’t have holes/typed feedback.
Tooling APIs - Strong (ecosystem)
Language servers, AST tools (Babel, ESLint), bundlers; excellent introspection through external tooling.
Readability/refactorability - Medium
Very style- and framework-dependent; you can write clean or very messy JS; refactoring relies heavily on TS/IDEs.
Spec to implementation spectrum - Weak/Medium
Test frameworks and schema validators help; no built-in contract/property language; most “spec” lives in tests and documentation.
Python
Semantics - Medium
Mostly simple and consistent at the surface, but dynamic features, metaprogramming, and import system quirks exist; still far friendlier than C++/JS.
Static guarantees - Weak/Medium
Dynamic by design; type hints + mypy/pyright improve things but are optional and unsound in many real-world uses.
Regular design - Medium
Core language is relatively small and consistent; some historical warts (Python 2 legacy, metaclasses, etc.).
Memory safety & concurrency - Medium
Memory-safe from the programmer’s view; GIL simplifies some concurrency concerns but is a performance and design constraint; no static race checking.
Diagnostics - Medium/Strong
Tracebacks are clear; newer versions add better error messages; type checkers give structured diagnostics; machine-readability via tools is good.
Partial programs - Medium
REPL culture and notebooks support incremental development; static analysis on incomplete programs is less robust than in strongly typed languages.
Tooling APIs - Strong
Rich introspection (reflection, inspect), language servers, static analyzers; good ecosystem for tools.
Readability/refactorability - Strong (by culture)
“There should be one obvious way”; enforced indentation; common style via PEP 8; dynamic nature still makes some large-scale refactors risky.
Spec to implementation spectrum - Medium
Property-based testing (hypothesis), contracts libraries, type hints; but nothing like a built-in, enforced spec language.
One concern I have regarding AI’s ability to work with Elm code (although this is conjectural rather than empirical) is that Elm tends to have large files (see the life of a file), and that will pollute an AI’s context window if it tries to read the whole thing. It may be possible to make it better for an AI to work with by providing a command line tool that will print a particular definition from a file, rather than having it need to read the whole thing.
Semantics - Strong
Functional, immutable data, pattern matching, and BEAM’s process model give it simple, well-defined, deterministic semantics with no UB in user code.
Static guarantees - Weak
Dynamically typed with optional typespecs + Dialyzer; useful for documentation and some checks but far weaker and less sound than ML/Rust-style static typing.
Regular design - Strong
Small, coherent core (modules, functions, pattern matching, processes, macros); most complexity is library-level and the core language stays uniform and orthogonal.
Memory safety & concurrency - Strong
Memory-safe via GC on BEAM; actor-style concurrency with isolated processes and message passing avoids shared-memory data races by construction.
Declarative DSLs - Strong
Macros and quoting make it very good for DSLs; frameworks like Phoenix and Ecto lean heavily on declarative routing, queries, schemas, and configurations.
Diagnostics - Medium
Runtime errors and stack traces are clear and helpful; ElixirLS/language tooling expose diagnostics, but there’s less emphasis on highly structured, versioned error codes than in Rust/TypeScript.
Partial programs - Medium
Great REPL (IEx), live reload, and a dynamic runtime make it easy to work with incomplete systems, but there’s no notion of typed holes or rich static feedback on partial terms.
Tooling APIs - Strong
mix, Hex, ElixirLS (LSP), and BEAM introspection provide rich, scriptable tooling and fast feedback loops attractive for AI-driven workflows.
Readability/refactorability - Strong
Pipeline operator, clear conventions, enforced formatting, and functional style make Elixir code generally very readable and modular, though dynamic typing limits fully automatic refactors.
Spec to implementation spectrum - Medium
Typespecs, @behaviour, docs, and property-based testing offer a light spec layer, but there’s no built-in contract or proof system tightly integrated with the language core.
A minor note. The Elixir LSP is expert, the others are considered deprecated. My employer actually employs someone to work on expert! They’ve put a TON of work into the LSP and it’s really great considering how much they’ve done in the past year or 2. However, the LSP does have limited APIs. It was pointed out to me a few weeks ago that the LSP cannot do function renaming. Supposedly it’s a technical limitation, though I don’t know all the details (can share links if there’s interest).
Interesting though that @wolfadex linked article on Elixir is taking the empirical approach and measuring some kind of benchmarks against AI models and languages.
Rust is a strong contender too, and probably the best of the major languages.
Rust
Semantics - Strong
Well-specified, mostly deterministic semantics; the safe subset rules out undefined behavior, and the boundary with unsafe is explicit.
Static guarantees - Strong
Rich static types with ownership/borrowing and lifetimes, algebraic data types (enums), traits, and pattern matching; strong guarantees about memory safety and aliasing in safe code.
Regular design - Medium/Strong
Modern, mostly orthogonal core, but lifetimes, traits, and generics introduce real complexity compared to simpler ML-style languages.
Memory safety & concurrency - Strong
Safe Rust prevents data races and most memory bugs by construction; low-level unsafety is confined to unsafe blocks with clear syntactic fences.
Declarative DSLs - Medium
Macros, traits, and builder patterns allow embedded DSLs and declarative styles, though they’re heavier than in pure functional or macro-heavy languages like Haskell or Elixir.
Diagnostics - Strong (human + structured)
Excellent compiler errors with suggestions, spans, and notes, plus structured, machine-readable diagnostics (error codes, JSON output) and tight IDE integration via rust-analyzer.
Partial programs - Medium
Good error recovery and helpful messages when code is incomplete, but no first-class typed holes as in Agda/Idris/Haskell; the experience is “close but not explicit.”
Tooling APIs - Strong rustc plus rust-analyzer, Cargo, Clippy, Miri, and stable compiler flags/JSON outputs provide rich introspection, fast incremental checking, and strong LSP-based tooling.
Readability/refactorability - Medium/Strong
Clear idioms, enforced formatting (rustfmt), and a culture that values explicitness and safety, but advanced features (lifetimes, complex generics) can make some code hard to read and refactor.
Spec to implementation spectrum - Medium/Strong
Types, traits, and pattern matching encode a lot of invariants; external tools (Prusti, Creusot, etc.) bring formal verification, though they’re not yet mainstream parts of typical Rust workflows.
I agree on the analysis and have found Claude capable of working with a large and complex codebase (Liikennematto), especially with recent models.
The main issue is the lack of app/algorithm/library examples in training data. LLMs are not spitting out Haskell as Elm code anymore, and can use the compiler to fix their wrong assumptions. Yet for example with TypeScript there are so many examples of really anything in the training data, that LLMs get a head start on understanding how to implement or fix something. The downside is that the training data is full of mediocre code, while Elm source in the wild tends to be of high quality.
Some of the issues can be worked around by first prototyping in a popular language and then writing the solution in idiomatic Elm.
When not explictly instructed not to, Claude Code still writes silly tests that fail to capture the idea of the code being tested. For example in TypeScript the tests often make much more sense. Hence with Elm it’s better to write tests yourself of give very good foundations for the LLM to use.
But overall, given that Liikennematto is a niche thing (traffic simulation/city builder) in a niche language, using 100% my own coding style and way of reasoning (no other contributors), I think Claude is doing a good job at helping me with new features. I don’t let it run as an agent for long periods of time, as that often is a dead-end. Requires guidance.
I once deliberately wrote an Elm module that grew to at least 6K LOC before refactoring it into smaller modules. It was a good exercise for me to learn from, and it did also mean that the architecture was not imposed so much as responding to the needs of the application. What I learned from this has enabled me to be better at modularising Elm, so I would not repeat the exercise now that I feel confident of how to modularize better.
That said, I do agree that big files can be a real problem for AI coding and eating up too much context.
Currently I use Serena MCP with Claude Code, and that seems to do a much better job of helping it to search and edit files more efficiently. There is also jCodeMunch, which does not support Elm, but does look promising if/when it does. Right now with CC + Serena, I don’t consider big files to be a disadvantage.
I am not really sure what Serena or jCodeMuch do to index the code. The obvious options are vector database or graph database or both! I don’t think either create graphs, but I think that would be the really interesting approach - create and maintain a knowledge graph out of the code. Needs to be quite language specific though, so not surprising that I don’t see one for Elm.
Elm being a ‘minor’ language is an inevitable weakness. Despite that, surprising how well it holds up, due to its other qualities.
Recenly I used an Elm graph package for creating then walking/searching through graphs. Turns out the package was not using tail recursive algorithms, so blew up the stack on real-world problems. I asked CC to write me a new graph implementation that is tail recursive, and it accomplished that easily.
Maybe that is too simple an example. What algorithms have you found it to have trouble with?
I’m using Elm + Tailwind with Opus 4.5/.6 since a couple of months ago, it works very very well.
I’ve also asked for some time to implement the same things both with Elm and Svelte, and compare the timings and quality between the two agents doing the work. The summary is that Elm is on-par, a little more time spent vs svelte (mainly for the json decoders) but less iterations to catch errors and more accurate results in terms of what I wanted to code. Nowadays I’m doing everything with Elm.
I’ve been thinking of a compiler/language-server feature in which when an error happens, besides the error itself, excerpts of the relevant code are provided together with line numbers and modules paths.
Elm error messages already do this ? But perhaps you are talking about something else.
I gave Elm this score on error reporting:
Diagnostics - Strong (human), Medium (structured)
Famous for excellent human-friendly errors. The structured/machine side exists via tooling but is not as emphasized as in e.g. Rust.
As the readable error messages are great - there is just a little bit of a lack of structure for tool readability. This could be addressed quite easily for Elm, by having an optional mode where the errors are printed in machine readable form, probably JSON.
The elm compiler does have a way of outputting errors in a json format, although I suspect that the human-readable version already works quite well for LLMs
It does! I just never used it before… --output=json, example below.
I think AI does fine with the human readable version but a structured format can still be useful for error classification, tool driven behaviour to fetch files and extract from them and so on.
Thanks for pointing that out, I think Elm should be scored as Strong in this area.
{
"type": "compile-errors",
"errors": [
{
"path": "/home/rupert/sc/github/the-sett/elm-mlir/src/Mlir/Mlir.elm",
"name": "Mlir.Mlir",
"problems": [
{
"title": "UNFINISHED DEFINITION",
"region": {
"start": {
"line": 49,
"column": 7
},
"end": {
"line": 49,
"column": 7
}
},
"message": [
"I got stuck while parsing the `asdasd` definition:\n\n49| asdasd\n ",
{
"bold": false,
"underline": false,
"color": "RED",
"string": "^"
},
"\nI was expecting to see an argument or an equals sign next.\n\nHere is a valid definition (with a type annotation) for reference:\n\n greet : String -> String\n greet name =\n ",
{
"bold": false,
"underline": false,
"color": "yellow",
"string": "\"Hello \""
},
" ++ name ++ ",
{
"bold": false,
"underline": false,
"color": "yellow",
"string": "\"!\""
},
"\n\nThe top line (called a \"type annotation\") is optional. You can leave it off if\nyou want. As you get more comfortable with Elm and as your project grows, it\nbecomes more and more valuable to add them though! They work great as\ncompiler-verified documentation, and they often improve error messages!"
]
}
]
}
]
}
On Partial Programs or type holes, there was talk around that some time ago, I don’t know if its something added to Elm language server or elm-dev possibly?
Type holes could be nice, but it’s also super easy to put the wrong type in and look at the error. I do this sometimes when I I know roughly the code I want but not the type. I’ll toss in an Int or () or some such that I know is plain wrong and let Elm tell me what type I should put in there. It’s like magic!
Do you agree with my assessment ? And choice of evaluation criteria ?
Yes.
Anecdote
In February I started using Codex at my day job to write Python, lots of YAML, and to debug JupyterHub deployments on Kubernetes/OpenStack - it’s very good at this.
I’m also one of the maintainers of a fairly large & mature[1], open source Elm application. Besides elm-analyse, we have a few elm-review rules, and some unit & integration tests. On a whim I recently used codex-cli (5.3) to implement a small, but non-trivial feature in our backlog. It rapidly iterated to a working solution with almost no intervention on my part. Using elm-ui (simple DSL), a design system, plus the opinionated patterns of our architecture probably helped too. I’m looking forward to experimenting with it more.