I review a lot of PRs these days. As the job of a PR author becomes easier with AI, the job of a PR reviewer gets harder.1

AI can “assist” with code review, but I’m less optimistic about AI code review than AI code generation. Sure, Claude/Codex can be quite helpful as a first pass, but code review still requires a large amount of human taste.2

I care about the high level abstractions my team uses in our codebase, and about how the pieces fit together. I care that our codebase can be intuitively understood by new team members. I care that code is tamper-resistant – that we build things robustly such that imperfect execution in the future doesn’t cause something to blow up. Systems should be decomposable. You should be able to fit all the components of the system in your head in a reasonably faithful mental model, but you shouldn’t need to fit all the implementation details of each component in your head to not cause something to break.

Anyways.

I’ve been trying to speed up my review latency for PRs, and have given some thought to the heuristics I use to evaluate PRs. Heuristics are lossy, of course, but they’re necessary. If you haven’t given this much thought recently, it’s useful to consciously recalibrate the heuristics you use when reviewing code, now that so much code is generated by LLMs.

General Reviewability

  • Did the author provide a detailed & accurate PR description?
  • What level of sensitivity is this code? Is this performance or safety critical code that needs to be reviewed with a fine-tooth comb, line-by-line, or is it something peripheral like an internal UI or CLI that can be “good enough”?
  • Does the change appear reversible? Is the Git diff of the change human readable? I find LLMs are often really eager to make big changes that clobber the Git diff. Incremental change is usually preferable.
  • Is the PR of an actually reviewable size? My personal bar is: <500 lines is ideal, >1000 lines is borderline unreviewable.

Design & Abstractions

  • If this is greenfield code, does the author seem to be setting up suitable abstractions? Do these abstractions seem like they’ll compose in sane ways? Do the abstractions have reasonable boundaries that do not leak information?
  • Can you zoom out your mind, picture the PR in your head, and “make it make sense” with your mental model of the code? Does it make sense at a conceptual level what is being proposed, or does this just have the veneer of “good code”?
  • Would the code be substantially improved by loading the PR into Claude Code and making a targeted one-sentence prompt? For example: “hey, could you deduplicate some of this logic between classes X and Y and make the Foo trait more modular”. (Fortunately, you can just put this as a review comment – the author will probably rewrite with an LLM anyways)

Vibe Code Smells

  • What amount of effort does it seem like the author has put into their PR? N.b. I mean the author not the AI that wrote the code on behalf of the author. Human effort and curation still leaves signs behind.
  • Did the author leave vibe-coded comments in the PR? (Often this looks like iterative process comments, of the flavor // Now we’ll not use type X anymore, per your feedback.)
  • Are imports (especially for Python and sometimes Rust) splattered around the code, instead of being present at the top?
  • Is there a weird amount of defensive copying/cloning due to a misunderstanding of e.g. immutability-by-default in Scala or how to use ownership/lifetimes in Rust?

Testing

  • For unit tests, do they cover common edge cases? Do the unit tests actually make meaningful assertions that exercise the code in meaningful ways, or are they sloppy assertions that reduce to assert!(true)?
  • For unit tests, do they have a weird number of extraneous edge cases that are unlikely to ever happen in practice? (Also a vibe code smell)
  • For tests, do the tests mock out dependencies to the point where the entire test is useless/invalid?3

Error Handling

  • Does the code have a weird level of paranoia about exceptions being thrown?
  • Does the code silently swallow errors with try/catch?
  • Does the code allow for exceptions/panics in areas where the code absolutely should not panic?

None of these are intended to be knocks against individual PR authors. It’s useful to assume positive intent when reviewing code. SWEs are under various pressures, visible and invisible. The “system” we have today, broadly defined, results in much more code being produced by non-humans. The best source of truth for human coding taste is still, for now, humans. Therefore, humans still need to review a lot of non-human code, as we collectively chip away at the pieces of code taste that can be incorporated back into model intuition.


  1. As of late 2025, code generation is easier than code verification. Martin Kleppmann is onto something with his prediction that AI will make formal verification go mainstream. Our current set of review tooling isn’t sufficient for the tsunamis of code that will be generated without human oversight over the coming years. ↩︎

  2. Code review agents have gotten a lot better in the past year, but not at the same pace as code generation agents. As the model harnesses have gotten more agentic, code review agents have gotten significantly better at pulling in the context they need to review a PR. I’ve found that review agents are competent at picking out obvious catestrophic issues and medium-risk tacictical coding mistakes. They’re not great with issues where the entire gestalt of a PR needs significant reworking, or where PRs have a lot of small things that need to be called out for improvement. There’s assuredly still low-hanging fruit for gains here. ↩︎

  3. I heard of an example of this recently, albeit not LLM related. The situation was described as the author “mock[ing] for the world they wanted, not the world as it is”. I still chuckle at this. :) ↩︎