All posts
Opinion

Is Google Docs a Good IDE? - The Real Bottleneck for Coding AIs

No. So why do a lot of people expect LLMs to be successful in coding by editing plain text?

EGEmre Gucer
3 minutes read

No. So why do a lot of people expect LLMs to be successful in coding by editing plain text? This was a naive mistake we made early on building Fume. We believed building an AI SWE is a problem of code search, not generation. This is simply not true. Here's why:

SWE-bench is one of the most popular LLM coding benchmarks, and we were one of the first to run a subset of it with Fume a long time ago. In the original paper "SWE-BENCH: CAN LANGUAGE MODELS RESOLVE REAL-WORLD GITHUB ISSUES?" one of the modes researchers use to evaluate various LLMs is called Oracle. In this mode, LLMs are given the pieces of code edited in the golden set along with the issue statement. This basically eliminates the need for LLMs to search the codebase as they are given the correct code snippets to edit. If building an AI SWE was really a code search problem, one would expect models to perform REALLY well in this mode. But the reality is far from that.

SWE-bench results

A great counter-argument is the fact that LLMs need more context from the rest of the codebase to be able to understand "how the code works." But a closer inspection of the SWE-bench dataset would show many of the issues are 1-10 line changes that hardly require context from the rest of the codebase. Here is one example among many:

Issue from django/django: order_by() on a parent model crashes when Meta.ordering contains expressions. Description (last modified by Jonny Fuller): Hi friends, During testing, I discovered a strange bug when using a query expression for ordering during multi-table inheritance. You can find the full write-up as well as a reproducible test repository at https://github.com/JonnyWaffles/djangoordermetabug. The bug occurs because the field is an OrderBy object, not a string, during get_order_dir. The linked stacktrace should make the issue obvious, but what I don't understand is why it only fails during test db setup, not during REPL or script use. I wish I could help more and come up with a real solution. Hopefully, this is enough for someone wiser to find the culprit.

Solution:

 
--- a/django/db/models/sql/compiler.py
 
+++ b/django/db/models/sql/compiler.py
 
@@ -722,6 +722,9 @@
 
def find_ordering_name(self, name, opts, alias=None, default_order='ASC'):
 
     results = []
 
     for item in opts.ordering:
 
+        if isinstance(item, OrderBy):
 
+            results.append((item, False))
 
+            continue
 
         results.extend(self.find_ordering_name(item, opts, alias, order, already_seen))
 
     return results
 

Even though great code search ability would be crucial in a real-life scenario with a large codebase, I think it's far from enough. So, what's the ultimate solution? I think the answer is fairly simple: you need to give LLMs the same tools as you would need as an engineer. Just like you wouldn't edit code on Google Docs and push to prod (if you are doing this, I have the utmost respect for you), you shouldn't expect LLMs to work on plain text. Give them LSPs, debuggers, a Unix terminal, a browser... everything. Then, let them run for hours. LLMs are not perfect, but they definitely shine when they are in a close feedback loop where they can make and fix mistakes again and again.

Yes, LLMs are not yet capable enough to use these complex tools as well as you can, but first, they are not good without them, so you have nothing to lose. Second, in the fortunate scenario that LLMs get substantially 'smarter,' the possibilities are limitless. That is why we are building Fume in an isolated development environment of its own and how we plan to automate as much software work as possible for teams in the background.