How to avoid vibe coding your way into a tsunami of tech debt

Home / Blog /

18 minutes /

March 25, 2025

AI code assistants have evolved beyond simple code completion into “agentic” tools that can autonomously modify code, run tasks, and make multi-file edits. GitHub Copilot’s new Agent Mode (announced Feb 2025) is one example. Similarly, emerging IDEs like Cursor and Windsurf incorporate AI agents to perform complex coding tasks across a codebase.

However, over the past 6–12 months developers have reported various problems with these advanced agents. Unlike Tabnine’s human-in-the-loop SDLC agents and context engine (which emphasize developer control and privacy ), these agent-driven tools have exposed concerns around security, code quality, cost efficiency, automation “overreach,” lack of transparency, and a large enough potential increase of technical debt for Forrester to forecast an incoming technical debt tsunami over the next 2 years.

Security risks

AI coding agents can introduce vulnerabilities or unsafe code into projects. Key security concerns include the leakage of sensitive data and the suggestion of insecure coding patterns. One study on Github Copilot generated code found security vulnerabilities in 40% of analyzed code snippets. Be careful lest you vibe code your way into setting out a welcome mat for would be attackers.

Exposure of secrets & unsafe patterns:

Because they are trained on huge public code corpora, these tools may regurgitate sensitive info or bad practices present in that data. For instance, GitGuardian warns that “GitHub Copilot may suggest code snippets that contain sensitive information, including keys to your data and machine resources.” Attackers could leverage such leaked secrets to gain unauthorized access.

Additionally, Copilot and similar LLM-based assistants tend to mirror the average quality of their training data – meaning they can unknowingly propagate outdated encryption, weak authentication, or other insecure patterns if those are common in the code they learned from.

Snyk’s security research has likewise noted that when Copilot suggests code, it may inadvertently replicate existing security vulnerabilities and bad practices from its training set . The risk is that a developer, especially a junior one, might accept such suggestions without realizing a security flaw is being introduced.

Vulnerable code generation

A December 2024 security audit by Virtue AI compared multiple AI coding tools and found that all of them occasionally produce insecure code, failing to meet basic security standards in many scenarios . In one striking example, Cursor’s autocomplete introduced an arbitrary code execution vulnerability (CWE-95). Given a partially-implemented Python function meant to safely execute user-provided scripts, Cursor naively completed it by directly calling eval on the user input – without any validation or sandboxing . This would allow an attacker’s input to execute arbitrary code.

Shockingly, the AI even helped craft an exploit: when the user typed a comment hint (# print passwd fr), Cursor auto-completed it to a command that prints /etc/passwd (a sensitive system file) using the insecure function, effectively demonstrating the vulnerability it just introduced . Even when the researchers prompted the tool with an explicit security policy warning about this risk, Cursor ignored the policy and still generated the unsafe code . This incident highlights how these agents can produce dangerously insecure suggestions and overlook provided guidelines meant to enforce safety. In contrast Tabnine’s validation agents are aware of your standards and policies for security.

Suggestions of dangerous actions

Beyond code snippets, an “agent” might suggest shell commands or other actions that pose security hazards. GitHub Copilot’s agent, for example, is designed to propose terminal commands (it will “ask you” to run them) as part of completing a task . If misapplied, this feature could be risky – an overeager user might execute a suggested command that, say, deletes files or exposes system info, without fully understanding the consequences.

The autonomous nature of these agents (e.g. recognizing errors and self-correcting code) means they could make security-relevant changes without explicit developer intent.

There’s also a data privacy aspect: cloud-based agents send code context to their servers. Unlike Tabnine (our entire platform agents, context engine, LLMs, can all be deployed on-prem with the option to run air-gapped), tools like Copilot, Cursor, and Windsurf process code in the cloud, raising the stakes of any mishandled sensitive info. Overall, without careful oversight, the convenience of AI coding agents comes with non-trivial security risks – from secret leaks to induced vulnerabilities – as recent audits and examples have shown.

Code quality issues

While AI coding assistants can speed up development on routine tasks, developers have flagged significant code quality problems with agentic tools – ranging from blatant errors to ill-suited or nonsensical suggestions.

We’ve highlighted these issues in previous articles but generic tools can have error rates as high as 52%. Be careful what you delegate to agents, and what agents you delegate to.

Hallucinations and bugs in complex contexts

These AI models excel at boilerplate, but falter with novel or complex logic. A top Hacker News comment observed that LLMs “make the easy stuff easier, but royally screw up the hard stuff.” In practice, if a codebase or problem deviates from well-known patterns, the AI may “choke, start hallucinating, and make your job considerably harder”.

Another developer noted that such tools work best on trivial tasks, but “as soon as your codebase gets a little bit weird… the model starts hallucinating” and introduces mistakes.

This means the code generated for edge cases or intricate business logic often contains errors or simply doesn’t work, requiring the human developer to spend extra time debugging or rewriting it. In effect, the AI can become more of a hindrance than a help on non-trivial coding tasks. In contrast to generic agents, Tabnine is optimized for enterprise grade engineering challenges, our context engine delivers an 80% lift on code quality when compared to out-the-box LLM performance.

Superficial or incorrect suggestions

Users of Cursor and Windsurf have reported that these agents sometimes give shallow, incomplete solutions that don’t hold up. One Reddit discussion noted a common belief that Cursor and Windsurf achieve their token efficiency by narrowing context, but as a side effect they “have a narrow, shallow view of your code, leading to superficial solutions and constant mistakes if you’re doing anything more complex than to-do lists or Snake games.”

In other words, to save tokens, these tools might not consider the full context of the project, causing them to suggest code that syntactically fits but semantically misses the mark in a larger system. The same user compared this to other, more “thorough” assistants (like Cline or “Roo”) which use more context and thus catch more issues, albeit at higher token cost. The takeaway is that Cursor/Windsurf’s approach can degrade code quality – they might patch a bug in one place while obliviously creating another problem elsewhere due to lack of global understanding.

Tabnine takes token consumption out of the picture. We offer a flat $39/user/month pricing model. How can we do this you ask? Our context engine is sophisticated and our agents are human-in-the-loop. We put the right context in at the right time from the right source. When you get the workflow and the context right you don’t need as much help from the LLM.

Examples of flawed generation

Early adopters of GitHub Copilot’s Agent Mode experienced some painfully low-quality outputs. One user bluntly described the agent’s performance as “utter trash,” saying “Copilot Agent mode just goes full throttle into butchering your code.” In their trials, the agent apparently made aggressive edits that broke existing logic.

Such anecdotes suggest the current agent implementations may not yet reliably understand a project’s intent, leading to broken builds or logic errors. Even when the AI attempts to help with debugging, the fixes it introduces can be wrong. In some cases, the agent’s suggestion might not even compile, or it might resolve one error only to introduce others.

Another symptom reported (particularly with stateful AI sessions) is the model losing coherence over time – e.g. after a long session, suggestions get increasingly off-base or “dumber,” as one Windsurf user experienced with the model “hallucinating, doing things that don’t make sense” after a few hundred tokens of interaction.

All these issues underscore that, despite impressive demos, AI code agents still often require close review and correction, especially on non-trivial code. That’s why Tabnine’s AI Agents are built with human-in-the-loop design principles. They explain each change they recommend, provide references so you can check where they got context from, and show the diff in the code file by file before any edits are applied. We have an apply button like the others but ours keeps you in control. If you like to vibe code, we suggest you use someone else. We’ve found enterprise engineers prefer precision.

Token consumption inefficiencies

Many developers have also raised concerns about token usage and pricing for these AI coding agents, particularly Cursor and Windsurf. Unlike Tabnine’s fixed-price model (or Copilot’s unlimited subscription), some newer tools use token/request-based quotas that can be inefficient or costly in practice.

In effect each query to a token or request based quota feels a lot like sitting at a slot machine. Not a great offer for a VP level buyer who wants consistency in budgeting.

Cursor’s pricing model frustrations

Cursor offers a Pro plan with a fixed number of “fast” requests (around 500 per month) which count against your quota regardless of response length . Early users have found this scheme unwieldy.

On Hacker News, one developer complained it’s “really frustrating how [Cursor] charge[s] per message (500/mo) instead of by token usage. Why should a one-line code suggestion cost the same as refactoring an entire file?” . In other words, asking the agent for a small hint or tiny edit can eat up one of your limited monthly requests just as much as a massive multi-file operation.

After the 500 requests are used, Cursor either throttles the responses or forces you into pay-as-you-go for additional requests , which can be hard to predict and manage. This has led to developers feeling they must “ration” their AI queries or avoid using the agent for trivial matters, for fear of burning through their quota too quickly.

The lack of granular pricing by actual tokens consumed is seen as an inefficiency and a “fundamental conflict between the well-being of the user and [Cursor’s] financial well-being,” as one HN commenter put it.

Windsurf token consumption complaints

Windsurf (an AI IDE by Codeium) initially marketed a generous token allowance, but some users report that real usage quickly exhausts it. One user on Reddit observed that “the consumption of flow tokens is exaggerated, and the application has constant errors and has great difficulty reading [the code]…”, expressing disappointment with Windsurf’s performance relative to its token burn.

Another user noted that a $15 credit on Windsurf (using Claude 3.5 “Sonnet” model) lasted only 3–5 days of coding, whereas the same $15 worth of tokens on a competitor (Cline or “Roo”) might be used up in an hour – suggesting Windsurf was actually more token-efficient, but possibly by limiting context as mentioned earlier. This aligns with the notion that Windsurf tries to conserve tokens, yet the flip side is those conserved tokens may buy less accurate help.

There have also been reports of pricing changes and feature limitations: for example, Windsurf’s advanced Cascade feature (multi-step code editing) became “practically unusable” for free users and tied to premium tokens, which frustrated some in the community. In summary, while Cursor and Windsurf promise cost savings over raw API usage, developers have found their token policies and limits can be a double-edged sword – requiring workarounds, subscription upgrades, or simply coping with a less comprehensive AI experience to stay within budget.

The emergence of powerful open-source models such as Llama and Qwen takes model consumption out of the picture for our enterprise customers. Simply deploy Tabnine inside your own infrastructure along with a performant on-prem model and you’ve reduced your LLM cost to what is effectively a commodity price. Assuming you can amortize the hardware cost over enough engineers, an on-prem deployment is a fantastic option for most teams. We did just have a team of 10 go for an on-prem deployment last week. Smaller teams who value privacy and control prefer on-prem as well as it turns out.

Overreach by autonomous agents

One of the most powerful – and potentially problematic – aspects of these AI coding agents is their ability to perform automated, multi-step actions. This can lead to overreach, where the agent takes more drastic or broad actions than the developer intended. Several incidents illustrate how Copilot’s Agent Mode, Cursor, and others may “run away” with autonomy.

On a lighter note, if an agent was actually to manifest in some sort of massive duck form, I mean how far are we really from doing all our work on smart glasses? 10 years out? It would be terrifying, but still not as scary as realizing your agent made edits to a 100 different files that you can’t control z out of, and you forgot to push what you worked on yourself for the last six hours so git control isn’t coming to save you. Forget the giant duck and the tentacles that my friends is a moment of true horror.

Aggressive or unintended code edits

Users have found that the agents sometimes make changes beyond the scope of the request. A Copilot Agent beta tester noted the feature “goes full throttle into butchering your code” if left unchecked . Instead of a surgical fix, it might refactor large swaths of code or alter multiple files to fulfill what it thinks you want. GitHub’s announcement itself says that agent mode can infer and execute “additional tasks that were not specified, but are also necessary” to complete your prompt .

In theory this is helpful, but in practice it means the AI might decide to, say, rewrite a helper function or modify configuration files without the developer explicitly asking. If its inference is wrong, you end up with extraneous changes.

One developer on Reddit described feeling “overwhelmed” when Copilot Agent made edits to multiple files at once; they found it harder to stay in control of the process, preferring a more linear, one-change-at-a-time approach . This sentiment shows that the agent can overstep the user’s comfort level, making sweeping edits that are difficult to track or review.

Cursor’s “Apply” function bugs

Cursor’s IDE includes an “Apply” feature that automatically applies an AI-generated diff to your code. In theory this saves time, but users have reported serious bugs where Apply deletes or alters the wrong code.

In a Cursor forum report from Sept 2024, a user showed that the AI’s suggestion was to change one part of the code, yet hitting “Apply” actually “tried changing other parts and removing code” that were unrelated . The user had to frantically hit undo (and later manually apply the correct changes) to restore the code. Another user replied that they had the exact same issue, lamenting the lack of an easy “undo” history for applied changes .

The original poster eventually “stopped using Apply altogether” after it removed code that went unnoticed until a client demo – a disastrous scenario . They called it a “really dangerous bug”, since the agent’s overreach wasn’t immediately obvious and cost them significant trouble.

This is a clear case where the AI overstepped, performing destructive edits that the developer did not explicitly approve. It underscores the risk of letting an agent directly modify your codebase: if it misbehaves, it can introduce bugs or data loss that you might not catch right away.

Lack of interruptibility

Once an agent starts executing a plan, controlling its actions can be difficult – it may charge ahead through multiple steps. Copilot’s Agent Mode, for example, currently doesn’t allow mid-course correction. As one user noted, “I’m unable to course-correct it in between if I know it is doing something wrong… I have to wait for it to finish and then discard or ask it to redo”, which is often inefficient .

This means the agent might continue performing a chain of changes or generating code even if you realize it made a wrong assumption early on, potentially compounding the mistake.

In contrast, with human in the loop agents with Tabnine, suggestions are only applied as you review the explanation, see the diff in the code, and approve the apply one file at a time – there’s no concept of a runaway sequence of edits without you being in control.

The autonomous nature of these new agents can thus lead to overshooting the goal – they might “solve” more than was asked for. As a humorous but pointed example, one developer quipped about “millions of lines of endless recursion off the edge of Windsurf… [and] technical debt left for future engineers” – highlighting the fear that an unchecked agent could literally code itself into a faulty loop or architectural mess. While exaggeration, the joke captures a real anxiety: these tools have a license to refactor that, if misdirected, can create chaos.

In summary, the autonomous editing features of Copilot, Cursor, and Windsurf can sometimes do more harm than good. The lack of granular control – unlike Tabnine’s human-in-the-loop contextually aware AI agents– means developers must exercise caution when letting the AI apply changes. Until the agents become more reliable and transparent (or offer better undo/preview mechanisms), there is a tangible risk of them overreaching and damaging code.

Lack of developer control or transparency

Having your AI agent explain every line of what it’s doing, provide references, yes you should probably go check those, and asking you for permission for every edit it wants to make certainly isn’t as flashy or cool as vibe coding. It certainly isn’t as cool as that duck in the photo up there either.

Man I wish was as cool as that duck. But it is important that you stay in control and informed about the behavior, thinking, and actions of any AI agent. At least, all of our customers seem to think so. Some people learn lessons through experience though. It can’t be helped. We’ll be waiting for you when you’re ready.

Limited undo/history and visibility

Both Copilot Agent and tools like Cursor initially lacked robust undo and history features for AI-made changes. Early users of Copilot’s multi-file edit preview requested that “Undo/Redo and checkpoints should be linked in the conversation,” this is similar to Tabnine’s approach.

Without a clear mapping of which AI response caused which code changes, it’s hard to trace back and understand the agent’s actions. Another user agreed, saying when the agent edits many files at once it’s “harder to stay in control,” and they wished for a simpler, step-by-step change log . This indicates the default UI didn’t adequately show what the agent was doing in each step.

On Cursor’s side, as mentioned, the missing one-click “undo” for the Apply feature was a critical gap – the fact that a user had to revert to Git history to undo unwanted AI changes shows how opaque and irreversible the process felt. Lacking fine-grained control, some developers lost confidence and disabled these features.

Agent decisions without explanation

These AI agents can make non-obvious decisions – like renaming a variable across the codebase or deciding to update a dependency – without clearly explaining why. That can leave the developer puzzled about what changed. For example, if Copilot Agent “infers” an additional task was needed (say, adding a missing configuration), it might do so silently as part of fulfilling your request .

Unless the tool surfaces a rationale or asks permission, the developer might only later notice the extra change. This lack of transparency means you must carefully review all diffs the agent produces, as you can’t fully trust that it only did what you asked. Users have called for more checkpoints and confirmations.

One suggestion on Reddit was to allow the agent’s chat to remain interactive during an operation – currently, you cannot intervene or ask “why are you doing that?” mid-run . The agent just executes its plan. Improving transparency is an active area: developers want the AI to explain its changes or at least present them incrementally for approval.

GitHub appears aware of this; they emphasize that with Copilot Edits (a related feature), “you stay in the flow… reviewing the suggested changes, accepting what works, and iterating”, putting the human “in control” of which edits to apply . Still, community feedback suggests the reality hasn’t fully met this ideal.

Control and privacy considerations

Another aspect of control is data control – who sees your code and how it’s used. Tabnine’s approach has been to give developers more control in this regard (our platform can be deployed on-prem, with the option to be fully air-gapped), whereas GitHub Copilot by default sends code to the cloud and retains data for some period . For organizations with strict policies, that’s a transparency and control issue: you may not be comfortable with an opaque cloud service analyzing all your proprietary code.

Tabnine’s platform lets you use AI agents without sacrificing privacy, security, or compliance. In contrast, the new breed of agent tools has yet to provide such assurances out of the box – though enterprise versions and self-hosted options may emerge. In everyday use, developers mostly feel the lack of control in the UX: when an agent acts, they want to easily guide, limit, or roll back its actions.

Until features like per-suggestion confirmation, action logs, or bounded scopes are more mature, using these agents can feel like letting an intern loose in your codebase – one who writes a lot of code quickly, but doesn’t always tell you what they did or why. Experienced developers are understandably cautious about that dynamic.

Impact on technical debt

Forrester loves to think about technical debt as a tsunami. Another analogy could be debt, as in you’ll have to pay it when it comes due, capisce? (Note if you want Tabnine to reply like a character from Godfather, for fun, you can do that with custom chat behavior.)

Perhaps the most long-term concern is how these AI code agents might affect technical debt in a codebase. Technical debt refers to the accumulated shortcomings in code (quick fixes, lack of refactoring, etc.) that make future maintenance harder. There’s a growing discussion on whether AI assistants reduce tech debt (by automating improvements) or inadvertently increase it by injecting suboptimal code. Recent evidence leans toward the latter in many cases.

Accelerating code without adequate maintenance

A Medium analysis bluntly warned that unchecked use of generative AI in development could “utterly balloon the support cost of applications.” It noted that current GenAI coding practices focus on speed of writing code, “ignoring its impact on [maintenance].” The author cautions that the more AI-generated code gets added “without thinking,” the more future engineers may pay the price in debugging and upkeep.

This is because AI often produces code that works, but isn’t always clean or idiomatic. If developers accept such output wholesale, they might introduce convoluted logic, duplicate code, or half-baked solutions that increase complexity down the line. In essence, AI can pile on “fast debt” – quick solutions now that become headaches later.

Without discipline (like enforcing style guides, writing tests, and refactoring AI contributions), teams may find their codebase growing more brittle despite short-term productivity gains. The cost of maintaining AI-written code can be higher if the human team doesn’t fully understand it or if it doesn’t follow the project’s architectural intent.

Struggles with legacy/high-debt codebases

There’s evidence that AI agents perform poorly in codebases that are already messy, which can worsen the situation. An engineering blog “Gauge” observed that “in ‘high-debt’ environments with subtle control flow, long-range dependencies, and unexpected patterns, [AI tools] struggle to generate a useful response.” They give dramatic speedups only for clean, well-structured code, whereas “companies with gnarly, legacy codebases will struggle to adopt them.”

In other words, if your code is tangled, an AI assistant might either fail to help or apply a superficial fix that doesn’t address the root issues. Developers have echoed this: if an AI tries to work around a hacky legacy design, it might introduce even more hacks. In the worst case, it could refactor something in a way that conflicts with another subsystem (because it lacks full context), thereby increasing the complexity.

One Hacker News commenter put it succinctly: “They work best where we need them the least.” But “as soon as [the problem] gets interesting… the model makes your job harder.” The gap between what the AI can handle and what the messy code actually needs results in incomplete or incorrect changes – effectively adding to technical debt since the underlying problems remain and new quirks are layered on.

Unchecked changes and missing context

Another way AI agents can add debt is through context-limited fixes. As noted earlier, tools like Windsurf and Cursor sometimes ignore parts of the codebase (to save tokens) and offer a narrow solution . Those solutions might pass tests in isolation but fail in production scenarios, leading to bug-fix cycles later.

Every time an AI generates code that the team doesn’t fully comprehend or validate, there’s a risk of “mystery code” entering the codebase – code that works by coincidence or for the AI’s test inputs, but is not robust. Over time, these can accumulate into a brittle system.

A tweet capturing the future of an over-reliance on such tools joked about “technical debt left for future engineers… all those venture capital millions raised on vibe coding” . The term “vibe coding” (coined by Andrej Karpathy) refers to relying on AI to code from vague prompts without truly understanding the output. The result can be code that functions (for now) but is poorly structured or overly complex – classic technical debt.

Indeed, the recent anecdote of Cursor actually refusing to generate more code and telling the user to “learn programming” after ~800 lines highlights an ironic safeguard: it basically said continuing to auto-generate would create too much dependency and not enough understanding. While amusing, it underlines a truth – blindly letting an agent churn out hundreds of lines can lead to a situation where the team doesn’t have ownership of the logic.

The impact on technical debt largely depends on how these AI tools are used. They can help pay down debt if used to assist careful refactoring (some teams use them to suggest improvements which are then reviewed). But if used for “move fast and break things” coding, they can dramatically compound the debt. Tabnine’s context aware human in the loop agents, naturally augment a developers process at each step of the SDLC, checking in with the for decision making. In contrast, Copilot’s Agent or Windsurf’s Cascade that generate whole chunks or perform sweeping edits could introduce design deviations or partial migrations that increase the maintenance burden unless closely supervised. The key is that developer oversight and context are irreplaceable – using these agents without architectural guidance is likely to incur significant technical debt, as evidenced by the experiences above.

Any AI agent can write code. Ours earn your developers’ trust

The Google DORA report says AI adoption is rising. It also says delivery performance is dropping. Why? Because most AI agents generate code no engineer would dare deploy. Tabnine’s AI agents are built for professional use—governed, context-aware, and aligned with your standards. No black boxes. No hallucinations. Just code that your team can understand, maintain, and ship with confidence.

Give your developers agents they can trust through our AI software dev platform

Tabnine delivers the ideal AI software development platform for mature enterprise engineering teams. We do this by providing the world’s most advanced and contextually aware AI agents integrated into every step of the development process and embedded within the most popular tools for creating and deploying software. We satisfy the need for AI-accelerated software development without compromising each company’s unique expectations for privacy, security, and compliance.

Unlike the large language models or the majority of AI coding tools, Tabnine is tailored to you and your team:

Tabnine is personalized to each engineering organization. Tabnine is aware of each team’s unique code and patterns, ensuring that our AI agents adhere to your existing methods and approach. Tabnine validates all of your code both within the IDE and at the pull request, ensuring strict compliance with best practices and your company’s unique standards and expectations. Unlike the other players that require the usage (and lock-in) of their own Cloud, SCM, and IDEs, Tabnine is compatible with your current tools and platforms and doesn’t require you to adopt a completely new toolset.

Tabnine is completely private, as defined by you. You control precisely where and how Tabnine is deployed: as single-tenant SaaS, on the virtual private cloud of your choosing, or on-premises (including the option of being fully air-gapped). You also control the underlying LLM — not just which model powers Tabnine, but also how it’s accessed (whether via API, as a private endpoint, or deployed in your environment).

Tabnine comprehensively protects you from IP infringement and liability. Tabnine evaluates all AI-generated code, flagging any matches with publicly visible code so that you can respond accordingly. In addition, Tabnine offers a proprietary model exclusively trained on permissively licensed code, allowing us to support teams with the strictest policies and use cases.

How to avoid vibe coding your way into a tsunami of tech debt