When the Cloud starts working against your product

Imagine the same startup in two parallel realities that is building the same product: a web app with a REST API (let’s say Python), relational model, and background data processing. OAuth + JWT for auth. Both use AWS (although I’m using AWS as an example, the post is about cloud native architectures in general, whether they are based on AWS, Azure, Google, etc..)

The product starts getting traction but still needs to prove its value. The small team needs to go faster. They don’t have many resources, and they need to focus their engineering efforts on the business logic. Massive scalability and concurrency are not the business core of the product. They just need to be able to run the product in production.

Reality A team starts simple deploying two ECS Fargate containers behind a load balancer; one for the API, one for the background worker. Postgres on RDS. A queue backed by Postgres. Frontend on S3. ~200 lines of Terraform. One CI/CD pipeline for the monolith. Monthly cost: ~$350-500. In this reality, the small team has complete focus on the business logic inside the core application. They push multiple changes a day. They don’t have conflicts with each other. Testing the full application in development is straightforward.

Reality B team goes full cloud native since the beginning, splitting the backend into separate services. API Gateway in front of Lambda functions. SQS + SNS for the job pipeline. Cognito for auth. RDS Aurora inside a VPC with a NAT Gateway for Lambda database access. DynamoDB as a secondary store. ~800 lines of Terraform. Multiple CI/CD pipelines, one per service. Monthly cost: ~$700-1000. Boy, the team is struggling in this reality… They realize they need to have plenty of knowledge of the infrastructure, besides their business logic. Some of the services don’t have a compatible open source equivalent to run locally, so they need to share environments and take turns for both development and testing.

Same product. Same audience. Roughly double the cost, four times the infrastructure code, and a system that requires provider-specific knowledge just to operate and iterate.

That gap is what this post is about. Not a gap in features (in both realities, the team shipped the same product). A gap in complexity that was imported, not earned.

The baseline we sometimes forget

Ship faster, learn faster, change direction faster. Architecture, tooling, processes, hiring,… should be evaluated against speed and consistency principles. Does this help us move faster? Does this keep us focused on the actual product? Does this solve a problem we actually have right now?

Every decision should be challenged by these three questions. However, when it comes to the product architecture, most engineers can’t resist preparing for the best future case scenario, even when they don’t even have a working product yet. Sometimes we get attracted by how other successful products do or by what the cloud provider recommends. Sometimes we get complexity and ambition mixed up. Sometimes simplicity looks like a limitation.

The premise that runs through this post: complexity has a cost. Every abstraction layer you add separates you from building your business core logic, another piece of knowledge that needs to be transferred, another service with its own pricing model and operational quirks. Each of those slows you down, pulls focus, and solves a problem you probably don’t have yet.

What a cloud provider actually is

There’s a misconception about what AWS, Azure, or Google Cloud actually are and the reason is that they are more than one thing at the same time. They are a set of infrastructure services with a managed operations layer on top and a web console instead of a terminal. You still need to understand networking, IAM, service limits, failure modes, and cost models. You just do it through provider-specific abstractions instead of Linux and networking fundamentals.

However, what they are changes a bit based on how you use them.

At a basic level, these providers make infrastructure available to people who don’t want to deal with maintaining it, which is a perfectly reasonable thing to want (for example, for most projects, I wouldn’t think twice to use RDS PostgreSQL instead of running my own cluster in ECS). You can continue taking advantage of battle-tested and well-documented open source services without the burden of maintaining them yourself in production.

A hyperscaler cloud is also a framework, a set of frequently not open source primitives that you must use to build a full-blown cloud native application. Through this framework, you are not just delegating services management; you are adapting your product architecture to the cloud guidelines, usually, with the goal of making a mass scaling application in the cloud.

It’s fundamental to understand that your product architecture is not a casual decision. When you adopt a cloud native architecture framework, you are fully depending on that particular cloud and particular primitives for good and bad. More often than not it’s for bad.

Complexity abstraction vs complexity adoption

Authentication and authorization are not usually considered a business core of your product, but you need them, and you don’t want to waste time on building that functionality, especially with the risk of not doing it right. Auth0 and other similar cloud services give you that. With minor adjustments, you have a full auth layer in your application that just works. If you do it right, you have a clean integration in your code, which you can mock up for tests and easily replace for local development (or even if tomorrow you want to stop using it). Cognito, at the most basic level, can be easily integrated as well into the application, but it has a very strong gravitational pull towards all those AWS primitives discussed earlier. It is part of the cloud native architecture “framework”, and it will lead you (if you want) to very tight integration with other AWS primitives and logic inside your code.

Background jobs execution is a common practice as well in web application. You can deploy background workers in ECS (with celery, for example, in python) and connect them to ElastiCache or just RDS PostgreSQL. That’s a classic, flexible architecture that you can run in local as well. Or you can go with the cloud native approach and design the background jobs around EventBridge, Lambda functions, SNS, SQS, etc..

The patterns are usually similar. Flexible classic architectures make it easy to delegate complexity in production but allows you to work, as well, with a local dev environment with 0 need for the cloud. In these case, most of the time you are working with widely used and actively maintained open source software (either locally or under the hood in the cloud). While on cloud native architectures, you embrace the cloud provider complexity with the hope of streamlining large applications development that can scale massively.

Heroku was one of the first to understand complexity abstraction in the cloud. It didn’t force you to design your application based on their philosophy. The abstraction happened in the deployment of your application and all the services around it. You could leave, and your application came with you. Other modern PaaS follow a similar philosophy, but that’s for another post.

When the architecture doesn’t match the problem

Cloud native application patterns like microservices, event-driven systems, or serverless didn’t appear out of nowhere. For example, Netflix fully embraced a microservices architecture because hundreds of engineers needed to deploy services independently without stepping on each other. Amazon built SQS to solve painfully real distributed systems problems at internal scale. They are not just good solutions, built for real conditions. In those cases, they become fundamental for the success of the company.

Cloud providers market those same patterns to every customer. But, our example startup, with four engineers, reads all the success stories, architecture and security blog posts, the same documentation as Netflix. At this point, you wonder why not following the same patterns. They work. These companies are successful and scale to the infinite. Who doesn’t want that? Well, most often than not, you don’t.

Microservices are designed to be operated by a specific team structure to work: services owned end-to-end, per-service CI/CD, distributed tracing, clear API contracts. Without those conditions, the coordination, development, testing, release, observability and debugging become a nightmare. The friction to iterate the product and release value frequently goes through the roof. Four engineers building microservices are creating an overhead that compounds daily.

Lambda and similar function-based compute are useful for sporadic, isolated workloads such as image resizes, webhook handlers, and a scheduled job once a day. When lambda functions are used as the primary application architecture, the technology starts fighting your expectations: cold start latency under load, execution time limits, complexity in RDS database access and connection management, and concurrency limits that need a support ticket. The other friction comes with code reusability. If your functions depend on your core application logic, you’re stuck: keep them inside the same backend to share libraries, and you’ll hit deployment size limits; separate them into another repo, and you have code duplication. There’s no clean way out.

If you are not prepared to embrace all these particularities and, instead, your goal is to build a product with the maximum flexibility and less friction possible, choosing this kind of architecture will only slow you down and generate unbearable friction.

The real costs of cloud native architectures

As hinted earlier, the costs of a cloud native architecture must be balanced with the benefits. If mass scaling is a business core and the team and level of automation is prepared, surely the benefits of a native cloud architecture will be significantly higher than the costs. Hell, maybe it’s even just the perfect option for your needs. If not, the following costs will haunt you every day.

Financial cost. At small/mid scale, a cloud native architecture is going to be significantly more expensive than a traditional one, before factoring in engineering time spent managing the additional complexity. The cost efficiency argument kicks in at a scale threshold most startups never reach. Money spent on infrastructure that doesn’t need to be that complex is money not spent on the product. At a large scale, the mindset shifts to paying for granular operations rather than reserved capacity. The bill may be higher in absolute terms, but every dollar maps to real usage.

Cognitive cost. How many engineers on your team can understand, debug, and modify the system without further preparation? That number determines the team’s bus factor, your incident response quality, and how long it takes to get a new hire contributing. There are probably one or two engineers whose life is miserable and spend more time helping manage cloud services than building the product.

Vendor and architecture lock-in. Traditional vendor lock-in means switching is expensive. Cloud native architecture lock-in is different: data models, event structures, and job logic are built around provider-specific primitives, changing direction means rewriting the product, not just the infrastructure. Most of these services are not open source, not portable, and built on behavior that can change at the provider’s discretion.

Failure is expensive. One of the principles of every agile organization is to fail early. In tech, failing early should be fairly cheap. But you need flexibility for that. A full-blown cloud native architecture is everything but flexible. Every failure may require painful refactors, complex migrations, changes in the deployment strategy, etc.. Keep a simple architecture, and technology failures will be easy to solve.

Opacity. Cloud services fail in ways you often can’t see into. Sometimes it’s documented, sometimes it’s not. You can’t attach a debugger to a managed service. You can’t read its source. When something goes wrong, and you can’t find the root cause in logs, notifications or social networks, you open a support ticket and wait. That wasted time comes directly out of building the product. (BTW, on a different note but still about opacity in the cloud, who decided that YAML-based pipelines was the right direction of infrastructure automation?)

Development experience. If you can’t run your application locally, your development loop depends on the cloud, which means additional costs, additional environments, and additional coordination. A cloud-native stack built on Lambda, SQS, Cognito, and DynamoDB cannot be fully replicated locally. Granted, you have Localstack that tries to alleviate the problem, but 1) it’s not going to behave as it does in the cloud, 2) you may end up paying extra for a license, and 3) if your product is deployed on Azure or Google, you have even fewer options. When you don’t have a local emulation, sharing one cloud dev environment among n developers will make their lives miserable unless you invest on automating dynamic environments in the cloud. That’s a hell of a price to pay for having a cloud native application.

What AI changes (and what it doesn’t)

If AI assistants can hold the knowledge your team lacks, create API contracts, automate platform tests, generate Terraform code, explain the failure modes, and navigate the CloudWatch logs, should the complexity of a native cloud application still be concerning?

For specific tasks with a properly described architecture context, absolutely, AI assistants are genuinely useful. Writing boilerplate, explaining documentation, helping you understand a service you’ve never used, and using them as a rubber duck to explore architecture ideas (keeping an eye on their sycophant tendencies). That’s real.

But it doesn’t dissolve the structural problems of adopting an architecture you don’t need or are not prepared for. Vendor lock-in isn’t a knowledge problem; an AI assistant knowing how to migrate off DynamoDB doesn’t make the migration cheap or fast. Cognitive cost isn’t just about whether someone knows the system, but also about whether someone can act on it to evolve it aligned with your product needs or when production breaks and the team is scrambling. Architecture decisions involve trade-offs that are organizational, not just technical: team skills, hiring plans, budget, product roadmap. An AI can throw you options based on concrete scenarios; it can’t weigh them against your general situation. Neither does it remove the extra financial cost mentioned above or improves the development experience.

The harder point is that, given a big enough knowledge gap between the team and the AI in the context of your product technology, it will make the underlying problem worse in a specific way. It will lower the barrier to generating complexity you don’t understand. A developer can have a complete Terraform module for a cloud-native architecture in an hour, with Lambda functions, SQS queues, and IAM roles wired together correctly. They don’t know the failure modes, the concurrency limits, or what happens when something breaks in production. The complexity just arrived faster, and at this point, the team will lag behind the AI.

The full loop of architecture, development, deployment, observability, incident response, and back to architecture requires a persistent understanding of a live system that changes over time. No agent today handles designing a whole complex architecture reliably, aligned with the product needs.

What “Start Simple” actually means

Use cloud providers. Use managed services. They’re genuinely useful, and the operational leverage is real. Or don’t. The question is always: does this serve the product, or does the product now serve this?

Keep your system simple enough that you can debug and iterate it yourself (with or without AI’s assistance). Keep making it better, and even bigger, but don’t add layers that you can’t see through. In every iteration, ask yourself again the question: Is this the simplest approach I can take to move from today to the next product goal?

In practice, in most early-stage startup products, this is deploying backend containers behind a load balancer, using SQL on a managed instance, a queue backed by the same SQL engine until you actually need something else.. Very importantly, don’t let go of your local development environment. It should approximate production closely enough that developers can work without touching the cloud most of the time. At the beginning, it’s more important to have a reliable and fast CI/CD approach to iterate fast and sound than a fancy architecture that may scale infinitely.

Delegate the boring, isolated, non-core problems. Let someone else manage database replication and backups. That’s the biggest value of a managed service. Don’t design your application’s architecture after cloud-only services, unless they are key to your product’s survival or success in production. That’s what our friends from the Reality B team wish they knew when they designed their architecture.

The “canary” question for any new managed service: do I have the ability to debug it when it breaks or hits a limit? If the answer is no, you’re not abstracting complexity; you’re hiding it somewhere you’ll find it at the worst time.