Sorry for the rant article, but this has been a major pain recently.
Something that all SaaS workers can agree on: things will go wrong. In addition to this, the things that go wrong typically need to be fixed immediately. Minutes, not hours or days.
This is where CloudFormation caught me offguard: it's not flexible enough to align with that requirement.
I've been the unfortunate recipient of a few CloudFormation stacks that have drifted far away from the source template. For example, a couple of databases missing, volumes changed/replaced, CloudWatch alarms updated, etc.
I've used CDK to rewrite some of the stack deployment process, but one glaring fact remains: you cannot get the existing resources out of CloudFormation Drift Hell.
CloudFormation supports importing resources, but the problem still remains on releasing them.
It boggles me that I cannot simply modify the current state or release resources entirely from CloudFormation management. No, you MUST go through template updates, which will fail when there is this much drift.
To remove a managed resource, you need to:
- Update the template with a RETAIN parameter on the resource.
- Update the Cloudformation stack.
- Remove the resource from the template.
- Update the Cloudformation stack.
None of this is very feasible when the stack has drifted. Not to mention the danger of editing templates like this. Obviously you don't want to delete something on accident, which is just another risk for prolonged downtime.
Ideally, I should just be able to:
- Remove the resource from the CloudFormation stack.
- Update the CloudFormation template, as if the resource was never there.
It seems to me like a fundamental flaw that a user cannot modify the current state easily. Sure, it could have dangerous side effects, but some situations really call for it.
Maybe the future will be brighter, but for now, I suggest that you DO NOT let anyone change managed resources without going through CloudFormation template updates.
Due to drift, I'm looking towards a tedious maintenance window of migrating resources since I can't feasibly retain them. If I could update the CloudFormation state and adjust the resource management, this could be a zero-downtime incremental migration.