this post was submitted on 23 Jun 2023
33 points (100.0% liked)

Experienced Devs

3950 readers
2 users here now

A community for discussion amongst professional software developers.

Posts should be relevant to those well into their careers.

For those looking to break into the industry, are hustling for their first job, or have just started their career and are looking for advice, check out:

founded 1 year ago
MODERATORS
 

Curious to know how many people do zero-downtime deployment of backend code and how many people regularly take their service down, even if very briefly, to roll out new code.

Zero-downtime deployment is valuable in some applications and a complete waste of effort in others, of course, but that doesn't mean people do it when they should and skip it when it's not useful.

top 25 comments
sorted by: hot top controversial new old
[–] [email protected] 10 points 1 year ago* (last edited 1 year ago) (2 children)

Answering my own question: My systems do zero-downtime deployment. Some of my services are managed using ECS and some using custom deployment scripts.

It's interesting that people mostly focus on the mechanics of launching the new code. To me, the interesting thing about zero-downtime deployment is what happens while the release is in progress, when there will be a mix of the old and new code versions accessing the same resources (databases, microservices, etc.) at the same time.

For example, you don't want to just drop a previously-mandatory column from a SQL database: even if your new release no longer references the column, the new code will break if you deploy code before updating the database, and the old code will break if you update the database before deploying code. Obviously there are ways to do this kind of thing (roll out the change in small backward-compatible steps) but they're extra work and can be easy to get wrong even if you're using ECS to launch the code. Whereas, if you're allowed to take downtime, you can do it all in one step without worrying about mixed-version environments.

[–] [email protected] 5 points 1 year ago

if you're allowed to take downtime, you can do it all in one step without worrying about mixed-version environments.

You don't need to wiry about mixed version environments but you need to worry about whether you can roll back your changes without loss of data. It's not as hard but it seems to get overlooked if there haven't been any bad deployments lately.

[–] [email protected] 3 points 1 year ago

On the flip side, if something goes wrong and your service is backwards compatible you can roll back without any more issues. If you allow downtime and backwards incompatible changes rollback can cause even more problems and result in far longer outages and lots of very stressed programmers.

You should always be able to roll back code changes. And zero downtime deployment are not that hard to do if you are already enforcing that.

[–] [email protected] 9 points 1 year ago

Only our most legacy system requires downtime to update. Everything else is zero downtime using ECS

[–] [email protected] 7 points 1 year ago

Zero-downtime for us using Kubernetes. It's built-in. Deployment gets updated, new pod comes online, once it's healthy, the old pod goes offline.

We do have a little code to handle graceful shutdowns to properly finish any active requests before going offline, but that was a trivial addition.

[–] [email protected] 7 points 1 year ago* (last edited 1 year ago)

Do not

There are deployment methodologies that exists to avoid downtime which I posted about in [email protected]

Thanks for the content idea :)

TL;DR;

Rolling Deployment

Canary Deployment

Blue-Green Deployment

A/B Deployment

[–] [email protected] 6 points 1 year ago

For our batch workflows, we do have downtime on deploys. It's by design because 0 downtime doesn't add any value. Downtime is usually 5 to 10 minutes. For our services, we rely on lambdas or kubernetes rolling deployments so no downtime.

[–] [email protected] 6 points 1 year ago (1 children)

Zero downtime deployments can get very complex for heavy usage apps, such as blue-green deployment.

We decided to avoid the complexity with some practical workarounds.

  • Most deployments happen at 4am. "develop" branch merges deploy at 4am, and "master" branch merges deploy immediately.
  • We force browser refresh if the front end detects the back end has had breaking changes. We attempt to re-populate form field values.
  • During database migrations, we send 503 with Retry-After header in response to POSTs. Our client code knows to wait for that time and try again. If the time is too long, the user gets a friendly message that it will try again in X seconds. GETs are handled by an available read-replica, if possible.
[–] [email protected] 3 points 1 year ago (1 children)

We force browser refresh if the front end detects the back end has had breaking changes. We attempt to re-populate form field values.

Do users not find this disruptive?

[–] [email protected] 3 points 1 year ago* (last edited 1 year ago)

Yes, but it's a very rare event. Maintaining state (form fields) makes it less of an issue. As I said, most deploys are at 4am at extremely low usage (usu zero), and even then a refresh is only needed if the backend has had breaking changes. A severe bug requires a mid-day deploy, but in my experience most severe bug fixes are only a few lines and therefore aren't a breaking change so don't require a refresh.

Our way wouldn't work well if you had 24 hours of heavy load, but most apps I've written have been US-only with low nightly usage (HR, K-12 admin, power grid, medical).

[–] [email protected] 5 points 1 year ago

Zero downtime with ECS and Lambda.

[–] [email protected] 4 points 1 year ago

Whenever possible, I've run projects to have zero downtime deployments. Multiple stateless instances behind a load balancer. Deploy one instance at a time, run a health check and move traffic to the fresh instances. Most cloud providers often have these out of the box. Database migrations are run well in advance. New functionality is hidden behind feature flags.

Zero downtime is nice, but the real benefit is that you force the teams to really think about deployments as migrations to accomplish this policy.

Your instrumentation and alerting need to be top-shelf you need to automate deployments fully, which means you can fully automate rollbacks.

The downside is that you have to build everything twice, deployments are slower and there is a significant descaffolding.

But that's a small price to pay not to be on call outside of business hours to deploy.

[–] [email protected] 4 points 1 year ago

We have a clustered/load balanced application and do zero downtime deployments with elastic beanstalk on aws. I’m uncertain as to how popular elastic beanstalk is, but it makes managing this sort of stuff really easy.

[–] [email protected] 2 points 1 year ago

I write data pipeline code and there is zero downtime. We use kafka to buffer messages from dozens of producers to dozens of consumers on kubernetes.

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago)

Yep. All of them. Each release for a about ~40 legacy asp .net webforms apps

[–] [email protected] 2 points 1 year ago

I just use platforms that have it built in.

[–] [email protected] 2 points 1 year ago* (last edited 1 year ago)

I do not. I've got apps that go unused 5pm-9am or 1am-9am depending on the night, apps that have lower usage 5pm-8am and one app that's basically unused 8am-5pm. Each one gets redeployed in its down time, with hours extra to rollback any problems.

Actual downtime is usually around twenty seconds which is fine except for emergency midday builds.

I've been trying to get zero downtime deployments, but it's hard to justify the extra complexity when we've got such open service windows. Also, we're likely to have more downtime from an ISP service outage than our midday builds.

Other teams have much shorter service windows but deployments that take the whole window.

[–] [email protected] 2 points 1 year ago

Yeah zero downtime. You ship out the new features but gate them using some system you can control. When all the new features are shipped you turn up the new features until it gets to 100%. This lets you observe the real world behavior of the new features if they don’t cache well or cause 500s or what have you you can turn it off without having to ship new code.

Also if you keep all these feature flags, if you have a situation where you have capacity problems you can turn down features for the survival of the service as a whole.

[–] [email protected] 1 points 1 year ago

Very rarely. Most of our services are ECS which will manage rolling deployments. Older/legacy systems are manually taken out of load balancer, upgraded then added back.

Only times I can think of have been backwards incompatible database changes/database engine upgrades. This is rare.

[–] [email protected] 1 points 1 year ago

Disclaimer: I work in the central Delivery Engineering org for a FAANG. Our system deploys somewhere in the range of 30-50,000 times per day through our system and the number of services that require downtime for an update I could count on one hand.

Downtime is completely unnecessary in modern service development. If I experience a product that uses downtime for deployments, I take it as an indicator that the product is immature and probably not built in a way that I want to depend on for personal or professional life.

I totally get for small businesses maybe downtime is a necessity because it does require additional effort, but if you're building a product or service that people depend on, it severely erodes trust and frustrates users.

Always release backwards-compatible changes. If you need to do a schema migration, ensure the DDL can be performed online and if it can't, dual write.

[–] [email protected] 1 points 1 year ago

I use Google Cloud Build + Google Cloud Run and it’s built in, so no

[–] [email protected] 0 points 1 year ago (1 children)

Our backend is written in Go... CI/CD compiles the binary, uploads it to the server under a temporary name, mv's it into place and -HUP's the process. So no downtime at all.

[–] [email protected] 4 points 1 year ago (1 children)

what about after the HUP? The time between HUP and the new binary starting up would be considered downtime

[–] [email protected] 1 points 1 year ago

We're talking milliseconds. The whole thing is run through an nginx proxy which would immediately retry if it failed.

load more comments
view more: next ›