Resgrid Disaster Recovery

In the specter of the Caldor Fire we’ve been hard at work with Resgrid’s backend and resilient systems. We have some redundancy plans in place but with a major event we felt it necessary to step up our efforts and ensure that Resgrid is always available.

Although we have some really cool feature we are working on, Push to Talk and Invoicing, we are halting work on those right now to ensure our platform can recovery from a major event within a reasonable amount of time.

So what are we doing? First we did some evaluations of Cloud Platforms to see where it makes sense for us to house our backup infrastructure and we decided on Linode (https://www.linode.com/). Resgrid originally started as a Microsoft Azure hosted service, but after opening up our source code and trying to ensure our hosted environment matches what people can deploy locally we have to avoid vendor lock in at all costs.

Linode has long hosted some of our other backend services and we’ve never had any issue with their platform and it offers a great Price to Performance ratio. Our backup provider will be Digital Ocean as we have used them a ton in the past (our Url Shortner and original Status Page were hosted there) but they have some restrictions on sending email’s that could impact us.

What does this mean for you? Well, nothing changes. We will be working on this for the next little bit and there will be some downtime that we will have to schedule. You can see when we have that downtime scheduled on our status page (https://resgrid.freshstatus.io/). This will be us moving our active hosting to the backup hosted to simulate a Disaster Recovery event and have it running there for a set period of time (probably a couple of weeks), at which time there will be scheduled downtime to migrate back.

Why would we move our active running instance to our DR site? Well, just having an Disaster Recovery plan doesn’t mean much if your not exercising it. This will allow us to verify the process and operations and also allows us to perform maintenance on our main Data Center. We have 2 very large servers (Canterbury and Donnager) that host Resgrid in an active-active fashion, we want to upgrade the RAM in both, and during the DR testing that gives us a great time to perform that.

Leave a Reply