As many of you have seen, we are currently working on recovering from a massive outage from yesterday.
Our primary web server had a drive failure yesterday. Unfortunately, the backup drive was also not functioning properly, and so we were unable to recover using our fail-over drive. After trying for hours to revive the server, the techs determined that a new server would need to be provisioned to replace the failed server. This provisioning of the new server lead to further delays.
Once the new server was ready, our software team has had to restore all the proper server configurations and load all the software and patches to the server. This process has taken many hours, and we are now in a testing mode for the server. Our targeted ETA is to go-live at 9am PST (2/29/2012) Leap Day.
We are reviewing our processes and also our data center hosting provider as a result of this major outage. We see many areas where we could have had a better method and approach that would have greatly reduced the downtime for any such major hardware failure events in the future. We will share a more detailed explanation of the series of events for other companies reviewing hosting providers and also trying to figure out a disaster recovery plan for their Internet services.
We do keep our database on a separate server, and this server was not impacted. Any order history and information has not been impacted during this system failure.
We will keep an eye on the system for stability issues and bugs as we go live, and hope to have the system live and stabilized shortly.
We sincerely apologize for all our customers who have had to endure the annoyance that this outage has created.
Thank you.
0 comments:
Post a Comment