[webkit-dev] bugs.webkit.org and trac.webkit org down?
wsiegrist at apple.com
Mon Aug 20 09:20:01 PDT 2012
On Aug 9, 2012, at 3:41 AM, Mark Rowe <mrowe at apple.com> wrote:
> On 2012-08-09, at 03:14, Peter Beverloo <peter at chromium.org> wrote:
>> On Thu, Aug 9, 2012 at 11:09 AM, Mark Rowe <mrowe at apple.com> wrote:
>> On 2012-08-09, at 02:41, Osztrogonac Csaba <oszi at inf.u-szeged.hu> wrote:
>> > Hi,
>> > bugs.webkit.org and trac.webkit.org is unavailable again. :(
>> > Could you check it, please?
>> This is caused by a problem on a host that I don't have sufficient privileges on to be able to address the issue myself. I've pinged people that should be able to resolve the issue, but given that it's currently 3am in California it may be a few hours before they're awake and fixing it.
>> Thank you. Is there any way an on-call or monitoring system could be set up? While it fortunately occurs much less frequently nowadays than it did earlier in the year, it --please excuse my directness-- is unacceptable for a project the size of WebKit to have critical infrastructure unavailable like this for several hours. Even though it's 3am in California, there are many contributors in Asia and Europe who are severely impacted by this.
> We have people online virtually 24x7 that are capable of investigating and addressing issues with the webkit.org infrastructure. I'm one such person. This particular case is an unfortunate combination of events: a configuration error on a subset of the new hardware that webkit.org was recently migrated to has unintentionally limited the number of people that can access the host that is currently experiencing problems, and the person that such issues are escalated to is currently on vacation. It's an unfortunate combination, but also one that is unlikely to repeat.
> One thing we should look in to is improving the process for reporting issues with webkit.org infrastructure. webkit-dev isn't an ideal way to report issues that our monitoring system hasn't noticed as there's no separation between the regular discussion and more urgent issues, so the emails can be overlooked.
I am back from vacation and just wanted to let everyone know that I am sorry that so many things went wrong and combined to create significant downtime. Like Mark said, we do have systems in place for monitoring, and we usually catch and fix things quickly, even at 3am. The new hardware is a big change for us, and our monitoring system is also changing to accommodate the new environment, so I will be spending time this week getting to the bottom of the database issue and making sure we're covering everything we need to with our monitoring systems.
As for a better notification system, we can create a webkit-sysadmin list that contains the people with shell access and send admin at webkit mail there. Currently, I'm the only recipient of admin at webkit mail, so we should fix that single point of failure. Does that seem reasonable to everyone?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the webkit-dev