lua-users home
lua-l archive

[Date Prev][Date Next][Thread Prev][Thread Next] [Date Index] [Thread Index]


On Sun, Nov 13, 2016 at 11:17:58 +0100, Lorenzo Donati wrote:
> Just curious, could you explain in a bit more detail what happened? I was
> talking about catastrophic system failures to my students [1] last week and
> maybe this could make a nice case study? Of course feel free to ignore my
> request if you are too busy or if you cannot disclose the details.

The machine in question is a virtual machine which means we actually got
to look at its console.  The console was full of messages along the lines
of:

INFO: task <process>:<pid> blocked for more than 120 seconds

In this instance, it was pretty much all the apps which suggests that the
IO subsystem for the VM had a hiccough.  The host system was fine so we
are basically only going to blame bogons for the fault.

As for not spotting it in time; that was simply human error combined with
bad error reporting design.  The box in question is our primary web server
which means the monitoring apps present their reports there; and as a human
I simply didn't try to look at anything for about 12 hours.

In mitigation, I have in the past mentioned ways to contact me out of band
(since yes, that server is also the mail delivery box) but in the end it
was simply the sort of thing that a non-professional small-time hosting
provider hits.

Sorry again for the inconvenience.

D.

-- 
Daniel Silverstone                         http://www.digital-scurf.org/
PGP mail accepted and encouraged.            Key Id: 3CCE BABE 206C 3B69