Opened at 2008-07-02T00:27:38Z
Last modified at 2008-09-03T01:17:14Z
#485 closed task
server incident reporting — at Version 1
Reported by: | warner | Owned by: | somebody |
---|---|---|---|
Priority: | major | Milestone: | 1.3.0 |
Component: | operational | Version: | 1.1.0 |
Keywords: | Cc: | ||
Launchpad Bug: |
Description (last modified by warner)
The current version of Foolscap now has code to report "incidents", which are logs of the events that led up to some high-severity event. There is also an API to subscribe to hear about these events.
We need to build a gathering mechanism for these events. The storage servers on a commercial grid should report Incidents to this gatherer, and the gatherer can then summarize and deliver them via email, or an RSS feed.
See also #484, which addresses a similar issue on the client side.
Things to be wary of: overloading the gatherer, bounding the sender's queue size, thundering herds if many servers experience problems at the same time.
The gatherer's interface should have a way to manage display of incidents: human operators should be able to say "yes, I know about that one", and not be distracted by well-known problems for which a fix is in progress. This kind of implies a table of incident disposition (new, still-troublesome, ignored), and maybe eventually a mechanism to automatically classify new incidents as being in a known category ("another 42 instances of Bug#123 were seen today").
Note that foolscap-0.2.8 has a bug in its Incident-handling code (it throws an exception during setLogDir if the incident-holding directory already exists), which makes it unsuitable for use. I've fixed the bug, but this ticket is blocked on the next Foolscap release, which will include the fix.
http://foolscap.lothar.com/trac/milestone/0.2.9 is the release in question.