Opened at 2008-07-02T00:27:38Z
Last modified at 2008-09-03T01:17:14Z
#485 closed task
server incident reporting — at Initial Version
Reported by: | warner | Owned by: | somebody |
---|---|---|---|
Priority: | major | Milestone: | 1.3.0 |
Component: | operational | Version: | 1.1.0 |
Keywords: | Cc: | ||
Launchpad Bug: |
Description
The current version of Foolscap now has code to report "incidents", which are logs of the events that led up to some high-severity event. There is also an API to subscribe to hear about these events.
We need to build a gathering mechanism for these events. The storage servers on a commercial grid should report Incidents to this gatherer, and the gatherer can then summarize and deliver them via email, or an RSS feed.
See also #484, which addresses a similar issue on the client side.
Things to be wary of: overloading the gatherer, bounding the sender's queue size, thundering herds if many servers experience problems at the same time.
The gatherer's interface should have a way to manage display of incidents: human operators should be able to say "yes, I know about that one", and not be distracted by well-known problems for which a fix is in progress. This kind of implies a table of incident disposition (new, still-troublesome, ignored), and maybe eventually a mechanism to automatically classify new incidents as being in a known category ("another 42 instances of Bug#123 were seen today").