#485 closed task

server incident reporting — at Version 1

Reported by: warner Owned by: somebody
Priority: major Milestone: 1.3.0
Component: operational Version: 1.1.0
Keywords: Cc:
Launchpad Bug:

Description (last modified by warner)

The current version of Foolscap now has code to report "incidents", which are logs of the events that led up to some high-severity event. There is also an API to subscribe to hear about these events.

We need to build a gathering mechanism for these events. The storage servers on a commercial grid should report Incidents to this gatherer, and the gatherer can then summarize and deliver them via email, or an RSS feed.

See also #484, which addresses a similar issue on the client side.

Things to be wary of: overloading the gatherer, bounding the sender's queue size, thundering herds if many servers experience problems at the same time.

The gatherer's interface should have a way to manage display of incidents: human operators should be able to say "yes, I know about that one", and not be distracted by well-known problems for which a fix is in progress. This kind of implies a table of incident disposition (new, still-troublesome, ignored), and maybe eventually a mechanism to automatically classify new incidents as being in a known category ("another 42 instances of Bug#123 were seen today").

Change History (1)

comment:1 Changed at 2008-07-02T03:46:20Z by warner

  • Description modified (diff)

Note that foolscap-0.2.8 has a bug in its Incident-handling code (it throws an exception during setLogDir if the incident-holding directory already exists), which makes it unsuitable for use. I've fixed the bug, but this ticket is blocked on the next Foolscap release, which will include the fix.

http://foolscap.lothar.com/trac/milestone/0.2.9 is the release in question.

Note: See TracTickets for help on using tickets.