1 | .. -*- coding: utf-8-with-signature -*- |
---|
2 | |
---|
3 | ============= |
---|
4 | Tahoe Logging |
---|
5 | ============= |
---|
6 | |
---|
7 | 1. `Overview`_ |
---|
8 | 2. `Realtime Logging`_ |
---|
9 | 3. `Incidents`_ |
---|
10 | 4. `Working with flogfiles`_ |
---|
11 | 5. `Gatherers`_ |
---|
12 | |
---|
13 | 1. `Incident Gatherer`_ |
---|
14 | 2. `Log Gatherer`_ |
---|
15 | |
---|
16 | 6. `Adding log messages`_ |
---|
17 | 7. `Log Messages During Unit Tests`_ |
---|
18 | |
---|
19 | Overview |
---|
20 | ======== |
---|
21 | |
---|
22 | Tahoe uses the Foolscap logging mechanism (known as the "flog" subsystem) to |
---|
23 | record information about what is happening inside the Tahoe node. This is |
---|
24 | primarily for use by programmers and grid operators who want to find out what |
---|
25 | went wrong. |
---|
26 | |
---|
27 | The Foolscap logging system is documented at |
---|
28 | `<https://github.com/warner/foolscap/blob/latest-release/doc/logging.rst>`__. |
---|
29 | |
---|
30 | The Foolscap distribution includes a utility named "``flogtool``" that is |
---|
31 | used to get access to many Foolscap logging features. ``flogtool`` should get |
---|
32 | installed into the same virtualenv as the ``tahoe`` command. |
---|
33 | |
---|
34 | |
---|
35 | Realtime Logging |
---|
36 | ================ |
---|
37 | |
---|
38 | When you are working on Tahoe code, and want to see what the node is doing, |
---|
39 | the easiest tool to use is "``flogtool tail``". This connects to the Tahoe |
---|
40 | node and subscribes to hear about all log events. These events are then |
---|
41 | displayed to stdout, and optionally saved to a file. |
---|
42 | |
---|
43 | "``flogtool tail``" connects to the "logport", for which the FURL is stored |
---|
44 | in ``BASEDIR/private/logport.furl`` . The following command will connect to |
---|
45 | this port and start emitting log information:: |
---|
46 | |
---|
47 | flogtool tail BASEDIR/private/logport.furl |
---|
48 | |
---|
49 | The ``--save-to FILENAME`` option will save all received events to a file, |
---|
50 | where then can be examined later with "``flogtool dump``" or "``flogtool |
---|
51 | web-viewer``". The ``--catch-up`` option will ask the node to dump all stored |
---|
52 | events before subscribing to new ones (without ``--catch-up``, you will only |
---|
53 | hear about events that occur after the tool has connected and subscribed). |
---|
54 | |
---|
55 | Incidents |
---|
56 | ========= |
---|
57 | |
---|
58 | Foolscap keeps a short list of recent events in memory. When something goes |
---|
59 | wrong, it writes all the history it has (and everything that gets logged in |
---|
60 | the next few seconds) into a file called an "incident". These files go into |
---|
61 | ``BASEDIR/logs/incidents/`` , in a file named |
---|
62 | "``incident-TIMESTAMP-UNIQUE.flog.bz2``". The default definition of |
---|
63 | "something goes wrong" is the generation of a log event at the ``log.WEIRD`` |
---|
64 | level or higher, but other criteria could be implemented. |
---|
65 | |
---|
66 | The typical "incident report" we've seen in a large Tahoe grid is about 40kB |
---|
67 | compressed, representing about 1800 recent events. |
---|
68 | |
---|
69 | These "flogfiles" have a similar format to the files saved by "``flogtool |
---|
70 | tail --save-to``". They are simply lists of log events, with a small header |
---|
71 | to indicate which event triggered the incident. |
---|
72 | |
---|
73 | The "``flogtool dump FLOGFILE``" command will take one of these ``.flog.bz2`` |
---|
74 | files and print their contents to stdout, one line per event. The raw event |
---|
75 | dictionaries can be dumped by using "``flogtool dump --verbose FLOGFILE``". |
---|
76 | |
---|
77 | The "``flogtool web-viewer``" command can be used to examine the flogfile in |
---|
78 | a web browser. It runs a small HTTP server and emits the URL on stdout. This |
---|
79 | view provides more structure than the output of "``flogtool dump``": the |
---|
80 | parent/child relationships of log events is displayed in a nested format. |
---|
81 | "``flogtool web-viewer``" is still fairly immature. |
---|
82 | |
---|
83 | Working with flogfiles |
---|
84 | ====================== |
---|
85 | |
---|
86 | The "``flogtool filter``" command can be used to take a large flogfile |
---|
87 | (perhaps one created by the log-gatherer, see below) and copy a subset of |
---|
88 | events into a second file. This smaller flogfile may be easier to work with |
---|
89 | than the original. The arguments to "``flogtool filter``" specify filtering |
---|
90 | criteria: a predicate that each event must match to be copied into the target |
---|
91 | file. ``--before`` and ``--after`` are used to exclude events outside a given |
---|
92 | window of time. ``--above`` will retain events above a certain severity |
---|
93 | level. ``--from`` retains events send by a specific tubid. |
---|
94 | ``--strip-facility`` removes events that were emitted with a given facility |
---|
95 | (like ``foolscap.negotiation`` or ``tahoe.upload``). |
---|
96 | |
---|
97 | Gatherers |
---|
98 | ========= |
---|
99 | |
---|
100 | In a deployed Tahoe grid, it is useful to get log information automatically |
---|
101 | transferred to a central log-gatherer host. This offloads the (admittedly |
---|
102 | modest) storage requirements to a different host and provides access to |
---|
103 | logfiles from multiple nodes (web-API, storage, or helper) in a single place. |
---|
104 | |
---|
105 | There are two kinds of gatherers: "log gatherer" and "stats gatherer". Each |
---|
106 | produces a FURL which needs to be placed in the ``NODEDIR/tahoe.cfg`` file of |
---|
107 | each node that is to publish to the gatherer, under the keys |
---|
108 | "log_gatherer.furl" and "stats_gatherer.furl" respectively. When the Tahoe |
---|
109 | node starts, it will connect to the configured gatherers and offer its |
---|
110 | logport: the gatherer will then use the logport to subscribe to hear about |
---|
111 | events. |
---|
112 | |
---|
113 | The gatherer will write to files in its working directory, which can then be |
---|
114 | examined with tools like "``flogtool dump``" as described above. |
---|
115 | |
---|
116 | Incident Gatherer |
---|
117 | ----------------- |
---|
118 | |
---|
119 | The "incident gatherer" only collects Incidents: records of the log events |
---|
120 | that occurred just before and slightly after some high-level "trigger event" |
---|
121 | was recorded. Each incident is classified into a "category": a short string |
---|
122 | that summarizes what sort of problem took place. These classification |
---|
123 | functions are written after examining a new/unknown incident. The idea is to |
---|
124 | recognize when the same problem is happening multiple times. |
---|
125 | |
---|
126 | A collection of classification functions that are useful for Tahoe nodes are |
---|
127 | provided in ``misc/incident-gatherer/support_classifiers.py`` . There is |
---|
128 | roughly one category for each ``log.WEIRD``-or-higher level event in the |
---|
129 | Tahoe source code. |
---|
130 | |
---|
131 | The incident gatherer is created with the "``flogtool create-incident-gatherer |
---|
132 | WORKDIR``" command, and started with "``tahoe run``". The generated |
---|
133 | "``gatherer.tac``" file should be modified to add classifier functions. |
---|
134 | |
---|
135 | The incident gatherer writes incident names (which are simply the relative |
---|
136 | pathname of the ``incident-\*.flog.bz2`` file) into ``classified/CATEGORY``. |
---|
137 | For example, the ``classified/mutable-retrieve-uncoordinated-write-error`` |
---|
138 | file contains a list of all incidents which were triggered by an |
---|
139 | uncoordinated write that was detected during mutable file retrieval (caused |
---|
140 | when somebody changed the contents of the mutable file in between the node's |
---|
141 | mapupdate step and the retrieve step). The ``classified/unknown`` file |
---|
142 | contains a list of all incidents that did not match any of the classification |
---|
143 | functions. |
---|
144 | |
---|
145 | At startup, the incident gatherer will automatically reclassify any incident |
---|
146 | report which is not mentioned in any of the ``classified/\*`` files. So the |
---|
147 | usual workflow is to examine the incidents in ``classified/unknown``, add a |
---|
148 | new classification function, delete ``classified/unknown``, then bound the |
---|
149 | gatherer with "``tahoe restart WORKDIR``". The incidents which can be |
---|
150 | classified with the new functions will be added to their own |
---|
151 | ``classified/FOO`` lists, and the remaining ones will be put in |
---|
152 | ``classified/unknown``, where the process can be repeated until all events |
---|
153 | are classifiable. |
---|
154 | |
---|
155 | The incident gatherer is still fairly immature: future versions will have a |
---|
156 | web interface and an RSS feed, so operations personnel can track problems in |
---|
157 | the storage grid. |
---|
158 | |
---|
159 | In our experience, each incident takes about two seconds to transfer from the |
---|
160 | node that generated it to the gatherer. The gatherer will automatically catch |
---|
161 | up to any incidents which occurred while it is offline. |
---|
162 | |
---|
163 | Log Gatherer |
---|
164 | ------------ |
---|
165 | |
---|
166 | The "Log Gatherer" subscribes to hear about every single event published by |
---|
167 | the connected nodes, regardless of severity. This server writes these log |
---|
168 | events into a large flogfile that is rotated (closed, compressed, and |
---|
169 | replaced with a new one) on a periodic basis. Each flogfile is named |
---|
170 | according to the range of time it represents, with names like |
---|
171 | "``from-2008-08-26-132256--to-2008-08-26-162256.flog.bz2``". The flogfiles |
---|
172 | contain events from many different sources, making it easier to correlate |
---|
173 | things that happened on multiple machines (such as comparing a client node |
---|
174 | making a request with the storage servers that respond to that request). |
---|
175 | |
---|
176 | Create the Log Gatherer with the "``flogtool create-gatherer WORKDIR``" |
---|
177 | command, and start it with "``twistd -ny gatherer.tac``". Then copy the contents of the |
---|
178 | ``log_gatherer.furl`` file it creates into the ``BASEDIR/tahoe.cfg`` file |
---|
179 | (under the key ``log_gatherer.furl`` of the section ``[node]``) of all nodes |
---|
180 | that should be sending it log events. (See :doc:`configuration`) |
---|
181 | |
---|
182 | The "``flogtool filter``" command, described above, is useful to cut down the |
---|
183 | potentially large flogfiles into a more focussed form. |
---|
184 | |
---|
185 | Busy nodes, particularly web-API nodes which are performing recursive |
---|
186 | deep-size/deep-stats/deep-check operations, can produce a lot of log events. |
---|
187 | To avoid overwhelming the node (and using an unbounded amount of memory for |
---|
188 | the outbound TCP queue), publishing nodes will start dropping log events when |
---|
189 | the outbound queue grows too large. When this occurs, there will be gaps |
---|
190 | (non-sequential event numbers) in the log-gatherer's flogfiles. |
---|
191 | |
---|
192 | Adding log messages |
---|
193 | =================== |
---|
194 | |
---|
195 | When adding new code, the Tahoe developer should add a reasonable number of |
---|
196 | new log events. For details, please see the Foolscap logging documentation, |
---|
197 | but a few notes are worth stating here: |
---|
198 | |
---|
199 | * use a facility prefix of "``tahoe.``", like "``tahoe.mutable.publish``" |
---|
200 | |
---|
201 | * assign each severe (``log.WEIRD`` or higher) event a unique message |
---|
202 | identifier, as the ``umid=`` argument to the ``log.msg()`` call. The |
---|
203 | ``misc/coding_tools/make_umid`` script may be useful for this purpose. |
---|
204 | This will make it easier to write a classification function for these |
---|
205 | messages. |
---|
206 | |
---|
207 | * use the ``parent=`` argument whenever the event is causally/temporally |
---|
208 | clustered with its parent. For example, a download process that involves |
---|
209 | three sequential hash fetches could announce the send and receipt of those |
---|
210 | hash-fetch messages with a ``parent=`` argument that ties them to the |
---|
211 | overall download process. However, each new web-API download request should |
---|
212 | be unparented. |
---|
213 | |
---|
214 | * use the ``format=`` argument in preference to the ``message=`` argument. |
---|
215 | E.g. use ``log.msg(format="got %(n)d shares, need %(k)d", n=n, k=k)`` |
---|
216 | instead of ``log.msg("got %d shares, need %d" % (n,k))``. This will allow |
---|
217 | later tools to analyze the event without needing to scrape/reconstruct the |
---|
218 | structured data out of the formatted string. |
---|
219 | |
---|
220 | * Pass extra information as extra keyword arguments, even if they aren't |
---|
221 | included in the ``format=`` string. This information will be displayed in |
---|
222 | the "``flogtool dump --verbose``" output, as well as being available to |
---|
223 | other tools. The ``umid=`` argument should be passed this way. |
---|
224 | |
---|
225 | * use ``log.err`` for the catch-all ``addErrback`` that gets attached to the |
---|
226 | end of any given Deferred chain. When used in conjunction with |
---|
227 | ``LOGTOTWISTED=1``, ``log.err()`` will tell Twisted about the error-nature |
---|
228 | of the log message, causing Trial to flunk the test (with an "ERROR" |
---|
229 | indication that prints a copy of the Failure, including a traceback). |
---|
230 | Don't use ``log.err`` for events that are ``BAD`` but handled (like hash |
---|
231 | failures: since these are often deliberately provoked by test code, they |
---|
232 | should not cause test failures): use ``log.msg(level=BAD)`` for those |
---|
233 | instead. |
---|
234 | |
---|
235 | |
---|
236 | Log Messages During Unit Tests |
---|
237 | ============================== |
---|
238 | |
---|
239 | If a test is failing and you aren't sure why, start by enabling |
---|
240 | ``FLOGTOTWISTED=1`` like this:: |
---|
241 | |
---|
242 | make test FLOGTOTWISTED=1 |
---|
243 | |
---|
244 | With ``FLOGTOTWISTED=1``, sufficiently-important log events will be written |
---|
245 | into ``_trial_temp/test.log``, which may give you more ideas about why the |
---|
246 | test is failing. |
---|
247 | |
---|
248 | By default, ``_trial_temp/test.log`` will not receive messages below the |
---|
249 | ``level=OPERATIONAL`` threshold. You can change the threshold via the ``FLOGLEVEL`` |
---|
250 | variable, e.g.:: |
---|
251 | |
---|
252 | make test FLOGLEVEL=10 FLOGTOTWISTED=1 |
---|
253 | |
---|
254 | (The level numbers are listed in src/allmydata/util/log.py.) |
---|
255 | |
---|
256 | To look at the detailed foolscap logging messages, run the tests like this:: |
---|
257 | |
---|
258 | make test FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTOTWISTED=1 |
---|
259 | |
---|
260 | The first environment variable will cause foolscap log events to be written |
---|
261 | to ``./flog.out.bz2`` (instead of merely being recorded in the circular |
---|
262 | buffers for the use of remote subscribers or incident reports). The second |
---|
263 | will cause all log events to be written out, not just the higher-severity |
---|
264 | ones. The third will cause twisted log events (like the markers that indicate |
---|
265 | when each unit test is starting and stopping) to be copied into the flogfile, |
---|
266 | making it easier to correlate log events with unit tests. |
---|
267 | |
---|
268 | Enabling this form of logging appears to roughly double the runtime of the |
---|
269 | unit tests. The ``flog.out.bz2`` file is approximately 2MB. |
---|
270 | |
---|
271 | You can then use "``flogtool dump``" or "``flogtool web-viewer``" on the |
---|
272 | resulting ``flog.out`` file. |
---|
273 | |
---|
274 | ("``flogtool tail``" and the log-gatherer are not useful during unit tests, |
---|
275 | since there is no single Tub to which all the log messages are published). |
---|
276 | |
---|
277 | It is possible for setting these environment variables to cause spurious test |
---|
278 | failures in tests with race condition bugs. All known instances of this have |
---|
279 | been fixed as of Tahoe-LAFS v1.7.1. |
---|