Context Navigation

source: trunk/docs/logging.rst

Visit:

Last change on this file was 9a10673, checked in by Jean-Paul Calderone <exarkun@…>, at 2020-12-09T15:50:03Z
"tahoe start" doesn't even work on these but "twistd -ny" does
Property mode set to `100644`
File size: 12.9 KB

Line
1	.. -- coding: utf-8-with-signature --
2
3	=============
4	Tahoe Logging
5	=============
6
7	1. `Overview`_
8	2. `Realtime Logging`_
9	3. `Incidents`_
10	4. `Working with flogfiles`_
11	5. `Gatherers`_
12
13	1. `Incident Gatherer`_
14	2. `Log Gatherer`_
15
16	6. `Adding log messages`_
17	7. `Log Messages During Unit Tests`_
18
19	Overview
20	========
21
22	Tahoe uses the Foolscap logging mechanism (known as the "flog" subsystem) to
23	record information about what is happening inside the Tahoe node. This is
24	primarily for use by programmers and grid operators who want to find out what
25	went wrong.
26
27	The Foolscap logging system is documented at
28	`<https://github.com/warner/foolscap/blob/latest-release/doc/logging.rst>`__.
29
30	The Foolscap distribution includes a utility named "``flogtool``" that is
31	used to get access to many Foolscap logging features. ``flogtool`` should get
32	installed into the same virtualenv as the ``tahoe`` command.
33
34
35	Realtime Logging
36	================
37
38	When you are working on Tahoe code, and want to see what the node is doing,
39	the easiest tool to use is "``flogtool tail``". This connects to the Tahoe
40	node and subscribes to hear about all log events. These events are then
41	displayed to stdout, and optionally saved to a file.
42
43	"``flogtool tail``" connects to the "logport", for which the FURL is stored
44	in ``BASEDIR/private/logport.furl`` . The following command will connect to
45	this port and start emitting log information::
46
47	flogtool tail BASEDIR/private/logport.furl
48
49	The ``--save-to FILENAME`` option will save all received events to a file,
50	where then can be examined later with "``flogtool dump``" or "``flogtool
51	web-viewer``". The ``--catch-up`` option will ask the node to dump all stored
52	events before subscribing to new ones (without ``--catch-up``, you will only
53	hear about events that occur after the tool has connected and subscribed).
54
55	Incidents
56	=========
57
58	Foolscap keeps a short list of recent events in memory. When something goes
59	wrong, it writes all the history it has (and everything that gets logged in
60	the next few seconds) into a file called an "incident". These files go into
61	``BASEDIR/logs/incidents/`` , in a file named
62	"``incident-TIMESTAMP-UNIQUE.flog.bz2``". The default definition of
63	"something goes wrong" is the generation of a log event at the ``log.WEIRD``
64	level or higher, but other criteria could be implemented.
65
66	The typical "incident report" we've seen in a large Tahoe grid is about 40kB
67	compressed, representing about 1800 recent events.
68
69	These "flogfiles" have a similar format to the files saved by "``flogtool
70	tail --save-to``". They are simply lists of log events, with a small header
71	to indicate which event triggered the incident.
72
73	The "``flogtool dump FLOGFILE``" command will take one of these ``.flog.bz2``
74	files and print their contents to stdout, one line per event. The raw event
75	dictionaries can be dumped by using "``flogtool dump --verbose FLOGFILE``".
76
77	The "``flogtool web-viewer``" command can be used to examine the flogfile in
78	a web browser. It runs a small HTTP server and emits the URL on stdout. This
79	view provides more structure than the output of "``flogtool dump``": the
80	parent/child relationships of log events is displayed in a nested format.
81	"``flogtool web-viewer``" is still fairly immature.
82
83	Working with flogfiles
84	======================
85
86	The "``flogtool filter``" command can be used to take a large flogfile
87	(perhaps one created by the log-gatherer, see below) and copy a subset of
88	events into a second file. This smaller flogfile may be easier to work with
89	than the original. The arguments to "``flogtool filter``" specify filtering
90	criteria: a predicate that each event must match to be copied into the target
91	file. ``--before`` and ``--after`` are used to exclude events outside a given
92	window of time. ``--above`` will retain events above a certain severity
93	level. ``--from`` retains events send by a specific tubid.
94	``--strip-facility`` removes events that were emitted with a given facility
95	(like ``foolscap.negotiation`` or ``tahoe.upload``).
96
97	Gatherers
98	=========
99
100	In a deployed Tahoe grid, it is useful to get log information automatically
101	transferred to a central log-gatherer host. This offloads the (admittedly
102	modest) storage requirements to a different host and provides access to
103	logfiles from multiple nodes (web-API, storage, or helper) in a single place.
104
105	There are two kinds of gatherers: "log gatherer" and "stats gatherer". Each
106	produces a FURL which needs to be placed in the ``NODEDIR/tahoe.cfg`` file of
107	each node that is to publish to the gatherer, under the keys
108	"log_gatherer.furl" and "stats_gatherer.furl" respectively. When the Tahoe
109	node starts, it will connect to the configured gatherers and offer its
110	logport: the gatherer will then use the logport to subscribe to hear about
111	events.
112
113	The gatherer will write to files in its working directory, which can then be
114	examined with tools like "``flogtool dump``" as described above.
115
116	Incident Gatherer
117	-----------------
118
119	The "incident gatherer" only collects Incidents: records of the log events
120	that occurred just before and slightly after some high-level "trigger event"
121	was recorded. Each incident is classified into a "category": a short string
122	that summarizes what sort of problem took place. These classification
123	functions are written after examining a new/unknown incident. The idea is to
124	recognize when the same problem is happening multiple times.
125
126	A collection of classification functions that are useful for Tahoe nodes are
127	provided in ``misc/incident-gatherer/support_classifiers.py`` . There is
128	roughly one category for each ``log.WEIRD``-or-higher level event in the
129	Tahoe source code.
130
131	The incident gatherer is created with the "``flogtool create-incident-gatherer
132	WORKDIR``" command, and started with "``tahoe run``". The generated
133	"``gatherer.tac``" file should be modified to add classifier functions.
134
135	The incident gatherer writes incident names (which are simply the relative
136	pathname of the ``incident-\*.flog.bz2`` file) into ``classified/CATEGORY``.
137	For example, the ``classified/mutable-retrieve-uncoordinated-write-error``
138	file contains a list of all incidents which were triggered by an
139	uncoordinated write that was detected during mutable file retrieval (caused
140	when somebody changed the contents of the mutable file in between the node's
141	mapupdate step and the retrieve step). The ``classified/unknown`` file
142	contains a list of all incidents that did not match any of the classification
143	functions.
144
145	At startup, the incident gatherer will automatically reclassify any incident
146	report which is not mentioned in any of the ``classified/\*`` files. So the
147	usual workflow is to examine the incidents in ``classified/unknown``, add a
148	new classification function, delete ``classified/unknown``, then bound the
149	gatherer with "``tahoe restart WORKDIR``". The incidents which can be
150	classified with the new functions will be added to their own
151	``classified/FOO`` lists, and the remaining ones will be put in
152	``classified/unknown``, where the process can be repeated until all events
153	are classifiable.
154
155	The incident gatherer is still fairly immature: future versions will have a
156	web interface and an RSS feed, so operations personnel can track problems in
157	the storage grid.
158
159	In our experience, each incident takes about two seconds to transfer from the
160	node that generated it to the gatherer. The gatherer will automatically catch
161	up to any incidents which occurred while it is offline.
162
163	Log Gatherer
164	------------
165
166	The "Log Gatherer" subscribes to hear about every single event published by
167	the connected nodes, regardless of severity. This server writes these log
168	events into a large flogfile that is rotated (closed, compressed, and
169	replaced with a new one) on a periodic basis. Each flogfile is named
170	according to the range of time it represents, with names like
171	"``from-2008-08-26-132256--to-2008-08-26-162256.flog.bz2``". The flogfiles
172	contain events from many different sources, making it easier to correlate
173	things that happened on multiple machines (such as comparing a client node
174	making a request with the storage servers that respond to that request).
175
176	Create the Log Gatherer with the "``flogtool create-gatherer WORKDIR``"
177	command, and start it with "``twistd -ny gatherer.tac``". Then copy the contents of the
178	``log_gatherer.furl`` file it creates into the ``BASEDIR/tahoe.cfg`` file
179	(under the key ``log_gatherer.furl`` of the section ``[node]``) of all nodes
180	that should be sending it log events. (See :doc:`configuration`)
181
182	The "``flogtool filter``" command, described above, is useful to cut down the
183	potentially large flogfiles into a more focussed form.
184
185	Busy nodes, particularly web-API nodes which are performing recursive
186	deep-size/deep-stats/deep-check operations, can produce a lot of log events.
187	To avoid overwhelming the node (and using an unbounded amount of memory for
188	the outbound TCP queue), publishing nodes will start dropping log events when
189	the outbound queue grows too large. When this occurs, there will be gaps
190	(non-sequential event numbers) in the log-gatherer's flogfiles.
191
192	Adding log messages
193	===================
194
195	When adding new code, the Tahoe developer should add a reasonable number of
196	new log events. For details, please see the Foolscap logging documentation,
197	but a few notes are worth stating here:
198
199	* use a facility prefix of "``tahoe.``", like "``tahoe.mutable.publish``"
200
201	* assign each severe (``log.WEIRD`` or higher) event a unique message
202	identifier, as the ``umid=`` argument to the ``log.msg()`` call. The
203	``misc/coding_tools/make_umid`` script may be useful for this purpose.
204	This will make it easier to write a classification function for these
205	messages.
206
207	* use the ``parent=`` argument whenever the event is causally/temporally
208	clustered with its parent. For example, a download process that involves
209	three sequential hash fetches could announce the send and receipt of those
210	hash-fetch messages with a ``parent=`` argument that ties them to the
211	overall download process. However, each new web-API download request should
212	be unparented.
213
214	* use the ``format=`` argument in preference to the ``message=`` argument.
215	E.g. use ``log.msg(format="got %(n)d shares, need %(k)d", n=n, k=k)``
216	instead of ``log.msg("got %d shares, need %d" % (n,k))``. This will allow
217	later tools to analyze the event without needing to scrape/reconstruct the
218	structured data out of the formatted string.
219
220	* Pass extra information as extra keyword arguments, even if they aren't
221	included in the ``format=`` string. This information will be displayed in
222	the "``flogtool dump --verbose``" output, as well as being available to
223	other tools. The ``umid=`` argument should be passed this way.
224
225	* use ``log.err`` for the catch-all ``addErrback`` that gets attached to the
226	end of any given Deferred chain. When used in conjunction with
227	``LOGTOTWISTED=1``, ``log.err()`` will tell Twisted about the error-nature
228	of the log message, causing Trial to flunk the test (with an "ERROR"
229	indication that prints a copy of the Failure, including a traceback).
230	Don't use ``log.err`` for events that are ``BAD`` but handled (like hash
231	failures: since these are often deliberately provoked by test code, they
232	should not cause test failures): use ``log.msg(level=BAD)`` for those
233	instead.
234
235
236	Log Messages During Unit Tests
237	==============================
238
239	If a test is failing and you aren't sure why, start by enabling
240	``FLOGTOTWISTED=1`` like this::
241
242	make test FLOGTOTWISTED=1
243
244	With ``FLOGTOTWISTED=1``, sufficiently-important log events will be written
245	into ``_trial_temp/test.log``, which may give you more ideas about why the
246	test is failing.
247
248	By default, ``_trial_temp/test.log`` will not receive messages below the
249	``level=OPERATIONAL`` threshold. You can change the threshold via the ``FLOGLEVEL``
250	variable, e.g.::
251
252	make test FLOGLEVEL=10 FLOGTOTWISTED=1
253
254	(The level numbers are listed in src/allmydata/util/log.py.)
255
256	To look at the detailed foolscap logging messages, run the tests like this::
257
258	make test FLOGFILE=flog.out.bz2 FLOGLEVEL=1 FLOGTOTWISTED=1
259
260	The first environment variable will cause foolscap log events to be written
261	to ``./flog.out.bz2`` (instead of merely being recorded in the circular
262	buffers for the use of remote subscribers or incident reports). The second
263	will cause all log events to be written out, not just the higher-severity
264	ones. The third will cause twisted log events (like the markers that indicate
265	when each unit test is starting and stopping) to be copied into the flogfile,
266	making it easier to correlate log events with unit tests.
267
268	Enabling this form of logging appears to roughly double the runtime of the
269	unit tests. The ``flog.out.bz2`` file is approximately 2MB.
270
271	You can then use "``flogtool dump``" or "``flogtool web-viewer``" on the
272	resulting ``flog.out`` file.
273
274	("``flogtool tail``" and the log-gatherer are not useful during unit tests,
275	since there is no single Tub to which all the log messages are published).
276
277	It is possible for setting these environment variables to cause spurious test
278	failures in tests with race condition bugs. All known instances of this have
279	been fixed as of Tahoe-LAFS v1.7.1.

Note: See TracBrowser for help on using the repository browser.

Download in other formats: