Context Navigation

Back to Ticket #1225

Ticket #1225: stats.txt

File stats.txt, 14.2 KB (added by p-static, at 2010-10-13T06:30:35Z)
docs/stats.txt, converted t rST

Line
1	================
2	Tahoe Statistics
3	================
4
5	1. `Overview`_
6	2. `Statistics Categories`_
7	3. `Running a Tahoe Stats-Gatherer Service`_
8	4. `Using Munin To Graph Stats Values`_
9
10	Overview
11	========
12
13	Each Tahoe node collects and publishes statistics about its operations as it
14	runs. These include counters of how many files have been uploaded and
15	downloaded, CPU usage information, performance numbers like latency of
16	storage server operations, and available disk space.
17
18	The easiest way to see the stats for any given node is use the web interface.
19	From the main "Welcome Page", follow the "Operational Statistics" link inside
20	the small "This Client" box. If the welcome page lives at
21	http://localhost:3456/, then the statistics page will live at
22	http://localhost:3456/statistics . This presents a summary of the stats
23	block, along with a copy of the raw counters. To obtain just the raw counters
24	(in JSON format), use /statistics?t=json instead.
25
26	Statistics Categories
27	=====================
28
29	The stats dictionary contains two keys: 'counters' and 'stats'. 'counters'
30	are strictly counters: they are reset to zero when the node is started, and
31	grow upwards. 'stats' are non-incrementing values, used to measure the
32	current state of various systems. Some stats are actually booleans, expressed
33	as '1' for true and '0' for false (internal restrictions require all stats
34	values to be numbers).
35
36	Under both the 'counters' and 'stats' dictionaries, each individual stat has
37	a key with a dot-separated name, breaking them up into groups like
38	'cpu_monitor' and 'storage_server'.
39
40	The currently available stats (as of release 1.6.0 or so) are described here:
41
42	counters.storage_server.\*
43
44	this group counts inbound storage-server operations. They are not provided
45	by client-only nodes which have been configured to not run a storage server
46	(with [storage]enabled=false in tahoe.cfg)
47
48	allocate, write, close, abort
49	these are for immutable file uploads. 'allocate' is incremented when a
50	client asks if it can upload a share to the server. 'write' is
51	incremented for each chunk of data written. 'close' is incremented when
52	the share is finished. 'abort' is incremented if the client abandons
53	the upload.
54
55	get, read
56	these are for immutable file downloads. 'get' is incremented
57	when a client asks if the server has a specific share. 'read' is
58	incremented for each chunk of data read.
59
60	readv, writev
61	these are for immutable file creation, publish, and retrieve. 'readv'
62	is incremented each time a client reads part of a mutable share.
63	'writev' is incremented each time a client sends a modification
64	request.
65
66	add-lease, renew, cancel
67	these are for share lease modifications. 'add-lease' is incremented
68	when an 'add-lease' operation is performed (which either adds a new
69	lease or renews an existing lease). 'renew' is for the 'renew-lease'
70	operation (which can only be used to renew an existing one). 'cancel'
71	is used for the 'cancel-lease' operation.
72
73	bytes_freed
74	this counts how many bytes were freed when a 'cancel-lease'
75	operation removed the last lease from a share and the share
76	was thus deleted.
77
78	bytes_added
79	this counts how many bytes were consumed by immutable share
80	uploads. It is incremented at the same time as the 'close'
81	counter.
82
83	stats.storage_server.\*
84
85	allocated
86	this counts how many bytes are currently 'allocated', which
87	tracks the space that will eventually be consumed by immutable
88	share upload operations. The stat is increased as soon as the
89	upload begins (at the same time the 'allocated' counter is
90	incremented), and goes back to zero when the 'close' or 'abort'
91	message is received (at which point the 'disk_used' stat should
92	incremented by the same amount).
93
94	disk_total, disk_used, disk_free_for_root, disk_free_for_nonroot, disk_avail, reserved_space
95	these all reflect disk-space usage policies and status.
96	'disk_total' is the total size of disk where the storage
97	server's BASEDIR/storage/shares directory lives, as reported
98	by /bin/df or equivalent. 'disk_used', 'disk_free_for_root',
99	and 'disk_free_for_nonroot' show related information.
100	'reserved_space' reports the reservation configured by the
101	tahoe.cfg [storage]reserved_space value. 'disk_avail'
102	reports the remaining disk space available for the Tahoe
103	server after subtracting reserved_space from disk_avail. All
104	values are in bytes.
105
106	accepting_immutable_shares
107	this is '1' if the storage server is currently accepting uploads of
108	immutable shares. It may be '0' if a server is disabled by
109	configuration, or if the disk is full (i.e. disk_avail is less than
110	reserved_space).
111
112	total_bucket_count
113	this counts the number of 'buckets' (i.e. unique
114	storage-index values) currently managed by the storage
115	server. It indicates roughly how many files are managed
116	by the server.
117
118	latencies..
119	these stats keep track of local disk latencies for
120	storage-server operations. A number of percentile values are
121	tracked for many operations. For example,
122	'storage_server.latencies.readv.50_0_percentile' records the
123	median response time for a 'readv' request. All values are in
124	seconds. These are recorded by the storage server, starting
125	from the time the request arrives (post-deserialization) and
126	ending when the response begins serialization. As such, they
127	are mostly useful for measuring disk speeds. The operations
128	tracked are the same as the counters.storage_server.* counter
129	values (allocate, write, close, get, read, add-lease, renew,
130	cancel, readv, writev). The percentile values tracked are:
131	mean, 01_0_percentile, 10_0_percentile, 50_0_percentile,
132	90_0_percentile, 95_0_percentile, 99_0_percentile,
133	99_9_percentile. (the last value, 99.9 percentile, means that
134	999 out of the last 1000 operations were faster than the
135	given number, and is the same threshold used by Amazon's
136	internal SLA, according to the Dynamo paper).
137
138	counters.uploader.files_uploaded
139
140	counters.uploader.bytes_uploaded
141
142	counters.downloader.files_downloaded
143
144	counters.downloader.bytes_downloaded
145
146	These count client activity: a Tahoe client will increment these when it
147	uploads or downloads an immutable file. 'files_uploaded' is incremented by
148	one for each operation, while 'bytes_uploaded' is incremented by the size of
149	the file.
150
151	counters.mutable.files_published
152
153	counters.mutable.bytes_published
154
155	counters.mutable.files_retrieved
156
157	counters.mutable.bytes_retrieved
158
159	These count client activity for mutable files. 'published' is the act of
160	changing an existing mutable file (or creating a brand-new mutable file).
161	'retrieved' is the act of reading its current contents.
162
163	counters.chk_upload_helper.\*
164
165	These count activity of the "Helper", which receives ciphertext from clients
166	and performs erasure-coding and share upload for files that are not already
167	in the grid. The code which implements these counters is in
168	src/allmydata/immutable/offloaded.py .
169
170	upload_requests
171	incremented each time a client asks to upload a file
172	upload_already_present: incremented when the file is already in the grid
173
174	upload_need_upload
175	incremented when the file is not already in the grid
176
177	resumes
178	incremented when the helper already has partial ciphertext for
179	the requested upload, indicating that the client is resuming an
180	earlier upload
181
182	fetched_bytes
183	this counts how many bytes of ciphertext have been fetched
184	from uploading clients
185
186	encoded_bytes
187	this counts how many bytes of ciphertext have been
188	encoded and turned into successfully-uploaded shares. If no
189	uploads have failed or been abandoned, encoded_bytes should
190	eventually equal fetched_bytes.
191
192	stats.chk_upload_helper.\*
193
194	These also track Helper activity:
195
196	active_uploads
197	how many files are currently being uploaded. 0 when idle.
198
199	incoming_count
200	how many cache files are present in the incoming/ directory,
201	which holds ciphertext files that are still being fetched
202	from the client
203
204	incoming_size
205	total size of cache files in the incoming/ directory
206
207	incoming_size_old
208	total size of 'old' cache files (more than 48 hours)
209
210	encoding_count
211	how many cache files are present in the encoding/ directory,
212	which holds ciphertext files that are being encoded and
213	uploaded
214
215	encoding_size
216	total size of cache files in the encoding/ directory
217
218	encoding_size_old
219	total size of 'old' cache files (more than 48 hours)
220
221	stats.node.uptime
222	how many seconds since the node process was started
223
224	stats.cpu_monitor.\*
225
226	1min_avg, 5min_avg, 15min_avg
227	estimate of what percentage of system CPU time was consumed by the
228	node process, over the given time interval. Expressed as a float, 0.0
229	for 0%, 1.0 for 100%
230
231	total
232	estimate of total number of CPU seconds consumed by node since
233	the process was started. Ticket #472 indicates that .total may
234	sometimes be negative due to wraparound of the kernel's counter.
235
236	stats.load_monitor.\*
237
238	When enabled, the "load monitor" continually schedules a one-second
239	callback, and measures how late the response is. This estimates system load
240	(if the system is idle, the response should be on time). This is only
241	enabled if a stats-gatherer is configured.
242
243	avg_load
244	average "load" value (seconds late) over the last minute
245
246	max_load
247	maximum "load" value over the last minute
248
249
250	Running a Tahoe Stats-Gatherer Service
251	======================================
252
253	The "stats-gatherer" is a simple daemon that periodically collects stats from
254	several tahoe nodes. It could be useful, e.g., in a production environment,
255	where you want to monitor dozens of storage servers from a central management
256	host. It merely gatherers statistics from many nodes into a single place: it
257	does not do any actual analysis.
258
259	The stats gatherer listens on a network port using the same Foolscap_
260	connection library that Tahoe clients use to connect to storage servers.
261	Tahoe nodes can be configured to connect to the stats gatherer and publish
262	their stats on a periodic basis. (In fact, what happens is that nodes connect
263	to the gatherer and offer it a second FURL which points back to the node's
264	"stats port", which the gatherer then uses to pull stats on a periodic basis.
265	The initial connection is flipped to allow the nodes to live behind NAT
266	boxes, as long as the stats-gatherer has a reachable IP address.)
267
268	.. _Foolscap: http://foolscap.lothar.com/trac
269
270	The stats-gatherer is created in the same fashion as regular tahoe client
271	nodes and introducer nodes. Choose a base directory for the gatherer to live
272	in (but do not create the directory). Then run:
273
274	::
275
276	tahoe create-stats-gatherer $BASEDIR
277
278	and start it with "tahoe start $BASEDIR". Once running, the gatherer will
279	write a FURL into $BASEDIR/stats_gatherer.furl .
280
281	To configure a Tahoe client/server node to contact the stats gatherer, copy
282	this FURL into the node's tahoe.cfg file, in a section named "[client]",
283	under a key named "stats_gatherer.furl", like so:
284
285	::
286
287	[client]
288	stats_gatherer.furl = pb://qbo4ktl667zmtiuou6lwbjryli2brv6t@192.168.0.8:49997/wxycb4kaexzskubjnauxeoptympyf45y
289
290	or simply copy the stats_gatherer.furl file into the node's base directory
291	(next to the tahoe.cfg file): it will be interpreted in the same way.
292
293	The first time it is started, the gatherer will listen on a random unused TCP
294	port, so it should not conflict with anything else that you have running on
295	that host at that time. On subsequent runs, it will re-use the same port (to
296	keep its FURL consistent). To explicitly control which port it uses, write
297	the desired portnumber into a file named "portnum" (i.e. $BASEDIR/portnum),
298	and the next time the gatherer is started, it will start listening on the
299	given port. The portnum file is actually a "strports specification string",
300	as described in docs/configuration.txt .
301
302	Once running, the stats gatherer will create a standard python "pickle" file
303	in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats
304	information from every connected node and write them into the pickle. The
305	pickle will contain a dictionary, in which node identifiers (known as "tubid"
306	strings) are the keys, and the values are a dict with 'timestamp',
307	'nickname', and 'stats' keys. d[tubid][stats] will contain the stats
308	dictionary as made available at http://localhost:3456/statistics?t=json . The
309	pickle file will only contain the most recent update from each node.
310
311	Other tools can be built to examine these stats and render them into
312	something useful. For example, a tool could sum the
313	"storage_server.disk_avail' values from all servers to compute a
314	total-disk-available number for the entire grid (however, the "disk watcher"
315	daemon, in misc/operations_helpers/spacetime/, is better suited for this specific task).
316
317	Using Munin To Graph Stats Values
318	=================================
319
320	The misc/munin/ directory contains various plugins to graph stats for Tahoe
321	nodes. They are intended for use with the Munin_ system-management tool, which
322	typically polls target systems every 5 minutes and produces a web page with
323	graphs of various things over multiple time scales (last hour, last month,
324	last year).
325
326	.. _Munin: http://munin-monitoring.org/
327
328	Most of the plugins are designed to pull stats from a single Tahoe node, and
329	are configured with the e.g. http://localhost:3456/statistics?t=json URL. The
330	"tahoe_stats" plugin is designed to read from the pickle file created by the
331	stats-gatherer. Some plugins are to be used with the disk watcher, and a few
332	(like tahoe_nodememory) are designed to watch the node processes directly
333	(and must therefore run on the same host as the target node).
334
335	Please see the docstrings at the beginning of each plugin for details, and
336	the "tahoe-conf" file for notes about configuration and installing these
337	plugins into a Munin environment.

Download in other formats:

Original Format