Ticket #1225: stats.txt

File stats.txt, 14.2 KB (added by p-static, at 2010-10-13T06:30:35Z)

docs/stats.txt, converted t rST

Line 
1================
2Tahoe Statistics
3================
4
51. `Overview`_
62. `Statistics Categories`_
73. `Running a Tahoe Stats-Gatherer Service`_
84. `Using Munin To Graph Stats Values`_
9
10Overview
11========
12
13Each Tahoe node collects and publishes statistics about its operations as it
14runs. These include counters of how many files have been uploaded and
15downloaded, CPU usage information, performance numbers like latency of
16storage server operations, and available disk space.
17
18The easiest way to see the stats for any given node is use the web interface.
19From the main "Welcome Page", follow the "Operational Statistics" link inside
20the small "This Client" box. If the welcome page lives at
21http://localhost:3456/, then the statistics page will live at
22http://localhost:3456/statistics . This presents a summary of the stats
23block, along with a copy of the raw counters. To obtain just the raw counters
24(in JSON format), use /statistics?t=json instead.
25
26Statistics Categories
27=====================
28
29The stats dictionary contains two keys: 'counters' and 'stats'. 'counters'
30are strictly counters: they are reset to zero when the node is started, and
31grow upwards. 'stats' are non-incrementing values, used to measure the
32current state of various systems. Some stats are actually booleans, expressed
33as '1' for true and '0' for false (internal restrictions require all stats
34values to be numbers).
35
36Under both the 'counters' and 'stats' dictionaries, each individual stat has
37a key with a dot-separated name, breaking them up into groups like
38'cpu_monitor' and 'storage_server'.
39
40The currently available stats (as of release 1.6.0 or so) are described here:
41
42**counters.storage_server.\***
43
44    this group counts inbound storage-server operations. They are not provided
45    by client-only nodes which have been configured to not run a storage server
46    (with [storage]enabled=false in tahoe.cfg)
47                           
48    allocate, write, close, abort
49        these are for immutable file uploads. 'allocate' is incremented when a
50        client asks if it can upload a share to the server. 'write' is
51        incremented for each chunk of data written. 'close' is incremented when
52        the share is finished. 'abort' is incremented if the client abandons
53        the upload.
54
55    get, read
56        these are for immutable file downloads. 'get' is incremented
57        when a client asks if the server has a specific share. 'read' is
58        incremented for each chunk of data read.
59
60    readv, writev
61        these are for immutable file creation, publish, and retrieve. 'readv'
62        is incremented each time a client reads part of a mutable share.
63        'writev' is incremented each time a client sends a modification
64        request.
65
66    add-lease, renew, cancel
67        these are for share lease modifications. 'add-lease' is incremented
68        when an 'add-lease' operation is performed (which either adds a new
69        lease or renews an existing lease). 'renew' is for the 'renew-lease'
70        operation (which can only be used to renew an existing one). 'cancel'
71        is used for the 'cancel-lease' operation.
72
73    bytes_freed
74        this counts how many bytes were freed when a 'cancel-lease'
75        operation removed the last lease from a share and the share
76        was thus deleted.
77
78    bytes_added
79        this counts how many bytes were consumed by immutable share
80        uploads. It is incremented at the same time as the 'close'
81        counter.
82
83**stats.storage_server.\***
84
85    allocated
86        this counts how many bytes are currently 'allocated', which
87        tracks the space that will eventually be consumed by immutable
88        share upload operations. The stat is increased as soon as the
89        upload begins (at the same time the 'allocated' counter is
90        incremented), and goes back to zero when the 'close' or 'abort'
91        message is received (at which point the 'disk_used' stat should
92        incremented by the same amount).
93
94    disk_total, disk_used, disk_free_for_root, disk_free_for_nonroot, disk_avail, reserved_space
95        these all reflect disk-space usage policies and status.
96        'disk_total' is the total size of disk where the storage
97        server's BASEDIR/storage/shares directory lives, as reported
98        by /bin/df or equivalent. 'disk_used', 'disk_free_for_root',
99        and 'disk_free_for_nonroot' show related information.
100        'reserved_space' reports the reservation configured by the
101        tahoe.cfg [storage]reserved_space value. 'disk_avail'
102        reports the remaining disk space available for the Tahoe
103        server after subtracting reserved_space from disk_avail. All
104        values are in bytes.
105
106    accepting_immutable_shares
107        this is '1' if the storage server is currently accepting uploads of
108        immutable shares. It may be '0' if a server is disabled by
109        configuration, or if the disk is full (i.e. disk_avail is less than
110        reserved_space).
111
112    total_bucket_count
113        this counts the number of 'buckets' (i.e. unique
114        storage-index values) currently managed by the storage
115        server. It indicates roughly how many files are managed
116        by the server.
117
118    latencies.*.*
119        these stats keep track of local disk latencies for
120        storage-server operations. A number of percentile values are
121        tracked for many operations. For example,
122        'storage_server.latencies.readv.50_0_percentile' records the
123        median response time for a 'readv' request. All values are in
124        seconds. These are recorded by the storage server, starting
125        from the time the request arrives (post-deserialization) and
126        ending when the response begins serialization. As such, they
127        are mostly useful for measuring disk speeds. The operations
128        tracked are the same as the counters.storage_server.* counter
129        values (allocate, write, close, get, read, add-lease, renew,
130        cancel, readv, writev). The percentile values tracked are:
131        mean, 01_0_percentile, 10_0_percentile, 50_0_percentile,
132        90_0_percentile, 95_0_percentile, 99_0_percentile,
133        99_9_percentile. (the last value, 99.9 percentile, means that
134        999 out of the last 1000 operations were faster than the
135        given number, and is the same threshold used by Amazon's
136        internal SLA, according to the Dynamo paper).
137
138**counters.uploader.files_uploaded**
139
140**counters.uploader.bytes_uploaded**
141
142**counters.downloader.files_downloaded**
143
144**counters.downloader.bytes_downloaded**
145
146    These count client activity: a Tahoe client will increment these when it
147    uploads or downloads an immutable file. 'files_uploaded' is incremented by
148    one for each operation, while 'bytes_uploaded' is incremented by the size of
149    the file.
150
151**counters.mutable.files_published**
152
153**counters.mutable.bytes_published**
154
155**counters.mutable.files_retrieved**
156
157**counters.mutable.bytes_retrieved**
158
159 These count client activity for mutable files. 'published' is the act of
160 changing an existing mutable file (or creating a brand-new mutable file).
161 'retrieved' is the act of reading its current contents.
162
163**counters.chk_upload_helper.\***
164
165    These count activity of the "Helper", which receives ciphertext from clients
166    and performs erasure-coding and share upload for files that are not already
167    in the grid. The code which implements these counters is in
168    src/allmydata/immutable/offloaded.py .
169
170    upload_requests
171        incremented each time a client asks to upload a file
172        upload_already_present: incremented when the file is already in the grid
173
174    upload_need_upload
175        incremented when the file is not already in the grid
176
177    resumes
178        incremented when the helper already has partial ciphertext for
179        the requested upload, indicating that the client is resuming an
180        earlier upload
181
182    fetched_bytes
183        this counts how many bytes of ciphertext have been fetched
184        from uploading clients
185
186    encoded_bytes
187        this counts how many bytes of ciphertext have been
188        encoded and turned into successfully-uploaded shares. If no
189        uploads have failed or been abandoned, encoded_bytes should
190        eventually equal fetched_bytes.
191
192**stats.chk_upload_helper.\***
193
194    These also track Helper activity:
195
196    active_uploads
197        how many files are currently being uploaded. 0 when idle.
198   
199    incoming_count
200        how many cache files are present in the incoming/ directory,
201        which holds ciphertext files that are still being fetched
202        from the client
203
204    incoming_size
205        total size of cache files in the incoming/ directory
206
207    incoming_size_old
208        total size of 'old' cache files (more than 48 hours)
209
210    encoding_count
211        how many cache files are present in the encoding/ directory,
212        which holds ciphertext files that are being encoded and
213        uploaded
214
215    encoding_size
216        total size of cache files in the encoding/ directory
217
218    encoding_size_old
219        total size of 'old' cache files (more than 48 hours)
220
221**stats.node.uptime**
222    how many seconds since the node process was started
223
224**stats.cpu_monitor.\***
225
226    1min_avg, 5min_avg, 15min_avg
227        estimate of what percentage of system CPU time was consumed by the
228        node process, over the given time interval. Expressed as a float, 0.0
229        for 0%, 1.0 for 100%
230
231    total
232        estimate of total number of CPU seconds consumed by node since
233        the process was started. Ticket #472 indicates that .total may
234        sometimes be negative due to wraparound of the kernel's counter.
235
236**stats.load_monitor.\***
237
238    When enabled, the "load monitor" continually schedules a one-second
239    callback, and measures how late the response is. This estimates system load
240    (if the system is idle, the response should be on time). This is only
241    enabled if a stats-gatherer is configured.
242
243    avg_load
244        average "load" value (seconds late) over the last minute
245
246    max_load
247        maximum "load" value over the last minute
248
249
250Running a Tahoe Stats-Gatherer Service
251======================================
252
253The "stats-gatherer" is a simple daemon that periodically collects stats from
254several tahoe nodes. It could be useful, e.g., in a production environment,
255where you want to monitor dozens of storage servers from a central management
256host. It merely gatherers statistics from many nodes into a single place: it
257does not do any actual analysis.
258
259The stats gatherer listens on a network port using the same Foolscap_
260connection library that Tahoe clients use to connect to storage servers.
261Tahoe nodes can be configured to connect to the stats gatherer and publish
262their stats on a periodic basis. (In fact, what happens is that nodes connect
263to the gatherer and offer it a second FURL which points back to the node's
264"stats port", which the gatherer then uses to pull stats on a periodic basis.
265The initial connection is flipped to allow the nodes to live behind NAT
266boxes, as long as the stats-gatherer has a reachable IP address.)
267
268.. _Foolscap: http://foolscap.lothar.com/trac
269
270The stats-gatherer is created in the same fashion as regular tahoe client
271nodes and introducer nodes. Choose a base directory for the gatherer to live
272in (but do not create the directory). Then run:
273
274::
275
276   tahoe create-stats-gatherer $BASEDIR
277
278and start it with "tahoe start $BASEDIR". Once running, the gatherer will
279write a FURL into $BASEDIR/stats_gatherer.furl .
280
281To configure a Tahoe client/server node to contact the stats gatherer, copy
282this FURL into the node's tahoe.cfg file, in a section named "[client]",
283under a key named "stats_gatherer.furl", like so:
284
285::
286
287    [client]
288    stats_gatherer.furl = pb://qbo4ktl667zmtiuou6lwbjryli2brv6t@192.168.0.8:49997/wxycb4kaexzskubjnauxeoptympyf45y
289
290or simply copy the stats_gatherer.furl file into the node's base directory
291(next to the tahoe.cfg file): it will be interpreted in the same way.
292
293The first time it is started, the gatherer will listen on a random unused TCP
294port, so it should not conflict with anything else that you have running on
295that host at that time. On subsequent runs, it will re-use the same port (to
296keep its FURL consistent). To explicitly control which port it uses, write
297the desired portnumber into a file named "portnum" (i.e. $BASEDIR/portnum),
298and the next time the gatherer is started, it will start listening on the
299given port. The portnum file is actually a "strports specification string",
300as described in docs/configuration.txt .
301
302Once running, the stats gatherer will create a standard python "pickle" file
303in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats
304information from every connected node and write them into the pickle. The
305pickle will contain a dictionary, in which node identifiers (known as "tubid"
306strings) are the keys, and the values are a dict with 'timestamp',
307'nickname', and 'stats' keys. d[tubid][stats] will contain the stats
308dictionary as made available at http://localhost:3456/statistics?t=json . The
309pickle file will only contain the most recent update from each node.
310
311Other tools can be built to examine these stats and render them into
312something useful. For example, a tool could sum the
313"storage_server.disk_avail' values from all servers to compute a
314total-disk-available number for the entire grid (however, the "disk watcher"
315daemon, in misc/operations_helpers/spacetime/, is better suited for this specific task).
316
317Using Munin To Graph Stats Values
318=================================
319
320The misc/munin/ directory contains various plugins to graph stats for Tahoe
321nodes. They are intended for use with the Munin_ system-management tool, which
322typically polls target systems every 5 minutes and produces a web page with
323graphs of various things over multiple time scales (last hour, last month,
324last year).
325
326.. _Munin: http://munin-monitoring.org/
327
328Most of the plugins are designed to pull stats from a single Tahoe node, and
329are configured with the e.g. http://localhost:3456/statistics?t=json URL. The
330"tahoe_stats" plugin is designed to read from the pickle file created by the
331stats-gatherer. Some plugins are to be used with the disk watcher, and a few
332(like tahoe_nodememory) are designed to watch the node processes directly
333(and must therefore run on the same host as the target node).
334
335Please see the docstrings at the beginning of each plugin for details, and
336the "tahoe-conf" file for notes about configuration and installing these
337plugins into a Munin environment.