1 | ================ |
---|
2 | Tahoe Statistics |
---|
3 | ================ |
---|
4 | |
---|
5 | 1. `Overview`_ |
---|
6 | 2. `Statistics Categories`_ |
---|
7 | 3. `Running a Tahoe Stats-Gatherer Service`_ |
---|
8 | 4. `Using Munin To Graph Stats Values`_ |
---|
9 | |
---|
10 | Overview |
---|
11 | ======== |
---|
12 | |
---|
13 | Each Tahoe node collects and publishes statistics about its operations as it |
---|
14 | runs. These include counters of how many files have been uploaded and |
---|
15 | downloaded, CPU usage information, performance numbers like latency of |
---|
16 | storage server operations, and available disk space. |
---|
17 | |
---|
18 | The easiest way to see the stats for any given node is use the web interface. |
---|
19 | From the main "Welcome Page", follow the "Operational Statistics" link inside |
---|
20 | the small "This Client" box. If the welcome page lives at |
---|
21 | http://localhost:3456/, then the statistics page will live at |
---|
22 | http://localhost:3456/statistics . This presents a summary of the stats |
---|
23 | block, along with a copy of the raw counters. To obtain just the raw counters |
---|
24 | (in JSON format), use /statistics?t=json instead. |
---|
25 | |
---|
26 | Statistics Categories |
---|
27 | ===================== |
---|
28 | |
---|
29 | The stats dictionary contains two keys: 'counters' and 'stats'. 'counters' |
---|
30 | are strictly counters: they are reset to zero when the node is started, and |
---|
31 | grow upwards. 'stats' are non-incrementing values, used to measure the |
---|
32 | current state of various systems. Some stats are actually booleans, expressed |
---|
33 | as '1' for true and '0' for false (internal restrictions require all stats |
---|
34 | values to be numbers). |
---|
35 | |
---|
36 | Under both the 'counters' and 'stats' dictionaries, each individual stat has |
---|
37 | a key with a dot-separated name, breaking them up into groups like |
---|
38 | 'cpu_monitor' and 'storage_server'. |
---|
39 | |
---|
40 | The currently available stats (as of release 1.6.0 or so) are described here: |
---|
41 | |
---|
42 | **counters.storage_server.\*** |
---|
43 | |
---|
44 | this group counts inbound storage-server operations. They are not provided |
---|
45 | by client-only nodes which have been configured to not run a storage server |
---|
46 | (with [storage]enabled=false in tahoe.cfg) |
---|
47 | |
---|
48 | allocate, write, close, abort |
---|
49 | these are for immutable file uploads. 'allocate' is incremented when a |
---|
50 | client asks if it can upload a share to the server. 'write' is |
---|
51 | incremented for each chunk of data written. 'close' is incremented when |
---|
52 | the share is finished. 'abort' is incremented if the client abandons |
---|
53 | the upload. |
---|
54 | |
---|
55 | get, read |
---|
56 | these are for immutable file downloads. 'get' is incremented |
---|
57 | when a client asks if the server has a specific share. 'read' is |
---|
58 | incremented for each chunk of data read. |
---|
59 | |
---|
60 | readv, writev |
---|
61 | these are for immutable file creation, publish, and retrieve. 'readv' |
---|
62 | is incremented each time a client reads part of a mutable share. |
---|
63 | 'writev' is incremented each time a client sends a modification |
---|
64 | request. |
---|
65 | |
---|
66 | add-lease, renew, cancel |
---|
67 | these are for share lease modifications. 'add-lease' is incremented |
---|
68 | when an 'add-lease' operation is performed (which either adds a new |
---|
69 | lease or renews an existing lease). 'renew' is for the 'renew-lease' |
---|
70 | operation (which can only be used to renew an existing one). 'cancel' |
---|
71 | is used for the 'cancel-lease' operation. |
---|
72 | |
---|
73 | bytes_freed |
---|
74 | this counts how many bytes were freed when a 'cancel-lease' |
---|
75 | operation removed the last lease from a share and the share |
---|
76 | was thus deleted. |
---|
77 | |
---|
78 | bytes_added |
---|
79 | this counts how many bytes were consumed by immutable share |
---|
80 | uploads. It is incremented at the same time as the 'close' |
---|
81 | counter. |
---|
82 | |
---|
83 | **stats.storage_server.\*** |
---|
84 | |
---|
85 | allocated |
---|
86 | this counts how many bytes are currently 'allocated', which |
---|
87 | tracks the space that will eventually be consumed by immutable |
---|
88 | share upload operations. The stat is increased as soon as the |
---|
89 | upload begins (at the same time the 'allocated' counter is |
---|
90 | incremented), and goes back to zero when the 'close' or 'abort' |
---|
91 | message is received (at which point the 'disk_used' stat should |
---|
92 | incremented by the same amount). |
---|
93 | |
---|
94 | disk_total, disk_used, disk_free_for_root, disk_free_for_nonroot, disk_avail, reserved_space |
---|
95 | these all reflect disk-space usage policies and status. |
---|
96 | 'disk_total' is the total size of disk where the storage |
---|
97 | server's BASEDIR/storage/shares directory lives, as reported |
---|
98 | by /bin/df or equivalent. 'disk_used', 'disk_free_for_root', |
---|
99 | and 'disk_free_for_nonroot' show related information. |
---|
100 | 'reserved_space' reports the reservation configured by the |
---|
101 | tahoe.cfg [storage]reserved_space value. 'disk_avail' |
---|
102 | reports the remaining disk space available for the Tahoe |
---|
103 | server after subtracting reserved_space from disk_avail. All |
---|
104 | values are in bytes. |
---|
105 | |
---|
106 | accepting_immutable_shares |
---|
107 | this is '1' if the storage server is currently accepting uploads of |
---|
108 | immutable shares. It may be '0' if a server is disabled by |
---|
109 | configuration, or if the disk is full (i.e. disk_avail is less than |
---|
110 | reserved_space). |
---|
111 | |
---|
112 | total_bucket_count |
---|
113 | this counts the number of 'buckets' (i.e. unique |
---|
114 | storage-index values) currently managed by the storage |
---|
115 | server. It indicates roughly how many files are managed |
---|
116 | by the server. |
---|
117 | |
---|
118 | latencies.*.* |
---|
119 | these stats keep track of local disk latencies for |
---|
120 | storage-server operations. A number of percentile values are |
---|
121 | tracked for many operations. For example, |
---|
122 | 'storage_server.latencies.readv.50_0_percentile' records the |
---|
123 | median response time for a 'readv' request. All values are in |
---|
124 | seconds. These are recorded by the storage server, starting |
---|
125 | from the time the request arrives (post-deserialization) and |
---|
126 | ending when the response begins serialization. As such, they |
---|
127 | are mostly useful for measuring disk speeds. The operations |
---|
128 | tracked are the same as the counters.storage_server.* counter |
---|
129 | values (allocate, write, close, get, read, add-lease, renew, |
---|
130 | cancel, readv, writev). The percentile values tracked are: |
---|
131 | mean, 01_0_percentile, 10_0_percentile, 50_0_percentile, |
---|
132 | 90_0_percentile, 95_0_percentile, 99_0_percentile, |
---|
133 | 99_9_percentile. (the last value, 99.9 percentile, means that |
---|
134 | 999 out of the last 1000 operations were faster than the |
---|
135 | given number, and is the same threshold used by Amazon's |
---|
136 | internal SLA, according to the Dynamo paper). |
---|
137 | |
---|
138 | **counters.uploader.files_uploaded** |
---|
139 | |
---|
140 | **counters.uploader.bytes_uploaded** |
---|
141 | |
---|
142 | **counters.downloader.files_downloaded** |
---|
143 | |
---|
144 | **counters.downloader.bytes_downloaded** |
---|
145 | |
---|
146 | These count client activity: a Tahoe client will increment these when it |
---|
147 | uploads or downloads an immutable file. 'files_uploaded' is incremented by |
---|
148 | one for each operation, while 'bytes_uploaded' is incremented by the size of |
---|
149 | the file. |
---|
150 | |
---|
151 | **counters.mutable.files_published** |
---|
152 | |
---|
153 | **counters.mutable.bytes_published** |
---|
154 | |
---|
155 | **counters.mutable.files_retrieved** |
---|
156 | |
---|
157 | **counters.mutable.bytes_retrieved** |
---|
158 | |
---|
159 | These count client activity for mutable files. 'published' is the act of |
---|
160 | changing an existing mutable file (or creating a brand-new mutable file). |
---|
161 | 'retrieved' is the act of reading its current contents. |
---|
162 | |
---|
163 | **counters.chk_upload_helper.\*** |
---|
164 | |
---|
165 | These count activity of the "Helper", which receives ciphertext from clients |
---|
166 | and performs erasure-coding and share upload for files that are not already |
---|
167 | in the grid. The code which implements these counters is in |
---|
168 | src/allmydata/immutable/offloaded.py . |
---|
169 | |
---|
170 | upload_requests |
---|
171 | incremented each time a client asks to upload a file |
---|
172 | upload_already_present: incremented when the file is already in the grid |
---|
173 | |
---|
174 | upload_need_upload |
---|
175 | incremented when the file is not already in the grid |
---|
176 | |
---|
177 | resumes |
---|
178 | incremented when the helper already has partial ciphertext for |
---|
179 | the requested upload, indicating that the client is resuming an |
---|
180 | earlier upload |
---|
181 | |
---|
182 | fetched_bytes |
---|
183 | this counts how many bytes of ciphertext have been fetched |
---|
184 | from uploading clients |
---|
185 | |
---|
186 | encoded_bytes |
---|
187 | this counts how many bytes of ciphertext have been |
---|
188 | encoded and turned into successfully-uploaded shares. If no |
---|
189 | uploads have failed or been abandoned, encoded_bytes should |
---|
190 | eventually equal fetched_bytes. |
---|
191 | |
---|
192 | **stats.chk_upload_helper.\*** |
---|
193 | |
---|
194 | These also track Helper activity: |
---|
195 | |
---|
196 | active_uploads |
---|
197 | how many files are currently being uploaded. 0 when idle. |
---|
198 | |
---|
199 | incoming_count |
---|
200 | how many cache files are present in the incoming/ directory, |
---|
201 | which holds ciphertext files that are still being fetched |
---|
202 | from the client |
---|
203 | |
---|
204 | incoming_size |
---|
205 | total size of cache files in the incoming/ directory |
---|
206 | |
---|
207 | incoming_size_old |
---|
208 | total size of 'old' cache files (more than 48 hours) |
---|
209 | |
---|
210 | encoding_count |
---|
211 | how many cache files are present in the encoding/ directory, |
---|
212 | which holds ciphertext files that are being encoded and |
---|
213 | uploaded |
---|
214 | |
---|
215 | encoding_size |
---|
216 | total size of cache files in the encoding/ directory |
---|
217 | |
---|
218 | encoding_size_old |
---|
219 | total size of 'old' cache files (more than 48 hours) |
---|
220 | |
---|
221 | **stats.node.uptime** |
---|
222 | how many seconds since the node process was started |
---|
223 | |
---|
224 | **stats.cpu_monitor.\*** |
---|
225 | |
---|
226 | 1min_avg, 5min_avg, 15min_avg |
---|
227 | estimate of what percentage of system CPU time was consumed by the |
---|
228 | node process, over the given time interval. Expressed as a float, 0.0 |
---|
229 | for 0%, 1.0 for 100% |
---|
230 | |
---|
231 | total |
---|
232 | estimate of total number of CPU seconds consumed by node since |
---|
233 | the process was started. Ticket #472 indicates that .total may |
---|
234 | sometimes be negative due to wraparound of the kernel's counter. |
---|
235 | |
---|
236 | **stats.load_monitor.\*** |
---|
237 | |
---|
238 | When enabled, the "load monitor" continually schedules a one-second |
---|
239 | callback, and measures how late the response is. This estimates system load |
---|
240 | (if the system is idle, the response should be on time). This is only |
---|
241 | enabled if a stats-gatherer is configured. |
---|
242 | |
---|
243 | avg_load |
---|
244 | average "load" value (seconds late) over the last minute |
---|
245 | |
---|
246 | max_load |
---|
247 | maximum "load" value over the last minute |
---|
248 | |
---|
249 | |
---|
250 | Running a Tahoe Stats-Gatherer Service |
---|
251 | ====================================== |
---|
252 | |
---|
253 | The "stats-gatherer" is a simple daemon that periodically collects stats from |
---|
254 | several tahoe nodes. It could be useful, e.g., in a production environment, |
---|
255 | where you want to monitor dozens of storage servers from a central management |
---|
256 | host. It merely gatherers statistics from many nodes into a single place: it |
---|
257 | does not do any actual analysis. |
---|
258 | |
---|
259 | The stats gatherer listens on a network port using the same Foolscap_ |
---|
260 | connection library that Tahoe clients use to connect to storage servers. |
---|
261 | Tahoe nodes can be configured to connect to the stats gatherer and publish |
---|
262 | their stats on a periodic basis. (In fact, what happens is that nodes connect |
---|
263 | to the gatherer and offer it a second FURL which points back to the node's |
---|
264 | "stats port", which the gatherer then uses to pull stats on a periodic basis. |
---|
265 | The initial connection is flipped to allow the nodes to live behind NAT |
---|
266 | boxes, as long as the stats-gatherer has a reachable IP address.) |
---|
267 | |
---|
268 | .. _Foolscap: http://foolscap.lothar.com/trac |
---|
269 | |
---|
270 | The stats-gatherer is created in the same fashion as regular tahoe client |
---|
271 | nodes and introducer nodes. Choose a base directory for the gatherer to live |
---|
272 | in (but do not create the directory). Then run: |
---|
273 | |
---|
274 | :: |
---|
275 | |
---|
276 | tahoe create-stats-gatherer $BASEDIR |
---|
277 | |
---|
278 | and start it with "tahoe start $BASEDIR". Once running, the gatherer will |
---|
279 | write a FURL into $BASEDIR/stats_gatherer.furl . |
---|
280 | |
---|
281 | To configure a Tahoe client/server node to contact the stats gatherer, copy |
---|
282 | this FURL into the node's tahoe.cfg file, in a section named "[client]", |
---|
283 | under a key named "stats_gatherer.furl", like so: |
---|
284 | |
---|
285 | :: |
---|
286 | |
---|
287 | [client] |
---|
288 | stats_gatherer.furl = pb://qbo4ktl667zmtiuou6lwbjryli2brv6t@192.168.0.8:49997/wxycb4kaexzskubjnauxeoptympyf45y |
---|
289 | |
---|
290 | or simply copy the stats_gatherer.furl file into the node's base directory |
---|
291 | (next to the tahoe.cfg file): it will be interpreted in the same way. |
---|
292 | |
---|
293 | The first time it is started, the gatherer will listen on a random unused TCP |
---|
294 | port, so it should not conflict with anything else that you have running on |
---|
295 | that host at that time. On subsequent runs, it will re-use the same port (to |
---|
296 | keep its FURL consistent). To explicitly control which port it uses, write |
---|
297 | the desired portnumber into a file named "portnum" (i.e. $BASEDIR/portnum), |
---|
298 | and the next time the gatherer is started, it will start listening on the |
---|
299 | given port. The portnum file is actually a "strports specification string", |
---|
300 | as described in docs/configuration.txt . |
---|
301 | |
---|
302 | Once running, the stats gatherer will create a standard python "pickle" file |
---|
303 | in $BASEDIR/stats.pickle . Once a minute, the gatherer will pull stats |
---|
304 | information from every connected node and write them into the pickle. The |
---|
305 | pickle will contain a dictionary, in which node identifiers (known as "tubid" |
---|
306 | strings) are the keys, and the values are a dict with 'timestamp', |
---|
307 | 'nickname', and 'stats' keys. d[tubid][stats] will contain the stats |
---|
308 | dictionary as made available at http://localhost:3456/statistics?t=json . The |
---|
309 | pickle file will only contain the most recent update from each node. |
---|
310 | |
---|
311 | Other tools can be built to examine these stats and render them into |
---|
312 | something useful. For example, a tool could sum the |
---|
313 | "storage_server.disk_avail' values from all servers to compute a |
---|
314 | total-disk-available number for the entire grid (however, the "disk watcher" |
---|
315 | daemon, in misc/operations_helpers/spacetime/, is better suited for this specific task). |
---|
316 | |
---|
317 | Using Munin To Graph Stats Values |
---|
318 | ================================= |
---|
319 | |
---|
320 | The misc/munin/ directory contains various plugins to graph stats for Tahoe |
---|
321 | nodes. They are intended for use with the Munin_ system-management tool, which |
---|
322 | typically polls target systems every 5 minutes and produces a web page with |
---|
323 | graphs of various things over multiple time scales (last hour, last month, |
---|
324 | last year). |
---|
325 | |
---|
326 | .. _Munin: http://munin-monitoring.org/ |
---|
327 | |
---|
328 | Most of the plugins are designed to pull stats from a single Tahoe node, and |
---|
329 | are configured with the e.g. http://localhost:3456/statistics?t=json URL. The |
---|
330 | "tahoe_stats" plugin is designed to read from the pickle file created by the |
---|
331 | stats-gatherer. Some plugins are to be used with the disk watcher, and a few |
---|
332 | (like tahoe_nodememory) are designed to watch the node processes directly |
---|
333 | (and must therefore run on the same host as the target node). |
---|
334 | |
---|
335 | Please see the docstrings at the beginning of each plugin for details, and |
---|
336 | the "tahoe-conf" file for notes about configuration and installing these |
---|
337 | plugins into a Munin environment. |
---|