1 | .. -*- coding: utf-8-with-signature -*- |
---|
2 | |
---|
3 | ================ |
---|
4 | Tahoe Statistics |
---|
5 | ================ |
---|
6 | |
---|
7 | 1. `Overview`_ |
---|
8 | 2. `Statistics Categories`_ |
---|
9 | 3. `Using Munin To Graph Stats Values`_ |
---|
10 | |
---|
11 | Overview |
---|
12 | ======== |
---|
13 | |
---|
14 | Each Tahoe node collects and publishes statistics about its operations as it |
---|
15 | runs. These include counters of how many files have been uploaded and |
---|
16 | downloaded, CPU usage information, performance numbers like latency of |
---|
17 | storage server operations, and available disk space. |
---|
18 | |
---|
19 | The easiest way to see the stats for any given node is use the web interface. |
---|
20 | From the main "Welcome Page", follow the "Operational Statistics" link inside |
---|
21 | the small "This Client" box. If the welcome page lives at |
---|
22 | http://localhost:3456/, then the statistics page will live at |
---|
23 | http://localhost:3456/statistics . This presents a summary of the stats |
---|
24 | block, along with a copy of the raw counters. To obtain just the raw counters |
---|
25 | (in JSON format), use /statistics?t=json instead. |
---|
26 | |
---|
27 | Statistics Categories |
---|
28 | ===================== |
---|
29 | |
---|
30 | The stats dictionary contains two keys: 'counters' and 'stats'. 'counters' |
---|
31 | are strictly counters: they are reset to zero when the node is started, and |
---|
32 | grow upwards. 'stats' are non-incrementing values, used to measure the |
---|
33 | current state of various systems. Some stats are actually booleans, expressed |
---|
34 | as '1' for true and '0' for false (internal restrictions require all stats |
---|
35 | values to be numbers). |
---|
36 | |
---|
37 | Under both the 'counters' and 'stats' dictionaries, each individual stat has |
---|
38 | a key with a dot-separated name, breaking them up into groups like |
---|
39 | 'cpu_monitor' and 'storage_server'. |
---|
40 | |
---|
41 | The currently available stats (as of release 1.6.0 or so) are described here: |
---|
42 | |
---|
43 | **counters.storage_server.\*** |
---|
44 | |
---|
45 | this group counts inbound storage-server operations. They are not provided |
---|
46 | by client-only nodes which have been configured to not run a storage server |
---|
47 | (with [storage]enabled=false in tahoe.cfg) |
---|
48 | |
---|
49 | allocate, write, close, abort |
---|
50 | these are for immutable file uploads. 'allocate' is incremented when a |
---|
51 | client asks if it can upload a share to the server. 'write' is |
---|
52 | incremented for each chunk of data written. 'close' is incremented when |
---|
53 | the share is finished. 'abort' is incremented if the client abandons |
---|
54 | the upload. |
---|
55 | |
---|
56 | get, read |
---|
57 | these are for immutable file downloads. 'get' is incremented |
---|
58 | when a client asks if the server has a specific share. 'read' is |
---|
59 | incremented for each chunk of data read. |
---|
60 | |
---|
61 | readv, writev |
---|
62 | these are for immutable file creation, publish, and retrieve. 'readv' |
---|
63 | is incremented each time a client reads part of a mutable share. |
---|
64 | 'writev' is incremented each time a client sends a modification |
---|
65 | request. |
---|
66 | |
---|
67 | add-lease, renew, cancel |
---|
68 | these are for share lease modifications. 'add-lease' is incremented |
---|
69 | when an 'add-lease' operation is performed (which either adds a new |
---|
70 | lease or renews an existing lease). 'renew' is for the 'renew-lease' |
---|
71 | operation (which can only be used to renew an existing one). 'cancel' |
---|
72 | is used for the 'cancel-lease' operation. |
---|
73 | |
---|
74 | bytes_freed |
---|
75 | this counts how many bytes were freed when a 'cancel-lease' |
---|
76 | operation removed the last lease from a share and the share |
---|
77 | was thus deleted. |
---|
78 | |
---|
79 | bytes_added |
---|
80 | this counts how many bytes were consumed by immutable share |
---|
81 | uploads. It is incremented at the same time as the 'close' |
---|
82 | counter. |
---|
83 | |
---|
84 | **stats.storage_server.\*** |
---|
85 | |
---|
86 | allocated |
---|
87 | this counts how many bytes are currently 'allocated', which |
---|
88 | tracks the space that will eventually be consumed by immutable |
---|
89 | share upload operations. The stat is increased as soon as the |
---|
90 | upload begins (at the same time the 'allocated' counter is |
---|
91 | incremented), and goes back to zero when the 'close' or 'abort' |
---|
92 | message is received (at which point the 'disk_used' stat should |
---|
93 | incremented by the same amount). |
---|
94 | |
---|
95 | disk_total, disk_used, disk_free_for_root, disk_free_for_nonroot, disk_avail, reserved_space |
---|
96 | these all reflect disk-space usage policies and status. |
---|
97 | 'disk_total' is the total size of disk where the storage |
---|
98 | server's BASEDIR/storage/shares directory lives, as reported |
---|
99 | by /bin/df or equivalent. 'disk_used', 'disk_free_for_root', |
---|
100 | and 'disk_free_for_nonroot' show related information. |
---|
101 | 'reserved_space' reports the reservation configured by the |
---|
102 | tahoe.cfg [storage]reserved_space value. 'disk_avail' |
---|
103 | reports the remaining disk space available for the Tahoe |
---|
104 | server after subtracting reserved_space from disk_avail. All |
---|
105 | values are in bytes. |
---|
106 | |
---|
107 | accepting_immutable_shares |
---|
108 | this is '1' if the storage server is currently accepting uploads of |
---|
109 | immutable shares. It may be '0' if a server is disabled by |
---|
110 | configuration, or if the disk is full (i.e. disk_avail is less than |
---|
111 | reserved_space). |
---|
112 | |
---|
113 | total_bucket_count |
---|
114 | this counts the number of 'buckets' (i.e. unique |
---|
115 | storage-index values) currently managed by the storage |
---|
116 | server. It indicates roughly how many files are managed |
---|
117 | by the server. |
---|
118 | |
---|
119 | latencies.*.* |
---|
120 | these stats keep track of local disk latencies for |
---|
121 | storage-server operations. A number of percentile values are |
---|
122 | tracked for many operations. For example, |
---|
123 | 'storage_server.latencies.readv.50_0_percentile' records the |
---|
124 | median response time for a 'readv' request. All values are in |
---|
125 | seconds. These are recorded by the storage server, starting |
---|
126 | from the time the request arrives (post-deserialization) and |
---|
127 | ending when the response begins serialization. As such, they |
---|
128 | are mostly useful for measuring disk speeds. The operations |
---|
129 | tracked are the same as the counters.storage_server.* counter |
---|
130 | values (allocate, write, close, get, read, add-lease, renew, |
---|
131 | cancel, readv, writev). The percentile values tracked are: |
---|
132 | mean, 01_0_percentile, 10_0_percentile, 50_0_percentile, |
---|
133 | 90_0_percentile, 95_0_percentile, 99_0_percentile, |
---|
134 | 99_9_percentile. (the last value, 99.9 percentile, means that |
---|
135 | 999 out of the last 1000 operations were faster than the |
---|
136 | given number, and is the same threshold used by Amazon's |
---|
137 | internal SLA, according to the Dynamo paper). |
---|
138 | Percentiles are only reported in the case of a sufficient |
---|
139 | number of observations for unambiguous interpretation. For |
---|
140 | example, the 99.9th percentile is (at the level of thousandths |
---|
141 | precision) 9 thousandths greater than the 99th |
---|
142 | percentile for sample sizes greater than or equal to 1000, |
---|
143 | thus the 99.9th percentile is only reported for samples of 1000 |
---|
144 | or more observations. |
---|
145 | |
---|
146 | |
---|
147 | **counters.uploader.files_uploaded** |
---|
148 | |
---|
149 | **counters.uploader.bytes_uploaded** |
---|
150 | |
---|
151 | **counters.downloader.files_downloaded** |
---|
152 | |
---|
153 | **counters.downloader.bytes_downloaded** |
---|
154 | |
---|
155 | These count client activity: a Tahoe client will increment these when it |
---|
156 | uploads or downloads an immutable file. 'files_uploaded' is incremented by |
---|
157 | one for each operation, while 'bytes_uploaded' is incremented by the size of |
---|
158 | the file. |
---|
159 | |
---|
160 | **counters.mutable.files_published** |
---|
161 | |
---|
162 | **counters.mutable.bytes_published** |
---|
163 | |
---|
164 | **counters.mutable.files_retrieved** |
---|
165 | |
---|
166 | **counters.mutable.bytes_retrieved** |
---|
167 | |
---|
168 | These count client activity for mutable files. 'published' is the act of |
---|
169 | changing an existing mutable file (or creating a brand-new mutable file). |
---|
170 | 'retrieved' is the act of reading its current contents. |
---|
171 | |
---|
172 | **counters.chk_upload_helper.\*** |
---|
173 | |
---|
174 | These count activity of the "Helper", which receives ciphertext from clients |
---|
175 | and performs erasure-coding and share upload for files that are not already |
---|
176 | in the grid. The code which implements these counters is in |
---|
177 | src/allmydata/immutable/offloaded.py . |
---|
178 | |
---|
179 | upload_requests |
---|
180 | incremented each time a client asks to upload a file |
---|
181 | upload_already_present: incremented when the file is already in the grid |
---|
182 | |
---|
183 | upload_need_upload |
---|
184 | incremented when the file is not already in the grid |
---|
185 | |
---|
186 | resumes |
---|
187 | incremented when the helper already has partial ciphertext for |
---|
188 | the requested upload, indicating that the client is resuming an |
---|
189 | earlier upload |
---|
190 | |
---|
191 | fetched_bytes |
---|
192 | this counts how many bytes of ciphertext have been fetched |
---|
193 | from uploading clients |
---|
194 | |
---|
195 | encoded_bytes |
---|
196 | this counts how many bytes of ciphertext have been |
---|
197 | encoded and turned into successfully-uploaded shares. If no |
---|
198 | uploads have failed or been abandoned, encoded_bytes should |
---|
199 | eventually equal fetched_bytes. |
---|
200 | |
---|
201 | **stats.chk_upload_helper.\*** |
---|
202 | |
---|
203 | These also track Helper activity: |
---|
204 | |
---|
205 | active_uploads |
---|
206 | how many files are currently being uploaded. 0 when idle. |
---|
207 | |
---|
208 | incoming_count |
---|
209 | how many cache files are present in the incoming/ directory, |
---|
210 | which holds ciphertext files that are still being fetched |
---|
211 | from the client |
---|
212 | |
---|
213 | incoming_size |
---|
214 | total size of cache files in the incoming/ directory |
---|
215 | |
---|
216 | incoming_size_old |
---|
217 | total size of 'old' cache files (more than 48 hours) |
---|
218 | |
---|
219 | encoding_count |
---|
220 | how many cache files are present in the encoding/ directory, |
---|
221 | which holds ciphertext files that are being encoded and |
---|
222 | uploaded |
---|
223 | |
---|
224 | encoding_size |
---|
225 | total size of cache files in the encoding/ directory |
---|
226 | |
---|
227 | encoding_size_old |
---|
228 | total size of 'old' cache files (more than 48 hours) |
---|
229 | |
---|
230 | **stats.node.uptime** |
---|
231 | how many seconds since the node process was started |
---|
232 | |
---|
233 | **stats.cpu_monitor.\*** |
---|
234 | |
---|
235 | 1min_avg, 5min_avg, 15min_avg |
---|
236 | estimate of what percentage of system CPU time was consumed by the |
---|
237 | node process, over the given time interval. Expressed as a float, 0.0 |
---|
238 | for 0%, 1.0 for 100% |
---|
239 | |
---|
240 | total |
---|
241 | estimate of total number of CPU seconds consumed by node since |
---|
242 | the process was started. Ticket #472 indicates that .total may |
---|
243 | sometimes be negative due to wraparound of the kernel's counter. |
---|
244 | |
---|
245 | |
---|
246 | Using Munin To Graph Stats Values |
---|
247 | ================================= |
---|
248 | |
---|
249 | The misc/operations_helpers/munin/ directory contains various plugins to |
---|
250 | graph stats for Tahoe nodes. They are intended for use with the Munin_ |
---|
251 | system-management tool, which typically polls target systems every 5 minutes |
---|
252 | and produces a web page with graphs of various things over multiple time |
---|
253 | scales (last hour, last month, last year). |
---|
254 | |
---|
255 | Most of the plugins are designed to pull stats from a single Tahoe node, and |
---|
256 | are configured with the e.g. http://localhost:3456/statistics?t=json URL. The |
---|
257 | "tahoe_stats" plugin is designed to read from the JSON file created by the |
---|
258 | stats-gatherer. Some plugins are to be used with the disk watcher, and a few |
---|
259 | (like tahoe_nodememory) are designed to watch the node processes directly |
---|
260 | (and must therefore run on the same host as the target node). |
---|
261 | |
---|
262 | Please see the docstrings at the beginning of each plugin for details, and |
---|
263 | the "tahoe-conf" file for notes about configuration and installing these |
---|
264 | plugins into a Munin environment. |
---|
265 | |
---|
266 | .. _Munin: http://munin-monitoring.org/ |
---|
267 | |
---|
268 | |
---|
269 | Scraping Stats Values in OpenMetrics Format |
---|
270 | =========================================== |
---|
271 | |
---|
272 | Time Series DataBase (TSDB) software like Prometheus_ and VictoriaMetrics_ can |
---|
273 | parse statistics from the e.g. http://localhost:3456/statistics?t=openmetrics |
---|
274 | URL in OpenMetrics_ format. Software like Grafana_ can then be used to graph |
---|
275 | and alert on these numbers. You can find a pre-configured dashboard for |
---|
276 | Grafana at https://grafana.com/grafana/dashboards/16894-tahoe-lafs/. |
---|
277 | |
---|
278 | .. _OpenMetrics: https://openmetrics.io/ |
---|
279 | .. _Prometheus: https://prometheus.io/ |
---|
280 | .. _VictoriaMetrics: https://victoriametrics.com/ |
---|
281 | .. _Grafana: https://grafana.com/ |
---|