Opened at 2009-11-01T01:13:48Z
Last modified at 2010-12-28T16:00:24Z
#823 new defect
WUI server should have a disallow-all robots.txt
Reported by: | davidsarah | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | undecided |
Component: | code-frontend-web | Version: | 1.5.0 |
Keywords: | privacy | Cc: | |
Launchpad Bug: |
Description
Currently, if a web crawler gets access to a Tahoe WUI gateway server then it will crawl all reachable links. This is probably undesirable, or at least not a sensible default (even though it is understood that robots.txt is not meant as a security mechanism).
WUI servers should have a disallow-all robots.txt:
User-agent: * Disallow: /
The robots.txt specification is at http://www.robotstxt.org/orig.html
Change History (6)
comment:1 Changed at 2009-11-01T01:21:08Z by davidsarah
comment:2 Changed at 2009-11-01T02:04:17Z by davidsarah
The Welcome page does include the introducer FURL, which some users might want to keep private as per #562.
comment:3 Changed at 2009-11-01T04:42:06Z by zooko
I think it is kind of cool that I occasionally find files on Tahoe-LAFS grid in google search results.
comment:4 Changed at 2009-12-20T23:44:08Z by davidsarah
- Keywords privacy added
If you like this bug, you might also like #860.
comment:5 Changed at 2010-12-26T03:25:41Z by davidsarah
warner in ticket:127#comment:29 gives another reason to fix this ticket:
Incidentally, someone told me the other day that any URLs sent through various google products (Google Talk the IM system, Gmail, anything you browse while the Google Toolbar is in your browser) gets spidered and added to the public index. The person couldn't think of any conventions (beyond robots.txt) to convince them to *not* follow those links, but they could think of lots of things to encourage their spider even more.
I plan to do some tests of this (or just ask google's spider to tell me about tests which somebody else has undoubtedly performed already).
I know, I know, it's one of those boiling the ocean things, it's really unfortunate that so many tools are so hostile to the really-convenient idea of secret URLs.
comment:6 Changed at 2010-12-28T16:00:24Z by zooko
I disagree with "WUI server should have a disallow-all robots.txt". I think if a web crawler gets access to a cap then it should crawl and index all the files and directories reachable from that cap. I suppose you can put a robots.txt file into a directory in Tahoe-LAFS if you want crawlers to ignore that directory.
On closer examination, the Welcome (root) page only links to statistics pages. OTOH, a directory page might be linked from elsewhere on the web, in which case everything reachable from that directory would be crawled. Anyway, it seems easy to fix.