Syndicated search engines broken - Part II
A few days ago I grumbled at the poor state of the search engines specializing in syndicated (RSS'd or Atomized) content.
Today, Michael Marshall Patrick is enthusiastically supporting a proposed standard by Bloglines that is trying to solve an apparent problem:
"“Everything you blog goes on your permanent record!” How many times have we heard that lately? From employment to family situations, many people have been frustrated to find out that things they intended to write for a personal audience is now discoverable by anyone in the world via search engines.
From the Bloglines proposal:
"As a result, we are proposing (and have implemented) an RSS and ATOM extension that allows publishers to indicate the distribution restrictions of a feed. Setting the access restriction to 'deny' will indicate the feed should not be re-distributed. In a nutshell, the proposal"
I respectfully disagree with Michael's Marshall's view here, and a user of these services, can not support the proposal, for three reasons:
1. Keeping stuff out of participating engines wouldn't ensure leakage. As one commenter on the quoted post has already pointed out (by '007') how do you avoid the repost scenario? If you really need to sneak stuff under the radar (to avoid getting fired???), use something other than public blogsite - you will be found. Another reason: why wouldn't some service providers show up that wouldn't adhere to the rules that ensure they catch the slime? (I could imagine 'Slimesearch'...). Private networks - ok = group IM, SSL'd, groups, etc (even company email considered leaky) - but just don't use inherently public networks for this kind of stuff.
2. A common issue with search results is spam. Spammers won't use the tag. I realize this isn't a stated goal of the proposal, but worth pointing out, I think.
3. IMHO, these guys (Bloglines, Technorati, etc) should be focussed on trying to solve the precisely reverse of the 'problem' they are trying to solve here with an access:restriction' tag - they should be trying to get more complete indexes, not the other way around.
Overall, this syndicated content search space is broken. The priorities seem wrong here - I don't see this step getting us any closer to getting better services when there are other much more fundamental issues that need solving.
Comments
- Anonymous
August 01, 2006
The comment has been removed - Anonymous
August 01, 2006
PingBack from http://fuzzyblog.com/archives/2006/08/02/bloglines-search-proposal-tastes-great-less-filling/ - Anonymous
August 02, 2006
Lauren, the problem is not so much people spouting off anonymously as it is corporate sources not wanting their content indexed. This is not necessarily confidential material. Publishers have their own reasons for not wanting content indexed, and server side aggregators need to respect that. Currently, publishers have to make requests of individual aggregators, and this proposal would automate that process.
As far as robots.txt, I strongly believe that this is a misapplication of the robots.txt protocol. Simply polling a syndicated feed is NOT robotic behavior and imposing the robots.txt convention places undue burden on both sides of the wire. The issue here is not search engines, but aggregators, specifically server-based aggregators - Bloglines, NewsGator, Technorati, etc. - Anonymous
August 02, 2006
typo
Marshall Kirkpatrick - Anonymous
August 02, 2006
I agree, there's lots of problems in the blogosphere, but if you look at this extension as a way of telling Bloglines that you don't want your RSS in Bloglines' search results, then it's a good fit. - Anonymous
August 02, 2006
The comment has been removed - Anonymous
August 02, 2006
http://dannyayers.com/2006/08/02/in-band-robots - Anonymous
August 02, 2006
"I can't think of why a program which periodically polls a feed wouldn't be considered a robot."
http://www.robotstxt.org/wc/faq.html#what
"A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced."
I don't think my conclusion that periodically polling a feed is NOT robotic behavior is entirely unjustified. - Anonymous
August 02, 2006
PingBack from http://kinrowan.net/blog/wp/archives/2006/08/02/alex-barnett-syndicated-search-and-feed-access-control - Anonymous
August 02, 2006
Man, where did that link on my pingback come from?? Oi! - Anonymous
August 02, 2006
cori - the pingbacks and trackback behaviours on this blogware are beyond my humble understanding... - Anonymous
August 02, 2006
I'm still going to have to disagree and refer you to the rest of the FAQ answer'
"Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot."
Also:
Q: "What other kinds of robots are there?"
A: "..."What's New" monitoring..."
So an autonomous agent that semi-regularly requests a document from a webserver could definitely be categorized as a robot.
But then we're back where we started. If a search engine can scrape the blog site because robots.txt doesn't disallow it, then any information on that site can be found using a search engine. If the only goal is to block it from appearing in your RSS feed, then what's the point of 1) publishing it to the web in the first place and 2) publishing it to the RSS feed in the second place?
To publish anything to the web without any protections save some infinitely ignorable tag is an absurdity if your goal is to keep that information restricted. If security and privacy is what you want, you're better off using a password-protected blog than something publically accessible. - Anonymous
August 02, 2006
marshall kirkpatrick - Anonymous
August 03, 2006
James - just waiting for you to notice ;-) - Anonymous
August 08, 2006
PingBack from http://savvis.dnska.com/~extralab/therssblog/?p=7 - Anonymous
August 08, 2007
Syndicated search engines broken - Part II