Several people have asked me recently how it is that they can extract useful “readership statistics” for content which they are making available via RSS, ATOM and the like.
It’s a thorny question – there are many challenges, dead-ends and false-starts, and I don’t want to take up too much time analysing what not to do and why; there is just too little time at the moment for me to go into the problems in depth.
Let’s just say “web bugs are not guaranteed, and neither is javascript; also people seem to hate click-throughs and partial feeds, and GoogleReader (which has the lion’s share of FeedReading and is only likely to grow) caches prettymuch everything, so hundreds of people could hide behing a single URL retreival.”
So here is my my solution and my process. It works for me. If you don’t like it, please leave a comment.
- Make a list of the RSS or ATOM URLs for which you are interested; in the case of this example, we shall be interested in only one:
http://www.crypticide.com/dropsafe/index.rss - From the HTTP server logs, obtain logs of all retreivals of the target URLs; note that we can be very specific about which URLs we’re logging, so this is a vastly reduced amount of data, much less than the whole corpus of logs. Note that each record contains a timestamp.
- Reduce that data once again, extracting only those retreivals which are sourced from GoogleReader, NewsGator or Bloglines; I am told that these three popular Blog Aggregators / Readers comprise more than 90% of the FeedReader market, and the provide their respective numbers of subscribers-per-feed as part of the “User-Agent” data, thusly:
Raw Apache Log Data:
65.214.44.29 – – [15/Mar/2007:08:33:31 -0400] “GET /dropsafe/index.rss HTTP/1.1” 200 77762 “-” “Bloglines/3.1 (http://www.bloglines.com; 37 subscribers)”
72.14.199.65 – – [15/Mar/2007:08:44:28 -0400] “GET /dropsafe/index.rss HTTP/1.1” 200 77762 “-” “Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 43 subscribers; feed-id=9998166800354916924)”
38.102.128.140 – – [15/Mar/2007:09:11:28 -0400] “GET /dropsafe/index.rss HTTP/1.1” 200 77762 “-” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1; Rojo 1.0; http://www.rojo.com/corporate/help/agg/; Aggregating on behalf of 1 subscriber(s) online at http://www.rojo.com/?feed-id=243732) Gecko/20021130″
64.78.155.100 – – [15/Mar/2007:09:16:42 -0400] “GET /dropsafe/index.rss HTTP/1.1” 200 12084 “-” “NewsGatorOnline/2.0 (http://www.newsgator.com; 8 subscribers)”
65.214.44.29 – – [15/Mar/2007:09:38:30 -0400] “GET /dropsafe/index.rss HTTP/1.1” 200 77762 “-” “Bloglines/3.1 (http://www.bloglines.com; 38 subscribers)”
72.14.199.65 – – [15/Mar/2007:09:44:29 -0400] “GET /dropsafe/index.rss HTTP/1.1” 200 77762 “-” “Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 43 subscribers; feed-id=9998166800354916924)”
65.214.44.29 – – [15/Mar/2007:10:10:06 -0400] “GET /dropsafe/index.rss HTTP/1.1” 200 77762 “-” “Bloglines/3.1 (http://www.bloglines.com; 38 subscribers)”
65.214.44.29 – – [15/Mar/2007:10:38:44 -0400] “GET /dropsafe/index.rss HTTP/1.1” 200 77762 “-” “Bloglines/3.1 (http://www.bloglines.com; 38 subscribers)”
72.14.199.65 – – [15/Mar/2007:10:44:30 -0400] “GET /dropsafe/index.rss HTTP/1.1” 200 77762 “-” “Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 43 subscribers; feed-id=9998166800354916924)”
…Rojo is included in the above, also; just because I could.
Note that some fetches are performed at hourly intervals, some at half-hourly, some daily, and so forth; this reinforces that each feedname’s statistics must be treated / graphed independently.
- Process this data and extract timestamp, feedname and subscriber count. Graph the number of readers per feedname against time. This will give you several trend lines.
- If the number of readers is generally rising against time, then you are doing something right. If the number is flat, you are not growing your readership, probably a bad thing. If the number is decreasing, you are doing something very wrong indeed.
Two of the key points to remember are the elective nature of these statistics, and that they are per-feed based; by such means you will measure the value and “interestingness” of the feeds as a whole.
Regards the elective nature: what you are getting here are the numbers of people who have chosen to read your feed via their preferred feed reader. They want to subscribe, they have subscribed, and it’s likely (in the nature of feed-readers) that the statistics will reflect individual people, rather than groups or teams.
This is probably what you actually want to know, when you think about it.
Regards the per-feed nature: some people – particularly those who write stuff – might want to know particularly how well individual articles within a feed have been received; this is fruitless and not measurable via any RSS mechanism. My usual analogy for this is:
Overall, do people subscribe to “Playboy” because of one or two article headlines that they have chanced to read, or is it because they are more interested in the general theme of the content?
I reckon that the latter is more likely, and further that said observation indicates the way towards better communication:
Do away with click-thru postings (“Click here to read the rest of the article…“) and instead put the real, unexpurgated posting into the feed, and then you should track the rate of growth of feed popularity.Deal in feeds and communication, not in high-school essays. If you want to know how interesting a particular article is, go count the page retreivals from the logs, but remember that you’ll need to compensate for all the people who read the whole thing (right?) via RSS.
In short: coerscing people into having to “click-thru” is a barrier to communication, one imposed by personal shortsightedness which is easily circumvented by not being so.
So, none of this will make you able to say precisely how many people read your blog — but by this technique you will be able to make a plausible argument for your being (say) 50% more popular now, than you were six months ago.
Leave a Reply