Recently (since I hit database quota limits), I've been proactively managing the logs I end up collecting on this site. One thing that caught my eye was the range of user agent strings. One look at it and it became clear there was no end to the number of possibilities. Here is just a small sampling of what I saw (I am looking at just page logs for now, and ignoring the various RSS readers hitting my feed):
The browsers: The expected ones, but even here there are literally hundreds of variants with different OS identifiers, and with different browser extensions, CLR versions all tacked on to the end.
Mozilla/5.0 (Windows; U; Windows NT 5.0; ca-AD; rv:1.7.8) Gecko/20050511 Firefox/1.0.4
Mozilla/4.0 (compatible; MSIE 5.17; Mac_PowerPC)
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030313
Opera/8.0 (Windows NT 5.0; U; en)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)
The search engines: Good to know I am being indexed and archived by both the well-known and unknown search engines.
Googlebot/2.1 (+http://www.googlebot.com/bot.html)
msnbot/0.11 (+http://search.msn.com/msnbot.htm)
Overture-WebCrawler/3.8/Fresh (atw-crawler at fast dot no; http://fast.no/support/crawler.asp)
ia_archiver (presumably the Internet Archive)
Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; http://www.inktomi.com/slurp.html)
http://www.almaden.ibm.com/cs/crawler (internal search engine at IBM?)
augurfind V-1.8 beta
Mobile as well: Cool, since I didn't think my site was usable in that form factor.
DoCoMo/1.0/N504i/c10/TB
UPG1 UP/4.0 (compatible; Blazer 1.0)
The surprising:
Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.7d
And the totally esoteric ones: I wouldn't even know how to interpret these.
Scooter/3.3
HALO, the magical bot
Under the Rainbow 2.2
Rumours-Agent
No Browser/0.0 (this one clearly wins the award for most useless user agent string)
While, these are undoubtedly fun to look at (once in a while), and perhaps be amused (for some time), the developer in me wants to scream. What a mess! Literally... Is there a user-agent format standard that might bring some sanity? If not, I'd love to go ahead and propose one. I don't know if anything will come out of it. Either way, how about the following:
<device>, <product name & version>, <OS name & version> (extension1[=value1]; extension2[=value2]; ... extensionN)
Where device could be "Browser", "Indexer", "NewsReader" etc. Does this miss any key piece of information that should be included? I know this post might as well be categorized into the "rant bucket", but seriously, what do people think?
Having a standardized format would only help anyone using these to do something intelligent on the server end. For example, I just hardcoded my .aspx pages as Uplevel, so I don't end up sending an ugly HTML3.2 rendering simply because the user agent string was off the chart. I know having a simpler standard format for this information would have led to much simplification to the browser capabilities detection logic in ASP.NET.
Posted on Friday, 5/27/2005 @ 4:49 PM