The insanity around user agent strings

The hundreds (if not thousands) of user agent variations points to insanity in my opinion. Heres a sampling of some of them, and a proposal for a standardized format, even if it just amounts to interesting reading...

Recently (since I hit database quota limits), I've been proactively managing the logs I end up collecting on this site. One thing that caught my eye was the range of user agent strings. One look at it and it became clear there was no end to the number of possibilities. Here is just a small sampling of what I saw (I am looking at just page logs for now, and ignoring the various RSS readers hitting my feed):

The browsers: The expected ones, but even here there are literally hundreds of variants with different OS identifiers, and with different browser extensions, CLR versions all tacked on to the end.
Mozilla/5.0 (Windows; U; Windows NT 5.0; ca-AD; rv:1.7.8) Gecko/20050511 Firefox/1.0.4
Mozilla/4.0 (compatible; MSIE 5.17; Mac_PowerPC)
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3) Gecko/20030313
Opera/8.0 (Windows NT 5.0; U; en)
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)

The search engines: Good to know I am being indexed and archived by both the well-known and unknown search engines.
Googlebot/2.1 (+http://www.googlebot.com/bot.html)
msnbot/0.11 (+http://search.msn.com/msnbot.htm)
Overture-WebCrawler/3.8/Fresh (atw-crawler at fast dot no; http://fast.no/support/crawler.asp)
ia_archiver (presumably the Internet Archive)
Mozilla/5.0 (Slurp/cat; slurp@inktomi.com; http://www.inktomi.com/slurp.html)
http://www.almaden.ibm.com/cs/crawler (internal search engine at IBM?)
augurfind V-1.8 beta

Mobile as well: Cool, since I didn't think my site was usable in that form factor.
DoCoMo/1.0/N504i/c10/TB
UPG1 UP/4.0 (compatible; Blazer 1.0)

The surprising:
Lynx/2.8.5rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.7d

And the totally esoteric ones: I wouldn't even know how to interpret these.
Scooter/3.3
HALO, the magical bot
Under the Rainbow 2.2
Rumours-Agent
No Browser/0.0 (this one clearly wins the award for most useless user agent string)

While, these are undoubtedly fun to look at (once in a while), and perhaps be amused (for some time), the developer in me wants to scream. What a mess! Literally... Is there a user-agent format standard that might bring some sanity? If not, I'd love to go ahead and propose one. I don't know if anything will come out of it. Either way, how about the following:

<device>, <product name & version>, <OS name & version> (extension1[=value1]; extension2[=value2]; ... extensionN)

Where device could be "Browser", "Indexer", "NewsReader" etc. Does this miss any key piece of information that should be included? I know this post might as well be categorized into the "rant bucket", but seriously, what do people think?

Having a standardized format would only help anyone using these to do something intelligent on the server end. For example, I just hardcoded my .aspx pages as Uplevel, so I don't end up sending an ugly HTML3.2 rendering simply because the user agent string was off the chart. I know having a simpler standard format for this information would have led to much simplification to the browser capabilities detection logic in ASP.NET.

Posted on Friday, 5/27/2005 @ 4:49 PM


Comments

11 comments have been posted.

Milan Negovan

Posted on 5/27/2005 @ 6:13 PM
Sounds like a good user agent string format. Let's lobby Dave Massy to cut over and start using it. Others would have to follow suit. :)

Eric Law

Posted on 5/28/2005 @ 12:31 AM
Actually, I am posting on behalf of Eric, PM from the IE team - who mailed me his comments while my comments were down... - Nikhil

Your proposal has merit, although it may get tricky to classify "Device" in some cases-- E.g. if I'm using an IE-based newsreader, is that a "NewsReader" or a "Browser", and so on. You'd also probably want to get more proscriptive about how OS & version are specified, lest we end up in the same string soup we have today. Also, we'd likely want to identify the architecture for cases where it matters (e.g. is this Windows2003 on x86 or Windowsx64 on Athlon64?). Lastly, it would be worthwhile to extend the DOM's window.navigator object with accessors for this information, to prevent gnarly string parsing code.

Identification of the browser itself is only half the battle, of course-- the other big challenge is in identifying the browser's capabilities. If ASPNET doesn't know the capabilities of the browser, identification alone isn't terribly useful. ASPNET does include a good range of capabilities information by default, but even then, there's quite a bit that it simply can't tell. It will tell you, for instance, that IE6 supports ActiveX controls, but it has no way of knowing whether or not the user has disabled them, or if they have permission to install new controls.

As you might imagine, backward-compatibility is the biggest obstacle; you have to balance the benefits you identify (easier identification) against the compatibility impact you incur. In the case of the IE7 UA string, ASPNET requires no code change to support the new UA. But nevertheless, a great many websites are coded in such a way that they don't recognize the UA, despite the fact that it's almost identical to the UA of its predecessor. You might try the proposed change using Fiddler (http://blogs.msdn.com/ie/archive/2005/04/27/412813.aspx) and see how much stuff breaks. Even if Firefox 1.1, IE7, and Opera 9 all standardized in this manner, it would still be quite a few years before web authors could safely remove all the old browser detection logic.

Nikhil Kothari

Posted on 5/28/2005 @ 12:34 AM
Eric, yes, I agree back-compat does constrain the scope of what is possible. Its almost like you'd really need to introduce a new header like HTTP_CLIENT or some such thing, while HTTP_USER_AGENT is phased out. The nice thing about ASP.NET is we can first look for the new header, and then fallback on the old one, so most apps are auto-upgraded without having to switch code...

Hermann Klinke

Posted on 5/28/2005 @ 2:15 AM
I like your idea. But why don't we even make it more explicit? For example: device=Browser, ProductName=Internet Explorer, ProductVersion=6.0, OperatingSystemName=Windows, OperatingSystemVersion=XP Professional, ...; you get the idea. That way we could ignore the order and it would be possible to omit information without breaking the format.

Nikhil Kothari

Posted on 5/28/2005 @ 9:23 AM
One of the primary motivations is to make some information essential. Making those bits optional will essentially result in the same un-interprettable user agents like today. Its also nice to see "device" first, and be able to ignore the rest potentially if its a particular device, thereby saving some processing cycles.

Phil

Posted on 5/28/2005 @ 11:25 AM
Haha, nice try --- exact taxonomies, hierarchies, rigid standards ... doesn't it surprise the "developer in us" how often these kinds of things fail in the real world? I bet there is a recommendation for UA strings somewhere over at w3c, it's just being ignored. THAT's the merit of the web. HTML "messy"? Maybe, but it was never designed for easy progressing, but for ease of use.

NoBrowser 0.0 will never conform to your idea. Should it therefore be discriminated and not be served any content? No. Should we therefore autofallback with our smart web engines to HTML 3.2 to serve nb 0.0? Maybe (think about the implications)...

Nikhil Kothari

Posted on 5/29/2005 @ 1:02 AM
W3C actually has a HTTP protocol spec (http://www.w3.org/Protocols/HTTP/HTTP2.html). The section on user agents simply says it should be <product>[/<version>] followed by individual words. Pretty loose...

'NoBrowser 0.0' actually conforms to the existing spec, though its useless from the perspective of any server-side logic. Also the IE ones could be claimed as incorrect, because they start with 'Mozilla/4.0' instead of say, with 'MSIE', the true product name. Of course that choice was probably driven by the fact that sites may have been simply looking for Mozilla given the prevalence of Netscape back in those early Web days - again a function of lack of useful semantics for this string, and how the world is indeed "messy".

Jame Curran

Posted on 5/31/2005 @ 9:07 AM
>> Of course that choice was probably driven by the fact that sites may have been simply looking for Mozilla given the prevalence of Netscape back in those early Web days <<

I've been told that's because some early web scripts (php or Perl, I guess) using "Mozilla" in the UA to decide whether or not to use frames in the HTML.

Aaron Brown

Posted on 5/31/2005 @ 10:37 AM
FWIW, Scooter is AltaVista's spider.

Philip Chalmers

Posted on 11/17/2005 @ 4:33 AM
I like the suggestion of a new HTTP header to be sent by user agents whihc conform to any new standard.

Developers need to know the user agent's capabilties rather than its brand name, for example:
(a) Engine and version. E.g. several browsers use the Gecko engine and NN8 uses either Gecko or the IE engine (there's confusion about how much depends on uer choice and how much is determined by the browers's opinion of the site). Would have to be very precise, e.g. does Opera use the same engine in PCs and mobile phones?
(b) Mono or colour screen (important for mobiles).
(c) Platform, e.g. PC or mobile. Many design / usability sites recommend a very different layout for PCs (including laptops) and for smaller devices.
(d) For accessibility enthusiasts, what assistive software is used, if any. Many accessibility sites recommend a mobile-like layout for user sof assistive devices - main site nav at bottom, etc.
Compared with these things, I'm not sure browser brand name is important.

DrydenMaker

Posted on 6/22/2006 @ 2:58 PM
There was one point when I was branding browsers that I included the McAfee EICAR test string at the end of the user agent string. I smiled to myself, but never saw the effect. In the past, browsers identified themselves as other browsers so that they would not be blocked, or so that servers would give them the full version of the markup.

There is an interesting extension for Firefox called 'User Agent Switcher' that lets you tell the server you are whatever you want.

Being able to identify a browser is a double edge sword. It may seem to help in some circumstances, but the bottom line is that we shouldn't have to care what browser is hitting our site. We should just simply be able to blast standard xHTML/ECMA Script/CSS and be done with it.

Accessibility software does a good job of being standard. Mobile devices need to just deal with what they get. They will get there. Settop boxes have come a long way, and they will keep improving. I remember when web-tv couldn't browse to a personal website cause the character code they used for the tilde (~) was non-standard.

Progression is why all web developers need to be fanatic about using standard code and saying 'Screw <browser here> and it's flaky <standard here> support.' They need to get it together.

Coding to a specific browser is like a international sales person purposely picking up a heavy Cajun accent. That might work fine in the Deep South. What would happen when they try to go to Asia and sell something, or simply find a restroom. People who code for only to IE or Firefox are goanna end up looking for a dark alley while they wet themselves.
The discussion on this post has been closed. Please use my contact form to provide comments.