A normal conversation in the GovUK (or any office I frequent) today went*: "Can we get some microformats on that page?", I suggest as I spot a section of our site outputting a boat-load of addresses. "No problem - but what's this about schema-org?". "Yeah, yeah.. we can hedge our bets and throw their mark-up in there too, it's just some extra itemprops
. flippant scoff I'll send you a complete snippet example, because I'm just nice like that."
And that's what I did. And it looked like this:
<div class="vcard" itemscope itemtype="http://schema.org/Organization">
<p class="org" itemprop="name">Department for Transport</p>
<p class="adr" itemprop="address" itemscope
itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">
<span class="extended-address">Great Minster House</span>
<span class="street-address">76 Marsham Street</span>
</span>
<span class="locality" itemprop="addressLocality">London</span>
<span class="postcode" itemprop="postalCode">SW1P 4DR</span>
</p>
<p>Telephone: <span class="tel"
itemprop="telephone">0300 330 3000</span></p>
<p>Website: <a href="http://www.dft.gov.uk" class="url"
itemprop="url">www.dft.gov.uk</a></p>
<p>Email: <a
href="mailto:firstname.surname@dft.gsi.gov.uk"
class="email" itemprop="email">firstname.surname@dft.gsi.gov.uk</a></p>
</div>
Holy massive-code-snippet batman. I was surprised by the size. I know, I can feel people digging up links already on attack and defence of "bloat" when using microformats alone, but seriously guys, IT'S HUGE. I felt guilty saying "this is what you've gotta add to get this mark-up to mean something". Here's a more broken down comparison:
Here's the address, raw, at just over a tweet's worth (167 chars):
Department for Transport
Great Minster House
76 Marsham Street
SW1P 4DR
Telephone: 0300 330 3000
Website: http://www.dft.gov.uk
Email: firstname.surname@dft.gsi.gov.uk
Here's the address with the elements on it to get at the separate pieces of the address, bringing us up to 356:
<p>Department for Transport</p>
<p>
<span>Great Minster House</span>
<span>76 Marsham Street</span>
<span>London</span>
<span>SW1P 4DR</span>
</p>
<p>Telephone: 0300 330 3000</p>
<p>Website: <a href="http://www.dft.gov.uk">www.dft.gov.uk</a></p>
<p>Email: <a href="mailto:firstname.surname@dft.gsi.gov.uk"
>firstname.surname@dft.gsi.gov.uk</a></p>
Now let's throw some classes on to those and get a bit of meaning in there (I mean, you may want to style them up, get things on new lines etc etc. so using the microformat classes are handy for that alone.**). We've got a vCard, people! (565):
<div class="vcard">
<p class="org">Department for Transport</p>
<p class="adr">
<span class="extended-address">Great Minster House</span>
<span class="street-address">76 Marsham Street</span>
<span class="locality">London</span>
<span class="postcode">SW1P 4DR</span>
</p>
<p>Telephone: <span
class="tel">0300 330 3000</span></p>
<p>Website: <a href="http://www.dft.gov.uk"
class="url">www.dft.gov.uk</a></p>
<p>Email: <a
href="mailto:firstname.surname@dft.gsi.gov.uk"
class="email>firstname.surname@dft.gsi.gov.uk</a></p>
</div>
And now let's make it schema-org friendly using microdata (863):
<div class="vcard" itemscope itemtype="http://schema.org/Organization">
<p class="org" itemprop="name">Department for Transport</p>
<p class="adr" itemprop="address" itemscope
itemtype="http://schema.org/PostalAddress">
<span itemprop="streetAddress">
<span class="extended-address">Great Minster House</span>
<span class="street-address">76 Marsham Street</span>
</span>
<span class="locality" itemprop="addressLocality">London</span>
<span class="postcode" itemprop="postalCode">SW1P 4DR</span>
</p>
<p>Telephone: <span class="tel"
itemprop="telephone">0300 330 3000</span></p>
<p>Website: <a href="http://www.dft.gov.uk" class="url"
itemprop="url">www.dft.gov.uk</a></p>
<p>Email: <a
href="mailto:firstname.surname@dft.gsi.gov.uk"
class="email" itemprop="email">firstname.surname@dft.gsi.gov.uk</a></p>
</div>
And we're done. All I wanted to do was say "this, dear Computer, is an address". Just getting some frankly useless out-of-the-box HTML elements on the raw data more than doubles it's size (167 to 356), then we double it again to actually make it useful.
Now, I know size isn't everything, and this is a pedantic, slightly silly, and probably less than accurate example. We're not crazy obsessed with keeping our pages below a certain size anymore (Ah... I remember back when the BBC S&Gs insisted that every page had to be less than 200k down the wire including script and CSS AND images. Those were the days.), but it's not something to be sniffed at either. Particularly with mark-up. Increased size probably suggests increased complexity - more work for everyone, more chance of someone bungling the order or nesting, more simply "I can't be bothered". Colour me dubious. I just want to highlight how much we add on to HTML to make it actually do what we need.
Itemscope
and itemtype
, a brief diversion
I had one of those Am I crazy, but why are there two properties on these things? moments. When would you ever use one without the other? The spec says you can use itemscope
alone, but without itemtype
, it's a bit meaningless. I think I'd do away with itemscope
and have itemtype
only but with a value, either a URI or something meaningful to the internal vocabulary. itemscope
seems to exist solely to say "the things in side me are related", but by the very nature of it being the parent of those items, that's already implied, and with a class name of something meaningful (say, hcard), or just the itemtype
(with a useful value), explicit to data consumers.
This isn't sarcasm: I would gratefully receive an explanation as to why there are two attributes instead of one.
Back in the room: Is this seriously what we expect authors to do?
I think I'm still struggling to understand why microdata is a separate specification (or even exists if it's not being used as a mechanism to get stuff into HTML long-term). You can achieve exactly this richness with the current attributes supplied in HTML, and I don't even mean just the microformats class
way. The data-
attribute is pretty handy, though, and seems ripe for stuffing with machine data (why shouldn't it take a URI if you really need it?).
But I digress.
Microdata with schema-org is solving a problem we've already solved in microformats, but in an equally not-quite-there way (having to specify itemtype
with a URI more than once in a page for items that are the same, but not within the same parent, feels filthy, for example). They are just as bad as each other, in slightly varying ways. Useful for proving a point, allowing growth and putting out examples (not that all of these bonuses are currently being made the best of), but crappy if this is all we can muster for the long-term, high-volume, regularly published, data representation patterns in HTML. We're asking authors to jump through hoops still for things they shouldn't have to.
Microformats, schema-org, whatever... is this really our game plan now? Just keep throwing ever more bloat into already creaking elements when you just want to do something really common? What's the strategy for getting this stuff out of this mess and into the language?
You might be asking why bother aiming to get those stronger patterns into HTML, if this mechanism basically works for getting a machine to figure out what the hell you're trying to say, but you may as well be asking why you have any semantically meaningful elements in HTML at all if that's the case. HTML version 5 is redefining some elements to have better semantic meaning because HTML is the language of authors, and to authors and consumers meaning matters.
Without a plan for gathering evidence for popularly used patterns directly from microformats or microdata (and using them as formal methods of research, testing and development), or what people (actual, real developers - not just the big search engines) are doing in general, we'll end up with no progress or the wrong progress in HTML, and I believe that a formal process for how and when this happens should be made (i.e. definitions of what constitutes critical mass of common patterns, how the information should be gathered, how they will be proposed formally in the WG and promoted into the language proper, etc.).
I want evidence-based HTML that will evolve using clearly defined mechanisms.
*Conversation shortened and re-written with an artistic license and possibly some (many; "nice" may be a stretch) inaccuracies.
**Yes, I'm casually suggesting that microformats are "free" if all you want to do is get your stuff out there with the minimum you'll need to be machine-friendly and human-eyes-pretty.
I was quoted a couple of weeks ago as saying, albeit in private, the following:
"HTML fails to be simple if it can't provide what authors regularly need and end up turning to other encodings" -- @phae
@slightlylate
For context, that was in response to a remark made by a friend that HTML fails if authors can't use it because it has become too complex and attempts to describe too much. My response was that it fails not because it's complicated, but when an author cannot express their content accurately with the toolkit they're supplied and have to go to another encoding to find what they're looking for. That's the language passing the buck, in my opinion.
Don't get me wrong - I'm not suggesting HTML should cover every niche semantic everyone is ever going to want to express ever. That would be crazy and confusing. HTML should express what is most commonly used, and at the moment it doesn't - which is why we still see microformats, microdata, component model, schema.org etc. trying to fill the gaps. And not just trying to fill the gaps, but trying to provide data on which decisions can be made about what should be in HTML.
HTML, and a platform that provides, should be the end goal. Microformats, et al., are the research grounds that should be directly contributing with the evidence and data they are able to garner. In fact, the most popular microformats, shown through demand and usage, should just be in HTML as a standard, by being provided for with semantically appropriate new elements.
We've seen this work. Microformats started doing things with dates, most specifically, hCalendar. It had a slightly cludgy way of marking up time, using abbr
. The accessibility lot were rightfully less than impressed, and other patterns were tried - title and spans and all kinds of things. But in short, it was shown that time gets talked about a lot, and we needed something better. We got <time>
in HTML. Hooray! The system works! Well, except when it doesn't. Go read Bruce Lawson's take, as the powers that be removed time
and replaced it with data
. Gee, thanks.
We shouldn't expect authors to go in search of richer mark-up from other sources when what they're trying to do is really common, when a need has been shown, and a pattern has been proven.