fberriman.com

Schema-org, microformats and more science please

A normal conversation in the GovUK (or any office I frequent) today went*: "Can we get some microformats on that page?", I suggest as I spot a section of our site outputting a boat-load of addresses. "No problem - but what's this about schema-org?". "Yeah, yeah.. we can hedge our bets and throw their mark-up in there too, it's just some extra itemprops. flippant scoff I'll send you a complete snippet example, because I'm just nice like that."

And that's what I did. And it looked like this:


<div class="vcard" itemscope itemtype="http://schema.org/Organization">
  <p class="org" itemprop="name">Department for Transport</p>
  <p class="adr" itemprop="address" itemscope 

itemtype="http://schema.org/PostalAddress"> <span itemprop="streetAddress"> <span class="extended-address">Great Minster House</span> <span class="street-address">76 Marsham Street</span> </span> <span class="locality" itemprop="addressLocality">London</span> <span class="postcode" itemprop="postalCode">SW1P 4DR</span> </p>
&lt;p&gt;Telephone: &lt;span class="tel" 


itemprop="telephone">0300 330 3000</span></p> <p>Website: <a href="http://www.dft.gov.uk" class="url"
itemprop="url">www.dft.gov.uk</a></p> <p>Email: <a
href="mailto:firstname.surname@dft.gsi.gov.uk"
class="email" itemprop="email">firstname.surname@dft.gsi.gov.uk</a></p> </div>

Holy massive-code-snippet batman. I was surprised by the size. I know, I can feel people digging up links already on attack and defence of "bloat" when using microformats alone, but seriously guys, IT'S HUGE. I felt guilty saying "this is what you've gotta add to get this mark-up to mean something". Here's a more broken down comparison:

Here's the address, raw, at just over a tweet's worth (167 chars):


Department for Transport
Great Minster House
76 Marsham Street
SW1P 4DR
Telephone: 0300 330 3000
Website: http://www.dft.gov.uk
Email: firstname.surname@dft.gsi.gov.uk

Here's the address with the elements on it to get at the separate pieces of the address, bringing us up to 356:


<p>Department for Transport</p>
<p>
  <span>Great Minster House</span>
  <span>76 Marsham Street</span>
  <span>London</span>
  <span>SW1P 4DR</span>
</p>

<p>Telephone: 0300 330 3000</p> <p>Website: <a href="http://www.dft.gov.uk">www.dft.gov.uk</a></p> <p>Email: <a href="mailto:firstname.surname@dft.gsi.gov.uk" >firstname.surname@dft.gsi.gov.uk</a></p>

Now let's throw some classes on to those and get a bit of meaning in there (I mean, you may want to style them up, get things on new lines etc etc. so using the microformat classes are handy for that alone.**). We've got a vCard, people! (565):


<div class="vcard">
  <p class="org">Department for Transport</p>
  <p class="adr">
    <span class="extended-address">Great Minster House</span>
    <span class="street-address">76 Marsham Street</span>
&lt;span class="locality"&gt;London&lt;/span&gt;
&lt;span class="postcode"&gt;SW1P 4DR&lt;/span&gt;

</p>

&lt;p&gt;Telephone: &lt;span 


class="tel">0300 330 3000</span></p> <p>Website: <a href="http://www.dft.gov.uk"
class="url">www.dft.gov.uk</a></p> <p>Email: <a
href="mailto:firstname.surname@dft.gsi.gov.uk"
class="email>firstname.surname@dft.gsi.gov.uk</a></p> </div>

And now let's make it schema-org friendly using microdata (863):


<div class="vcard" itemscope itemtype="http://schema.org/Organization">
  <p class="org" itemprop="name">Department for Transport</p>
  <p class="adr" itemprop="address" itemscope 

itemtype="http://schema.org/PostalAddress"> <span itemprop="streetAddress"> <span class="extended-address">Great Minster House</span> <span class="street-address">76 Marsham Street</span> </span> <span class="locality" itemprop="addressLocality">London</span> <span class="postcode" itemprop="postalCode">SW1P 4DR</span> </p>
&lt;p&gt;Telephone: &lt;span class="tel" 


itemprop="telephone">0300 330 3000</span></p> <p>Website: <a href="http://www.dft.gov.uk" class="url"
itemprop="url">www.dft.gov.uk</a></p> <p>Email: <a
href="mailto:firstname.surname@dft.gsi.gov.uk"
class="email" itemprop="email">firstname.surname@dft.gsi.gov.uk</a></p> </div>

And we're done. All I wanted to do was say "this, dear Computer, is an address". Just getting some frankly useless out-of-the-box HTML elements on the raw data more than doubles it's size (167 to 356), then we double it again to actually make it useful.

Now, I know size isn't everything, and this is a pedantic, slightly silly, and probably less than accurate example. We're not crazy obsessed with keeping our pages below a certain size anymore (Ah... I remember back when the BBC S&Gs insisted that every page had to be less than 200k down the wire including script and CSS AND images. Those were the days.), but it's not something to be sniffed at either. Particularly with mark-up. Increased size probably suggests increased complexity - more work for everyone, more chance of someone bungling the order or nesting, more simply "I can't be bothered". Colour me dubious. I just want to highlight how much we add on to HTML to make it actually do what we need.

Itemscope and itemtype, a brief diversion

I had one of those Am I crazy, but why are there two properties on these things? moments. When would you ever use one without the other? The spec says you can use itemscope alone, but without itemtype, it's a bit meaningless. I think I'd do away with itemscope and have itemtype only but with a value, either a URI or something meaningful to the internal vocabulary. itemscope seems to exist solely to say "the things in side me are related", but by the very nature of it being the parent of those items, that's already implied, and with a class name of something meaningful (say, hcard), or just the itemtype (with a useful value), explicit to data consumers.

This isn't sarcasm: I would gratefully receive an explanation as to why there are two attributes instead of one.

Back in the room: Is this seriously what we expect authors to do?

I think I'm still struggling to understand why microdata is a separate specification (or even exists if it's not being used as a mechanism to get stuff into HTML long-term). You can achieve exactly this richness with the current attributes supplied in HTML, and I don't even mean just the microformats class way. The data- attribute is pretty handy, though, and seems ripe for stuffing with machine data (why shouldn't it take a URI if you really need it?).

But I digress.

Microdata with schema-org is solving a problem we've already solved in microformats, but in an equally not-quite-there way (having to specify itemtype with a URI more than once in a page for items that are the same, but not within the same parent, feels filthy, for example). They are just as bad as each other, in slightly varying ways. Useful for proving a point, allowing growth and putting out examples (not that all of these bonuses are currently being made the best of), but crappy if this is all we can muster for the long-term, high-volume, regularly published, data representation patterns in HTML. We're asking authors to jump through hoops still for things they shouldn't have to.

Microformats, schema-org, whatever... is this really our game plan now? Just keep throwing ever more bloat into already creaking elements when you just want to do something really common? What's the strategy for getting this stuff out of this mess and into the language?

You might be asking why bother aiming to get those stronger patterns into HTML, if this mechanism basically works for getting a machine to figure out what the hell you're trying to say, but you may as well be asking why you have any semantically meaningful elements in HTML at all if that's the case. HTML version 5 is redefining some elements to have better semantic meaning because HTML is the language of authors, and to authors and consumers meaning matters.

Without a plan for gathering evidence for popularly used patterns directly from microformats or microdata (and using them as formal methods of research, testing and development), or what people (actual, real developers - not just the big search engines) are doing in general, we'll end up with no progress or the wrong progress in HTML, and I believe that a formal process for how and when this happens should be made (i.e. definitions of what constitutes critical mass of common patterns, how the information should be gathered, how they will be proposed formally in the WG and promoted into the language proper, etc.).

I want evidence-based HTML that will evolve using clearly defined mechanisms.

*Conversation shortened and re-written with an artistic license and possibly some (many; "nice" may be a stretch) inaccuracies.

**Yes, I'm casually suggesting that microformats are "free" if all you want to do is get your stuff out there with the minimum you'll need to be machine-friendly and human-eyes-pretty.