Schema-org, microformats and more science please

A normal conversation in the GovUK (or any office I frequent) today went*: “Can we get some microformats on that page?”, I suggest as I spot a section of our site outputting a boat-load of addresses. “No problem – but what’s this about schema-org?”. “Yeah, yeah.. we can hedge our bets and throw their mark-up in there too, it’s just some extra itemprops. *flippant scoff* I’ll send you a complete snippet example, because I’m just nice like that.”

And that’s what I did. And it looked like this:

<div class="vcard" itemscope itemtype="">
  <p class="org" itemprop="name">Department for Transport</p>
  <p class="adr" itemprop="address" itemscope 

itemtype=""> <span itemprop="streetAddress"> <span class="extended-address">Great Minster House</span> <span class="street-address">76 Marsham Street</span> </span> <span class="locality" itemprop="addressLocality">London</span> <span class="postcode" itemprop="postalCode">SW1P 4DR</span> </p> <p>Telephone: <span class="tel"
itemprop="telephone">0300 330 3000</span></p> <p>Website: <a href="" class="url"
itemprop="url"></a></p> <p>Email: <a
class="email" itemprop="email"></a></p> </div>

Holy massive-code-snippet batman. I was surprised by the size. I know, I can feel people digging up links already on attack and defence of “bloat” when using microformats alone, but seriously guys, IT’S HUGE. I felt guilty saying “this is what you’ve gotta add to get this mark-up to mean something“. Here’s a more broken down comparison:

Here’s the address, raw, at just over a tweet’s worth (167 chars):

Department for Transport
Great Minster House
76 Marsham Street
Telephone: 0300 330 3000

Here’s the address with the elements on it to get at the separate pieces of the address, bringing us up to 356:

<p>Department for Transport</p>
  <span>Great Minster House</span>
  <span>76 Marsham Street</span>
  <span>SW1P 4DR</span>

<p>Telephone: 0300 330 3000</p>
<p>Website: <a href=""></a></p>
<p>Email: <a href="" 

Now let’s throw some classes on to those and get a bit of meaning in there (I mean, you may want to style them up, get things on new lines etc etc. so using the microformat classes are handy for that alone.**). We’ve got a vCard, people! (565):

<div class="vcard">
  <p class="org">Department for Transport</p>
  <p class="adr">
    <span class="extended-address">Great Minster House</span>
    <span class="street-address">76 Marsham Street</span>

    <span class="locality">London</span>
    <span class="postcode">SW1P 4DR</span>

    <p>Telephone: <span 

class="tel">0300 330 3000</span></p> <p>Website: <a href=""
class="url"></a></p> <p>Email: <a
class="email></a></p> </div>

And now let’s make it schema-org friendly using microdata (863):

<div class="vcard" itemscope itemtype="">
  <p class="org" itemprop="name">Department for Transport</p>
  <p class="adr" itemprop="address" itemscope 

itemtype=""> <span itemprop="streetAddress"> <span class="extended-address">Great Minster House</span> <span class="street-address">76 Marsham Street</span> </span> <span class="locality" itemprop="addressLocality">London</span> <span class="postcode" itemprop="postalCode">SW1P 4DR</span> </p> <p>Telephone: <span class="tel"
itemprop="telephone">0300 330 3000</span></p> <p>Website: <a href="" class="url"
itemprop="url"></a></p> <p>Email: <a
class="email" itemprop="email"></a></p> </div>

And we’re done. All I wanted to do was say “this, dear Computer, is an address”. Just getting some frankly useless out-of-the-box HTML elements on the raw data more than doubles it’s size (167 to 356), then we double it again to actually make it useful.

Now, I know size isn’t everything, and this is a pedantic, slightly silly, and probably less than accurate example. We’re not crazy obsessed with keeping our pages below a certain size anymore (Ah… I remember back when the BBC S&Gs insisted that every page had to be less than 200k down the wire including script and CSS AND images. Those were the days.), but it’s not something to be sniffed at either. Particularly with mark-up. Increased size probably suggests increased complexity – more work for everyone, more chance of someone bungling the order or nesting, more simply “I can’t be bothered”. Colour me dubious. I just want to highlight how much we add on to HTML to make it actually do what we need.

Itemscope and itemtype, a brief diversion

I had one of those Am I crazy, but why are there two properties on these things? moments. When would you ever use one without the other? The spec says you can use itemscope alone, but without itemtype, it’s a bit meaningless. I think I’d do away with itemscope and have itemtype only but with a value, either a URI or something meaningful to the internal vocabulary. itemscope seems to exist solely to say “the things in side me are related”, but by the very nature of it being the parent of those items, that’s already implied, and with a class name of something meaningful (say, hcard), or just the itemtype (with a useful value), explicit to data consumers.

This isn’t sarcasm: I would gratefully receive an explanation as to why there are two attributes instead of one.

Back in the room: Is this seriously what we expect authors to do?

I think I’m still struggling to understand why microdata is a separate specification (or even exists if it’s not being used as a mechanism to get stuff into HTML long-term). You can achieve exactly this richness with the current attributes supplied in HTML, and I don’t even mean just the microformats class way. The data- attribute is pretty handy, though, and seems ripe for stuffing with machine data (why shouldn’t it take a URI if you really need it?).

But I digress.

Microdata with schema-org is solving a problem we’ve already solved in microformats, but in an equally not-quite-there way (having to specify itemtype with a URI more than once in a page for items that are the same, but not within the same parent, feels filthy, for example). They are just as bad as each other, in slightly varying ways. Useful for proving a point, allowing growth and putting out examples (not that all of these bonuses are currently being made the best of), but crappy if this is all we can muster for the long-term, high-volume, regularly published, data representation patterns in HTML. We’re asking authors to jump through hoops still for things they shouldn’t have to.

Microformats, schema-org, whatever… is this really our game plan now? Just keep throwing ever more bloat into already creaking elements when you just want to do something really common? What’s the strategy for getting this stuff out of this mess and into the language?

You might be asking why bother aiming to get those stronger patterns into HTML, if this mechanism basically works for getting a machine to figure out what the hell you’re trying to say, but you may as well be asking why you have any semantically meaningful elements in HTML at all if that’s the case. HTML version 5 is redefining some elements to have better semantic meaning because HTML is the language of authors, and to authors and consumers meaning matters.

Without a plan for gathering evidence for popularly used patterns directly from microformats or microdata (and using them as formal methods of research, testing and development), or what people (actual, real developers – not just the big search engines) are doing in general, we’ll end up with no progress or the wrong progress in HTML, and I believe that a formal process for how and when this happens should be made (i.e. definitions of what constitutes critical mass of common patterns, how the information should be gathered, how they will be proposed formally in the WG and promoted into the language proper, etc.).

I want evidence-based HTML that will evolve using clearly defined mechanisms.

*Conversation shortened and re-written with an artistic license and possibly some (many; “nice” may be a stretch) inaccuracies.

**Yes, I’m casually suggesting that microformats are “free” if all you want to do is get your stuff out there with the minimum you’ll need to be machine-friendly and human-eyes-pretty.

18 thoughts on “Schema-org, microformats and more science please

  1. I do wonder if something like:

    Department for Transport

    Great Minster House
    76 Marsham Street
    SW1P 4DR

    Telephone: 0300 330 3000

    @prefix v: .
    [] a v:VCard;
    v:fn “Department for Transport” ;
    v:adr [
    v:extended-address "Great Minster House";
    v:street-address "76 Marsham Street";
    v:locality "London" ;
    v:postal-code "SW1P 4DR" ;
    ] ;
    v:tel [ v:Work; rdf:value "0300 330 3000" ] ;
    v:url ;
    v:email .

    Wouldn’t be easier?

    Now to find out if this comments field does escaping ;)

  2. IIRC the reason for having both itemscope=”" and itemtype=”" came from the Microdata usability study done at Google. The participants in the study made fewer markup errors in the two-attribute case.

  3. I posted some notes about it here:

    Unfortunately the raw data can’t be published for privacy reasons.

    I was really surprised by the itemscope/itemtype thing helping people, but it was a really stark result if I recall correctly. Originally I’d designed it with just one attribute “item”, whose value was optional but if present was the type. Confusion abounded in the usability lab when we tested that variant. We had a variant with the attributes split more or less like it is now, and the participants in the study were far more comfortable with that. One of the people who was tested on my original design saw the split variant near the end of their session, and it was like they had an epiphany.

    It was quite an educational experience for me as a language designer. Things that I thought were obvious (URLs are too long and unwieldy to be used everywhere, terse markup is better than verbose redundant markup) were repeatedly shown to be false. It really changed how I design languages.


  4. I was just about to post about the study by Hixie regarding itemscope and itemtype. I was quite saddened to read the results from said study and can’t help wondering if this really was statistically significant enough to let it impact the specification like this.

    Like the poster, I had a wtf moment reading about itemscope, and found the hixie study while looking for what could possibly be the reason for choosing this syntax.

  5. Ian:

    I agree with Karl about the methodological concerns here. 7 is a pretty small sample. More to the point, AFAICT this is a test of “can n00bs learn a thing this way” vs. “what works best over long-term use” seems to be something not studied by this survey. A study like this could be constructed using new forms of elements the participants already know that are designed to be “clearer” in this way (e.g., an output type=”video” vs. the video tag).

    *Perhaps* we can eventually say that what’s good for new users is good for the experienced as well, but this research doesn’t seem to explore that, even in the small population. But it’s good to know what’s good for new users too.


  6. Ian:

    Which parts of the raw data prevent it being published?

    Excellent to hear that decisions are evidence-based, but without publishing the data it can’t be reviewed & debated.

    I know it’s not your intention, but the whole…
    “Evidence proves I’m right!”
    “Can I see this evidence?”
    “No, it’s secret”
    …thing is the folly of quacks

  7. I’m not sure that you can have a single attribute doing double-duty as both a boolean and a value, if you see what I mean.

    Consider if you got rid of itemscope and just had itemtype instead – what would you do in an XML serialisation? You’d need to do itemtype=”itemtype”, then the specification would need to use that as a reserved value, which would start to get messy.

  8. That was one massive WTF moment I had (and asked about, never getting an answer)

    the itemscope is utterly pointless. The scope of the property is /always/ the scope of the tag to which it is applied. Making this extremely verbose and rather confusing.

  9. I only stumbled across this today. I was not aware that Microformats were still going. I thought they had died through lack of traction. Don’t get me wrong, I think the data driven semantic web is the way forward for many things. Including the open government, data exchange, commerce and the Internet of things.

    But I have an open question, which is possibly rather naive. Why be so concerned about how a Microformat is constructed? If the data is to be read by computer then it needs to be in a sensible format for machines, no matter how complex that maybe. Most large scale web sites are created programatically, so as long as the application code is constructed (once) correctly any number of Microformats can be created without error.

    I appreciate the lack of elegance in the code, but surely they are solved, or at least hidden, once the web application is written.

    Mark (out of work right now, with clearly nothing better to do than learn new stuff!)

  10. In my work I build websites that are used mostly in environments with poor, often expensive and flaky internet connection. My vacations usually also take me to such places. Which makes me care A LOT about page sizes.

    I use microformats when applicable, micro-data when needed and never so far, but I also ditch them if I find my page balooning too much. It can be a very very mild version of Sophie’s choice.

    Not every place is wired like most (but not all) of Europe or Sillicon Valley and I wish we would take this more into account when creating new standards. I expect situation will improve eventually, but then again, standards will change too. HTML5 is afterall a living document.

    Sorry if I am ranting too much.

Comments are closed.