Schema-org, microformats and more science please

A normal conversation in the GovUK (or any office I frequent) today went*: “Can we get some microformats on that page?”, I suggest as I spot a section of our site outputting a boat-load of addresses. “No problem – but what’s this about schema-org?”. “Yeah, yeah.. we can hedge our bets and throw their mark-up in there too, it’s just some extra itemprops. *flippant scoff* I’ll send you a complete snippet example, because I’m just nice like that.”

And that’s what I did. And it looked like this:


<div class="vcard" itemscope itemtype="http://schema.org/Organization">
  <p class="org" itemprop="name">Department for Transport</p>
  <p class="adr" itemprop="address" itemscope 

itemtype="http://schema.org/PostalAddress"> <span itemprop="streetAddress"> <span class="extended-address">Great Minster House</span> <span class="street-address">76 Marsham Street</span> </span> <span class="locality" itemprop="addressLocality">London</span> <span class="postcode" itemprop="postalCode">SW1P 4DR</span> </p> <p>Telephone: <span class="tel"
itemprop="telephone">0300 330 3000</span></p> <p>Website: <a href="http://www.dft.gov.uk" class="url"
itemprop="url">www.dft.gov.uk</a></p> <p>Email: <a
href="mailto:firstname.surname@dft.gsi.gov.uk"
class="email" itemprop="email">firstname.surname@dft.gsi.gov.uk</a></p> </div>

Holy massive-code-snippet batman. I was surprised by the size. I know, I can feel people digging up links already on attack and defence of “bloat” when using microformats alone, but seriously guys, IT’S HUGE. I felt guilty saying “this is what you’ve gotta add to get this mark-up to mean something“. Here’s a more broken down comparison:

Here’s the address, raw, at just over a tweet’s worth (167 chars):


Department for Transport
Great Minster House
76 Marsham Street
SW1P 4DR
Telephone: 0300 330 3000
Website: http://www.dft.gov.uk
Email: firstname.surname@dft.gsi.gov.uk

Here’s the address with the elements on it to get at the separate pieces of the address, bringing us up to 356:


<p>Department for Transport</p>
<p>
  <span>Great Minster House</span>
  <span>76 Marsham Street</span>
  <span>London</span>
  <span>SW1P 4DR</span>
</p>

<p>Telephone: 0300 330 3000</p>
<p>Website: <a href="http://www.dft.gov.uk">www.dft.gov.uk</a></p>
<p>Email: <a href="mailto:firstname.surname@dft.gsi.gov.uk" 
>firstname.surname@dft.gsi.gov.uk</a></p>

Now let’s throw some classes on to those and get a bit of meaning in there (I mean, you may want to style them up, get things on new lines etc etc. so using the microformat classes are handy for that alone.**). We’ve got a vCard, people! (565):


<div class="vcard">
  <p class="org">Department for Transport</p>
  <p class="adr">
    <span class="extended-address">Great Minster House</span>
    <span class="street-address">76 Marsham Street</span>

    <span class="locality">London</span>
    <span class="postcode">SW1P 4DR</span>
  </p>

    <p>Telephone: <span 

class="tel">0300 330 3000</span></p> <p>Website: <a href="http://www.dft.gov.uk"
class="url">www.dft.gov.uk</a></p> <p>Email: <a
href="mailto:firstname.surname@dft.gsi.gov.uk"
class="email>firstname.surname@dft.gsi.gov.uk</a></p> </div>

And now let’s make it schema-org friendly using microdata (863):


<div class="vcard" itemscope itemtype="http://schema.org/Organization">
  <p class="org" itemprop="name">Department for Transport</p>
  <p class="adr" itemprop="address" itemscope 

itemtype="http://schema.org/PostalAddress"> <span itemprop="streetAddress"> <span class="extended-address">Great Minster House</span> <span class="street-address">76 Marsham Street</span> </span> <span class="locality" itemprop="addressLocality">London</span> <span class="postcode" itemprop="postalCode">SW1P 4DR</span> </p> <p>Telephone: <span class="tel"
itemprop="telephone">0300 330 3000</span></p> <p>Website: <a href="http://www.dft.gov.uk" class="url"
itemprop="url">www.dft.gov.uk</a></p> <p>Email: <a
href="mailto:firstname.surname@dft.gsi.gov.uk"
class="email" itemprop="email">firstname.surname@dft.gsi.gov.uk</a></p> </div>

And we’re done. All I wanted to do was say “this, dear Computer, is an address”. Just getting some frankly useless out-of-the-box HTML elements on the raw data more than doubles it’s size (167 to 356), then we double it again to actually make it useful.

Now, I know size isn’t everything, and this is a pedantic, slightly silly, and probably less than accurate example. We’re not crazy obsessed with keeping our pages below a certain size anymore (Ah… I remember back when the BBC S&Gs insisted that every page had to be less than 200k down the wire including script and CSS AND images. Those were the days.), but it’s not something to be sniffed at either. Particularly with mark-up. Increased size probably suggests increased complexity – more work for everyone, more chance of someone bungling the order or nesting, more simply “I can’t be bothered”. Colour me dubious. I just want to highlight how much we add on to HTML to make it actually do what we need.

Itemscope and itemtype, a brief diversion

I had one of those Am I crazy, but why are there two properties on these things? moments. When would you ever use one without the other? The spec says you can use itemscope alone, but without itemtype, it’s a bit meaningless. I think I’d do away with itemscope and have itemtype only but with a value, either a URI or something meaningful to the internal vocabulary. itemscope seems to exist solely to say “the things in side me are related”, but by the very nature of it being the parent of those items, that’s already implied, and with a class name of something meaningful (say, hcard), or just the itemtype (with a useful value), explicit to data consumers.

This isn’t sarcasm: I would gratefully receive an explanation as to why there are two attributes instead of one.

Back in the room: Is this seriously what we expect authors to do?

I think I’m still struggling to understand why microdata is a separate specification (or even exists if it’s not being used as a mechanism to get stuff into HTML long-term). You can achieve exactly this richness with the current attributes supplied in HTML, and I don’t even mean just the microformats class way. The data- attribute is pretty handy, though, and seems ripe for stuffing with machine data (why shouldn’t it take a URI if you really need it?).

But I digress.

Microdata with schema-org is solving a problem we’ve already solved in microformats, but in an equally not-quite-there way (having to specify itemtype with a URI more than once in a page for items that are the same, but not within the same parent, feels filthy, for example). They are just as bad as each other, in slightly varying ways. Useful for proving a point, allowing growth and putting out examples (not that all of these bonuses are currently being made the best of), but crappy if this is all we can muster for the long-term, high-volume, regularly published, data representation patterns in HTML. We’re asking authors to jump through hoops still for things they shouldn’t have to.

Microformats, schema-org, whatever… is this really our game plan now? Just keep throwing ever more bloat into already creaking elements when you just want to do something really common? What’s the strategy for getting this stuff out of this mess and into the language?

You might be asking why bother aiming to get those stronger patterns into HTML, if this mechanism basically works for getting a machine to figure out what the hell you’re trying to say, but you may as well be asking why you have any semantically meaningful elements in HTML at all if that’s the case. HTML version 5 is redefining some elements to have better semantic meaning because HTML is the language of authors, and to authors and consumers meaning matters.

Without a plan for gathering evidence for popularly used patterns directly from microformats or microdata (and using them as formal methods of research, testing and development), or what people (actual, real developers – not just the big search engines) are doing in general, we’ll end up with no progress or the wrong progress in HTML, and I believe that a formal process for how and when this happens should be made (i.e. definitions of what constitutes critical mass of common patterns, how the information should be gathered, how they will be proposed formally in the WG and promoted into the language proper, etc.).

I want evidence-based HTML that will evolve using clearly defined mechanisms.

*Conversation shortened and re-written with an artistic license and possibly some (many; “nice” may be a stretch) inaccuracies.

**Yes, I’m casually suggesting that microformats are “free” if all you want to do is get your stuff out there with the minimum you’ll need to be machine-friendly and human-eyes-pretty.

Gold-plating the cow paths

I was quoted a couple of weeks ago as saying, albeit in private, the following:

“HTML fails to be simple if it can’t provide what authors regularly need and end up turning to other encodings” — @phae

@slightlylate

For context, that was in response to a remark made by a friend that HTML fails if authors can’t use it because it has become too complex and attempts to describe too much. My response was that it fails not because it’s complicated, but when an author cannot express their content accurately with the toolkit they’re supplied and have to go to another encoding to find what they’re looking for. That’s the language passing the buck, in my opinion.

Don’t get me wrong – I’m not suggesting HTML should cover every niche semantic everyone is ever going to want to express ever. That would be crazy and confusing. HTML should express what is most commonly used, and at the moment it doesn’t – which is why we still see microformats, microdata, component model, schema.org etc. trying to fill the gaps. And not just trying to fill the gaps, but trying to provide data on which decisions can be made about what should be in HTML.

HTML, and a platform that provides, should be the end goal. Microformats, et al., are the research grounds that should be directly contributing with the evidence and data they are able to garner. In fact, the most popular microformats, shown through demand and usage, should just be in HTML as a standard, by being provided for with semantically appropriate new elements.

We’ve seen this work. Microformats started doing things with dates, most specifically, hCalendar. It had a slightly cludgy way of marking up time, using abbr. The accessibility lot were rightfully less than impressed, and other patterns were tried – title and spans and all kinds of things. But in short, it was shown that time gets talked about a lot, and we needed something better. We got <time> in HTML. Hooray! The system works! Well, except when it doesn’t. Go read Bruce Lawson’s take, as the powers that be removed time and replaced it with data. Gee, thanks.

We shouldn’t expect authors to go in search of richer mark-up from other sources when what they’re trying to do is really common, when a need has been shown, and a pattern has been proven.

hgroups and sub-titles

I realise that queries or concerns about HTML 5 elements should make their way onto the WHATWG mailing list, but I just wanted to get a few thoughts out on here about what I’ve spent far too long discussing at work recently. It’s perfectly likely that I’ve totally got the wrong end of the proverbial, so this is just me trying to get my mind straight on why I feel something about this is unnatural and I welcome comments to help clarify or discuss.

So, hgroup, eh?

hgroup is one of the new elements featuring in the HTML 5 specification. It’s purpose, quite simply, is to group two or more headings together into one block so that subheadings are treated differently and only the first heading becomes part of the document outline.

The hgroup element is typically used to group a set of one or more h1-h6 elements — to group, for example, a section title and an accompanying subtitle.

From the current HTML5 working draft

The WHATWG wiki has the following rationale for requiring the hgroup element:

The point of <hgroup> is to hide the subtitle from the outlining algorithm

WHATWG wiki

Over on HTML5 Doctor, John Allsopp appears to find fault with this element also and suggests that the requirement for hgroup is symptomatic of a flaw in the outlining algorithm. I can see his point, but I’m more concerned that it’s a fundamentally inaccurate use of a heading.

In my mind, headings are designed to denote sections. At least, that’s what they were used for in HTML 4. Things either went in a heading, because they denoted a new section of content, or they didn’t. This is Frances the idealist speaking, I realise this, but still.

Let’s say I had a new website about a children’s story about monsters, and I wanted to title it “Monsters live under my bed”, but it could also have a sub-title or strap-line. As an author, I either want my title to be “Monsters live under my bed. Where things go bump in the night” or I want it to be “Monsters live under my bed” and the next line is incidental and a supplementary strap-line and not something I would consider to be part of my title.

Currently, I might do any of the following:

<h1>Monsters live under my bed 
Where things go bump in the night
</h1>

Example wrapped for legibility, but my story title is the full text and is in a heading.

<h1>Monsters live under my bed 
<span>Where things go bump in the night</span>
</h1>

This one is mostly a stylistic example. The strapline needs to look like a strapline, so I’ve stuck a span around it (yeah, I know…), but fundamentally I’m still considering it to be part of the title. My story’s name is the full text.

<h1>Monsters live under my bed</h1>
<p>Where things go bump in the night</p>

In this case, the title of my story is only “Monsters live under my bed” and because HTML 4 doesn’t really offer a suitable element that I would consider “a sub header that isn’t a new section of the document” I’ve stuck the sub-title text in a paragraph.

<h1>Monsters live under my bed</h1>
<h2>Where things go bump in the night</h2>

This one suggests that I have a title and then the first chapter beneath the title is “Where things go bump in the night”. That second line is no longer the title of my kids story. The h2 would be a new indented item in an outline and would suggest that further within the document I may find more h2s and that I have stepped into the document by a level.

What HTML 5 says you would do is this if you want a sub-title/sub-heading is:

<hgroup>
<h1>Monsters live under my bed</h1>
<h2>Things that go bump in the night</h2>
</hgroup>

This has the effect of making that h2 not appear in the outline, since it will no longer create a new section. The outline now considers that the title of my story is again “Monsters live under my bed”. Any content that comes after this would be within the section titled by the h1. The h2 doesn’t count as the start of a new section (as it would if there was no hgroup wrapper). The contents of the h2 is considered a special non-sectioning-heading case, but it’s still in a heading element. But if it’s meant to be a heading, why isn’t it in the h1? Gah!

I kind of have the feeling that what we should have at our disposal is something that looks more like the following, which allows for a heading and some sort of sub-title(s) (naming isn’t my strong point, I’ve picked ‘strapline’ fairly arbitrarily, but essentially I imagine it as a non-heading sub-title of some nature – maybe even subheading?). It’s not as if hgroup is allowed to hold anything other than headings anyway.

<h1>Monsters live under my bed</h1>
<strapline>Where things go bump in the night<strapline>

It satisfies my problem with using lower numbered headings for things you consider to either be associated as part of the first heading (or rather, supplementary to it) or not actually headings at all. If I want my full title to be all of the above, it can all go in the h1. If I don’t consider the second line to be part of a heading, it gets to go in it’s own non-heading supplementary titling element. The rationale quoted above specifically says “subtitle”, although I noticed the current editor’s draft for hgroup does mention “subheadings”.

Do you follow my drift?

If we’re in the business of having the opportunity to create new elements, can’t we just create one that actually satisfies the requirement explicitly rather than sort of allow authors to do things that seem somehow hypocritical to the point of heading elements in most other contexts. I also realise that purist intentions fall waaaaay down the list of priorities when compared to the requirements of paving existing usage, but as an author as well, I feel that there’s something fundamentally inaccurate about treating a heading as a non-heading. As an author I want to be able to be as accurate as possible.

Is it just time for me to let go of the idea that headings do the job of creating and naming sections in a document outline?

Aren’t semantics fun!

HTML5 Microdata – Over-cooked?

What is Microdata?

Microdata is HTML5’s answer to how we should go about embedding machine-readable data in our mark-up.

At a high level, microdata consists of a group of name-value pairs. The groups are called items, and each name-value pair is a property. Items and properties are represented by regular elements.

A simple example looks something like this:


<div item>
 <p>My name is <span itemprop="name">Frances</span>.</p>
 <p>My work for the <span itemprop="company">BBC</span>.</p>
 <p>I am <span itemprop="nationality">British</span>.</p>
</div>

Where the item has 3 properties with values (name:Frances, company:BBC, nationality:British).

You can then associate item properties with items that the property is not a direct descendant of, with the subject attribute.

Essentially, you have some new attributes at your disposal:

  • item – to specify a group.
  • itemprop – to define the property of an element inside an item.
  • subject – to associate a property with a non-parent item.

You can also type items with a URL, reverse DNS labels or a pre-defined type (and each itemprop can accept multiple properties, as you’d expect with class):

Here, the item is “org.example.animals.cat”:


<section item="org.example.animal.cat">
 <h1 itemprop="org.example.name">Hedral</h1>
 <p itemprop="org.example.desc">Hedral is a male american domestic
 shorthair, with a fluffy black fur with white paws and belly.</p>
 <img itemprop="org.example.img" src="hedral.jpeg" alt="" title="Hedral, age 18 months">
</section>

In this example the “org.example.animals.cat” item has three properties, an “org.example.name” (“Hedral”), an “org.example.desc” (“Hedral is…”), and an “org.example.img” (“hedral.jpeg”).

Quotes and examples (slightly personalised) come from the HTML5 working draft.

My reservations

My gut instinct with microdata is that it’s overcomplicating things. We have RDFa already if you really want to get into the nitty-gritty of machine-readable data and, dare I say it, microformats and good semantic practice for creating shared vocabularies for plain-old semantic HTML. I’m not sure HTML5 necessarily needs this sort of extra solution.

The last example above, with the reverse DNS typing, just looks so… heavy. Something about it just doesn’t feel right and it’s actual value to me remains unclear, or at least I can’t see the value of specifying the path on each element. Couldn’t that be inferred from the structure, or subject used where ambiguities appear, and then as a last resort specify it on each element?


<section item="org.example.animal.cat">
 <h1 itemprop="name">Hedral</h1>
 <p itemprop="desc">Hedral is a male american domestic
 shorthair, with a fluffy black fur with white paws and belly.</p>
 <img itemprop="img" src="hedral.jpeg" alt="" title="Hedral, age 18 months">
</section>

The itemprop attribute bothers me most. I can’t help but think that all the examples shown in the draft would still work if itemprop was replaced with class. The class attribute is already designed to take a semantically rich term for the element. Worse still, assuming class is used appropriately, you’ll end up with unnecessary repetition across the attributes.


<div item>
 <p>My name is <span class="name" itemprop="name">Frances</span>.</p>
...
</div>

The subject attribute examples aren’t great, which doesn’t help their case – they don’t seem that real world (although there are plenty of good reasons why you might need subject – just look at the microformat include-pattern for example, and how that would be improved with it). A few of the examples could be better represented and relationships then inferred from the element structure (and I wouldn’t mind, but HTML5 already offers a boat-load of new elements to take away much of the ambiguity that HTML4 had – but just sections and headers go a long way to tying information notionally together).

The microdata proposal seems to be about making explicit what could otherwise already be inferred from the actual elements and values (although I’ll concede that it’s often inaccurate or very difficult). Wanting to be exact isn’t a terrible idea (it works really well for the for attribute, for example) and I do like disambiguation. I just don’t think the current proposal really solves the right problems as it stands.

I do think that subject has the most legs of the new attributes, though, but surely it could be as simple as:


<div id="about">
<p>I'm Frances and I like to complain about things on the internet.</p>
</div>
...
<p subject="about">I own no cats. :(</p>

Let the subect do what for has done for label, but across all elements, tying wayward bits of information to an ID (or maybe simply use subject alone to tie pieces of information together – but then this starts to feel like a class job again).

Or an example with class in place of itemprop and using a pre-defined vocabulary:


<div id="vcard">
<p>I'm <span class="fn">Frances</span> and I like to complain about things on the internet.</p>
</div>
...
<p subject="vcard">I still own no cats. :( I do work for the <span class="company">BBC</span> though. </p>

My final concern, which actually could apply to HTML5 as a whole and is more of a general are we ready for this yet? thought, is that this is a lot for an author to consider. You look at the web as it stands now, and most of it isn’t well written. Elements are abused, misused or completely forgotten (and attributes fair worse).

HTML5 offers a raft of new elements and attributes to aid clarity in information, accessibility and flexibility. Do we really think that authors on the whole have a great track-record of implementing the specs well? These new microdata attributes make what could already be a simple lesson (use class meaningfully) into a much steeper learning curve, watering down the overall benefit.

I’m not suggesting that that should be an excuse to not make HTML5 as rich as possible, but it should always be in mind that the web is about enabling normal people to share information – it’s not just an intellectual experiment for web developers.

Microdata is in the early draft stage – so I realise things will change.

Disclaimer

It’s well known that I’m a microformats busy-body, but this has nothing to do with my distaste for microdata as the spec stands. Sure, the two things have similar aims, but microformats has always been a solution for the here-and-now. HTML5 still “supports” microformats, and when HTML5 is ready, microformats will simplify (using the time element can’t happen soon enough) and continue to do what they have always done. I like HTML5 and want it to succeed. I am in no way condoning microformats over microdata or generally comparing the two.