HTML Metadata

From Exterior Memory
Jump to: navigation, search

Metadata in HTML and XHTML is specified rather differently. For example, consider the recommend ways to specify the author of a HTML page:

HTML 4 <meta name="author" contents="John Doe">
XHTML 1 <head profile="http://dublincore.org/documents/dcq-html/">
<link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
<meta name="DC.creator" contents="John Doe" />
</head>
XHTML 2 <html xmlns="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/elements/1.1/">
<meta property="dc:creator" contents="John Doe">

Link Relations: rel and rev

The rel/rev link relation can be used in both <link> and <a> elements. Link relations are used if you want to point to another resource (e.g. HTML page or RDF resource). In particular, the rel attribute is used to specify what the target URL is of the current document.

The following keywords are defined in HTML 4 and XHTML 2:

alternate
Designates alternate versions for the document. When used together with the hreflang attribute, it implies a translated version of the document. When used together with the hrefmedia attribute, it indicates a version intended for that type of device.
stylesheet
Refers to an external style sheet. (Deprecated in XHTML 2.)
start
Refers to the first resource in a collection of resources. A typical use case might be a collection of chapters in a book.
next
Refers to the next resource (after the current one) in an ordered collection.
prev
Refers to the previous resource (before the current one) in an ordered collection.
up
Refers to the resource "above" in a hierarchically structured set. (New in XHTML 2.)
contents
Refers to a resource serving as a table of contents.
index
Refers to a resource providing an index.
glossary
Refers to a resource providing a glossary of terms.
copyright
Refers to a copyright statement for the resource.
chapter
Refers to a resource serving as a chapter in a collection.
section
Refers to a resource serving as a section in a collection.
subsection
Refers to a resource serving as a subsection in a collection.
appendix
Refers to a resource serving as an appendix in a collection.
help
Refers to a resource offering help (more information, links to other sources of information, etc.)
bookmark
Refers to a bookmark. A bookmark is a link to a key entry point within an extended document. The title attribute may be used, for example, to label the bookmark. Note that several bookmarks may be defined for a document.
meta
Refers to a resource that provides metadata, for instance in RDF. (New in XHTML 2.)
icon
Refers to a resource that represents an icon, similar to the favicon.ico file. (New in XHTML 2.)
shortcut icon
See icon (custom element by Internet Explorer)
p3pv1
Refers to a P3P Policy Reference File. (New in XHTML 2.)

In addition XHTML2 defines the profile, role and cite keywords, but usage is not entirely clear. The list at http://www.w3.org/TR/relations.html seems to specify an old list of keywords. It is not recommended to use those.

Stylesheets

As seen above, the valid way to specify a stylesheet in HTML 4 and XHMTL 1 is:

<link rel="stylesheet" content="style.css" type="text/css" media="screen" />

In XHTML2, the stylesheet keyword is deprecated in favour of the style element (HTML4 and XHTML1 already contain the style element, but not the src attribute):

<style src="style.css" type="text/css" media="screen" />

Finally, it is also possible to specify the stylesheet in the XML preamble, though not all browsers support this. This is the recommended way for specifying style sheets in SVG images, though:

<?xml-stylesheet type="text/css" href="style.css" media="screen"?>

If the type is not given, HTML uses text/css by default, or whatever is given in the Content-Style-Type HTTP header.

Meta tags

The meta element can have a property (in XHTML2) or name (in HTML4 and XHTML1) attribute, with a specific keyword. The following keywords are specified:

description
Gives a description of the resource.
generator
Identifies the software used to generate the resource.
keywords
Gives a comma-separated list of keywords describing the resource.
robots
Gives advisory information intended for automated web-crawling software.
title
Specifies a title for the resource.
author
Specifies the creator of the HTML page (deprecated in XHTML2)
copyright
Gives the copyright statement (deprecated in XHTML2)

HTML4 did not formally specify any keywords, but the appendix mentions the keywords, description and robots keywords, while the examples mention the author and copyright keywords, which are deprecated in XHTML2.

XHTML2 also defines reference as the default keyword, if none is present. This is only useful is the property attribute is used on other elements then the meta element. See http://www.w3.org/TR/xhtml2/mod-meta.html and http://www.w3.org/TR/xhtml2/mod-metaAttributes.html.

HTTP equivalent data

It is sometimes not possible to (easily) alter the HTTP headers. In those cases, it is possible to specify a substitute HTTP header using the meta element:

<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=UTF-8" />

The following HTTP headers describe, part of

Content-Type
The MIME Type of the body, with optional charset. For example "text/html; charset=ISO-8859-1" for HTML or "application/xhtml+xml; charset=UTF-8" for XHTML. Regretably, Internet Explorer does not understand the application/xhtml+xml MIME type.
Content-Language
Describes the natural language(s) of the intended audience (thus not necessarily a list of all languages used in the document).
Content-Length
Size of the full HTTP body (thus all the HTML code), in bytes.
Content-Location
The URI of the original resource, in case it can be accessed at seperate locations.
Content-MD5
Message integrity check (MIC) of the entity-body, using a MD5 checksum.
Expires
The Expires entity-header field gives the date/time after which the response is considered stale. Unfortunately, the required format is the rather clumsy RFC 1123 date format (e.g. Thu, 01 Dec 2010 16:00:00 GMT)
Last-Modified
Specifies the last modification date of the document. Specified in archaic RFC 1123 format.
Content-Style-Type
The default MIME type for scripts. By default text/css (Defined by HTML 4.)
Content-Script-Type
The default MIME type for scripts. By default text/javascript (Defined by HTML 4.)
Cache-Control
Specifies how end-hosts and intermediate proxies must cache the results. E.g. max-age=3600
Pragma
Obsolete header, defined for backwards-compatibility with HTTP 1.0. Pragma: no-cache has the same meaning as Cache-Control: no-cache.
PICS-Label
Obsolete header, defining the content rating of a document. The Internet Content Rating Association (ICRA) has now replace PICS with ICRA labels, which use RDF files. You need to use <meta name="meta" content="icra-label.rdf" type="application/xml" /> for these new labels.

In addition, RFC 2616 (HTTP 1.1) defines the Allow, Content-Encoding, and Content-Range entitity-headers, but these do not seem useful in a HTML meta element.

External Metadata Specifications

All HTML variants allow an extension of keywords using external namespaces. The most populair namespace are according to a survey by Google are the Dublin Core and XFN.

Specifying the external namespace

In HTML 4 and XHTML1 (as recommended by the HTML 4 spec) and the Dublin Core spec):

<head profile="http://dublincore.org/documents/dcq-html/">
  <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
  <meta name="DC.creator" contents="John Doe" />
</head>

While the Dublin Core extends the keywords for the meta element, XFN extends the keywords for the rel attribute of the a and link element. In addition, while the Dublin Core is used for elements in the head element, XFN is typically only for rel atributes of a elements in the body of the HTML page (so on a elements rather then link elements). Even so, for HTML 4, the profile attribute should be added to the head element, not to the body element:

<head profile="http://gmpg.org/xfn/11"></head>
<body>
  <a href="johndoe.example.com" rel="co-worker">John Doe</a>
</body>

The profile attributes allows multiple values, seperated with a space. However, the HTML 4 specification says that all values but the first URI may be ignored.

In XHTML 2 (as shown in the XHTML specs):

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:dc="http://purl.org/dc/elements/1.1/">
<head>
  <meta property="dc:creator" contents="John Doe" />
</head>

Alternatively, you can still use a profile, though this is specified with a link element, rather then in the head element. Since XHMTL 2 is still in progress as of this writing, I expect that only one method will remain in the end:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <link rel="profile" content="http://purl.org/dc/elements/1.1/" />
  <meta property="creator" contents="John Doe" />
</head>

If you use element refinements of the Dublin Core, like date.created, rather then just date, it is not obvious how to specify this in XHTML, since the date element is defined the one namespace, while the refinement created is defined in another namespace. There are in fact two equivalent ways to define it, as shown by these two meta elements.

 <head profile="http://dublincore.org/documents/dcq-html/">
  <link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" />
  <link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" />
  <meta name="DC.date.created" content="2001-07-18" />
  <meta name="DCTERMS.created" content="2001-07-18" />
</head>

See the articles on Dublin Core and RDF schemas for more information about other terminologies.