GRDDL parser (name grddl)

A parser for the Gleaning Resource Descriptions from Dialects of Languages (GRDDL), W3C Proposed Recommendation of 2007-07-16 which allows reading XHTML and XML as RDF triples by using profiles in the document that declare XSLT transforms from the XHTML or XML content into RDF/XML or other RDF syntax which can then be parsed.

The GRDDL parser is rather complex and different from the other parsers in that it retrieves URIs, reads HTML documents (possibly with errors), transforms the documents with XSLT and turns the result into a single graph. The default configuration of the GRDDL parser also reads microformats (hcard, hcalendar) and follows <link> tags that point to RDF/XML. Parts of the GRDDL process can be altered by configuration, which are describe below.

The GRDDL parser defines 'base', 'Base' and 'url' XSLT parameters with the value of the base URI to allow some XSLT sheets to work. These set of parameters cannot be disabled.

If the XSLT transform returns an empty string, no further processing of the result is done, and a warning is generated. The xsl:output method is mapped to result document mime types as follows: 'text' to text/plain; 'xml' to application/xml and 'html' to text/html. Any result that is of type 'application/xml' or unknown mime type is assumed to be RDF/XML.

The URIs that are processed during GRDDL operations can be checked and skipped if required using a handler set with the raptor_parser_set_uri_filter() function. If the handler returns non-0, the URI is rejected. This uses raptor_www_set_uri_filter() internally.

If the value of option RAPTOR_OPTION_WWW_TIMEOUT if set to a number >0, it is used as the timeout in seconds for retrieving of URIs during GRDDL processing. This uses raptor_www_set_connection_timeout() internally.

The hardcoded support for hcard and hcalendar microformats can be disabled by setting parser option RAPTOR_OPTION_MICROFORMATS to 0 or using raptor_parser_set_option() with option RAPTOR_OPTION_STRICT and a boolean value of 1.

The GRDDL parser by default will try an XML parser on the content followed by a lax HTML parser. This can be disabled by setting parser option RAPTOR_OPTION_HTML_TAG_SOUP to 0 or using raptor_parser_set_option() with option RAPTOR_OPTION_STRICT and a boolean value of 1.

The GRDDL parser by default will try to look for an HTML <link> tag that points to RDF/XML. This can be disabled by setting parser option RAPTOR_OPTION_HTML_LINK to 0 or using raptor_parser_set_option() with option RAPTOR_OPTION_STRICT and a boolean value of 1.