HOWTO: a ten-line URI dereferencer

The core parts of the Semantic Web architecture are:

Serializing data as RDF, using appropriate vocabularies and ontologies,
Publishing those data as a whole, as an RDF document, or in part, at dereferencable URIs that can be followed to reassemble the whole.
Making the rest of the Semantic Web aware of these documents or URIs by linking to them in other documents and by ‘pinging’ semantic search engines.

Additionally:

Semantic search is offered via local SPARQL endpoints and global SW crawlers and search engines (e.g., Sindice).
Reasoning is possible via basic RDF reasoning in SPARQL endpoints (e.g., in Allegrograph or 4sr), or by using more complex Descriptive Logic in OWL, with reasoners like pellet and HermiT.

(See this classic article on ‘How to Publish Linked Data on the Web’)

While URIs can refer to document fragments (e.g. http://foo.com/doc.rdf#MyThing ), for large datasets it is more efficient to be able to dereference the URIs on the fly, returning only the requested data in a dereferenced document (e.g., http://foo.com/sw/MyThing dereferences to http://foo.com/sw/MyThing.rdf with a 303 redirect, and that file contains a selection of data about http://foo.com/sw/MyThing only).

I myself thought it would be complicated to build a URI dereferencer, and am sure that the worry of having to set up a server and dereferencer is a significant barrier to getting more institutions and projects to host their data via the Semantic Web, with permanent, digital, dereferencable identifiers for their resources, e.g., for museum collections (see the latest of Rod Page’s calls for this). However, I’ve found it’s actually pretty simple to build a URI dereferencer, but it took a fair amount of googling and trial and error. So, in an attempt to persuade others it’s not hard, here’s a basic (ca.) 10-line dereferencer, in the form of a gawk script (called here rdfserver):

  #!/usr/bin/gawk -f
  BEGIN{

    # 1. Get URI and method arguments
    id = gensub(/^id=([^&]+)&method=.+$/,"\\1","G",ENVIRON["QUERY_STRING"])
    method = gensub(/^id=[^&]+&method=(.+)$/,"\\1","G",ENVIRON["QUERY_STRING"])

    # 2. Query 4store with SPARQL
    cmd = "curl -d 'query=CONSTRUCT { <http://foo.com/sw/" id \
          "> ?p ?o . ?s2 ?p2 <http://foo.com/sw/" id \
          "> . } WHERE  { <http://foo.com/sw/" id \
          "> ?p ?o . OPTIONAL { ?s2 ?p2 <http://foo.com/sw/" id \
          "> } }' http://localhost:8006/sparql/"
    RS = "\x04" ;  cmd | getline rdfdata ;  close(cmd) ;

    # 3. Return RDF in either RDFXML or as Turtle in plain text
    if (method == "rdf") {
    print "Content-type: application/rdf+xml\n\n"; print rdfdata }
    else {
    print "Content-type: text/plain\n\n*** Turtle serialization of <http://foo.com/sw/" id "> ***\n" ;
    print rdfdata | "rapper -q -i rdfxml -o turtle - 'http://foo.com/sw/'" ;}

    exit;
  }

This suggestion comes with no warranties, may be a security risk, blah, blah, etc, but it does its job and can easily be modified to be more functional, secure, pretty. Why gawk? Because it’s the best scripting language hands down! Fast to prototype, fast to execute, pre-installed everywhere, simple to learn and surprisingly powerful. ‘Nuff said. If you have a problem with awk, it’ll be trivial to rewrite this in perl or python. The following assumes a basic knowledge of unix scripting.

Setup

As written, dependencies are: Apache web server, 4store triplestore, and rapper from the Redland suite. And of course, gawk.

This script can be placed anywhere in a website’s hierarchy, but is set up here to work in the root web directory. Aside from the script you need a few lines in your root .htaccess file (assuming an Apache web server):

  Options +ExecCGI
  <Files rdfserver>
    SetHandler cgi-script
  </Files>

  RewriteEngine On
  RewriteBase /
  RewriteCond %{HTTP_ACCEPT} application/rdf\+xml
  RewriteRule ^sw/([^\.]+)$ sw/$1.rdf [R=303]
  RewriteRule ^sw/([^\.]+)$ sw/$1.txt [R=303]
  RewriteRule ^sw/([^\.]+)\.rdf$ rdfserver?id=$1&method=rdf
  RewriteRule ^sw/([^\.]+)\.txt$ rdfserver?id=$1&method=txt

These lines tell the webserver:

To allow scripts and let the rdfserver file act as a script.
That the rewrites should use the web root, /, as their prefix.
That if a browser-agent can handle rdf, then a file sw/Thing (with no period in the Thi.ng), should trigger a 303 redirect to sw/Thing.rdf,
otherwise a file sw/Thing should trigger a 303 redirect to sw/Thing.txt.
If sw/Thing.rdf is asked for, the a call to script rdfserver should be made with two parameters, the first being the name of the resource, the second rdf.
If sw/Thing.txt is asked for, the a call to script rdfserver should be made with two parameters, the first being the name of the resource, the second txt.

All other addresses (e.g., sw/Thing.html) will lead to a 404 NOT FOUND error.

Next, place the script in the same top directory, and make it executable (chmod a+x rdfserver). Make sure it is only writable by the user, and that its directory is only writable by the user, or else you may get a suexec error.

The script

...is pretty self-explanatory, with a basic understanding of general scripting/programming constructs.

Gawk makes environment variables available via the ENVIRON[] array, and the webserver makes the query string available to the script via variable $QUERY_STRING. This is parsed with a regexp in gensub().
Using the powerful getline construct, the script passes a microscript to the system (in this case a simple query to the 4store SPARQL endpoint), and reads the data back into the variable rdfdata. The ‘RS = "\x04"’ is to make sure the data are read as a single line. The SPARQL script used here asks for all triples with the queried resources as subject, and, if there are any, any triples with the queried resources as object. The designer of a dereferencer is free to choose to return any or all useful triples about a subject. As a rule, it’s also good to return some metadata about the document (Thing.rdf) itself, such as modification date.
using the correct Content-type, the data is sent back to the browser agent. Since RDFXML is hard to read, a call to rapper is made to convert back to Turtle before serving as plain text.

The data server itself

4store

The easiest way to access the RDF data is with a triplestore. I have found 4store to be the fastest, simplest and most robust, but you’ll probably have to compile and install it yourself. If you have built it,

  $ 4s-backend-setup myKB
  $ 4s-backend myKB
  $ 4s-httpd -p 8000 myKB

Make some RDF data in Turtle/N3, e.g.:

  $ cat test.ttl
  <http://foo.com/sw/Subject> <http://foo.com/sw/verb> 
      <http://foo.com/sw/Object> .
  <http://foo.com/sw/Subject2> <http://foo.com/sw/verb2> 
      <http://foo.com/sw/Subject> .

Convert it to RDFXML and load it:

  $ rapper -i turtle -o rdfxml test.ttl > test.rdf
  $ curl -T test.rdf http://localhost:8000/data/test.rdf

Check the 4store is working by visiting the built-in GUI at http://localhost:8000/test/ in a browser and running a query.

SQL

You can also manufacture RDF from any database. If you have a table:

  id      field1  field2
  ======  ======  ========
  Thing1  Data1   "Data 2"
  Thing2  Data3   "Date 4"

an SQL query of:

  SELECT CONCAT('<http://foo.com/sw/', id ,'> <http://foo.com/sw/field1> 
      <http://foo.com/sw/', field1 ,'> .') FROM table1 ;
  SELECT CONCAT('<http://foo.com/sw/', id ,'> <http://foo.com/sw/field2>
      "', field2 '" .') FROM table1 ;

Will generate Turtle/N-triples, which can be converted to RDFXML. Note that even if it looks like ntriples, it’s safer to use -i turtle to allow for UTF-8 characters (not allowed by ntriples). In our rdfserver script, replace section 2 with:

    cmd = "mysql -u user1 -ppassword1 myDB < query.sql | \
           rapper -q -i turtle -o rdfxml - 'http://foo.com/sw/'" ;
    RS = "\x04" ;  cmd | getline rdfdata ;  close(cmd) ;

Or... you could generate Turtle/N-triples from from a plain text data file with another gawk script.

Voilà

Pretty straightforward, hey?. Give it a try, and then go ahead and start serving your data as RDF, and become a node in the Semantic Web.

[ home | blog ]