The core parts of the Semantic Web architecture are:
Additionally:
(See this classic article on ‘How to Publish Linked Data on the Web’)
While URIs can refer to document fragments
(e.g. http://foo.com/doc.rdf#MyThing
), for large datasets it is more
efficient to be able to dereference the URIs on the fly, returning
only the requested data in a dereferenced document (e.g.,
http://foo.com/sw/MyThing
dereferences to
http://foo.com/sw/MyThing.rdf
with a 303 redirect, and that file
contains a selection of data about http://foo.com/sw/MyThing
only).
I myself thought it would be complicated to build a URI dereferencer,
and am sure that the worry of having to set up a server and
dereferencer is a significant barrier to getting more institutions and
projects to host their data via the Semantic Web, with permanent,
digital, dereferencable identifiers for their resources, e.g., for
museum collections (see the latest of Rod Page’s calls for this).
However, I’ve found it’s actually pretty simple to build a URI
dereferencer, but it took a fair amount of googling and trial and
error. So, in an attempt to persuade others it’s not hard, here’s a basic
(ca.) 10-line dereferencer, in the form of a gawk script (called here rdfserver
):
#!/usr/bin/gawk -f
BEGIN{
# 1. Get URI and method arguments
id = gensub(/^id=([^&]+)&method=.+$/,"\\1","G",ENVIRON["QUERY_STRING"])
method = gensub(/^id=[^&]+&method=(.+)$/,"\\1","G",ENVIRON["QUERY_STRING"])
# 2. Query 4store with SPARQL
cmd = "curl -d 'query=CONSTRUCT { <http://foo.com/sw/" id \
"> ?p ?o . ?s2 ?p2 <http://foo.com/sw/" id \
"> . } WHERE { <http://foo.com/sw/" id \
"> ?p ?o . OPTIONAL { ?s2 ?p2 <http://foo.com/sw/" id \
"> } }' http://localhost:8006/sparql/"
RS = "\x04" ; cmd | getline rdfdata ; close(cmd) ;
# 3. Return RDF in either RDFXML or as Turtle in plain text
if (method == "rdf") {
print "Content-type: application/rdf+xml\n\n"; print rdfdata }
else {
print "Content-type: text/plain\n\n*** Turtle serialization of <http://foo.com/sw/" id "> ***\n" ;
print rdfdata | "rapper -q -i rdfxml -o turtle - 'http://foo.com/sw/'" ;}
exit;
}
This suggestion comes with no warranties, may be a security risk,
blah, blah, etc, but it does its job and can easily be modified to be
more functional, secure, pretty. Why gawk
? Because it’s the best
scripting language hands down! Fast to prototype, fast to execute,
pre-installed everywhere, simple to learn and surprisingly
powerful. ‘Nuff said. If you have a problem with awk
, it’ll be
trivial to rewrite this in perl or python. The following assumes a basic knowledge of unix scripting.
As written, dependencies are: Apache web server, 4store triplestore,
and rapper
from the Redland suite. And of course, gawk
.
This script can be placed anywhere in a website’s hierarchy, but is
set up here to work in the root web directory. Aside from the script
you need a few lines in your root .htaccess
file (assuming an Apache
web server):
Options +ExecCGI
<Files rdfserver>
SetHandler cgi-script
</Files>
RewriteEngine On
RewriteBase /
RewriteCond %{HTTP_ACCEPT} application/rdf\+xml
RewriteRule ^sw/([^\.]+)$ sw/$1.rdf [R=303]
RewriteRule ^sw/([^\.]+)$ sw/$1.txt [R=303]
RewriteRule ^sw/([^\.]+)\.rdf$ rdfserver?id=$1&method=rdf
RewriteRule ^sw/([^\.]+)\.txt$ rdfserver?id=$1&method=txt
These lines tell the webserver:
rdfserver
file act as a script./
, as their prefix.sw/Thing
(with no period in the Thi.ng
), should trigger a 303 redirect to
sw/Thing.rdf
,sw/Thing
should trigger a 303 redirect to
sw/Thing.txt
.sw/Thing.rdf
is asked for, the a call to script rdfserver
should be made with two parameters, the first being the name of the
resource, the second rdf
.sw/Thing.txt
is asked for, the a call to script rdfserver
should be made with two parameters, the first being the name of the
resource, the second txt
.All other addresses (e.g., sw/Thing.html
) will lead to a 404 NOT
FOUND error.
Next, place the script in the same top directory, and make it
executable (chmod a+x rdfserver
). Make sure it is only writable by
the user, and that its directory is only writable by the user, or else
you may get a suexec error.
...is pretty self-explanatory, with a basic understanding of general scripting/programming constructs.
ENVIRON[]
array, and the webserver makes the query string available to the
script via variable $QUERY_STRING
. This is parsed with a regexp
in gensub()
.getline
construct, the script passes a
microscript to the system (in this case a simple query to the
4store SPARQL endpoint), and reads the data back into the
variable rdfdata
. The ‘RS = "\x04"
’ is to make sure the data
are read as a single line. The SPARQL script used here asks for
all triples with the queried resources as subject, and, if there
are any, any triples with the queried resources as object. The
designer of a dereferencer is free to choose to return any or all useful
triples about a subject. As a rule, it’s also good to return some
metadata about the document (Thing.rdf
) itself, such as
modification date.Content-type
, the data is sent back to the
browser agent. Since RDFXML is hard to read, a call to rapper
is
made to convert back to Turtle before serving as plain text.The easiest way to access the RDF data is with a triplestore. I have found 4store to be the fastest, simplest and most robust, but you’ll probably have to compile and install it yourself. If you have built it,
$ 4s-backend-setup myKB
$ 4s-backend myKB
$ 4s-httpd -p 8000 myKB
Make some RDF data in Turtle/N3, e.g.:
$ cat test.ttl
<http://foo.com/sw/Subject> <http://foo.com/sw/verb>
<http://foo.com/sw/Object> .
<http://foo.com/sw/Subject2> <http://foo.com/sw/verb2>
<http://foo.com/sw/Subject> .
Convert it to RDFXML and load it:
$ rapper -i turtle -o rdfxml test.ttl > test.rdf
$ curl -T test.rdf http://localhost:8000/data/test.rdf
Check the 4store is working by visiting the built-in GUI at
http://localhost:8000/test/
in a browser and running a query.
You can also manufacture RDF from any database. If you have a table:
id field1 field2
====== ====== ========
Thing1 Data1 "Data 2"
Thing2 Data3 "Date 4"
an SQL query of:
SELECT CONCAT('<http://foo.com/sw/', id ,'> <http://foo.com/sw/field1>
<http://foo.com/sw/', field1 ,'> .') FROM table1 ;
SELECT CONCAT('<http://foo.com/sw/', id ,'> <http://foo.com/sw/field2>
"', field2 '" .') FROM table1 ;
Will generate Turtle/N-triples, which can be converted to RDFXML. Note
that even if it looks like ntriples, it’s safer to use -i turtle
to
allow for UTF-8 characters (not allowed by ntriples). In our
rdfserver
script, replace
section 2 with:
cmd = "mysql -u user1 -ppassword1 myDB < query.sql | \
rapper -q -i turtle -o rdfxml - 'http://foo.com/sw/'" ;
RS = "\x04" ; cmd | getline rdfdata ; close(cmd) ;
Or... you could generate Turtle/N-triples from from a plain text data
file with another gawk
script.
Pretty straightforward, hey?. Give it a try, and then go ahead and start serving your data as RDF, and become a node in the Semantic Web.