Recent Event Highlights: BUSCADORES SEM�NTICOS: HAKIA FRENTE A GOOGLE (Parte 5), JISC calls for Linked Data projects… Talis can help, RDF meets NoSQL, Semantic Search Workshop 2010… and a bit of competition, Richard Stirling Talks about data.gov.uk, Putting the Semantic Web to work in e-Commerce with GoodRelations, and 472 more...
Created by mlang on Oct 29, 2008
Last updated: 03/11/10 at 04:00 PM
Planet RDF has no followers yet. Be the first one to follow.
En este �ltimo art�culo sobre Hakia se profundiza en las
ventajas de red de Google frente a Hakia y se trata la innovadora
herramienta de Hakia para la publicidad web (CONTEXTA). Tambi�n se
expone la utilidad de Hakia, y en general de cualquier buscador
sem�ntico, para evitar el tr�fico oportunista que muchas p�ginas
web consiguen usando palabras equ�vocas.
Desde un punto de vista t�cnico, los expertos distinguen entre
ventajas de red directas e indirectas. Sin entrar en detalles, dir�
que las primeras corresponden a situaciones en que el valor de un
bien o servicio se incrementa inmediatamente cuando crece el n�mero
de nodos con los que puede comunicarse, como sucede con el caso del
correo electr�nico o del est�ndar XML (elimina los problemas de
interoperabilidad sint�ctica en el intercambio de documentos). Otro
ejemplo de ventaja de red directa est� asociada al servicio de
anuncios contextuales AdSense de Google, que
emplea la tecnolog�a del buscador para saber qu� anuncios son
relevantes para un sitio web de peque�o tama�o y publicarlos. A
medida que AdSense ha ido atrayendo anunciantes, se ha vuelto m�s
valioso para m�s sitios web, que se han suscrito al servicio, y ha
atra�do m�s anunciantes. Y as� sucesivamente�
Las ventajas indirectas de red son efectos secundarios positivos
de que muchas personas empleen un mismo bien o servicio; pueden
corresponder a efectos en los que interviene el mercado o la
existencia de bienes o servicios complementarios.
Algunos ejemplos de estas �ltimas ventajas son las econom�as de
escala (a medida que se incrementa el n�mero de usuarios de una red
se reduce el precio del producto o servicio), el aprendizaje de los
usuarios (si los usuarios expertos de una red comparten su
conocimiento con los nuevos usuarios la red se expandir�
r�pidamente) o su reticencia a cambiar por otros los bienes o
servicios con los que se han familiarizado. En el caso del
software, las ventajas indirectas de red hacen que el n�mero de
usuarios de un sistema operativo determine que los fabricantes de
hardware saquen o no al mercado dispositivos compatibles con �l
(servicios complementarios); del mismo modo, las empresas de
programaci�n desarrollan aplicaciones que funcionan en los sistemas
operativos m�s populares. En el caso de las consolas de
videojuegos, los ventajas indirectas de red ocasionan que las m�s
populares tengan disponible un mayor n�mero de juegos (servicios
complementarios), as� como un mayor n�mero de revistas y sitios web
dedicados a ellos (aprendizaje de los usuarios)
El buscador Google tiene como principales ventajas de red el
gran n�mero de usuarios (que hace, por ejemplo, que muchas empresas
se anuncien en �l y, por tanto, que atraiga a m�s usuarios), la
capacidad de aprendizaje de �stos (unida a su inercia, que hace
dif�cil que cambien a otros buscadores) y el valor de una marca muy
conocida y bien valorada, tanto en el c�rculo de los usuarios como
en los mercados financieros. En los �ltimos a�os, el valor de la
marca de Google ha superado al de empresas como General Electric,
Microsoft, Coca-Cola, IBM o Marlboro (las quiebras y los rescates
estatales no son buenos para las marcas: por eso el valor de las
marcas de los bancos estadounidenses y brit�nicos ha ca�do en
picado y no las considero). Este valor percibido de la marca Google
acarrea que los medios de comunicaci�n informen de sus novedades,
que sus usuarios sean leales a la marca y que muchos de ellos
adopten cualquier herramienta que saque al mercado (Gmail, Google
Maps, tel�fonos m�viles, etc.). De hecho, esas herramientas han
sido posibles por las ventajas de red indirectas del buscador.
Con todo, Hakia s� puede competir con Google en cuanto a
calidad de los resultados, y dicha
calidad resulta muy importante para los usuarios: cuando Google
comenz� apenas proporcionaba m�s de unos cientos de resultados para
b�squedas que hoy devolver�an millones o cientos de miles y Yahoo!
era pr�cticamente sin�nimo de buscador. �qu�l gan� la batalla a
�ste dando resultados m�s precisos y r�pidos en las b�squedas; es
decir, siendo el mejor en lo que quer�an los usuarios: �de qu�
sirve tener una interfaz gr�fica muy conseguida o un maravilloso
servicio de atenci�n al cliente, si lo que quieren los consumidores
del servicio es velocidad en las b�squedas? Google emplea un
enfoque horizontal; intenta estar en todos los sectores, como si
mediante el sabelotodo rect�ngulo blanco horizontal bajo seis
letras de colores llamativos pudiera accederse a toda la
informaci�n del mundo. Este enfoque puede coexistir perfectamente
con buscadores especializados en sectores concretos y que
proporcionen respuestas precisas y fiables, como hace Hakia en el
campo de la salud (http://health.hakia.com/).
La clave para que un buscador sea rentable a largo plazo es la
publicidad. La inclusi�n de enlaces patrocionados en los resultados
de Google, ha hecho que la compa��a gane miles de millones de
d�lares. Un buscador sem�ntico como Hakia puede hacer que la
publicidad mostrada al usuario se adecue m�s a la que puede
interesarle. La empresa Hakia ya est� cambiando la publicidad web
con herramientas comerciales como CONTEXTA, que
permite analizar p�ginas web y obtener palabras claves que
representan el significado de la p�gina con sus correspondientes
puntuaciones.
Si CONTEXTA conquista un nicho significativo del mercado
publicitario, basar el posicionamiento web en buscadores que usan
palabras clave como Google o Yahoo! estar� en poco tiempo tan
obsoleto como medir audiencias de televisores en blanco y negro.
Por el momento, el popular blog ReadWriteWeb utiliza
CONTEXTA configurado de manera que compara sobre la marcha el
contenido sem�ntico de cada post con los criterios
establecidos por los patrocinadores, con el objetivo de mostrar
anuncios relevantes para los lectores. Los patrocinadores de
ReadWriteWeb han proporcionado previamente algunas frases que
definen sus productos o servicios.
Adem�s, Hakia puede reducir el tr�fico oportunista que muchas
p�ginas consiguen insertando textos que en realidad no tienen nada
que ver con el contenido de las p�ginas, y muchos usuarios y
organizaciones estar�an dispuestos a pagar por un buscador que les
evite perder su tiempo y que garantice que sus resultados son
fiables. Por ejemplo, en http://www.pauklein.com/experimento-captar-trafico-oportunista/
se puede ver un interesante ejemplo, debidamente cuantificado,
sobre el tr�fico oportunista a trav�s de Google que se genera al
incluir en un blog de marketing la letra y el nombre de
una canci�n de Billie The Vision And The Dancers tan pegadiza como
ef�meramente popular en Espa�a.
Este tr�fico existe solamente porque Google es un buscador
convencional basado en palabras clave, no un buscador sem�ntico, y
cualquiera puede enga�arlo incluyendo en su p�gina web palabras que
est�n de moda o sean populares. Si Hakia estuviera ya disponible en
espa�ol, detectar�a que las palabras como "Estrella", "Damm",
"Formentera" o "Mediterr�neamente" no guardan ninguna relaci�n de
semejanza con los conceptos designados por t�rminos como "marketing
online" o "posicionamiento en internet", y por lo tanto no mostar�a
el post "Anuncio Estrella Damm Formentera 2009
Mediterr�neamente" cuando alguien buscara informaci�n sobre
marketing (considerar�a que el sitio web donde se encuentra es de
poca fiabilidad) o le asignar�a una posici�n muy baja en la lista
de resultados.
A los usuarios nos interesa que buscadores como Hakia o Bing
compitan con Google. Cuando una compa��a domina un mercado, la
necesidad de inversi�n en mejorar sus productos disminuye o
desaparece, y los usuarios se encuentran con productos fosilizados
o con pr�cticas comerciales leoninas. Algunas empresas parecen
pensar "�Para qu� tratar bien al cliente, si estamos solos en el
mercado y nadie puede ni siquiera ara�ar nuestra cuota de mercado?"
o, m�s crudamente, "Pague y calle: no hay nadie m�s que pueda
ayudarle".
En mi opini�n, nos hemos acostumbrado a tratar los buscadores
como si fueran m�quinas de refrescos: se introducen unas palabras
clave, se pulsa un bot�n y se recogen los resultados (con la
salvedad de que a veces uno pide una lata de Coca-Cola y obtiene
una botella de Don Perignon; y otras veces, un vaso de agua f�tida
y aherrumbrada). En realidad, deber�amos considerarlos
bibliotecarios especializados a los que convendr�a formular
preguntas detalladas y precisas. Un ejemplo: si uno introduce las
frases "En este mundo vil, nada es gratuito. Todo se exp�a: el
bien, como el mal, se paga tarde o temprano. El bien mucho m�s
caro, l�gicamente" en un buscador, deber�a obtener resultados que
informaran sobre la biograf�a de su autor, sobre su obra, sobre el
estilo renqueante de su prosa, sobre su influencia en otros
escritores, etc. Actualmente, lo que uno obtiene en Google son unos
5.400 resultados, de los cuales solamente unos 10 guardan relaci�n
con el autor, y s�lo 3 informan sobre �ste, de manera muy
incompleta y sin dar fuentes reconocidas o de prestigio.
Hakia es por ahora un escaparate de lo que el futuro nos
deparar� en cuanto a b�squedas sem�nticas. Un precioso escaparate,
dicho sea de paso.
http://www.wshoy.sidar.org/index.php?2010/03/09/50-buscadores-semanticos-hakia-frente-a-google-parte-5
Martin Hepp and Jamie Taylor answer questions about the
GoodRelations vocabulary in a podcast conversation, exploring
opportunities to enrich the way in which we compare goods and
services.
http://blogs.zdnet.com/semantic-web/?p=356
To quote Prime Minister Gordon Brown in his opening speech today
at the Global Investment Conference 2010 “from today you will
be identify centres of excellence at the click of a
buttonâ€. Obviously in a general global stage speech, a
Prime Minister cannot go in to detail, but he was referring to a
project delivered in super quick time to the UK Government
which is launched today – The Research Funding
Explorer.
It was less than a month ago when the Department for Business Innovation and
Skills (BIS) asked us at Talis if we could use the Linked Data
Principles and practice demonstrated in our
work with data.gov.uk to
produce an application to help them with their mission.Â
Specifically they wanted a way to demonstrate to those looking to
invest in the UK, where the centres of excellence are located.
You can’t beat the focus of a fixed delivery date to stimulate
innovation. So when we were asked to not only come up with
the the pilot for a real application but also have it ready in two
weeks, in time for the preparation of the Prime Minister’s
Conference, the team behind it were filled with challenge and
trepidation in equal measure – especially as at this time we hadn’t
had a close look at the data.
They wanted something that could join the list of applications
on the data.gov.uk Apps List
and show how Linked Data from several sources could be brought
together to deliver real benefit in a way that each source alone
could not.  The data originated from organisations such
as the the Technology Strategy
Board, the Medical Research
Council, the Engineering and
Physical Sciences Research Council, and the Intellectual Property Office mostly in the
form of large spreadsheets. The data was extracted from
these, transformed in to RDF and loaded in to the Talis Platform utilising URIs
for concept which will be compatible with rest of the RDF to be
found on data.gov.uk. With great input from visualisation
developers at Iconomical, the Research Funding Explorer was
born.
In the limited time available it was
not possible to ingest and display data for all research topics, so
for that demonstrate the UK’s investment in leading technologies
were chosen: RFID,
Advanced
Composites, Regenerative
Medicine, and Plastic
Electronics. Running the animation on the home page of
the site clearly shows the funding hot spots for these topics of UK
research.  Zooming in to the map shows the location of
the organisations involved. The graph on the visualisation tracking
the national cumulative investment in these subjects, overlaid with
an indication of the number of patents granted for each.
The obvious wow of this application is the visualisation, but
the real power of storing this data in RDF, and using SPARQL to
query it, becomes apparent when you start navigating it via
the subjects,
regions, and
organisations,
seamlessly following the associations between them. For a
quick whiz through what the application is capable of, checkout
this short screencast:
At the moment the data is all stored within a single Talis
Platform store (if you are at home with SPARQL, check
it out here), over the next couple of weeks this data will be
made available via stores available via data.gov.uk so that it can
be used to drive other innovative applications.
This is only a start, but already this project has demonstrated
that publishing data as Linked Data in a queryable store can
stimulate innovation beyond the ubiquitous demo mashup towards real
full-blown applications that can deliver commercial benefit.
Technorati Tags: data.gov.uk,BIS,Talis,Iconomical,RDF,Linked Data,SPARQL,Semantic
Web,Talis Platform,UK,Research,visualisation
http://feedproxy.google.com/~r/Nodalities/~3/BtEiMIIpkpI/linked-data-visualisation-launched-at-prime-ministers-conference.php
During the implementation phase
of the Rule Interchange Format (RIF), the Working
Group discovered a problem with the design of the Production
Rules Dialect. This problem is addressed with a new Last Call Working
Draft that changes the way actions are handled to more closely
match existing production rule engines. Please send comments and
RIF implementation reports to public-rif-comments@w3.org.
http://www.w3.org/blog/SW/2010/02/21/rif_production_rules_dialect_revised_las
I’m not in the least bit excited about the iPad, and it seems
I’m not alone. The mood seems to have changed since before the
launch, with countless tech journalists previously falling over
themselves to declare tablets the next big thing.
(Thankfully Rory Cellan-Jones from the BBC was more measured,
focusing on personal projectors as a more exciting development).
The mood since is considerably more downbeat, and I think more
realistic.
I may be missing some crucial usage context that reveals the
killer characteristics of the iPad, but I’ve tried really hard and
still nothing. There are many obvious practical issues with the
device:
it’s too big for a pocket, but not sufficiently more useful
than an iPhone or an HTC Hero.
it’s about the same size as a compact laptop, but with less
scope for comfortable rapid input.
it’s probably too big to cradle comfortably in my hand for
prolonged periods, and sitting with one ankle on the other knee is
not always practical.
The only scenarios I can conjure up where I could imagine using
the device are:
showing people my holiday photos.
reviewing design proofs without needing to print them out.
Neither of these, or even both, are very compelling at all. TVs
are getting good for viewing photos, by including e.g. an SD card
slot, and rumours of the death of paper are greatly exagerated.
Perhaps the most annoying thing about the scenarios used to
promote the device is the one about the San Francisco to Tokyo
flight, watching video all the way without running out of battery.
Any airline with planes worth boarding has personal video screens.
I don’t want to bring my own. I’d rather use the space to carry a
decent pair of noise-canceling headphones, which I’m sure increase
my enjoyment of onboard media far more than a little bit of extra
screen real estate. The development I want to see is not a new
device that I have to prop on the flimsy airline table, hold tight
when we hit some turbulence, and stow away when my food arrives,
but the capability to connect my own device to the in-built screen
via USB or Bluetooth. Even a bare USB port with power but no
connectivity would be a start, allowing me to run low-powered
devices (that I already own) during long flights.
OK, so the flight reference is just a touchstone for how long
the device can run without mains power, but I think it demonstrates
a lack of grounding of the device in realistic scenarios.
Any new device has to have two key characteristics these days
for me to get excited: interoperability and
convergence. The iPad seems to have very little of either.
You could argue that it offers some convergence between smartphones
and e-readers, but that’s about as exciting as convergence between
a smartphone and a wall clock.
I’m left wondering what the iPad is competing against? I’m
guessing it’s paper, whether that’s in the form of a book,
brochure, newspaper, restaurant menu or whatever. Unfortunately for
Apple, paper is pretty well suited to each of these, especially
when you introduce bath water, the risk of theft, or just ketchup,
into the equation. Perhaps this is the end of electronic picture
frames as dedicated device? Probably about time. Maybe the iPad
will make an excellent Spotify console for the living room. Who
knows? Whatever happens I can’t see this becoming a mass-market
product worthy of even a fraction of the hype.
Where I wish that Apple had expended their creative talent was
in addressing the power issue. Not in making sure I could watch 10
hours of back to back video, but in enabling me to spend that
energy in whatever way I choose, powering whichever device I
choose. It drives me crazy that I carry several batteries around,
and short of running my phone off my laptop via USB there is no
interoperability between these power sources. If Apple could
produce a universal power supply that was sleek, sexy, efficient
and interoperable, then I would be interested. Sadly this
doesn’t seem to be the way.
No related posts.
Related posts brought to you by Yet Another Related Posts
Plugin.
http://tomheath.com/blog/2010/02/wash-down-the-apple-tablet-with-a-gulp-of-kool-aid/
There's been a whole lot of discussion around Microdata, a new
approach for embedding machine-readable information into
forthcoming HTML5. What I find most attractive about Microdata is
the fact that it was designed by HTMLers, not RDFers. It's
refreshingly pragmatic, free of other RDF spec legacy, but still
capable of expressing most of RDF.
Unfortunately, RDFa lobbyists on the
HTML WG mailing list forced the spec out of HTML5 core for the time
being. This manoeuver was understandable (a lot of energy went into
RDFa, after all), but in my opinion very short-sighted. How many
uphill battles did we have, trying to get RDF to the broader
developer community? And how many were successful? Atom,
microformats, OpenID, Portable Contacts, XRDS, Activity Streams
(well, not really), these are examples where RDFers tried, but
failed to promote some of their infrastructure into the respective
solutions. Now: HTML5, where the initial RDF lobbying actually had
an effect and lead to a native mechanism for RDF-in-HTML. Yes,
native, not in some separate spec. This would have
become part of every HTML5 book, any HTML developer on this planet
would have learned about it. Finally a battle won. And what a great
one. HTML.
But no, Microdata wasn't developed by an RDF group, so they voted
it out again. Now, the really sad thing is, there could have been a
solution that would have served everybody sufficiently well, both
HTMLers and RDFers. The RDFa group recently realized that RDFa
needs to be revised anyway, there is going to be an RDFa 1.1 which
will require new parsers. If they'd swallowed their pride, they
would most probably have been able to define RDFa 1.1 as a proper
superset of Microdata.
Here is a short overview of RDF features supported by Microdata:
Explicit resource containers, via @itemscope (in RDFa, the
boundaries of a resource are often implicitly defined by @rel or
@typeof)
Subject declaration, via @itemid (RDFa uses @about)
Main subject typing, via @itemtype (RDFa uses @typeof)
Predicate declaration, via @itemprop (RDFa uses @property,
@rel, and @rev)
Literal objects, via node values (RDFa also allows hidden
values via @content)
Non-literal objects, via @href, @src, etc. (RDFa also allows
hidden values via @resource)
Object language, via @lang
Blank nodes
I won't go into details why hiding semantics in RDFa will be
penalized by search engines as soon as spammers discover the
possibilities, why reusing RDF/XML's attribute names was probably
not a smart move with regard to attracting non-RDFers, why the new
@vocab idea is impractical, or why namespace prefixes, as handy as
they are in other RDF formats, are not too helpful in an HTML
context. Let's simply state that there is a trade-off between
extended features (RDFa) and simplicity (Microdata). So, what are
the core features that an RDFer would really need beyond Microdata:
the possibility to preserve markup, but probably not
necessarily as an explicit rdf:XMLLiteral
datatypes for literal objects (I personally never used them in
practice in the last 6 years that I've been developing RDF apps,
but I can see some use cases)
Markup preservation is currently turned on by default in RDFa and
can be disabled through @datatype in RDFa, so an RDFer-satisfying
RDFa 1.1 spec could probably just be Microdata + @datatype + a few
extended parsing rules to end up with the intended RDF. My
experience with watching RDF spec creation tells me that the RDFa
group won't pick this route (there simply is no "Kill
a Feature" mentality in the RDF community), but hey, hope dies
last.
I've been using Microdata in two of my recent RDF apps and the CMS
module of (ahem, still not documented) Trice, and it's been a great
experience. ARC is going to
get a "microRDF" extractor that supports the RDF-in-Microdata
markup below (Note: this output still requires a 2nd extraction
process, as the current Microdata draft's RDF mechanism only
produces intermediate RDF triples, which then still have to be
post-processed. I hope
my related suggestion will become official, but I seem to be
the only pro-Microdata RDFer on the HTML list right now, so it may
just stay as a convention):
Microdata:
http://bnode.org/blog/2010/01/26/microdata-semantic-markup-for-both-rdfers-and-non-rdfers
On the serendipitous joy of finding XML.
http://norman.walsh.name/2010/01/25/xml
We’re happy to announce the first Pellet maintenance release of
the 2.0 series, Pellet 2.0.1. This is the first release under our
new release management scheme whereby we’ve committed to four
quarterly point releases (2.1, 2.2, 2.3 and 2.4|3.0) during 2010
and maintenance releases for those versions as necessary.
The 2.0.1 release closes 16 tickets, all of which are detailed
at the Trac
page for this release. Pellet 2.0.1 is available for download.
We’ve also release an updated Pellet Reasoner Plug-in for
Protégé 4 to work with Pellet 2.0.1.
http://clarkparsia.com/weblog/2010/01/11/pellet-2-0-1-maintenance-release/
There have been a number of discussions about “Enterprise Linked
Data†recently, and I took part on a panel on precisely that topic
at ESTC 2009. Unfortunately
the panel was cut short due to time pressures so I didn’t get
chance to say everything I’d hoped. In lieu of that debate here’s a
blog post containing a few thoughts on the subject.
When we refer to enterprise use of Linked Data, there are a
number of different facets to that discussion which are worth
highlighting. In my opinion the issues and justifications relating
to each of them are quite different. So different in fact that
we’re in danger of having a confused debate unless we tease out
this different aspects.
Aspects of the Debate
In my view there are three facets to the discussion:
Publishing Linked Data, the key question here being:
What does an Enterprise have to benefit by publishing Linked
Data?
Consuming Linked Data: What does an Enterprise have to
benefit from consuming Linked Data?
Adopting Linked Data: What benefits can an Enterprise
gain by deploying Linked Data technologies internally?
I think these facets whilst obviously closely related are
largely orthogonal. For example I could see a scenario in which an
organization consumed Linked Data but didn’t store or use it as
RDF, but just fed it into existing applications. Similarly
businesses could clearly adopt Linked Data as a technology without
publishing or using any data to the web at all.
These issues are also largely orthogonal to the Open Data
discussion: an enterprise might use, consume and publish Linked
Data but this might not be completely open for others to reuse. The
data may only be available behind the firewall, amongst authorised
business partners, or only available to licensed third-parties. So,
while the issue as to whether to publish open data is a very
important aspect of the discussion, its not a defining one.
Here’s a few thoughts on each of these different facets.
Publishing Linked Data
So why might an enterprise publish Linked Data? And if that is a
worthwhile goal, then is it clear how to achieve it? Lets tackle
the second question first as its the simplest.
There is an increasingly large amount of good advice available
online, as well as tools and applications, to support the
publishing of Linked Data. We’re making good strides towards making
the important transition from moving Linked Data out of the
research area and into the hands of actual practitioners. The
How
to Publish Linked Data on the Web tutorial is an great resource
but to my mind Jeni Tennison’s recent series on
publishing Linked Data is an excellent end-to-end guide full of
great practical advice.
We can declare victory when someone writes the O’Reilly book on
the subject and do for Linked Data what RESTful Web Services
did for REST. (And the two would make great companion pieces).
But technology issues aside, what are the benefits to an
organization in publishing Linked Data? There are several ways to
approach answering that question but I think in most discussions
Linked Data tends to get compared with Web APIs. The value of
creating an API is now reasonably well understood, and many of the
benefits that come from opening data through an API also apply to
Linked Data.
However the argument that Linked Data married with a SPARQL
endpoint is as easy for developers to use as a Web API is still a
little weak at this stage. SPARQL can be off-putting for developers
used to simpler more tightly defined APIs. As a community we ought
to consider it as a power tool and look for ways to make it easier
to get started with. It’s also worth recognising that a search API
is also a useful addition to a SPARQL endpoint as part of Linked
Data deployment.
But publishing Linked Data can’t be directly compared to just
creating an API, because its also largely
a pattern for web publishing in general. Its increasingly
easier to instrument existing content management systems to expose
RDF(a) and Linked Data. So rather than create a custom API, which
will involve expensive development costs, particularly if its going
to scale, its possible to simply expose Linked Data as part of an
existing website.
By following the Linked Data pattern for web publishing, in
particular the use of strong identifiers, an enterprise can end up
with a single point of presence on the web for publishing all of
its human and machine-readable data, resulting in
a website that is strongly Search Engine Optimised. Search
engines can better crawl and index well structured websites and are
increasingly ingesting embedded RDFa to improve search results and
rankings. That’s a strong incentive to publish Linked Data by
itself.
Adopting Linked Data, particularly as part of a reorganization
of an existing web presence, could deliver improved search engine
rankings and exposure of content whilst saving on the costs of
developing and running a custom API. The longer term benefits of
being part of the growing web of data can be the icing on the
cake.
Consuming Linked Data
Next we can consider why an enterprise might want to consume
Linked Data.
To my knowledge organizations are currently only publishing
Linked Open Data (albeit with some wide
variations in licensing terms), so we’ll skip for the present
whether enterprises have an option of consuming non-open Linked
Data, e.g. as part of a privately licensed dataset.
The LOD Cloud is still growing and provides a great resource of
highly interlinked data. The main issues that face an organization
consuming this data are ones of quantity (there’s still a lot more
data that could be available); quality (how good is the data, and
how well is it modelled); and trust (picking and choosing reliable
sources).
To some extent these issues face any organization that begins
relying on a third-party API or dataset. However at present a lot
of the data in the LOD cloud is still from secondary sources. The
same can’t be said for the majority of web APIs, which tend to be
published by the original curators of the data.
These issues should resolve themselves over time as more primary
sources join the LOD cloud. Because Linked Data is all based on the
same data model bulk loading and merging data from external sources
is very simple. This gives enterprises the option of creating their
own mirrors of LOD data sources which will provide some additional
reassurances around stability and longevity.
Linked Data, with its reliance on strong identifiers, is much
easier to navigate and process than other sources, even if you’re
not storing the results of that processing as RDF. There’s also a
much greater chance of serendipity, resulting in the discovery of
new data sources and new data items. Whereas there is virtually no
serendipity in a Web API as each API needs to be explicitly
integrated.
But this benefit is only going to become evident if we continue
to put effort into helping (enterprise) developers understand how
to consume Linked Data. E.g. as part of existing frameworks or
using new data integration patterns is another area that needs more
attention. The Consuming Linked Data
tutorial at ISWC 2009 was a good step in that direction, although
the message needs to be circulated wider, outside of the core
semantic web community.
In my opinion it will be easier for enterprises to consume
Linked Data if they first begin to publish it. By publishing data
they are putting their identifiers out into the wild. These
identifiers become points for annotation and
reuse by the community, creating liminal
zones from which the enterprise can harvest and filter useful
data. This is a benefit that I think is unique to Linked Data as
with an Web API the end results are typically mashups or widgets
displaying in a third-party application; these are just new silos
one step removed from the data publisher.
Adopting Linked Data
Finally, what value could be gained if an organization adopts
Linked Data internally as a means to manage and integrate data
behind the firewall?
The issues and potential benefits here are largely a mixture of
the above, except that there are little or no issues with trust as
all of the data comes from known sources. In a typical enterprise
environment Linked Data as an integration technology will be
compared to a wider range of systems ranging from integrated
developer tools through to middleware systems. There’s a reason why
SOAP based systems are still well used in enterprise IT as most
organizations aren’t (yet?) internally organized as if they were
true microcosms of the web.
Its interesting to see that Linked Data can potentially provide
a means for solving many of the issues that Master Data
Management is trying to address. Linked Data encourages strong
identifiers; clean modelling; and linking to, rather than
replicating data. These are core issues for data consolidation
within the enterprise. Coupled with the ability to link out to data
that is part of the LOD Cloud, or published by business partners,
Linked Data has the potential to provide a unifying infrastructure
for managing both internal and external data sources.
Its worth noting however that semantic technologies in general,
e.g. document analysis, entity extraction, reasoning and ontologies
seem to be much more widely deployed in enterprise systems than
Linked Data. This is no doubt in large part because the advantages
of those technologies may currently be much more easily articulated
as they’re more easily packaged into a product.
Summary
In this post I wanted to tease out some of the questions that
underpin the discussions about enterprise adoption of Linked Data.
I’ve presented a few thoughts on those questions and I’d love to
hear your opinions.
Along the way I’ve attempted to highlight some areas where we
need to focus to help transition from a researcher-led to a
practioner-led community. More data, more documentation, and more
tools are the key themes.
http://www.ldodds.com/blog/2009/12/thoughts-on-enterprise-linked-data/
9 June 1923 — 26 November 2009.
http://norman.walsh.name/2009/12/26/dad
Following the Online
Information Conference 2009, at which I demonstrated live
examples of Linked Data in action, I have been asked several times
if
my presentation had been videoed.
Unfortunately it had not. So I have tried to recreate the
presentation, if not the atmosphere, by recording this
screencast.  In an attempt to find a quite uninterrupted
environment, I recorded this early on a Sunday morning. I
hope therefore you will forgive the odd clink of the first coffee
cup of the day.
A slides version of the presentation is also available on
SlideShare.
Technorati Tags: Semantic
Web,Linked Data,online09,govdata,linkeddata,govdata,talis,govdata,Talis
Platform
http://feedproxy.google.com/~r/Nodalities/~3/cxop5fqhPZI/the-semantic-web-and-linked-data-in-action.php
It may be that I was not attentive enough, ie, some of these may
be old(er) news. But I did hit two interesting
http://ivan-herman.name/2009/12/12/rdfa-usage-spreading%e2%80%a6/
I recently wrote about
http://webofdata.wordpress.com/2009/12/02/linked-open-data-caching-update/
SuRF 1.0.0 Beta
We are pleased to announce release of SuRF 1.0.0 Beta. This
version includes some significant changes and improvements in
interface, thus the major version number shift. SuRF is an Object -
RDF Mapper based on the popular rdflib python library. It exposes
RDF triple sets as sets of resources and integrates them into the
Object Oriented paradigm of Python in a similar manner as the
ActiveRDF does for Ruby. New features in 1.0.0 Beta version:
Improved resource querying. Can mix any of these features
together:
filter resources by attribute values
filter resources using SPARQL filter expressions
limit, offset, order ascending/descending
specify graph/context where resources should be loaded from and
later saved to
eager-load resource attributes
Improved attribute querying. All the querying features
available at resource level are also available at attribute
level.
Growing amount of documentation and examples. Still big gaps
there but the situation is improving.
Project Google Code site:
http://code.google.com/p/surfrdf/
Documentation: http://packages.python.org/SuRF/ You are very
welcome to try it out, tell us about your experiences, report bugs
and participate! contact: Peteris Caune
http://blog.deri.ie/index.php?id=452&no_cache=1&tx_ttnews[tt_news]=592
Yong Yu and
Rudi Studer are editing a special issue of the Journal of Web Semantics on
semantic search that will appear in the summer 2010. The
special issue will cover interdisciplinary topics between Semantic
Web and search. See the
call for papers for a list of relevant topics and details on
how to submit papers, which are due by 20 January 2010
http://ebiquity.umbc.edu/blogger/2009/11/11/cfp-jws-special-issue-on-semantic-search/
OWL 2, the new
version of the Web Ontology Language, officially became a W3C
standard yesterday. From the W3C press release:
“Today W3C announces a new version of a standard for
representing knowledge on the Web. OWL 2, part of W3C’s Semantic
Web toolkit, allows people to capture their knowledge about a
particular domain (say, energy or medicine) and then use tools to
manage information, search through it, and learn more from it.
Furthermore, as an open standard based on Web technology, it lowers
the cost of merging knowledge from multiple domains.â€
http://ebiquity.umbc.edu/blogger/2009/10/27/owl-2-becomes-a-w3c-recommendation/
I realised recently that, while a lot of work has been done on
creating and exploring interesting extensions to the SPARQL query
language, there has yet to be a systematic survey of the range of
different extensions that are currently implemented in various RDF
triplestores. Or if there has been a survey, then I’ve clearly
missed it.
In order to get a better idea of what kinds of extensions are
available I’ve set myself the task of surveying those currently
implemented. I intend to write-up and share the results of that
work through this blog.
Rationale
I think that pulling together a list of extensions is a useful
activity which should:
Help researchers and implementors to have a clearer view of
existing work, thereby encouraging further experimentation
Promote convergence on a core set of useful extensions that
could be implemented across a number of triplestores.
Help users to have a clearer understanding of what SPARQL
extensions are currently supported in particular triplestores,
letting them make informed decisions about which extensions to use
when writing and sharing queries
It looks like the SPARQL Working Group may well be adding a
standard library of extension functions into the next revision of
the query language so the timing of this work should help
contribute to that effort. However I’m looking beyond their
immediate goals and hope to encourage the implementor community to
explore models simple to the EXSLT
effort which has been successful in creating a set of
community-designed extensions for XSLT transformations. I see no
reason why the same process can’t be applied to SPARQL
extensions.
Clarity of which extensions are portable across triplestores is
important to allow users to experiment with various triplestore
implementations and services.
If data is going to be truly portable, then this will be an
important consideration.
With that in mind I’ve begun digging into the available
documentation for a number of different triplestores. I’ve decided
to organize my work by surveying each of the three different types
of SPARQL extension.
Types of SPARQL Extension Function
Its possible to extend the SPARQL query language in any of the
following three ways:
Extension Functions
Property Functions (aka “Magic Predicatesâ€)
Language Extensions
Lets look at each of these in turn.
Extension Functions
Extension Functions are explicitly described by the current
SPARQL specification under the banner of “extensible
value testing“. The standard
library of extensions that may be added to SPARQL 1.1 will fall
into this category. Extension Functions are simple function calls
that can be used within a FILTER in a SPARQL query to carry out
some specific extra logic that cannot be handled by matching triple
patterns. Examples of extension functions include substring
testing, string concatenation, date tests, etc.
The specification indicates that these extension functions
should have a unique URI, allowing them to be globally identified.
Few engines are publishing useful information at these URIs, but
this seems like it would be a useful thing to do. These URIs should
be grounded in the web too.
Property Functions
Property Functions (aka “Magic Predicatesâ€, or “Magic
Propertiesâ€) are extensions to the triple matching process that is
carried out when a SPARQL query is executed. This means that
property functions don’t appear in a FILTER expression like an
extension function. They instead appear within the graph pattern of
the query. Unlike extension functions which have a syntax like a
conventional functional call, property functions use turtle syntax
and appear, to the untrained eye, as standard triple patterns.
For example, as property function that could split a resource
URI into a namespace and a localname might look like this in a
SPARQL query:
?uri a rdfs:Class.
?uri ex:splitURI (?namespace ?localname).
In that example the the property function
ex:splitURI has as its input each of the URIs that are
bound to the ?uri variable, and as its output binds
the namespace URI and localname of those URIs to two new
variables.
There are other ways to structure the inputs and outputs of a
property function, depending on its purpose, but the important
things to recognise are that:
the property function is written as a conventional triple
pattern
parameters can be passed from either the subject or object
portions of the triple (or potentially both)
similarly, output can be bound to variables that appear in
either the subject or object portions of the triple
one technique for passing multiple parameters or generating
multiple output values is to allow specification of an RDF list in
the object portion of the triple
Property functions are very powerful as they can allow arbitrary
complex logic to be used to extend the triple matching process. One
common use is to extend the matching process by calling out to
specialised indices or logic, e.g. for full-text indexing or
geospatial functions and reasoning.
It is worth noting that Property Functions are not
explicitly licensed by the current SPARQL specification. The
specification does not describe them at all: they are simply
allowed by the fact that they conform to the overall SPARQL
grammar.
Testing whether a query uses Property Functions would therefore
require a validator (such as the one that Dan Brickley describes
here) to either have explicit knowledge of the function, e.g.
based on its URI, or for implementors to publish some useful
information at those locations so that a validator might determine
whether a specific predicate is actually a “real†predicate or an
extension through dereferencing the URI. I’m not aware of any
implementation that currently does this.
Language Extensions
The final category of SPARQL extensions are extensions to the
language itself. This type of extension involves amending the
grammar of the language to include new operators, keywords, and
types of expression. Examples of this type of extensions include
sub-queries and aggregates (e.g. min and max). The forthcoming
SPARQL 1.1 specification will standardise these and a few other
language extensions that have been commonly implemented.
Arguably, if one changes the grammar of a language then you’re
creating a new language: “SPARQL plus some extensionsâ€. So some
care needs to be taken with respect to this type of extension if
one wants queries to be portable.
In my view while there is plenty of scope for the community to
collaborate and converge on common extension of all of the types
I’ve described here, the best place for language extensions to be
formally ratified and agreed on is through the SPARQL Working
Group. I personally don’t expect the Working Group to have to, or
want to sign-off on every extension function or property function,
but interoperability is ultimately best served by co-ordinating
language extensions through the Working Group. Naturally this
should happen after the implementor community have had a period of
experimentation and research. This is obviously the process that
has happened to date, and hopefully this will continue as the
language continues to evolve. A bit of collective action ought to
help ensure interoperability in other areas.
A Survey
For my survey of SPARQL extensions I’ve decided to tackle things
in the order in which I have presented them here: I will first look
at Extension Functions, then Property Functions, and then Language
Extensions. For the rationale and reasons I’ve already outlined, I
think the community is best served by organizing itself around
standardising two of those types of extensions. And Extension
Functions seem like the lowest hanging fruit.
I’m intending to do the survey in as open a way as possible, and
want to ensure that I include as many different implementations as
possible. Having said that initially I’m going to impose some
editorial control simply to ensure consistency and quality.
Implementors feel free to drop me a line providing me with
information on your extensions or preferably pointers to the
relevant documentation. I’ll also stress that while this survey has
obvious relevance for my day job, that this is a personal project
so things will progress as quickly as I’m able to find some time to
push things forward.
I’m going to send regular status updates to the
public-sparql-dev mailing list as that is the correct place for
further discussion. I’ll also summarize my findings in further blog
posts here. I’ve already begun the process of cataloguing Extension
Functions as you can see by
my recent email to the mailing list. I still have to include
some additional information helpfully provided by OpenLink and to
also update the entries for Mulgara to list its support for some of
the EXSLT functions.
One other task I have on my list is to help provide some
guidance on how implementors should publish information about their
SPARQL extensions. It would be useful to have some descriptive
metadata for these available from the relevant URIs. I’m intending
to spend some time at Vocamp DC pulling
together a vocabulary for that purpose. Let me know if you’re
attending and want to collaborate.
http://www.ldodds.com/blog/2009/10/surveying-and-classifying-sparql-extensions/
Jonathan Rochkind describes a very typical decision sequence when
working with MARC data:
Frequently the answer to “How do I get this piece of data I
want†is along the lines of: “Well, it’ll be in this field, UNLESS
this other field is X, in which case it’ll be in field Y, UNLESS
field Y is being used for Z to try to and figure out if Z look at
fixed fields a, b, and c, the different combinations of all three
of which determine that, but there’s no guarantee they’re filled
out correct. Oh, and that’s assuming it’s a post-1972 record, in
older records they did things entirely differently and put the data
over in field N. Oh, and ALL of that is assuming this is AACR2
data, the corpus also includes Rare Books and Manuscripts data, and
those guys do things entirely differently, although it’s still in
MARC, you’ve got to look in this OTHER field for it. First check
fixed field q to see if it’s RBM data, and hope fixed field q is
right. Oh, and don’t forget to check if it’s encoded in UTF-8 or
MARC-8 by checking this other fixed field, which we know is wrong
most of the time.â€
For non-librarians, MARC is a structured data format whose
fields have well-defined and rather precise semantics. The missing
pieces are that the data structure is not self-describing and there
is no strategy for discovering the rules (apart from a human
reading the documentation and then encoding the rules in code).
http://iandavis.com/blog/2009/09/being-structured-and-having-semantics-is-not-enough
These days, data provenance is a
big topic across the board, ranging from the linked data web, to RDF in general, to
any kind of data integration, with or without RDF. Especially with
scientific data we encounter the need for metadata and provenance,
repeatability of experiments, etc. Data without context is
worthless, yet the producers of said data do not always have a
model or budget for metadata. And if they do, the approach is often
a proprietary relational schema with web services in front.
RDF and linked data principles could evidently be a great help.
This is a large topic that goes into the culture of doing science
and will deserve a more extensive treatment down the road.
For now, I will talk about possible ways of dealing with
provenance annotations in Virtuoso at a
fairly technical level.
If data comes many-triples-at-a-time from some source (e.g.,
library catalogue, user of a social network), then it is often
easiest to put the data from each source/user into its own graph.
Annotations can then be made on the graph. The graph IRI will
simply occur as the subject of a triple in the same or some other
graph. For example, all such annotations could go into a special
annotations graph.
On the query side, having lots of distinct graphs does not have
to be a problem if the index scheme is the right one, i.e., the 4
index scheme discussed in the
Virtuoso documentation. If the query does not specify a graph,
then triples in any graph will be considered when evaluating the
query.
One could write queries like —
SELECT ?pub
WHERE
{
GRAPH ?g
{
?person foaf:knows ?contact
}
?contact foaf:name "Alice" .
?g xx:has_publisher ?pub
}
This would return the publishers of graphs that assert that
somebody knows Alice.
Of course, the RDF reification
vocabulary can be used as-is to say things about single
triples. It is however very inefficient and is not supported by any
specific optimization. Further, reification does not seem to get
used very much; thus there is no great pressure to specially
optimize it.
If we have to say things about specific triples and this occurs
frequently (i.e., for more than 10% or so of the triples), then
modifying the quad table becomes an option. For all its
inefficiency, the RDF reification vocabulary is applicable if
reification is a rarity.
Virtuoso's RDF_QUAD table can be altered to have
more columns. The problem with this is that space usage is
increased and the RDF loading and query functions will not know
about the columns. A SQL update statement
can be used to set values for these additional columns if one knows
the G,S,P,O.
Suppose we annotated each quad with the user who inserted it and
a timestamp. These would be columns in the RDF_QUAD
table. The next choice would be whether these were primary key
parts or dependent parts. If primary key parts, these would be
non-NULL and would occur on every index. The same quad
would exist for each distinct user and time this quad had been
inserted. For loading functions to work, these columns would need a
default. In practice, we think that having such metadata as a
dependent part is more likely, so that G,S,P,O are the
unique identifier of the quad. Whether one would then include these
columns on indices other than the primary key would depend on how
frequently they were accessed.
In SPARQL, one could
use an extension syntax like —
SELECT *
WHERE
{ ?person foaf:knows ?connection
OPTION ( time ?ts ) .
?connection foaf:name "Alice" .
FILTER ( ?ts > "2009-08-08"^^xsd:datetime )
}
This would return everybody who knows Alice since a date more
recent than 2009-08-08. This presupposes that the quad table has
been extended with a datetime column.
The OPTION (time ?ts) syntax is not presently
supported but we can easily add something of the sort if there is
user demand for it. In practice, this would be an extension
mechanism enabling one to access extension columns of
RDF_QUAD via a column ?variable syntax in
the OPTION clause.
If quad metadata were not for every quad but still relatively
frequent, another possibility would be making a separate table with
a key of GSPO and a dependent part of R,
where R would be the reification URI of the quad.
Reification statements would then be made with R as a
subject. This would be more compact than the reification vocabulary
and would not modify the RDF_QUAD table. The syntax
for referring to this could be something like —
SELECT *
WHERE
{ ?person foaf:knows ?contact
OPTION ( reify ?r ) .
?r xx:assertion_time ?ts .
?contact foaf:name "Alice" .
FILTER ( ?ts > "2008-8-8"^^xsd:datetime )
}
We could even recognize the reification vocabulary and convert
it into the reify option if this were really necessary. But since
it is so unwieldy I don't think there would be huge demand. Who
knows? You tell us.
http://www.openlinksw.com/weblog/oerling/?id=1572
Jeni Tennison has written a wonderfully
insightful post on RDF and HTML5. She caps it off with two
sentences that every RDF advocate needs to pay attention to:
so why not just stop arguing and use the spare time and energy
doing?
so why not demonstrate RDF’s power in real-world
applications?
That’s all there is to say really. Stop waving hands around
promising things will be great if only you did X. Get out there and
build some things that actually make a difference.
http://iandavis.com/blog/2009/08/rdf-jfdi
I moved ARC's
codebase to a BZR repository
2 months ago but didn't really find the time to synchronize it
with the way I created bundles in the past. Today I finally linked
the repository and its TGZ creation feature from the main download page. This is
the first bundle since March, so there are quite a number of fixes.
Some tweaks were not logged, but from now on, the process should be
more professional (thanks to the proper versioning system).
Here is the raw list of changes, the most interesting are probably
the improved RDFa extractor (cheers to Toby Inkster and Masahide
Kanzaki for code) and the new auto-cleanup of unused values/hashes
in the RDF store. I received a couple of more patches which will be
integrated in the coming weeks:
new component: Resource
new method: completeQuery (PREFIX-injection)
Reader: new method: getResponseHeaders
RDFa: fixes, +3 test case PASSes (thx to Toby Inkster &
Masahide Kanzaki)
Class: auto-populate POST (php5 bug)
Class: refactored *PName methods
new methods: toIndex, toTriples, checkRegex
Parsers: unsetting reader object to fix garbage collection
SelectQueryHandler: improved LIKE-check for
REGEX-rewriting
Class: used prefixes were not logged, leading to serialization
gaps
Class: fixed root calculation bug in calcURI
Class: new methods: toDataURI/fromDataURI
ARC2_SPARQLScriptProcessor: improved automatic PREFIX
injection
ARC2_RemoteStore: added automatic PREFIX injection and
getResourceLabel method
ARC2_StoreSelectQueryHandler: fixed missing brackets in
getExpressionSQL.
Reader: Improved timeout handling
Reader: support for port in http header (thx to Roan
O'Sullivan)
Slowly starting to switch to inline PHPDoc documentation
Atom_Parser: Addition: support for link types
DeleteQueryHandler: Addition: cleanValueTables method
(auto-called every 500 DELETE queries)
Class: new method: resetErrors
Class: switch from getScriptURI to getRequestURI in init()
In related news:
Tuukka Hastrup created an ARC 2 Starter
Pack that simplifies the process of setting up an ARC
store.
Andrew Ritz created a WordPress extension
that lets you embed results from remote SPARQL endpoints directly
in your blog pages.
http://bnode.org/blog/2009/08/21/new-arc2-release
Way back in 2006 I wrote a blog post concerning the modelling of
time in RDF (see
Refactoring Bio With Einstein Part 3: Temporal Invariants. That
post also provoked some discussion in
the blogosphere. Although I haven’t written anything on the
subject for the past three years I haven’t stopped thinking about
it. In fact I’ve been working quite hard on the problem, mainly by
modelling real data, especially geographical information. This is
the first of a series of blog posts describing my experiments. I’d
like to thank Leigh Dodds
and Jeni Tennison
who gave me valuable feedback on an earlier version of this
write-up.
In a comment to my blog post Chris Mungall made an excellent
point about the importance of solving the time problem:
However, it’s also seems clear to me that this is a recipe for
trouble for the semantic web. Surely all real-world data that
concerns non-trivial applications such as science and electronic
health records, or any kind of human activity _must_ take time into
account? Which ever hack you make to account for time, it has to
propagate through all your ontologies. An ontology that treats the
world as time-slices can’t interoperate with one that has a
standard view of objects and processes. It may be just about
workable, but I can’t see it being anything other than tremendously
complicated. We’ll essentially end up with layering 3-place
relations on top of RDF in an extremely inelegant way.
This is not made clear when people are lured into the semantic
web with examples of toy ontologies about pizzas that live floating
in some mathematical space untroubled by time. Unless more is done
to address these issues (and I commend this article for tackling
this) the semantic web will face a huge backlash when people start
realising they have to warp their ontologies and refactor their
instance data to deal with time in order to represent real
entities. Why is there no best practices document on representing
instances that vary in time (that is, all real-world instances)? I
do find it curious that more people aren’t making noise about this
problem – I can only conclude that there’s a dearth of serious
applications using RDF or OWL for instance data.
Those comments are still true today and in fact they are being
accentuated by the wide availability of data brought about by the
successful Linked Data
project. For example dbpedia, freebase and geonames all have
descriptions of London (in England) and their URIs are all declared
to be owl:sameAs one another:
http://dbpedia.org/resource/London
http://sws.geonames.org/2643743/
http://rdf.freebase.com/ns/guid.9202a8c04000641f80000000000242b2
These descriptions assert population figures for London of
7,355,400, 7,421,209 and 7,512,400 respectively. Since all these
resources are owl:sameAs one another then I have three different
populations for exactly the same thing with no temporal context (to
be fair both freebase and dbpedia do attempt to assign a date, but
they both say it’s for the year 2006). Perhaps they are all correct
but are taken at different times, or perhaps they are actually
referring to slightly different definitions of “Londonâ€. Whatever
the cause, the effect is that the data is not particularly useful.
It would be helpful for them to indicate when the measurement of
population took place. This is not intended as a criticism of the
LOD project but to demonstrate that simplistic modelling of data
that ignores time can quickly produce unhelpful results.
As an another example of the kind of data that I should be able
to model in RDF consider the city of Istanbul. During it’s long and
varied history it has been named Byzantium, New Rome,
Constantinople and Stamboul (see Wikipedia’s page
on the names of Istanbul). At various times it has been the
capital city of Roman Empire, the Byzantine/Eastern Roman Empire
(twice), the Latin Empire, the Ottoman Empire and modern Turkey and
of course its extent has varied considerably over that period of
time too.
No existing geographic ontology can model that variation in
properties accurately enough for me to write a query to return the
name of that city during the sixth crusade.
My main requirements for modelling time are:
to be able to query the properties of and relations between
entities at any point in time
to be able to sequence data in relative terms such as before,
after and during
not to extend the RDF triple model beyond possibly allowing
named graphs
not to require changes to existing RDF schemas
avoid duplication of data
In this post I’m going to explore some of the possible solutions
to this modelling problem. I take four main approaches:
Conditions were my invention to model the state of being of an
individual at a point in time (basically time slices like CYC
sub abstractions).
Named graphs, with one graph containing time interval
information about the other graphs.
Reification of all triples and attaching time interval
information to the reified statements.
N-ary relations representing contexts.
I’m going to illustrate the various approaches using three
scenarios all drawn from problems in the genealogy field which
happens to be both an enduring interest of mine and a minefield for
time-insensitive applications:
In the first a woman is born as “Maria Smith†in 1867 and
marries “Richard Johnson†in 1888. I want to write a sparql query
that gives me her name so I can find her in the 1891 census.
For scenario 2 imagine that I discover an ancestor in the 1861
census who claims to have been born in in Widford, Gloucestershire.
However when I check I find three Widfords: one in Essex, one in
Hertfordshire and one in Oxfordshire. Has there been an error in
the census transcription? The explanation is that prior to 1844 the
Oxfordshire Widford was actually in Gloucestershire. I want to
write a sparql query that finds out which county the parish was in
when the 1841 census was being taken.
The final scenario is where I have records of the addresses that
a person has lived at. I don’t have precise dates for the moves
between them because the information has been derived from locating
that person in public records. I know, for instance that in 1870
this person lived in Lyme Regis, Dorset; in 1871 they were in
Charmouth, Dorset and in 1881 they were in Hastings, Sussex. Given
that information, where is the most likely place to look for them
in 1874? Obviously in the absence of any other information, I would
start looking in Charmouth and if that proved fruitless, I would
move onto Hastings. Can I write a sparql query to give me that
ordering of possibilities?
For completeness, there were approaches that I didn’t consider
in detail:
Temporal
RDF introduces a fourth time component to the triple. I chose
not to cover this approach in a lot of detail because it extends
the RDF model in a way that no current triple store implements and
it requires a numeric time to be associated with each triple,
preventing relative times from being expressed.
It is worth noting that the scenarios analysed in these posts
are very specialist. Most data modelling is only concerned with
“The Nowâ€. The data and corresponding queries I show in the
following posts are quite convoluted and don’t reflect usual usage
of RDF. This would likely be true of any data representation format
that attempted to model time-varying properties of arbitrary
things.
I have split this write-up into six parts, of which this post is
the first:
Part 1: Introduction
Part 2: Approach 1
Part 3: Approach 2
Part 4: Approach 3
Part 5: Approach 4
Part 6: References
http://iandavis.com/blog/2009/08/time-in-rdf-1
For those of us who have been doing Knowledge Representation for
decades, we judge a modeling tool on its power: How many whiz-bang
shortcuts for complex OWL restrictions or mass editing of similar
items or refactoring does it have? But when we try to get Modeling
to the Masses, or at least Modeling in the Enterprise, we find that
it isn't the power tools that they are interested in. Enterprise
knowledge workers will prefer pretty simple model editing tools.
But they insist that they have strict control over version
governance.
What exactly is version governance? Often the people who want it
aren't quite sure, but they know it when it isn't there. Someone
makes a change to a part of a model on someone else's turf. Someone
wants to try out a long-transaction 'better idea' to see how it
works - but we want to be able to toss it later on if we don't like
it. Or we find something wrong in a category - who changed it?
When? What was the model like when they changed it?
Some of this stuff comes for free when you use a version control
system like SVN or CVS. But these solutions, which are great for
managing versions of java code, aren't intuitive to a team that is
organizing, say, a vocabulary project. They want something a bit
finer grained (who changed this term?) and with a bit of process
("I can propose a change, but only John can approve it").
That's why the biggest part of TopQuadrant's Enterprise
Vocabulary Management System (EVMS) is a system for
collaborating on model changes. You don't just use the EVMS to
change a vocabulary; you use it to build a sandbox in which you
make your changes. The changes then enter a (configurable)
workflow, where, if they get approved, they are committed to the
shared version. If not, well, then they aren't.
Now, that's pretty cool. After all, it lets teams collaborate on
their vocabulary management, lets them manage territory on a
term-by-term basis, and even provides a process for moving the
changes along. But the thing that I find most cool about this is
that it was all built using the TopBraid Ensemble assembly
platform.
You see, I never got the hang of coding Java, and I'm not really
a programmer. But I like making systems do what I want them to do,
so I am a big scripter. The entire EVMS collaboration control
system is written as a TopBraid Ensemble application.
What does this mean? It means a lot of things, but for this
project it means that when I was talking to a colleague about how
to display the changes that had been made to a vocabulary. He said,
"to my mind, I want to be able to click on a term, and see all the
people who have changed it, and why!" Well, all that information is
modeled in the system - it is just a matter of querying it out with
SPARQL.
In the figure, we see the final step of this. We are looking at
a fragment of the NCI Thesaurus regarding Organisms. The change log
shows a rather silly argument over what we should call lab mice by
two of the taxonomists. Every change was made through the EVMS, so
we can track back the whole story about each term. Adding this to
the system was as easy as writing a SPARQL query and wiring it up
to the display components (a grid in the upper-right and a form in
the lower-right) so that the changes relevant to a chosen term
would be shown.
http://dallemang.typepad.com/my_weblog/2009/07/governance-with-topbraid-ensemble.html
And in case you still doubted whether this is a company ran by
nerds,
http://tripletalk.wordpress.com/2009/07/28/welcome-the-new-yahoo-homepage/
| This post will appear in Nodalities Magazine, Issue
7
by Paul Miller
Much has been written about growing Enterprise use of social
media (usually Twitter, these days) to successfully track and
mitigate customer complaint. Many have been quick to spot that the
disproportionately high cost of satisfying (or, more cynically,
silencing) these early adopters is unlikely to scale effectively as
an increasingly large cohort of customers move onto these services,
and it must remain an open question as to whether ComcastCares and
its peers can survive any move to the mainstream in recognisable
form.
It appears, though, that Enterprise engagement in the social
sphere changes the game far more significantly than merely enabling
a select few twitterati to jump the Customer Support queue, and
that this change is worth effort and investment in order to ensure
that it does scale. What’s actually happening is that a
relationship is being enabled between a brand and those that Seth
Godin might recognise as its tribe; a relationship in which
interactions are no longer driven predominantly by the desire to
seek redress. Rather than only raising those issues serious enough
for us to have written letters or endured telephone muzak in the
past, we now comment on issues at the periphery of a brand.
Collectively, we’ve moved from simply complaining about the worst
failures of companies, their products and their employees, toward
emitting an impressive stream of FYIs. Individually insignificant,
and possibly unimportant, together these light touches on and
around a brand build into an ever-changing and valuable commentary
that brands and the corporations they front would do well to take
notice of. The minor niggles about an otherwise exemplary service,
the human touches that made us smile, the odd inconsistencies in a
polished persona; none are enough to make us pick up the phone, but
we comment upon them endlessly in Twitter, Facebook, FriendFeed and
elsewhere, and by tapping into this fundamentally honest stream of
consciousness there is much for those about whom we comment to
learn. Good companies probably already know about fundamental
failings in a product long before their customer support operation
melts down under the weight of complaints or their quarterly sales
targets are seriously under-achieved. Do they have as good a handle
on the things we love? Do they have a clue about the minor gripes
of customers outside their pre-launch polling groups? Do they know
about the gut reaction to a colour, a touch, a smell, or a careless
word that persuaded a likely prospect to buy a technically or
aesthetically inferior product from the competition instead? All
this and more is there for the taking in the stream of online
chatter freely directed their way.
Semantic Technologies aren’t often directly associated with the
worlds of Marketing and Commerce, yet individuals such as Eric
Hillerbrand and Scott Brinker are hard at work to show just what
might be possible when the experiences of the Semantic Web are
applied to this space. Brands are no longer owned by the companies
in whose name they were created. Increasingly, ownership of various
forms is being asserted by the multitude of stakeholders with
effort and attention invested in the brand. They care about it,
they care about what it says about them, and they play a clear role
in the brand’s evolution whether its managers want them to or
not.
Brands need to engage in this conversation, as we are beginning
to see them do, but they also need to discover the means to
cost-effectively monitor and engage with a potential flood of third
party reaction whilst using the Business Intelligence tools
available to them in nimbly shaping public opinion to their
advantage wherever possible.
I spoke with Scott Brinker
last year, to explore his—then nascent—views on Semantic Marketing,
and look forward to hearing his latest thoughts at the Semantic
Technology Conference in San Jose in June.
More recently, Eric
Hillerbrand talked about some of his ideas with respect to
‘Social Commerce,’ and the ways in which commercial organisations
might seek to strengthen and exploit relationships with their
customers, aided by a range of semantic technologies.
We’re just beginning to grasp the realities of a world in which
tightly controlled and fiercely guarded brand attributes become
increasingly permeable. For those companies with the confidence and
foresight to loosen their grip, whilst simultaneously exploiting
the wealth of data and new opportunities to engage, there is much
to be gained. For the dinosaurs that hang on to ‘their’ brand in
spite of the world around them, there is everything to lose.
http://feedproxy.google.com/~r/Nodalities/~3/MLoFGDmH0zc/might-semantic-technologies-permit-meaningful-brand-relationships.php
Things have settled down a bit after the birth of baby #2 and
I'm starting to get a bit of time to program again: about an hour a
night. That means I'm thinking a lot about indexing structured data
again.
Here are my most up-to-date thoughts on a model for representing
aggregated structured data which I'm tentatively calling
'BTriples'. I'm writing this down mainly so I can refer to it in
future writing.
The purpose of BTriples is to be an internal model for an OLAP
database such that it can represent structured data from a variety
of popular formats (json, xml, csv, relational) and can index and
query across heterogeneous data sources.
A good candidate for such a model would appear to be RDF, but it
falls short on a couple of counts for my requirements:
The first issue is that in order to represent vanilla data as
RDF there's a certain amount of manual mapping that needs to be
done. You need to come up with a URI scheme for your imported data,
and you then need to do some schema and ontology work so that the
data can be semantically joined with other RDF data. This manual
import overhead removes the ability to do one-click database
imports, which is something I'd like to achieve with my database
tool.
The second issue is that the RDF model has strict semantic
constraints that are difficult to manage over a large set of
disconnected parties. Specifically the RDF model says that "URI
references have the same meaning whenever they occur". This 'same
meaning' is difficult to enforce without central control and makes
RDF brittle in the face of merging data from globally disconnected
teams.
TagTriples was my first
attempt at creating a simplified RDF-like model, but it suffers
from the problem that it can't represent anonymous nodes. This
makes importing tree structures like XML or JSON a tricky exercise
as you need to have some way to generate branch node labels from
data that has none. When I was designing tagtriples I was also
thinking in terms of an interchange format (like RDF). I no longer
think creating an interchange format is important - the world
already has plenty of these.
Btriples is basically my attempt at fixing the problems with
tagtriples. The format is triple based like RDF
and so I borrow a bunch of the terms from the RDF model.
BTriples Specification
The Btriples universe consists of a set of distinct
graphs (think: documents). Each graph consists of an
ordered set of statements. A statement is intended to
convey some information about a subject. Each statement has three
parts: a subject, a predicate (or property) and
an object.
A subject identity is anonymous and is local to the graph. This
means you can't refer to it outside the graph. (This is similar to
a 'blank node' in RDF).
A predicate is a literal symbol (e.g. strings, numbers).
An object is either a literal symbol or an internal reference
to a subject in the same graph.
Example (logical) statements:
// row data
#1 name "Phil Dawes"
#1 "hair colour" Brown
#1 plays "French Horn"
// array
#2 elem "Item 1"
#2 elem "Item 2"
#2 elem "Item 3"
#2 elem "Item 4"
// tree
#3 type feed
#3 entry #4
#4 title "BTriples - a model for aggregating structured data"
#4 content "blah blah ..RDF... blah"
That's it.
Notes:
Btriples is not an interchange format. I have deliberately not
defined a serialization of BTriples.
BTriples graphs are disconnected: Btriples does not define a
method for them to refer to each other.
Perhaps the biggest departure from RDF is that there are no
formal semantics in Btriples. The btriples model cannot tell you if
a subject in one graph denotes the same thing as a subject in
another.
Also the semantic meaning of symbols is not defined by BTriples
and is up to the user to decide. Two identical symbols do not
necessarily 'mean' the same thing.
The statements in a BTriples graph are *ordered*, so you can get
data out in the same order it went in.
I'm not crazy about the BTriples name. Maybe I'll change it.
http://phildawes.net/blog/2009/07/19/btriples-model-for-aggregating-structured-data/
DERI on PlanetRDF
DERI's Blog is now syndicated on PlanetRDF. Hello and Welcome to
our new readers ! http://planetrdf.com/
New video
on DERI's YouTube channel - one minute inside DERI - #1 What
is DERI? by Stefan Decker
We are pleased to announce this new set of video "one minute
inside DERI". one minute inside DERI aims to give you an
overview about what we are doing in the Institute. The first
video is the presentation of DERI's goal, research approach,
standardisation activities and Industrial Partners by Prof. Dr.
Stefan Decker. [youtube VSOVCDsNUow nolink]
http://www.youtube.com/user/DERIGalway
New publication: "Exploiting Linked Data to Build Web
Applications" Michael Hausenblas
Semantic Web technologies have been around for a while. However,
such technologies have had little impact on the development of
real-world Web applications to date. With linked data, this
situation has changed dramatically in the past few months. This
article shows how linked data sets can be exploited to build rich
Web applications with little effort.
Michael Hausenblas, "Exploiting Linked
Data to Build Web Applications," IEEE Internet Computing, vol. 13,
no. 4, pp. 68-73, July/Aug. 2009, doi:10.1109/MIC.2009.79
http://doi.ieeecomputersociety.org/10.1109/MIC.2009.79
Innovative Enterprise Architecture Education and Training Based
on Web 2.0 Technologies - EATrain2
DERI is a partner in a new eLearning EU project: EATrain2 .
EATrain2 kicked off in January 2009. It aims to fulfil the training
and educational needs of students and employees in the public and
private sector regarding Enterprise Architecture (EA), using
innovative problem-based pedagogies and Web 2.0 approaches.
EATrain2 aims at achieving five specific key objectives as
follows:
•   To determine the profile of an Enterprise
Architect by identifying the required competencies (skills,
knowledge and attitudes).
•   To determine the required EA competencies that
university students, public servants and private sector employees
need to acquire in order to become competent Enterprise
Architects.
•   To identify new training and teaching methods
based on Web 2.0 technologies and active, problem-based learning
approaches.
•   To develop a Web 2.0 EA learning platform and
course models for university students and employees. The foreseen
training courses will be created in English and will be translated
and adopted in German, Greek and Polish.
•   To undertake pilot trials for private sector
employees, public servants and university students.
DERIÂ’s role in this project is to develop a Web 2.0 EA learning
platform to support the developed methodology and underpin the
training and learning processes. http://www.eatraining.eu/
http://www.eatraining.eu/
Â
DERI Tutorial's Videos
On the link below, most of the DERI's Tutorial videos are
available. The DERI
Tutorial's goal of the first series was to give fundamental
lectures in the core topics of the different research units within
DERI which shall serve to bring all researchers up-to-speed with
basics.Â
DERI Tutorial's videos :
http://dev.deri.ie/confluence/display/PubVids/Home
http://blog.deri.ie/index.php?id=452&no_cache=1&tx_ttnews[tt_news]=570
Colleagues - one of my blog entries at Nature seems to have hit
a nerve - been zinging around the “twittersphere†and I’ve received
a number of responses in private not just commiserating, but
agreeing with the major points. I want to make it clear that
this is solely my own opinion, and it has not been carefully
researched, but given that so many US Semantic Web researchers have
shared the frustration that I express here, I thought I’d share it
on planetRDF as well (Europeans, believe it or not, on this
side of the ocean it is hard to get funding for Semantic Web
research - you have no idea how lucky you are!)
-Jim H
from blog entry: “Why NSF cannot fund high-risk, high-reward
researchâ€
I just got turned down for a grant. That’s nothing new, you win
some and you lose some, and every senior professor has gotten used
to that over time. This time, however, I cannot find it in myself
to just say “oh well†and let it go at that. This time, I think I
need to go public, because I think what happened shows an endemic
problem with the US National Science Foundation and, I hope, points
out some things they could do to fix it.
Click here for the blog entry at Nature.com
http://tw.rpi.edu/weblog/2009/07/13/121/
SemTech 2009, along with W3C's significant participation in it,
is now behind us. Besides catching upon on emails, I have spent the
past week reflecting on the enthusiasm, presentations, and flurry
of activities that constituted this year's event in San Jose, 14 to
18 June.
One strong feeling I had while in San Jose, was a sense of /deja
vu/ in the Web world. Stepping back, I realize that 2009 feels a
lot like 1999 when I was consulting with Allaire (remember CFML and
ColdFusion?) and attended their user group meetings teaming with
enthusiastic Web developers with war stories about their successes
and failures bringing Web development servers into organizations of
all types and sizes.
Ten years ago, many enterprises were just getting onto the
"e-commerce bus," having been either eclipsed or inspired by the
likes of innovative Web-centric companies such as Amazon.com and
eBay who launched in 1995, or early-adopter retailers like JCPenney
whose understanding of the catalogue business put them online
faster than many other retailers, or businesses for that matter.
Many mainline companies were in various phases of their Web
evolution in 1999 -- from brochureware to intranets to pilot
customer-facing interactive sites. And keep in mind that ten years
ago, Google was barely two.
In 1999 there was also a wide cross-section of skill sets and
diversity of understanding about what the Web was, how it worked,
and what people and tools to trust to bring one's vision onto the
Web. I remember sitting in focus groups with a number of HTML Web
designers who were impatient with their more senior corporate IT
colleagues who insisted on clear roadmaps, risk assessments and
cost-benefit analyses for the Web-based tools and technology
solutions their companies were considering.
The Java developers, engineers and system architects in other
discussion groups also weren't too keen on the irreverent attitudes
and huge amounts of money being thrown at these young people, who
just a few years earlier were teenagers playing video games at the
arcades. But understanding and trust continued to build, innovation
accelerated, communities with technical skills increased, and
revenues skyrocketed as a direct result of vendors developing and
companies embracing new Web technologies.
We fast forward to 2009 and see similar dynamics with Semantic
Web technologies. There are the early adopters and evangelists who
have already climbed aboard the "RDF-bus," understand what's
possible with W3C's Semantic Web technology standards, and can
point to impressive results in new tools, pilot projects and even
robust deployments within organizations, governments, and
enterprises.
Yet skeptics remain both in terms of understanding the paradigm
shift that the Semantic Web brings, just as the early Web
challenged the status quo, and in the legitimate need for better
tools and long-term architectural considerations for how to
successfully deploy Semantic Web technologies in large
enterprises.
Like the early Web and the W3C standards and subsequent
commercial tools, products and services that enabled its rapid
growth, the W3C Semantic Web stack is highly stable today. The
accelerating uptake of W3C Semantic Web standards, new tools and
applications were part of the buzz at this year's Semantic
Technologies Conference.
In addition to hearing and seeing many new use cases and case
studies, the call for commercialization was clear, as was the
amount of enthusiasm among the technologists doing good and
exciting work. The community's call to publish and link data in RDF
or RDFa is clearly being heard, with The New York Times joining the
ranks of large data holders eager and willing to publish to the
Linked Open Data Cloud.
Finally, the number of Semantic Web communities flourishing in
cities coast to coast across North America and in Europe, is
another healthy sign that the growth and adoption of Semantic Web
technologies has not only "crossed the chasm" (in keeping with
Geoffrey Moore's model), but has spawned strong beachheads of
support among highly skilled technology professionals across
business, industry, and government sectors.
It is my hope that at next year's Semantic Technologies
Conference -- which is changing venues to San Francisco -- we will
point to an even higher coordinate on the adoption curve and see
amazing new results and impact from the use of W3C Semantic Web
technologies. If I were Jean Luc Picard, I would, "Make it so." But
for now, I'll continue in my role of education and outreach for
W3C.... Look forward to seeing many of you throughout the year and
at next year's conference!
http://www.w3.org/QA/2009/06/reflections_on_semtech_2009.html
SemTech 2009 has
come and gone, and it was great. I was concerned—as were
others—that the state of the economy would depress the turnout and
enthusiasm for the show, but it seems that any such effects were at
least counterbalanced by a growing interest in semantic
technologies. Early reports are that attendance was up about 20%
from last year, and at sessions, coffee breaks, and the exhibit
hall there seemed to always be more people than I expected. Good
stuff.
Eric P. and I gave our SPARQL By Example tutorial to a crowd of
about 50 people on Monday. From the feedback I’ve received, it
seems that people found the session beneficial, and at least a
couple of people remarked on the fact that Eric and I seemed to be
having fun. If this whole semantic thing doesn’t work out, at least
we can fall back on our ad-hoc comedy routines.
Anyways, I wanted to share a couple of links with everyone. I
think they work nicely to supplement other SPARQL tutorials in
helping teach SPARQL to newcomers and infrequent practitioners.
SPARQL
By Example slides. I’ve probably posted this link before, but
the slides have now been updated with some new examples and with a
series of exercises that help reinforce each piece of SPARQL that
the reader encounters. Thanks to Eric P. for putting together all
of the exercises and to Leigh Dodds for the excellent space
exploration data set.
SPARQL
Cheat Sheet slides. This is a short set of about 10 slides
intended to be a concise reference for people learning to write
SPARQL queries. It includes things like common prefixes, the
structure of queries, how to encode SPARQL into an HTTP URL, and
more.
Enjoy, and, as always, I’d welcome any feedback, suggestions for
improvements, or pointers to how/where you’re able to make use of
these materials.
http://www.thefigtrees.net/lee/blog/2009/06/sparqling_at_semtech.html
Just saw Jim’s post on
What is the Semantic Web really all about?
I have been wondering about this problem too. What is Semantic Web?
Yesterday I have asked a question “Why few (or none?) Web 2.0 sites provide
hierarchical tagging?†on LinkedIn
and get some pretty good answers:
http://www.linkedin.com/answers?viewQuestion=&questionID=496785&askerID=14212719
For your convenience, I attached my LinkedIn post at the end of
this blog.
There are two things in the answers that draw my
attention:
* Many do _not_ believe tags, or even hierarchical tags, are
semantic; “semantics†means RDF
or triples at least to them;
* Some believe that even implementing a hierarchical tagging system
is not easy in engineering or social aspects.
I think these two beliefs, among many other reasons, may explain
in part why the “Semantic Web†is still far from a reality. The
first is about the overestimation of what is “semanticsâ€: triple is
one way to express semantics, but it is a question that whether it
is _the_ way. The second is about the underestimation of
“Webâ€-scale: realizing a knowledge system, even if is conceptually
“simpleâ€, on the Web can lead to serious scalability
problems, both for machine (can you make
http://tw.rpi.edu/weblog/2009/06/19/i-will-pay-delicious-100-for-hierarchical-tagging/
It’s hardly new for this blog or our community to
cover issues of open access and making information useful for
users. But, what if we were to begin speaking in terms  such
as: “A call for transparency,†or subtly replace user
with citizen?  With little substantive shift of
core meaning, the whole message becomes one of rights,
responsibilities, and public duty.
I’ve been watching this week as the ember at the heart of this
dialogue has been fanned with air-time on mainstream media, and is
about to receive its fuel. First, UK Prime Minister Gorden Brown
asked Sir Tim Berners-Lee  â€to help us drive the opening up of
access to Government data in the web over the coming monthsâ€
appointing him to a special role advising Parliament. In an
interview
with BBC tech correspondent Rory Cellan-Jones, Sir Tim discussed
his position; explaining that he’s pushing for transparency: “This
is our data. This is our taxpayers’ money which has created this
data, so I would like to be able to see it, please.â€
Sir Tim had the audience at the
tech-friendly TED conference chanting “Raw Data Now†back in
February, and he’s now been invited by a sitting government leader
to make this happen.
This week also saw the publication of the
Digital Britain report, outlining Parliament’s plans for a more
connected future. I must admit, for the record, that I haven’t read
all 239 pages of the report (made available via bbc.co.uk), rather,
I’ve skimmed it and read several overviews. The gist seems to be
that the UK plans to invest in the future of its citizens’ internet
connectivity, upgrading existing infrastructure and providing
access where there currently isn’t. This investment will cover both
wired broadband provision (with a stated aim of 2MBps minimum for
every household) and wireless, encouraging investment in 3g
provision by allowing mobile companies to have their network
licenses more permanently. Â It recommends subsidising
development wherever the market can’t provide; seemingly equating
net access with public utilities (The PM further
clarified his thoughts by saying the Internet is as vital as
water or gas). More information on this report can be found on the
summary page at the Guardian,
on twitter: hashtag #digitalbritain,
and Bill Thompson’s tech-centric
overview.
All this week needs is a major announcement of something moving
entirely to cloud-computing to look a bit like the convergence
I blogged about a few days ago ;).
So, what has this incredible week brought us? It’s a
governmental lead on opening up access to data. Their appointment
of TBL makes me think that it’s likely we’ll see more and more
linked-data projects coming from the public sector (not just access
to, but usable, linked data). Over the next few years, the UK plans
to improve its infrastructure and incentivize development on
communications networks, and they’ve begun to use language
suggesting that being part of the network and access to Public data
are rights issues.
Sir Tim spoke, in the interview, about beginning with
low-hanging fruit: pilot schemes which open up data and watch what
happens.
What are you building?
Image: “Sparksâ€, by Steven Wong via
flickr; Creative
Commons By, Share Alike License
http://feedproxy.google.com/~r/Nodalities/~3/9BQcdl8_NxI/britain2.php
In my latest podcast I talk with Erik Nemeth, a Senior Data
Specialist at the Getty Research Institute.
We discuss Erik’s ideas on the ways in which discipline-specific
databases need to evolve to remain competitive in the discovery of
scholarly literature.
During the conversation, we refer to the following
resources;
Bibliography of the History of Art
Erik’s paper
abstract at the 2009
Semantic Technology Conference
Getty Research Institute
Google Scholar
Thesaurus of Geographic Names
This conversation was recorded on Tuesday 26 May,
2009.
For other Talis podcasts in this Nodalities series, see
here. To
subscribe to updates from all of Talis’ podcast series, see
here.
http://feedproxy.google.com/~r/Nodalities/~3/6UZZ8revJ1Q/erik-nemeth-talks-about-the-place-of-disciplinary-research-databases-in-a-web-20-world.php
BlogTalk 2009
The 6th International Conf. on Social Software
September 1st and 2nd, 2009
Jeju Island, Korea
Overview
Following the international success of the last five BlogTalk
events, the next BlogTalk - to be held in Jeju Island, Korea on
September 1st and 2nd, 2009 - is continuing with its focus on
social software, while
remaining committed to the diverse cultures, practices and tools of
our emerging networked society. The conference (which this year
will be co-located with Lift Asia 09) is
designed to maintain a sustainable dialog between developers,
innovative academics and scholars who study social software and
social media, practitioners and administrators in corporate and
educational settings, and other general members of the social
software and social media communities.
We invite you to submit a proposal for presentation at the
BlogTalk 2009 conference. Possible areas include, but are not
limited to:
Forms and consequences of emerging social software
practices
Social software in enterprise and educational environments
The political impact of social software and social media
Applications, prototypes, concepts and standards
Participants and proposal categories
Due to the interdisciplinary nature of the conference, audiences
will come from different fields of practice and will have different
professional backgrounds. We strongly encourage proposals to bridge
these cultural differences and to be understandable for all groups
alike. Along those lines, we will offer three different submission
categories:
Academic
Developer
Practitioner
For academics, BlogTalk is an ideal conference for presenting
and exchanging research work from current and future social
software projects at an international level. For developers, the
conference is a great opportunity to fly ideas, visions and
prototypes in front of a distinguished audience of peers, to
discuss, to link-up and to learn (developers may choose to give a
practical demonstration rather than a formal presentation if they
so wish). For practitioners, this is a venue to discuss use cases
for social software and social media, and to report on any results
you may have with like-minded individuals.
Submitting your proposals
You must submit a one-page abstract of the work you intend to
present for review purposes (not to exceed 600 words). Please
upload your submission along with some personal information using
the EasyChair
conference area for BlogTalk 2009. You will receive a
confirmation of the arrival of your submission immediately. The
submission deadline is June 27th, 2009.
Following notification of acceptance, you will be invited to
submit a short or long paper (four or eight pages respectively) for
the conference proceedings. BlogTalk is a peer-reviewed conference.
Timeline and important dates
One-page
abstract submission deadline: June 27th, 2009
Notification of acceptance or rejection: July 13th,
2009
Full paper submission deadline: August 27th, 2009
(Due to the tight schedule we expect that there will be no
deadline extension. As with previous BlogTalk conferences, we will
work hard to endow a fund for supporting travel costs. As soon as
we review all of the papers we will be able to announce more
details.)
Topics
Application Portability
Bookmarking
Business
Categorisation
Collaboration
Content Sharing
Data Acquisition
Data Mining
Data Portability
Digital Rights
Education
Enterprise
Ethnography
Folksonomies and Tagging
Human Computer Interaction
Identity
Microblogging
Mobile
Multimedia
Podcasting
Politics
Portals
Psychology
Recommender Systems
RSS and Syndication
Search
Semantic Web
Social Media
Social Networks
Social Software
Transparency and Openness
Trend Analysis
Trust and Reputation
Virtual Worlds
Web 2.0
Weblogs
Wikis
http://www.johnbreslin.com/blog/2009/05/29/blogtalk-2009-6th-international-social-software-conference-call-for-proposals-september-1st-and-2nd-jeju-korea/
Google wave looks
interesting. Google describes it as “a new tool for communication
and collaboration on the web†and it’s a funny mix of email,
instant messaging, wikis, and Facebook wall interactions. Or maybe
IRC for the new century. This is from a post,
Went Walkabout. Brought back Google Wave, on the Google
blog.
“A “wave†is equal parts conversation and document, where people
can communicate and work together with richly formatted text,
photos, videos, maps, and more. Here’s how it works: In Google Wave
you create a wave and add people to it. Everyone on your wave can
use richly formatted text, photos, gadgets, and even feeds from
other sources on the web. They can insert a reply or edit the wave
directly. It’s concurrent rich-text editing, where you see on your
screen nearly instantly what your fellow collaborators are typing
in your wave. That means Google Wave is just as well suited for
quick messages as for persistent content — it allows for both
collaboration and communication. You can also use “playback†to
rewind the wave and see how it evolved.â€
Google Wave is not available yet, but you can sign up to be
notified when it’s launched.
Here’s a random thought. Our models for communication in
multiagent systems (e.g., KQML and FIPA) were informed by if
not based on email and, to a lesser degree, IM. If Wave is a useful
new communication model for humans, does it have a counterpart for
software agents? If so, I suspect that ideas from the Semantic Web
will be useful to provide a “rich content†for agents.
For more views, see posts by
o’reilly,
techcrunch,
BusinessWeek and
Gabor Cselle.
http://ebiquity.umbc.edu/blogger/2009/05/28/google-wave-as-a-new-communication-model/
It is always interesting to
find out about new communities using Semantic Web technologies. So
here is one more: 3D Media. Although I already knew about some work
going on around this subject, a paper published by Michela
Spagnuolo and Bianca Falcidieno[1]
just drew my attention on it again. Unfortunately, the paper, being
published by IEEE, is not publicly available:-( but if you can get
hold of it, it is worth looking at it!
Using specific vocabularies it should be possible to provide
semantic annotation of 3D models, shapes, surfaces, etc, regardless
of how you present or construct the 3D virtual object itself. The
paper gives the example of a 3D surface described in terms of a
triangular mesh; in some cases, for example to gain on display
efficiency, the mesh may be converted into a coarser mesh, but a
corresponding semantic descriptions should stay unchanged. How to
exactly do that, how to store the metadata, what is the exact
identification of the surface itself (i.e., what are the URI-s to
use), what type of vocabularies should be developed by the
community, etc, are all R&D issues: there is now a community
growing up around these questions. If such semantic descriptions
are available then better organizations of shapes, better search
facilities, better segmentation tools for 3D objects, etc, can be
developed. (It is not unlike what happens with images and their
possible semantic characterizations, except that moving from 2D to
3D significantly increases complexity.) It is also possible to
create and share depositories of 3D shapes, models, etc, and
possibly link them to other datasets around the Web. Such
activities are already happening, eg in the AIM@SHAPE project, or the
publication of some of the relevant OWL
ontologies but, at least to my knowledge, there has not been
too much connections between the, shall we say, “core†Semantic Web
community (whatever that means:-) and this 3D Model activity. That
is why I was pleased to see the paper of Michela and Bianca; maybe
more connections can be made in the years to come!
(Full disclosure: I am biased. Indeed, in my previous life, I
was active in Computer Graphics and I actually has the pleasure of
co-organizing a workshop with Bianca Falcidieno[2]
a long time ago; I also met Michela at that Workshop. It was one of
the nicest events I participated in in the 90’s…)
Spagnuolo, Michela and Falcidieno, Bianca (2009) “3D Media and
the Semantic Web.†IEEE Intelligent Systems, 24(2), pp. 90-96.
Available from:
http://www2.computer.org/portal/web/csdl/doi/10.1109/MIS.2009.20
Falcidieno, Bianca, Herman, Ivan and Pienovi, Caterina (eds.)
(1992) Computer Graphics and Mathematics, Heidelberg, Springer
Verlag.
Posted in Semantic Web, Work Related Tagged: 3D, computer graphics,
Ontologies, semanticweb
http://ivan-herman.name/2009/05/28/3d-media-and-the-semantic-web/
A couple of weeks ago, I've written about the self-enforcing
value spiral that RDF data enables. Here is an example about
how RDFa can be used to support this
"Repurpose-Republish" loop.
While data exchange between different semantic web sources is
usually RDF-based (i.e. the data always maintains its semantics),
there is one major exception: SPARQL SELECT queries. This
developer-oriented operation returns tabular data (similar to
record sets in SQL). Once the query result is separated from the
query, the associated structural data is lost. You can't directly
feed SELECT results back into a triple store, even though querying
based on linked resources means that you have just
created knowledge. It's a pity to show this generated
information to human consumers only.
One of the demos at my NYC talk was a dynamic wiki item that pulled
in competitor information from Semantic CrunchBase and injected that
into a page template as HTML. The existing RDF infrastructure does
not let me cache the SELECT results locally as usable RDF. And a
semantic web client or crawler that indexes the wiki page will not
learn how the described resource (e.g. Twitter) is related to the
remote, linked entities.
However, by simply adding a single RDFa hook to the wiki item
template, the RDF relation (e.g. competitor) can be made available
again to apps that process my site content. This is basically how
Linked Data works. But here is
the really nifty thing: My site can be a consumer of its own pages,
too, recursively enriching its own data.
I tweaked the wiki script which now works like this: When the page
is saved, a first operation updates the wiki markup in the page's
graph (i.e. the not-yet-populated template). In a second step, the
page URL is retrieved via HTTP. This will return HTML with
RDFa-encoded remote data, which is then parsed by ARC, and finally
added to the same graph. We end up with a graph that does not only
contain the wiki markup, but also the RDFized information that was
integrated from remote sites. After adding this graph to the RDF
store, we can use a local query to generate the page and
occasionally reset the graph to enable copy-by-reference. And all
this without any custom API code.
http://bnode.org/blog/2009/05/26/simple-rdfication-of-sparql-select-results-with-rdfa
Reading RESTful Web
Services by Leonard Richardson and Sam Ruby, it suddenly stroke
my like thunder: yes indeed, it’s very often the case with
technologies that they are (often unknowingly) abused in obscure
ways, which then is often perceived by the community as good or
best practice. So much generic introduction for explaining the
title – let’s flesh it out ;)
Couple of years a go I used to develop Web applications using
JSP and relational databases (RDB). One pattern I often found (and
have to admit, did myself pretty much the same way): treating the
RDB only as a dump store without exploiting the features it offers.
So, you load whatever you need via some SQL command in the
beginning, process it in memory and when you’re done you dump it
back again into the RDB. Is this the way RDB are supposed to be
used? Certainly not.
Then, as motivated by the RESTful WS book, HTTP naturally
provides a set of methods for CRUD
operations, however, certain so-called ‘Web’ solutions merely use
HTTP as a transport protocol and redefine most of the logic in
rather complex ways (RPC-style being one example, but also hybrids
exist that partially use HTTP for reading, but define their own
mechanisms to update resources).
Anyway, there seems to be a pattern and now I was wondering if
we know about such MalBestPracticing in the RDF world as well. What
comes to mind are the following (ok, very roughly, but feel free to
add yours):
Using RDF in a closed-world setup: often seen
and often seen failing. Whenever you have a closed-world
application, that is, something that’s supposed to do a job in an
environment you entirely control (Intranet, desktop, etc.) and
there is no need to share/incorporate other data, using RDF is
probably not a smart choice. You’re better off with the RDB of your
choice and some hand-coded rules, both in terms of complexity and
performance.
Thinking of RDF on the serialisation level.
Yes, there are a couple of RDF serialisations such as RDF/XML,
RDFa, Turtle, etc. but that’s not the point. If I want to I can put
my RDF-glasses on and view (quite) everything as RDF, but one
should think of RDF on the data model level, rather. The important
point is that RDF provides a away to express structured data in a
graph manner which happens to be the same as the Web from a
morphological point of view.
No interlinking between data. Hu, that’s a
heavy one. Publishing RDF without interlinking to other data in RDF
out there. But to be fair, this has been properly addressed by
TimBL in
his LinkedData
note and the community has
picked it up since. Imagine HTML documents on all of the computers
in the world … without a single hyperlink between them. Would you
call that the Web? Certainly not. Believe it or not, this was more
or less what we’ve been doing for more than six years or so in the
Semantic Web.
So, what’s your favorite MalBestPracticing in the Semantic Web
world?
http://webofdata.wordpress.com/2009/05/24/technology-malbestpracticing/
Do you write HTML? You’ve just heard that Google
now supports RDFa and you want to know where to start? Our own
Steven Pemberton just published a fantastic
RDFa introduction for HTML authors:
RDFa is a thin layer of markup you can add to your web pages
that makes them understandable for machines as well as people. You
could describe it as a CSS for meaning. By adding it, browsers,
search engines, and other software can understand more about the
pages, and in so doing offer more services or better results for
the user. For instance, if a browser knows that a page is about an
event such as a conference, it can offer to add it to your
calendar, show it on a map, locate hotels or flights, or any number
of other things.
http://rdfa.info/2009/05/14/rdfa-for-html-authors-start-here/
VoCamps provide the missing social interaction needed for
vocabulary creation and management on the Semantic Web: a space
where members of the community can discuss the current issues
related to vocabularies and semantic interoperability. Unlike
Semantic Web meetups which typically take just a few hours and
where the discussion focuses on a single presentation, VoCamps are
two-day events that allow in-depth discussions and working in small
groups.
Following the success of
VoCamp Ibiza, we are organizing another similar event at Yahoo,
but this time in the US, where VoCamps are now also taking hold.
(VoCampDC will
be organized at the end of May and has already reached it’s full
capacity!) This VoCamp will take place in Sunnyvale, directly after
the SemTech 2009
conference.
If you would like to join this next edition of the VoCamp
series, please sign up on the VoCampSunnyvale2009
wiki page! The space is limited, but we will try to expand if
necessary. Hope to see many of you in San Jose and Sunnyvale!
http://tripletalk.wordpress.com/2009/05/12/vocamp-sunnyvale-june-18-19-2009/
While I'm in the process of looking more closely at the
Microdata proposal, I wanted to note that today marked the end of
the first day of the code sprint for incorporating RDFa into the
core of Drupal 7.
Yes, when Drupal 7 hits the streets, 1.7 million Drupal web
sites, and counting, will have built-in support for RDFa.
http://realtech.burningbird.net/semantic-web/rdf-and-rdfa/rdfa-drupal-core
We’re excited about SemTech 2009 in June for many reasons,
including the 5 talks we’re giving:
Introducing
OWL 2: A Panel and Community
Discussion, featuring Mike Smith, who’s one of the OWL 2 Test Suite editors; Ian Horrocks (Oxford), Ivan
Herman (W3C), Deb MacGuinness (RPI), and
Zhe Wu (Oracle).
Faceted Browsing
Tools where Mike Grove will talk about jSpace, our faceted
visual query browser for SPARQL.
Government
Sidebar, where Evren Sirin and Mike Smith will talk about
HTN-DL Planner and the work we’ve done
with Kevin Newman and John T. Carson from Lockheed Martin on
planning UAVs;
Managing
OWL: Using Oracle and Pellet for
Massively Scalable, Highly Expressive OWL
Infrastructure where Xavier Lopez from Oracle and I will talk
about the work we’ve done to combine Oracle’s scalable reasoning
with Pellet’s expressive reasoning in PelletDb, a new offering from
us later this year;
A Programmer’s
Introduction to Pellet: How to Build Ontology-based Semantic
Applications where Evren and Mike Smith will give a 3-hour
tutorial on how to build real OWL applications based on Pellet (where “real†means
“uses inference in anger†and “is semantics-respectingâ€).
At SemTech 2008 our talks focused on R&D work we’d been
doing around OWL reasoning with Pellet,
probabilistic reasoning with Pronto, etc. This year our talks are
much more tightly focused on industrial applications of Pellet,
automatic planning, and SPARQL tools, as
well as our standards work at W3C.
See you in San Jose!
http://clarkparsia.com/weblog/2009/05/08/5-talks-at-semtech-2009/
I'm happy to announce that the proceedings of the upcoming
ESWC2009 Worshop on Trust and Privacy on the Social and
Semantic Web (SPOT2009) have just been
published via CEUR-WS (ISSN:
1613-0073) on
Vol-447.
The proceedings include the 9 papers that have been selected for
the workshop (from a total of 16 submissions - a big thanks to all
the PC members for
their work !) as well as the abstract of the keynote that will be
given by Prof. Piero Bonatti and a workshop preface. This is the
opportunity to read them before the event in order to engage
discussions at the workshop if you'll attend it. In any case you
will get an overview of recent research trends on the topic, papers
varying from theoretical approaches to software implementations on
the topics of trust and privacy on the (Social and Semantic)
Web.
http://apassant.net/blog/2009/05/01/spot2009-proceedings
(Last of five posts related to the WWW 2009
conference, held the week of April 20, 2009.)
The social networks camp was interesting, with a special meeting
around Twitter. Half jokingly, we (that is, the OpenLink folks
attending) concluded that societies would never be completely
classless, although mobility between, as well as criteria for
membership in, given classes would vary with time and circumstance.
Now, there would be a new class division between people for whom
micro-blogging is obligatory and those for whom it is an
option.
By my experience, a great deal is possible in a short time, but
this possibility depends on focus and concentration. These are
increasingly rare. I am a great believer in core competence and
focus. This is not only for geeks — one can have a lot of
breadth-of-scope but this too depends on not getting sidetracked by
constant information
overload.
Insofar as personal success depends on constant reaction to
online social media, this comes at a cost in time and focus and
this cost will have to be managed somehow, for example by
automation or outsourcing. But if the social media is only
automated fronts twitting and re-twitting among themselves, a bit
like electronic trading systems do with securities, with or without
human operators, the value of the medium decreases.
There are contradictory requirements. On one hand, what is said
in electronic media is essentially permanent, so one had best only
say things that are well considered. On the other hand, one must
say these things without adequate time for reflection or analysis.
To cope with this, one must have a well-rehearsed position that is
compacted so that it fits in a short format and is easy to remember
and unambiguous to express. A culture of pre-cooked fast-food
advertising cuts down on depth. Real-world things are complex and
multifaceted. Besides, prevalent patterns of communication train
the brain for a certain mode of functioning. If we train for
rapid-fire 140-character messaging, we optimize one side but
probably at the expense of another. In the meantime, the world
continues developing increased complexity by all kinds of emergent
effects. Connectivity is good but don't get lost in it.
There is a CIA memorandum about
how analysts misinterpret data and see what they want to see.
This is a relevant resource for understanding some psychology of
perception and memory. With the information overload, largely
driven by user generated content, interpreting fragmented and
variously-biased real-time information is not only for the analyst
but for everyone who needs to intelligently function in
cyber-social space.
I participated in discussions on security and privacy and on
mobile social networks and context.
For privacy, the main thing turned out to be whether people
should be protected from themselves. Should information expire?
Will it get buried by itself under huge volumes of new content?
Well, for purposes of visibility, it will certainly get buried and
will require constant management to stay visible. But for purposes
of future finding of dirt, it will stay findable for those who are
looking.
There is also the corollary of setting security for resources,
like documents, versus setting security for statements, i.e.,
structured data like social networks. As I have blogged before,
policies à la
SQL do not work well when schema is
fluid and end-users can't be expected to formulate or understand
these. Remember Ted Nelson? A
user interface should be such that a beginner understands it in 10
seconds in an emergency. The user interaction question is how to
present things so that the user understands who will have access to
what content. Also, users should themselves be able to check what
potentially sensitive information can be found out about them. A
service along the lines of Garlic's Data Patrol should be a part of
the social web infrastructure of the future.
People at MIT have developed AIR (Accountability In RDF) for
expressing policies about what can be done with data and for
explaining why access is denied if it is denied. However, if we at
all look at the history of secrets, it is rather seldom that one
hears that access to information about X is restricted to
compartment so-and-so; it is much more common to hear that there is
no X. I would say that a policy system that just leaves out
information that is not supposed to be available will please the
users more. This is not only so for organizations; it is fully
plausible that an individual might not wish to expose even the
existence of some selected inner circle of friends, their parties
together, or whatever.
In conclusion, there is no self-evident solution for careless
use of social media. A site that requires people to confirm
multiple times that they know what they are doing when publishing a
photo will not get much use. We will see.
For mobility, there was some talk about the context of usage.
Again, this is difficult. For different contexts, one would for
example disclose one's location at the granularity of the city; for
some other purposes, one would say which conference room one is
in.
Embarrassing social situations may arise if mobile devices are
too clever: If information about travel is pushed into the social
network, one would feel like having to explain why one does not
call on such-and-such a person and so on. Too much initiative in
the mobile phone seems like a recipe for problems.
There is a thin line between convenience and having IT
infrastructure rule one's life. The complexities and subtleties of
social situations ought not to be reduced to the level of if-then
rules. People and their interactions are more complex than they
themselves often realize. A system is not its own metasystem, as
Gödel put it. Similarly, human self-knowledge, let alone knowledge about
another, is by this very principle only approximate. Not to forget
what psychology tells us about state-dependent recall and of how
circumstance can evoke patterns of behavior before one even
notices. The history of expert systems did show that people do not
do very well at putting their skills in the form of if-then rules.
Thus automating sociality past a certain point seems a problematic
proposition.
http://www.openlinksw.com/weblog/oerling/?id=1554
http://dallemang.typepad.com/my_weblog/2009/04/hi1tech-comm-infrastructure-at-edw.html
You know that situation. Every time you want to order flowers
via the Web or book a flight (where you not happen to be a premium
member of the airline’s frequent fliers club having all the data
handy), you are urged to fill in the most stupid and plain forms.
Day in, day out, the same boring and time-consuming activity.
How about a sort of Mechanical Turk that
does the dirty work for you?
Ok, enough words: here is my take on it, using RDForms‘
fusion
capabilities, one can fill in her profile data from one’s FOAF
file:
profile auto-complete demo based on
RDForms fusion algorithm
Check out http://ld2sd.deri.org/pac/ and let
me know what and how we could extend that little toy (maybe
pre-selecting fields or protect certain fields, etc. ?). The code
is also available at the
pushback project svn.
Warning: this is a Sunday-morning hack, not a ready-to-go
product, so please, bear with it/me ;)
http://webofdata.wordpress.com/2009/04/05/pac-intro/
One concern about Virtuoso Cluster is
fault tolerance. This post talks about the basics of fault
tolerance and what we can do with this, from improving resilience
and optimizing performance to accommodating bulk loads without
impacting interactive response. We will see that this is yet
another step towards a 24/7 web-scale Linked Data Web. We will see
how large scale, continuous operation, and redundancy are
related.
It has been said many times — when things are large enough,
failures become frequent. In view of this, basic storage of
partitions in multiple copies is built into the Virtuoso cluster
from the start. Until now, this feature has not been tested or used
very extensively, aside from the trivial case of keeping all schema
information in
synchronous replicas on all servers.
Approaches to Fault Tolerance
Fault tolerance has many aspects but it starts with keeping
data in at least two copies. There are
shared-disk cluster databases like Oracle RAC that do
not depend on partitioning. With these, as long as the disk image
is intact, servers can come and go. The fault tolerance of the disk
in turn comes from mirroring done by the disk controller. Raids
other than mirrored disk are not really good for databases because
of write speed.
With shared-nothing setups like Virtuoso, fault tolerance is
based on multiple servers keeping the same logical data. The copies
are synchronized transaction-by-transaction but are not bit-for-bit
identical nor write-by-write synchronous as is the case with
mirrored disks.
There are asynchronous replication schemes generally based on
log shipping, where the replica replays the transaction log of the
master copy. The master copy gets the updates, the replica replays
them. Both can take queries. These do not guarantee an entirely
ACID fail-over but for many applications they come close
enough.
In a tightly coupled cluster, it is possible to do synchronous,
transactional updates on multiple copies without great added cost.
Sending the message to two places instead of one does not make much
difference since it is the latency that counts. But once we go to
wide area networks, this becomes as good as unworkable for any sort
of update volume. Thus, wide area replication must in practice be
asynchronous.
This is a subject for another discussion. For now, the short
answer is that wide area log shipping must be adapted to the
application's requirements for synchronicity and consistency. Also,
exactly what content is shipped and to where depends on the
application. Some application-specific logic will likely be
involved; more than this one cannot say without a specific
context.
Basics of Partition Fail-Over
For now, we will be concerned with redundancy protecting against
broken hardware, software slowdown, or crashes inside a single
site.
The basic idea is simple: Writes go to all copies; reads that
must be repeatable or serializable (i.e., locking) go to the first
copy; reads that refer to committed state without guarantee of
repeatability can be balanced among all copies. When a copy goes
offline, nobody needs to know, as long as there is at least one
copy online for each partition. The exception in practice is when
there are open cursors or such stateful things as aggregations
pending on a copy that goes down. Then the query or transaction
will abort and the application can retry. This looks like a
deadlock to the application.
Coming back online is more complicated. This requires
establishing that the recovering copy is actually in sync. In
practice this requires a short window during which no transactions
have uncommitted updates. Sometimes, forcing this can require
aborting some transactions, which again looks like a deadlock to
the application.
When an error is seen, such as a process no longer accepting
connections and dropping existing cluster connections, we in
practice go via two stages. First, the operations that directly
depended on this process are aborted, as well as any computation
being done on behalf of the disconnected server. At this stage,
attempting to read data from the partition of the failed server
will go to another copy but writes will still try to update all
copies and will fail if the failed copy continues to be offline.
After it is established that the failed copy will stay off for some
time, writes may be re-enabled — but now having the failed copy
rejoin the cluster will be more complicated, requiring an atomic
window to ensure sync, as mentioned earlier.
For the DBA, there can be intermittent software crashes where a
failed server automatically restarts itself, and there can be
prolonged failures where this does not happen. Both are alerts but
the first kind can wait. Since a system must essentially run
itself, it will wait for some time for the failed server to restart
itself. During this window, all reads of the failed partition go to
the spare copy and writes give an error. If the spare does not come
back up in time, the system will automatically re-enable writes on
the spare but now the failed server may no longer rejoin the
cluster without a complex sync cycle. This all can happen in well
under a minute, faster than a human operator can react. The
diagnostics can be done later.
If the situation was a hardware failure, recovery consists of
taking a spare server and copying the database from the surviving
online copy. This done, the spare server can come on line. Copying
the database can be done while online and accepting updates but
this may take some time, maybe an hour for every 200G of data
copied over a network. In principle this could be automated by
scripting, but we would normally expect a human DBA to be
involved.
As a general rule, reacting to the failure goes automatically
without disruption of service but bringing the failed copy online
will usually require some operator action.
Levels of Tolerance and Performance
The only way to make failures totally invisible is to have all
in duplicate and provisioned so that the system never runs at more
than half the total capacity. This is often not economical or
necessary. This is why we can do better, using the spare capacity
for more than standby.
Imagine keeping a repository of linked data. Most of the content
will come in through periodic bulk replacement of data sets. Some
data will come in through pings from applications publishing FOAF
and similar. Some data will come through on-demand RDFization of
resources.
The performance of such a repository essentially depends on
having enough memory. Having this memory in duplicate is just added
cost. What we can do instead is have all copies store the whole
partition but when routing queries, apply range partitioning on top
of the basic hash partitioning. If one partition stores IDs 64K -
128K, the next partition 128K - 192K, and so forth, and all
partitions are stored in two full copies, we can route reads to the
first 32K IDs to the first copy and reads to the second 32K IDs to
the second copy. In this way, the copies will keep different
working sets. The RAM is used to full advantage.
Of course, if there is a failure, then the working set will
degrade, but if this is not often and not for long, this can be
quite tolerable. The alternate expense is buying twice as much RAM,
likely meaning twice as many servers. This workload is memory
intensive, thus servers should have the maximum memory they can
have without going to parts that are so expensive one gets a new
server for the price of doubling memory.
Background Bulk Processing
When loading data, the system is online in principle, but query
response can be quite bad. A large RDF load will
involve most memory and queries will miss the cache. The load will
further keep most disks busy, so response is not good. This is the
case as soon as a server's partition of the database is four times
the size of RAM or greater. Whether the work is bulk-load or
bulk-delete makes little difference.
But if partitions are replicated, we can temporarily split the
database so that the first copies serve queries and the second
copies do the load. If the copies serving on line activities do
some updates also, these updates will be committed on both copies.
But the load will be committed on the second copy only. This is
fully appropriate as long as the data are different. When the bulk
load is done, the second copy of each partition will have the full
up to date state, including changes that came in during the bulk
load. The online activity can be now redirected to the second
copies and the first copies can be overwritten in the background by
the second copies, so as to again have all data in duplicate.
Failures during such operations are not dangerous. If the copies
doing the bulk load fail, the bulk load will have to be restarted.
If the front end copies fail, the front end load goes to the copies
doing the bulk load. Response times will be bad until the bulk load
is stopped, but no data is lost.
This technique applies to all data intensive background tasks —
calculation of entity search ranks,
data cleansing, consistency checking, and so on. If two copies are
needed to keep up with the online load, then data can be kept just
as well in three copies instead of two. This method applies to any
data-warehouse-style workload which must coexist with online access
and occasional low volume updating.
Configurations of Redundancy
Right now, we can declare that two or more server processes in a
cluster form a group. All data managed by one member of the group
is stored by all others. The members of the group are
interchangeable. Thus, if there is four-servers-worth of data, then
there will be a minimum of eight servers. Each of these servers
will have one server process per core. The first hardware failure
will not affect operations. For the second failure, there is a 1/7
chance that it stops the whole system, if it falls on the server
whose pair is down. If groups consist of three servers, for a total
of 12, the two first failures are guaranteed not to interrupt
operations; for the third, there is a 1/10 chance that it will.
We note that for big databases, as said before, the RAM cache
capacity is the sum of all the servers' RAM when in normal
operation.
There are other, more dynamic ways of splitting data among
servers, so that partitions migrate between servers and spawn extra
copies of themselves if not enough copies are online. The Google
File System (GFS) does something of this sort at the file system
level; Amazon's Dynamo does something similar at the database
level. The analogies are not exact, though.
If data is partitioned in this manner, for example into 1K
slices, each in duplicate, with the rule that the two duplicates
will not be on the same physical server, the first failure will not
break operations but the second probably will. Without extra logic,
there is a probability that the partitions formerly hosted by the
failed server have their second copies randomly spread over the
remaining servers. This scheme equalizes load better but is less
resilient.
Maintenance and Continuity
Databases may benefit from defragmentation, rebalancing of
indices, and so on. While these are possible online, by definition
they affect the working set and make response times quite bad as
soon as the database is significantly larger than RAM. With
duplicate copies, the problem is largely solved. Also, software
version changes need not involve downtime.
Present Status
The basics of replicated partitions are operational. The items
to finalize are about system administration procedures and
automatic synchronization of recovering copies. This must be
automatic because if it is not, the operator will find a way to
forget something or do some steps in the wrong order. This also
requires a management view that shows what the different processes
are doing and whether something is hung or failing repeatedly. All
this is for the recovery part; taking failed partitions offline is
easy.
http://www.openlinksw.com/weblog/oerling/?id=1540
Constructing a “real world†XProc pipeline: building the XProc
specification with XProc.
http://norman.walsh.name/2009/03/26/xprocWithXProc
It is relatively common for us to hear from clients that they
want to use Pellet without it accessing the network. Sometimes they
want to avoid network problems by caching locally; sometimes
they’re conforming to local security policy constraints; often,
people just like hacking on local copies before publishing their
ontologies on the Web. Regardless of motivation, they need to avoid
the network access used to fetch the contents of an ontology’s
imports closures. In this post I outline how a user can setup a
local ontology repository that will be used by Pellet’s Jena
loader.
The most common use case is a user hand editing a collection of
local ontologies which use HTTP
URLs. Until the ontologies are ready to
be published there is no content (or, even worse, outdated content)
at those URLs. The problem is that it’s
cumbersome to change all the URLs to
file: URLs only to change
them back when publishing.
Consider two simple ontologies, presented below using Turtle
syntax. First,
@prefix owl:
.
a owl:Ontology .
a owl:Class
.
Second,
@prefix rdfs:
.
@prefix owl: .
a owl:Ontology ;
  owl:imports
.
a owl:Class
;
  rdfs:subClassOf
.
We want to use Pellet to iteratively check the inferred class
hierarchy as we develop the ontologies. To do this with the command
line tools, we normally issue the following command
:; pellet classify
http://example.org/FriendOntology
But if we try this as-is, we’ll get an error. We need Pellet to
recognize that, while these ontologies are destined to be published
on the Web, for now they are in local files named
people.ttl and friend.ttl. To do this, we
use a Jena
LocationMapper configuration file. We can setup the file, named
location-mapping.ttl, with the following Turtle
content:
@prefix lm:
.
[] lm:mapping
  [ lm:name “http://example.org/PeopleOntology†;
lm:altName “file:people.ttl†] ,
  [ lm:name “http://example.org/FriendOntology†;
lm:altName “file:friend.ttl†] .
The only other change we need is to explicitly tell Pellet to
use the Jena loader; it uses the OWLAPI
loader by default. The command line looks like
:; pellet classify --loader Jena
http://example.org/FriendOntology
With the location mapping configuration file in place, we no
longer get a timeout but instead see the class hierarchy we expect,
based on the content of the local files.
The second common use case is a user working with an ontology
they’ve found on the Web and which has an arbitrarily large imports
closure. This user wants to avoid network accesses to fetch
ontologies. There are three steps to addressing this; first we need
to identify all of the ontology URLs in
the imports closure, then we need store them in our local
repository; finally, we need to create an adequate mapping
file.
To illustrate this example, we’ll use the LKIF-Core ontology. This ontology is interesting
because it has a moderate number of ontologies in its imports
closure. We could use a tool like Protégé 4 to identify the
ontologies in the imports closure; but we’re going to assume that
Pellet is the only ontology tool available. To find all the network
resources fetched, we can take advantage of some debug logging
available in Jena. Jena uses log4j, so we need to create a log4j
configuration file, called lm-log4j.properties, to
echo the interesting content to standard error.
log4j.rootLogger=WARN, stderr
log4j.appender.stderr=org.apache.log4j.ConsoleAppender
log4j.appender.stderr.target=System.err
log4j.appender.stderr.layout=org.apache.log4j.SimpleLayout
log4j.logger.com.hp.hpl.jena.util.FileManager=DEBUG
Once created, we set the system property
log4j.configuration to reference the file. If you’re
using the shell script included with Pellet-2.0 RC5 or newer, you
can do this with an environment variable as follows
:; export
pellet_java_args="-Dlog4j.configuration=file:lm-log4j.properties"
Then proceed as before
:; pellet consistency --loader Jena
http://www.estrellaproject.org/lkif-core/lkif-core.owl
There will be a lot of DEBUG messages,
but it’s easy to narrow in on the useful details with a simple grep
command, such as
:; pellet consistency --loader Jena
http://www.estrellaproject.org/lkif-core/lkif-core.owl 2>&1
| grep 'Not mapped'
What’s output should be something like the following,
enumerating all of the URLs which are
being retrieved:
DEBUG – Not
mapped:
http://www.estrellaproject.org/lkif-core/lkif-core.owl
DEBUG – Not mapped:
http://www.estrellaproject.org/lkif-core/norm.owl
DEBUG – Not mapped:
http://www.estrellaproject.org/lkif-core/legal-role.owl
DEBUG – Not mapped:
http://www.estrellaproject.org/lkif-core/legal-action.owl
DEBUG – Not mapped:
http://www.estrellaproject.org/lkif-core/role.owl
DEBUG – Not mapped:
http://www.estrellaproject.org/lkif-core/expression.owl
DEBUG – Not mapped:
http://www.estrellaproject.org/lkif-core/action.owl
DEBUG – Not mapped:
http://www.estrellaproject.org/lkif-core/process.owl
DEBUG – Not mapped:
http://www.estrellaproject.org/lkif-core/relative-places.owl
DEBUG – Not mapped:
http://www.estrellaproject.org/lkif-core/time.owl
DEBUG – Not mapped:
http://www.estrellaproject.org/lkif-core/mereology.owl
DEBUG – Not mapped:
http://www.estrellaproject.org/lkif-core/lkif-top.owl
After downloading each of these files and saving them locally,
we can create a location mapping file as above, with one map entry
per ontology. That said, the location mapping configuration file
supports more sophisticated mapping, and this is a great time to
take advantage of prefix based mapping. The following content in
location-mapping.ttl should be sufficient:
@prefix lm:
.
[] lm:mapping
  [ lm:prefix “http://www.estrellaproject.org/lkif-core/â€
; lm:altPrefix “file:./†] .
With this in place and all the files in the working directory,
if we rerun the previous command, grep doesn’t find any matches. To
disable the debug output
:; unset pellet_java_args
Then proceed as before
:; pellet consistency --loader Jena
http://www.estrellaproject.org/lkif-core/lkif-core.owl
We’ve used the location mapping configuration to completely
avoid network access.
A few additional details are worth noting. First, Jena does some
searching for the location mapping configuration file, but the
easiest approach is to keep it in the working directory.
Alternatively, it can be explicitly named using the
LocationMap system property. This approach can be
attractive if you work on multiple ontology projects and would like
them to share a single local repository. E.g., you might use
:; export
pellet_java_args="-DLocationMap=file:///etc/my-repository.ttl"
Second, in Pellet 2.0 RC5 this functionality is only available
if Pellet’s Jena loader is used. We’ve got
a ticket open to duplicate the functionality in the OWLAPI loader and hope to have it in place before
the final Pellet 2.0 release.
Feel free to comment on this functionality or any other aspect
of Pellet’s behavior on the pellet-users
mailing list. See you there.
Edited 2009-03-16 15:15 EDT to
note:
There has been some public discussion, such as
this thread on public-owl-wg@w3.org about tools using
XML Catalogs to provide a
standardized map description format similar to the one provided by
the location mapper configuration file used here. We think that any
mechanism that is sane and supported by OWL tools in an interoperable way is a good thing.
Translation between the Jena format and XML Catalogs looks straight forward, so you needn’t
worry about backwards compatibility issues if Pellet supports
XML Catalogs in the future.
http://clarkparsia.com/weblog/2009/03/16/local-ontology-repositories-with-pellet/
Here is the video of the talk that Tim Berners-Lee gave at the
TED2009
conference on linked
data.
You can see the slides that TBL
used on the W3C site.
I may have missed it, but I don’t think he mentioned the phrase
“Semantic Web†once during the 16 minute talk.
http://ebiquity.umbc.edu/blogger/2009/03/14/video-from-tim-berners-lee-2009-ted-talk-on-linked-data/

