walk through the semantic web using ~~a snorkel~~ Snorql and Triplestore technologies

Introduction

Why this guide?

First time I have seen DBPedia, I was completly amazed by the endless possiblities it provides and because I haven’t found any comprehensive resource on how to use it, I wrote one.

Wikipedia is certainly the most amazing thing internet has ever build. DBPedia sit on top of it and try to organize this crazy amount of information in a way that makes sense for humans.

At the beginning, extracting information from DBPedia felt to me like black magic, the xmlish URI felt obscure, finding the correct information seemed even worse and the interface I was using was constantly throwing errors. What a pain!

This guide intend to give some very practical examples showing a method you can use to make your own query. At the end of this guide you should be able to create a wide range of query using SPARQL and get answers to many questions that have answers spread all over wikipedia.

Some of the examples are silly, some usefull like planning your next holiday without going having to look at the advertising on TripAdvisor.

Our goal here is to give you an overview of the possibility DBPedia can offer.

Notice: I am not a SPARQL practitioner! I simply wrote this document simply because of the lack of comprehensive documentation to start using DBPedia.

Our challenge

We will try to find information that can be very difficult to search on a search engine. For example:

1) Find CEOs that has a company with more than 3210 employees but wasn’t born when Neil Amstrong walked on the moon

2) list zipcode of every city in a country (you might found such database on the internet but good luck to find it for free)

3) Find the wealthiest people on earth like the forbes would do every year

4) Find all the natural place (with its gps location, name, abstract, wikipage and image) in Australia for your next holidays

Prerequisite

Before to start, make sure you have read those documents:

DBpedia: https://en.wikipedia.org/wiki/DBpedia
RDF: https://en.wikipedia.org/wiki/Resource_Description_Framework
SPARQL: https://en.wikipedia.org/wiki/SPARQL
Triplestore: https://en.wikipedia.org/wiki/Triplestore
ontology: https://en.wikipedia.org/wiki/Ontology_(information_science)

A method to get your answers

Tooling

DBpedia provide 2 SPARQL endpoint:

http://dbpedia.org/snorql
http://dbpedia.org/sparql

You can write some SPARQL query in both of them. However, it is way easier to start with snorql as:

it makes it easy to discover data
you can go back in history without losing the query you make
gives you some shortcut by default

Start your query

For our first challenge, we will try at first to extract some companies. We will start at this page that list the ontology available from DBPedia.

dbo:Company

We can use it that way:

SELECT ?object WHERE{
  # BEGINNING OF OUR RDF triple.
  ?object # our output where we store object
  rdf:type # that have the type
  dbo:Company # company
  . # END OF A TRIPLE
  # we could have also done that in one line
  # ?object rdf:type dbo:Company .
  # or even
  # ?object a dbo:Company .
}
LIMIT 5

Note:

the filter keyword AFTER the WHERE statement. SPARQL support many query modifiers such as GROUP BY, HAVING, ORDER BY, LIMIT, OFFSET. Nothing new if you know about SQL
we used to terminate a RDF triple using the ‘.’
Example people use to write ‘a’ instead of rdf:type. It is just a shortcut.
rdf:foo is just a shortcut for https://www.w3.org/1999/02/22-rdf-syntax-ns:foo That’s basically why snorql is cool. By default, It creates a bunch of shortcut to make our life easier (that’s the PREFIX rdf:http://www.w3.org/1999/02/22-rdf-syntax-ns# line on top of snorql)

Filter data

The goal here is to filter down keeping only companies that have more than 3210 employees.

To do so we need to get to find this information somewhere. How do we get it?

First, we will try to list all the properties available for a Company resource:

select ?property where{
{
 ?property rdfs:domain ?class .
  dbo:Company rdfs:subClassOf+ ?class.
  } UNION {
   ?property rdfs:domain dbo:Company.
   }}

There, we can see a numberOfEmployees properties. Let’s use it:

SELECT ?object ?employees
WHERE
{
 ?object rdf:type dbo:Company .
  ?object dbo:numberOfEmployees ?employees
   FILTER ( xsd:integer(?employees) >= 3210 ) .
   }
   ORDER BY DESC(xsd:integer(?employees))
   LIMIT 2000

Exploring Data

Our goal now is to find the ceo. It seems easy as we even have a ceo field.

SELECT ?object ?employees ?ceo
WHERE
{
 ?object rdf:type dbo:City .
  ?object dbo:numberOfEmployees ?employees
   FILTER ( xsd:integer(?employees) >= 3210 ) .
    ?object dbo:ceo ?ceo
    }
    ORDER BY DESC(xsd:integer(?employees))
    LIMIT 200

No results! Nothing! Unfortunatly, It seems DBPedia didn’t make use of this property …

As a plan B, let’s try to explore what kind of information can be found for a specific instance of a Company.

Let’s get back to our previous query:

SELECT ?object ?employees
WHERE
{
 ?object rdf:type dbo:Company .
  ?object dbo:numberOfEmployees ?employees
   FILTER ( xsd:integer(?employees) >= 3210 ) .
   }
   ORDER BY DESC(xsd:integer(?employees))
LIMIT 200

Clicking on Siemens it should point you to: http://dbpedia.org/snorql/?describe=http%3A//dbpedia.org/resource/Siemens That’s what make snorql cool, it makes query on your behalf, so that exploring is just a click away:

SELECT ?property ?hasValue ?isValueOf
   WHERE {
    { <http://dbpedia.org/resource/Siemens> ?property ?hasValue }
     UNION
      { ?isValueOf ?property <http://dbpedia.org/resource/Siemens> }
      }

  By browsing the page, we can see here that there is no dbo:ceo. However, dbpedia:keyPeople is quite interesting. Let's use it: ``` SELECT ?object ?employees ?ceo
  WHERE
  {
   ?object rdf:type dbo:Company .
    ?object dbo:numberOfEmployees ?employees
 FILTER ( xsd:integer(?employees) >= 3210 ) .
  ?object dbpedia2:keyPeople ?ceo
  }
  ORDER BY DESC(xsd:integer(?employees)) LIMIT 200 ```
  If we click on the CEO of siemens:
  http://dbpedia.org/snorql/?describe=http%3A//dbpedia.org/resource/Joe_Kaeser
  we see that a way to access:
  - the CEO name would be to use foaf:givenName
  - the CEO age would be to use dbo:birthDate ``` SELECT ?object ?employees ?ceo ?ceo_name ?ceo_birth
  WHERE
  {
   ?object rdf:type dbo:Company .
    ?object dbo:numberOfEmployees ?employees
     FILTER ( xsd:integer(?employees) >= 3210 ) .
      ?object dbpedia2:keyPeople ?ceo .
       ?ceo foaf:givenName ?ceo_name .
        ?ceo dbo:birthDate ?ceo_birth
	 FILTER ( xsd:date(?ceo_birth) >= "1971"^^xsd:date ) .
	 } ORDER BY DESC(xsd:date(?ceo_age)) ```

Some issues

Some paint point:

because the CEO field is not really used on Wikipedia, we can only have keyPerson which is not always the CEO …
the data on DBPedia is not always fully update and can be a old. Example: http://dbpedia.org/page/Google where Sundar Pichai is not even mention here! To mitigate this issue, we can use this interface: http://live.dbpedia.org/sparql that give more fresh information As we can see Company is not giving everything we could have expect first.

Unfortunatly Wikipedia don’t provide absolutly every detail on everyone.

If Wikipedia would have include a valuation field, it would have been fun to create graph of companies considered as Unicorn, but company like airbnb don’t push this kind of stuff on their Wikipedia page :/

The zipcode challenge

There is a crazy amount of people who maintain paid database with City zipcode accross different country. Why would you paid for that? And what if you want the location of those cities too?

Let’s get started

Recapt of what we’ve already seen in a different context

1) We are looking for cities. First things is to identify the ontology we need to use, here it is dbo:City (http://dbpedia.org/ontology to see them all)

’’’ prefix dbo: http://dbpedia.org/ontology/ SELECT ?object WHERE { ?object rdf:type dbo:City . } LIMIT 100 ‘’’

2) We’ll have to find the properties we are interested in. It should be fairly simple to find them using snorql

3) filter to remove the scrap from the results

Final query

prefix dbo: <http://dbpedia.org/ontology/>

SELECT ?country ?city_code ?city ?city_population ?city_location
WHERE {
 ?city rdf:type dbo:City .
  ?city foaf:name ?city_name .
   ?city <http://www.georss.org/georss/point> ?city_location .
    ?city dbpedia2:populationTotal ?city_population .
     ?city dbpedia2:postalCode ?city_code .

?city dbo:country ?country .
 ?country foaf:name ?country_name .
 # FILTER(?country_name = "Guatemala")

}
ORDER BY DESC(xsd:integer(?city_population))
lIMIT 2000

prefix dbo: <http://dbpedia.org/ontology/>

SELECT ?country_name ?city_code ?city_name ?city_population ?city_location
WHERE {
 ?city rdf:type dbo:City .
  ?city rdfs:label ?city_name
   FILTER(langMatches(lang(?city_name), "EN"))

?city <http://www.georss.org/georss/point> ?city_location .
 ?city dbpedia2:populationTotal ?city_population .
  ?city dbpedia2:postalCode ?city_code .

?city dbo:country ?country .
 ?country rdfs:label ?country_name
  FILTER(langMatches(lang(?country_name), "EN") && ?country_name = "Switzerland")

}
ORDER BY DESC(xsd:string(?country_name))

To run such a large query, you’ll have to host dbpedia somewhere.

The challenges Results

Worlds’ most wealthiest person

SELECT ?object ?name ?death_date ?wealth
WHERE
{
 ?object rdf:type dbo:Person .
  ?object foaf:givenName ?name .
   ?object dbo:networth ?wealth .
    minus {
     ?object dbo:deathDate ?death_date
      }

}
ORDER BY DESC (xsd:integer(?wealth))
LIMIT 300

Prepare you holiday in AUstralia

SELECT ?object ?wikipage ?thumbnail ?location ?abstract
WHERE
{
 ?object rdf:type dbo:NaturalPlace .

?object foaf:name ?name .
 ?object foaf:isPrimaryTopicOf ?wikipage .
  ?object dbo:thumbnail ?thumbnail .
   ?object <http://www.georss.org/georss/point> ?location .
    ?object dbo:abstract ?abstract .
     FILTER(langMatches(lang(?abstract), "EN"))
      ?object dbpedia2:location ?country .
       ?country foaf:name ?country_name
        FILTER(xsd:string(?country_name) = 'Australia')
	}
LIMIT 300

Instead of filling the pocket of TripAdvisor and clicking on their advertising, contribute to Wikipedia and fix the issues :)

Quick example, one of the most beautiful place on earth, the Lake Mc Kenzie, doesn’t appear because the page was badly done (http://dbpedia.org/page/Lake_McKenzie)! What a waste!

Some links to keep going

rdf sprql query: https://www.w3.org/TR/rdf-sparql-query/
a chear sheet for SPARQL: http://www.slideshare.net/LeeFeigenbaum/sparql-cheat-sheet

Links to keep somewhere

http://mappings.dbpedia.org/server/ontology/classes/