Walkthrough Dbpedia And Triplestore
walk through the semantic web using a snorkel Snorql and Triplestore technologies
Introduction
Why this guide?
First time I have seen DBPedia, I was completly amazed by the endless possiblities it provides and because I haven’t found any comprehensive resource on how to use it, I wrote one.
Wikipedia is certainly the most amazing thing internet has ever build. DBPedia sit on top of it and try to organize this crazy amount of information in a way that makes sense for humans.
At the beginning, extracting information from DBPedia felt to me like black magic, the xmlish URI felt obscure, finding the correct information seemed even worse and the interface I was using was constantly throwing errors. What a pain!
This guide intend to give some very practical examples showing a method you can use to make your own query. At the end of this guide you should be able to create a wide range of query using SPARQL and get answers to many questions that have answers spread all over wikipedia.
Some of the examples are silly, some usefull like planning your next holiday without going having to look at the advertising on TripAdvisor.
Our goal here is to give you an overview of the possibility DBPedia can offer.
Notice: I am not a SPARQL practitioner! I simply wrote this document simply because of the lack of comprehensive documentation to start using DBPedia.
Our challenge
We will try to find information that can be very difficult to search on a search engine. For example:
1) Find CEOs that has a company with more than 3210 employees but wasn’t born when Neil Amstrong walked on the moon
2) list zipcode of every city in a country (you might found such database on the internet but good luck to find it for free)
3) Find the wealthiest people on earth like the forbes would do every year
4) Find all the natural place (with its gps location, name, abstract, wikipage and image) in Australia for your next holidays
Prerequisite
Before to start, make sure you have read those documents:
- DBpedia: https://en.wikipedia.org/wiki/DBpedia
- RDF: https://en.wikipedia.org/wiki/Resource_Description_Framework
- SPARQL: https://en.wikipedia.org/wiki/SPARQL
- Triplestore: https://en.wikipedia.org/wiki/Triplestore
- ontology: https://en.wikipedia.org/wiki/Ontology_(information_science)
A method to get your answers
Tooling
DBpedia provide 2 SPARQL endpoint:
- http://dbpedia.org/snorql
- http://dbpedia.org/sparql
You can write some SPARQL query in both of them. However, it is way easier to start with snorql as:
- it makes it easy to discover data
- you can go back in history without losing the query you make
- gives you some shortcut by default
Start your query
For our first challenge, we will try at first to extract some companies. We will start at this page that list the ontology available from DBPedia.
dbo:Company
We can use it that way:
SELECT ?object WHERE{
# BEGINNING OF OUR RDF triple.
?object # our output where we store object
rdf:type # that have the type
dbo:Company # company
. # END OF A TRIPLE
# we could have also done that in one line
# ?object rdf:type dbo:Company .
# or even
# ?object a dbo:Company .
}
LIMIT 5
Note:
- the filter keyword AFTER the WHERE statement. SPARQL support many query modifiers such as GROUP BY, HAVING, ORDER BY, LIMIT, OFFSET. Nothing new if you know about SQL
- we used to terminate a RDF triple using the ‘.’
- Example people use to write ‘a’ instead of rdf:type. It is just a shortcut.
- rdf:foo is just a shortcut for https://www.w3.org/1999/02/22-rdf-syntax-ns:foo That’s basically why snorql is cool. By default, It creates a bunch of shortcut to make our life easier (that’s the PREFIX rdf:http://www.w3.org/1999/02/22-rdf-syntax-ns# line on top of snorql)
Filter data
The goal here is to filter down keeping only companies that have more than 3210 employees.
To do so we need to get to find this information somewhere. How do we get it?
First, we will try to list all the properties available for a Company resource:
select ?property where{
{
?property rdfs:domain ?class .
dbo:Company rdfs:subClassOf+ ?class.
} UNION {
?property rdfs:domain dbo:Company.
}}
There, we can see a numberOfEmployees properties. Let’s use it:
SELECT ?object ?employees
WHERE
{
?object rdf:type dbo:Company .
?object dbo:numberOfEmployees ?employees
FILTER ( xsd:integer(?employees) >= 3210 ) .
}
ORDER BY DESC(xsd:integer(?employees))
LIMIT 2000
Exploring Data
Our goal now is to find the ceo. It seems easy as we even have a ceo field.
SELECT ?object ?employees ?ceo
WHERE
{
?object rdf:type dbo:City .
?object dbo:numberOfEmployees ?employees
FILTER ( xsd:integer(?employees) >= 3210 ) .
?object dbo:ceo ?ceo
}
ORDER BY DESC(xsd:integer(?employees))
LIMIT 200
No results! Nothing! Unfortunatly, It seems DBPedia didn’t make use of this property …
As a plan B, let’s try to explore what kind of information can be found for a specific instance of a Company.
Let’s get back to our previous query:
SELECT ?object ?employees
WHERE
{
?object rdf:type dbo:Company .
?object dbo:numberOfEmployees ?employees
FILTER ( xsd:integer(?employees) >= 3210 ) .
}
ORDER BY DESC(xsd:integer(?employees))
LIMIT 200
Clicking on Siemens it should point you to: http://dbpedia.org/snorql/?describe=http%3A//dbpedia.org/resource/Siemens That’s what make snorql cool, it makes query on your behalf, so that exploring is just a click away:
SELECT ?property ?hasValue ?isValueOf
WHERE {
{ <http://dbpedia.org/resource/Siemens> ?property ?hasValue }
UNION
{ ?isValueOf ?property <http://dbpedia.org/resource/Siemens> }
}
By browsing the page, we can see here that there is no dbo:ceo. However, dbpedia:keyPeople is quite interesting. Let's use it: ``` SELECT ?object ?employees ?ceo
WHERE
{
?object rdf:type dbo:Company .
?object dbo:numberOfEmployees ?employees
FILTER ( xsd:integer(?employees) >= 3210 ) .
?object dbpedia2:keyPeople ?ceo
}
ORDER BY DESC(xsd:integer(?employees)) LIMIT 200 ```
If we click on the CEO of siemens:
http://dbpedia.org/snorql/?describe=http%3A//dbpedia.org/resource/Joe_Kaeser
we see that a way to access:
- the CEO name would be to use foaf:givenName
- the CEO age would be to use dbo:birthDate ``` SELECT ?object ?employees ?ceo ?ceo_name ?ceo_birth
WHERE
{
?object rdf:type dbo:Company .
?object dbo:numberOfEmployees ?employees
FILTER ( xsd:integer(?employees) >= 3210 ) .
?object dbpedia2:keyPeople ?ceo .
?ceo foaf:givenName ?ceo_name .
?ceo dbo:birthDate ?ceo_birth
FILTER ( xsd:date(?ceo_birth) >= "1971"^^xsd:date ) .
} ORDER BY DESC(xsd:date(?ceo_age)) ```
Some issues
Some paint point:
- because the CEO field is not really used on Wikipedia, we can only have keyPerson which is not always the CEO …
- the data on DBPedia is not always fully update and can be a old. Example: http://dbpedia.org/page/Google where Sundar Pichai is not even mention here! To mitigate this issue, we can use this interface: http://live.dbpedia.org/sparql that give more fresh information As we can see Company is not giving everything we could have expect first.
Unfortunatly Wikipedia don’t provide absolutly every detail on everyone.
If Wikipedia would have include a valuation field, it would have been fun to create graph of companies considered as Unicorn, but company like airbnb don’t push this kind of stuff on their Wikipedia page :/
The zipcode challenge
There is a crazy amount of people who maintain paid database with City zipcode accross different country. Why would you paid for that? And what if you want the location of those cities too?
Let’s get started
Recapt of what we’ve already seen in a different context
1) We are looking for cities. First things is to identify the ontology we need to use, here it is dbo:City (http://dbpedia.org/ontology to see them all)
’’’ prefix dbo: http://dbpedia.org/ontology/ SELECT ?object WHERE { ?object rdf:type dbo:City . } LIMIT 100 ‘’’
2) We’ll have to find the properties we are interested in. It should be fairly simple to find them using snorql
3) filter to remove the scrap from the results
Final query
prefix dbo: <http://dbpedia.org/ontology/>
SELECT ?country ?city_code ?city ?city_population ?city_location
WHERE {
?city rdf:type dbo:City .
?city foaf:name ?city_name .
?city <http://www.georss.org/georss/point> ?city_location .
?city dbpedia2:populationTotal ?city_population .
?city dbpedia2:postalCode ?city_code .
?city dbo:country ?country .
?country foaf:name ?country_name .
# FILTER(?country_name = "Guatemala")
}
ORDER BY DESC(xsd:integer(?city_population))
lIMIT 2000
prefix dbo: <http://dbpedia.org/ontology/>
SELECT ?country_name ?city_code ?city_name ?city_population ?city_location
WHERE {
?city rdf:type dbo:City .
?city rdfs:label ?city_name
FILTER(langMatches(lang(?city_name), "EN"))
?city <http://www.georss.org/georss/point> ?city_location .
?city dbpedia2:populationTotal ?city_population .
?city dbpedia2:postalCode ?city_code .
?city dbo:country ?country .
?country rdfs:label ?country_name
FILTER(langMatches(lang(?country_name), "EN") && ?country_name = "Switzerland")
}
ORDER BY DESC(xsd:string(?country_name))
To run such a large query, you’ll have to host dbpedia somewhere.
The challenges Results
Worlds’ most wealthiest person
SELECT ?object ?name ?death_date ?wealth
WHERE
{
?object rdf:type dbo:Person .
?object foaf:givenName ?name .
?object dbo:networth ?wealth .
minus {
?object dbo:deathDate ?death_date
}
}
ORDER BY DESC (xsd:integer(?wealth))
LIMIT 300
Prepare you holiday in AUstralia
SELECT ?object ?wikipage ?thumbnail ?location ?abstract
WHERE
{
?object rdf:type dbo:NaturalPlace .
?object foaf:name ?name .
?object foaf:isPrimaryTopicOf ?wikipage .
?object dbo:thumbnail ?thumbnail .
?object <http://www.georss.org/georss/point> ?location .
?object dbo:abstract ?abstract .
FILTER(langMatches(lang(?abstract), "EN"))
?object dbpedia2:location ?country .
?country foaf:name ?country_name
FILTER(xsd:string(?country_name) = 'Australia')
}
LIMIT 300
Instead of filling the pocket of TripAdvisor and clicking on their advertising, contribute to Wikipedia and fix the issues :)
Quick example, one of the most beautiful place on earth, the Lake Mc Kenzie, doesn’t appear because the page was badly done (http://dbpedia.org/page/Lake_McKenzie)! What a waste!
Some links to keep going
- rdf sprql query: https://www.w3.org/TR/rdf-sparql-query/
- a chear sheet for SPARQL: http://www.slideshare.net/LeeFeigenbaum/sparql-cheat-sheet
Links to keep somewhere
- http://mappings.dbpedia.org/server/ontology/classes/