Comments (1)
Kyle Wesley
This is extremly insightful, thanks !!
In the era of Big Data, data quality - or how clean the data is - inevitably floats to the surface. Data scientists typically refer to data as dirty when it has a number of basic flaws:
Working with such dirty data can be exasperating with SQL data, as in many cases the only thing that can be done with a row of such data is to throw it out. However, one advantage of working with data as triples is that you focus on your data at the level of individual properties and cells, and as such can retrieve the useful data from a dataset without having to throw everything out.
Before digging deeply into such techniques, however, consider the bigger question about where such data comes from. Most data cleanup that occurs takes place because the data collected was application data that was focused on one particular use case. Sometimes you have no choice but to throw the baby out with the bathwater. One of the biggest issues involved with data cleanup is in ensuring that you are not in fact adding biases with any cleanup that you perform, that the data that you are adding does not in fact change the significance of the data you are working with.
There are a growing number of converters to take SQL and convert it into RDF. D2RQ, an open source Apache project, is a good place to start, both because it is open source and because it makes use of a number of W3C conversion specifications such as those given for R2RML.
D2RQ itself is a set of Java based command line routines written around the core Jena library, the same library that powers the Jena-Fuseki web server that often serves as people's first introduction to the world of RDF and SPARQL. The core D2RQ is a Turtle based mapping file that either can be built by hand or can be generated from the schema of a relational database, and can work with Oracle, mySQL, SQL Server, Postgres, and (with some work) any database that supports the jdbc protocols. As an example, creating a mapping for the "world" database from a local mySQL instance , creating an output mapping file (mapping.ttl) would look like the following:
> ./generate-mapping -u root -p Password1 \
-o mapping.ttl jdbc:mysql:///world
The mapping file itself (mapping.ttl) creates associations either in an application specific mapping language in Turtle, or uses the R2RML mappings from the W3C specification:
# mapping.ttl @prefix map: <#> . @prefix db: <> .prefix vocab: <vocab/> . .prefix vocab @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix d2rq: <http://www.wiwiss.fu-berlin.de/suhl/bizer/D2RQ/0.1#> . @prefix jdbc: <http://d2rq.org/terms/jdbc/> . map:database a d2rq:Database; d2rq:jdbcDriver "com.mysql.jdbc.Driver"; d2rq:jdbcDSN "jdbc:mysql:///world"; d2rq:username "root"; d2rq:password "Password1"; jdbc:autoReconnect "true"; jdbc:zeroDateTimeBehavior "convertToNull"; . # Table city map:city a d2rq:ClassMap; d2rq:dataStorage map:database; d2rq:uriPattern "city/@@city.ID@@"; d2rq:class vocab:city; d2rq:classDefinitionLabel "city"; . map:city__label a d2rq:PropertyBridge; d2rq:belongsToClassMap map:city; d2rq:property rdfs:label; d2rq:pattern "city #@@city.ID@@"; . map:city_ID a d2rq:PropertyBridge; d2rq:belongsToClassMap map:city; d2rq:property vocab:city_ID; d2rq:propertyDefinitionLabel "city ID"; d2rq:column "city.ID"; d2rq:datatype xsd:integer; . map:city_Name a d2rq:PropertyBridge; d2rq:belongsToClassMap map:city; d2rq:property vocab:city_Name; d2rq:propertyDefinitionLabel "city Name"; d2rq:column "city.Name"; . map:city_District a d2rq:PropertyBridge; d2rq:belongsToClassMap map:city; d2rq:property vocab:city_District; d2rq:propertyDefinitionLabel "city District"; d2rq:column "city.District"; . map:city_Population a d2rq:PropertyBridge; d2rq:belongsToClassMap map:city; d2rq:property vocab:city_Population; d2rq:propertyDefinitionLabel "city Population"; d2rq:column "city.Population"; d2rq:datatype xsd:integer; . map:city_CountryCode__ref a d2rq:PropertyBridge; d2rq:belongsToClassMap map:city; d2rq:property vocab:city_CountryCode; d2rq:refersToClassMap map:country; d2rq:join "city.CountryCode => country.Code"; . # Table country map:country a d2rq:ClassMap; d2rq:dataStorage map:database; d2rq:uriPattern "country/@@country.Code|urlify@@"; d2rq:class vocab:country; d2rq:classDefinitionLabel "country"; . map:country__label a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property rdfs:label; d2rq:pattern "country #@@country.Code@@"; . map:country_Code a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_Code; d2rq:propertyDefinitionLabel "country Code"; d2rq:column "country.Code"; . map:country_Name a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_Name; d2rq:propertyDefinitionLabel "country Name"; d2rq:column "country.Name"; . map:country_Continent a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_Continent; d2rq:propertyDefinitionLabel "country Continent"; d2rq:column "country.Continent"; . map:country_Region a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_Region; d2rq:propertyDefinitionLabel "country Region"; d2rq:column "country.Region"; . map:country_SurfaceArea a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_SurfaceArea; d2rq:propertyDefinitionLabel "country SurfaceArea"; d2rq:column "country.SurfaceArea"; d2rq:datatype xsd:double; . map:country_IndepYear a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_IndepYear; d2rq:propertyDefinitionLabel "country IndepYear"; d2rq:column "country.IndepYear"; d2rq:datatype xsd:integer; . map:country_Population a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_Population; d2rq:propertyDefinitionLabel "country Population"; d2rq:column "country.Population"; d2rq:datatype xsd:integer; . map:country_LifeExpectancy a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_LifeExpectancy; d2rq:propertyDefinitionLabel "country LifeExpectancy"; d2rq:column "country.LifeExpectancy"; d2rq:datatype xsd:double; . map:country_GNP a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_GNP; d2rq:propertyDefinitionLabel "country GNP"; d2rq:column "country.GNP"; d2rq:datatype xsd:double; . map:country_GNPOld a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_GNPOld; d2rq:propertyDefinitionLabel "country GNPOld"; d2rq:column "country.GNPOld"; d2rq:datatype xsd:double; . map:country_LocalName a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_LocalName; d2rq:propertyDefinitionLabel "country LocalName"; d2rq:column "country.LocalName"; . map:country_GovernmentForm a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_GovernmentForm; d2rq:propertyDefinitionLabel "country GovernmentForm"; d2rq:column "country.GovernmentForm"; . map:country_HeadOfState a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_HeadOfState; d2rq:propertyDefinitionLabel "country HeadOfState"; d2rq:column "country.HeadOfState"; . map:country_Capital a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_Capital; d2rq:propertyDefinitionLabel "country Capital"; d2rq:column "country.Capital"; d2rq:datatype xsd:integer; . map:country_Code2 a d2rq:PropertyBridge; d2rq:belongsToClassMap map:country; d2rq:property vocab:country_Code2; d2rq:propertyDefinitionLabel "country Code2"; d2rq:column "country.Code2"; . # Table countrylanguage map:countrylanguage a d2rq:ClassMap; d2rq:dataStorage map:database; d2rq:uriPattern "countrylanguage/@@countrylanguage.CountryCode|urlify@@/@@countrylanguage.Lan guage|urlify@@"; d2rq:class vocab:countrylanguage; d2rq:classDefinitionLabel "countrylanguage"; . map:countrylanguage__label a d2rq:PropertyBridge; d2rq:belongsToClassMap map:countrylanguage; d2rq:property rdfs:label; d2rq:pattern "countrylanguage #@@countrylanguage.CountryCode@@/@@countrylanguage.Language@@"; . map:countrylanguage_Language a d2rq:PropertyBridge; d2rq:belongsToClassMap map:countrylanguage; d2rq:property vocab:countrylanguage_Language; d2rq:propertyDefinitionLabel "countrylanguage Language"; d2rq:column "countrylanguage.Language"; . map:countrylanguage_IsOfficial a d2rq:PropertyBridge; d2rq:belongsToClassMap map:countrylanguage; d2rq:property vocab:countrylanguage_IsOfficial; d2rq:propertyDefinitionLabel "countrylanguage IsOfficial"; d2rq:column "countrylanguage.IsOfficial"; . map:countrylanguage_Percentage a d2rq:PropertyBridge; d2rq:belongsToClassMap map:countrylanguage; d2rq:property vocab:countrylanguage_Percentage; d2rq:propertyDefinitionLabel "countrylanguage Percentage"; d2rq:column "countrylanguage.Percentage"; d2rq:datatype xsd:double; . map:countrylanguage_CountryCode__ref a d2rq:PropertyBridge; d2rq:belongsToClassMap map:countrylanguage; d2rq:property vocab:countrylanguage_CountryCode; d2rq:refersToClassMap map:country; d2rq:join "countrylanguage.CountryCode => country.Code"; .
The mapping creates the relevant connection variables to the database, then from there identifies the mappings that convert tables into classes and fields into properties based upon the schema. This file can also be modified to generate different URIs based upon specified rules, allowing for a basic level of conversion that may produce additional keys, computed fields and so forth.
Once the mapping is generated, it can be used to create RDF dumps of the SQL databases (through the appropriately named dump-rdf API:
> ./dump-rdf -u root -f turtle \ -b http://example.com/ns/world/ -o world.ttl mapping.ttl ==> # world.ttl @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix vocab: <http://example.com/ns/world/vocab/> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix map: <http://example.com/ns/world/#> . @prefix db: <http://example.com/ns/world/> . <http://example.com/ns/world/city/3782> a vocab:city ; rdfs:label "city #3782" ; vocab:city_CountryCode <http://example.com/ns/world/country/VNM> ; vocab:city_District "An Giang" ; vocab:city_ID 3782 ; vocab:city_Name "Long Xuyen" ; vocab:city_Population 132681 . <http://example.com/ns/world/city/1252> a vocab:city ; rdfs:label "city #1252" ; vocab:city_CountryCode <http://example.com/ns/world/country/IND> ; vocab:city_District "Andhra Pradesh" ; vocab:city_ID 1252 ; vocab:city_Name "Malkajgiri" ; vocab:city_Population 126066 . <http://example.com/ns/world/country/MAC> a vocab:country ; rdfs:label "country #MAC" ; vocab:country_Capital 2454 ; vocab:country_Code "MAC" ; vocab:country_Code2 "MO" ; vocab:country_Continent "Asia" ; vocab:country_GNP 5749.0E0 ; vocab:country_GNPOld 5940.0E0 ; vocab:country_GovernmentForm "Special Administrative Region of China" ; vocab:country_HeadOfState "Jiang Zemin" ; vocab:country_LifeExpectancy 81.6E0 ; vocab:country_LocalName "Macau/Aomen" ; vocab:country_Name "Macao" ; vocab:country_Population 473000 ; vocab:country_Region "Eastern Asia" ; vocab:country_SurfaceArea 18.0E0 . <http://example.com/ns/world/city/1385> a vocab:city ; rdfs:label "city #1385" ; vocab:city_CountryCode <http://example.com/ns/world/country/IRN> ; vocab:city_District "Teheran" ; vocab:city_ID 1385 ; vocab:city_Name "Karaj" ; vocab:city_Population 940968 . <http://example.com/ns/world/city/589> a vocab:city ; rdfs:label "city #589" ; vocab:city_CountryCode <http://example.com/ns/world/country/DOM> ; vocab:city_District "La Romana" ; vocab:city_ID 589 ; vocab:city_Name "La Romana" ; vocab:city_Population 140204 . ...
This mapping process can do a lot of your data cleansing. Fields that are set to NULL, for example, simply don't generate a corresponding triple for that row and column. Fields that are set to blank strings do, but you can modify an entry using a dt:condition property. For instance, to ensure that city names with blank values ("") don't generate triples, you would change the map:city entry from:
map:city_Name a d2rq:PropertyBridge; d2rq:belongsToClassMap map:city; d2rq:property vocab:city_Name; d2rq:propertyDefinitionLabel "city Name"; d2rq:column "city.Name"; .
to
map:city_Name a d2rq:PropertyBridge; d2rq:belongsToClassMap map:city; d2rq:property vocab:city_Name; d2rq:propertyDefinitionLabel "city Name"; d2rq:column "city.Name"; d2rq:condition city.Name <> ''; # check to make sure the database doesn't # use a blanks string. .
Similarly, the rdfs:label property currently defaults to "city #@@city.ID@@", where the expression within the ampersand corresponds to the ID of the specific city record. This can be changed so that it is the same as the city name followed by the country name by altering the property associated with rdfs:label for city:
map:city__label a d2rq:PropertyBridge; d2rq:belongsToClassMap map:city; d2rq:property rdfs:label; d2rq:pattern "@@city.Name@@"; .
These and similar mapping transformations are covered in greater detail at http://d2rq.org/d2rq-language. The transformations in this case are the SQL transformations, so if you wanted the rdfs:label to be in upper case, you'd use the triples:
map:city__label a d2rq:PropertyBridge; d2rq:belongsToClassMap map:city; d2rq:property rdfs:label; d2rq:sqlExpression "ucase(city.Name)"; .
You can also add new properties into the base set. For example, one very useful technique when working with updating content is to create a concatenated value from one or more changeable items in the table, then applying a hash function such as SHA1 to this concatenated value. If any of those fields change, the old computed hash will be different from the new hash. This way, the only time you would need to update a record would be when the old and new hashes diverge.
map:city__hash a d2rq:PropertyBridge; d2rq:belongsToClassMap map:city; d2rq:property vocab:city_Hash; d2rq:sqlExpression "sha1(concat(city.Name,city.District, city.CountryCode,city.ID))"; .
With these changes, a new entry now might look like the following:
http://example.com/ns/world/city/3122> a vocab:city ; rdfs:label "POTSDAM" ; vocab:city_CountryCode vocab:country_DEU ; vocab:city_District "Brandenburg" ; vocab:city_Hash "4958dc8728f5e4b4e0da9f8496a5a5c5b664d133" ; vocab:city_ID 3122 ; vocab:city_Name "Potsdam" ; vocab:city_Population 128983 .
Even this comparatively simple process has done a lot to clean up the data. Null fields and empty string entries are eliminated. We have added the first critical pieces of metadata (type, labels and hash entries).
The d2rq:sqlExpression property and d2rq:condition can also be used to do things like translate dates in the format MM/DD/YYYY into ISO YYYY-MM-DD formats, primarily by taking advantage of the SQL REGEX_REPLACE, REGEX_SUBSTR and SPLIT_PART functions. The conditional statement can then test whether the data in question is equal to or greater than a placeholder value (such as 9/9/9999 for a date, or -1 for a non-negative integer), and can eliminate that assertion prior to production of the RDF).
This is also a good place to add other metadata, especially if you plan on importing the mapping RDF into a separate graph in the triples database. This could include information such as range and domain information, SHACL validation code, or units information. For instance, the new hash property that was introduced might include the following entry:
map:city__hash a d2rq:PropertyBridge; d2rq:belongsToClassMap map:city; d2rq:property vocab:city_Hash; d2rq:sqlExpression "sha1(concat(city.Name,city.District, city.CountryCode,city.ID))"; . ex:HashPropertyShape a sh:PropertyShape ; sh:path vocab:city_Hash,vocab:country_Hash ; sh:name "Hash"; sh:description "An SHA1 encoding of several properties to determine when relevent information in the resource has changed." ; sh:pattern "[A-Fa-f0-9]{40}"; sh:minCount 0; sh:maxCount 1 .
This provides a label and description for the property, a regex pattern (here a forty character hexadecimal encoding), and an indication that the property may be optional (if no hash is provided, then the resource may be replaced automatically, or something similar).
Other assertions (such as those for population or area related data) might be provided for unit analysis:
ex:PopulationPropertyShape a sh:PropertyShape ; sh:path vocab:city_Population,vocab:country_Population; sh:minValueInclusive 0; sh:datatype xsd:long; sh:minCount 0; sh:maxCount 1; ex:units units:People; .
This provides a way of indicating a common, consistent definition of units (number of people, or square kilometers), along with potential conversion factors between units of the same kind (area) but different types (square meters, square kilometers, square miles, acres, hectares, etc.). This unit analysis can go a long way towards ensuring consistency even when the base units are different.
One other advantage that loading this mapping does is to provide provenance information, specifically what the source database for this information is. If you know a given class or property URI, you can automatically walk back from there to the database information (via the d2rq:belongsToClassMap and d2rq:dataStorage property).
The d2rq.org package also includes both a web server that can be run either standalone or as as tomcat or jetty WAR package, and the d2r-query tool that allows you to query the original SQL database via SPARQL without the need to import the data into a separate triple-store. For instance, if you wanted to know the population of a given city or state in the US, this can be done as follows:
> cat query.sp
@prefix vocab: <vocab/> .
var ?cityName ?countryName ?population where {
?city a vocab:city;
?city vocab:city_Name ?cityName.
?city vocab:city_Population ?population.
?city vocab:city_CountryCode ?countryCode.
?country vocab:country_Code ?countryCode.
?country vocab:country_Name ?countryName.
filter (?countryCode = 'USA')
}
> .d2r-query -f turtle mapping.ttl @query.sp
cityName | ?countryName | ?population
------------------------------------------------
Seattle | United States | 1872201
Portland | United States | 1256857
....
Normally the query would be provided as a string after the mapping, but it can be passed as a file by using the @ symbol.
A final note - the above "cleaning" exercise was also a key part of the ingestion process of data into a larger scale data hub. The ability to generate mappings significantly reduces the overhead of building such maps, meaning that most of the work of finalizing such maps can then be done in a matter of days rather than potentially weeks or longer.
As well - this process converts SQL data into RDF, but doesn't necessarily complete the round trip that many people would like to see with the ability to query such data using SQL (especially for analytics and visualization tools). While SPARQL is a powerful language (more powerful in many respects than SQL) it's not as widely deployed. Fortunately, many contemporary triple stores are now evolving ways of making specifically formatted RDF look like a relational database for query purposes. This will be discussed in more detail in a later post.
This is extremly insightful, thanks !!
Kurt is the founder and CEO of Semantical, LLC, a consulting company focusing on enterprise data hubs, metadata management, semantics, and NoSQL systems. He has developed large scale information and data governance strategies for Fortune 500 companies in the health care/insurance sector, media and entertainment, publishing, financial services and logistics arenas, as well as for government agencies in the defense and insurance sector (including the Affordable Care Act). Kurt holds a Bachelor of Science in Physics from the University of Illinois at Urbana–Champaign.
Leave your comments
Post comment as a guest