Geocoding with long tail linked data

The problem

Blue map location pin icon vector imageGeocoding (sometimes called forward geocoding) uses a description of a location, typically a postal address or place name, to find geographic coordinates from spatial reference data such as building polygons, land parcels, street addresses, postal codes (e.g. ZIP codes, CEDEX) and so on. Geocoding facilitates spatial analysis using Geographic Information Systems and Enterprise Location Intelligence systems. A geocoder is a piece of software or a (web) service that implements a geocoding process.

There are many geocoders, the most known are Google Maps, Bing Maps and OpenStreetMap. The former two are commercial services that provide an API interface. The latter is a collaborative open source project that provides both APIs and raw data. All solutions pose some problems to companies that need to integrate geocoding information in their private information systems.

Unfortunately Google Map APIs, as many commercial geocoding services, are not suitable to generate data to be included in a corporate knowledge base because of its stringent license restrictions: the Google Maps geocoding APIs may only be used in conjunction with a Google Map service; geocoding results without displaying them on a map is prohibited. See the Google Maps Geocoding API Usage Limits for more information.

OpenStreetMap (OSM) does not suffer license restrictions and exposes both an API and a data interface. Even if it is a very good service,   OSM API does not provide any SLA and suffers performance penalties on the public geocoding server: no more than a query per second is allowed. The OSM data interface requires an expensive ETL process (Extract, Transform, Load) to iterate very often.

Beside this, as a matter of facts in all systems, civic numbers (mainly in rural sites), are not always accurate.

As a consequence, creating and maintaining an accurate geocoding knowledge base using these data sets can be really expensive.

The solution

The GeocodIT project, proposes a solution to this problem that leverages the many existing linked (open) data and the semantic web practices to develop a near-to-zero maintenance private geocoder.

The geocoder exposes:
  • a forward geocoder API that integrates Google Maps, Bing Maps and OSM with linked data;
  • a graph database containing all linked data;
  • a SPARQL query endpoint;
  • a shareable knowledge base description based on KEES open language profile.
The most complex part of this solution is the set-up and the operation of an RDF graph database able to manage potentially millions of information. These are barriers that prevents companies, especially SMEs, to use this approach. To overcome these barriers, GeocodIT uses the LinkedData.Center service that provides a fully managed RDF graph database as a service ranging from 69€/month (free plans available). This makes the set-up and the operation of an RDF graph database cost effective and sustainable for any company.

Moreover LinkedData.Center has software agents that recognize KEES descriptions to manage automatic data ingestion and data updating. This makes the upload of multiple datasets automatic and manageable.

The big data sources

OpenStreetMap (OSM) is a collaborative project to create a free editable map of the world. The license that they use is the Open Data Commons Open Database License, a very open license that allows the reuse of data for any purpose. OSM exposes their data in a ready to use 5-star format (through the LinkedGeoData project). This allows LinkedData.Center to effortlessly connect with them.

Even though the OSM coverage is still low compared to other blasonated services (like google maps or bing), the accuracy of their maps is astonishing since all the data are collected and validated by a human being. Moreover, the number of active contributors is constantly rising.


Geoportale Nazionale is a portal produced by the Italian Ministry for the Environment and the Protection of Natural and Marine Resources. It offers a series of geo-referenced web-services. Along the services offered there is a dataset of the house numbers. This dataset is distributed under a Creative Commons BY-SA license.

The database is updated at 2012 and, even though it is not complete and contains some inconsistencies, features a good coverage of the Italian territory. Data are supplied through a Web Map Service and must be converted before usage.

Open data sources in the web long tails

Data sets on the web are literally exploding. Of course they’re still fragmented and of different quality but that will be fixed and it is only a matter of time. That is a great opportunity for companies. With Linked Open Data, it does not make sense to use only the biggest ones because the highest quality data are often in the smallest data sets. These are in the long tail of the whole available data sets.  Hence you’ve to use them or you’ll lose a lot of value.

We invested some days scouting the web to find resources containing specific geocoding data for the Italian territory. This scouting included eGovernment portals (at country, regional and local level) and the coordination with existing similar projects. A table containing some of the most interesting analysed datasets is available in the white paper you can download in LinkedData.Center.

Focalizing only on the italian data released by institutional bodies we found:
  • 119 municipal data sets;
  • 42 regional datasets;
  • 15 provincial datasets;
  • 2 national datasets;
  • 1 global dataset;

for a total of 178 different datasets from 54 unique data sources This picture summarizes dataset size:




The GeocodIT architecture
The GeocodIT project system architecture is based on a data supply chain paradigm with three levels of data management services that link data providers to data subscribers:

  • Data services that read non-RDF data and transform them in RDF accordingly to the defined language profile (GeocodIT Gateways);
  • Graph services (by LinkedData.Center) that integrate all data sources in a big picture allowing automatic data ingestion, knowledge base hosting, data consolidation and query through a private SPARQL endpoint;
  • Application services that consist in a set of APIs that query the graph database providing data ready to use by an application.

All required system components are available as a service or as open source code:

component name

type

description

provided by

GeocodIT gateways

PHP Code

A set of custom RESTful web services that translate each non-RDF open source (three star datasource) in RDF accordingly to the GeocodIT language profile.

GeocodIT project

RDF storage

service

Store index for RDF data as a graph db.

LinkedData.Center

SPARQL end-point

service

Allows to query RDF graph db.

LinkedData.Center

Ingestion engine

service

Reads from a KEES knowledge base configuration and automated dataset ingestion.

LinkedData.Center

Table API

PHP Code

A web service that querying the SPARQL endpoint produces tabular paged data in a format suitable to be used by a relational DBMS (csv).

LinkedData.Center

GeocodIT API endpoint

PHP Code

Restfull web server searches an address on both Google Map, OSM and GeocodIT Knowledge Base returning geolocation info.

GeocodIT project

GeocodIT language profile

doc

A trimmed down version or an extension of a set of ontologies that trades some expressive power for the efficiency of reasoning in GeocodIT domain.

GeocodIT project

GeocodIT KEES configuration

RDF dataset

A knowledge base configuration description expressed with the KEES language profile.

GeocodIT project



All GeocodIT components are released as PHP and open source libraries. All the source code is shared in a GitHub repository under MIT license.