Semantic Technologies and Linked Data Foundations

The last decade has seen a growing interest in the Semantic Web, which extends the web of documents to a web of data. This technology applies web-based standards for encoding datasets and linking them to other published datasets, so that applications can exploit data from many different sources. It also provides standards for encoding general knowledge in ontologies, so allowing enhancements based on automatic reasoning (improved querying, for example).
This article introduces Linked Data and related semantic technologies, and shows how they can be deployed in web applications.

We will describe a set of technologies that allows datasets to be published over the web, and queried effectively by applications. Compared with search engines such as Google and Yahoo, which are based on text-string matching, these technologies are "semantic". This means that information is represented not in a natural language like English or Spanish, but in a graph-based data model that facilitates extension, integration, inference and uniform querying. As a realistic application of semantic technologies, we consider the provision of a portal through which users can retrieve resources and information in the world of music. Consider for example the following tasks:
  • Retrieve a performance of the Beethoven violin concerto by a Chinese orchestra
  • Retrieve a photograph of the conductor of this performance
  • List male British rock musicians married to Scandinavians

Attempts to answer such queries through text-based search are unreliable: we might equally retrieve a performance in which the soloist was Chinese, or a rock musician that plays Scandinavian music. Using semantic technologies, resources such as the audio file of the performance, or the photograph of the conductor, can be annotated using the Resource Description Framework (RDF). In this framework, formal names can be assigned to what are called resources, which would include Beethoven, his violin concerto, the orchestra, and the conductor. Names can also be assigned to types (or classes) of resource (composers, concertos, etc.), and to relationships (or properties) that link resources (e.g., the "composed-by" relationship between composition and composer). By reasoning over facts encoded in this way, a query system can confirm that a performance was given by the Beijing Symphony Orchestra, that this orchestra is based in Beijing, that Beijing is located in China, and so forth -- thus combining geographical and musical knowledge in order to retrieve an answer.

In designing these semantic technologies, a key design decision was to leave open the naming of resources and properties, provided that names conform to the format for web resource names -- that is, provided they are Uniform Resource Identifiers or URIs.

All four of the above could be names for Beethoven, illustrating that the URI need not be human-readable (e.g., it might be an arbitrary string of letters and numbers), although identifiers should be resolvable to RDF representations that include human-readable labels, as explained below. If data from different sources are to be combined, it is therefore important to establish links, for instance through statements indicating that the above four URIs are synonymous. These statements, which can also be expressed in RDF, provide a means by which data published by many people or organisations can be combined into linked data.

In the following chapters, we will show through practical examples how to describe resources in RDF, how to convert data from other formats to RDF, how to publish RDF data, and how to link published RDF to other datasets. We will also consider how to utilize existing linked data in applications for querying, analysis, mining, and visualisation. All these topics will be illustrated by the case scenario of a music portal.

For examples of existing music portals, you can look at the BBC music reviews site, and the etree linked music site. These applications make use of a music ontology and a large dataset of musical information called MusicBrainz, which we will also exploit in our training material.

Background technologies

Linked data results from a confluence of earlier ideas and technologies, including hypertext, databases, ontologies, markup languages, the Internet, and the World Wide Web. In this section we provide background information on these technologies.


The Internet is an extension of the technology of computer networks. The earliest computers operated independently. In the 1960s and 1970s, it became common for computers in an organisation (e.g., university, government, company) to be linked together in a network. At the same time, there were early experiments in linking whole networks together, including the ARPANET in the United States. In the early 1980s, the Internet Protocol Suite (TCP/IP) for the ARPANET was standardised, to provide the basis for a network of networks that could embrace the whole world. The Internet spread mostly to Europe and Australia during the 1980s, and to the rest of the world during the 1990s.

The technology supporting the Internet includes the IP (Internet Protocol) system for addressing computers, so that messages can be routed from one computer to another. Each computer on the Internet is assigned an IP number which can be written as four integers from 0--255 separated by dots, e.g. (To be precise, this convention holds for version 4 of the IP, but not the more recent version 6.) The structure of messages is governed by application protocols that vary according to the service required (e.g., email, telephony, file transfer, hypertext). Examples of such protocols are FTP (File Transfer), USENET, and HTTP (HyperText Transfer).


The concept of hypertext is normally dated from Bush and Wang's 1945 article "As we may think" [1], which proposed an organisation of external records (books, papers, photographs) corresponding to the association of ideas in human memory. By the 1960s, with more advanced computer technology, this concept was implemented by pioneers such as Douglas Engelbart and Ted Nelson in programs that allowed texts (or other media) to be viewed with some spans marked as hyperlinks, through which the reader could jump to another document.

World Wide Web

Informally people often use the terms "Internet" and "World Wide Web" (WWW) interchangeably, but this is inaccurate: the WWW is in fact just one of many services delivered over the Internet. The distinctive feature of the WWW is that it is a hypertext application, which exploits the Internet to allow cross-linking of documents all over the world.

The formal proposal for the WWW, and prototype software, were produced in 1990 by Tim Berners-Lee [2], and elaborated over the next few years. The basic idea is that a client application called a web browser obtains access to a document stored on another computer by sending a message, over the Internet, to a web server application, which sends back the source code for the document. Documents (or web pages) are written in the Hypertext Markup Language (HTML), which allows some spans to be marked as hyperlinks to a document at a specified location in the web, named using a Universal Resource Locator (URL). When the user clicks on a hyperlink, the browser finds the IP address associated with the URL, and sends a message to this IP address requesting the HTML file at the given location in the server's file system; on receipt, this file is displayed in the browser.

Figure 1: Development of the WWW
Source: Radar Networks & Nova Spivack, 2007.
Citation: Nova Spivack's illustration of the evolution of the WWW.
License: CC (Some Rights Reserved)

Web 1.0 (static)

In 1993 came a turning point for the WWW with the introduction of the Mosaic web browser, which could display graphics as well as text. From that date, usage of the web grew rapidly, although most users operated only as consumers of content, not producers. During this early phase of web development, sometimes called Web 1.0, web pages were mostly static documents read from a server and displayed on a client, with no options for users to contribute content, or for content to be tailored to a user's specific demands.

Web 2.0 (dynamic)

Around 2000 a second phase of web development began with the increasing use of technologies allowing the user of a browser to interact with web pages and shape their content. There are basically two ways in which this can be done, known as client-side scripting, and server-side scripting.

Client-side scripting is achieved through program code incorporated into the HTML source, typically written in Javascript. This code can be run on the user's computer, without any need for further messages to be sent to the server: hence "client-side".

Server-side scripting is achieved through messages to the server which invoke applications capable of creating the HTML source dynamically: the document eventually displayed to the user is therefore tailored in response to a specific request rather than retrieved from a previously stored file.

Social web

These Web 2.0 technologies have made possible a wide range of social web sites now familiar to everyone, including chat rooms, blogs, wikis, product reviews, e-markets, and crowdsourcing. Previously a consumer of content provided by others, the web user has now become a prosumer, capable of adding information to a web page, and in this way communicating not only with the server, but through the server with other clients as well.

Web 3.0 (semantic)

During the 1990s, Berners-Lee and collaborators developed proposals for a further stage of web development known as the Semantic Web. This far-reaching concept, first publicised in a 2001 article in the Scientific American [3], is partly implemented in the current stage of web development sometimes called Web 3.0. At present we cannot see clearly what lies beyond Web 3.0, but in Figure 1 we allow for future stages in Semantic Web development by including a loosely defined further stage "Web 4.0".

In their 2001 article, Berners-Lee and co-authors pointed out that existing web content was usable by people but not by computer applications. There were many computer applications available for tasks like planning, or scheduling, or analysis, but they worked only on data files in some standard logical format, not on information presented in natural language text. A person could plan an itinerary by looking at web pages giving flight schedules, hotel locations, and so forth, but it was not yet possible (then as now) for programs to extract such information reliably from text-based web pages. The initial aim of the Semantic Web is to provide standards through which people can publish documents that consist of data, or perhaps a mixture of data and text, so allowing programs to combine data from many datasets, just as a person can combine information from many text documents in order to solve a problem or perform a task.

Figure 2: From documents to data
Source: Own source.


Datasets usually encode facts about individual objects and events, such as the following two facts about the Beatles (shown here in English rather than a database format):

The Beatles are a music group
The Beatles are a group

There is something odd about this pair of facts: having said that the Beatles are a music group, why must we add the more generic fact that they are a group? Must we list these two facts for all music groups -- not to mention all groups of acrobats or actors etc.? Must we also add all other consequences of being a music group, such as performing music and playing musical instruments?

Ontologies allow more efficient storage and use of data by encoding generic facts about classes (or types of object), such as the following:

Every music group is a group
Every theatre group is a group

It is now sufficient to state that the Beatles (and the Rolling Stones, etc.) are music groups, and the more general fact that they are groups can be derived through inference. Ontologies thus enhance the value of data by allowing a computer application to infer, automatically, many essential facts that may be obvious to a person but not to a program.

To allow automatic inference, ontologies may be encoded in some version of mathematical logic. There are many formal logics, which vary in expressivity (the meanings that can be expressed) and tractability (the speed with which inferences can be drawn). To be useful in practical applications it is necessary to trade expressivity for tractability, and description logic, which is implemented in the Web Ontology Language OWL, does precisely this. However, despite these restrictions on expressivity, OWL cannot yet be used efficiently for inference over very large datasets, as required by Linked Data applications. For this reason, most reasoning for Linked Data relies on the far simpler logical resources of RDF-Schema, with OWL used sparingly if at all.

Background standards

The technologies described in the previous section are implemented through a number of standard protocols and languages, with probably familiar acronyms like HTTP, URI, XML, RDF, RDFS, OWL, SPARQL. You can look up details of these standards as needed, but as background it is useful to know a little about each one, and in particular what they are for. The later standards in this list build on the earlier ones, so they are often described as a stack of languages, as shown in Figure 3.


From using the World Wide Web, most people are familiar with the HTTP prefix in front of web addresses such as The meaning of this acronym is HyperText Transfer Protocol, and it refers to a set of conventions governing communication between a client and a server. More precisely, these conventions define the structure of request messages from client to server, and response messages from server to client. Message structure varies from one protocol to another: thus a different protocol such as FTP (File Transfer Protocol) will define a different message structure. A request messages in HTTP consists essentially of a method to be applied to a resource. The fundamental method is GET, which requests the server to send back a representation of the resource, typically an HTML file that can be displayed in a browser pane. However, there are several other methods including DELETE, which deletes the resource, and POST, which submits data to be processed with respect to the resource. The resource, specified through a relative document ID (often a filename/path on the server), may be a document, or picture, or an executable that will generate data for the response.


A Uniform Resource Identifier (URI) is defined in the standard [4] as "a compact sequence of characters that identifies an abstract or physical resource". The word "compact" here means that the string must contain no space characters (or other white-space padding). "Abstract or physical" means that the URI may refer to an abstract resource such as the concepts "Beethoven" and "symphony", as well as to a document or other file that can be retrieved from the WWW.

A URI that is linked to a retrievable resource is known also as a Uniform Resource Locator, or URL. For instance, the following URI for the MusicBrainz FAQ page is a URL:

The definition of a correctly formed URI is quite complicated, with constituents that vary according to the scheme (the initial constituent before the colon), which specifies the relevant internet protocol, such as HTTP. For an HTTP URI, the other constituents most relevant for our purposes are the authority, and the path, which occur in that order. The authority specifies the server where the resource (if it really exists) is located. Finally, the path locates the resource precisely within the server's directory structure.

Thus for the URL given above, "http" is the scheme, "" is the authority, and "/doc/Frequently_Asked_Questions" is the path; the other characters such as the colon are punctuation separating these constituents.

Note that the constituents following the scheme will be different for different schemes: thus the "tel" scheme, for example, is followed simply by a telephone number. Here are some examples indicating this variety:


Since URIs are typically long, and hence difficult to read and write, it is convenient to make use of abbreviated forms known as "compact URIs" or "CURIEs". A compact URI consists simply of a namespace and a local name, separated by a colon. Typically, the namespace includes the scheme, the authority, and perhaps the early part of the path; the local name contains the remainder of the URI, chosen so as to convey intuitively what the URI means, while observing some syntactic restrictions (e.g., there should be no further use of the characters "/" and "#"). Thus in the example just given, one could introduce a namespace "dbp" for, so reducing the URI to "dbp:Karlsruhe", where the local name preserves the substring that is significant to human readers. We will use this convenient method of abbreviation often in the rest of this book.


Extensible Markup Language (XML) is a refinement of Standard Generalised Markup Language (SGML), which was introduced in the 1980s as a meta-language suitable for defining particular mark-up languages -- for instance, languages for adding formatting information to documents. The basic concept, now well known from widespread use of HTML, is that labelled tags are placed around spans of text, thus indicating perhaps that the span should be formatted in italics:

<i>text in italics</i>

The italic tag "i" is part of HTML, not SGML, but the convention of placing tags within angle brackets, and distinguishing the closing tag by a forward slash character, comes from SGML, as does the syntax for adding attributes to the opening tag, as in this example yielding blue text:

<font color="blue">blue text</font>

SGML is versatile because it can be used simply for encoding data, as well as for adding structure to text.

In the mid-1990s, the newly formed World Wide Web Consortium (abbreviated W3C) set up a working group to simplify and rework SGML to meet the requirements of the WWW. The result was the first XML specification, which became a W3C recommendation in 1998, and has become the standard convention for data exchange over the web. The essential advance on SGML is that XML is simpler and stricter: to give just one example, it is permissible in SGML (but not in XML) to omit closing tags, as in the common practice of inserting <p> without a closing </p> when writing HTML.


Figure 3: Stack of Semantic Web Languages
Citation: Semantic Web Language stack - architectural layers.
License: Copyright (c) 2006 World Wide Web Consortium, (Massachusetts Institute of Technology, European Research Consortium for Informatics and Mathematics, Keio University). All rights reserved.

The Resource Description Framework (RDF) was introduced originally as a data model for metadata, which are attributes of a document, or image, or program, etc. such as its author, date, location, and coding standards. First published as a W3C recommendation in 1999 [5], the framework has since been updated, and generalised in its purpose to cover not only metadata (strictly interpreted) but knowledge of all kinds.

The basic idea of RDF is a very simple one: namely, that statements are represented as triples of the form subject--predicate--object, each triple expressing a relation (represented by the predicate resource) between the subject and object resources. Formally, the subject is expressed by a URI or a blank node, the predicate by a URI, and the object by a URI or a literal such as a number or string.

The original W3C recommendation for exposing RDF data was that it should be encoded in XML syntax, sometimes called RDF/XML. It is for this reason that the semantic web "stack" of languages has RDF implemented on top of XML. However, notations have also been proposed which are easier for people to read and write, such as Turtle, in which statements are formed simply by listing the elements of the triple on a line, in the order subject-predicate-object, followed by a full stop, with URIs possibly shortened through the use of namespace abbreviations defined by "prefix" and "base" statements, as in the following example:

@base <>.
@prefix mo:<>.
<artist/b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d#_> a mo:MusicGroup.

Here the subject is abbreviated using the "base" statement, and the object is abbreviated using the "prefix" statement. The very simple predicate "a" relies on a further Turtle shorthand for very commonly used predicates, and refers to the "type" relation between a resource and its class. This can be seen from the following equivalent Turtle statement, in which all URIs are shown in their cumbersome unabbreviated form. Note that this statement should occupy a single line, although it is shown here with wrapping so that it fits on the page. The format in which every URI in a Turtle statement is fully expanded is also known as NTriples.


Where multiple statements apply to the same subject, they can be abbreviated by placing a semi-colon after the first object, and then giving further predicate-object pairs separated by semi-colons, with a full stop after the final pair. For statements having the same subject and predicate, objects can be listed in a similar way separated by commas. These conventions are illustrated by the following statements:

@base <>.
@prefix mo:<>.
@prefix rdfs:<>.
@prefix owl:<>.
@prefix dbpedia:<>.
@prefix bbc:<>.

  rdfs:label "The Beatles";
  owl:sameAs dbpedia:The_Beatles,


RDF Schema (RDFS) is an extension of RDF which allows resources to be classified explicitly as classes or properties; it also supports some further statements that depend on this classification, such as class-subclass or property-subproperty relationships, and domain and range of a property. Some important resources in RDFS are as follows (for brevity we use the "rdfs" prefix defined above):

A resource representing the class of all classes.
Used as a predicate to mean that the subject is a subclass of the object.
Used as a predicate to mean that the subject is a sub-property of the object.
Used as a predicate when the subject is a property and the object is the class that is domain of this property.
Used as a predicate when the subject is a property and the object is the class that is range of this property.

The following statements in Turtle serve to illustrate these RDFS resources. Note that they use abbreviated URLs for which the prefixes are given above.

mo:member rdf:type rdfs:Property.
mo:member rdfs:domain mo:MusicGroup.

mo:member rdfs:range foaf:Agent.
mo:MusicGroup rdfs:subClassOf foaf:Group.

In these statements, the resource "mo:member" denotes the property that relates a music group to each of its members -- for instance, the Beatles to John, Paul, George and Ringo, as in the following triple:

dbpedia:The_Beatles mo:member dbpedia:Ringo_Starr.

The second and third statements above give the domain and range of the property "mo:member". Intuitively, their meaning is that if "mo:member" is employed as predicate in a triple, its subject will belong to the class "mo:MusicGroup", and its object to the class "foaf:Agent". The fourth statement means that any resource belonging to the class "mo:MusicGroup" will also belong to the (more general) class "foaf:Group".

An important gain in adding such statements is that they allow new facts to be inferred from existing ones. Consider for instance how they may be combined with the statement (just given) that Ringo is a member of the Beatles. Using the domain and range statements for the property "mo:member", it follows directly that the Beatles are a music group, and that Ringo is an agent; using the subClassOf statment, it follows further that the Beatles are a group. Encoded in Turtle, these inferred facts are as follows:

dbpedia:The_Beatles rdf:type mo:MusicGroup.
dbpedia:Ringo_Starr rdf:type foaf:Agent.
dbpedia:The_Beatles rdf:type foaf:Group.

RDFS also contains some predicates for linking a resource to information useful in presentation and navigation, but not for inference. These include the following:

Associates a resource with a human-readable description of it.
Associates a resource with a human-readable label for it.
Associates a resource with another resource that might provide additional information about it.
A sub-property of "rdfs:seeAlso", indicating a resource that contains a definition of the subject resource.


The Web Ontology Language (OWL) extends RDFS to provide an implementation of a description logic, capable of expressing more complex general statements about individuals, classes and properties.

OWL was developed in the early 2000s and became a W3C standard (along with RDFS) in 2004. The acronym OWL was preferred to the more logical WOL because it is easier to pronounce, provides a handy logo, and is suggestive of wisdom. Of course the name also reminds us of the character in "Winnie the Pooh" who misspells his name "Wol".

The reason for choosing description logic, rather than a more expressive kind of mathematical logic, has already been mentioned: the aim was to achieve fast scalable reasoning services, and hence to use a logic for which efficient reasoning algorithms were already available. In fact description logics are more a family of languages than a single language. They can be thought of as a palette of operators for constructing classes, properties and statements, from which the user can make different selections, so obtaining fragments with different profiles of expressivity and tractability.

Figure 4: Fragments of OWL 2
Citation: OWL 2 Fragments
License: CC Attribution 3.0

The OWL standard is under constant development, and the current version OWL 2.0 provides for the fragments shown in Figure 4; their meanings are as follows:

OWL 2 Full
Used informally to refer to RDF graphs considered as OWL 2 ontologies and interpreted using the RDF-Based Semantics.
Used informally to refer to OWL 2 ontologies interpreted using the formal semantics of Description Logic ("Direct Semantics").
A simple fragment limited to basic classification, allowing reasoning in polynomial time.
A fragment designed to be translatable to querying in relational databases.
A fragment designed to be efficiently implementable using rule-based reasoners.

As already explained, a detailed understanding of OWL is not necessary for working with Linked Data. When reasoning over huge amounts of data, only the simplest reasoning processes are computationally efficient, and these can for the most part be implemented using only the resources of RDFS. Very briefly, the additional resources in OWL are terms providing mainly for the following:

  • Class construction: forming new classes from existing classes, properties and individuals (e.g., ObjectIntersectionOf);
  • Property construction: distinguishing object properties (resources as values) from data properties (literals as values);
  • Class axioms: statements about classes, describing sub-class, equivalence and disjointness relationships;
  • Property axioms: statments about properties, including relationships such as equivalence and sub-property, and also attributes such as whether a property is functional, transitive, and so forth;
  • Individual axioms: statements about individuals, including class membership, and whether two resources represent the same individual or different individuals.


The SPARQL Protocol and RDF Query Language (a recursive acronymn, since it contains itself) is a language for formulating queries over RDF data. It is the Semantic Web's counterpart to SQL (Structure Query Language), which has been a standard language for querying relational databases since the 1980s. SPARQL is a recent addition to the Semantic Web stack of languages, having been recommended as a W3C standard in 2008 [6].

Since chapter 3 of this book is dedicated to SPARQL, we limit ourselves here to an example that illustrates its purpose. Comparing SPARQL with SQL, the key difference is that it is designed for retrieving information from sets of triples, rather than from data organised into relations (i.e., tables). Queries are therefore formulated using lists of RDF triples in which some URIs or literals are replaced by variables, as in the following:

PREFIX dc: <>
PREFIX foaf: <>
PREFIX dbpedia: <>
PREFIX music-ont: <>

SELECT ?album_name ?track_title 
  dbpedia:The_Beatles foaf:made ?album .
  ?album dc:title ?album_name . 
  ?album music-ont:track ?track .
  ?track dc:title ?track_title }

Translated into English, the meaning of this query is as follows:

Retrieve a list of all album names AN and track titles TT in the data for which the following conditions hold:
  1. There is an album A made by the Beatles.
  2. Album A has the title AN.
  3. There is a track T on album A.
  4. Track T has the title TT.

Or more colloquially: retrieve the titles of all tracks on albums by the Beatles, along with the corresponding album titles. The response should be a list of pairs, each containing an album name and a track title.

This example shows the simplest kind of query, in which the WHERE statement is simply a list of triples (containing variables). SPARQL also provides some more sophisticated constructs: these include FILTER, which allows conditions on the values of variables (e.g., that a number should be between 1990 and 2000); also 

OPTIONAL, which specifies data that should be retrieved if available, while allowing the query to succeed even when they are unavailable. For more information on these more complex constructs, see Chapter 3.

Practically, to pose a query to a dataset you need to use a program or website that serves as a SPARQL endpoint.  Typically, an endpoint interface provides text fields where you can type the URL of the dataset you wish to query, and the query itself (e.g., the SELECT query in the example above). On hitting the "Submit" button, you obtain a dynamically generated webpage listng the values of the query variables in a table. There are also libraries allowing you to incorporate  SPARQL queries into your programs, such as the Java library Jena at or php BOTK


LinkedData.Center provides you with a fully configured and fully managed server that exposes a private SPARQL endpont. Thanks to high capacity RDF storage (included in the service) you are able to load any number of RDF triple and query ther resulting graph using SPARQL.

Next suggested reading: Introduction to Linked Data

Creative Commons License This article is based on the results of  EUCLID project (EU FP7 - 296229)Except for third party materials and otherwise stated, the content of this article is made available under a Creative Commons Attribution 3.0 Unported License.