What is a dataset?

Within LinkedData.Center, a dataset is any web resource that provides structured data:
  • tabular data: tables of data that normally are downloadable as CVS files or spreadsheets;
  • object collections: e.g. XML documents or in json files;
  • linked data: the web standard for describing data. You can find linked data in RDF resources or SPARQL endpoints.

As datasets are web resources, they must have a URL. Typically you discover datasets URLs in data portals or other data sources published by a data provider. Often data sources publish links to the datasets together with their metadata, which are structured data about the dataset itself (e.g the dataset author, the dataset last update time, etc.).

Examples:

  • Recorded Crime Summary Data for  London Borough of Barnet:  a tabular data resource exposed by UK open data portal (i.e the data source) by the London Borough of Barnet (i.e. the data provider). Here is the  metadata URL  and here is the download CSV URL 
  • Approved Building Permits in Bostona tabular data resource exposed by Socrata portal. Here is the  metadata URL  and here is the download CSV URL 
  • Italian census data: linked data  SPARQL endpoint provided by ISTAT, note that we do not need metadata because such information is embedded in link data. 
  • All Wikipedia  data: linked data  SPARQL endpoint provided by DBpedia

Where can I find datasets? And can I use them?

As you can imagine, on the web there are millions of data sources and billions of datasets that expose trillions of open data, that is data that you are free to use for what you want (even for commercial purposes). Please, find below some hints to start your open data scouting:


Besides open data there are lots of free datasets with more restrictive licenses suitable for specific purposes (i.e. Google maps), paid dataset and, of course, your private data. All these datasets can be used in LinkedData.Center.

Is there a size limit for a dataset?

From a theoretic point of view, there are no limits in dataset size. In practice, datasets bigger than 50 Mb uncompressed are difficult to manage, mainly because they can cause timeouts on the Internet during download. But this is not a big issue. More or less all data providers allow to split or filter large datasets.

From a commercial point of view, in our standard Starter Kit offering we put some reasonable limits to keep the ontology design simple (see SLA).

Could I use private datasets?

Sure! You can use any private dataset with full security. LinkedData.Center supports Basic Http Authentication scheme on both plain and encrypted channels.

Ok, I got it but what is needed to provide datasets to LinkedData.Center? 

When you activate a Starter Kit service you will be requested to submit datasets to LinkedData.Center. 

  • If your dataset is a tabular one, you'll have to provide:
    • the URL of a standard CSV dataset (we can manage both gzip and zip encoded files);
    • the URL or a textual description of the data structure (i.e. the metadata).
  • If your data set is a linked data, you'll have to provide: the SPARQL endpoint and the SPARQL query (optionally an SPARQL query that uses the "CONSTRUCT" operator).

In both cases, if data are private, you must give us the credentials to access the dataset. 

We are flexible enough to manage individual cases (i.e excel files, non-standard CSV files, etc.). Ask our experts to find a solution to your special case.

Sorry, my data are secret...

We do not need to know your real data, but just their structure. If you have confidential data that you do not want to share with us, just provide us fake data and temporary credentials. On your knowledge base, you will be able to change both the dataset URL and access credentials on your decision.

My data are in a file in my desktop only

Just install SDaaS in  container and mount your data as a local volume.
If you have lots of data on your own network, please consider the LinkedData.Center Enterprise offering which moves all our technology to your premises.

What do I  have to do if I want to publish my own datasets?

TJust define your data license and data access criteria. 

LinkedData.Center provides you with all tools to publish data (open or not) as tabular data or as linked data. You will be compliant with best practices and with the e-Government guidelines.

Do you mean that... Can I sell open data?

Sure, through LinkedData.Center you can easily create value around open data (aggregating, cleansing, validating, filtering, etc) and then sell it.

Can I load a dataset directly in my Graph Database?

Yes, you can directly load any linked data. To load other types of datasets (i.e. tables or objects) you need to write a simple gateway. If you subscribe to the Starter Kit or the add dataset services we will do it for you.

Know more

...about tabular data

Tabular data are lists of records, each one providing a fixed set of fields.  The meaning of the fields is implicit or must be described elsewhere: in a book, in a web page, etc.  

Tabular data can be managed by an RDBMS and queried through SQL.


...about object collections

The structure of objects can be implicit or explicit, following a formal model or not (i.e. schemaless). Like for tabular data, the metadata should be defined elsewhere. 

Objects can be managed by object databases and queries with specific languages like XPATH. Any table can be translated into an object collection but not vice versa.


...about linked data

The structure of linked data is based on the Resource Data Framework. RDF is the foundation of the new web architecture and it is a W3C standard.

The data meaning (i.e. the semantic) is described by formal vocabularies (i.e. ontologies) that can be embedded in the dataset itself or another linked resource. The same for metadata.

Linked Data need to be managed by Graph Databases and queries using SPARQL through standard SPARQL endpoint

RDF is able to describe any kind of a dataset (i.e. table, objects) adding power to information. RDF is the format we use internally to model all data.