Data Quality

For now this page is to discuss a proposal for a short project (4-7 months) looking at data quality approaches to collaborative online sources of information. This is something that could be an interesting fit for the geospatial strand of JISC funding call 15/10 on infrastructures for education and research.

Overview

INSPIRE does not mandate quality standards but Joint Research Commission recognises that not to consider quality, is an oversight.

The ISO standards regarding quality of geographic information are oriented towards quality assurance in the data production process. They also assume that the end user will accept the product as homogeneous, where a generic quality statement applies to all objects and areas.

This means a lack of focus on the value of data quality information from the end-user's perspective - what problems are we helping to solve by publishing data quality information?

Potentially, this will allow to move from the question 'Is dataset X useful for task Y?' to 'Is dataset X useful for task Y at location Z?' - after all, each researcher is working at a specific scale and purpose, so introducing scale and location explicitly to the decision making should assist in the data selection and fitness for purpose analysis.

For example, OS Research has done extensive work on a "vernacular gazetteer" of shapes for social names, but data quality concerns prohibit its release, even for research.

In addition, emerging OS Research on the usability of geographical information is exposing the producer-centric nature of the datasets, and the need to develop novel, user-centric approaches to data production and delivery

Geodata world has its domain specific problems, can benefit from looking at lighter weight /

differently conceived quality approaches from other domains.

The aim should be to encourage and support the publication of more data of variable, knowably unknown quality.

Quality currently looks like a niche issue. New developments in data sharing over the internet will raise priority for machine-reusable descriptions of data quality (distributed databases; multiple copies of the same resource unsynchronised, or variably edited; more collaborative mapping projects along lines of OSM and OpenAddresses; lossy or transient datastores; linked data pollution)

Fit for 15/10?

See the briefing paper on the JISC geospatial strand for more context - up to 9 months duration between Feb and Dec 2011.

Briefing emphasises infrastructure development, re-use of tools and services, both those directly supported by JISC and others popular on the web

We'd be looking at a mixture of service/tool re-use and structured interviews with academic geodata users regarding their concerns around quality.

Exploring the concept of fit-for-use from a user-centred perspective (in contrast to producer centred view). This should guide the development of user-centred metadata discovery.

Themes

Starting with Nothing

The traditional ISO data quality model assumes theoretically perfect data. Many measures and tests can only be run in comparison with a known higher quality, more "authoritative" dataset or through ground truth.

Perfect and homogeneous data does not look like a reasonable assumption - in reality data are collected at different times, by different people - and even inside the same organisation standards and procedures change over time due to organisational and technological changes. A paradigm shift to heterogeneous understanding of geographic dataset is needed.

Attestation

Peer review for data quality / social aspect to data sources. Research shows that data can improve through peer review - even when the users are non domain expert. This should be very suitable to the academic community, with the trend of academic researchers releasing their source code and data sets for review. However, the relationships between experts and general users are somewhat more contested in the academic environment, and there is a need for expert overview and mechanisms to allow the understanding of the credibility and knowledge of a reviewer. For example, the assertion that a bus stop is miss located on the map can be done by anyone, but the identification of a rare butterfly requires specific domain knowledge. In both cases, a non-professional scientist can contribute the information accurately (some hobbyists have more time on their hands to identify butterflies then professional researchers)

Edit-time quality reporting

The JOSM Validator model, looking at logical consistency of edits to OSM before commit. Again there's a data production bias here - how many research users of OSM, for example, are active editors?

More generally, there is a balance between the use of labour of contributors (while assuming that they are not very trustworthy) to the use of 'distributed intelligence' (which assumes that their analysis skills can be used).

Interviews

References

Participants in this document / proposal

Jo Walsh - EDINA, University of Edinburgh Muki Haklay - University College London