Data Quality
For now this page is to discuss a proposal for a short project (4-7 months) looking at data quality approaches to collaborative online sources of information. This is something that could be an interesting fit for the geospatial strand of JISC funding call 15/10 on infrastructures for education and research.
Overview
- INSPIRE does not mandate quality standards but Joint Research Commission recognises that not to consider quality, is an oversight.
- The ISO standards regarding quality of geographic information are oriented towards quality assurance in the data production process
- This means a lack of focus on the value of data quality information from the end-user's perspective - what problems are we helping to solve by publishing data quality information?
- For example, OS Research has done extensive work on a "vernacular gazetteer" of shapes for social names, but data quality concerns prohibit its release, even for research.
- Geodata world has its domain specific problems, can benefit from looking at lighter weight /
differently conceived quality approaches from other domains.
- The aim should be to encourage and support the publication of more data of variable, knowably unknown quality.
- Quality currently looks like a niche issue. New developments in data sharing over the internet will raise priority for machine-reusable descriptions of data quality (distributed databases; multiple copies of the same resource unsynchronised, or variably edited; more collaborative mapping projects along lines of OSM and OpenAddresses; lossy or transient datastores; linked data pollution)
Fit for 15/10?
See the briefing paper on the JISC geospatial strand for more context - up to 9 months duration between Feb and Dec 2011.
- Briefing emphasises infrastructure development, re-use of tools and services, both those directly supported by JISC and others popular on the web
- We'd be looking at a mixture of service/tool re-use and structured interviews with academic geodata users regarding their concerns around quality.
Themes
Starting with Nothing
The traditional ISO data quality model assumes theoretically perfect data. Many measures and tests can only be run in comparison with a known higher quality, more "authoritative" dataset. Perfect data does not look like a reasonable assumption.
Attestation
Peer review for data quality / social aspect to data sources.
Edit-time quality reporting
The JOSM Validator model, looking at logical consistency of edits to OSM before commit. Again there's a data production bias here - how many research users of OSM, for example, are active editors?