Reading the INSPIRE Metadata Draft

Metadata about geographic data is at the heart of INSPIRE. The metadata draft is the first in the set of "implementing rules" and it will underpin all the other implementing rules. The consultation process is open until 2007-03-30. While the documents are open access, comments can only be offered through an SDIC or Spatial Data Interest Community.

The Free and Open Source Geospatial Community has a voice through one of these SDICs thanks to Markus Neteler. This page contains preparatory material for a collective response through the FOSS GIS SDIC, from the POV of people implementing and managing metadata creation, collection and search services, working closely with many different data user communities.


 * The response proper will live at Response to INSPIRE Metadata Draft. Initial notes are included below in the Issues section.
 * It is interesting to read this in parallel with the North American Metadata Profile draft which is also currently in consultation. It's hoped the OSGeo community will also be able to contribute to a Response to NAP Metadata Draft and get the |geodata commons project involved in this.

Reading the draft

 * [here]
 * Pages 1-17 are metadata about the document itself, intentions and history, and can be safely skipped. Pages 43-104 are the Annexes.
 * Annex A is particularly interesting as there are details of the thinking exposed in the mapping to ISO19115/39 that aren't set out in the implementing rules. If you want to know what's likely to affect you but are short on time, at minimum read section 5 and Annex A'.

Lightning Summary of the draft
The draft establishes a basic information model for metadata which is close to, but not specific to, ISO19115 and OGC Web Services.

It only mandates what metadata is published by and for public authorities covered by INSPIRE - it does not try to cover repository management or internal processes.

It separates out metadata properties into those useful for 'discovery', 'evaluation', and 'use'. It identifies one very high level "use case" for spatial data search services built from metadata being shared at this level.

It differentiates between properties useful for 'non-specialist' and 'expert' users into 2 Levels, 1 and 2. Level 1 is always mandatory. This *includes* classification according to the data themes in the INSPIRE annexes, and keywords from controlled vocabularies which are not covered by the IR document but are left to Spatial Data Theme Communities. (How these communities are found, selected, and make their decisions, is unknown to us at this time.)

= Issues =

'''This list is an overview of what jumped out at me as something to address. I don't know how much of this is appropriate to send back, or how much can be fixed. - User:JoWalsh'''

Conceptual overview
The model maps quite well to the minimum useful subset identified in DCLite4G. It looks like a lightweight core. But, the model and the draft break down the problem space of metadata in a way that is a reaction to artificial scarcity of data. It identifies three phases of the metadata use cycle:
 * discovery (of what data is out there)
 * evaluation (of whether the data will be useful for specific purpose)
 * use (once access gained, how to best use the data)

It is illuminates to compare this with the North American Profile metadata draft which talks about


 * discovery
 * access
 * fitness for use (e.g. evaluation)
 * transfer

So the IRs both don't address how to make the data more useful via metadata, and are vague about how much a minimal subset is going to provide enough information to evaluate utility on. Generally the draft dances around data licensing access issues, and glosses over the over-engineering needed to work around artificial constraints on availability. IRs for evaluation and use of data based on metadata are not covered by this draft at all, but left up to the Spatial Data Theme communities for each of the 35 data themes identified in Annexes I-III of the INSPIRE text.

Issues with specific metadata properties
The model maps quite well to DCLite4G. It looks like a lightweight core.

Things that aren't there that should be
5.2.8 Resource responsible party. Each dataset *must* have one or more people/organisations responsible for it. The IR says that this can be freetext or can be in more structured form. This only includes the responsible party's name, but NOT any form of contact details.

Some form of electronic or telephonic contact address should be mandatory, if the org/person's details are mandatory. Why publish ownership information - especially if there are constraints on access and reuse of the described data - if you can't immediately get in personal contact with someone who can make assurances about the data?

Annex A on mapping to IS019115 mandates that contact persons and organisations be free text, not resource identifiers. 2 serious problems with the ISO 19115 mapping:


 * It does not ask or provide for contact details.
 * It looks *mandatory* that the reponsible party be given a role, which in turn is one of N codes published by the Library of Congress to describe people's roles within organisations.

No discussion of formalising dataset accuracy / completeness - crucial for cost-benefit evaluation / evaluation of suitability for combining with other data sets.

Things that are there that probably shouldn't be
Every 'dataset or dataset series' published under INSPIRE *must* include both a Resource topic category and a set of resource keywords.

Topic categories are very high-level classifications which correspond to each of the Spatial Data Themes identified in Annexes I, II and III of the INSPIRE Directive.


 * Which topic category data fits in will often be a property of an organisation not any published data sets.

From an implementor's POV this will involve something like selecting a topic category for data at install time of metadata publishing engine, and forgetting about it. The IRs place a lot of faith in the ability of simple keyword / classification code matches to enhance utility of search and discovery services for users.

But. this already raises the bar for non-expert users (the domain vocabulary is jargon specific or oriented towards specialist codes)

The IRs emphasise the fact that keywords should originate from a controlled vocabulary. The reponsibility for creating one is not in the hands of the Drafting Teams but in the hands of Spatial Data Theme Communities. How these are constituted and how their decisions become binding are unclear.

Again, faith in keywords for search utility is misplaced. Reliance on them may lead to false negatives. Again assumes familiarity with, or time and ability to learn about, what to expect in the domain from a non-expert user, and an expert will need a better level of detail. Pitfalls of 'controlled' keywording:
 * intentional misclassification
 * lazy/default misclassification

Both of these are at 'Level 1 for discovery metadata' which implies that any INSPIRE compliant metadata set MUST have both topic category and associated keywords.

Conformity
This is an IR and obligatory to deal with. But 5.3.4 just says "see Annex F". Annex F in its entirety says:

The way in which conformity is expressed in the INSPIRE IR will be defined in a subsequent draft based on discussions with the Drafting Team on Data specifications and harmonization.

(Is this where accuracy/completeness comes in? How can we know?)

Dataset series / Aggregate data
IR talks about dataset series. Some of the diagrams talk about 'MD_Aggregates'- this term isn't used elsewhere. No conception in this model of one UrDataSet with many different potential sources according to how they are packaged or processed. As the IRs mandate properties for dataset series, really need more clarity / examples about what they actually are.

Search / discovery services
The preamble (p.7) states that "separate IRs for discovery services are being prepared and are not the subject of this document." But the INSPIRE use case is predicated on the availablity of 'Geoportal' style search services. What else *are* discovery services if they are not the search services treated of here? If there is only going to be an abstract model for discovery, and these IRs are careful to avoid imposing any constraints on internal data repository management, how much more can a discovery services draft provide?

Lack of machine-reusable data in general
Dataset 'lineage' is only a full-text field. If datasets result from recombination, that should be machine-traversable. Human descriptions of lineage will be so different that they won't be useful for building search / evaluation services.

Lack of engagement with packaging and re-use issues
Cf. Dataset series / aggregates. The examples have 'MasterMap' as one potential dataset! Real world use cases are going to need subsets of such huge data sets broken down into packages with smaller spatial extents or with less layers.

Bypassing of feature-level metadata from consideration
Once we get down to the feature level the interesting European problems appear - the fact that every local area may have its own classification schemes, even inside one language community the same word is used to describe different looking things, and across language barriers mappings from words to things don't tend to be 1-1. But by disregarding feature-level metadata - partly because it can't be mandated when the underlying geospatial objects aren't publically inspectable and a certain amount of feature level metadata would mean the data itself is essentially public...

Overspecificness about internet- and webservices- based distribution models
Actually causing ourselves unnesc problems by putting everything on the Internet. Data sharing agreements over publically maintained private networks with flat-rate membership are a clear potential future and 'middle way' in this domain. The draft now is all about making access/use contraints *specific to data sets* and not specific to the relationship between the data provider or broker, the data user and the transport network between them.

So we have a 'distributed computing platform' metadata property that is required by the IRs. In the ISO19915 mapping in Annex A this is a free text field, yet 5.2.15 states that the property "is necessary for a client to bind to the service". If it must be mandated, it should be as a URI. It would be wonderful to have examples of what other than HTTP or OGC web services is envisaged NOW as a means of access to the backend of a distributed computing platform.