Reading the INSPIRE Metadata Draft

From OSGeo
Revision as of 04:59, 14 March 2007 by Ianibbo (Talk | contribs) (Search / discovery services)

Jump to: navigation, search

Metadata about geographic data is at the heart of INSPIRE. The metadata draft is the first in the set of "implementing rules" and it will underpin all the other implementing rules. The consultation process is open until 2007-03-30. While the documents are open access, comments can only be offered through an SDIC or Spatial Data Interest Community.

The Free and Open Source Geospatial Community has a voice through one of these SDICs thanks to Markus Neteler. This page contains preparatory material for a collective response through the FOSS GIS SDIC, from the POV of people implementing and managing metadata creation, collection and search services, working closely with many different data user communities.

  • The response proper will live at Response to INSPIRE Metadata Draft. Initial notes are included below in the Issues section.
  • It is interesting to read this in parallel with the North American Metadata Profile draft which is also currently in consultation. It's hoped the OSGeo community will also be able to contribute to a Response to NAP Metadata Draft and get the geodata commons project involved in this.

Reading the draft

  • the Implementing Rules for Metadata Draft (pdf)
  • supporting / background material
  • Pages 1-17 are metadata about the document itself, intentions and history, and can be safely skipped. Pages 43-104 are the Annexes.
  • Annex A is particularly interesting as there are details of the thinking exposed in the mapping to ISO19115/39 that aren't set out in the implementing rules. If you want to know what's likely to affect you but are short on time, at minimum read section 5 and Annex A'.

Lightning Summary of the draft

The draft establishes a basic information model for metadata which is close to, but not specific to, ISO19115 and OGC Web Services.

It only mandates what metadata is published by and for public authorities covered by INSPIRE - it does not try to cover repository management or internal processes.

It separates out metadata properties into those useful for 'discovery', 'evaluation', and 'use'. It identifies one very high level "use case" for spatial data search services built from metadata being shared at this level.

It differentiates between properties useful for 'non-specialist' and 'expert' users into 2 Levels, 1 and 2. Level 1 is always mandatory. This *includes* classification according to the data themes in the INSPIRE annexes, and keywords from controlled vocabularies which are not covered by the IR document but are left to Spatial Data Theme Communities. (How these communities are found, selected, and make their decisions, is unknown to us at this time.)

Issues

This list is an overview of what jumped out at me as something to address. I don't know how much of this is appropriate to send back, or how much can be fixed. - User:JoWalsh I've added some random musings also, again, no idea if they are of use or even valid questions, but at least they are there and can be edited out, or used as the basis of discussions - User:IanIbbo.

Conceptual overview

The model maps quite well to the minimum useful subset identified in DCLite4G. It looks like a lightweight core. But, the model and the draft break down the problem space of metadata in a way that is a reaction to artificial scarcity of data. It identifies three phases of the metadata use cycle:

  • discovery (of what data is out there)
  • evaluation (of whether the data will be useful for specific purpose)
  • use (once access gained, how to best use the data)

It is illuminates to compare this with the North American Profile metadata draft which talks about

  • discovery
  • access
  • fitness for use (e.g. evaluation)
  • transfer

So the IRs both don't address how to make the data more useful via metadata, and are vague about how much a minimal subset is going to provide enough information to evaluate utility on. Generally the draft dances around data licensing access issues, and glosses over the over-engineering needed to work around artificial constraints on availability. IRs for evaluation and use of data based on metadata are not covered by this draft at all, but left up to the Spatial Data Theme communities for each of the 35 data themes identified in Annexes I-III of the INSPIRE text.

Issues with specific metadata properties

The model maps quite well to DCLite4G. It looks like a lightweight core.

Things that aren't there that should be

5.2.8 Resource responsible party. Each dataset *must* have one or more people/organisations responsible for it. The IR says that this can be freetext or can be in more structured form. This only includes the responsible party's name, but NOT any form of contact details.

Some form of electronic or telephonic contact address should be mandatory, if the org/person's details are mandatory. Why publish ownership information - especially if there are constraints on access and reuse of the described data - if you can't immediately get in personal contact with someone who can make assurances about the data?

Annex A on mapping to IS019115 mandates that contact persons and organisations be free text, not resource identifiers. 2 serious problems with the ISO 19115 mapping:

  • It does not ask or provide for contact details.
  • It looks *mandatory* that the reponsible party be given a role, which in turn is one of N codes published by the Library of Congress to describe people's roles within organisations.

No discussion of formalising dataset accuracy / completeness - crucial for cost-benefit evaluation / evaluation of suitability for combining with other data sets.

Things that are there that probably shouldn't be

Every 'dataset or dataset series' published under INSPIRE *must* include both a Resource topic category and a set of resource keywords.

Topic categories are very high-level classifications which correspond to each of the Spatial Data Themes identified in Annexes I, II and III of the INSPIRE Directive.

  • Which topic category data fits in will often be a property of an organisation not any published data sets.

From an implementor's POV this will involve something like selecting a topic category for data at install time of metadata publishing engine, and forgetting about it. The IRs place a lot of faith in the ability of simple keyword / classification code matches to enhance utility of search and discovery services for users.

But. this already raises the bar for non-expert users (the domain vocabulary is jargon specific or oriented towards specialist codes)

The IRs emphasise the fact that keywords should originate from a controlled vocabulary. The reponsibility for creating one is not in the hands of the Drafting Teams but in the hands of Spatial Data Theme Communities. How these are constituted and how their decisions become binding are unclear.

Again, faith in keywords for search utility is misplaced. Reliance on them may lead to false negatives. Again assumes familiarity with, or time and ability to learn about, what to expect in the domain from a non-expert user, and an expert will need a better level of detail. Pitfalls of 'controlled' keywording:

  • intentional misclassification
  • lazy/default misclassification

Both of these are at 'Level 1 for discovery metadata' which implies that any INSPIRE compliant metadata set MUST have both topic category and associated keywords.


Areas which are unclear

Conformity

This is an IR and obligatory to deal with. But 5.3.4 just says "see Annex F". Annex F in its entirety says:

The way in which conformity is expressed in the INSPIRE IR will be defined in a subsequent draft based on discussions with the Drafting Team on Data specifications and harmonization.

(Is this where accuracy/completeness comes in? How can we know?)

Dataset series / Aggregate data

IR talks about dataset series. Some of the diagrams talk about 'MD_Aggregates'- this term isn't used elsewhere. No conception in this model of one UrDataSet with many different potential sources according to how they are packaged or processed. As the IRs mandate properties for dataset series, really need more clarity / examples about what they actually are.

General concerns

Search / discovery services

The preamble (p.7) states that "separate IRs for discovery services are being prepared and are not the subject of this document." But the INSPIRE use case is predicated on the availablity of 'Geoportal' style search services. What else *are* discovery services if they are not the search services treated of here? If there is only going to be an abstract model for discovery, and these IRs are careful to avoid imposing any constraints on internal data repository management, how much more can a discovery services draft provide?

II: I think this observation is spot on, but for different reasons perhaps. I'm finding it difficult to express concrete concerns.. but Section 5.2 "Discovery metadata elements" starts to set out a list of concepts seen to be (The document hints at, but does not directly say) core to the discovery process. Section 5.3 then sets out "Abstract discovery metadata element set". I *guess* the implication is that the concepts laid out in 5.2 are in some way even more abstract than those set out in 5.3. The document really isn't clear about what the abstract model is, or what it is for, before it starts enumerating the concepts. Your later comment about being tied to web services is spot on also here, I'm really not sure "Service type version", "Operation name" and "Distributed computing platform" belong in an abstract discovery model (The probably *do* belong in some result record schema). These three attributes seem to belong specifically to a particular (And I would guess already existing) service binding (Or as already said, to a very specific kind of returned result record). What I'd really like to see is a much clearer statement of what the purpose of the abstract discovery model is. Hopefully, once that is tightly defined, it should become easier to decide what lies inside the boundary of the abstract model, and what belongs in the domain of specific realisations of the abstract model. (Actually.. I should say that I'm baised by the information retrieval community generally, in that it's considered really important to have a seperate abstract model for discovery (The search access points) and then bind that model on to as many backend schemas as needed.. this decoupling is seen as best practice in the information retrival domain, and most of my concerns here are that because of the apparent 1:1 mapping between the abstract model and the implementation. This is the approach taken in the [[Z3950 GEO profile] http://www.blueangeltech.com/standards/GeoProfile/geo22.htm]).

I'm a bit confused by the "Temporal Reference" Element... 5.2.2. Talks about what I would expect to see from a temporal reference, but 5.3.2 maps temporal reference on to "One of the dates of publication, last revision or creation of the resource". These three elements are already well defined by dublin core attributes... Maybe I've misunderstood whats implied by table 1 in 5.3.2. Also, similar issues to the spaital access point arise (With structured data, as opposed to text queries). In some UK datasets, periods such as "Neolithic" can be used instead of an ISO 19108 Date Time. (I seen note 11 under 5.3.4 talks about this, which is good. Whats important is that regardless of the outcome of the study, the IR are extensible enough to cope with the eventual decision). I'd consider seperate access points for controlled vocabulary time period and structured temporal data.

Geographic Extent.. the doc seems a bit bounding box heavy. Would be nice to understand (have examples of) specification of interior/exterioir polygons. Servers only supporting minimal bounding boxes can gracefully degrade (Since it's easy to calculate a MBR from a polygon) whilst allowing other servers to retain the full richness of polygons. It's not clear where the semantics for parsing these strings will be defined.. for example should geographic extent be encoded as OpenGIS strings (Which seems to make sense to me, but I'm biased by Oracle and MySQL's spatial functions). This might seem a bit extreme for the abstract part of the document, but it's one of those make-or-break issues for interoperability, and might be worth the pain. Also, I think it's worth entertaining the idea that spatial specifications such as MBRs and polygons (Structured spatial constructs) might be better exposed using their own abstract access point, and "Place Name" having it's own access point. This will help server implementors avoid problems with disambiguation of search terms.

I'm interested in what the expected semantics of resource language are on retrieval of language-neutral data sets.... Should a result record not be selected because the user specified "Nor" as the search language, but resources matching other criteria (Geo Extent for example) do match. Normally in Info Retrieval this is a no-brainer, of course it shouldn't, but I'm a bit less certain when we talk about result records that aren't primarily "Text" based. (Actually, this is a slightly wider concernn about annex A and those "CharacterString" elements... In IEEE LOM for example we have "LangString" element that has a "Lang" attribute. That community chose to allow language variants of a resource to be expressed within one record by allowing an element to hold all language variants, for example

 <title>
   <langstring lang="En">Hello</langstring>
   <langstring lang="Dk">Hej</langstring>
 </title>

The presence of a "Lang" attribute at the "Dataset" level might mean the intention is to support multi-language datasets by having several dataset records, one for each language, which is OK, but possibly not optimal for datasets that aren't prmarily language based. If this is the case, is the "CharacterString" element in Annex A just redundant payload?)

Lack of machine-reusable data in general

Dataset 'lineage' is only a full-text field. If datasets result from recombination, that should be machine-traversable. Human descriptions of lineage will be so different that they won't be useful for building search / evaluation services.

II: It does tend to talk about "Lineage statement"... would making it (More along the conceptual lines of)

 <lineage>
   <dc:description>Text</dc:description>
 </lineage>

Give you the extensibility to either use private extensions, or to specify recombination elements at a later date (I didn't think this through in terms of the *actual* recombination operations, just wanted to show how we might make lineage extensible without specifying it.

 <lineage>
   <dc:description>This dataset is a recombination of X and Y</dc:description>
   <Jo:recombination>
     <Jo:source>X-URI</Jo:source>
     <Jo:source>Y-URI</Jo:source>
     <Jo:Rules>Overlap</Jo:Rules>
   <Jo:recombination>
 </lineage>

Should the lineage search point be called "LineageDescription" (I think thats what I'll do in my SRW profile).

Lack of engagement with packaging and re-use issues

Cf. Dataset series / aggregates. The examples have 'MasterMap' as one potential dataset! Real world use cases are going to need subsets of such huge data sets broken down into packages with smaller spatial extents or with less layers.

II: Indeed, as well as the srw/sru binding experiment, I've been wondering about the OAI binding, which I know you've already discussed elsewhere. What might be generally useful (And maybe this already exists) is a set of TREC style test data. Setting up a static gateway OAI server wouldn't be so hard, and might give us some valuable real-world information about this problem.

Bypassing of feature-level metadata from consideration

Once we get down to the feature level the interesting European problems appear - the fact that every local area may have its own classification schemes, even inside one language community the same word is used to describe different looking things, and across language barriers mappings from words to things don't tend to be 1-1. But by disregarding feature-level metadata - partly because it can't be mandated when the underlying geospatial objects aren't publically inspectable and a certain amount of feature level metadata would mean the data itself is essentially public...

II: Aye, generally for discovery services it's nice to try and avoid mandating that users understand predefined controlled vocabularies, whilst allowing users who do know terms to qualify their discovery process, for example, in CQL I'd be tempted to allow a user to say "dc.subject=Something" or (The equivalent of) "authority=19115:2003 and dc.subject=Something" for users who know a specific term.

There's quite a lot of work going on around europe at the moment covering crosswalks of controlled vocabularies (Mostly I know about crosswalking euroopean educational levels, but It seems to be the same problem cast in a different way). If we can arrange for someone to do the intellectual work of cross-mapping, and make the data publically available, then it becomes a "Turning-the-handle" job for providers to support cross vocab retrieval. Standards such as ZThes are being used quite a lot in the learning domain to transport this data around. The only effect on reviwing the IR is that it's important that the IR does not preclude this at a future date? (The whole design for unforseen use thing.. specifically, I think mandating a specific vocab in the IR might not be the right thing to do, and giving users a way to say which vocab they are using in the description and discovery process is a better way to go....)

Overspecificness about internet- and webservices- based distribution models

Actually causing ourselves unnesc problems by putting everything on the Internet. Data sharing agreements over publically maintained private networks with flat-rate membership are a clear potential future and 'middle way' in this domain. The draft now is all about making access/use contraints *specific to data sets* and not specific to the relationship between the data provider or broker, the data user and the transport network between them.

So we have a 'distributed computing platform' metadata property that is required by the IRs. In the ISO19915 mapping in Annex A this is a free text field, yet 5.2.15 states that the property "is necessary for a client to bind to the service". If it must be mandated, it should be as a URI. It would be wonderful to have examples of what other than HTTP or OGC web services is envisaged NOW as a means of access to the backend of a distributed computing platform.