Reading the INSPIRE Metadata Draft

From OSGeo
Jump to navigation Jump to search

Metadata about geographic data is at the heart of INSPIRE. The metadata draft is the first in the set of "implementing rules" and it will underpin all the other implementing rules. The consultation process is open until 2007-03-30. While the documents are open access, comments can only be offered through an SDIC or Spatial Data Interest Community.

The Free and Open Source Geospatial Community has a voice through one of these SDICs thanks to Markus Neteler. This page contains preparatory material for a collective response through the FOSS GIS SDIC, from the POV of people implementing and managing metadata creation, collection and search services, working closely with many different data user communities.

  • The response proper will live at Response to INSPIRE Metadata Draft. Initial notes are included below in the Issues section.
  • It is interesting to read this in parallel with the North American Metadata Profile draft which is also currently in consultation. It's hoped the OSGeo community will also be able to contribute to a Response to NAP Metadata Draft and get the geodata commons project involved in this.

Reading the draft

  • the Implementing Rules for Metadata Draft (pdf)
  • supporting / background material
  • Pages 1-17 are metadata about the document itself, intentions and history, and can be safely skipped. Pages 43-104 are the Annexes.
  • Annex A is particularly interesting as there are details of the thinking exposed in the mapping to ISO19115/39 that aren't set out in the implementing rules. If you want to know what's likely to affect you but are short on time, at minimum read section 5 and Annex A'.

Lightning Summary of the draft

The draft establishes a basic information model for metadata which is close to, but not specific to, ISO19115 and OGC Web Services.

It only mandates what metadata is published by and for public authorities covered by INSPIRE - it does not try to cover repository management or internal processes.

It separates out metadata properties into those useful for 'discovery', 'evaluation', and 'use'. It identifies one very high level "use case" for spatial data search services built from metadata being shared at this level.

It differentiates between properties useful for 'non-specialist' and 'expert' users into 2 Levels, 1 and 2. Level 1 is always mandatory. This *includes* classification according to the data themes in the INSPIRE annexes, and keywords from controlled vocabularies which are not covered by the IR document but are left to Spatial Data Theme Communities. (How these communities are found, selected, and make their decisions, is unknown to us at this time.)

Issues

This list is an overview of what jumped out at me as something to address. I don't know how much of this is appropriate to send back, or how much can be fixed. - User:JoWalsh I've added some random musings also, again, no idea if they are of use or even valid questions, but at least they are there and can be edited out, or used as the basis of discussions - User:IanIbbo.

Conceptual overview

The model maps quite well to the minimum useful subset identified in DCLite4G. It looks like a lightweight core. But, the model and the draft break down the problem space of metadata in a way that is a reaction to artificial scarcity of data. It identifies three phases of the metadata use cycle:

  • discovery (of what data is out there)
  • evaluation (of whether the data will be useful for specific purpose)
  • use (once access gained, how to best use the data)

It is illuminates to compare this with the North American Profile metadata draft which talks about

  • discovery
  • access
  • fitness for use (e.g. evaluation)
  • transfer

So the IRs both don't address how to make the data more useful via metadata, and are vague about how much a minimal subset is going to provide enough information to evaluate utility on. Generally the draft dances around data licensing access issues, and glosses over the over-engineering needed to work around artificial constraints on availability. IRs for evaluation and use of data based on metadata are not covered by this draft at all, but left up to the Spatial Data Theme communities for each of the 35 data themes identified in Annexes I-III of the INSPIRE text.

Issues with specific metadata properties

The model maps quite well to DCLite4G. It looks like a lightweight core.

Things that aren't there that should be

5.2.8 Resource responsible party. Each dataset *must* have one or more people/organisations responsible for it. The IR says that this can be freetext or can be in more structured form. This only includes the responsible party's name, but NOT any form of contact details.

Some form of electronic or telephonic contact address should be mandatory, if the org/person's details are mandatory. Why publish ownership information - especially if there are constraints on access and reuse of the described data - if you can't immediately get in personal contact with someone who can make assurances about the data?

I agree, but not only for data, btu for metadata as well: Page 53. sec A.2.8.1 and A2.8.2 and A.4.1: Other information for identification is necessary like e-mail, URL, voice number.

Annex A on mapping to IS019115 mandates that contact persons and organisations be free text, not resource identifiers. 2 serious problems with the ISO 19115 mapping:

  • It does not ask or provide for contact details.
  • It looks *mandatory* that the reponsible party be given a role, which in turn is one of N codes published by the Library of Congress to describe people's roles within organisations.

No discussion of formalising dataset accuracy / completeness - crucial for cost-benefit evaluation / evaluation of suitability for combining with other data sets.

I agree: Page 31. sec 5.2.12: The lineage element is not enough for data evaluation. The same problem is on Page 55. sec A.3.2.

  • Page 7. par. 6: "Separate IRs for discovery services are being prepared and are not the subject of this document". But I have found a lot of information about discovery (search) services in this document and I think that information about discovery services should be a part of this document. There is a conflict. We can not be sure how to evaluate this document when we do not know how will look IR for services.
  • Page 17. sec. 4.1. Note 6, par 4 and 5: This part of service evaluation must be solved for the infrastructure and the INSPIRE must give IR for this part of the infrastructure. Evaluation of the service must be based on the purpose of the usage. I believe that case studies support system can help with it. There is a gap in IR metadata from this point of view.
  • Page 22. sec 4.6., Par 6: "The content of these repositories need to be accessible. This should happen preferably via a standard interface and/or standard encoding format such as XML". Must be specified how, if there is not exact information how to connect then is not useful for implementation.
  • Page 30. sec 5.2.7.: There is not a list of types. The INSPIRE should define authority that can define types of services. For example you have in examples an identifier WMS, but we use an identifier OGC:WMS in our catalogue. This can lead to inconsistency.
  • Page 30. sec 5.2.3.: There is vertical extend missing, necessary for geology, meteorology and climatology studies.
  • Page 57. sec A.3.6: Code List is good, but I can not find it in the document.
  • Page 21. sec 4.6. Par 1 (numbering 1): "Search engine connected to a set of metadata repositories". How the repositories will be connected, how selected, real-time, replication, harvesting. There is no information about it in the document here on later in the text, but they should be defined.
  • Page 8:There is no information about cooperation with FGDC, FAO or other subjects that should be later connected to the INSPIRE infrastructure. Does it mean that EU want to stay out of USA and UN infrastructures?
  • Page 64. sec B.8: Resource provider is not defined. Why not dc: publisher?
  • Page 77. Annex G: There is not information about Usage of a resource that is very necessary to evaluate quality of a resource.
  • Page 101. Annex I: There are no guidelines, just some unclear information.

Things that are there that probably shouldn't be

Every 'dataset or dataset series' published under INSPIRE *must* include both a Resource topic category and a set of resource keywords.

Topic categories are very high-level classifications which correspond to each of the Spatial Data Themes identified in Annexes I, II and III of the INSPIRE Directive.

  • Which topic category data fits in will often be a property of an organisation not any published data sets.

From an implementor's POV this will involve something like selecting a topic category for data at install time of metadata publishing engine, and forgetting about it. The IRs place a lot of faith in the ability of simple keyword / classification code matches to enhance utility of search and discovery services for users.

But. this already raises the bar for non-expert users (the domain vocabulary is jargon specific or oriented towards specialist codes)

The IRs emphasise the fact that keywords should originate from a controlled vocabulary. The reponsibility for creating one is not in the hands of the Drafting Teams but in the hands of Spatial Data Theme Communities. How these are constituted and how their decisions become binding are unclear.

Again, faith in keywords for search utility is misplaced. Reliance on them may lead to false negatives. Again assumes familiarity with, or time and ability to learn about, what to expect in the domain from a non-expert user, and an expert will need a better level of detail. Pitfalls of 'controlled' keywording:

  • intentional misclassification
  • lazy/default misclassification

Both of these are at 'Level 1 for discovery metadata' which implies that any INSPIRE compliant metadata set MUST have both topic category and associated keywords.

I partly agree: It is true that for INSPIRE should be defined thesaurus for keywords. INSPIRE is mainly oriented to environment protection. There is GEMET thesaurus, but there is not a common agreement to use it. This part of metadata is not so simple and I understand that definition of a IR for this part is very difficult.

Other issues:

  • Page 60. sec B.2.1: Temporal extent is meaningful always for discovering and searching.
  • Page 85. sec H.5.2: Too difficult for implementation and not very useful, there are other ways how to support multilingual descriptions.
  • Page 38. sec 6.2: Identifier are of two types. This is problematic. It is usually better to have only one type of identifier. The preferred one should be URL (URI). The document says that URL can be in some cases not unique. If you want to make identification more unique than mix URL with UUID. Like this: http://gis.vsb.cz/01f8da38-10d7-11da-b569-000f1f1a7b03

Areas which are unclear

Conformity

This is an IR and obligatory to deal with. But 5.3.4 just says "see Annex F". Annex F in its entirety says:

The way in which conformity is expressed in the INSPIRE IR will be defined in a subsequent draft based on discussions with the Drafting Team on Data specifications and harmonization.

(Is this where accuracy/completeness comes in? How can we know?)

Dataset series / Aggregate data

IR talks about dataset series. Some of the diagrams talk about 'MD_Aggregates'- this term isn't used elsewhere. No conception in this model of one UrDataSet with many different potential sources according to how they are packaged or processed. As the IRs mandate properties for dataset series, really need more clarity / examples about what they actually are.

Other

  • Page 17. sec. 4.1. Note 6, par. 2: "Services, including web services, are routinely measured in terms of availability and performance. These parameters are easily quantified and users can easily agree on their value: services that are available more often (seeking the elusive 99.9% "up-time") are more desirable than services that are available less often, and services that provide faster response time are more desirable than similar services that are slower to respond.". This is not exact. When a user wants to use a service he must know more than about availability and speed of the service. For example cost, quality of used geodata, used algorithms, quality of used algorithms, possibility of chaining, quality of self-description of the process, development, update, security, type of call (synchronous x asynchronous). Perhaps all that you have to know about software, because a service is a software and a user will not use it once but many times and perhaps for a long time.
  • Page 30. sec 5.2.4.: Does it mean that the resource for INSPIRE can not be encoded in other encoding such as Windows 1250? This is a problem that we have found in ISO 19115 and we still do not know how to handle it. We have a lot of geodata encoded in Windows-1250 and nobody is going to change the encoding, and it means that we can not describe them using ISO 19115 (or we do not know how, except extending the list).
  • Page 33. sec 5.3.2: The resource responsible party must be searchable. The abstract should be searchable.
  • Page 39. sec 6.3: The formulation are not clear. They need to be specified better with examples and schemas.
  • Page 58. A.3.8.2: Not conform with A.3.8.1. Why A.3.8.2 is conditional and A.3.8.2 is mandatory?
  • Page 64. sec B.6: The keyword element must be mandatory for services as well.
  • Page 72. sec D4.2: I think that ISO 19139 must be adopted always not only when there is not any other XML schema, I can prepare my own schema compatible with ISO 19115 but not with ISO 19139, but this is not useful for INSPIRE.
  • Page 19. sec. 4.3. Par 4 (numbering 3): This is not right. When there is not ISO 19139 yet final we can not talk about maturity. Some of the services have adopted ISO 19115, but there is not a conformity in adoption and many adoptions are probably wrong.
  • Page 27. sec 5.1. Par 2 (bullet 1): "the user query expressed through the search interface of the search engine and provided in a form compatible with the metadata repository interface". This is unclear, what does it mean? Any repository can have different interface? That is horrible if you know that there is CSW 2.0 specified now.
  • Page 28. sec 5.1. Par 9: "The discovery metadata elements are defined at an abstract level in order to make the Implementing Rules independent of ...". This is often wrong, if there is not specified encoding. In this case the IR for metadata can be useless.
  • Page 33. sec 5.3.3: The Operation name should be moved to Discovery level 1 and must be searchable, this is necessary for services searching.
  • Page 44. sec A1.2.1 Note 3: This kind of condition is problematic. Who will decide that particular statement or quality report is required? I assure you, nobody will do more than is specified in the IR for metadata.
  • Page 49. sec A.2.3.1: EX_GeographicBoundingBox must be defined always for services too. Not conditional. When the service is not based on GeographicBoundingBox then it should be defined for a whole world.
  • Page 45. sec A.1.2.2: “identifier: MD_identifier. condition: if the identifier is available”. Identifier will be probably available always and if not it should be generated.
  • Page 58. A.4: Why are metadata on metadata in level 2 and why not searchable.
  • Page 59. sec A.4.3: Free text. I do not understand why when there is a list. Does not list include all languages?
  • Page 62. sec B.3: Do not understand the extent definition, description looks very strange.

General concerns

Search / discovery services

The preamble (p.7) states that "separate IRs for discovery services are being prepared and are not the subject of this document." But the INSPIRE use case is predicated on the availablity of 'Geoportal' style search services. What else *are* discovery services if they are not the search services treated of here? If there is only going to be an abstract model for discovery, and these IRs are careful to avoid imposing any constraints on internal data repository management, how much more can a discovery services draft provide?

II: I think this observation is spot on, but for different reasons perhaps. I'm finding it difficult to express concrete concerns.. but Section 5.2 "Discovery metadata elements" starts to set out a list of concepts seen to be (The document hints at, but does not directly say) core to the discovery process. Section 5.3 then sets out "Abstract discovery metadata element set". I *guess* the implication is that the concepts laid out in 5.2 are in some way even more abstract than those set out in 5.3. The document really isn't clear about what the abstract model is, or what it is for, before it starts enumerating the concepts. Your later comment about being tied to web services is spot on also here, I'm really not sure "Service type version", "Operation name" and "Distributed computing platform" belong in an abstract discovery model (The probably *do* belong in some result record schema). These three attributes seem to belong specifically to a particular (And I would guess already existing) service binding (Or as already said, to a very specific kind of returned result record). What I'd really like to see is a much clearer statement of what the purpose of the abstract discovery model is. Hopefully, once that is tightly defined, it should become easier to decide what lies inside the boundary of the abstract model, and what belongs in the domain of specific realisations of the abstract model. (Actually.. I should say that I'm baised by the information retrieval community generally, in that it's considered really important to have a seperate abstract model for discovery (The search access points) and then bind that model on to as many backend schemas as needed.. this decoupling is seen as best practice in the information retrival domain, and most of my concerns here are that because of the apparent 1:1 mapping between the abstract model and the implementation. This is the approach taken in the [[Z3950 GEO profile] http://www.blueangeltech.com/standards/GeoProfile/geo22.htm]).

I'm a bit confused by the "Temporal Reference" Element... 5.2.2. Talks about what I would expect to see from a temporal reference, but 5.3.2 maps temporal reference on to "One of the dates of publication, last revision or creation of the resource". These three elements are already well defined by dublin core attributes... Maybe I've misunderstood whats implied by table 1 in 5.3.2. Also, similar issues to the spaital access point arise (With structured data, as opposed to text queries). In some UK datasets, periods such as "Neolithic" can be used instead of an ISO 19108 Date Time. (I seen note 11 under 5.3.4 talks about this, which is good. Whats important is that regardless of the outcome of the study, the IR are extensible enough to cope with the eventual decision). I'd consider seperate access points for controlled vocabulary time period and structured temporal data. This seems a specific example where the abstract IR model needs to go beyond what is defined in the A2 binding.

Geographic Extent.. the doc seems a bit bounding box heavy. Would be nice to understand (have examples of) specification of interior/exterioir polygons. Servers only supporting minimal bounding boxes can gracefully degrade (Since it's easy to calculate a MBR from a polygon) whilst allowing other servers to retain the full richness of polygons. It's not clear (in the abstract model, it is in the A2 binding) where the semantics for parsing these strings will be defined.. for example should geographic extent be encoded as OpenGIS strings (Which seems to make sense to me, but I'm biased by Oracle and MySQL's spatial functions). This might seem a bit extreme for the abstract part of the document, but it's one of those make-or-break issues for interoperability, and might be worth the pain. Also, I think it's worth entertaining the idea that spatial specifications such as MBRs and polygons (Structured spatial constructs) might be better exposed using their own abstract access point, and "Place Name" having it's own access point. This will help server implementors avoid problems with disambiguation of search terms.

I'm interested in what the expected semantics of resource language are on retrieval of language-neutral data sets.... Should a result record not be selected because the user specified "Nor" as the search language, but resources matching other criteria (Geo Extent for example) do match. Normally in Info Retrieval this is a no-brainer, of course it shouldn't, but I'm a bit less certain when we talk about result records that aren't primarily "Text" based. (Actually, this is a slightly wider concernn about annex A and those "CharacterString" elements... In IEEE LOM for example we have "LangString" element that has a "Lang" attribute. That community chose to allow language variants of a resource to be expressed within one record by allowing an element to hold all language variants, for example

 <title>
   <langstring lang="En">Hello</langstring>
   <langstring lang="Dk">Hej</langstring>
 </title>

The presence of a "Lang" attribute at the "Dataset" level might mean the intention is to support multi-language datasets by having several dataset records, one for each language, which is OK, but possibly not optimal for datasets that aren't prmarily language based. If this is the case, is the "CharacterString" element in Annex A just redundant payload?)

Lack of machine-reusable data in general

Dataset 'lineage' is only a full-text field. If datasets result from recombination, that should be machine-traversable. Human descriptions of lineage will be so different that they won't be useful for building search / evaluation services.

II: It does tend to talk about "Lineage statement"... would making it (More along the conceptual lines of)

 <lineage>
   <dc:description>Text</dc:description>
 </lineage>

Give you the extensibility to either use private extensions, or to specify recombination elements at a later date (I didn't think this through in terms of the *actual* recombination operations, just wanted to show how we might make lineage extensible without specifying it.

 <lineage>
   <dc:description>This dataset is a recombination of X and Y</dc:description>
   <Jo:recombination>
     <Jo:source>X-URI</Jo:source>
     <Jo:source>Y-URI</Jo:source>
     <Jo:Rules>Overlap</Jo:Rules>
   <Jo:recombination>
 </lineage>

Should the lineage search point be called "LineageDescription" (I think thats what I'll do in my SRW profile).

Lack of engagement with packaging and re-use issues

Cf. Dataset series / aggregates. The examples have 'MasterMap' as one potential dataset! Real world use cases are going to need subsets of such huge data sets broken down into packages with smaller spatial extents or with less layers.

II: Indeed, as well as the srw/sru binding experiment, I've been wondering about the OAI binding, which I know you've already discussed elsewhere. What might be generally useful (And maybe this already exists) is a set of TREC style test data. Setting up a static gateway OAI server wouldn't be so hard, and might give us some valuable real-world information about this problem. I now the records won't be in the right schema, but something we can try and munge into gmd would be a real help.

Bypassing of feature-level metadata from consideration

Once we get down to the feature level the interesting European problems appear - the fact that every local area may have its own classification schemes, even inside one language community the same word is used to describe different looking things, and across language barriers mappings from words to things don't tend to be 1-1. But by disregarding feature-level metadata - partly because it can't be mandated when the underlying geospatial objects aren't publically inspectable and a certain amount of feature level metadata would mean the data itself is essentially public...

II: Aye, generally for discovery services it's nice to try and avoid mandating that users understand predefined controlled vocabularies, whilst allowing users who do know terms to qualify their discovery process, for example, in CQL I'd be tempted to allow a user to say "dc.subject=Something" or (The equivalent of) "authority=19115:2003 and dc.subject=Something" for users who know a specific term.

There's quite a lot of work going on around europe at the moment covering crosswalks of controlled vocabularies (Mostly I know about crosswalking euroopean educational levels, but It seems to be the same problem cast in a different way). If we can arrange for someone to do the intellectual work of cross-mapping, and make the data publically available, then it becomes a "Turning-the-handle" job for providers to support cross vocab retrieval. Standards such as ZThes are being used quite a lot in the learning domain to transport this data around. The only effect on reviwing the IR is that it's important that the IR does not preclude this at a future date? (The whole design for unforseen use thing.. specifically, I think mandating a specific vocab in the IR might not be the right thing to do, and giving users a way to say which vocab they are using in the description and discovery process is a better way to go....)

Overspecificness about internet- and webservices- based distribution models

Actually causing ourselves unnesc problems by putting everything on the Internet. Data sharing agreements over publically maintained private networks with flat-rate membership are a clear potential future and 'middle way' in this domain. The draft now is all about making access/use contraints *specific to data sets* and not specific to the relationship between the data provider or broker, the data user and the transport network between them.

So we have a 'distributed computing platform' metadata property that is required by the IRs. In the ISO19915 mapping in Annex A this is a free text field, yet 5.2.15 states that the property "is necessary for a client to bind to the service". If it must be mandated, it should be as a URI. It would be wonderful to have examples of what other than HTTP or OGC web services is envisaged NOW as a means of access to the backend of a distributed computing platform.

Feelings

Too general

I am now finishing the reading and my first feeling from the IR document is that there are some parts too general and not helpful for implementation. It is like ISO 19115 or CSW 2.0. When you start to implement it you have to find your own way how to do it. This will probably lead in inconsistency of catalogues interfaces and content. I know that something is better than nothing and I welcome INSPIRE metadata IR, but I know that there should be specified more that is in this document. We are now testing CSW 2.0 implementation in GeoNetwork Open Source and we have found a lot of problematic parts (in implementation and in specification as well). I have some concrete comments to IR metadata, but I have to sumarised them first to publish them here. Some of them are probably mentioned here (in that case I will just comment that I agree with mentioned issues).