Difference between revisions of "Geodata Metadata Requirements"

From OSGeo
Jump to navigation Jump to search
(→‎Minimal Information Model: Identifier moved back)
Line 197: Line 197:
 
|-
 
|-
 
|'''Attr. name'''|| '''Cardinality''' || '''Attr. type''' || '''Explanation''' || '''Poss. to autom.?''' || '''Status'''
 
|'''Attr. name'''|| '''Cardinality''' || '''Attr. type''' || '''Explanation''' || '''Poss. to autom.?''' || '''Status'''
 +
|-
 +
|dc:identifier  || [1]    || string  || Unique id to identify a resource (URI); see UUID but also [http://www.openarchives.org/OAI/2.0/guidelines-oai-identifier.htm OAI-PMH]! || System generated || tbd.
 
|-
 
|-
 
|dc:title        || [0..1] || string  || Title of the resource. || GetCapabilities || Ok
 
|dc:title        || [0..1] || string  || Title of the resource. || GetCapabilities || Ok
Line 206: Line 208:
 
|dct:format      || [0..*] || URI    || A (machine readable) reference to an xml schema namespace URI, mime type, internet media type. || (GetCapabilities) || Ok?
 
|dct:format      || [0..*] || URI    || A (machine readable) reference to an xml schema namespace URI, mime type, internet media type. || (GetCapabilities) || Ok?
 
|-
 
|-
|dct:spatial    || [0..1] || dcmiBox:Box with CRS || Take WGS84 as mandatory or as default CRS? || Boundary-Coords from file; GetCapabilities || Ok?
+
|dct:spatial    || [0..1] || dcmiBox:Box with CRS || Subtype of dc:coverage. Take WGS84 as mandatory or as default CRS? || Boundary-Coords from file; GetCapabilities || Ok?
 
|-
 
|-
 
|dct:modified    || [0..1] || date    || Date of last (published) change of resource. || Timestamp of file; GetCapabilities || Ok
 
|dct:modified    || [0..1] || date    || Date of last (published) change of resource. || Timestamp of file; GetCapabilities || Ok
Line 212: Line 214:
 
|dc:subject      || [0..1] || string  || A keyword list (comma separated?); could be ISO 19115 classifications || GetCapabilities || Ok?
 
|dc:subject      || [0..1] || string  || A keyword list (comma separated?); could be ISO 19115 classifications || GetCapabilities || Ok?
 
|-
 
|-
|cld:isAccessedVia || [0..*] || URI || (Machine readable) Reference to either baseURL to WMS/WFS or a full path to ftp://host.com/path/filename, as well as to a 'Filter Service' (or 'WSDL'). See [http://dublincore.org/groups/collections/collection-application-profile/2006-08-24/#colcldisAccessedVia CLD]). isAccessedVia is a subtype of dc:relation. || Can be derived || tbd.
+
|cld:isAccessedVia || [0..*] || URI || If dc:type 'geodata' (Machine readable) Reference to either baseURL to WMS/WFS or a full path to ftp://host.com/path/filename, as well as to a 'Filter Service' (or 'WSDL'). See [http://dublincore.org/groups/collections/collection-application-profile/2006-08-24/#colcldisAccessedVia here]). Subtype of dc:relation. || Can be derived || tbd.
 +
|-
 +
|dctermshasPart    || [0..*] || URI || If dc:type 'data access service': (Machine readable) Reference to . See [http://dublincore.org/groups/collections/collection-application-profile/2006-08-24/#coldctermshasPart here]). Subtype of dc:relation. || Can be derived || tbd.
 
|-
 
|-
 
|dc:publisher    || [0..1] || structure (refinement of string) || Civic Address or URI to point to (xAL/KML?) || Up to data owner, GetCapabilities || tbd.
 
|dc:publisher    || [0..1] || structure (refinement of string) || Civic Address or URI to point to (xAL/KML?) || Up to data owner, GetCapabilities || tbd.
Line 224: Line 228:
 
|-
 
|-
 
|'''Attr. name'''|| '''Cardinality''' || '''Attr. type''' || '''Explanation''' || '''Poss. to autom.?''' || '''Status'''
 
|'''Attr. name'''|| '''Cardinality''' || '''Attr. type''' || '''Explanation''' || '''Poss. to autom.?''' || '''Status'''
|-
 
|dc:identifier  || [1]    || string  || Unique id to identify a resource (URI); see UUID but also [http://www.openarchives.org/OAI/2.0/guidelines-oai-identifier.htm OAI-PMH]! || System generated || tbd.
 
 
|-
 
|-
 
|dc:language    || [0..1] || enum    || RFC 1766 (ISO 639, followed optionally by country ISO 3166) ||  (ifndef: can be guessed) || Ok
 
|dc:language    || [0..1] || enum    || RFC 1766 (ISO 639, followed optionally by country ISO 3166) ||  (ifndef: can be guessed) || Ok

Revision as of 16:06, 25 September 2006

One goal of the Public Geospatial Data Project is to offer, in the future, a repository of reusable public geographic data that can support open source geospatial software projects, both inside and outside the foundation.

One big requirement for a potential Geodata Repository is that there be a well-defined baseline for metadata. This can be seen as a quality assurance effort - data won't be accepted without a certain amount of metadata.

The Federal Geographic Data Committee metadata standard emphasises conformance, but doesn't emphasise exchangeability / reusability. FDGC is standard for "Spatial Data Infrastucture" efforts, but doesn't have much of a "geospatial web" orientation.

There are some properties in addition to FGDC which it would be really useful to have - different distribution channels like WFS, bittorrent which have come into existence since FGDC was originally defined. For many elements, FGDC asks for full-text descriptions. More structure in descriptions would help with automating discovery or re-use.

This is perpetual work in progress. See also:

Draft Metadata model

Graph illustrating a basic metadata model generated from an RDF model of what OSGeo Geodata Committee participants have identified as their core needs for metadata.

This picks an arbitrary namespace for an OWL schema that maps to most, if not all, of the FGDC mandatory properties and provides some extra ones.

Graph0.png

Data Set

title

Title of the data set. Corresponds to Dublin Core title

description

Text description of the data set. Corresponds to Dublin Core description element.

originator

Person

A person responsible for publication of the data set - name and contact email address. These properties are well-defined in the FOAF vocabulary.

Organization

A organization responsible for publication of the data set - name and contact email address. These properties are well-defined in the FOAF vocabulary.

Spatial Data Organization

Vector, Raster or Point data, as described in FGDC. (cf http://biology.usgs.gov/fgdc.metadata/version2/sdorg.htm )

datasource

URL from which the data can be downloaded via different protocols

WFS

For Vector data in GML

WMS

For Raster data described in GML

File at HTTP URL

BitTorrent

URL of bittorrent .torrent tracker file.

Other Web API

For example, OpenStreetmap API ( http://wiki.openstreetmap.org/index.php/REST )

License information

Emphasis on public geographic data licenses: PGL, possible LPGL, Public Domain, Creative Commons-type licenses. These can be represented by URLs.

Publication date

Corresponds to Dublin Core date: ISO compliant date of publication.

timespan

Time Period

start date and end date

single date

extents

Spatial Domain

A lot of this can be inferred either using GDAL/OGR or collected from a WMS/WFS GetCapabilities. It would be nice to bypass human error on collecting this kind of metadata.

bounding coordinates

FGDC specifies north, east, west, south bounding co-ordinates. It doesn't specify a projection in which these should be described. For reasons of simplicity it could make sense to require these be in WGS84 (EPSG:4236) - for the same reasons GeoRSS decided to mandate WGS84, rather than complicate matters by dictating that people also specify an SRS.

Projection (Raster, Vector, Coverage)

Original projection of the data (reference to an ?)

Horizontal and vertical datum;

Horizontal and vertical units.

Resolution (Raster,Coverage)

(property of DataSet)

e.g. map units per pixel where map units are defined by SRS

can be different in horizontal / vertical axes e.g. non square pixels

Colour Depth (Raster)

8/16/24 bit etc - this is useful rather than required

Transparency (Raster)

Scale (Vector)

Map scale at which vectors are considered accurate

Quantified as a fractional/dimensionless number - 'inches per inch' - on a scale between 1 and 0 - or inverse scale such as 1:50000 - and we would want to store this in a consistent way.

Layers

DataSet has multiple Layers

Name

Description

Extent

can be non-rectangular

Scale Hinting

minscale / maxscale - cf resolution and scale - are these actually properties of layers and not really of data sets? (eg data set contains multiple layers - will they be in any way likely to contain different scale properties?)

Optional extra properties

Taxonomy/Ontology

Currently undecided; would be good to refer this to current well-known thesauri for data themes.

Discovery

Requirements

A discovery resource is essential to expose resultant metadata as per this document. Below are requirements:

Publish:

  • ability to publish/register a web service
  • ability to publish/register a static resource
  • ability to harvest and classify public and private resources
  • ability to establish and maintain user/group/role based authentication
  • ability to provide a RESTful authentication mechansim

Find:

  • ability to discover the existence of a web service
  • ability to discover the existence of a resource which is available via web services (i.e WMS layer, WFS feature type)
  • ability to discover the existence of a static resource (dataset, document, etc.)

Bind:

  • ability to perform discovery operations with spatial, aspatial and temporal predicates
  • ability to provide a RESTful request API
  • ability to provide responses in XML
  • ability to expose resource/service metadata in a manner which facilitates dynamic connection to a resource/service

Information model for metadata exchange

This is the http://tinyurl.com/kfkyv model (:->).

Some design considerations:

  • This is a minimal metadata information model regarding to a metadata exchange protocol for harvesting (e.g. no filter nor GML implementation needed) and according to the ideas about a Simple Catalog Interface/protocol.
  • Based on Dublin Core (DC) and Catalogue Services Specification 2.0.1, OGC 04-021r3, p.22.
  • Dublin Core need refined semantics of some properties/attributes.
  • Have had hard times with the abundance use of namespaces. This is because DC specs and other XML 'practices' specialize properties/attribute types instead of specializing whole classes.
  • All properties/attributes have cardinality [0..1] except for identifiers (which are mandatory) and for those attributes which are really needed (as unbounded) for automation!
  • Take all information one can in an automated manner, e.g. from data set resource.

Details:

  • Services are included in attribute 'format' in the sense that WMS, etc. are just protocol bindings to geodata. Real well known services on it's own like filter or label placement services have a place there too. They could be still detected by challenging them with GetCatabilites (taken from OWS/WxS).
  • Indicating of quality of service could be a nice task for search service provider; no need to add it as attribute
  • Relationships between features is part of schema metadata: How to handle this...?

Minimal Information Model

  • Attempting to abstract this out into Geodata Metadata Model
  • Aligned and rearranged to mode after some discussions on osgeodata-list and geotools-devel-list.

Core DC elements plus 'common' elements from OWS (GetCapabilities):

Attr. name Cardinality Attr. type Explanation Poss. to autom.? Status
dc:identifier [1] string Unique id to identify a resource (URI); see UUID but also OAI-PMH! System generated tbd.
dc:title [0..1] string Title of the resource. GetCapabilities Ok
dc:description [0..1] URI or string A description of the resource. if dc:type then its a string or a URI pointing to human readable information; in case of dc:type data_access_service its the content (string) of GetCapabilities document (why should we use dct:abstract as ISO19115 does?) GetCapabilities Ok
dct:type [0..1] string DC: "The nature or genre of the content of the resource (text, image, sound)". Here mainly geodata, data_access_service, config. documents. How about vector, raster, grid geodata? Can be derived; (GetCapabilities) Ok?
dct:format [0..*] URI A (machine readable) reference to an xml schema namespace URI, mime type, internet media type. (GetCapabilities) Ok?
dct:spatial [0..1] dcmiBox:Box with CRS Subtype of dc:coverage. Take WGS84 as mandatory or as default CRS? Boundary-Coords from file; GetCapabilities Ok?
dct:modified [0..1] date Date of last (published) change of resource. Timestamp of file; GetCapabilities Ok
dc:subject [0..1] string A keyword list (comma separated?); could be ISO 19115 classifications GetCapabilities Ok?
cld:isAccessedVia [0..*] URI If dc:type 'geodata' (Machine readable) Reference to either baseURL to WMS/WFS or a full path to ftp://host.com/path/filename, as well as to a 'Filter Service' (or 'WSDL'). See here). Subtype of dc:relation. Can be derived tbd.
dctermshasPart [0..*] URI If dc:type 'data access service': (Machine readable) Reference to . See here). Subtype of dc:relation. Can be derived tbd.
dc:publisher [0..1] structure (refinement of string) Civic Address or URI to point to (xAL/KML?) Up to data owner, GetCapabilities tbd.
dc:source [0..1] URI (preferred) or string A reference to a resource from which the present resource is derived. (Human readable) reference to lineage information about the resource (Note: Server base URLs and file URIs are handled elsewhere) Up to data owner Ok?


Remaining Core DC elements:

Attr. name Cardinality Attr. type Explanation Poss. to autom.? Status
dc:language [0..1] enum RFC 1766 (ISO 639, followed optionally by country ISO 3166) (ifndef: can be guessed) Ok
dc:rights [0..1] URI or string URI: (human readable) License information about the resource Up to data owner, GetCapabilities Ok?
dc:relation [0..*] URI (Machine readable) Reference to other metadata providers in order to let discover other (meta) data providers. Note that OAI-PMH has such a relationship called 'friends' but on the metadata collection/set level. Up to data owner tbd.

Sorted out or highly disputed (non-DC) elements:

Attr. name Cardinality Attr. type Explanation Poss. to autom.? Status
??:fees [0..1] string ??? GetCapabilities tbd.
??:scalehint [0..1] positive integer Denominator of scale 1:nnnnn GetCapabilities tbd.
??:harvestinterval [0..1] positive integer (seconds? look at robots.txt/Sitemaps) ? GetCapabilities tbd.

Legend: 'Poss. to autom.' means Possible to automate

Remarks:

  • General:
    • DC attributes/properties left as they are...: Audience; Contributor; Creator.
    • All attributes/properties have cardinality 1 except dc:relation (cld:isAccessedVia) and dct:format.
    • Depending on the modeling approach, even these elements can become cardinality 1. NOTE that geodata resources and data access services in principle have a many-to-many relationship: Here a geodata resource (dc:type geodata) can have many cld:isAccessedVia elements and a dc:type data_access_service has only one dc:description which can be a GetCapabilities document.
    • No additinal DC attributes/properties required; few them needed to be specialized (see dct:...); ** See for some general explanations about dc/dct here.
    • still some attributes/properties need some specialized recommended meaning (see tbd.).
    • Assume metadata (as opposite to geodata) is always free and open information, like Creative Commons Share Alike
    • An encoding still has to be discussed (see following example). need schemaLocation in OSGeo!?
  • Details:
    • dct:modified and dct:spatial can be sync'ed from dataset.
    • Attribute 'relation': This was'nt discussed yet. Simply helps harvesters to discover more (meta) data providers.
    • Attribute 'publisher': Carl mentioned such a structure here which includes StreetAddress, addressee, primaryAddressNumber, streetName, city, state, zipCode, countryCode (like in KML and behind Google geocoding service!?)
    • Keywords is included in attribute 'dc:subject'; I think people have a hard time to agree on an enumerated list (see the success of folksonomy).
    • GetCapabilities adds following attributes (not yet modelled here): Fees, ScaleHint and Style.
  • Note that OAM-PMH...
    • puts a XML envelope around this metadata and adds a header containing two attributes: 'identifier' to identify an metadata record and 'datestamp' as date of last (published) change of metadata record.
    • requires to define a name for metadata sets. Let's don't care about this yet.

Example

Notes:

  • Example values are only for explanation purposes and purely fictive.
  • XML Schema (= geometadc.xsd) still tbd.
  • This record is not yet validated!
  • Took 'geometadc' as envelope name.
 <geometadc:qualifieddc 
   xmlns:geometadc="http://www.osgeo.org/schemas/geometa/" 
   xmlns:dc="http://purl.org/dc/elements/1.1/" 
   xmlns:dct="http://purl.org/dc/terms/" 
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
   xsi:schemaLocation="http://www.osgeo.org/schemas/geometa/ geometadc.xsd">
   <dc:identifier>f264-77d2-09ce-aa39-f0f0</dc:identifier>
   <dc:title>National Elevation Mapping Service for Texas</dc:title>
   <dc:description>Elevation data collected for the National Elevation 
    Dataset (NED).</dc:description>
   <dc:subject>Elevation, Hypsography, and Contours</dc:subject>
   <dc:relation>g264-77d2-09ce-aa39-g0g0</dc:relation>
   <dc:type>grid geodata</dc:type>
   <dc:format>uri:http://www.osgeo.org/services/wms/</dc:format>
   <dc:format>uri:http://www.osgeo.org/geodata/ned_grid_georss.xml</dc:format>
   <dc:format>uri:http://www.osgeo.org/geodata/ned_grid.shp</dc:format>
   <dct:modified>2004-03-01</dct:modified>
   <dct:spatial>
     <Box projection="EPSG:4326" name="Geographic">
       <northlimit>34.353</northlimit>
       <eastlimit>-96.223</eastlimit>
       <southlimit>28.229</southlimit>
       <westlimit>-108.44</westlimit>
     </Box>
   </dct:spatial>
   <dc:language>en</dc:language>
   <dc:source>Lineage: Based on 30m horizontal and 15m vertical accuracy.</dc:source>
   <dc:rights>uri:http://www.usgs.gov/pubprod/</dc:rights>
   <dc:publisher>U.S. Geological Survey</dc:publisher>
 </dct:description>

Other Relevant Info

  • Simple_Catalog_Interface
  • OSGeodata on GISpunkt Wiki - These pages are about the search of an open, lean and mean "protocol for the incremental exchange of metadata about geographic resources between systems". Profiled specifications like WFS or OAI-PMH are currently on our short list. Delving into 'Open Archives Initiative Protocol for Metadata Harvesting' (OAI-PMH) is strongly encouraged. It's a low barrier interoperability specification based around metadata harvesting model, it's stable (subsequent revisions are backwards compatible) and uses unqualified Dublin Core as default metadata information model; there exist open source tools (like OAICat) and it has been adopted among others by Google and Yahoo! but it's not a search protocol.
  • See here a comparison between CSW, WFS and OAI-PMH.

References

Geospatial

  • GeoAPI contains an implementation of ISO 19115

RDF

  • FOAF metadata model for people and organisations
  • DOAP metadata model for open source software projects and code repositories

Notes

metadata isn't an easy task. The balance between completeness and people simply ignoring to generate it...

I wish I had had a prexisting plan of how to index and search for the data sets on extent and 'type' that we were adding

From Geodata Packaging Working Group:

    • Specifications of a data set
      • Creator
      • Date
      • License
      • Data Type
      • Topic
      • Spatial Extent
      • Coordinate System/Projection
      • Target Scale/Precision
      • Attribute Data

See Also

Comments

  • I would like to propose an additional element for the metadata model--data source (or lineage). If the data is derived from some other data, we should be able to backtrack and look at its parent/s. "Lineage" is a conditional element in FGDC but I think it's important enough that we should include it in our model. I suppose this can also be included in the Description but wouldn't it be nice to have this included as a required element? This is useful when checking for errors/consistency. -Perry