Geodata Metadata Requirements

One goal of the Public Geospatial Data Project is to offer, in the future, a repository of reusable public geographic data that can support open source geospatial software projects, both inside and outside the foundation.

One big requirement for a potential Geodata Repository is that there be a well-defined baseline for metadata. This can be seen as a quality assurance effort - data won't be accepted without a certain amount of metadata.

The |US Federal Geographic Data Committee metadata standard emphasises conformance, but doesn't emphasise exchangeability / reusability. FDGC is standard for "Spatial Data Infrastucture" efforts, but doesn't have much of a "geospatial web" orientation.

There are some properties in addition to FGDC which it would be really useful to have - different distribution channels like WFS, bittorrent which have come into existence since FGDC was originally defined. For many elements, FGDC asks for full-text descriptions. More structure in descriptions would help with automating discovery or re-use.

This is perpetual work in progress. See also:


 * Geodata Metadata Translata
 * Simple Catalog Interface
 * Geometa Engine
 * ebRIM

= Draft Metadata model = Attempting to abstract the minimal metadata model below (DClite4G) out into Geodata Metadata Model (see there), which serves for a metadata management tool. This is what OSGeo Geodata Committee participants have identified as their core needs for metadata.

It was generated from an RDF model. This picks an arbitrary namespace for an OWL schema that maps to most, if not all, of the FGDC mandatory properties and provides some extra ones.

Data Set
This is an draft to define a metadata information model for a OSGeo catalogue. Below there is another minimal metadata information model targeted for metadata exchange and harvesting (name proposed: DClite4G).


 * Title: Title of the data set. Corresponds to Dublin Core title
 * Description: Text description of the data set. Corresponds to Dublin Core description element.
 * Originator
 * Person: A person responsible for publication of the data set - name and contact email address. These properties are well-defined in the FOAF vocabulary.
 * Organization: A organization responsible for publication of the data set - name and contact email address. These properties are well-defined in the FOAF vocabulary.
 * Spatial Data Organization: Vector, Raster or Point data, as described in FGDC. (cf http://biology.usgs.gov/fgdc.metadata/version2/sdorg.htm )
 * Datasource: URL from which the data can be downloaded via different protocols
 * WFS: For Vector data in GML
 * File at HTTP URL: For Raster data described in GML
 * BitTorrent: URL of bittorrent .torrent tracker file.
 * Other Web API: For example, OpenStreetmap API ( http://wiki.openstreetmap.org/index.php/REST )
 * License information: Emphasis on public geographic data licenses: PGL, possible LPGL, Public Domain, Creative Commons-type licenses. These can be represented by URLs.
 * Publication date: Corresponds to Dublin Core date: ISO compliant date of publication.
 * Timespan
 * Time Period
 * start date and end date
 * single date
 * Extents
 * Spatial Domain: A lot of this can be inferred either using GDAL/OGR or collected from a WMS/WFS GetCapabilities. It would be nice to bypass human error on collecting this kind of metadata.
 * bounding coordinates: FGDC specifies north, east, west, south bounding co-ordinates. It doesn't specify a projection in which these should be described. For reasons of simplicity it could make sense to require these be in WGS84 (EPSG:4236) - for the same reasons GeoRSS decided to mandate WGS84, rather than complicate matters by dictating that people also specify an SRS.
 * Projection (Raster, Vector, Coverage): Original projection of the data (reference to an ?)
 * Horizontal and vertical datum;
 * Horizontal and vertical units.
 * Resolution (Raster,Coverage): (property of DataSet). e.g. map units per pixel where map units are defined by SRS. can be different in horizontal / vertical axes e.g. non square pixels
 * Colour Depth (Raster): 8/16/24 bit etc - this is useful rather than required
 * Transparency (Raster)
 * Scale (Vector):Map scale at which vectors are considered accurate. Quantified as a fractional/dimensionless number - 'inches per inch' - on a scale between 1 and 0 - or inverse scale such as 1:50000 - and we would want to store this in a consistent way.
 * Layers: DataSet has multiple Layers
 * Name
 * Description
 * Extent: can be non-rectangular
 * Scale Hinting: minscale / maxscale - cf resolution and scale - are these actually properties of layers and not really of data sets? (eg data set contains multiple layers - will they be in any way likely to contain different scale properties?)
 * Optional extra properties
 * Taxonomy/Ontology: Currently undecided; would be good to refer this to current well-known thesauri for data themes.

= Discovery =

Requirements
A discovery resource is essential to expose resultant metadata as per this document. Below are requirements:

Publish:
 * ability to publish/register a web service
 * ability to publish/register a static resource
 * ability to harvest and classify public and private resources
 * ability to establish and maintain user/group/role based authentication
 * ability to provide a RESTful authentication mechansim

Find:
 * ability to discover the existence of a web service
 * ability to discover the existence of a resource which is available via web services (i.e WMS layer, WFS feature type)
 * ability to discover the existence of a static resource (dataset, document, etc.)

Bind:
 * ability to perform discovery operations with spatial, aspatial and temporal predicates
 * ability to provide a RESTful request API
 * ability to provide responses in XML
 * ability to expose resource/service metadata in a manner which facilitates dynamic connection to a resource/service

Webcrawlers perspective or: How to boost your geodata?
Geoinformation content needs to be published and get disseminated somehow before it is being found by users. Following are some thoughts to help discovery/search services/brokers and their webcrawlers/harvesters/aggregators to do a better job. So the main question here is: What can content owners do to promote their information. It's mainly registration, declaration and citation. -- Stefan 11:25, 2 October 2006 (CEST) (http://tinyurl.com/ghhb2)


 * Self-registration through content provider
 * submit URL (like DNS)
 * Do a 'ping' (like RSS)
 * register through UDDI (if SOAP is 'unavoidable'...)
 * Self-declaration:
 * Machine-readable only ('invisible'):
 * GeoRSS feed of feeds which contains GeoRSS encoded metadata records ('RSS of RSS')
 * OPML (natural to RSS, but more complicated than just 'RSS of RSS')
 * Something similar to robots.txt
 * HTML link: 
 * Visible/human readable:
 * Feed icons
 * Chicklets
 * Citation through others:
 * XML link or relationship inside metadata record (see also 'friend' in OAI-PMH metadata set header)
 * HTML link pointing to Webpage (suboptimal)

Note: See also "RSS auto-discovery": e.g. '"Blog Optimization", SES San Jose (aug. 2006): 1. Submit your feeds... Or: "Pimp My Blog in 8 Steps" (sept. 2006): (...) 3. Sign up for feedburner.com, (...) 8. Enable auto discovery (feed icon, chicklet).'

Information model for metadata exchange
This is the 'Dublin Core lite for Geo' (DClite4G) model (http://tinyurl.com/kfkyv).

Some design considerations:
 * This is a minimal metadata information model regarding to a metadata exchange protocol for harvesting (e.g. no filter nor GML implementation needed) and according to the ideas about a Simple Catalog Interface/protocol.
 * Based on Dublin Core (DC) and Catalogue Services Specification 2.0.1, OGC 04-021r3, p.22.
 * Dublin Core need refined semantics of some properties/attributes.
 * Have had hard times with the abundance use of namespaces. This is because DC specs and other XML 'practices' specialize properties/attribute types instead of specializing whole classes.
 * All properties/attributes have cardinality [0..1] except for identifiers (which are mandatory) and for those attributes which are really needed (as unbounded) for automation!
 * Take all information one can in an automated manner, e.g. from data set resource.

Details:
 * Services are included in attribute 'format' in the sense that WMS, etc. are just protocol bindings to geodata. Real well known services on it's own like filter or label placement services have a place there too. They could be still detected by challenging them with GetCatabilites (taken from OWS/WxS).
 * Indicating of quality of service could be a nice task for search service provider; no need to add it as attribute
 * Relationships between features is part of schema metadata: How to handle this...?

Dublin Core lite for Geo (DClite4G)

 * Aligned and rearranged after some discussions on osgeodata-list and geotools-devel-list. Next steps: first consensus on approach/fields, then consensus about which encoding (DC, RDF or GeoRSS?). - Stefan 08:33, 26 September 2006 (CEST)


 * Figure: Analysis UML class diagram of minimal metadata information model 'Dublin Core lite for Geo' (DClite4G) which consists of a single entity set (green); two entities/instances/records are shown, a 'dataset' (left, grey) and a 'service' (right, red).

Dublin Core lite for Geo (DClite4G) (Mandatory subset of DC elements plus georesource relationships, some XML content exept dc:identifier may be null/empty):

Legend: 'Equal to' means possible to derive from iso19128 (= WxS GetCapabilities).

Remarks:
 * General:
 * DC attributes/properties left as they are: dc:Audience; dc:Contributor; dc:Creator.
 * All attributes/properties have at most cardinality 1 except iso19115: OnlineResource and dct:hasPart (and dc:relation from complementary part of DClite4G).
 * Depending on the modeling approach, even these elements can become cardinality 1. NOTE that datasets (geodata resources) and services (data access services) in principle have a many-to-many relationship: Here a geodata resource (dc:type dataset) can have many iso19115: OnlineResource elements and a dc:type data_access_service has only one dc:description which can be a GetCapabilities document.
 * No additinal DC attributes/properties required; few of them need to be specialized (see dct:...);
 * See for some general explanations about dc/dct here.
 * Assume metadata (as opposite to many geodata sets) is always free and open information, like Creative Commons Share Alike
 * An encoding still has to be discussed (see following example). need schemaLocation in OSGeo!?
 * Details:
 * GetCapabilities adds following attributes (not yet modeled here): Fees, ScaleHint and Style.
 * Sorted out or highly disputed (non-DC) elements: fees, scalehint, harvestinterval.
 * dct:modified and dct:spatial can be sync'ed from dataset.
 * Attribute 'relation': This was'nt discussed yet. Simply helps harvesters to discover more (meta) data providers.
 * Attribute 'publisher': Carl mentioned such a structure here which includes StreetAddress, addressee, primaryAddressNumber, streetName, city, state, zipCode, countryCode (like in KML and behind Google geocoding service!?)
 * Keywords is included in attribute 'dc:subject'; I think people have a hard time to agree on an enumerated list (see the success of folksonomy).
 * Note that OAI-PMH...
 * puts a XML envelope around this metadata and adds a header containing two attributes: 'identifier' to identify an metadata record and 'datestamp' as date of last (published) change of metadata record.
 * requires to define a name for metadata sets. Let's don't care about this yet.

Examples
Some examples of DClite4G instances/records: (legend: 'literal' is a constant, // is a comment) (http://tinyurl.com/eaaaj)

A web mapping service instance example derived from GetCapabilities: Mapping OGC:WxS GetCapabilities to a service instance. Can also be called a 'data access point' (Note GetCapabilities finally needs to be delivered by some service owner!):

dc:identifier     = identifier of the metadata record (ev. a machine readable URI) dc:title          = WxS Service/Title dct:abstract      = WxS Service/Abstract dc:type           = 'service' dc:format         = namespace to OGC:WxS schema.xsd dct:spatial       = Root BoundingBox // from Capabilities XML dct:modified      = timestamp // e.g. from HTTP header or updateSequence dc:subject        = WxS Service/KeywordList dclite4g:onlineResource = baseURI of WxS // seems redundant to id but is the real link to the service dct:hasPart       = a dc:identifier which points to each Layer element dct:hasPart       = another dc:identifier, etc. dc:source         = null // N/A. Note: Not meant as an onlineResource dc:publisher      = WxS Service ContactInformation/Organization dc:language       = maybe HTTP header for lang (ISO3166 code), soon supported by WMS dc:rights         = WxS Services Fees/AccessConstraints dc:relation       = definition up to metadata provider

A dataset example derived from GetCapabilities: Mapping OGC:WxS GetCapabilities to a dataset/georesource/data access point (Note: There is no hasPart relationship in datasets):

dc:identifier     = identifier of the metadata record (ev. a machine readable URI) dc:title          = WxS Layer/Title dct:abstract      = WxS Layer/Abstract dc:type           = 'dataset' dc:format         = namespace to format dct:spatial       = BoundingBox from Layer/BoundingBox dct:modified      = timestamp from HTTP header maybe or updateSequence dc:subject        = Layer/KeywordList dclite4g:onlineResource = a baseURI of WxS (GetMap/GetFeatures) // seems redundant to id. but is the real 'data access point' dclite4g:onlineResource = another baseURI of WxS // another dataset binding dc:source         = null // Note: Not meant as an onlineResource dc:publisher      = from Service if available? dc:language       = from Service if available? dc:rights         = from Service if available? dc:relation       = definition up to metadata provider

A dataset example delivered by a dataset owner:

dc:identifier     = a data access point metadata defined by dataset owners context dc:title          = entered by dataset owner dct:abstract      = entered by dataset owner dc:type           = 'dataset' dc:format         = namespace to format entered by dataset owner dct:spatial       = BoundingBox from data warehouse dct:modified      = file timestamp from data warehouse dc:subject        = keyword list entered by dataset owner dclite4g:onlineResource = a http URI entered by dataset owner // dataset binding dclite4g:onlineResource = a baseURI of a WxS entered by dataset owner // another binding dc:source         = entered by dataset owner // Note: Not online resource dc:publisher      = entered by dataset owner dc:language       = entered by dataset owner (or derived?) dc:rights         = entered by dataset owner dc:relation       = definition up to metadata provider

A dataset example with OAI-PMH XML encoding in DClite4G format:

Notes:
 * Example values are only for explanation purposes and purely fictive.
 * XML Schema (= geometadc.xsd? or dclite4g.xsd?) still tbd.
 * This record is not yet validated!
 * Took 'dclite4g' as envelope name.



www.osgeo.org/geodata/:f264-77d2-09ce-aa39-f0f0 National Elevation Mapping Service for Texas Elevation data collected for the National Elevation Dataset (NED). dataset ...uri to the schema of the information model (xsd, realxng, schematron, ili, ...)   34.353        -96.223         28.229         -108.44         <dct:modified>2004-03-01</dct:modified> <dc:subject>Elevation, Hypsography, and Contours</dc:subject> <dclite4g:onlineResource>uri:http://www.osgeo.org/geodata/ned_grid_georss.xml</dclite4g:onlineResource> <dclite4g:onlineResource>uri:http://www.osgeo.org/services/wms/</dclite4g:onlineResource> <dclite4g:onlineResource>uri:http://www.osgeo.org/geodata/ned_grid.shp</dclite4g:onlineResource> <dc:source>Lineage: Based on 30m horizontal and 15m vertical accuracy.</dc:source>

<dc:publisher>U.S. Geological Survey</dc:publisher> <dc:language>en</dc:language> <dc:rights>uri:http://www.usgs.gov/pubprod/</dc:rights> </dclite4g:qualifieddc>

Same record example as before but with unqualified DC encoding: Note that this unqualified DC record can be seen as a mapping from DClite4G by using it's well defined semantics and content 'encoding'. See explanations above to understand the semantics of these DC-elements:

... www.osgeo.org/geodata/:f264-77d2-09ce-aa39-f0f0</dc:identifier> National Elevation Mapping Service for Texas</dc:title> <dc:description>Elevation data collected for the National Elevation Dataset (NED).</dc:description> dataset</dc:type> ...uri to the schema of the information model (xsd, realxng, schematron, ili, ...)</dc:format> <dc:coverage>34.353 -96.223 28.229 -108.44</dc:coverage> <dc:date>2004-03-01</dc:date> <dc:subject>Elevation, Hypsography, and Contours</dc:subject> <dc:relation>uri:http://www.osgeo.org/services/wms/</dc:relation> <dc:source>Lineage: Based on 30m horizontal and 15m vertical accuracy.</dc:source> ...

Other Relevant Info

 * Simple_Catalog_Interface
 * OSGeodata on GISpunkt Wiki - These pages are about the search of an open, lean and mean "protocol for the incremental exchange of metadata about geographic resources between systems". Profiled specifications like WFS or OAI-PMH are currently on our short list. Delving into 'Open Archives Initiative Protocol for Metadata Harvesting' (OAI-PMH) is strongly encouraged. It's a low barrier interoperability specification based around metadata harvesting model, it's stable (subsequent revisions are backwards compatible) and uses unqualified Dublin Core as default metadata information model; there exist open source tools (like OAICat) and it has been adopted among others by Google and Yahoo! but it's not a search protocol.
 * See here a comparison between CSW, WFS and OAI-PMH.

Guidelines for a minimal OAI-PMH implementation
OAI-PMH means Open Archives Initiative Protocol for Metadata Harvesting. For an introduction to OAI-PMH 2 see here.

This is a draft implementation guideline for a minimal OAI-PMH implementation for geospatial resources which contains following five steps:


 * 1. Join the community: Register to the 'OAI Implementers list' at the OAI Homepage
 * 2. Read the spec.: The Harvesting Protocol (version 2.0) specification together with Implementation Guidelines
 * 3. Look at existing tools.
 * 4. Implement protocol with DC model following these recommendations and test it: The official OAI Repository Explorer.
 * 5. Extend it later eventually with specific metadata models like ISO19139 or DClite4G and respective XML formats.

Following are more specific guidelines for a minimal OAI-PMH implementation of a so called 'data provider' using only the mandatory 'unqualified' Dublin Core (DC):


 * Only three operations (verbs) are needed: Identify, ListMetadataFormats and ListRecords.
 * Following operations are not required (initially): ListIdentifiers, ListSets, GetRecord.
 * No incremental harvesting (resumption process for ListXxx operations with more than 1000 records)
 * No compression as defined in the OAI-PMH spec. (compression at lower http level still possible)
 * Date granularity may be 'day' not seconds (YYYY-MM-DD)
 * Keeping track of deleted record may not be supported (deletedRecord=no)
 * Mandatory DC supported as data model is sufficient for a start but with specific semantics (e.g. coverage, relation) (see also example below):
 * dc:description contains dct:abstract
 * dc:coverage contains bounding box encoding as defined in http://georss.org/simple.html#Box
 * dc:date means in fact dct:modified
 * dc:relation is filled in with dclite4g:onLineSrc. If dc:type='service' dct:hasPart can be derived from GetCapabilities.

OAI implementations
OAI with Dublin Core:
 * 'Geo-Metadatabase' (GMDB), open source (GPL, PHP)
 * GeoShop, infoGrips GmbH, Zurich, Switzerland (C, Java)
 * GeoNetwork, open source (GPL, Java)

= References =

Geospatial

 * FGDC geospatial metadata model


 * GEON geospatial metadata model


 * DIF geospatial metadata model


 * GeoRSS


 * WFS


 * GeoAPI contains an implementation of ISO 19115

RDF

 * Resource Description Framework


 * RDF Primer


 * OWL Web Ontology Language Guide


 * Semantic Web for Earth and Environmental Terminology OWL Ontologies at NASA


 * Dublin Core metadata model for documents


 * FOAF metadata model for people and organisations


 * DOAP metadata model for open source software projects and code repositories

From Geodata Packaging Working Group:

 * Specifications of a data set
 * Creator
 * Date
 * License
 * Data Type
 * Topic
 * Spatial Extent
 * Coordinate System/Projection
 * Target Scale/Precision
 * Attribute Data