Training Material for UN Open GIS OpenData
The following educational material has been drafted within the framework of the OSGeo UN Committee Educational Challenge - Open Geospatial Data and software for UN sustainable development goals. The overarching goal is to show that at this time, the combination of open (geo)data globally available and the significant developments of the free and open source solutions for geospatial is sufficient to initiate geospatial analysis, at worldwide level, at small and intermediate scales, to better understand our ecosystem. In that respect, we have employed OSGeo software solutions to process global open geospatial datasets to answer one selected indicator for a sustainable development goal. The selected indicator is 9.1.1 Proportion of the rural population who live within 2 km of an all-season road (C0901010) which supports the target of developing quality, reliable, sustainable and resilient infrastructure, including regional and transborder infrastructure, to support economic development and human well-being, with a focus on affordable and equitable access for all. The indicator has been chosen after a close analysis of all SGDs and the corresponding indicators as to comply with the following:
- to have a spatial dimension;
- to not be an indicator that is already addressed through another initiative, such as the GEO Wetlands Initiative, WHO Interactive Air Pollution Maps, GEO AquaWatch, ESA CoastColour etc.;
- if possible, to not be yet the subject of a published methodology.
We have prepared this educational material for researchers, educators and professionals in local, regional, national or international agencies with minimal to intermediate geospatial information knowledge. We assume our audience has already basic knowledge of geospatial data structures, formats and that they have already used a GIS software, as to have basic skills and understanding of how to work with geospatial and tabular data. In that respect, we have limited the interactions with the command line, however we have inserted references to it.
For the calculation of the SDG indicator, we have only used QGIS 3.4. We have also taken into consideration that changes that might occur from one version to another and thus focused on the functions used more than on a step-by-step guide.
After going through the entire educational material, one will be able to:
- Have a broader view on what are the types of geospatial data open at global scale, as well as what are their limitations.
- Have a more deeper understanding of working with geospatial data using a dedicated software
- Consistent knowledge of QGIS software fundamentals
- Learn how to create cartographic representations of the obtained results
Open geospatial information and its role in answering UN Sustainable Development Goals
The idea of this training material is based on the fact that in the last two decades significant amounts of geospatial data have become freely available and accessible. This situation translated into a multitude of worldwide initiatives, as well as local web services and projects on an array of topics and, as an cyclical cause-effect relationship, this lead to a open data movement that spread from research data to public data, to community driven data. The following table presents a few freely available geospatial datasets that have been considered, with respect to their applicability for a geospatial SDG indicator analysis. Please be aware, this table is far from exhaustive, but more a live one. We invite anyone to add other available geospatial datasets.
|No.||Topic||Name collection/dataset||Abstract||indicators||Producer/collector||Owner||License||Type of data||Format||Scale/spatial resolution||Edition||CRS||Geographic coverage||URL|
|1||Water||World Hydrogeological Map||The objectives of the World-wide Hydrogeological Mapping and Assessment Programme (WHYMAP) are to summarise groundwater information on global scale, show groundwater data on maps and map applications.||groundwater resources, river and groundwater basins, karst aquifer||UNESCO&all||BGR and UNESCO own the copyright on the data and maps provided here.||The maps may be reproduced without further permission from the copyright owner provided that an acknowledgement to BGR & UNESCO is included in presentations and publications.||vector||Esri Geodatabase, SHP||Example||2008||Geographic WGS84||Example||Example|
|2||Water||Global Dataset of River Widths||Global Dataset of River Widths Developed from Landsat Imagery||river widths||Allen, George H., & Pavelsky, Tamlin M.||n/a||CC BY 4.0||vector, raster||ESRI shapefile, TIFF||30 m||2018||Geographic WGS84||Global||https://zenodo.org/record/1297434|
|3||Water||GFPLAIN250m||A global high-resolution dataset of Earth’s floodplains.||flood plains||Nardi, F. et al. Figshare||n/a||CC BY 4.0||raster||TIF||250 m||2018||Geographic WGS84||Global||https://www.nature.com/articles/sdata2018309#data-records|
|4||Urban Development||Global Human Settlement||Multitemporal information layer on built-up presence as derived from Landsat image collections (GLS1975, GLS1990, GLS2000, and ad-hoc Landsat 8 collection 2013/2014).||build-up spatial distribution, population||EU- Joint Research Center||EU- Joint Research Center||EU Free and Open Data policy||raster||TIF||38 m||1975, 1990, 2000, 2014/ 2015||Spherical Mercator (EPSG:3857)||Global||https://ghsl.jrc.ec.europa.eu/datasets.php|
|5||Climate Change||FOODSEC Meteodata||FOODSEC receives daily, 10-daily and monthly outputs of the ECMWF (European Centre for Medium-Range Weather Forecast) global circulation models. All the data is aggregated for 10-day periods.||Rainfall estimation, solar radiation, evapo-transpiration and temperature (min, max, avg)||MeteoConsult||European Union, 2011-2014. EC-JRC-MARS data created by MeteoConsult based on ECWMF (European Centre for Medium Range Weather Forecasts) model outputs||n/a||raster||SPIRITS format||0.25 X 0.25 degrees||interim: 1989 - 2012 / operational: 2008 - present||Geographic WGS84||Global||http://spirits.jrc.ec.europa.eu/download/downloaddata/downloadmeteodata/|
|6||Climate change||CHIRPS||Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) is a 30+ year quasi-global rainfall dataset. Spanning 50°S-50°N (and all longitudes), starting in 1981 to near-present, CHIRPS incorporates 0.05° resolution satellite imagery with in-situ station data to create gridded rainfall time series for trend analysis and seasonal drought monitoring. As of February 12th, 2015, version 2.0 of CHIRPS is complete and available to the public.||rainfall||USGS||USGS (?)||CC0||raster||TIF; BIL, NetCDF||0.05 x 0.05 degree||1981 - near-present||Geographic WGS84||global||http://chg.geog.ucsb.edu/data/chirps/|
|7||Urban Development||GUF||GUF is a global raster map of the world’s settlement patterns built using the radar (SAR) satellite imagery of the two German satellites TerraSAR X and TanDEM X. A dataset of about 180,000 very high-resolution SAR images with about 3 m ground resolution was processed for this dataset.||human settlements||DLR||DLR||=HYPERLINK("https://www.dlr.de/eoc/en/Portaldata/60/Resources/dokumente/guf/DLR-GUF_LicenseAgreement-and-OrderForm.pdf","open for scientific and non-comercial scopes")||raster||GeoTIFF||12, 84 m||2016||Geographic WGS84||global||https://www.dlr.de/eoc/en/desktopdefault.aspx/tabid-11725/20508_read-47944/|
|8||Health, Nutrition and Population||Gridded Population of the World||The Gridded Population of the World (GPW) collection, now in its fourth version (GPWv4), models the distribution of human population (counts and densities) on a continuous global raster surface. The essential inputs to GPW have been population census tables and corresponding geographic boundaries. The purpose of GPW is to provide a spatially disaggregated population layer that is compatible with data sets from social, economic, and Earth science disciplines, and remote sensing.||spatially distribuited population numbers||Earth Insititute, Columbia University||Columbia University||CC BY 4.0||raster||GeoTIFF, ASCII||30 arc-seconds (app 1 km)||2000, 2005, 2010, 2015, 2020||Geographic WGS84||global||http://sedac.ciesin.columbia.edu/data/collection/gpw-v4|
|9||Health, Nutrition and Population||WorldPop||High resolution, contemporary data on human population distributions and their compositions are a prerequisite for the accurate measurement of the impacts of population growth, for monitoring changes and for planning interventions. The WorldPop project was initiated in 2013 to unite the continent-focussed AfriPop, AsiaPop and AmeriPop projects, with an aim of producing detailed and freely-available population distribution and composition maps for the whole of Central and South America, Africa and Asia.||population density, births, pregnancies, urban change, development and health indicators, age structures, depedencies rations, internal migration flows, global flight data||GeoData Institute, Univ. of Southampton||GeoData Institute, Univ. of Southampton||CC BY 4.0||raster||GeoTIFF||100 m||various, based on country||Geographic WGS84||Africa, Asia Central+South America||http://www.worldpop.org.uk/data/get_data/|
|10||Water||Global Water Surface||The datasets that are available are intended to show different facets of the spatial and temporal distribution of surface water over the last 32 years. Some of those datasets are intended to be mapped (e.g. the seasonality layer) and some are intended to show the temporal change at specific locations (i.e. the water history).||surface water occurrence, occurance change intensity, seasonality, recurrence, transition, maximum water extent, montlhy recurrence, monthly water history,||EU- Joint Research Center||EU- Joint Research Center||Copernicus open data policy||raster||GeoTIFF||30 m||1984 - 2015, 1984-1999 and 2000-2015, 2014-2015, 1984-2015, 2014-2015, 1984-2015||Geographic WGS84||global||https://global-surface-water.appspot.com/download|
|11||Environment and Natural Resources||Global Land Survey - GLS||The Global Land Survey collection consists of images acquired from 1972 to 2012 combined into one dataset. All Global Land Survey datasets contain the standard Landsat bands designated for each sensor.||Example||NASA&USGS||NASA&USGS||open data (?!)||raster||GeoTIFF||30 m, 60 m - MSS||1975, 1990, 2000, 2005, 2010||UTM||global||https://landsat.usgs.gov/global-land-surveys-gls|
|12||Environment and Natural Resources||Climate Change Initative - Land Cover||As part of the ESA Climate Change Inititiave (CCI), the Land_Cover_cci project is concerned with the generation of the land cover essential climate variable. Land cover is defined as the physical material at the surface of the earth. Land covers include grass, asphalt, trees, bare groung, water, etc.||land cover, NDVI, Snow, Burned Areasa, Water Bodies||European Space Agency||European Space Agency||Copernicus Open Data policy||raster||GeoTIFF, NetCDF4||300m spatial resolution for three 5-year epochs centred on the years 2010 (2008-2012), 2005 (2003-2007) and 2000 (1998-2002)||2010 (2008-2012), 2005 (2003-2007) and 2000 (1998-2002)||Geographic WGS84||global||https://www.esa-landcover-cci.org/?q=node/164|
|13||Agriculture and Food Security||The World Bank - World Development Indicators||The primary World Bank collection of development indicators, compiled from officially-recognized international sources. It presents the most current and accurate global development data available, and includes national, regional and global estimates.||Agriculture and Rural Development||World Bank||World Bank||open data||tabular||csv, xls||n/a||2005-2018||n/a||global||https://data.worldbank.org/indicator|
|14||Agriculture and Food Security||OECD-FAO Agricultural Outlook (Edition 2017)||The OECD databases on agriculture constitute a unique collection of agricultural statistics and provide a framework for quantifying and analysing the agricultural economy. This includes forecasts regarding the evolution of the main agricultural markets and commodities, detailed estimates of policy support, as well as indicators of environmental performance of agriculture. Data concern both OECD countries and non-member economies.||Agricultural output, agricultural policy, fisheries, sustainable agriculture||Organisation for Economic Co-operation and Development||Organisation for Economic Co-operation and Development||open data||tabular||xls, csv, xml||n/a||2016 - 2026||n/a||cvasi-global||https://data.oecd.org/agriculture.htm|
|15||Water||HydroSHEDS||HydroSHEDS (Hydrological data and maps based on SHuttle Elevation Derivatives at multiple Scales) provides hydrographic information in a consistent and comprehensive format for regional and global-scale applications.||stream networks, watershed boundaries, drainage directions, and ancillary data layers such as flow accumulations, distances, and river topology information.||U.S. Geological Survey (USGS); the International Centre for Tropical Agriculture (CIAT); The Nature Conservancy (TNC); McGill University, Montreal, Canada; the Australian National University, Canberra, Australia; and the Center for Environmental Systems Research (CESR), University of Kassel, Germany.||U.S. Geological Survey (USGS); the International Centre for Tropical Agriculture (CIAT); The Nature Conservancy (TNC); McGill University, Montreal, Canada; the Australian National University, Canberra, Australia; and the Center for Environmental Systems Research (CESR), University of Kassel, Germany.||open data||vector, raster||shp, ESRI GRID, ESRI BIL||3s (app90m), 15s (app 500m), 30s (app 1km), 5m (app 10 km)||depending on the region, 2006, 2007, 2008 or 2009||Geographic WGS84||global||https://hydrosheds.cr.usgs.gov/dataavail.php|
|16||Urban Development||Global Administrative Unit Layers (GAUL)||The GAUL compiles and disseminates the best available information on administrative units for all the countries in the world, providing a contribution to the standardization of the spatial dataset representing administrative units. Because GAUL works at global level, unsettled territories are reported. GAUL is released once a year and the target beneficiary of GAUL data is the UN community and other authorized international and national partners. Data might not be officially validated by authoritative national sources and cannot be distributed to the general public. A disclaimer should always accompany any use of GAUL data.||global layers with a unified coding system at country, first (e.g. departments) and second administrative levels (e.g. districts). Where data is available, it provides layers on a country by country basis down to third, fourth and lowers levels.||FAO-UN||FAO-UN||non-open data||vector, raster||n/a||n/a||1990 - 2015||Geographic WGS84||global||http://www.fao.org/geonetwork/srv/en/metadata.show?id=12691|
|17||Urban Development||Database of Global Administrative Areas||GADM wants to map the administrative areas of all countries, at all levels of sub-division. A high spatial resolution and an extensive set of attributes are employed.||administrative units||University of California, Davis.||n/a||non-commercial use||vector||Geopackage, shpafile, KMZ||n/a||2018||Geographic WGS84||global||https://gadm.org/data.html|
Population dataset description
Global Administrative Units Dataset description
For road related data, we have decided to use OpenStreetMap data as it is the only homogeneously designed globally available dataset. Without doubt, the amount and the quality of the available data for various regions around the world can vary consistently. However, given the clear and consistent definition of each map element and tag, this exercise should be reproducible in any other part of the world.
Yet, given our area of interest, the Tabora county from Tanzania, we must take into consideration specific developments for Africa, more precisely, the Highway Tag Africa - Topology of Road Network in African countries, and furthermore, the East Africa Tagging Guidelines.
However, with consideration to the global replicability of our educational material, we will also insert specifications on a more general scale. Of course, it must be acknowledged that the workflow presented here could require other adjustments with respect to the specificity of the road dataset used in calculation.
Preparing the geospatial data
For the scope of this exercise we have chosen the Tabora county of Tanzania. As we strive to create an educational material that can be applied no matter the region of interest, a decision was made to use the available datasets, on a global level.
The following table presents the datasets used:
|Topic||Name collection/dataset||Abstract||Indicators||Produce/collector||Owner||License||Type of data||Format||Scale/spatial resolution||Edition||CRS||Other URL|
|Administrative units||Database of Global Administrative Areas||GADM provides maps and spatial data for all countries and their sub-divisions.||administrative units||University of California, Berkeley,Museum of Vertebrate Zoology, and theInternational Rice Research Institute (Global Administrative Areas 2009)||GDAM||The data are freely available for academic use and other non-commercial use. Redistribution, or commercial use is not allowed without prior permission.||vector||Geopackage, shapefile, geodatabase. KMZ, R formats||n/a||April 2018||Geographic WGS84||https://gadm.org/metadata.html|
|World Population||WorldPop||Alpha version 2010 and 2015 estimates of numbers of people per grid square, with national totals adjusted to match UN population division estimates (http://esa.un.org/wpp/) and remaining unadjusted.||Settelments, Population numbers, birth and pregnancy, age structures, poverty spatial distribution etc.||GeoData Institute, University of Southampton||GeoData Institute, University of Southampton||CC BY 4.0||raster||GeoTIFF||100 m||July 2013||Geographic WGS84||http://www.worldpop.org.uk/data/methods/|
|World Population||Global Rural-Urban Mapping project (GRUMP), v1||To provide a polygon representation of urban areas with city or agglomeration name and time series population estimates.||urban geometries||Socioeconomic Data and Applications Center (sedac)||Socioeconomic Data and Applications Center (sedac)||CC BY 4.0||vector||shapefile||30 arc-second||2006||Geographic WGS84||n/a|
|World Population||Global Human Built-up And Settlement Extent (HBASE) Dataset From Landsat, v1 (2010)||To provide high spatial resolution estimates of global urban extent derived from global 30m Landsat satellite data for the target year 2010 and a companion dataset to the Global Man-made Impervious Surface (GMIS) dataset.||urban extent||Socioeconomic Data and Applications Center (sedac)||Socioeconomic Data and Applications Center (sedac)||CC BY 4.0||raster||GeoTiff||30 m||2017||Geographic WGS84, UTM||n/a|
|OpenStreetMap||OpenStreetMap)||OpenStreetMap is built by a community of mappers that contribute and maintain data about roads, trails, cafés, railway stations, and much more, all over the world.||road network, road condition||OpenStreetMap contributors||OpenStreetMap Foundation (OSMF)||Open Data Commons Open Database License (ODbL)||vector||osm||n/a||September 2018||Pseudo-Mercator, EPSG 3857||https://wiki.openstreetmap.org/wiki/Main_Page|
Add all listed data to your project. Create group layers as you bring in the data, so it is easier when you start processing to navigate through all datasets. Create and save your project, so you can pick up the work from where you left it.
/ [Layer]-->[Add layer] - will open the Data Source Manager that allows you to load the data.: the administrative unites (GDAM dataset), the population numbers (WorldPop - we will use the TZA_popmap15adj_v2b.tif file), the urban extent (we will use the global_urban_extent_polygons_v1.01.shp file).
As mentioned, we will use OpenStreetMap data for the roads geometry and condition. Bringing OSM data will require you install a new plugin - OSM Downloader.
[Plugins]-->[Manage and install plugins].
The datasets are in various projections, either the Geographic projection EPSG 4326 or the Pseudo_Mercator EPSG 3857. A geographic coordinate system is based on a spheroid and uses angular units (degrees). Thus, when using QGIS calculator, for example, it returns values in decimal degree and not meters. You can see the used units in a projection's description that can be retrieved from epsg.io. 
As we will work with road geometries, we must reproject all the datasets in a projected coordinate system, which is based on a 2D plane (with the spheroid projected on a 2D plane) and uses linear units, such as meters. For our study, we identify a suitable CRS  for our region of interest, the Tabora county in Tanzania. To do that we will use epsg.io. After a quick search, we find WGS 84 / UTM zone 36S-EPSG: 32736 to be the appropriate for our region.
To reproject vector data using QGIS, we have to save the file with the desired projection.
Click on the vector layer you want to reproject and choose [Export]-->[Save features as..]
For raster datasets, we will use gdalwrap that is available as a processing tool in the Processing toolbox.
[Processing]-->[Toolbox] We can search by typing the keyword 'reproject' in the search bar.
Then, we will cut all layers by the boundary of the selected county, Tabora.
Firstly, export from the administrative units level 1, Tabora county. Secondly, clip all layers by its geometry.
Step 3 produces the rural areas of the Tabora county. According to Wikipedia, Tanzania is divided into regions (GDAM administrative level 1), districts (GDAM administrative level 2) and wards, (GDAM administrative level 3).
We will calculate the Rural Access Index on wards, thus we will extract the rural regions from the administrative units level 3. The resulting dataset will be vector type. [Vector]-->[Geoprocessing tools]-->[Difference]
Step 4 prepares the dataset from which we will extract the population number for all rural areas of each administrative unit. We will use zonal statistics, a tool now available in the Processing Toolbox.
Bring in the roads!
Step 5 is the most time consuming processing stage and, more over, it may vary when this exercise will be applied to other regions in the world. For the roads geometry and condition we will use the OpenStreetMap data available.
As we will see, Tanzania is very well represented on the OSM map, even if we will encounter various situations mainly regarding quality of network connectivity, an important aspect for our study. That is because in 2015, the Crowd2Map Tanzania was launched and during the following years, there have been significant crowd mapping campaigns.
After step 1 and 2, the road dataset should be imported into QGIS, clipped, reprojected in EPSG 32736 and saved as a geopackage file.
Next, we will do a preliminary cleaning, by eliminating all roads segments that are not suitable for cars. The SDG indicator 9.1.1 Proportion of the rural population who live within 2 km of an all-season road refers to roads that are suitable for any kind of vehicle (average modern automobile), thus we will filtrate by: "highway" = 'cycleway' or "highway" = 'pedestrian' or "highway" = 'path' or "highway" = 'footway' or "highway" = ‘residential’ or "highway" = ‘service’. We have also deleted roads under construction, because that means that the roads can not be used for access.
OpenStreetMap defines map features by tags. Each tag has 2 elements: the key and the value. The key is used to describe a topic, category, or type of feature and the value describes the specific form of the key-specified feature. In our filter above, the tag "highway" = 'cycleway' is formed by the key "highway" and the value 'cycleway', which means that the map feature is a road for bicycles. More details on how OpenStreetMap is build can be found in the OSM wiki.
There is a number of tags considered in OSM road data that gives important indication related to the road condition. These are:
- Key:smoothness with 8 possible values: excellent, good, intermediate, bad, very_bad, horrible, very_horrible, impassable. According to the OSM community, in developing countries, the HDM model is using some values of smoothness to help define road quality. The HDM stands for Highway Development and Management (HDM-4) and it represents the World Bank tool for the management of roads network, particularly in developing countries.
- Key: surface
- Tag: tracktype with a gradual variation from grade1 - solid to grade5 - soft.
Considering the East Africa Tagging Guidelines, there is an emphasis in using surface=* with the generic separation of surface=paved for sealed roads and surface=unpaved for the others. Analyzing the available data for the area of interest, the Tabora county of Tanzania, we can identify that for a number of road segments, we have road quality particular tags identified. We can see that in the other_tags column. Looking at the structure of the attributes registered in this column, we observe a number of things:
- The tags (key+value) are separated by comma;
- The keys are separated by their values through "=>" ;
- Tags are not necessarily in the same order;
- Not all tags that are relevant for our exercise are registered.
To simplify the processing steps, we will break the other_tags column into one column per tag. We will do this using an external solution, LibreOffice. The file containing the attributes for each segment is the .dbf file (dBase file) which can be opened and manipulated by tabular data solutions as well. It is advisable though that any change on the file to be done with caution, as it can corrupt the file, and thus make it unusable.
-->open .dbf file using LibreOffice --> Text to column by separator comma
Next, we create one column for each road condition key available: surface, smoothness and tracktype with the values specific for each road segment. Afterwards, we populate the three new columns (surface, smoothness and tracktype) with the specific value for each road segment. For this, we will use the Field Calculator of QGIS introducing the following expression:
CASE WHEN "other_tags" LIKE '%surface%' THEN regexp_replace ("other_tags", 'surface"=>"', ) ELSE surface END
Other_tags - column from where we take the attribute value, and will be subsequently replaced by columns Tag1, Tag2 and so on for each of the three columns.
Surface - is the key name we are looking for in the attribute list.
After we have populated the three new columns (surface, smoothness and tracktype), we can extract information that will allow us to better understand the roads dataset we will use to calculate the RAI. We will make use of the statistics functionality of QGIS to show the total length of roads in our area of interest, but also various lengths, as for example the lengths of paved roads, or bad roads lengths and so on.
To calculate the roads length, we open the attribute table of the dataset, we create a new column and we write $length.
The next step leading to the RAI calculation is to determine the road condition that will be taken into account. This represents a difficult assessment to make, mainly due to lack of harmonization between OpenStreetMap defined tags and official internationally used methodologies for road condition calculations. For the purpose of our exercise, we will consider the proposed definition of what a “road in good condition” given in the World Bank 2016 document Measuring Rural Access: Using New Technologies (Transport & ICT. WB, 2016.):
"A road in good condition refers to:
* Paved road with IRI less than 6 meters/km and unpaved road with IRI less than 13 meters/km, when IRI data are available * Paved road in excellent, good, or fair condition and unpaved road in excellent or good condition, when IRI data are not available but other road condition data, such as the PCI or visual assessment by class value, are available.
In our case, there is no available IRI data available, thus our data falls in the second category, where we have a qualitative description: Paved road in excellent, good, or fair condition and unpaved road in excellent or good condition. In order to work with available OSM road data, we will chart a compatibility map between OSM tags that describe road condition and the given classification.
Analysing our data with GroupStats, we discovery the following situation:
GroupStats is a plugin for easy calculation of statistics for features in a vector layer. The plugin has a control panel that allows the user to make calculations (as in our example, length) for specific group features, as can be seen in attached file 1.
A look at the complete dataset tags shows that we can find any kind of combination of the tags we are interested in. We can observe that, for our test dataset, we have 100% the HIGHWAY tag completed for all roads segments.
Continuing the analysis, with consideration to the definitions given by the OSM when assigning the specific values, we will proceed in mapping the correspondence. And so, giving the tags descriptions and the definition of a road in “good or fair condition in rural areas”, we have considered the following mappings:
Next, we will clean the roads dataset of the categories that we have excluded, regardless of the highway=* value. Using GroupStats, we measure that we will exclude from RAI calculation. The filter GroupStats must be done by "smoothness" = 'very_bad' OR "smoothness" = 'very_horrible' OR "smoothness" = 'impassable' OR "smoothness" = 'horrible' OR "tracktype"='grade4' OR "tracktype" = 'grade5' and it will find 3082 matching features, with a total length of 3359.96 KM.
The spatial distribution of the roads eliminated from our analysis is presented in the map below:
After a quick glance, we can see that a significant central rural area is served by the roads with a too poor condition to be included in the RAI calculation.
Further on, we will work with a road layer that has been cleaned of the roads that are not in good condition. To do that, select in the attribute table using the following expression ["smoothness" = 'very_bad' OR "smoothness" = 'very_horrible' OR "smoothness" = 'impassable' OR "smoothness" = 'horrible' OR "tracktype"='grade4' OR "tracktype" = 'grade5'] and then delete the selected features.
Running GroupStats, we discover the following situation for Tabora county, with respect to road condition accepted in the RAI calculations:
The next step in preparing the road dataset resides in solving the irregularities related to the road network available. With consideration to the RAI calculation, to have a correct topological road network is desirable, but not a condition. As OpenStreetMap is an open collaborative mapping project, there is a high possibility that datasets collected have all sorts of inconsistency. In the following processing steps, we are aiming at identifying and correcting these inconsistencies as to eliminate as much as possible artificial results for RAI.
One situation to resolve is when we have segments with shorter lengths that are further away from any other connected road by at least 2 km. Why would this situation yield artificial values for RAI? Because even if rural population has access to such a road in good conditions it offers no real connectivity, it gets them only so far. Before proceeding, let us analyse the current situation: identify the percentage and spatial distribution of segments shorter than 2 km. As we can see in the attribute table, the road dataset for our exercise has 21811 features. After filtering, we identify 19709 segments that are shorter than 2km. Filter the attribute table by "length_m" < 2000 and then run GroupStats to get a sense of your data.
However, by looking at the geometry, we can see that a significant part of these short segments are actually completing the longer road segments. Thus, we need to connect all these segments, in order to be able to delete the features that are shorter than 2km AND not connected to any other road feature. Thus, we need to unite all road segments that are connected. As mentioned before, in our study crucial is the road network and not the topology. To do this, we will use a QGIS plugin, MergeLines. After running MergeLines, we are left with 13590 features, out of which 11298 are shorter than 2 km.
By comparing the numbers, we can see that running the MergeLines plugin represents a significant cleaning of the road dataset, with respect to RAI calculation. Almost 3000 km were road segments shorter than 2 km and would have increased the buffer calculation time artificially.
Next, we will create the 2km buffer zone around the road network. We need to identify whether there are any buffers disconnected from the network buffer. Such a case would indicate that again, there are roads in good or fair condition, that are not connected to the main road network, thus including into the RAI calculation would offer an artificial result, as it can be seen in the map below.
Before deleting the unconnected buffers, we will check to see whether the corresponding roads are not on the border. In this case, we decide not to remove them because, they might be connected to roads outside the Tabora county.
After deleting the disconnected buffer zones, we can identify a new situation that may lead to an artificial result for RAI. As shown in fig. 18, the issue can occur when we have 2 disconnected road segments that have overlapping 2km buffers and yet, the distance that one must walk from segment A to segment B is between 2.01 km and 3.99 km. Hence, the buffer should not include this area.
Why not just delete the unconnected segments? Because there are such road segments that would be in the 2km buffer from the road network but stretch beyond the buffer, so if we were to delete all together, we would lose the buffer region of the unconnected segment. This case is exemplified in fig. 11.
To overcome this issue, we will identify all connected segments of the road and create a 2km buffer zone for the entire network. As our dataset is substantial (over 13000 features), we will split all vectors by a 50X50 grid. [Vector]-->[Research tools]-->[Vector grid] - creates vector grid with specific extent and cell size; [Vector]--> [Geoprocessing Tools] --> [Intersection] - splits the road dataset by each grid cell; [Vector]--> [Data Management Tools] --> [Split vector layer] - creates a road layer for each grid cells.
Now that we have prepared the road dataset, we will run the script (by Jochen Schwarze) that creates networks, by merging adjacent lines. The script produces a new dataset from the input data with an attribute “subnet” that contains an incremented ID for each subnetwork created.
[Vector]--> [Geoprocessing Tools] --> [Dissolve] - the roads that all connected will be dissolved into one feature - we will run the algorithm in batch processing mode. [Vector]--> [Merge Vector Layers] - we merge all resulting layers into one road dataset. [Vector]--> [Dissolve by attribute value - subnet] - thus we will have one geometry for each subnet [GroupStats] - will be run on the Dissolve_merged dataset, because we want to extract how many road segments are in each subnet, as well as total length. The GroupStats must be configured as:
- rows=subnet /value=count,length #to count all unique segments in each subnet
- rows=subnet / value=sum,length #to calculate lengths for each subnet
Using LibreOffice, we will analyse how the road network for Tabora county looks like. We are attempting to identify road segments that can lead to artificial results of the RAI indicator.
[Layer] --> [Add] --> Delimited text layer -inserting the table created in the previous step, so we can see a spatial distribution with respect to the number of segments for each subnet, as well as by length.
Select the [dissolved_by_subnet] layer --> [Properties] --> Join - here we join the attributes calculated (segments count, sum of subnet length) to the geometry
Select the layer after joining the new attributes --> [Export..] - saving a new file as the join operation is valid only within the project.
A quick analysis of the spatial distribution shows that the short road segments do not represent small disconnected parts of the road network, but rather disparate segments that do not bring a significant improvement to the road network (fig.22, fig.23).
To clean the road network in preparation for the 2km buffer calculation, we will eliminate the roads that have 1 or 2 segments, with a total length below 2 km. Now, we calculate the corresponding 2km buffer for the cleaned OpenStreetMap road network in a good condition. [Vector]--> [Geoprocessing tools] --> [Fixed distance buffer] (lengthy operation!)
Calculation of the Rural Access Index
The provisional new RAI is the share of the population who live within 2 km of an all-seasons road. As we have prepared the datasets for population number, rural and urban areas, road network and road condition, we can calculate the indicator showing the number of people included - Rural Access Index.
Next, we will continue with the final series of data processing, that have already been encountered in this material. [Vector]--> [Geoprocessing tools] --> [Clip] - we will clip the buffer zones by the administrative unit layer, as we will report on the number of rural population who live within 2 km of an all-season road by administrative unit level 3 - wards. [Raster] --> [Zonal statistics] - we extract for each 2km buffer zone of each administrative unit, the number of people living in rural areas. The zonal statistics functionality provides the user with the opportunity to extract using a vector mask (in our case the 2 km buffer area) from a raster the following statistics on the selected band:
- Count: to count the number of pixels
- Sum: to sum the pixel values
- Mean: to get the mean of pixel values
- Median: to get the median of pixel values
- StDev: to get the standard deviation of pixel values
- Min: to get the minimum of pixel values
- Max: to get the maximum of pixel values
- Range: to get the range (max - min) of pixel values
- Minority: to get the less represented pixel value
- Majority: to get the most represented pixel value
- Variety: to count the number of distinct pixel values
For our specific exercise, we will use Sum. To facilitate working with the dataset, we will round up to an integer for the number of people in each buffer zone.
[Vector] --> [Geoprocessing tools] --> [Clip] -before proceeding to RAI calculation, we will clip the buffer vector layer by the boundaries of the administrative units level 3. [Vector] --> [Geoprocessing tools] --> [Difference] - so we can extract urban areas from the clipped buffer zone. After these small processing chains, we have the 2 main layers we will use for calculation of RAI for Tabora:
- vector layer - Tabora administrative units level 3 with population number
- vector layer 2 - 2 km buffer zone for all-seasons Tabora road network, with population number for each buffer segment within each administrative unit.
The last step is to add the sum population values of the total buffer zone for each administrative unit level 3 and then divide the rural population within the buffer area by the total rural population. We do that within the [Properties] of the Administrative units level 3 vector layer [Joins] --> [join attribute values] from the 2 km buffer zone by the name of the administrative unit.
We have enclosed all steps of the processing chain detailed above in a conceptual schema of the workflow necessary to calculate the Rural Access Index using open data and open source software.
Geospatial data representations
Although, it might not always be considered as such, a significant aspect of any study, be it geospatial or not, lies in the representation of results. There are well known rules and guidelines in any type of domain and the geospatial one is no exception. Most prevalent representations come in the form of static or dynamic maps, cartodiagrams, graphs developed using various solutions, largely visualisations that are closely related to the spatial dimension of the data. The geospatial open source solutions ecosystem has reached an advanced level that accommodates the creation of highly professional maps, diagrams be they static or interactive, on paper or online. For the visualisation of the Rural Access Index results for Tabora county, we used QGIS capabilities. With the third major release of QGIS - QGIS 3.X - significant developments have been achieved, yet one crucial improvement lies in the advancements of the map composer engine. There are sufficient online and offline resources that present complete tutorials on how to use the Map Compose/Layout functionality, so we consider there is no need to duplicate already existing work. However, we are including in this material the two map templates created and used to make all representations in this material.
Known limitations of the material
The scope of this material is to show that we have entered in a time when using FOSS and open data we can initiate complex analysis allowing understanding of various facets of our environment. Works in this direction are already underway with a high focus on the knowledge we can extract from satellite imagery that have an open data policy, such as the Landsat or Copernicus programs. These works translate into global initiatives such as the Global Forest Watch or Geo Wetlands Initiative and so many more. Yet, we must not underestimate the power of what collaborative open mapping efforts can bring to our knowledge ecosystem and the OpenStreetMap project is a successful exemplification of that. Of course, one must not overlook matters related to the quality of data when it comes to the open collaborative environments, but as we have proven through this material, they provide a very good starting point that must not be neglected. As done in this study, careful analysis on understanding the datasets we are working with is crucial for eliminating as much of artificial errors of the final results as possible. Such decision is as the one we made, to eliminate roads that are shorter than 2 km and don't have more than 2 segments that would have given an artificial value of the Rural Access Index in a multitude of situations, as we presented. Nonetheless, we are aware that this decision might have lead to delete segments of roads in good condition that would be viable in the index calculation. It was a compromise that we acknowledge, after the analysing the data. Yet, it must be highlighted that this material is a first step to calculate a Sustainable Development Goal indicator, when official data is not available. Thus, in a specific case a more thorough and manual inspection of the open datasets is highly recommended.
When it comes to software, FOSS is a stable choice as it has reached a maturity level that covers all sections of a geospatial analysis, from storing, to processing to visualization, be it online or offline, dynamic and interactive or static. For our material we have used a very popular GIS open source solution - QGIS 3, but it must be highlighted that it is by far the only solution. Most, if not all steps in the processing chain we made can be accomplished using other open source solutions, as well. Our choice to use QGIS was based on the idea of using an outright, intuitive and user friendly framework.
- EPSG.io is an open-source web service with a database of coordinates systems used in maps worldwide that allows discovery of coordinate reference systems utilized all over the world for creating maps and geodata and for identifying geo-position.
- CRS stands for Coordinate Reference System