Benchmarking 2010/Constellation-SDI

This page describes the experience of the Constellation-SDI team during the FOSS4G 2010 Benchmarking effort.

Benchmark design

We expended a great deal of effort attempting to understand how a benchmarking effort could be designed properly.

The 2009 and 2010 efforts were undertaken in the naive belief that one simply sets up different servers on the same data and makes requests of the servers to compare response times. Belatedly, the 2010 effort is demonstrating that a proper benchmark is more complex, since the data might be in a format useful for a certain class of use cases but meaningless for another scale of usage, the testing can easily be hardware bound limiting useful comparison between servers, servers can be set up to be doing very different work especially in 'best effort'/anything goes configurations, and results can be compared only in superficial ways or in very narrow types of requests. In order to tackle these issues rather than merely pretend they were not serious, we examined what it would take to develop a useful benchmarking protocol, either to stress all the functionality of one particular server or to compare the performance and abilities of various, arbitrary WMS servers.

Developing a WMS benchmarking design which provides useful, comparative metrics of server performance is exceedingly hard. In a recent presentation at the Java Language Summit, Joshua Bloch presented a talk entitled Performance Anxiety which describes the impossibility of developing performant software from first principles in any language due to the enhancements of compilation and machine instruction re-ordering, the necessity of testing to obtain concrete results, and the difficulty of developing proper, statistically rigourous testing metrics.

Since these issues were apparent to us even before this effort, we have been developing tools, benchmark designs and analytic methodologies to test the Constellation-SDI server. This work has been greatly extended during the FOSS4G 2010 benchmarking effort and expanded to consider how to test different WMS servers, possibly built for different uses.

Unfortunately, there is still much distance to go before achieving a solid benchmarking suite. This work will undoubtedly be continued in the future, most likely within the framework of Open Geospatial Consortium (OGC) testing.

Enhancements

This section describes enhancements due to the work during the benchmarking 2010 effort, including improved understanding and workflow by the Constellation-SDI team and ameliorations to the code bases of the Geotoolkit.org library and to the Constellation-SDI server itself.

Benchmarking

Investigate numerous issues with jmeter.
Design simpler scripts.
Examine different configurations to stress different aspects of the WMS server experience.

Geotoolkit

Referencing: fix inverse projection for fake spherical mercator.
Referencing: accelerate raster reprojection.

Coverage: Create a reader for GeoTiff images.

Shapefile: reduce memory usage when leveraging a quad-tree index.
Shapefile: reduce styling to one single pass when painting by symbol rather than by feature.
Shapefile: reduce the reading of non-necessary parts of the files.
Shapefile: fix handling of large (over 2GB) DBF attribute files.

DataSource: Enable startup from a coverage mosaic, either a folder or a manager.

Renderer: bypass rendering engine for single raster requests.
Renderer: improve decimation algorithm for vector layers.
Renderer: switch to OGC conformant 96dpi assumption rather than industry standard 72.
Renderer: fix sld parsing errors to handle <ogc:Literal> or its absence.
Renderer: optimize the colour model selected for multiple inputs.

Constellation-SDI

Configuration: greatly enhance configurability of the server, with hot reload of data, styles and rendering configuration.

Server: cache ServiceMetadata document for insanely slow data sources.
Server: fix envelopes for data sources in ServiceMetadata document.

Backend: fix multi-threading bug to use classes in a thread-safe manner.

JEE output: enable direct writing of images into output stream.

GUI: build a prototype interface.

Performance Results

This section details the results of running the jmeter scripts against the Constellation-SDI WMS server.

Note: All values reported are in units of responses per second taken from the "Throughput" column of the 'summarizer.py' script. The values are those of the third pass, after the jmeter scripts have looped through the two warmup passes for the various thread counts, from one to sixty-four; we report only the measure from the last pass.

There are two sets of runs for two kinds of scripts: older scripts where all the requests are repeated in every pass and newer scripts where every request is different. It seems that the size of the data set we are using coupled with the number of requests just allows some servers to escape being disk bound in the older scripts with concomitant, order of magnitude jump in performance. In the general confusion, voting, ignoring vote results and whatever else, we ended up making both sets of runs.

Session 2010.09.03 (New, non-repeating scripts)

This was the first full run of Constellation-SDI using the jmeter scripts.

The runs were performed with the more recent jmeter design where all three runs use different requests so that the server is always asking for new files from disk.

For lack of time, a second run was only done for the two raster request sets; nonetheless, the numbers give us a ballpark estimate of variability between runs.

Raster 3rd Pass (New, non-repeating scripts)
Threads	25831				3857
Threads	Run 1	Run 2	Run 3	Run 4	Run 1	Run 2	Run 3	Run 4
1	4.5	5.7	---	---	5.4	5.4	---	---
2	6.3	6.1	---	---	6.0	5.7	---	---
4	5.1	4.9	---	---	4.7	4.8	---	---
8	4.8	4.6	---	---	4.6	4.5	---	---
16	4.5	4.7	---	---	4.6	4.6	---	---
32	5.2	5.0	---	---	4.8	4.8	---	---
64	4.9	4.8	---	---	4.8	4.6	---	---

Vector 3rd Pass (New, non-repeating scripts)
Threads	4326				3857
Threads	Run 1	Run 2	Run 3	Run 4	Run 1	Run 2	Run 3	Run 4
1	1.5	---	---	---	1.5	---	---	---
2	2.1	---	---	---	2.1	---	---	---
4	2.1	---	---	---	2.3	---	---	---
8	2.2	---	---	---	2.3	---	---	---
16	2.1	---	---	---	2.2	---	---	---
32	2.2	---	---	---	2.3	---	---	---
64	1.8	---	---	---	1.9	---	---	---

The processing power (8 CPUs) of the machine does not seem to have been stressed at any point in the test runs.

The numbers differ from the results obtained on local servers which had a more distinct separation between vector and raster results. Variability also seems high enough that several runs would be needed to discriminate between the various configurations, enough so that we probably need to tighten up the testing to get anything meaningful from numbers such as these.

Session 2010.09.03 afternoon (older, repeating scripts)

Repeating the earlier work, with the older scripts which may leave the file blocks in memory.

The first raster results, native CRS and reprojected, were performed with a mapserver running in the background so the test was repeated.

Raster 3rd Pass (older, repeating scripts)
Threads	25831				3857
Threads	Run 1	Run 2	Run 3	Run 4	Run 1	Run 2	Run 3	Run 4
1	5.5	11.6	---	---	5.9	5.9	---	---
2	4.0	11.9	---	---	6.9	6.7	---	---
4	5.3	5.5	---	---	5.4	5.5	---	---
8	5.1	5.1	---	---	5.3	5.2	---	---
16	5.3	5.3	---	---	5.3	5.1	---	---
32	5.3	5.2	---	---	5.2	5.1	---	---
64	5.3	5.0	---	---	5.0	5.1	---	---

Vector 3rd Pass (older, repeating scripts)
Threads	4326				3857
Threads	Run 1	Run 2	Run 3	Run 4	Run 1	Run 2	Run 3	Run 4
1	1.9	---	---	---	1.8	---	---	---
2	2.2	---	---	---	2.1	---	---	---
4	2.6	---	---	---	2.7	---	---	---
8	2.4	---	---	---	2.5	---	---	---
16	2.5	---	---	---	2.6	---	---	---
32	2.3	---	---	---	2.4	---	---	---
64	2.0	---	---	---	2.0	---	---	---

Not a dramatic change although these numbers are much less consistent than those generated with the older scripts. We can even burst up to well over 60 images per second for raster and 10 images per second for vector in some passes showing the danger of playing on this cusp for any meaningful results.

Session 2010.09.04

This session was split into two parts.

The first part of the session aimed to look at the difference in performance when running with the correct configuration (which had mistakenly not been used in yesterday's session).

Raster 3rd Pass
Threads	25831				3857
Threads	Run 1	Run 2	Run 3	Run 4	Run 1	Run 2	Run 3	Run 4
1	5.5	6.1	---	---	6.0	---	---	---
2	6.4	6.1	---	---	6.8	---	---	---
4	5.1	5.6	---	---	5.3	---	---	---
8	6.9	5.4	---	---	5.3	---	---	---
16	5.7	5.4	---	---	5.2	---	---	---
32	5.3	5.3	---	---	5.4	---	---	---
64	5.3	5.0	---	---	5.1	---	---	---

The results are slightly better due to their consistency but essentially in the same ballpark.

The second part of the session set out to look at performance not limited by the IO bottleneck.

Note that this is merely an academic exercise since for any non-trivial dataset, the size of data on disk will be so much larger than the size of available memory that these numbers will never be achieved.

In order to generate this behaviour, the session involved a bunch of runs with the same requests as with the earlier session but with the scripts cut up and restructured so that the data would already have been read by the time the requests are made. The approach yielded somewhat stable numbers with a variability between runs of around 10%.

The results reported are the last run performed for any thread group. The runs with 32 threads and 64 threads were done with three single passes to save time and to ensure the requests would not outstrip main memory.

Raster 3rd Pass
Threads	25831				3857
Threads	Results	-----	-----	-----	Results	-----	-----	-----
1	13.2	---	---	---	11.6	---	---	---
2	21.6	---	---	---	19.6	---	---	---
4	32.8	---	---	---	40.1	---	---	---
8	34.8	---	---	---	30.1	---	---	---
16	34.9	---	---	---	34.1	---	---	---
32	35.7	---	---	---	33.7	---	---	---
64	35.3	---	---	---	32.4	---	---	---

Clearly this is different behaviour, and only occurs after repeated requests for the same images so is not realistic in any way. Because the I/O throughput of the machine was not monitored while the test was running, we cannot be sure either that this is really peak, all in memory performance.

The result serves best as a reminder that benchmarks need to be setup carefully in order to measure the behaviour which one wants to measure since lurking nearby are order of magnitude changes in performance.

Session NEXT

This is a placeholder and template for future runs

Raster 3rd Pass
Threads	25831				3857
Threads	Run 1	Run 2	Run 3	Run 4	Run 1	Run 2	Run 3	Run 4
1	---	---	---	---	---	---	---	---
2	---	---	---	---	---	---	---	---
4	---	---	---	---	---	---	---	---
8	---	---	---	---	---	---	---	---
16	---	---	---	---	---	---	---	---
32	---	---	---	---	---	---	---	---
64	---	---	---	---	---	---	---	---

Vector 3rd Pass
Threads	4326				3857
Threads	Run 1	Run 2	Run 3	Run 4	Run 1	Run 2	Run 3	Run 4
1	---	---	---	---	---	---	---	---
2	---	---	---	---	---	---	---	---
4	---	---	---	---	---	---	---	---
8	---	---	---	---	---	---	---	---
16	---	---	---	---	---	---	---	---
32	---	---	---	---	---	---	---	---
64	---	---	---	---	---	---	---	---