Infrastructure Transition Plan 2014

'''This is a draft document for the purposes of collaborative planning of the new server transition. This notice will be removed once SAC has determined it's final course of action.'''

= Background =

Current Physical Machines hosted at Oregon State University's Open Source Lab (OSL) are entering the latter part of their life expectancy. With the recent replacement of hard drives and raid card batteries effecting performance it's time to start planning for the next 3-5 years of computing needs. We have a recently acquired large backup machine at OSL, 9TB usable space. OSGeo1 at Peer will be off as of May 2014.

Past Performance
Current hardware has for the most part met the original stated goals of hosting websites for projects, issue tracking, version control, and mailing lists. Uptime has been generally good, performance occasionally not so good when things aren't configured right (open proxy, excessive WMS requests, large numbers of 404 from bots). Most services were not configured with redundancy as small amounts of downtime were deemed acceptable, which may no longer be the case.

Our biggest dilemma has been lack of people power. We currently only have about 4-5 people who partake in core system administration. Several other people kindly manage Nabble (mail archive hosting), and some other external resources. Ideas on how to balance the workload and recruit more help is important to being able to keep the systems running.

Future Needs

 * Build services
 * More projects are using static websites built from version control, primarily with Sphinx.
 * Some projects have expressed interest in continuous integration services.
 * There's a renewed interest in global mirroring or GeoCDN type setup for redundancy and speed. Something similar to OSM, or maybe even swapping space with OSM.
 * More redundancy to increase uptime of important websites.
 * Separate web serving from other operations
 * Long term archive of Foss4g sites

Projects, please list specific needs you would like met.


 * avoid downtime longer that some hours, may require automatic mirror and failover setups (projects: GRASS GIS, ... others...?)
 * A Document Management System for storing Contributor Licensing Agreements for projects.\
 * Support for git & github sync
 * More isolation so projects don't take each other out when one is misconfigured or has extremely heavy usage.

= Ideas =

Hardware

 * Buy new hardware (1U 8 drive $3000-$5000 USD or 10Krpm+SSD, 1U 4-6 drives $2600 or 1U 6 drive $2000-3000 USD)
 * Possibly use SSDs
 * or, pick hard disks that will last for years over faster things that need more careful maintenance
 * Take advantage of various free hosting
 * http://Readthedocs.org
 * https://travis-ci.org/
 * Pay for external hosting
 * github pro
 * hetzner (QGIS is currently renting a server)
 * bluehost
 * digitalocean
 * rackspace
 * linode
 * etc...
 * Pool resource with Projects that have bigger budgets
 * Leverage Category:ICA OSGeo Lab Network for hosting nodes

Security
Implement OWASP or similar protocols for adaptive blocking of bad bots and malicious exploit attempts. We currently use custom fail2ban filters on some servers.

VM configuration

 * Provide consistent, baseline setup in general, then implement faster hardware configurations where there is time, attention and resources to do so
 * Puppet, Chef, Juju, etc...
 * Optimize the hardware configuration for the type of Virtualization
 * Ganeti - no raid, mirror drives across machines, hotcopy failover, VMs based on logical groupings
 * Docker/OpenVZ/LXC - software raid, all one big install with containers for each subproject

Admin crew

 * Funded sysadmin time? (Short of outsourcing everything, can we find ways to avoid relying solely on volunteer time to support the infrastructure? How can we handle funded sysadmin time in a fair way vs volunteer contributions? Are there good examples to follow in other non-profit orgs?)
 * Have a "fire crew" on alert throughout the entire day, 24x7
 * at least one "fire crew" member (ready to handle any emergency SAC issues) on alert at any time in the day, which includes one in a North American timezone, one in a European timezone, and one in an Asia/Pacific timezone
 * "fire crew" positions should be funded/paid
 * "fire crew" schedule and contact info should be made available publicly, so at any given time issues can be brought to the fire crew member on alert

Mirrors

 * Mirrors or Distributed services (These people have offered to host some services or mirrors) (the GRASS GIS project has a running mirror system)
 * Las Vegas, NV - astrodog
 * Zurich - OSGL at ETH Zurich http://karlinapp.ethz.ch/osgl/index.html
 * CDN
 * Cloudflare

Storage

 * Dedicated Disk Storage
 * Put all files into networked storage mounted via NFS (or something similar)
 * GlusterFS (supports georeplication)
 * Use XFS (supported in Linux for a long time), ZFS, or something else good with lots of small files.
 * SSD
 * Static websites, Tile Caches, Downloads, extra disk based caching for web configurations
 * 10K rpm drives
 * Databases, Version Repositories
 * 7.2K rpm drives
 * Foss4g Archives