Difference between revisions of "Infrastructure Transition Plan 2014"

From OSGeo
Jump to navigation Jump to search
(→‎Future Needs: http://trac.osgeo.org/osgeo/ticket/592)
 
(18 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
= Background  =
 
= Background  =
  
Current Physical Machines hosted at Oregon State University's Open Source Lab ([[OSL]]) are entering the latter part of their life expectancy. With the recent replacement of hard drives and raid card batteries effecting performance it's time to start planning for the next 3-5 years of computing needs. We have a recently acquired large backup machine at [[OSL]], 9TB usable space. OSGeo1 at Peer will be off as of May 2014.
+
Current Physical Machines hosted at Oregon State University's Open Source Lab ([[OSL]]) are entering the latter part of their life expectancy. With the recent replacement of hard drives and raid card batteries effecting performance it's time to start planning for the next 3-5 years of computing needs. In 2014 (?), we have acquired large backup machine at [[OSL]], 9TB usable space. OSGeo1 at Peer is off as of May 2014.
  
 
== Past Performance ==
 
== Past Performance ==
  
Current hardware has for the most part met the original stated goals of hosting websites for projects, issue tracking, version control, and mailing lists. Uptime has been generally good, performance occasionally not so good when things aren't configured right (open proxy, excessive WMS requests, large numbers of 404 from bots). Most services were not configured with redundancy as small amounts of downtime were deemed acceptable, which may no longer be the case.
+
[[Infrastructure_Transition_Plan_2010|Current hardware]] has for the most part met the original stated goals of hosting websites for projects, issue tracking, version control, and mailing lists. Uptime has been generally good, performance occasionally not so good when things aren't configured right (open proxy, excessive WMS requests, large numbers of 404 from bots). Most services were not configured with redundancy as small amounts of downtime were deemed acceptable, which may no longer be the case.
  
Our biggest dilemma has been lack of people power. We currently only have about 4-5 people who partake in core system administration. Several other people kindly manage Nabble (mail archive hosting), and some other external resources. Ideas on how to balance the workload and recruit more help is important to being able to keep the systems running.
+
Our biggest dilemma has been lack of people power. We currently only have about 4-5 people (or less) who partake in core system administration. Several other people kindly manage Nabble (mail archive hosting), and some other external resources. Ideas on how to balance the workload and recruit more help is important to being able to keep the systems running.
 +
 
 +
== Urgent Needs ==
 +
* '''Urgent: fix the very slow osgeo4 machine''', iowait is 20-50%. Solution: move temporary off the VMs, reformat/reinstall base system (RAID6 -> RAID5, XFS), then move back the VMs
 +
* Redo the projectsVM from scratch
 +
* Update trac.osgeo.org (both Debian version as well as migrate trac itself to a fresh version)
 +
* Update other outdated Debian servers to current Debian stable
 +
 
 +
== Current Needs ==
 +
'''Projects, please list specific needs you would like met.'''
 +
 
 +
* avoid downtime longer that some hours, may require automatic mirror and failover setups (projects: GRASS GIS done, ... others...?)
 +
* Board: A Document Management System for storing Contributor Licensing Agreements for projects.
 +
* Support for git & github sync
 +
* More isolation so projects don't take each other out when one is misconfigured or has extremely heavy usage.
  
 
== Future Needs ==
 
== Future Needs ==
 
 
* Build services
 
* Build services
 
** More projects are using static websites built from version control, primarily with Sphinx.
 
** More projects are using static websites built from version control, primarily with Sphinx.
Line 19: Line 32:
 
* More redundancy to increase uptime of important websites.
 
* More redundancy to increase uptime of important websites.
 
* Separate web serving from other operations
 
* Separate web serving from other operations
* Long term archive of Foss4g sites
+
* <strike>Long term archive of Foss4g sites</strike> done: http://www.foss4g.org/
 
+
* Upgraded version control hosting.
 
+
** <strike>Upgrade Trac (<a href="http://trac.osgeo.org/osgeo/ticket/592">#592</a>)</strike> done: 4/2015
'''Projects, please list specific needs you would like met.'''
+
** Gitlab
 
 
* avoid downtime longer that some hours, may require automatic mirror and failover setups (projects: GRASS GIS, ... others...?)
 
* A Document Management System for storing Contributor Licensing Agreements for projects.\
 
* Support for git & github sync
 
* More isolation so projects don't take each other out when one is misconfigured or has extremely heavy usage.
 
  
 
= Ideas =
 
= Ideas =
Line 33: Line 41:
 
== Hardware ==
 
== Hardware ==
  
* Buy new hardware (1U 8 drive $3000-$5000 USD, 1U 6 drive $2000-3000 USD)
+
* Buy new hardware (1U 8 drive [http://www.siliconmechanics.com/quotes/270347?confirmation=141627023 $3000-$5000] USD or [http://www.siliconmechanics.com/quotes/284582?confirmation=1523506908 10Krpm+SSD] or with [http://www.siliconmechanics.com/quotes/284588?confirmation=1814222821 newer cpu], 1U 4-6 drives [http://www.siliconmechanics.com/quotes/274487?confirmation=1948640496 $2600] or 1U 6 drive [http://www.siliconmechanics.com/quotes/270350?confirmation=476256389 $2000-3000] USD)
 
** Possibly use SSDs
 
** Possibly use SSDs
 
** or, pick hard disks that will last for years over faster things that need more careful maintenance
 
** or, pick hard disks that will last for years over faster things that need more careful maintenance
Line 40: Line 48:
 
** https://travis-ci.org/
 
** https://travis-ci.org/
 
* Pay for external hosting
 
* Pay for external hosting
** github pro
+
** Leverage OSUOSL Supercell [https://osuosl.org/services/supercell] for build slaves
 +
** github pro or gitlab
 
** hetzner (QGIS is currently renting a server)
 
** hetzner (QGIS is currently renting a server)
 
** bluehost
 
** bluehost
Line 49: Line 58:
 
* Pool resource with [[ProjectsVM|Projects]] that have bigger budgets
 
* Pool resource with [[ProjectsVM|Projects]] that have bigger budgets
 
* Leverage [[:Category:ICA OSGeo Lab Network]] for hosting nodes
 
* Leverage [[:Category:ICA OSGeo Lab Network]] for hosting nodes
 +
 +
== Security ==
 +
 +
Implement OWASP or similar protocols for adaptive blocking of bad bots and malicious exploit attempts. We currently use custom fail2ban filters on some servers.
  
 
== VM configuration ==
 
== VM configuration ==
Line 54: Line 67:
 
* Provide consistent, baseline setup in general, then implement faster hardware configurations where there is time, attention and resources to do so
 
* Provide consistent, baseline setup in general, then implement faster hardware configurations where there is time, attention and resources to do so
 
** Puppet, Chef, Juju, etc...
 
** Puppet, Chef, Juju, etc...
 +
* Optimize the hardware configuration for the type of Virtualization
 +
** Ganeti - no raid, mirror drives across machines, hotcopy failover, VMs based on logical groupings
 +
** Docker/OpenVZ/LXC - software raid, all one big install with containers for each subproject
  
 
== Admin crew ==
 
== Admin crew ==
  
* Funded sysadmin time? (Short of outsourcing everything, can we find ways to avoid relying solely on volunteer time to support the infrastructure? How can we handle funded sysadmin time in a fair way vs volunteer contributions? Are there good examples to follow in other non-profit orgs?)
+
* '''Funded sysadmin time?''' (Short of outsourcing everything, can we find ways to avoid relying solely on volunteer time to support the infrastructure? How can we handle funded sysadmin time in a fair way vs volunteer contributions? Are there good examples to follow in other non-profit orgs? '''Solve at least the urgent stuff from above?''')
 
* Have a "fire crew" on alert throughout the entire day, 24x7
 
* Have a "fire crew" on alert throughout the entire day, 24x7
 
** at least one "fire crew" member (ready to handle any emergency SAC issues) on alert at any time in the day, which includes one in a North American timezone, one in a European timezone, and one in an Asia/Pacific timezone
 
** at least one "fire crew" member (ready to handle any emergency SAC issues) on alert at any time in the day, which includes one in a North American timezone, one in a European timezone, and one in an Asia/Pacific timezone
Line 75: Line 91:
 
** Put all files into networked storage mounted via NFS (or something similar)
 
** Put all files into networked storage mounted via NFS (or something similar)
 
** GlusterFS (supports georeplication)
 
** GlusterFS (supports georeplication)
** Use XFS (supported in Linux for a long time), ZFS, or [http://en.wikipedia.org/wiki/Comparison_of_file_systems something else] good with lots of small files.
+
** Use XFS (supported in Linux for a long time, works well),  
 +
** [http://zfsonlinux.org/debian.html ZFS],  
 +
** or [http://en.wikipedia.org/wiki/Comparison_of_file_systems something else] good with lots of small files.
 +
* SSD
 +
** Static websites, Tile Caches, Downloads, extra disk based caching for web configurations
 +
* 10K rpm drives
 +
** Databases, Version Repositories
 +
* 7.2K rpm drives
 +
** Foss4g Archives
 +
 
 +
= Plan =
 +
 
 +
* Buy a new machine - [http://www.siliconmechanics.com/quotes/284594 current quote]
 +
** All SSD, 2x4 Disk Raid 5 = RAID 50 with n-2 drives available and at only 4 drives per raid 5, faster less intrusive rebuilds.
 +
** 1 spare drive in the order already
 +
** 1/2 the power needs of osgeo4 (870W redundant)
 +
* Astrodog has another 1U box with Debian flavored FreeBSD for freebsd and some VMs
 +
* Retire osgeo 4 once we get everything off.
 +
** How to handle backup/failover of the LDAP server? (Currently uses DRBD via ganeti)
 +
 
 +
=== Later ===
 +
* osgeo3 at some point should also be revamped or retired, something for later
 +
 
 +
 
 
[[Category: Infrastructure]]
 
[[Category: Infrastructure]]

Latest revision as of 10:15, 23 April 2015

This is a draft document for the purposes of collaborative planning of the new server transition. This notice will be removed once SAC has determined it's final course of action.

Background

Current Physical Machines hosted at Oregon State University's Open Source Lab (OSL) are entering the latter part of their life expectancy. With the recent replacement of hard drives and raid card batteries effecting performance it's time to start planning for the next 3-5 years of computing needs. In 2014 (?), we have acquired large backup machine at OSL, 9TB usable space. OSGeo1 at Peer is off as of May 2014.

Past Performance

Current hardware has for the most part met the original stated goals of hosting websites for projects, issue tracking, version control, and mailing lists. Uptime has been generally good, performance occasionally not so good when things aren't configured right (open proxy, excessive WMS requests, large numbers of 404 from bots). Most services were not configured with redundancy as small amounts of downtime were deemed acceptable, which may no longer be the case.

Our biggest dilemma has been lack of people power. We currently only have about 4-5 people (or less) who partake in core system administration. Several other people kindly manage Nabble (mail archive hosting), and some other external resources. Ideas on how to balance the workload and recruit more help is important to being able to keep the systems running.

Urgent Needs

  • Urgent: fix the very slow osgeo4 machine, iowait is 20-50%. Solution: move temporary off the VMs, reformat/reinstall base system (RAID6 -> RAID5, XFS), then move back the VMs
  • Redo the projectsVM from scratch
  • Update trac.osgeo.org (both Debian version as well as migrate trac itself to a fresh version)
  • Update other outdated Debian servers to current Debian stable

Current Needs

Projects, please list specific needs you would like met.

  • avoid downtime longer that some hours, may require automatic mirror and failover setups (projects: GRASS GIS done, ... others...?)
  • Board: A Document Management System for storing Contributor Licensing Agreements for projects.
  • Support for git & github sync
  • More isolation so projects don't take each other out when one is misconfigured or has extremely heavy usage.

Future Needs

  • Build services
    • More projects are using static websites built from version control, primarily with Sphinx.
    • Some projects have expressed interest in continuous integration services.
  • There's a renewed interest in global mirroring or GeoCDN type setup for redundancy and speed. Something similar to OSM, or maybe even swapping space with OSM.
  • More redundancy to increase uptime of important websites.
  • Separate web serving from other operations
  • Long term archive of Foss4g sites done: http://www.foss4g.org/
  • Upgraded version control hosting.

Ideas

Hardware

Security

Implement OWASP or similar protocols for adaptive blocking of bad bots and malicious exploit attempts. We currently use custom fail2ban filters on some servers.

VM configuration

  • Provide consistent, baseline setup in general, then implement faster hardware configurations where there is time, attention and resources to do so
    • Puppet, Chef, Juju, etc...
  • Optimize the hardware configuration for the type of Virtualization
    • Ganeti - no raid, mirror drives across machines, hotcopy failover, VMs based on logical groupings
    • Docker/OpenVZ/LXC - software raid, all one big install with containers for each subproject

Admin crew

  • Funded sysadmin time? (Short of outsourcing everything, can we find ways to avoid relying solely on volunteer time to support the infrastructure? How can we handle funded sysadmin time in a fair way vs volunteer contributions? Are there good examples to follow in other non-profit orgs? Solve at least the urgent stuff from above?)
  • Have a "fire crew" on alert throughout the entire day, 24x7
    • at least one "fire crew" member (ready to handle any emergency SAC issues) on alert at any time in the day, which includes one in a North American timezone, one in a European timezone, and one in an Asia/Pacific timezone
    • "fire crew" positions should be funded/paid
    • "fire crew" schedule and contact info should be made available publicly, so at any given time issues can be brought to the fire crew member on alert

Mirrors

  • Mirrors or Distributed services (These people have offered to host some services or mirrors) (the GRASS GIS project has a running mirror system)
  • CDN
    • Cloudflare

Storage

  • Dedicated Disk Storage
    • Put all files into networked storage mounted via NFS (or something similar)
    • GlusterFS (supports georeplication)
    • Use XFS (supported in Linux for a long time, works well),
    • ZFS,
    • or something else good with lots of small files.
  • SSD
    • Static websites, Tile Caches, Downloads, extra disk based caching for web configurations
  • 10K rpm drives
    • Databases, Version Repositories
  • 7.2K rpm drives
    • Foss4g Archives

Plan

  • Buy a new machine - current quote
    • All SSD, 2x4 Disk Raid 5 = RAID 50 with n-2 drives available and at only 4 drives per raid 5, faster less intrusive rebuilds.
    • 1 spare drive in the order already
    • 1/2 the power needs of osgeo4 (870W redundant)
  • Astrodog has another 1U box with Debian flavored FreeBSD for freebsd and some VMs
  • Retire osgeo 4 once we get everything off.
    • How to handle backup/failover of the LDAP server? (Currently uses DRBD via ganeti)

Later

  • osgeo3 at some point should also be revamped or retired, something for later