Infrastructure Transition Plan 2010

'''This is a draft document for the purposes of collaborative planning of the new server transition. This notice will be removed once SAC has determined it's final course of action.'''

= Background =

SAC and the board have allocated a budget to purchase new server machines. These new servers have been specified, quoted and ordered. Delivery by Feb 22, 2010. They will be physically hosted by the Open Source Lab (OSL) and the main host OS on which virtual machines will be managed in part by OSL. We will continue use of current Telescience blades but plan to discontinue use of PEER1 services for osgeo1 and osgeo2 once all services have been migrated.

OSL Hosting
During the setup process Wildintellect is the OSGeo liason to OSL. Questions should be sent to the SAC mailing list or asked on IRC

= New Hardware =

osgeo3(previously osl1)

 * 2x 4 core 2.5 Ghz cpu
 * 6x 146 GB 15K rpm, 3GB/s hard drives in RAID 5 configuration ~ 730 GB
 * 48 GB of RAM
 * Dual NIC ethernet

osgeo4(previously osl1)

 * 2x 4 core 2.5 Ghz cpu
 * 6x 300 GB 15K rpm, 6GB/s hard drives in RAID 6 configuration. ~ 1.17 TB
 * 48 GB of RAM
 * Dual NIC ethernet

= Resource Allocation =

The plan includes running virtual machines on the new machines. OSL has suggested KVM as that's their preferred vm solution and they could provide support. OSL plans to install ganeti to manage the virtual machines - it allows things like live moving of VMs between machines, scaling of RAM, running VM creation/installation scripts, vnc connection to guests(in case ssh is down), etc...

Ideas(Virtual Machines)
Each line should be a suggested virtual machine(VM) (or in the case of Telescience 1 blade). There are lots of possible scenarios but this list will try to capture the most common options (expect the final selection to be a subset).

One alternative is to simply give each service/project it's own virtual machine(VM), this may make administration easier(for security) or harder (for backup, general management) and may not use resources efficiently. For example if there were more than 12 VMs on any one machine they would each have at most 4GB of RAM. By pooling some services that use the same infrastructure we could essentially balance 16GB of RAM across 4 sites, assuming that heavy loads occur only occasionally any one of the 4 sites could potentially use the 16GB as needed and would be unlikely to conflict with the other 3.

osgeo3
Suggested Name, Service, Details
 * tracsvn,Trac/SVN with orwithout Postgres - Trac from source
 * osgeoweb or web,Apache/PHP (Drupal + Mediawiki )(with or without MySQL + Postgres )
 * LAMP (Drupal + MySQL)
 * www.osgeo.org
 * mapguide.osgeo.org
 * fdo.osgeo.org
 * wiki,
 * LAPP (MediaWiki+Postgres)
 * wiki.osgeo.org
 * Mysql
 * Postgres
 * secure or ldap,Secure VM
 * LDAP
 * LDAP Python admin scripts.
 * Secure admin notes for OSGeo admins
 * not* using LDAP for logins.

osgeo4

 * lists or mail,Postfix/Mailman
 * download1,download.osgeo.org mirror (rsynced from telascience)
 * backup,Local Backup
 * qgis,QGIS VM (Apache/Joomla + MySQL)
 * qgis.org - main website (Joomla)
 * blog.qgis.org - developer blog (Drupal)
 * forum.qgis.org - forum (phpBB3)
 * pyqgis.org ?? - Ruby on Rails
 * grass,GRASS VM
 * grass web site (static from svn)
 * grass wiki (mediawiki on mysql)
 * automated linux builds (for binary distribution)
 * mapbender, Mapbender VM
 * Mapbender VM with main portal and development instances, uses OSGeo LDAP for authentication
 * mapbender wiki (MediaWiki on MySQL; currently hosted externally)
 * projects,Lower load project websites (hosted on xblade14 now - relatively low priority to migrate)
 * mapserver.org
 * gdal.org
 * geotools.org
 * webextra,?
 * planet.osgeo
 * geodictionary(Wiktionary)
 * FOSSGis
 * FOSS4G

Telescience Blades

 * Lower load project websites
 * Buildbot slaves
 * Offsite Backup
 * download.osgeo.org

osgeo3 &amp; osgeo4

 * Note: OSL has stated that it's fine to double book cores in the VMs, and that they recommend a minimum of 2. We may want to reduce the number of VMs if we're concerned about maxing all the virtual cores at the same time.
 * DRBD: Keeps a failover copy on the other physical machine in the pair. This could lead to a slight slowdown, but ensures uptime for critical services.
 * DRBD: Can be run in Synchronous or Asynchronous mode, Async would not slow down the I/O on the original machine since DRBD won't wait for machine 2 to finish writing. Our preference is Async mode to reduce possibility of I/O delay on main image.

Connecting the VMs can be reached at $name.osgeo.osuosl.org, Primary admins have key based access and all SAC has LDAP based access to every VM

osgeo1
?

osgeo2
All services on osgeo2 will be migrated asap so this machine can be turned off, and that part of our Peer1 contract eliminated.

Telescience Blades
Some upgrades to OS, rebalancing of loads and a clear backup sync from osgeo4 to telescience.

= Base Image =


 * Debian Stable 64bit + Backports
 * 10 GB HD (This is the default set by OSL, we can request a different size and the images can always be grown)
 * OSGeo recommended default is 20GB, unless you can justify more.
 * default swap is 4GB
 * 4 GB RAM (default, leaves enough ram to run 2x the normal load of vms, from 5-6 to 10-12)
 * 64 bit
 * Standard partitioning /boot, swap, / (This is OSL default for backup and management purposes, we can request something different.)
 * ext3 (ext4 not feasible with current kernel, consider newer kernel for Backup and Download as ext4 serves large files faster)


 * Except for the securevm, all virtual machines should be configured for LDAP based authentication. This should probably be done as part of the template image used to make new virtual machines. That way the local admin users are already setup, ssh is on and the machine will be ready for login when booted the 1st time.

Package List
Policy: Install from packages unless exception agreed on by SAC

Standard Packages
libwww-perl liblwp-useragent-determined-perl
 * SSL Cert
 * Open-ssh server - done
 * libpam-LDAP for client login authentication
 * Add Debian Backports repository for Debian Stable (5.0.4) - done
 * Munin monitoring tools, apt-get install munin-node from Debian [main]- done
 * Added postfix - removed exim - done
 * editors - emacs,joe - done
 * Bacula client (Backports version 5.0.x)
 * Misc
 * Emergency user account - Admins have a key based access method.

Undecided

 * Firewall? Shorewall? blocking all but 22, 80, 443 by default?

Selective Packages

 * Apache
 * Php (Apache by default should be the non-php builds, except for the servers that require php)
 * MySQL
 * Postgresql
 * SVN
 * Postfix
 * Mailman

Source Exceptions Packages that will be installed from source in order to obtain specific version and customizations.


 * Trac (mod_wsgi? or mod_python?)

= Migration Plan &amp; Schedule =

Priority

 * 1) Setup LDAP (Are we moving it or just configuring the new servers to use it? Does this need to happen at the start?)
 * 2) Setup backupVM so it's ready to store backups
 * 3) Migrate osgeo2 (Many of these could be done at the same time if the virtual machines are created)
 * 4) wiki.osgeo.org
 * 5) backups (how much data is this, might need to be scheduled in off hours, and split into smaller jobs, or throttled)
 * 6) moodle? ocs? wiktionary? fossgis wiki? community.osgeo.org? - these could just be archived on the backup server if not in use
 * 7) planet
 * 8) qgis.org joomla site
 * 9) Download mirror (one that can take files bigger than 2 GB)
 * 10) Trac/SVN
 * 11) Migrate OSGeo1

It's anticipated that services would be switched as soon as they are tested on the new setup and the DNS for each service is redirected to it's new working home.

Schedule
(All dates are approximate, alternative schedule suggestions welcome)


 * Order - Feb 10,2010 ✅
 * General Plan - Feb 26, 2010 ✅
 * Physical Installation - Feb 22-Mar 15 2010 ✅
 * Specific Plan - Mar 22, 2010
 * Software Setup(Start) - Mar 23, 2010
 * Configure Base image - Mar 24-28 ✅
 * Create Wiki, Secure, Backup - Mar 29 ✅
 * Create Qgis, Webextra,Web - May 12 ✅
 * Create Tracsvn - June ✅
 * Create Projects, Mail - June
 * Create ... TBD
 * Migration - March-May 2010

= TODO: List =


 * Create a base virtual machine image for all new VMs - OSL will do this for us.
 * OSGeo template image should be based on the OSL template but include LDAP based pam authentication against the OSGeo LDAP service for ssh login.
 * May also include standard backup scripts or mount of the backup virtual machine.
 * Naming scheme for virtual machines.
 * Upgrade Telescience blade OS (May require service shuffle rotation or downtime)
 * Contingency plan for unexpected hardware failure

= Questions to ask OSL/Ourselves =


 * Can ram be increased/decreased live? No
 * Can ram be increased/decreased via a web interface live or with power cycle?With power cycle via Ganeti cli
 * Is it easy to move VMs between the machines? Yes, using Ganeit cli
 * Should the LDAP be hosted on one of the Host OS' for reliability?
 * Would LVM snapshot backups of virtual machines be a viable backup method? Should be doable, still needs some testing.
 * Define our base VM: (OSL does not recommend gentoo, though that is what they use as the base KVM host)
 * Choose a standard: Debian Stable + backports, Ubuntu LTS, Centos ... (Does it need to implement SELinux or is that overkill?)
 * ext4 formatting? OSL still testing that backup and management tools work with ext4, otherwise ext3
 * 32bit vs64bit - in some cases smaller VMs with only 2 GB etc could perform better with 32 bit :64 bit
 * default HD size? - remember to leave lots of room for /var, logs and database dumps even if there's not much in the VM : 10 GB
 * How much ram should we reserve for the host OS?
 * Naming of the Virtual Machines?
 * Latitude, Longitude, Northing, Easting, Parallels, etc.
 * Mercator, Albers, Robinson, Sinusodial, etc.
 * wiki, mail, web, ldap, etc. (based on primary DNS of that vm)
 * vm1, vm2, vm3, etc.
 * Should we stagger the backups to reduce chance of backups grinding things to a halt
 * Times to choose from (GMT) for backup starts (Assuming blocks of 2-3 hours to complete backups)
 * A: 9 GMT (1-4 US,8-11 Europe,19 Australia,13-15 India)
 * B: 15 GMT (8-11 US,14-16 Europe,2 Australia,17-19 India)
 * C: 23 GMT (16-19 US, 22-24 Europe,10 Australia,3-5 India)