Dave N5UP

Dave N5UP
Dave monitoring progress during the server migration June 17, 2010

Dell R710 server

Dell R710 server
A technician at our data center adjusts a rack of Dell R710 servers

Tuesday, September 21, 2010

Server outage 0823 UTC Tuesday - Fixed

As of 13:40 we are back up and running again.

We had been experiencing some delays and lockups on a server, so I applied some new Windows and ColdFusion updates and restarted the server. For some reason it did not come back online. There were 5 different data center personnel involved in troubleshooting and fixing the problem. The primary disk passed all the tests. The RAM was replaced, and the BIOS was updated. Finally we ran CHKDSK on the drive repairing any errors. At 1300 UTC we reinstalled the operating system, and that finally fixed the problem.

Thursday, September 16, 2010

System is up again

The problem with the network router at our hosting facility has been fixed, and the system should be accessible to everyone again.

Network router has made the system unavailable

Our hosting facility has had a network router failure. Our servers are behind this router, and so you cannot access our servers until they fix the router.

Sunday, August 1, 2010

Up-time Stats for July 2010

From our Fort Lauderdale, Florida testing center:
July 1 to August 1, 2010

2,973 attempts,
2,973 successful hits,
100.00% up-time
Average response time: 3.33 seconds for page load

Sunday, July 18, 2010

System will be down 0400 to 0500 July 19 UTC

During the conversion to the new server farm, I failed to create one of the index files that is necessary for the Account Manager to work properly. Without this index, it takes about 30 minutes for the screen to load, which makes the program totally useless. With the index in place, it takes a few seconds.

Unfortunately, it will take about an hour to create this index file, during which time the entire system has to be taken out of service.

The Account Manager is the only tool that will allow you to split your account into multiple accounts, with different time periods or QTHes, and that will move the eQSLs around to the proper new account automatically.

We are going to bite the bullet and create the index tonight, July 18 at 11pm Central Time, which is 0400 UTC July 19. Hopefully it will take less than 1 hour, but we really have no idea how long the index creation will take on these new faster servers.

73,
Dave N5UP

Wednesday, July 7, 2010

New System is working great!

So far, the new system has been working fantastically well. With a maximum of 240 simultaneous browser connections, the application server has been running at a maximum of 10% CPU utilization. Meanwhile, the database servers is still serving over 99.5% of all database requests directly from memory instead of requiring a disk access.

Wednesday, June 30, 2010

Scary? I'll tell you what's scary... and sad...

Turning out the lights on 2 servers that have served you perfectly, without a hiccup, for 2 1/2 years, deleting all the files, hoping you did get everything moved over, checking your backup and your off-city backup, and then checking it all again, and then entering the service cancellation order and logging off for the last time.

Sad, and just a tad scary.

Friday, June 25, 2010

I am making this post from my iPhone to test the ability to post status information even if all the computers are down. This concludes the test.

The System is Up

The Application Server has been handling over 200 connections with less than 10% of the CPU resources.

The Database Server has been returning an average of 99.62% of the queries from memory without requiring a disk access.

Friday, June 18, 2010

A full day on the new servers

A full day of operation in which a couple of bugs were pointed out and fixed, but otherwise everything ran relatively smoothly. The only outstanding problem I'm aware of is that the Account Manager had to be disabled because it needs a new index, and I have to take down the entire system for an hour to build that index.

I'm looking at the performance monitor, and it shows 184 user connections only used a maximum of 9% of the CPU resource on the application server. Very cool.

The database server is the real hero here. It is able to keep almost a quarter of the entire database in memory, so that requests for data can be served from memory instead of requiring a disk access. My list of long-running queries shows the worst offenders to be running about 12 seconds. On the old servers, there were some queries that ran for 5 or 6 minutes!

Now that I can take a few minutes to actually breathe, there are 251 cards in the queue needing to be printed and mailed. Picked a bad day to run out of inkjet ink and card stock.

Thursday, June 17, 2010

First day on the new servers

One small issue: the eQSL applications were reporting the time as 1 hour ahead of the correct time that is set in Windows. I fixed that by overriding the Java Runtime Engine's default of British Summer Time instead of what I wanted, which was GMT!

Everything is going smoothly the first day on the new servers. I managed to get some sleep from 0700 to 1200 UTC.

I am adding a 1 Terabyte disk to my backup server, located in Houston Texas (so I have a catastrophic backup in the event of a disaster in Dallas). That disk should come online in a few minutes, and I will start making backups of the Application server and the Database server.

Your logger that does Real-Time uploads of ADIF logs might fail occasionally over the next 72 hours. This is because the eQSL.cc domain name may take up to 3 days to propagate all around the world. There is nothing that can be done, except to wait a day and try again. In many regions, the eQSL.cc domain is already working properly, pointing to the new servers without a change in the URL to www.QSLCard.com. It should not take more than about 72 hours for this to occur.

If you logged into www.eQSL.cc but now your URL says www.QSLCard.com you can try logging out, then point your browser again to www.eQSL.cc and log in again. If it stays logged into wwww.eQSL.cc then you are all set. If it redirects you to www.QSLCard.com then your area does not yet have the updated eQSL.cc routing. Just wait another few hours.

In any event, everybody is now on the new servers and can use the system without limitations with their browser.


The system is running much faster from what I have been able to see. Right now I am seeing over 90 users logged in, and the longest wait for a database response has been on the order of 20 seconds, compared with 10 minutes or longer on the old machine.

Even the Power Users screen, which used to take 2 or 3 minutes (if it didn't timeout first) now only takes a few seconds to display.

The new application server is easily able to handle those 90 users, and has never had to process more than 2 users at the same time, because it is handling their requests so fast. On the old machines, it was quite normal to have 10 to 20 users being processed simultaneously.

Most of the speed improvements are the result of the new database server, which has 24 Gigabytes of memory, along with 5 hard disk, of which 4 provide fully redundant, mirrored and striped (RAID 10) data storage spinning at 15,000 RPM for the database, and the 5th of which is a 1 Terabyte disk for storing database backups.

Please don't report any errors yet, unless they involve money ;) so we can have a chance to find errors ourselves and fix them. Otherwise you may overwhelm our email support volunteers with questions they cannot answer.

If everything continues to go this well, I will eliminate the time delays on the OutBox (I have already reduced it from 10 minutes to 1 minute) and the InBox, and other screens.

If you have questions or comments, feel free to post them here on this blog.

73!
Dave

We are running on the new server farm!

Please do not report any errors for the next few hours.
Our volunteers who answer support email will have no idea how to respond, and I will be working diligently to track down all errors.

Your logger that does Real-Time uploads of ADIF logs will probably FAIL. This is because the eQSL.cc domain name will take up to 3 days to propagate all around the world. There is simply nothing that can be done, except to wait a day and try again until it works.








































Schedule
TimeStatusTask
0145Done Take down QSLCard.com; Restore database backup of June 14
0200Done Take down eQSL.cc; Copy final Transaction Log to new database (6 steps)
0245Done Restore Transaction Logs through June 16 - Taking longer than expected to get this working properly
0400Done Implement recommended speed improvements in new database - Part I
0415Done Implement recommended speed improvements in new database - Part II (Took longer than last time we tested it, but did OK)
0600Done Bring up new server at QSLCard.com (3 steps)
0615We are UP! Redirect all traffic from eQSL.cc to QSLCard.com (4 steps)
0615In Progress Change DNS for eQSL.cc and other domain names to point to new server
June 20 Final propagation of new IP addresses: eQSL.cc and other domain names point to new server from anywhere in the world


The plan is this:

0200 UTC June 17 (which is 9pm Central Time Wednesday the 16th), I will take down the current eQSL.cc system and start the final backups, moving the last programs, files, and database records over to the new servers.

Both the new and the old servers will have to be down while the last data is being restored from the old servers onto the new ones.

During those few hours, any accesses to eQSL.cc, eQSL.net, or the QSLVia domains will either fail, get a warning notice.

As soon as the new servers are running, you will be automatically redirected to QSLCard.com which is the first domain name that will "come alive" on the new servers.

After the new servers are up and running, it will take up to 3 days for the eQSL.cc domain name to be properly redirected, and users in various parts of the world may see error messages occasionally during those 3 days.

Continue to watch this page from now on, as I will post frequent status updates during the transition tonight. Since I will be drinking a lot of coffee, they could be quite humorous.

Tuesday, June 15, 2010

Well, we've got even more disk now

Got the 1TB disk installed and running. But I'm fighting with Microsoft's idea of a joke: the new Windows Server Backup. In this day and age you would think that a software company would understand that you DO NOT eliminate features when you introduce a new product that is supposed to improve on the last one. I've been working on this for hours now, and I simply cannot get it to backup one server onto the other server. Frustrating and annoying. It does not have any way of specifying a network location. All backup devices have to be local hard disks.

Anyway, the database backups at least are working, backing the database from one set of RAID 10 disks onto the 1TB backup drive. Next I'll install the backup client that allows my Houston datacenter computer to back up the backups. Confusing? Imagine what it's like sitting here amongst this pile of papers and keyboards and screens.

Memory problem may be solved

It looked as if I was going to need to install a HotFix or "Cumulative Update" to fix the memory problem, but instead, I set the minimum and maximum memory allocations, and the system has been running ever since that time without any problems.

I found out from my son Talon (who works for Microsoft) that Windows 2008 32-bit was based on Vista, while Windows 2008 64-bit (called "R2") was based on Windows 7. That would explain some of the weird differences I've found between those two, that you would not expect.

Now my backups are running out of disk space because I don't have the cleanup task set up correctly. I'm addressing that, and seeing if I can add one more disk to the database server, just for some more room. I'm also making a few changes to the current servers to handle the new eWAS PSK endorsement, and then hopefully it will be about time to move to the new servers. The other web sites on the new server are running smoothly.

Thursday, June 10, 2010

Insufficient memory error message

The new database server is returning some error messages that say "There is insufficient memory in resource pool 'internal'.

I have tracked this down and am implementing the specified fix for it. In the meanwhile, if you see this message, hit the Refresh button on your browser.

Wednesday, June 9, 2010

OK, I almost have a plan




I've been moving some of my other web sites over to the new servers as guinea pigs, to find out what I'm forgetting.

I'm at the point right now of setting up the numerous backup processes that back up the database, data files, graphics, and programs onto duplicate backup machines.

What you can't see in the diagram above is the additional 2 machines that are the current servers. They are all cross-connected while I am moving software and databases around, and it's quite exciting trying to keep all of the Remote Desktop sessions straight between them! And launch some new eAwards, and work on new features, and solve problems that come in via the Support email.

So, anyway, we are now on Step 14 of a 47 step plan to migrate everything over to the new servers, and things are going pretty smoothly thus far. Someone is trying to break into our new web server and is hacking away with FTP, trying thousands and thousands of different passwords and usernames. I hope they have lots of time.

Monday, June 7, 2010

Getting the backup schedule right

I'm revamping the backup schedule for the database. There are 3 different types of backups: Full database backups, Differential database backups (just the things that have changed since the last full backup), and Transaction Log backups. These each eat up a substantial amount of disk space, and have to be scheduled carefully so they do not interfere with each other or cause the disk to run out of space, as happened this morning while I wasn't watching! Getting this right will be one of the last major hurdles before I cut everything over to the new servers.

Sunday, June 6, 2010

Those pesky domain names

When I change a domain name such as www.QSLCard.com over to our new IP address on the new servers, the change does not happen instantly. For several hours, or maybe even a day or more, your browser will still continue to fetch the page from the old server. The length of time is not under my control at all, but rather under the control of your Internet Service Provider (ISP).

This is making it a bit tricky to design a changeover strategy that will minimize downtime. I have set our DNS servers to ask your ISP to update its IP addresses every 10 minutes, but there is no guarantee that they will comply.

So AFTER WE ANNOUNCE THE CHANGEOVER IS BEING MADE, there will probably be a period of several hours during which your logger will not be able to connect with the eQSL server to upload ADIF files, or your email client won't be able to send us email, or your browser may be redirected to an IP address instead of the eQSL domain name.

Don't worry, this will all work itself out within a matter of usually no more than 72 hours, and everything will be back to normal.

Testing is going well so far, and we may be able to start the changeover sometime this week. Stay tuned!

Friday, June 4, 2010

Doing some cleanup while I'm at it

There have been numerous little things I've wanted to clean up in the system, but was unable to do while it was running. I'm finding this to be a good time to make those changes while nobody is using the new servers yet. They will make some of the administration less complicated and make expansion easier in the future.

For some reason the domains are not moving over for the mail server yet. Have to investigate that.

Thursday, June 3, 2010

New site is running in test mode

So far, so good. I have migrated 1 static site, 1 experimental site (www.AirBaboon.com), and they all run nicely on the new servers. I then set up one of the domains that has been registered for eQSL.cc but is rarely used by anyone for that purpose, (www.QSLCard.com) onto the new servers for testing. It works well, and indeed is much, much faster! I will be doing most of my testing under this domain name, and when everything is looking good, will erase everything that was done there and migrate the other domain names over as well. Might be within a week or so.

Wednesday, June 2, 2010

Some more migration tasks completed

The IIS web server has been set up and a couple of static web sites configured for testing.

The mail server has been installed and the mailboxes and configurations copied over. I have changed the DNS zone for several of my static web sites so I can test the mail server and IIS configuration tomorrow.

It took 8 hours to make a full backup of the 30 GB of programs and data files on the eQSL.cc site and copy them over to the new application server.

It took about an hour and a half to apply all of the recommended index changes suggested by the Database Tuning Advisor to speed up the new database server.

I think I'm gonna like this new database server

N2BJ is one of my worst case scenarios. Barry burns up the airwaves and has uploaded nearly 300,000 QSOs into his OutBox. When he checks his eAward status, it always takes a very, very long time. In fact, it usually times out his browser trying to check for things like ePFX300 and eZ40 standings. Until recently, he could barely check his InBox, and usually only late at night when the system wasn't heavily loaded.

So I use N2BJ whenever I'm testing a new performance improvement. He represents the worst case scenario, and I mean that in the nicest possible way! If I can make the system fast enough for people like N2BJ, K9QVB, and others with huge logs, it will work just fine for everyone else.

So, I was chomping at the bit to find out how much faster the new database machine is going to be than the old one. The old server had about 4GB of memory for the SQL Server engine to use. The new one has 24GB. Most database engines nowadays try to load as much of the actual data into memory as they can, so they can service queries directly from memory instead of having to do disk accesses. So obviously the more memory you can give a database engine, the faster things will be. I also have 2 CPUs with 4 cores each now, and I have 4 disks in a RAID 10 array that spin at 15,000 RPM on a SCSI bus instead of the slower, smaller disks on the old database server.

So, I ran a query this morning. It's a simple display of N2BJ's InBox. On the current server, it took 14 seconds to return 90 records. Then I ran it on the new server. Now mind you, I have not made any changes to the indexes or any other tweaks to enhance the performance. It only took 1 second to return the 90 records. That's 14 times faster speed. Then I ran the ePFX300 Standings query. It took 7 minutes 17 seconds on the current server. On the new server, it only took 38 seconds. Again about a 14-fold improvement in speed. And just by playing around with this query, I've found a few ways to speed it up even further.

I think I'm gonna like this new database server. 14 times faster already! And I haven't even implemented the recommendations that should give me another 40% improvement. I think it will be fast enough that I can remove the current time delays (cache) from the InBox and OutBox.
This is my world at the moment. About a year ago I built up a fast desktop machine running Windows 7 64-bit and two monitors. It has an Intel Core 2 Quad CPU (Q9400) running at 2.66GHz and has 8GB of RAM with about 3.1 terabytes of disk hanging off it.

So, on the left monitor you see the new brand new eQSL servers. The Windows Task Manager looks really cool there with 12 separate graphs for all the CPU cores. Nothing much is happening on those servers right now.

On the right monitor you can see the current eQSL application server on the top and the database server on the bottom. They are both running pretty strong right now, as we have 95 users logged into eQSL.

This is where I spend 90% of my day... that being from 8am to about midnight.

Tuesday, June 1, 2010

Status of the migration to new server farm

The database tuning advisor came up with recommendations that will produce a 40% improvement in the database performance. I will implement those in the test database, find out if there are any problems with them, and then also implement them again after I do the final copy of the production database.

Copying the database over to the new database server for a test run has already taken over 9 hours, and there are about 3 hours left to go.

It took me quite a while today to figure out all the peculiarities of Windows 2008 server and its firewalls and programs. For instance, did you know that Win2008 32-bit thinks that Daylight Saving Time is an option for UTC? Now Win2008 R2 (64-bit) knows that concept is absurd, but the 32-bit version presents a checkbox for selecting DST on UTC! Weird.

Everything is looking pretty good so far. I don't dare speculate as to what date this will all happen, but I will be able to bring up several of the smaller web sites first, and I might give you the URL to those after I get them running.

Status of the migration to new server farm

Completed a backup of all the non-eQSL web sites and moved them over to the new app server. It took a little over an hour to move it all across, so the eQSL programs should take about the same amount of time when I get ready. Since the eQSL site has tons of graphics files, and a lot of temporary files, I'll have to move it all twice: once to test it, and then again when I am ready to move everything for good.

I'm doing a full backup of the entire 120GB database of eQSL right now, and will move that across to the new database server for testing.

The database tuning advisor has been running for several days now, coming up with recommendations for changes to the eQSL database configuration, index files, etc.

Monday, May 31, 2010

Status of the migration to new server farm

In addition to eQSL.cc, I operate more than 10 other web sites. I say "more than 10", because there are actual 25 that I am moving over to the new servers, but some of them are personal sites for friends, and some are defunct enterprises that never went anywhere. Several of them are incubators for possible new businesses, and several are non-profits that I am assisting, such as the Miniature Schnauzer Rescue of North Texas (MSRNT.com) and the Dallas Chapter of the Red Cross.

I will be using several of these web sites as "guinea pigs" to test the new server farm. I will move their databases and their HTML over to the new servers and make sure everything works correctly before committing to move the eQSL system, which is by far the largest of all the web sites.

Sunday, May 30, 2010

Status of the migration to new server farm

On May 30, 2010, our data center completed the installation of 2 new servers. One of them will be the web/application/mail server, and the other will be a dedicated database server. Both of them are dual CPU, quad-core processors running at 2.26GHz. The app server is running a 32-bit operating system with 4GB of RAM because our graphics routines are not 64-bit, and we need to convert them first. The database server is running a 64-bit OS with 24GB of RAM dedicated to the database engine. We will play with these server configurations and do some testing first. There are about a hundred different ways to perform the migration, and we need to figure out the best way to do this to minimize downtime.