Disaster recovery versus Disaster Avoidance… What’s the difference?

Several years ago I worked for EMC as a vSpecialist which meant that I played an evangelist role as a pre-sales resource able to help the account and marketing teams. I would travel to customers, trade shows, user groups and whomever else wanted to hear the gospel according to EMC and pontificate on the benefits of running VMware virtualization technology on top of EMC storage products as opposed to the other guys platform. Now there’s nothing wrong with that and in fact it’s a great way to make a living especially if you happened to believe (as I did) that a) EMC DID in fact understand what people needed to succeed in the world of virtualization and b) EMC wanted to give it to them. And because I’m a reasonably good nerd with plenty of hands on experience and a love of the spotlight, it wasn’t very hard to stand up in front of an audience and talk convincingly about what makes a great infrastructure and how to architect it. It’s never hard to talk about something you’re passionate about and believe in and I loved the way EMC focused on selling the strengths of their own solutions rather than the weaknesses of the other guy.

Now fast forward a few years and some of the concepts we used to demo are coming to light as actual products, not just from EMC but from myriad vendors in multiple flavors. Software defined networking, mobile computing and storage tiering are all available as off the shelf components ready to be bolted onto infrastructures large and small. But one of the coolest of these was ‘long distance vMotion” otherwise known as stretched clusters.

The idea behind stretched clusters is simple. You get two sites with a really fat pipe between them and put half your compute nodes on one side and half on the other. You put all these nodes in a single cluster that spans both sites and bingo! You can vMotion between sites and enable vMware HA for increased availability. And you know what? It works! And it does it exactly as the vendor described!

The key, that really unlocked the potential for this capability was to introduce a storage virtualization layer which abstracted away the details of the underlying storage similar to what products like FalconStore did years ago. But unlike previous products, the new generation could not only hide the details of the heterogeneous arrays behind the presentation layer but it could replicate data between sites and use advanced caching so that the virtualization layer believed the storage it was accessing was local to the current site. Even if your data was in fact at the other end of the WAN pipe, the new tools could produce response times fast enough and throughput big enough that having the execution context in one site and the disk files at another wasn’t a problem anymore.

Awesome… so why write a blog post?

Because the devil’s in the details and I’m seeing folks who glossed over that part.

First the context. I’ve had two clients in the last quarter looking to implement stretched clusters either in place of SRM or alongside SRM. That’s lead to deeper discussions on disaster recovery versus disaster avoidance and where each technology fits. The second half of that conversation is around the details. Lots of details…

SRM or Stretched Clusters. Pick one

Right off the bat it’s important to understand that the design paradigm of Site Recovery Manager and Stretched Clusters are mutually exclusive meaning you protect a workload with one technology or the other but not both simultaneously. That’s because the architecture of SRM dictates each site has it’s own vCenter+SRM pair and each site communicates with the other. Clusters, in contrast, require that all nodes in a cluster be under the management of a single vCenter instance. Think about it… what’s the boundary to vmotion? The Datacenter! Do Datacenter structures in vSphere span multiple vCenters? NO! So the nodes in a cluster have to appear entirely within a single vCenter which means they could theoretically be used in conjunction with SRM as either resources at the “logical” protected or recovery side but we couldn’t divide the resources and make half the cluster appear as protected and half as recovery. Once within a vCenter instance these resources are atomic and indivisible from an SRM perspective.

Disaster Recovery versus Disaster Avoidance

Stretched Clusters and long distance vmotion works wonderfully for clients who see the disaster coming. If you’re standing on the breakwater watching the storm clouds roll in, Long Distance vMotion can definitely save your bacon. What it typically can’t do is recover well from a smoking hole in the ground you didn’t know was coming. And that’s because the typical stretched cluster lack a few key components.

 

Most stretched clusters don’t maintain full copies of data at both sites simultaneously. Your plan may be different but consider for a moment the implications;

 

  1. Synchronous replication is still subject to the laws of physics meaning the greatest possible distance between two sites in still about a hundred miles (the answer varies +/- 50% with the scenario).  You may be able to mask the penalty during normal use via advanced caching algorithms but if the data hasn’t made it across when the disaster happens you still don’t have an RPO of zero.
  2. The enterprise must maintain a second array with sufficient capacity and avoid the temptation to consume part of that capacity. While many organizations commit to this concept at the time of purchase, I see time and time again where storage or vmware admins have consumed that space because ‘they had no choice’.
  3. Migrating the execution context from a server at one site to another at the second site doesn’t necessarily migrate the ‘authoritative’ data source from one site to another. In fact unless you explicitly configure the storage to follow the access point across sites there’s a distinct possibility nothing changes and all storage IO operations are still occurring at the original array. Now before anyone screams that their product doesn’t work like this let me point out that this is product and implementation dependent so PoC testing is the only real way to know how your environment will react and none of the clients I’ve been to in the last year have conducted a PoC.

 

Ok, so lets assume for a minute that your organization has met all the basic requirements. Two arrays, all the capacity reserved with replication enabled and a storage virtualization layer that masks the current data location from the stretched cluster. You’ve cleaned up your networking by stretching the guest networks and the vSphere management networks across sites and you’ve confirmed everything works exactly as you want. Now can I ditch SRM?

Not yet.

Most folks want to ditch SRM in favor of stretched-cluster HA because SRM is a pain to configure and maintain. Every time workloads are added or removed an admin has to go into the admin tool and update the protection groups and maybe edit the recovery plan. SRM used to play poorly with Storage vMotion and Storage DRS so architects had to pick one or the other and live with their choice but even with that fixed, folks still complain about having to maintain this “thing they never use”. My recent experience in the field is that stretched clusters are by and large perceived by customers as a set and forget solution to disaster recovery.

I don’t believe that’s the case at all.

Remember that SRM is primarily an orchestration engine meaning you define what’s going to happen BEFORE it happens. VMware HA on the other hand has almost no orchestration abilities. Oh sure, you can specify which workloads to recover if HA is invoked but can you do any of the following?

 

  • Build dependencies such that the application server doesn’t start until the database is up?
  • Build Recovery plans to bring the highest priority tier one workloads online before recovering the secondary systems?
  • Shut down or suspend workloads already present at the recovery site to make space for the about-to-be-recovered production systems?
  • Re-address workloads and connect them to different port groups if the enterprise elects to not stretch its networks across sites?
  • Produce an audit-acceptable test result which will prove the business requirements have been met?

 

If you didn’t address these very real requirements I suspect that any actual disaster would result in general panic and mayhem as engineers attempt to sort out these issues on the fly. Furthermore, I would respectfully suggest that even if you could architect your environment in such a way as to meet all of the above criteria in advance of a disaster, the amount of time and effort involved in building and maintaining the resulting fragile and complex Rube Goldberg type infrastructure would far exceed the effort required to implement and maintain an SRM based solution.

Does that mean solutions such as Peer Motion and VPLEX have been oversold by sales teams? Not at all. These technologies have tremendous benefits and can dramatically improve availability and performance. They can most definitely be used for both disaster avoidance and recovery but only if architected properly. Simply drop kicking them into the datacenter and walking away isn’t likely to produce the desired result, which is why I advocate clients commit to a bake-off style proof of concept. Not for the purpose of choosing stretched clusters OR Site Recovery Manager as the winner but rather that they understand factually what these two technologies do, what their requirements are and how they are complementary rather than competing solutions.

I’m a bad person… aka the rest of the SQL setup

Sometimes you start something and get lost in the details…

Like trying to get MS-Word to properly format a blog post how-to you’ve written on the plane (try that with a 15″ laptop!) in native docx format. Something that you’d think would be easy except when it isn’t. Pretty soon you’re trying to do conversions from one format to another which leads to evaluating blogging tools and winds up with a study in solving quadratic equations. Eventually something that began as trivial takes on a scope just shy of world peace with the predictable result that it gets dropped by the wayside. Thanks Bill!

That’s what’s happened to my series on setting up SQL. It turned into a wrestling match with MS-Word that eventually lost out to kids, life, work and beer. But mostly beer.

Still it nags me that I’ve started it, finished it but it’s not yet published so I’ve decided to go around the problem, skip the formatting step and post the original MS-Word version right here in .docx format.That way I can go back to wrestling with blogging tools with a clear conscience and those who care can take the raw material and use it as they will. Hopefully one day I’ll find the right tool and post the full content directly into my blog but for the moment I’m moving on. There’s too many other great topics to talk about.

Enjoy!

Install SRM v1.5

vSphere+SRM Lab Database Setup – Part 1: Install & Configure SQL Express

Overview

This guide is for setting up a single server to host SQL Express, databases for VC, VUM, SSO and SRM. Why on earth would anyone do that? Well simple really… in a small lab there are limited resources and installing everything on a single server makes the most efficient use of those resources. Memory, disk, Microsoft Windows activations… you get the picture!

Basic Server Setup

  1. Install Win2k8 R2
  2. Install Service Pack 1
  3. Configure networking (fixed IP), DNS Servers, Gateway & disable IPv6
  4. Activate windows (optional)
  5. Set the time zone
  6. Set the machine name including the FQDN.
  7. Enable Remote desktop
  8. Join machine to domain
  9. Disable Windows Firewall (including domain)
  10. Disable IE Enhanced Security Configuration (right click on Computer -> Manage -> Configure IE ESC -> Off for Administrators and Off for Users

Middleware components

  1. Install .Net 3.5 SP1. Start-> Administrative Tools -> Server Manager -> Features -> Add features
    1. Do not install WCF Activation or you’ll wind up with IIS running on port 80 which will conflict with Vmware services!

  2. Install the Microsoft jdbc drivers for vCenter installation. The KB article outlining installation and configuration is at http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2006642 and the driver itself is available from http://www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=21599
    1. Install the JDBC driver.
    2. During the installation, when prompted to specify a path to store the files, specify a path, such as C:\Program Files\Microsoft SQL Server JDBC Driver 3.0.
    3. Make a note of the path.
    4. Add the driver to the Java CLASSPATH variable.To add the driver to the Java CLASSPATH variable:
      1. Right-click Computer and click Properties.
      2. Click Advanced System Settings.
      3. Click Environment Variables.
      4. Click New under the System variables section.
      5. Enter CLASSPATH in the Variable name text box.
      6. Enter Path\sqljdbc_3.0\enu\sqljdbc4.jar in the Variable value text box, where Path is path of the installation files noted in Step 4.
      7. Click OK.
      8. Reboot vCenter Server.
  3. Install SQL Express 2012 (see below)
  4. Install the older SQL Native Client for SRM

Notes

VUM is a 32 bit SQL Native Client DSN

VCDB is a 64 bit SQL Native Client

SRM must be 64 bit SQL Native Client (old)

SQL Server Express 2012

SQL 2012 Server Express Installation

You’ll need a 64 bit version of SQL. If you have a production version then fantastic otherwise for a lab build you’ll have to use “Express”. This product is not suitable for production use for a number of reasons:

  • 10GB per database maximum
  • 1 Physical CPU maximum, but multiple cores allowed (obviously this limits performance)
  • 1 GB RAM max. The system can have more than that but SQL Express will only use 1 GB.
  • No SQL Server Agent Service

The good news is that the download is FREE and comes with Management Studio for interfacing with the database server.

At present both can be downloaded in one package from:

http://www.microsoft.com/sqlserver/en/us/editions/2012-editions/express.aspx

You should wind up with a single file names SQLEXPRWT_x64_ENU.exe. Double click the executable to begin the installation.

Pick the first option to install a new stand-alone server.

The installer will launch a “Rules Checker” to validate the installation then provide the license term dialogue

Click “Next” to accept the terms. You’ll see a quick rule check dialogue fly by (hopefully it doesn’t produce any errors and stop!). The next screen is the feature selection dialogue. There’s no need to make changes unless you’re looking to change the software’s installation location.

The next screen is the instance ID configuration option. I prefer to leave this at the defaults and add more instances later. I’m just never sure what other components will assume. At least if I leave the defaults in place I know nothing down the road will break.

You’ll briefly see the disk space requirements screen fly by followed by the Server Configuration panel.

On the Server Configuration screen I recommend changing the SQL Server Browser startup to automatic. You’ll have to do it eventually so why not now?

The next screen is the Database Engine Configuration panel. Definitely change the Authentication mode here to “Mixed” and don’t forget to record the “sa” password! This allows you to use windows authentication OR to create users and passwords within the database. A word of caution here… sometimes ODBC connectors or VMware software can have different rules regarding acceptable passwords meaning you can enter the correct password but because it contains something that violates the entry rules on the end application it will be rejected as though it were incorrect. The point is to avoid special characters, especially wildcards!

Notice the Data Directories tab… If you wanted to place the actual database instance data on another disk this would be the place to do so. For production I would highly recommend doing so as IO characteristics are completely different for binaries versus data. However here in the lab it’s not so critical. For lab use I generally thin provision a 100 GB C drive and allow it to grow as required.

Error Reporting: No change.

After you click “Next”, the installation will run for a while… sometimes quite awhile with a few status updates along the way.

Once completed you’ll be prompted to restart the computer after setup is complete. Not a problem, just the usual locked files needing update.

Once the warning dialogue is acknowledged you’ll be left with the underlying “Complete” panel.

Close the SQL Server 2012 Setup. You’ll be left with the installation wizard (below). Close the wizard and restart the system. At this stage I usually do a reboot (to install the locked files), a shutdown then a snapshot. Given the length of time it takes to get to this stage it’s probably a good time to take a snap, before the db config and VMware software installs.

Configuration of SQL Express 2012

By default, SQL Express does not permit connections from remote computers. Local configurations should work as long as you’re installing VC and/or SRM onto the same systems as SQL however you may still encounter issues with communications protocol mismatches between the client and the server. A more consistently successful approach is to enable remote tcp/ip connections and disable the alternate protocols. This simplifies troubleshooting and increases the likelihood of success right out of the box. You must enable remote connections for each instance of SQL Express that you want to connect to from a remote computer.

  1. Click Start -> All Programs -> Microsoft SQL Server 2012 -> Configuration tools -> SQL Server Configuration Manager
  2. Expand SQL Server Network Configuration and Click on Protocols for SQLEXPRESS
  3. In the right hand pane will be three transport protocols. Shared memory, Named Pipes and TCPIP. Right click to disable the first two then right-click on TCPIP to enable it.
  4. You will receive a warning that changes will not take effect until the service is stopped and restarted. We will do that at the end.

  1. Again, right click on TCP/IP and select the properties item from the fly out. You will be presented with a tabbed dialogue box. Select the IP Addresses tab and you will see a list of IP addresses and their status. All will likely show Active but not enabled. I generally enable all the addresses so SQL is available no matter what interface a client chooses to connect however there is a potential pitfall here. If you clone this machine and change its address, SQL may try to start on the non-existent interfaces and fail. Depending on the version of SQL and what specific options have been enabled you will either be just fine or wind up with a perfectly good SQL Server that won’t start!
  2. Also while you’re here, blank the TCP Dynamic Ports field and change the TCP Port to 1433 for each interface. I realize that we’re locking things down and making it difficult to clone the system but we’re also making 100% sure connectivity won’t be a problem in our lab. Is this the way to do things in production? Most definitely not (firewall disabling?) however lab time is about messing with SRMs bells and whistles, not troubleshooting database connectivity!

  3. Now repeat the steps for the SQL Native Client configuration. Make sure both the 32 and 64 bit clients are enabled for TCP/IP and not for Named Pipes or Shared Memory! Notice also that the clients are already set up to try port 1433 by default so no port level configuration changes are needed,

  1. Now with all the changes made, restart SQL and the SQL Browsing service. In the Left pane, select the option for SQL Server Services then, in the right pane,
    1. Stop (don’t restart) the Browser service
    2. Restart the SQL Server
    3. Start the SQL Server Browser

    Order is important here! SQL Server will pick up the configuration changes on a restart but the browser needs to start AFTER SQL Server has picked up its changes or it will run with old configuration data. If it’s all too much to follow just reboot the box and come back in a couple minutes!

Things the vendor forgot to mention…

Well it’s the end of another fun week onsite wrestling with SRM and 3rd party SRAs. Sometimes SRAs are fairly straightforward and well documented, sometimes they’re not. Despite a fairly thick package of documentation, IBM falls into the latter category. Not for what it included but for what it omitted.

The IBM SVC SRA comes with two pdf documents. One is a User Guide and the second is a set of Release Notes. They’re full of screenshots and step by step instructions for installing the SRA and running the small configuration utility IBMSVCSRAUtil.exe

Here’s where things go wrong… not BIG wrong but rather “Gee, it would’ve been good to know that!”

First issue.

The IBMSRASVCUtil utility must be run as administrator! Failing to do so will have two results. You won’t be able to close the utility no matter how many times you click the OK button and if you do manage to close the utility, perhaps by hitting Cancel or “X”ing the utility window itself, it won’t store the configuration you’ve entered. Meaning that if you enter your config and go back into the utility a second time you’ll see nothing has been saved. No error messages, no warnings, nothing in the doc… just nothing saved! Now if you run as an admin, you’ll notice OK exits the utility and a second trip through this little piece of quick and dirty will show the previously stored configuration in the bottom left quadrant of the application panel.

Second issue

Where the utility clearly indicates SpaceEfficient Mode should be answered “Yes” or “No”, the IBM documentation shows an entry of “True”. So which is it? Yes? No? True? False? Barney?

ibmsvcsrautil input conflict in documentation

Yes? True? False? Barney?

Third Issue (OK, now I’m just complaining)

So item #2 got us to thinking… What should we be using? We tried “Yes”… and it accepted and stored the value as indicated in the lower left quadrant. So we tried “True” and again IBMSVCSRAUtil saved the value and happily confirmed it had been accepted. So on a lark we tried “Barney” Guess what? It merrily accepted the value and confirmed it was saved. In fact you can enter absolutely anything you want here and it will be accepted! Hint: Don’t use Barney

Really?

Really?

Point to be made here… IBMSVCSRAUtil does zero input checking!

Second point? The correct answer here is “True/False” or really just True. Anything else, including the word “Yes” will result in a fully provisioned snapshot and consumption of 2x the disk space. Only the word “True” results in a Space Efficient flashcopy/snapshot!

These observations were made using the IBM SVC SRA version 2.1.0.121108 and SVC version 6.3.0.2

Amateur hour at IBM

I think the world of IBM. I worked there in the 90s and anyone that wants to stuff my brain full of bat-shit tech goodness is more than welcome to do so. But everyone has high points and low points and the great IBM is no exception.

While working with the most recent IBM SVC Storage Replication Adapter, I discovered some surprising behaviour.  The first discovery was  that if you used a preconfigured setup (as specified on the SRA utility and the SVC) you could used reduced (ie non-god mode) permissions. But if you did that then every mirror-source, target and Flashcopy was simultaneously visible to your ESX hosts.  Yeah, kudos for using the least required permissions but DUDE WTF! (Welcome To Facebook!), why do I have 3 LUNs claiming to be datastore “T1G1” across both sites?

OK, so this is a major issue and derives from several key pieces of information

  1. Pre-configured means just that… all the mirrored sources and targets as well as their flashcopy/snapshots are predefined (this is good)
  2. All the mapping (zoning) is also pre-defined (good until you realize what this means)
  3. The SRA now only runs as a “copy_operator” permission level. (awesome! all hail the first SRA vendor (I know of) to run with minimum permissions)

Problem is the COPY_OPERATOR permission set doesn’t permit mapping/unmapping LUNs to hosts meaning once this configuration is set up and the first hba rescan is executed, ALL hosts can see ALL local luns including the read-only metro-mirror and global mirror targets and as well as all the local flashcopies (which pre-exist because this is a pre-configured environment!)

The result isn’t immediately obvious but once a successful failover and failback is enabled both sides will have conducted a rescan and now the poor SRM administrator is looking at a “successfully recovered” environment with multiple SVC luns presenting themselves as the original datastore! Holy SCSI ids Batman, How in Gotham did THAT happen?

To confirm our theory, we sat down at the SVC console and logged in as our COPY_OPERATOR userid and attempted to un-map the offending luns from the esx hosts The attempt failed with an insufficient privilege error. When we tried to revise the privileges to add zoning/masking to our SRA id we found that the IBM/SVC model doesn’t support granular permissions! Next step up from COPY_OPERATOR? Full-blown ADMIN! Grrrrr!

Lesson learned… Cheers to IBM for leading the way with reduced permissioning but jeers for creating more problems than were solved!

Why I use snapshots with SRM…

Recently I posted a question regarding alternate backup and recovery plans for Site Recovery Manager on the Vmware Communities forum and was surprised at the lack of folks taking out insurance on an SRM installation. Not because SRM is a piece of junk that blows up at every opportunity but rather because its a complex system with multiple moving parts from multiple vendors spread across multiple sites. And sometimes… just sometimes… these pieces don’t all mesh the way they should. It’s kind of like a new hard drive… It’s either going to be dead out of the box or run for years but getting past the initial set up is typically the hardest part and its where I like to test ALL the functionality the client plans to use and even some of the things they don’t.

Now this type of issues happens a lot less than it used to. I’ve been playing with SRM since before it went GA and have probably done fifty to a hundred installs over the last few years and much to my relief this happens a lot less than it used to. But it still happens and when it does it can sometimes leave the back end database in an inconsistent state from which recovery is virtually impossible. Just visit Google and you’ll find lots of folks facing this dilemma. I’ve had re-purposed luns, ÜBER VNX VSAs, re-protect operations and even accidental power offs cause failures of this type. Typically it happens on first test or first failover (they’re not the same) but rarely in a mature, stable environment. Sometimes it’s possible to recover through the GUI or even a simple reboot and at others we need to do a little digging in the database. This is made all the more difficult when VMware doesn’t publish the schema but there are tips and tricks and many, many things are possible. Still,  there are occasions where these two methods don’t work and we’re left with a ticking clock… Which is why I use snapshots.

Now before everyone gets their knickers in a twist lets be clear about a few things. I don’t snapshot production or running databases and I’m njot talking about leaving snapshots in place ad infinitum but it’s fairly predictable where these issues occur and when so a little insurance goes a long way.

Typically, we set up a lab or POC environment first. We want to get the customer thinking about how their array behaves in both test and failure scenarios. We’ll do a lab set up, conduct a non-disruptive test then a failover followed by a re-protect and finally a fail back.

It can take days to get to this stage, especially if you’re starting from bare metal and things don’t go smoothly but there are distinct advantages to doing it this way:

  • It proves the solution works
  • It shows firsthand the behaviour the client can expect.
  • It uncovers all the prerequisite hardware, software and configuration the customer will need to implement in production.
  • It trains the customer (clients always drive)
  • It leaves the customer with a working sandbox

When doing this for the first time, I’ll install and configure all the required components up to SRM. Then I’ll shut down the system(s) and take a snapshot. If its an all-in-one lab environment that means two VMs, each consisting of SQL Express, VC and SRM. Because the system is shut down there’s no issue with a running database or a running memory image. If there are individual servers then I’ll shut down the VCs, DBs and SRM servers and snapshot those. I install SRM and the vendor SRA and repeat the snapshots.

Now we bring everything back up and begin configuration and testing. If anything is going to break this is where it’s likely to happen. Sometimes we work through issues and sometimes we know exactly what happened the moment the error pops up but no matter what happens we have options for recovery. Choices. Considering how long it takes to install and configure all the components for SRM it makes sense to have a checkpoint we can quickly revert to to if all else fails. It also means that if something does happen and we want to work through it but are short on time we can take another snapshot to capture the issue, revert to our post SRA snap and continue onwards.

Once it’s all been sorted we can delete our snaps and commit to the working environment but in the interim I like the insurance. I’ve had scenarios where I’m in another city with a flight home in 24 hours and something messes up SRM to the point where only a reinstall will fix the database. That’s when I’m grateful for the snapshot.

Backing up and restoring the SRM database

Without the sa password!

Let suppose for just a moment that you’re in a situation where you can’t snapshot the SRM database server. Maybe it’s shared or perhaps you cant shut it down to get a clean snap. Or perhaps you don’t even have access to it… You’re going to have to back things up.

Backing up the db generally goes without issue but in reality you should stop SRM to get a consistent state. The permissions granted to the srm user are sufficient to complete a backup however recovery/restore is another matter

First a couple of requirements

  1. You’ll need to stop the Site Recovery Manager Service. If you’re backing up then it’s just a BP. If you’re restoring then it’s going to be an actual requirement unless you enjoy failure.
  2. You’ll need to have db_owner permission: databases-> srm -> security -> users -> srm -> (right-click) Properties -> Membership. Make sure db_owner is checked!

  3. You’ll also need db_creator permissions: Security -> Logins -> (right click) Properties -> Server Roles. Make sure db_creator is checked!

Now just right-click the srm database and from the Tasks menu select Backup!

Of course now that you have a backup of your database, you’ll never have to use it!