[Service Fabric] Auto-scaling your Service Fabric cluster–Part I
Like most people, whenever I need to build an ARM template to do something with Service Fabric, I'm browsing around Github or where ever else I can find the bits and pieces of JSON that I need.
I was recently working on a project where they needed 3 things:
1. They wanted their Service Fabric cluster to use managed disks instead of Azure Storage accounts.
2. They wanted to have auto-scaling setup for their virtual machine scale set (VMSS). In this case, we started out using the Percentage CPU rule, which is based on the VM scale set hardware and nothing application specific. In Part II, I'll talk about auto-scaling via custom metrics and Application Insights.
3. They had a stateless service where they wanted to register their service event source to output logging information in to the WADETWEventTable.
This template includes those 3 implementations. You can find a link to the files here https://github.com/larrywa/blogpostings/tree/master/VMSSScaleTemplate. This is a secure cluster implementation so you'll have to have your primary certificate in Azure Key Vault etc.
Using managed disks
Some of the newer templates you'll find out on Github already use managed disks for the Service Fabric cluster, but in case the one you are using doesn't, you need to find the location in your JSON file, in the Microsoft.Compute/virtualMachineScaleSets resource provider and make the following modifications.
Another important setting you need to make sure you have is the overProvision = false setting (here placed in a variable)
This variable is actually used in the Microsoft.Compute/virtualMachineScaleSets resource provider properties:
More information about overprovisioning can be found here /en-us/rest/api/compute/virtualmachinescalesets/create-or-update-a-set, but if this setting is missing or set to true, you may see more than the requested number of machines and nodes created at deployment and then the ones that are not in use are turned off. This will cause errors to appear in the Service Fabric Explorer. Service Fabric will eventually go behind and clean up itself but when you first see the errors, you'll think you did something wrong.
Setting Up Auto-scale on your VMSS
At first my idea was to go to my existing cluster, turn on auto-scaling inside of the VMSS settings and then export the template from the Azure portal. I then discovered that my subscription did not have permission to use the microsoft.insights resource provider. Not sure you'll run in to this, but if you do you can either enable it in the portal under Subscriptions -> your subscription -> Resource providers -> Microsoft.Insights.
The 'microsoft.insights/autoscalesettings' resource provider is placed at the same level in the JSON file as the other major resource providers like the cluster, virtual machine scale set etc. This scaleset, although it is found as a setting for the scaleset, is not a sub-section of the resource provider for the VMSS, it is actually a separate resource provider as shown in the next screenshot.
In the Json outline editor, the auto-scale resource will look like this:
There is a rule with conditions for scaling up and scaling down based on Percentage CPU metrics. Before deployments, go in and set your own desired levels in the capacity area, line 838 in this sample. The values for the Percentage CPU I have set in this sample are ridiculously low just to be able to see something happen, and if you're impatient, you can make them even lower just to see scaling take place.
Registering your Event Source provider
To register an event source provider, you need to add your event source provider to the Microsoft.Compute/virtualMachineScaleSets resource provider in the Microsoft.Azure.Diagnostics section.
You will of course need to make sure that the 'provider' listed above matches the EventSource name in your ServiceEventSource.cs file of your service. This is what I have for my stateless service that I'll deploy to the cluster once it's up and running.
For deployment to create the Service Fabric cluster, there is a deploy.ps1 PowerShell script included but you can really deploy the ARM template with whatever script you normally would use for your other ARM templates. Make sure you go through the process or creating your cluster certificate and put it in the Azure Key Vault first though, you will need that type of information for any secure cluster.
Note that in this example on Github, the parameters file is named 'empty-sfmanageddisk.parameters.json' and to get the deploy.ps1 to work, you need to get rid of the 'empty-' part of the name.
Once your cluster is up and running….confirm your auto-scale settings
Within your cluster resource group, click on the name of your VM scale set. Then click on the Scaling menu item. What you should see is something like this:
If you want to make changes to the levels for Scale out and Scale in, just click on the 'When' rules and an edit blade will appear where you can make those changes. Then click the Update button.
Next, to get notified of an auto-scale event, click on the Notify tab, enter an email address and then click the Save button.
Before moving on to the next step, make sure that your cluster nodes are up and running by clicking on the name of your cluster in your resource group and looking for a visual confirmation of available nodes:
Publish your Service Fabric application
For this example, I have a simple stateless ASP.Net Core 2.0 sample that uses the Kestrel communications listener. You can either choose a dynamic port or specify a specific port number in your ServiceManifest.xml file. Publish the service to the cluster using Visual Studio 2017.
In the WebAPI.cs file, you will notice some code that is commented out. I'll use this in part II of this article when I discuss scaling via custom metrics.
You can find this code sample at https://github.com/larrywa/blogpostings/tree/master/SimpleStateless
Email from Azure during auto-scale event
Each time an auto-scale event happens, you will receive an email similar to the one below:
WARNING: The values that I have set for my auto-scale ranges are extremely aggressive. This is just so that initially I can see something happening. In an actual production environment, you cannot expect rapid scale-up and scale-down of Service Fabric nodes. Remember, a node (machine is first created) and then the Service Fabric bits are installed on the VM and then all that has to be spun up, registered and the rest of the Service Fabric magic. You are looking at several minutes of time for this to take place. If your scale-up and scale-down parameters are too aggressive, you could actually get the cluster in a situation where it is trying to spin up a node why taking down the same node. Your node will then go in to an unknown state in which you will have to manually correct by turning off auto-scaling until things settle down.