Having deployed a large Azure VMWare Solution (AVS) environment last year with just under 300 VMs and x7 nodes, we started seeing issues with our backup infrastructure not performing and in some cases failing completely.
Before I detail my findings and the solution I will give a brief overview of how AVS backups work. Microsoft Azure Backup Server (MABS) is the software used to connect to the AVS cluster and take, process and store the backups. MABS is just a slightly scaled down version of Microsoft Data Protection Manager (DPM) which is commonplace in on prem environments.
In order to function MABS needs to be run on either one or a number of dedicated virtual machines and therein lies the problem we were experiencing – scaling to manage capacity.
Initially we had opted to run a single backup server instance with view to scaling up to meet demand, this approach does have a few benefits a key one being complete autonomy of VM enrollment into the backup policy at the root level, e.g when a new folder is created in AVS it would automatically be detected and added to the backup schedule. Remember that, we will come back to it later!
After some time using the single VM instance / scaling up approach we noticed seemingly random backup failures causing the VM to on occasion lock up and require a reboot. After some investigation it seemed this was caused by the VM running out of resources (CPU and RAM), so back to the drawing board!
Having experienced numerous failures and a general lack of reliability in the single instance approach we decided to split the load across two instances using separate backup policies on each which allowed us the flexibility to stagger the backup times on each policy / instance. We could then scale out as the AVS instance grew.
Now, remember I mentioned the automatic detection of new VMs / folders at the root level when we were using the single instance… well, we started building the first of the replacement split load backup servers and were surprised to find we couldn’t enroll any of the AVS VMs into the new policy, with them showing an error saying they were already claimed by the initial single instance backup server, despite the fact that server had nothing enrolled in its backup policies!
We tried deleting the MABS local backup data, clearing the RSV and even totally removing all the MABS backup policies. All to no avail.
Then I remembered… whenever a VM / folder is enrolled in the MABS backup a custom attribute “DPMServer” is set in the vCenter containing, you guessed it… the DPM server FQDN. The presence of this tag effectively locking that VM / folder to the DPM server specified within the tag value.
So we need to remove the DPMServer tag from the custom attributes however, as with most things in AVS, its not that simple! Even with the highest level of permission (cloudadmin) I have not found a way to completely remove a custom attribute.
So here is my solution – rename the DPMServer custom attribute name on the VMs / folders you want to re enroll, for example “DPMServer” could become “DPMServerOld”. This means MABS cant see the DPMServer attribute anymore so its effectively just as good as a delete. In our case it was set at the vCenter root folder to enable the auto enrollment, so after renaming the attribute all the folders were once again able to be selected in MABS.
The only thing to consider with this approach is that you could end up with lots of messy unused old tags. If you are only working on a small number of VMs then deleting the DPMServer attribute value would also work, however the value is not inherited so using this approach on a folder wont apply down to child VMs and if you have hundreds of child VMs going through them one by one and deleting the value is not feasible!