Category Archives: Operations

Out of space – first steps when a datastore runs out of space

This is a situation that never should happen, and I had to deal with it only a couple of times in more than 10y working with VMware vSphere/ ESXi. In most cases, the reason for this was the usage of thin-provisioned disks together with small datastores. Yes, that’s a bad design. Yes, this should never happen.

There is a nearly 100% chance that this setup will fail one day. Either because someone dumps much data into the VMs, or because of VM snapshots. But such a setip WILL FAIL one day.

Yesterday was one of these days and five VMs have stopped working on a small ESXi in a site of one of my customers. A quick look into the vCenter confirmed my first assumption. The datastore was full. My second thought: Why are there so many VMs on that small ESXi host, and why they are thin-provisioned?

The vCenter showed me the following message on each VM:

There is no more space for virtual disk $VMNAME.vmdk. You might be able to continue this session by freeing disk space on the relevant volume, and clicking Retry. Click Cancel to terminate this session.

Okay, what to do? First things first:

  1. Is there any unallocated space left on the RAID group? If yes, expand the VMFS.
  2. Are there any VM snapshots left? If yes, remove them
  3. Configure 100% memory reservation for the VMs. This removes the VM memory swap files and releases a decent amout of disk space
  4. Remove ISO files from the datastore
  5. Remove VMs (if you have a backup and they are not necessary for the business)

This should allow you to continue the operation of the VMs. To solve the problem permanently:

  1. Add disks to the server and expand the VMFS, or create a new datastore
  2. Add a NFS datastore
  3. Remove unnecessary VMs
  4. Setup a working monitoring , setup alarms, do not overprovision datastores, or switch to eager-zeroed disks

Such an issues should not happen. It is not rude to say here: This is simply due to bad design and lack of operational processes.

Veeam Backup & Replication: Backup of Microsoft Active Directory Domain Controller VMs

To backup a virtual machine, Veeam Backup & Replication needs two permissions:

  • permission to access and backup the VM, as well as the
  • permission to do specific tasks inside the VM

to guarantee a consistent backup. The former persmission is granted by the user account that is used to access the VMware vCenter server (sorry for the VMW focust at this point). Usually, this account has the Administrator role granted at the vCenter Server level. The latter permission is granted by a user account that has permissions inside the guest operating system.

geralt / pixabay.com/ Creative Commons CC0

Something I often see in customer environments is the usage of the Domain Administrator account. But why? Because everything works when this account is used!

There are two reasons for this:

  • This account is part of the local Administrator group on every server and client
  • customers tend to grant the Administrator role to the Domain Admins group on vCenter Server level

In simple words: Many customers use the same account to connect to the vCenter, and for the application-aware processing of Veeam Backup & Replication. At least for Windows servers backups.

Houston, we have a problem!

Everything is fine until customers have to secure their environments. One of the very first things customers do, is to protect the Administrator account. And at this point, things might go wrong.

Using a service account to connect to the vCenter server is easy. This can be any account from the Active Directory, or from the embedded VMware SSO domain. I tend to create a dedicated AD-based service account. For the necessary permissions in the vCenter, you can grant this account Administrator permissions, or you can create a new user role in the vCenter. Veeam offers a PDF document which documents the necessary permissions for the different Veeam tasks.

The next challenge is the application-aware processing. For Microsoft SQL Server, the user account must have the sysadmin privileges on the Microsoft SQL Server. For Microsoft Exchange, the user must be member of the local Administrator group. But in case of a Active Directory Domain Contoller things get complicated.

A Domain Controller does not have a local user database (SAM). So what user account or group membership is needed to backup a domain controller using application-aware processing?

This statement is from a great Veeam blog post:

Permissions: Administrative rights for target Active Directory. Account of an enterprise administrator or domain administrator.

So the service account used to backup a domain controller is one of the most powerful accounts in the active directory.

There is no other way. You need a Domain or Enterprise Administrator account. I tend to create a dedicated account for this task.

I recommend to create a service account to connect the vCenter, and which is added to the local Administrator group on the servers to backup, and I create a dedicated Domain/ Enterprise Administrator account to backup the virtual Domain Controllers.

The advantage is that I can change apply different fine-grained password policies to this accounts. Sure, you can add more security by creating more accounts for different servers, and applications, add a dedicated role to the vCenter for Veeam etc. But this apporach is easy enough to implement, and adds a significant amount of user account security to every environment that is still using DOMAIN\Administrator to backup their VMs.

Security: If it doesn’t hurt, you’re doing it wrong!

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

The Informationsverbund Berlin-Bonn (IVBB), the secure network of the german government , was breached by an unknown hacker group. Okay, a secure government network might be a worthy target for an attack, but your network not, right? Do you use the same password for multiple accounts? There were multiple massive data breaches in the past. Have you ever checked if your data were also compromised? I can recommend haveibeenpwned.com. If you want to have some fun, scan GitHub for -----BEGIN RSA PRIVATE KEY-----. Do you use a full disk encryption on your laptop or PC? Do you sign and/ or encrypt emails using S/MIME or PGP? Do you use different passwords for different services? Do you use 2FA/ MFA to secury importan services? Do you never work with admin privileges when doing normal office tasks? No? Why? Because it’s uncomfortable to do it right, isn’t it?

My focus is on infrastructure, and I’m trying to educate my customers that hey have to take care about security. It’s not the missing dedicated management network, or the usage of self signed certificates that makes an infrastructre unsecure. Mostly it’s the missing user management, the same password for different admin users, doing office work with admin privileges, or missing security patches because of “never touch a running system”, or “don’t ruin my uptime”. I don’t khow how often I heard the story of ransomware attacks, that were caused by admins opening email attachments with admin privileges…

My theory

Security must approach infinitely near the point, where it becomes unusable.

Correlation Between Security and Comfort

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Security is nothing you can take care about later. It has to be part of the design. It has to be part of the processes. Most security incidents doesn’t happen because of 0-day exploits. It’s because of default passwords for admin accounts, missing security patches, and because of lazy admins or developers.

Don’t be lazy. Do it right. Even if it’s uncomfortable.

Why “Patch Tuesday” is only every four weeks – or never

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today, this tweet caught my attention.

Patch management is currently a hot topic, primarily because of the latest ransomware attacks.

After appearance of WannaCry, one of my older blog posts got unfamiliar attention: WSUS on Windows 2012 (R2) and KB3159706 – WSUS console fails to connect. Why? My guess: Many admins started updating their Windows servers after appearance of WannaCry. Nearly a year after Microsoft has published KB3159706, their WSUS servers ran into this issue.

The truth about patch management

I know many enterprises, that patch their Windows clients and servers only every four or eight weeks, mostly during a maintenance window. Some of them do this, because their change processes require the deployment and test of updates in a test environment. But some of them are simply to lazy to install updates more frequent. So they simply approve all needed updates every four or eight weeks, push them to their servers, and reboot them.

Trond mentioned golden images and templates in his blog posts. I strongly agree to what he wrote, because this is something I see quite often: You deploy a server from a template, and the newly deployed server has to install 172 updates. This is because the template was never updated since creation. But I also know companies that don’t use templates, or goldes master images. They simply create a new VM, mount an ISO file, and install the server from scratch. And because it’s urgent, the server is not patched when it goes into production.

Sorry, but that’s the truth about patch management: Either it is made irregular, made in too long intervals, or not made at all.

Change Management from hell

Frameworks, such as ITIL, play also their part in this tragedy. Applying change management processes to something like patch managent prevents companies to respond quickly to threats. If your change management process prevents you from deploying critical security patches ASAP, you have a problem –  a problem with your change management process.

If your change management process requires the deployment from patches in a test environment, you should change your change mangement process. What is the bigger risk? Deploying a faulty patch, or being the victim of an upcoming ransomware attack?

Microsoft Windows Server Update Service (WSUS) offers a way to automatically approve patches. This is something you want! You want to automatically approve critical security patches. And you also want that your servers automatically install these updates, and restart if necessary. If you can’t restart servers automatically when required, you need short maintenance windows every week to reboot these servers. If this is not possible at all, you have a problem with your infrastructure design. And this does not only apply to Microsoft updates. This applies to ALL systems in your environment. VMware ESXi hosts with uptimes > 100 days are not a sign of stability. It’s a sign of missing patches.

Validated environments are ransomwares best friends

This is another topic I meet regularly: Validated environments. An environmentsthat was installed with specific applications, in a specifig setup. This setup was tested according to a checklist, and it’s function was documented. At the end of this process, you have a validated environments and most vendors doesn’t support changes to this environments without a new validation process. Sorry, but this is pain in the rear! If you can’t update such an environment, place it behind a firewall, disconnect it from your network, and prohibit the use of removable media such as USB sticks. Do not allow this environment to be Ground Zero for a ransomware attack.

I know many environments with Windows 2000, XP, 2003, or even older stuff, that is used to run production facilities, test stands, or machinery. Partially, the software/ hardware vendor is no longer existing, thus making the application, that is needed to keep the machinery running, another security risk.

Patch quick, patch often

IT departments should install patches more often, and short after the release. The risk of deploying a faulty patch is lower than the risk of being hit by a security attack. Especially when we are talking about critical security patches.

IT departments should focus on the value that they deliver to the business. IT services that are down due to a security attack can’t deliver any value. Security breaches in general, are bad for reputation and revenue. If your customers and users complain about frequent maintenance windows due to critical security patches, you should improve your communication about why this is important.

Famous last words

I don’t install Microsoft patches instantly. Some years ago, Microsoft has published a patch that causes problems. Imagine, that a patch would cause our users can’t print?! That would be bad!

We don’t have time to install updates more often. We have to work off tickets.

We don’t have to automate our server deployment. We deploy only x servers a week/ month/ year.

We have a firewall from $VENDOR.