Tag Archives: operations management

Why “Patch Tuesday” is only every four weeks – or never

This posting is ~6 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today, this tweet caught my attention.

Patch management is currently a hot topic, primarily because of the latest ransomware attacks.

After appearance of WannaCry, one of my older blog posts got unfamiliar attention: WSUS on Windows 2012 (R2) and KB3159706 – WSUS console fails to connect. Why? My guess: Many admins started updating their Windows servers after appearance of WannaCry. Nearly a year after Microsoft has published KB3159706, their WSUS servers ran into this issue.

The truth about patch management

I know many enterprises, that patch their Windows clients and servers only every four or eight weeks, mostly during a maintenance window. Some of them do this, because their change processes require the deployment and test of updates in a test environment. But some of them are simply to lazy to install updates more frequent. So they simply approve all needed updates every four or eight weeks, push them to their servers, and reboot them.

Trond mentioned golden images and templates in his blog posts. I strongly agree to what he wrote, because this is something I see quite often: You deploy a server from a template, and the newly deployed server has to install 172 updates. This is because the template was never updated since creation. But I also know companies that don’t use templates, or goldes master images. They simply create a new VM, mount an ISO file, and install the server from scratch. And because it’s urgent, the server is not patched when it goes into production.

Sorry, but that’s the truth about patch management: Either it is made irregular, made in too long intervals, or not made at all.

Change Management from hell

Frameworks, such as ITIL, play also their part in this tragedy. Applying change management processes to something like patch managent prevents companies to respond quickly to threats. If your change management process prevents you from deploying critical security patches ASAP, you have a problem –  a problem with your change management process.

If your change management process requires the deployment from patches in a test environment, you should change your change mangement process. What is the bigger risk? Deploying a faulty patch, or being the victim of an upcoming ransomware attack?

Microsoft Windows Server Update Service (WSUS) offers a way to automatically approve patches. This is something you want! You want to automatically approve critical security patches. And you also want that your servers automatically install these updates, and restart if necessary. If you can’t restart servers automatically when required, you need short maintenance windows every week to reboot these servers. If this is not possible at all, you have a problem with your infrastructure design. And this does not only apply to Microsoft updates. This applies to ALL systems in your environment. VMware ESXi hosts with uptimes > 100 days are not a sign of stability. It’s a sign of missing patches.

Validated environments are ransomwares best friends

This is another topic I meet regularly: Validated environments. An environmentsthat was installed with specific applications, in a specifig setup. This setup was tested according to a checklist, and it’s function was documented. At the end of this process, you have a validated environments and most vendors doesn’t support changes to this environments without a new validation process. Sorry, but this is pain in the rear! If you can’t update such an environment, place it behind a firewall, disconnect it from your network, and prohibit the use of removable media such as USB sticks. Do not allow this environment to be Ground Zero for a ransomware attack.

I know many environments with Windows 2000, XP, 2003, or even older stuff, that is used to run production facilities, test stands, or machinery. Partially, the software/ hardware vendor is no longer existing, thus making the application, that is needed to keep the machinery running, another security risk.

Patch quick, patch often

IT departments should install patches more often, and short after the release. The risk of deploying a faulty patch is lower than the risk of being hit by a security attack. Especially when we are talking about critical security patches.

IT departments should focus on the value that they deliver to the business. IT services that are down due to a security attack can’t deliver any value. Security breaches in general, are bad for reputation and revenue. If your customers and users complain about frequent maintenance windows due to critical security patches, you should improve your communication about why this is important.

Famous last words

I don’t install Microsoft patches instantly. Some years ago, Microsoft has published a patch that causes problems. Imagine, that a patch would cause our users can’t print?! That would be bad!

We don’t have time to install updates more often. We have to work off tickets.

We don’t have to automate our server deployment. We deploy only x servers a week/ month/ year.

We have a firewall from $VENDOR.

How to monitor ESXi host hardware with SNMP

This posting is ~7 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

The Simple Network Management Protocol (SNMP) is a protocol for monitoring and configuration of network-attached devices. SNMP exposes data in the form of variables and values. These variables can then be queried or set. A query retrieves the value of a variable, a set operation assigns a value to a variable. The variables are organized in a hierarchy and each variable is identified by an object identifiers (OID). The management information base (MIB ) describes this hierarchy. MIB files (simple text files) contain metadata for each OID. These are necessary for the translation of a numeric OID into a human-readable format.  SNMP knows two devices types:

  • the managed device which runs the SNMP agent
  • the network management station (NMS) which runs the management software

The NMS queries the SNMP agent with GET requests. Configuration changes are made using SET requests. The SNMP agent can inform the NMS about state changes using a SNMP trap message. The easiest way for authentication is the SNMP community string.

SNMP is pretty handy and it’s still used, especially for monitoring and managing networking components. SNMP has the benefit, that it’s very lightweight. Monitoring a system with WBEM or using an API can cause slightly more load, compared to SNMP. Furthermore, SNMP is a internet-protocol standard. Nearly every device supports SNMP.

Monitoring host hardware with SNMP

Why should I monitor my ESXi host hardware with SNMP? The vCenter Server can trigger an alarm and most customers use applications like VMware vRealize Operations, Microsoft System Center Operations Manager, or HPE Systems Insight Manager (SIM). There are better ways to monitor the overall health of an ESXi host. But sometimes you want to get some stats about the network interfaces (throughput), or you have a script that should do something, if a NIC goes down or something else happens. Again, SNMP is very resource-friendly and widely supported.

Configure SNMP on ESXi

I focus on ESXi 5.1 and beyond. The ESXi host is called “the SNMP Agent”. We don’t configure traps or trap destinations. We just want to poll the SNMP agent using SNMP GET requests. The configuration is done using esxcli . First of all, we need to set a community string and enable SNMP.

[[email protected]:~] esxcli system snmp set -c public -e true
[[email protected]:~] esxcli system snmp get
   Communities: public
   Enable: true
   Engineid: 00000063000000a100000000
   Hwsrc: indications
   Largestorage: true
   Loglevel: info
   Port: 161

That’s it! The necessary firewall ports and services are opened and started automatically.

Querying the SNMP agent

I use a CentOS VM to show you some queries. The Net-SNMP package contains the tools snmpwalk  and snmpget. To install the Net-SNMP utils, simply use yum .

[[email protected] ~]# yum install net-snmp-utils.x86_64

Download the VMware SNMP MIB files, extract the ZIP file, and copy the content to to /usr/share/snmp/mibs.

[[email protected] mibs]# ls -lt
total 3852
-rw-r--r--. 1 root root  50968 Jun  3 17:05 BRIDGE-MIB.mib
-rw-r--r--. 1 root root  59268 Jun  3 17:05 ENTITY-MIB.mib
-rw-r--r--. 1 root root  52586 Jun  3 17:05 HOST-RESOURCES-MIB.mib
-rw-r--r--. 1 root root  10583 Jun  3 17:05 HOST-RESOURCES-TYPES.mib
-rw-r--r--. 1 root root   7309 Jun  3 17:05 IANA-ADDRESS-FAMILY-NUMBERS-MIB.mib
-rw-r--r--. 1 root root  33324 Jun  3 17:05 IANAifType-MIB.mib
-rw-r--r--. 1 root root   3890 Jun  3 17:05 IANA-RTPROTO-MIB.mib
-rw-r--r--. 1 root root  76268 Jun  3 17:05 IEEE8021-BRIDGE-MIB.mib
-rw-r--r--. 1 root root  89275 Jun  3 17:05 IEEE8021-Q-BRIDGE-MIB.mib
-rw-r--r--. 1 root root  16082 Jun  3 17:05 IEEE8021-TC-MIB.mib
-rw-r--r--. 1 root root  44543 Jun  3 17:05 IEEE8023-LAG-MIB.mib
-rw-r--r--. 1 root root  71747 Jun  3 17:05 IF-MIB.mib
-rw-r--r--. 1 root root  16782 Jun  3 17:05 INET-ADDRESS-MIB.mib
-rw-r--r--. 1 root root  46405 Jun  3 17:05 IP-FORWARD-MIB.mib
-rw-r--r--. 1 root root 185967 Jun  3 17:05 IP-MIB.mib
-rw-r--r--. 1 root root    229 Jun  3 17:05 list-ids-diagnostics.txt
-rw-r--r--. 1 root root  77406 Jun  3 17:05 LLDP-V2-MIB.mib
-rw-r--r--. 1 root root  16108 Jun  3 17:05 LLDP-V2-TC-MIB.mib
-rw-r--r--. 1 root root  23777 Jun  3 17:05 notifications.txt
-rw-r--r--. 1 root root  39918 Jun  3 17:05 P-BRIDGE-MIB.mib
-rw-r--r--. 1 root root  84172 Jun  3 17:05 Q-BRIDGE-MIB.mib
-rw-r--r--. 1 root root   1465 Jun  3 17:05 README
-rw-r--r--. 1 root root 223872 Jun  3 17:05 RMON2-MIB.mib
-rw-r--r--. 1 root root 148032 Jun  3 17:05 RMON-MIB.mib
-rw-r--r--. 1 root root  22342 Jun  3 17:05 SNMP-FRAMEWORK-MIB.mib
-rw-r--r--. 1 root root   5543 Jun  3 17:05 SNMP-MPD-MIB.mib
-rw-r--r--. 1 root root   8259 Jun  3 17:05 SNMPv2-CONF.mib
-rw-r--r--. 1 root root  31588 Jun  3 17:05 SNMPv2-MIB.mib
-rw-r--r--. 1 root root   8932 Jun  3 17:05 SNMPv2-SMI.mib
-rw-r--r--. 1 root root  38048 Jun  3 17:05 SNMPv2-TC.mib
-rw-r--r--. 1 root root  28647 Jun  3 17:05 TCP-MIB.mib
-rw-r--r--. 1 root root  93608 Jun  3 17:05 TOKEN-RING-RMON-MIB.mib
-rw-r--r--. 1 root root  20951 Jun  3 17:05 UDP-MIB.mib
-rw-r--r--. 1 root root   3175 Jun  3 17:05 UUID-TC-MIB.mib
-rw-r--r--. 1 root root   2326 Jun  3 17:05 VMWARE-CIMOM-MIB.mib
-rw-r--r--. 1 root root  22411 Jun  3 17:05 VMWARE-ENV-MIB.mib
-rw-r--r--. 1 root root  53480 Jun  3 17:05 VMWARE-ESX-AGENTCAP-MIB.mib
-rw-r--r--. 1 root root   2328 Jun  3 17:05 VMWARE-HEARTBEAT-MIB.mib
-rw-r--r--. 1 root root   1699 Jun  3 17:05 VMWARE-NSX-MANAGER-AGENTCAP-MIB.mib
-rw-r--r--. 1 root root 146953 Jun  3 17:05 VMWARE-NSX-MANAGER-MIB.mib
-rw-r--r--. 1 root root  15641 Jun  3 17:05 VMWARE-OBSOLETE-MIB.mib
-rw-r--r--. 1 root root   2173 Jun  3 17:05 VMWARE-PRODUCTS-MIB.mib
-rw-r--r--. 1 root root   8305 Jun  3 17:05 VMWARE-RESOURCES-MIB.mib
-rw-r--r--. 1 root root   3736 Jun  3 17:05 VMWARE-ROOT-MIB.mib
-rw-r--r--. 1 root root  11142 Jun  3 17:05 VMWARE-SRM-EVENT-MIB.mib
-rw-r--r--. 1 root root   3872 Jun  3 17:05 VMWARE-SYSTEM-MIB.mib
-rw-r--r--. 1 root root   7017 Jun  3 17:05 VMWARE-TC-MIB.mib
-rw-r--r--. 1 root root   7611 Jun  3 17:05 VMWARE-VA-AGENTCAP-MIB.mib
-rw-r--r--. 1 root root   8777 Jun  3 17:05 VMWARE-VC-EVENT-MIB.mib
-rw-r--r--. 1 root root  38576 Jun  3 17:05 VMWARE-VCOPS-EVENT-MIB.mib
-rw-r--r--. 1 root root  26952 Jun  3 17:05 VMWARE-VMINFO-MIB.mib

Now we can use snmpwalk  to “walk down the hierarchy “. This is only a small part of the complete output. The complete snmpwalk  output has more than 4000 lines!

[[email protected] mibs]# snmpwalk -m ALL -c public -v 2c esx1.lab.local
SNMPv2-MIB::sysDescr.0 = STRING: VMware ESXi 6.0.0 build-3825889 VMware, Inc. x86_64
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (402700) 1:07:07.00
SNMPv2-MIB::sysContact.0 = STRING:
SNMPv2-MIB::sysName.0 = STRING: esx1

Now we can search for interesting parts. If you want to monitor the link status of the NICs, try this:

[[email protected] mibs]# snmpwalk -m ALL -c public -v 2c esx1.lab.local IF-MIB::ifDescr
IF-MIB::ifDescr.1 = STRING: Device vmnic0 at 03:00.0 bnx2
IF-MIB::ifDescr.2 = STRING: Device vmnic1 at 03:00.1 bnx2
IF-MIB::ifDescr.3 = STRING: Device vmnic2 at 04:00.0 bnx2
IF-MIB::ifDescr.4 = STRING: Device vmnic3 at 04:00.1 bnx2
IF-MIB::ifDescr.5 = STRING: Device vmnic4 at 06:00.0 bnx2
IF-MIB::ifDescr.6 = STRING: Device vmnic5 at 06:00.1 bnx2
IF-MIB::ifDescr.7 = STRING: Distributed Virtual VMware switch: DvsPortset-0
IF-MIB::ifDescr.8 = STRING: Virtual interface: vmk0 on port 33554442 DVS 6b a0 37 50 c6 24 04 b8-25 08 f5 ea 32 ef 48 27
IF-MIB::ifDescr.9 = STRING: Virtual interface: vmk1 on port 33554443 DVS 6b a0 37 50 c6 24 04 b8-25 08 f5 ea 32 ef 48 27
IF-MIB::ifDescr.10 = STRING: Virtual interface: vmk2 on port 33554444 DVS 6b a0 37 50 c6 24 04 b8-25 08 f5 ea 32 ef 48 27
IF-MIB::ifDescr.11 = STRING: Virtual interface: vmk3 on port 33554445 DVS 6b a0 37 50 c6 24 04 b8-25 08 f5 ea 32 ef 48 27

As you can see, I used a subtree of the whole hierarchy (IF-MIB::ifDescr). This is the “translated” OID. To get the numeric OID, you have to add the option -O fn to snmpwalk .

[[email protected] mibs]# snmpwalk -O fn -m ALL -c public -v 2c esx1.lab.local IF-MIB::ifDescr
. = STRING: Device vmnic0 at 03:00.0 bnx2
. = STRING: Device vmnic1 at 03:00.1 bnx2
. = STRING: Device vmnic2 at 04:00.0 bnx2
. = STRING: Device vmnic3 at 04:00.1 bnx2
. = STRING: Device vmnic4 at 06:00.0 bnx2
. = STRING: Device vmnic5 at 06:00.1 bnx2
. = STRING: Distributed Virtual VMware switch: DvsPortset-0
. = STRING: Virtual interface: vmk0 on port 33554442 DVS 6b a0 37 50 c6 24 04 b8-25 08 f5 ea 32 ef 48 27
. = STRING: Virtual interface: vmk1 on port 33554443 DVS 6b a0 37 50 c6 24 04 b8-25 08 f5 ea 32 ef 48 27
. = STRING: Virtual interface: vmk2 on port 33554444 DVS 6b a0 37 50 c6 24 04 b8-25 08 f5 ea 32 ef 48 27
. = STRING: Virtual interface: vmk3 on port 33554445 DVS 6b a0 37 50 c6 24 04 b8-25 08 f5 ea 32 ef 48 27

You can use snmptranslate  to translate an OID.

[[email protected] mibs]# snmptranslate .
[[email protected] mibs]# snmptranslate -O fn IF-MIB::ifDescr

So far, we have only the description of the interfaces. With a little searching, we find the status of the interfaces (I stripped the output).

IF-MIB::ifOperStatus.1 = INTEGER: up(1)
IF-MIB::ifOperStatus.2 = INTEGER: up(1)
IF-MIB::ifOperStatus.3 = INTEGER: down(2)
IF-MIB::ifOperStatus.4 = INTEGER: down(2)
IF-MIB::ifOperStatus.5 = INTEGER: up(1)
IF-MIB::ifOperStatus.6 = INTEGER: up(1)

ifOperStatus.1  corresponds with ifDescr.1 , ifOperStatus.2  corresponds with ifDescr.2  and so on. The ifOperStatus corresponds  with the status of the NICs in the vSphere Web Client.


If you want to monitor the fans or power supplies, use these these OIDs.

HOST-RESOURCES-MIB::hrDeviceDescr.35 = STRING: POWER Power Supply 1
HOST-RESOURCES-MIB::hrDeviceDescr.36 = STRING: POWER Power Supply 2
HOST-RESOURCES-MIB::hrDeviceDescr.37 = STRING: FAN Fan Block 1
HOST-RESOURCES-MIB::hrDeviceDescr.38 = STRING: FAN Fan Block 2
HOST-RESOURCES-MIB::hrDeviceDescr.39 = STRING: FAN Fan Block 3
HOST-RESOURCES-MIB::hrDeviceDescr.40 = STRING: FAN Fan Block 4

HOST-RESOURCES-MIB::hrDeviceStatus.35 = INTEGER: running(2)
HOST-RESOURCES-MIB::hrDeviceStatus.36 = INTEGER: running(2)
HOST-RESOURCES-MIB::hrDeviceStatus.37 = INTEGER: running(2)
HOST-RESOURCES-MIB::hrDeviceStatus.38 = INTEGER: running(2)
HOST-RESOURCES-MIB::hrDeviceStatus.39 = INTEGER: running(2)
HOST-RESOURCES-MIB::hrDeviceStatus.40 = INTEGER: running(2)

Many possibilities

SNMP offers a simple and lightweight way to monitor a managed device. It’s not a replacement for vCenter, vROps or SCOM. But it can be an addition, especially because SNMP is an internet-protocol standard.

Lean ITIL Service Operation

This posting is ~7 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

The IT Infrastructure Library (ITIL) is a set of pre-defined processes and common practices (I try to avoid the word “best practice” when talking about ITIL) for the IT service management (ITSM).

When I talk with customers about ITIL, they often complain about the overhead of ITSM processes, that were designed according to ITIL. I already wrote about this in one of my previous blog posts (Is lean ITSM a myth?). Companies mainly have three problems during the implementation and/ or operation of ITIL processes:

  • slow processes
  • complex processes
  • error prone processes.

ITIL doesn’t tell you how to design a process. ITIL is a collection of common practices. Usually, you have someone that helps you to design and implement the processes and functions. If you don’t have an experienced consultant, you might get processes, that lead you to the wrong direction: Big, fat, complex, ugly, error prone processes.

At the end, your processes have to deliver value. But I saw so many crappy/ slow/ complex processes that doesn’t deliver any value, that I seriously began to doubt in ITIL. But again: ITIL isn’t slow, complex or error prone by default. The processes you design are slow, complex and error prone. The success of ITSM with ITIL is based on the processes that you design and implement.

The ITIL life cycle

The biggest difference between ITIL v2 and ITIL v3 is, that ITIL v3 focuses on the full life cycle of services, covering the entire IT organization.

ITIL v3 Lifecycle

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

ITIL Service Operation

The ITIL Service Operation phase manages services in supported environments. The ITIL Service Operation volume describes the Service Operation phase as the part of the lifecycle, where the services and value is actually delivered to end-users and customers. The Service Operation phase is the only phase of the ITIL life cycle, that consists of process and functions.

Event ManagementService Desk
Access ManagementTechnical Management
Request FulfillmentApplication Management
Problem ManagementIT Operations Management
Incident Management

When designing the processs and functions, it’s important to focus on the delivered value! Without this focus, you will never be able to develop your IT from manufactory to factory.

Value Stream Mapping

Value Stream Mapping is method for analyzing the current state, and designing the new state of a process. Knowledge about the current state is mandatory for the design of the future state. Value Stream Mapping is a well-known method in Lean Management, and it can be applied to any value chain. A value chain is a number of activities to deliver a valuable product or service to a customer.

Value Stream Mapping can be used to analyze ITIL Service Operation processes and functions for potential waste. With the knowledge about potential waste, processes and functions can be optimized.

The Lean Management/ Toyota Production System knows three types of waste:

  • Mura (waste due to variation)
  • Muri (waste due to overburdening)
  • Muda (transportation, waiting, overproduction, defects, inventory, movement, extra processing)

I’m sure you can apply all three types of waste to ITIL Service Operation processes and functions. And because of this, methods and instruments known from the Lean Management can help to streamline ITIL Service Operation processes and functions.

Lean Managenent and ITIL

Lean Management offers a lot instruments and methods, that can be used together with the processes and functions of the ITIL Service Operation phase.

One of the greatest benefits is automation. Automate as much as you can. Kaizen (“improvement”) can be used as part of the Continuous Improvement of ITIL. Kanban can be used as part of the Service Desk, Incident or Problem Management. Problem-solving techniques, like A3, Kepner-Tregoe or 5W, can be used in the Problem and Incident Management processes. FMEA can be used for quality management as part of the Application, Technical and IT Operations Management. Total Productive Maintenance (TPM) can be used as part of the IT Operations Management. And there are many more possibilities to use methods and instruments of the Lean Management as part of the ITIL Service Operation Phase.


Every process may suffer from different types of waste. This can be due to bad design or bad execution. This can be a big problem in case of the processes and functions of the ITIL Service Operation, because these processes and functions  actually deliver services to end-users and customers. To provide the best possible service quality, you need effective and valuable processes and functions. The Value Stream Mapping can help to analyze current processes. With the knowledge about the value-adding activites of the current processes, IT organizations are able to design valuable and waste-free processes and functions. Methods and instruments of the Lean Management can help to achieve this.

Lean ITIL Service Operation must be the goal!

Industrialize your IT – after you have done your homework

This posting is ~7 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today a tweet from Keith Townsend (@CTOAdvisor) has caught my attention:

Keith wrote a nice blog post and I really recommend to read it. His point is, that automation enables business agility.

The point of automation is to enable business agility. Business agility isn’t achieved by automating inefficient processes. The start of an IT automation project begins by examining existing processes and eliminating inefficiency.

Agility is the ability of an organization to act flexible, proactive, adaptable and with initiative in times of change and uncertainty. Can IT automation enable business agility? No, I don’t think that IT automation can enable business agility.

It’s all about processes

Business processes are the key to business agility. Keith wrote, that

Business processes are very organic systems that grow and evolve as a company’s business and culture matures.

and I’m totally fine with this statement. Processes tend to optimize themselves, if you let them. In cases where consultants are hired to improve faulty processes, the processes are often not the problem. The people who run them are often the problem.

Before you should think about IT automation, you should optimize your processes. Keith has a similar view:

Automation comes after you’ve established a repeatable process that has few if any points of contention.

Lean processes are the key

A business process has to deliver value. That’s quite clear. But a process has to deliver value and it has to be lean. Otherwise the business would lose its agility.

What are the characteristics of a lean process?

  • It delivers value
  • It respects and involves the people who run the process
  • It’s steamlined and free of waste

Streamlining processes and avoiding waste are the anchors for IT automation! But you should only automate what you have fully understood. Keith mentioned a nice example:

Money and time are better spent implementing an IP management system and continuing the manual processes. The problem wasn’t self-provisioning of VM’s but waiting for IP address assignment.

This example clearly shows that it’s mandatory to check each and every process for potential waste and pain points. Only after this step, you will be able to design a new, lean process.

IT automation can’t enable business agility. But lean business processes can do. IT automation can help you to streamline your business processes.

Time is changing

At the end I would like to refer to an important statement in Keiths blog post:

All of this goes back to my drumbeat that IT infrastructure practitioners acquire business alongside their technical skill set.

100% agree! The time of the people who have been sitting in dark basements, with their “There are 10 types of people” shirts and the opinion, that IT is the navel of the world, comes to an end.

Solving problems: A structured approach

This posting is ~7 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

What is a problem? A problem is an obstacle, that has to be surmounted. Solving a problem is connected with obstacles. Or more general: Problem solving is a process to get from an unsatisfactory to a satisfactory situation.

Most of us get paid for solving problems. It’s irrelevant if you are paid for solving technical problem (e.g. My computer doesn’t work), or if you are paid to create solutions for customers (e.g. design infrastructure for a Citrix XenApp farm). At the end you solve a problem.

Every problem has characteristics, that can be used to describe it.

  • Solubility

Not every problem is solvable. Think about “Squaring the circle“. But often a problem seems to be unsolvable because it’s not well defined. If Initial situation, obstacle and target situation are not clearly formulated, you won’t be able to solve the problem.

  • Decomposability

If you can decompose a problem into multiple subproblems, it is a hierarchical problem. Otherwise, it’s an elemental problem.

  • Effort

The effort to solve a problem is always different.. A problem is theoretically solvable, but it may require such a high effort, that it is practically unsolvable.

  • Subjectivity

Even if a problem is well defined, it appears different in regard to complexity for different people.

How to start?

First of all

  • Understand and define the problem

This is most important part. Before you try to solve a problem, make sure that you have really understood the problem. Then you should define the problem. Only a clearly defined problem can be solved. And it’s much easier to solve a clearly defined, than a vague problem. If it’s a complex problem, then you should try to

  • Simplify or decompose the problem

A simplified problem can help you stay focused. If you can’t simplify a problem, you can try to divide it into subproblems. With a clearly defined, simplified/ structured problem, you can start to

  • Find the root cause

Collecting information is the key. Collect information about what happened before, during, and after the problem has occurred. Identifying the root cause for a problem can be a time consuming task. But let me say this clearly: Information is the key. Information that help to find the root cause are not only observations (e.g. logs, error messages etc.). You can can use the results of systematic tests. Collect as much data as you can.

Sometimes it can be useful to create a hypothesis.

Scientists generally base scientific hypotheses on previous observations that cannot satisfactorily be explained with the available scientific theories.

If you see that System A is affected, but system B should be affected too, but it’s not, it might be time to create a hypothesis. With a hypothesis in mind, you can try to prove it. Test the hypothesis by performing tests and collecting data. This strategy is called “hypothesis testing”.

At some point, you should have identified the root cause. With the now known root cause, you can

  • Create solutions and select the best one

Sometimes it’s easy. But sometimes it’ not that easy. A trade-off analysis can help to identify the best of multiple solutions.

  • Create an action plan

Even if you only have to disable a specific feature, it’s a good idea to formulate an action plan. Even if consists only of three lines… You should state clearly

  • WHAT you do
  • WHY do you have to do it, and
  • HOW to you plan to check it

With these steps, you should be well prepared. It doesn’t matter what kind of problem you are trying to solve: The process is basically the same.

Other problem solving methods

Over the years many problem solving methods have been developed. Kepner-Tregoe is one of them. Other well known methods are:

  • A3 Problem Solving
  • PDCA
  • Eight Disciplines (8D) Problem Solving
  • Failure mode and effects analysis (FMEA)

A3 Problem Solving has been developed at Toyota for their Toyota Production System (TPS). It’s an often used method in Lean Manufacturing. A3 helps to solve problems by pretending a structure (WHAT IS and WHAT IS NOT the problem, describe the problem, root cause, solution etc). This strucure is placed on an A3 sheet paper (that why it’s called A3). The process is based on the principles of Deming’s PDCA cycle.

PDCA, or Plan-Do-Check-Act (sometimes Shewhart-Cycle) was made popular by Dr. Edwards Deming. Plan-Do-Check-Act refers to the four phases of this cycle.

  • Plan: Plan the change
  • Do: implement the change
  • Check: Check the sucess of the implemented change
  • Act: Take action based on the results of “Check”

Eight Disciplines (8D) Problem Solving was developed by the Ford Motor Company. The D0 phase is the starting point for the D8 process, but it’s not counted.

  • D0:  Plan for solving the problem and determine the prerequisites
  • D1: Establish a team of people with the required skills and knowledge
  • D2: Describe the problem
  • D3: Define and implement containment actions
  • D4: Determine and verify the root causes
  • D5: Plan permanent corrective actions for the observed problem
  • D6: Implement the best permanent corrective actions
  • D7: Modify management systems to prevent a recurrence
  • D8: Congratulate your team!

The Failure mode and effects analysis (FMEA) is a highly structured, systematic approach for failure analysis. There are different FMEA alalyses:

  • Functional
  • Design
  • Process

FMEA is based on inductive reasoning (forward logic). FMEA is based on a highly structured process, which can be represented as followed.

  • Structural analysis: A system is divided into its components
  • Functional analysis: Identify the function of each component
  • Failure analysis: Identify the possible failures for each component
  • Calculate the risk: Risk Priority Number = occurrence ranking x detection ranking x highest severity ranking
  • Optimize: Optimize the component to mitigate the risk

No matter what, stay organized

The key to successfully solve problems is to stay organized. Solving problems isn’t magic. It is a very structured process that gets better with increasing experience. Try to create your own, structured method. Or use one of the mentioned problem solving methods. But in general:

  • Always try to describe a problem
  • Try to simplify or break it into smaller problems
  • Search and verify for the root cause
  • Develop a solution

Certificate-based authentication of Azure Automation accounts

This posting is ~8 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Before you can manage Azure services with Azure Automation, you need to authenticate the Automation account against a subscription. This authentication process is part of each runbook. There are two different ways to authenticate against an Azure subscription:

  • Active Directory user
  • Certificate

If you want to use an Active Directory account, you have to create a credential asset in the Automation account and provide username and password for that Active Directory account. You can retrieve the credentials using the Get-AzureAutomationCredential cmdlet. This cmdlet returns a System.Management.Automation.PSCredential object, which can be used with Add-AzureAccount to connect to a subscription. If you want to use a certificate, you need four assets in the Automation account: A certificate and variables with the certificate name, the subscription ID and the subscription name. The values of these assets can be retrieved with Get-AutomationVariable and Get-AutomationCertificate.


Before you start, you need a certificate. This certificate can be a self- or a CA-signed certificate. Check this blog post from Alice Waddicor if you want to start with a self-signed certificate. I used a certificate, that was signed by my lab CA.

At a Glance:

  • self- or CA-signed certificate
  • Base64 encoded DER format (file name extension .cer) to upload it as a management certificate
  • PKCS #12 format with private key (file name extension .pfx or .cer) to use it as an asset inside the Automation account

Upload the management certificate

First, you must upload the certificate to the management certificates. Login to Azure and click “Settings”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Click on “Management Certificates”


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

and select “Upload” at the bottom of the website.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Make sure that the certificate has the correct format and file name extension (.cer).


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Finish the upload dialog. After a few seconds, the certificate should appear in the listing.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Create a new Automation account

Now it’s time to create the Automation account. Select “Automation” from the left panel.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Click on “Create an Automation account”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Give your Automation account a descriptive name and select a region. Please note that an Automation account can manage Azure services from all regions!


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Click on the newly created account and click on “Assets”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Select “Add setting” from the bottom of the website.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Add a credential asset by choosing “Add credential” and select “Certificate” as “Credential type”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Enter a descriptive name for the certificate. You should remember this name. You will need it later. Now you have to upload the certificate. The certificate must have the file name extension .pfx or .cer and it must include the private key!


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Finish the upload of the certificate. Now add three additional assets (variables).


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Select the name, the value and the type from the  table below. The name of the certificate is the descriptive name, you’ve previously entered when uploading the certificate.

AutomationCertificateNameName of your certificateString
AzureSubscriptionNameName of your subscriptionString
AzureSubscriptionID36 digit ID of the subscriptionString

Done. You’ve uploaded and created all the required certificates and variables.

How to use it

To use the certificate and the variables to connect to an Azure subscription, you have to use the two cmdlets Get-AutomationCertificate and Get-AutomationVariable. I use this code block in my runbooks:

$AzureSubscriptionName = Get-AutomationVariable -Name "AzureSubscriptionName" 
$AzureSubscriptionID = Get-AutomationVariable -Name "AzureSubscriptionID" 
$AutomationCertificateName = Get-AutomationVariable -Name "AutomationCertificateName"
$CertificateName = Get-AutomationCertificate -Name $AutomationCertificateName

Set-AzureSubscription -SubscriptionName $AzureSubscriptionName -SubscriptionId $AzureSubscriptionID -Certificate $CertificateName
Select-AzureSubscription $AzureSubscriptionName

Works like a charm.


Certificate-based authentication is an easy way to authenticate an Automation account against an Azure subscription. It’s easy to implement and you don’t have to maintain users and passwords. You can use different certificates for different Automation accounts. I really recommend this, especially if you have separate accounts for dev, test and production.

All you need is to upload a certificate as a management certificates, and as a credential asset in the Automation account.  You can use a self- or CA-signed certificate. The subscription ID, the subscription name and the name of the certificate are stored in variables.

At the beginning of each runbook, you have to insert a code block. This code block takes care of authentication.

A brief introduction into Azure Automation

This posting is ~8 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Automation is essential to reduce friction and to streamline operational processes. It’s indispensable when it comes to the automation of manual, error-prone and frequently repeated tasks in a cloud or enterprise environment. Automation is the key to IT industrialization. Azure Automation is used to automate operational processes withing Microsoft Azure.

Automation account

The very first thing you have to create is an Automation account. You can have multiple Automation accounts per subscription. An Automation account allows you so separate automation resources from other Automation accounts. Automation resources are runbooks and assets (credentials, certificates, connection strings, variables, scheudles etc.). So each Automation account has its own set of runbooks and assets. This is perfect to separate production from development. An Automation account is associated with an Azure region, but the Automation account can manage Azure services in all regions.


A runbook is a collection of PowerShell script or PowerShell workflows. You can automate nearly everything with it. If something provides an API, you can use a runbook and PowerShell to automate it. A runbook can run other runbooks, so you can build really complex automation processes. A runbook can access any services that can be accessed by Microsoft Azure, regardless if it’s an internal or external service.

There are three types of runbooks:

  • Graphical runbooks
  • PowerShell Workflow runbooks
  • PowerShell runbooks

Graphical runbooks can be created and maintained with a graphical editor within the Azure portal. Graphical runbooks use PowerShell workflow code, but you can’t directly view oder modify this code. Graphical runbooks are great for customers, that don’t have much automation and/ or PowerShell knowledge. Once you created a graphical runbook with an automation account, you can export and import this runbook into another automation accounts, but you can modify the runbook only with the account which was used during the creation of the runbook.

PowerShell Workflow runbooks doesn’t have a graphical presentation of the workflow. You can use a text editor to create and modify PowerShell Workflow runbooks. But you need to know how to deal with the logic of PowerShell Workflow code.

PowerShell runbooks are plain PowerShell code. Unlike PowerShell Workflows, a PowerShell runbook is faster, because it doesn’t have to be compiled before the run. But you have to be familiar with PowerShell. There is no parallel processing and you can’t use checkpoints (if a snapshot fails, it will be suspended. With a checkpoint, the workflow can started at the last sucessful checkpoint).


Schedules are used to run runbooks to a specific point in time. Runbooks and schedules have a M:N relationship. A schedule can be associated with one or more runbooks, and a runbook can be linked to one or more schedules.


This is only a brief introduction into Azure Automation. Azure Automation uses Automation accounts to execute runbooks. A runbook consists of PowerShell Workflow or plain PowerShell code. You can use runbooks to automate nearly all operations of Azure services. To execute runbooks to a specific point in time, you can use schedules Runbooks, schedules and automation assets, like credentials, certificates etc., are associated with a specific Automation account. This helps you to separate between different Automation accounts, e.g. accounts for development and for production.

How to migrate from VMware vCOps to vROps – Part 3

This posting is ~8 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I wrote about what’s new in vROps 6 and about the deployment of the virtual appliance. I also described how to migrate the data from the old vCOps vApp. Part 3 covers the decommission of the old vApp.

Enter the IP or FQDN of your UI VM into the browser. Login as admin into the administration UI.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Before the vApp can be removed, the vCOps needs to be unregistered from the vCenter. Click “Unregister”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

A confirmation pop-up appears. Click “Yes”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

The process can take some time, depending on your environment. In my case the unregistration took about 5 minutes.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

That’s it. After the successful unregistration, the vApp can be shutdown and removed from the vCenter. Say good bye and enjoy your new vROps 6.0!


Some users complained about the absence of the Health Widget in the vSphere Web Client, after the removal of the old vCOps. Michael White (@mwVme) posted the solution: vSphere Web Client Health State Widget has errors after vR Ops Migration? Thanks to Michael for sharing this!

How to migrate from VMware vCOps to vROps – Part 2

This posting is ~8 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Part 1 of this series has covered a short overview over vRealize Operations Manager 6.0 and the initial deployment of the virtual appliance. Now it’s time to bring it to life.

Open a browser and enter the IP of your newly deployed vROps appliance. You will get this nice initial setup screen. “New Installation” is always a good start. Click “New Installation”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Within five little steps, the configuration of vROps will be done. You may have noticed the “Migrate Data” icon on the right of the screenshot. This will be important later. Start with clicking “Next”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Enter the password for the build-in administrator account and click “Next”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

I really liked part: Choose Certificate. No command line, no complex knowledge base articles. Simply choose “Install a certificate” and point the installer to a valid certificate. You can replace the certificate later. Because of this, I’ve chosen “Use the default certificate” at this point.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Enter the name of the master node and a valid NTP server. In my case, the NTP server is my Active Directory Domain Controller.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Complete the initial setup by clicking “Finish”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Now the vROps services must be started. Click “Start vRealize Operations Manager”. The pop up can be answered with “Yes”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Log in with “admin” and the password you’ve chosen at the beginning of the initial setup.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Because I already had a vCOps running in my lab, I’ve chosen “Import Environment”. This selection allows you to import the data from your current vCOps vApp. Click “Next”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Carefully read the EULA, enable the checkbox and click “Next”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Now you have to add a license key. I used my vExpert vCloud Suite key.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Enter the IP address or the FQDN of your current vCOps UI VM. A pop-up appears that informs you, that an agent is pushed to the VM. This step takes about 5 minutes to finish. Click “Next”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

You can add additional solutions at this point. This wasn’t necessary in my case. Click “Next”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Everything’s fine. Simply click “Finish” and relax.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Now the data is imported. Depending on your resources and the amount of data, this step can take some time. In my case this step took about 10 minutes to finish.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Congratulations! That’s it. Easy, isn’t it? Now you can add additional nodes, e.g. a replica node or additional data nodes.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

The migration process is really easy and straightforward. The database is the most valuable. Because of this, I strongly recommend to migrate the data from your current vCOps environment.

But what to do with the old vCOps vApp? It’s still active and consumes resources from your vSphere cluster. I covered this in part 3 of this series.

How to migrate from VMware vCOps to vROps – Part 1

This posting is ~8 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

VMware presented the vRealize Operations Manager 6.0 at the VMworld 2014 in Barcelona. In early december, vROps was available for download.

vROps 6.0 is the successor of VMwares IT Operations Management suite vCenter Operations Manager, or vCOps. VMware has aligned the naming scheme with other products, so this release is the first release under the new brand vRealize.

VMware has made some major improvements to this release. One of the biggest advantages is the ability to scale-out. In prior releases you had to deploy multiple vApps to scale. Now you can add additional vROps instances to a cluster. These appliances provide computing resources, as well as redundancy. This allows you to scale beyond the limits of vCOps 5. Redundancy is provided by a concept which is based on master, replica and data nodes. The fist node in a vROps deployment is the master node. By adding a replica node, you can add redundancy for the case that the master node fails. Master and replica node work in a active/ standby relationship. The data nodes are the secret behind the scalability of vROps. A data node has only one task to perform: Collect data based on the assigned adapter.


An adapter is used to collect data from 3rd party systems and it’s provided by different available management packs. Check VMwares Whitepaper “VMware vRealize Operations Guide to Third-Party Solutions” for more information about availble management packs. From my point of view the management packs (which provide the adapters) for HP 3PAR (HP StoreFront), HP OneView, Brocade SAN Analytics, Microsoft SCOM or SAP CCMS are really cool. Just to make this clear: vRealize Operations Manager 6.0 is NOT focused on VMware! You can use vROps with available management packs to get a much better oversight over your IT infrastructure!

VMware has also improved the user interface (UI). There is no need anymore to switch between different UIs. The management, administration and customer UI has been consolidated into a single UI. The first UI access after the initial deployment is redirected to a first-time wizard. This wizard helps you to deploy vROps, or to migrate your vCOps environment. In addition to the UI changes, VMware has added RBAC to vROps to simplify user access management.

When talking about the UI, it’s only a small step to the data visualization and reporting. The visualization of collected data is one of the greatest features of vCOps/ vROps. Only data visualization makes it possible to get a quick oversight over the current health of the IT infrastructure. Prior version of vROps included reports and some fixed dashboards. With the current release, vROps provides fully configurable reports and dashboards with the ability to include any data source, object or metric. Just think about the opportunities: You can build reports or dashboards across the whole stack, from the infrastructure to the application.

Smart Alerts provide a way to trigger an alarm, when multiple symptoms are observed. This is much more flexible than the old alarms you know from vCOps. But even a dynamic threshold may be too inflexible in certain cases. Smart Alerts add more intelligence to the alarming system. When using smart alerts, an alert is triggert when multiple symptoms are observed. Now you know that something is wrong in your datacenter. But how can you solve this? vROps is able to execute basics operations as a reaction to a smart alert. If you need more advanced reactions, vROps can utilize vRealize Orchestrator to accomplish this. Your datacenter will heal itself.

The capacity management was improved. The well known “Demand & Allocation” based capacity model hasn’t changed, but it’s now not limited to vSphere objects. The capacity management can now include all monitored objects. VMware also improved the the “what-if”-analysis. This feature was now extended and is now aware of projects. With this feature, you can add future hardware purchases into a “what-if”-analysis.

VMware dramatically improved the analysis and monitoring of the storage subsystem. Unified Storage Visibility can show you the correlation between applications and your underlying storage infrastructure in an End-2-End manner.

The licensing was simplified. Customers can now install multiple editions in the same vROps deployment. This enables customers to deploy a-la-carte and suite licenses together, e.g. vCloud Suite Standard, vSOM Standard and vROps Standard together in one deployment. You can’t deploy Standard with either an Advanced or Enterprise, e.g. vCloud Site Standard and vSOM Advanced or vROps Advanced/ Enterprise.

The deployment process

You can deploy vROps using a virtual appliance or on top of RHEL or Windows using suitable installation packages. I would always prefer the appliance deployment. A big advantage is, that the new vROps appliance is a single appliance, not a vApp. Especially in environments without DRS deploying the vCOps vApp was pain. You can deploy vROps on every ESX/ESXi host running version 4.0 or later that is managed by a VMware vCenter Server 4.0 Update 2 or later.

I used the good old C# client to deploy the appliance. ;) Start with selecting “Deploy a OVF Template” from the vSphere client.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Simply click “Next” to get to the EULA.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Carefully read the EULA and click “Accept”. Then click “Next”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Enter a new name for the vROps appliance or accept the default name. Select a location for the VM.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Select the configuration. If your environment consists of equal or less 2000 VMs, select “small”. In this case the appliance will configured with 4 vCPUs and 16 GB memory. This is equal to the resources of the old vCOps vApp construct.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Select the host or the cluster into the vROps appliance should be deployed. I selected my management host.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Select a resource pool if you have RPs configured.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

I deploy my VMs always thin-provisioned in my lab. Select an appropriate disk format for your deployment.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Choose the port-group to which the appliance should be connected. In my case it’s my infrastructure port-group.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Now we need to add IP, subnet, gateway, DNS and time zone. Make sure that this applies to the port-group you have chosen one step earlier.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Double check your settings and click “finish”.


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Depending on your equipment, wait a couple of seconds, get yourself a coffee or go out for a walk. Congratulations, it’s a vROps!


Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

At this point is the initial deployment finished and we can proceed further. As you maybe noticed, there was already a vCOps deployed in my lab. At this point a vCOps 5.8.x and a vROps 6.0 were running in my lab. Time to migrate vCOps to vROps. This process is covered in part 2 of this series.