Category Archives: Software

CloudFlare API v4 and Fail2ban: Fixing the unban action

In January 2017, I wrote an article about how to protect your WordPress blog using the WP Fail2Ban plugin, fail2ban on your Linux/ FreeBSD host, and CloudFlare. Back then, the fail2ban was using the CloudFlare API V1, which was already deprecated since November 2016.

Free-Photos/ pixabay.com/ Creative Commons CC0

Although the actions were updated later to use CloudFlare API V4, I still had problems with the unbaning of IP addresses. IP addresses were banned, but the unban action failed. 

This is the unban action, which is included in fail2ban (taken from fail2ban-0.10.3.1 which is shipped with FreeBSD 11.1-RELEASE-p10):

And this is the unban action, which finally solved this issue:

I found the solution at serverfault.com. The only difference is an additional tr -d '\n'  in the last line of the statement. Kudos to Jake for fixing this!

To prevent the action file to being overwritten, you should copy the original cloudflare.conf  located in the  action.d  directory, e.g. to mycloudflare.conf , and use the copied action file in your fail definition.

Windows Network Policy Server (NPS) server won’t log failed login attempts

This is just a short, but interesting blog post. When you have to troubleshoot authentication failures in a network that uses Windows Network Policy Server (NPS), the Windows event log is absolutely indispensable. The event log offers everything you need. The success and failure event log entries include all necessary information to get you back on track. If failure events would be logged…

geralt/ pixabay.com/ Creative Commons CC0

Today, I was playing with Alcatel-Lucent Enterprise OmniSwitches and Access Guardian in my lab. Access Guardian refers to the some OmniSwitch security functions that work together to provide a dynamic, proactive network security solution:

  • Universal Network Profile (UNP)
  • Authentication, Authorization, and Accounting (AAA)
  • Bring Your Own Device (BYOD)
  • Captive Portal
  • Quarantine Manager and Remediation (QMR)

I have planned to publish some blog posts about Access Guardian in the future, because it is a pretty interesting topic. So stay tuned. :)

802.1x was no big deal, mac-based authentication failed. Okay, let’s take a look into the event log of the NPS… okay, there are the success events for my 802.1x authentication… but where are the failed login attempts? Not a single one was logged. A short Google search showed me the right direction.

Failed logon/ logoff events were not logged

In this case, the NPS role was installed on a Windows Server 2016 domain controller. And it was a german installation, so the output of the commands is also in german. If you have an OS installed in english, you must replace “Netzwerkrichtlinienserver” with “Network Policy Server”.

Right-click the PowerShell Icon and open it as Administrator. Check the current settings:

As you can see, only successful logon and logoff events were logged.

The option /success:enable /failure:enable activeates the logging of successful and failed logon and logoff attempts.

Veeam backups fails because of time differences

Last week I had an interesting incident at a customer. The customer reported that one of multiple Veeam backup jobs jobs constantly failed.

jarmoluk/ pixabay.com/ Creative Commons CC0

The backup job included two VMs, and the backup of one of these VMs failed with this error:

The verified the used credentials for that job, but re-entering the password does not solved the issue. I then checked the Veeam backup logs located under %ProgramData%\Veeam\Backup (look for the Agent.Job_Name.Source.VM_Name.vmdk.log) and found VDDK Error 3014:

The user, that was used to connect to the vCenter, was an Active Directory located account. The account were granted administrator privileges root of the vCenter. Switching from an AD located account to Administrator@vsphere.local solved the issue. Next stop: vmware-sts-idmd.log on the vCenter Server appliance. The error found in this log confirmed my theory, that there was an issue with the authentication itself, not an issue with the AD located account.

To make a long story short: Time differences. The vCenter, the ESXi hosts and some servers had the wrong time. vCenter and ESXi hosts were using the Domain Controllers as time source.

This is the ntpq  output of the vCenter. You might notice the jitter values on the right side, both noted in milliseconds.

After some investigation, the root cause seemed to be a bad DCF77 receiver, which was connected to the domain controller that was hosting the PDC Emulator role. The DCF77 receiver was connected using an USB-2-LAN converter. Instead of using a DCF77 receiver, the customer and I implemented a NTP hierarchy using a valid NTP source on the internet (pool.ntp.org).

Demystifying “Interfaces on which heartbeats are not seen”

By accident, I found a heartbeat/ VLAN issue on a NetScaler cluster at one of my customers. The NetScaler ADC appliances have three interfaces connected to a switch stack. Two of the three interfaces were configured as a channel (LAG). This is a snippet from the config:

On the switch stack, the port to which interface 1/3 is connected, is configured as an access port. The ports, to which the channel is connected, is configured as a trunk port with some permitted VLANs. The customer is using HPE Comware based switches. The terminology is the same for Cisco. If you use HPE ProVision or Alcatel Lucent Enterprise, translate “access” to “untagged” and “trunk” to “tagged”. Because the channel is configured as a trunk port on the switch, the tagall option was set.

Issue

While examining the output of  show ha node I saw this:

Because interface 1/3 was not affected, this had to be a VLAN issue. During the initial troubleshooting, I was able to discover heartbeat packets in VLAN 1 and in VLAN 10.

Solution

The solution was easy: Remove the tagged option for VLAN 10 on LA/1.

instead of

Because of the configured tagall  option, all packets sourced by LA/1 are tagged with the corrosponding VLAN ID. But because it’s now explicitly configured without a tag for VLAN 10, VLAN 10 is now also the native VLAN for LA/1.

Now the NetScaler was sending heartbeat packets with a tag for VLAN 10, and the issue was solved.

Explanation

Heartbeat packets are always send without a VLAN tag (untagged). There are two exceptions:

  • The NSVLAN is configured with a specific VLAN ID, or
  • an interface used for hearbeats is configured with the tagall

In this case, the heartbeat packets are tagged with the ID of the native VLAN ID of the interface. A show interface of the channel showed, that the channel was using VLAN 1 as the native VLAN.

How does the NetScaler determine the native VLAN for an interface? The native VLAN is the VLAN, to which an interface is bound untagged. An interface can only be bound untagged to a single VLAN. But it can be bound tagged to multiple VLANs.

If you take a look at the config snippet at the top of this blog post, you might notice, that interface 1/3 is bound untagged to VLAN 10. So this is the native VLAN for interface 1/3. But this interface is not using the tagall  option. Therefore, heartbeat packets are not tagged. The channel LA/1 is bound tagged to VLAN 10. But it was also bound to VLAN 1, without the tagged  option. This caused, that VLAN 1 was used as the native VLAN for channel LA/1. And because LA/1 is configured with the tagall  option, the heartbeats were tagged with a tag for VLAN 1. That’s why I was able to see the heartbeats, that were send over channel LA/1, in VLAN 1.

In the end, the NetScaler appliances were sending heartbearts from interface 1/3 to VLAN 10, and from channel LA/1 to VLAN 1. This caused the message “Interfaces on which heartbeats are not seen: LA/1”.

The Meltdown/ Spectre shortcut blogpost for Windows, VMware and HPE

TL;DR

Jump to

I will try to update this blog post regularly!

Change History

01-13-2018: Added information regarding VMSA-2018-0004
01-13-2018: HPE has pulled Gen8 and Gen9 system ROMs
01-13-2018: VMware has updated KB52345 due to issues with Intel microcode updates
01-18-2018: Updated VMware section
01-24-2018: Updated HPE section
01-28-2018: Updated Windows Client and Server section
02-08-2018: Updated VMware and HPE section
02-20-2018: Updated HPE section
04-17-2018: Updated HPE section


Many blog posts have been written about the two biggest security vulnerabilities discovered so far. In fact, we are talking about three different vulnerabilities:

  • CVE-2017-5715 (branch target injection)
  • CVE-2017-5753 (bounds check bypass)
  • CVE-2017-5754 (rogue data cache load)

CVE-2017-5715 and CVE-2017-5753 are known as “Spectre”, CVE-2017-5754 is known as “Meltdown”. If you want to read more about these vulnerabilities, please visit meltdownattack.com.

Multiple steps are necessary to be protected, and all necessary information are often repeated, but were distributed over several websites, vendor websites, articles, blog posts or security announcements.

Two simple steps

Two (simple) steps are necessary to be protected against these vulnerabilities:

  1. Apply operating system updates
  2. Update the microcode (BIOS) of your server/ workstation/ laptop

If you use a hypervisor to virtualize guest operating systems, then you have to update your hypervisor as well. Just treat it like an ordinary operating system.

Sounds pretty simple, but it’s not. I will focus on three vendors in this blog post:

  • Microsoft
  • VMware
  • HPE

Let’s start with Microsoft. Microsoft has published the security advisory ADV180002  on 01/03/2018.

Microsoft Windows (Client)

The necessary security updates are available for Windows 7 (SP1), Windows 8.1, and Windows 10. The January 2018 security updates are ONLY offered in one of theses cases (Source: Microsoft):

  • An supported anti-virus application is installed
  • Windows Defender Antivirus, System Center Endpoint Protection, or Microsoft Security Essentials is installed
  • A registry key was added manually

To add this registry key, please execute this in an elevated CMD. Do not add this registry key, if you are running an unsupported antivirus application!! Please contact your antivirus application vendor! This key has to be added manually, only in case if NO antivirus application is installed, otherwise your antivirus application will add it!

OSUpdate
Windows 10 (1709)KB4056892
Windows 10 (1703)KB4056891
Windows 10 (1607)KB4056890
Windows 10 (1511)KB4056888
Windows 10 (initial)KB4056893
Windows 8.1KB4056898
Windows 7 SP1KB4056897

Please note, that you also need a microcode update! Reach out to your vendor. I was offered automatically to update the microcode on my Lenovo ThinkPad X250.

Update 01-28-2018

Microsoft has published an update to disable mitigation against Spectre (variant 2) (Source: Microsoft). KB4078130 is available for Windows 7 SP1, Windows 8.1 and Windows 10, and it disables the mitigation against Spectre Variant 2 (CVE 2017-5715) independently via registry setting changes. The registry changed are described in KB4073119.

A reboot is required to disable the mitigation.

Windows Server

The necessary security updates are available for Windows Server 2008 R2, Windows Server 2012 R2, Windows Server 2016 and Windows Server Core (1709). The security updates are NOT available for Windows Server 2008 and Server 2012!. The January 2018 security updates are ONLY offered in one of theses cases (Source: Microsoft):

  • An supported anti-virus application is installed
  • Windows Defender Antivirus, System Center Endpoint Protection, or Microsoft Security Essentials is installed
  • A registry key was added manually

To add this registry key, please execute this in an elevated CMD. Do not add this registry key, if you are running an unsupported antivirus application!! Please contact your antivirus application vendor! This key has to be added manually, only in case if NO antivirus application is installed, otherwise your antivirus application will add it!

OSUpdate
Windows Server, version 1709 (Server Core Installation)KB4056892
Windows Server 2016KB4056890
Windows Server 2012 R2KB4056898
Windows Server 2008 R2KB4056897

After applying the security update, you have to enable the protection mechanism. This is different to Windows Windows 7, 8.1 or 10! To enable the protection mechanism, you have to add three registry keys:

The easiest way to distribute these registry keys is a Group Policy. In addition to that, you need a microcode update from your server vendor.

Update 01-28-2018

The published update for Windows 7 SP1, 8.1 and 10 (KB4073119) is not available for Windows Server. But the same registry keys apply to Windows Server, so it is sufficient to change the already set registry keys to disable the mitigation against Spectre Variant 2 (CVE 2017-5715).

A reboot is required to disable the mitigation.

VMware vSphere

VMware has published three important VMware Security Advisories (VMSA):

VMware Workstation Pro, Player, Fusion, Fusion Pro, and ESXi are affected by CVE-2017-5753 and CVE-2017-5715. VMware products seems to be not affected by CVE-2017-5754. On 09/01/2017, VMware has published VMSA-2018-0004, which also addresses CVE-2017-5715. Just to make this clear:

  • Hypervisor-Specific Remediation (documented in VMSA-2018-0002.2)
  • Hypervisor-Assisted Guest Remediation (documented in VMSA-2018-0004)

I will focus von vCenter and ESXi. In case of VMSA-2018-002, security updates are available for ESXi 5.5, 6.0 and 6.5. In case of VMSA-2018-0004, security updates are available for ESXi 5.5, 6.0, 6.5, and vCenter 5.5, 6.0 and 6.5. VMSA-2018-0007 covers VMware Virtual Appliance updates against side-channel analysis due to speculative execution.

Before you apply any security updates, please make sure that you read this:

  • Deploy the updated version of vCenter listed in the table (only if vCenter is used).
  • Deploy the ESXi security updates listed in the table.
  • Ensure that your VMs are using Hardware Version 9 or higher. For best performance, Hardware Version 11 or higher is recommended.

For more information about Hardware versions, read VMware KB article 1010675.

VMSA-2018-0002.2

OSUpdate
ESXi 6.5ESXi650-201712101-SG
ESXi 6.0ESXi600-201711101-SG
ESXi 5.5ESXi550-201709101-SG

In case of ESXi550-201709101-SG it is important to know, that this patch mitigates CVE-2017-5715, but not CVE-2017-5753! Please see KB52345 for important information on ESXi microcode patches.

VMSA-2018-0004

OSUpdate
ESXi 6.5ESXi650-201801401-BG, and
ESXi650-201801402-BG
ESXi 6.0ESXi600-201801401-BG, and
ESXi600-201801402-BG
ESXi 5.5ESXi550-201801401-BG
vCenter 6.56.5 U1e
vCenter 6.06.0 U3d
vCenter 5.55.5 U3g

The patches ESXi650-201801402-BG, ESXi 6.0 ESXi600-201801401-BG, and
ESXi550-201801401-BG will patch the microcode for supported CPUs. And this is pretty interesting! To enable hardware support for branch target mitigation (CVE-2017-5715 aka Spectre) in vSphere, three steps are necessary (Source: VMware):

  • Update to one of the above listed vCenter releases
  • Update the ESXi 5.5, 6.0 or 6.5 with
    • ESXi650-201801401-BG
    • ESXi600-201801401-BG
    • ESXi550-201801401-BG
  • Apply microcode updates from your server vendor, OR apply these patches for ESXi
    • ESXi650-201801402-BG
    • ESXi600-201801402-BG
    • ESXi550-201801401-BG

In case of ESXi 5.5, the hypervisor and microcode updates are delivered in a single update (ESXi550-201801401-BG).

Update 01-13-2018

Please take a look into KB52345 if you are using Intel Haswell and Broadwell CPUs! The KB article includes a table with affected CPUs.

All you have to do is:

  • Update your vCenter to the latest update release, then
  • Update your ESXi hosts with all available security updates
  • Apply the necessary guest OS security updats and enable the protection (Windows Server)

For the required security updates:

Make sure that you also apply microcode updates from your server vendor!

VMSA-2018-0007

This VMSA, published on 08/02/2018, covers several VMware Virtual appliances. Relevant appliances are:

  • vCloud Usage Meter (UM)
  • Identity Manager (vIDM)
  • vSphere Data Protection (VDP)
  • vSphere Integrated Containers (VIC), and
  • vRealize Automation (vRA)
ProductPatch pending?Mitigation/ Workaround
UM 3.xyesKB52467
vIDM 2.x and 3.xyesKB52284
VDP 6.xyesNONE
VIC 1.xUpdate to 1.3.1
vRA 6.xyesKB52497
vRA 7.xyesKB52377

 

HPE ProLiant

HPE has published a customer bulletin (document ID a00039267en_us) with all necessary information:

HPE ProLiant, Moonshot and Synergy Servers – Side Channel Analysis Method Allows Improper Information Disclosure in Microprocessors (CVE-2017-5715, CVE-2017-5753, CVE-2017-5754)

CVE-2017-5715 requires that the System ROM be updated and a vendor supplied operating system update be applied as well. For CVE-2017-5753, CVE-2017-5754 require only updates of a vendor supplied operating system.

Update 01-13-2018

The following System ROMs were previously available but have since been removed from the HPE Support Site due to the issues Intel reported with the microcode updates included in them. Updated revisions of the System ROMs for these platforms will be made available after Intel provides updated microcodes with a resolution for these issues.

Update 01-24-2018

HPE will be releasing updated System ROMs for ProLiant and Synergy Gen10, Gen9, and Gen8 servers including updated microcodes that, along with an OS update, mitigate Variant 2 (Spectre) of this issue. Note that processor vendors have NOT released updated microcodes for numerous processors which gates HPE’s ability to release updated System ROMs.

I will update this blog post as soon as HPE releases new system ROMs.

For most Gen9 and Gen10 models, updated system ROMs are already available. Check the bulletin for the current list of servers, for which updated system ROMs are available. Please note that you don’t need a valid support contract to download this updates!

Under Software Type, select “BIOS-(Entitlement Required”) – (Note that Entitlement is NOT required to download these firmware versions.

Update 02-09-2018

Nothing new. HPE has updates the bulletin on 31-01-2018 with an updated timeline for new system ROMs.

Update 02-25-2018

HPE hast published Gen10 system ROMs. Check the advisory: HPE ProLiant, Moonshot and Synergy Servers – Side Channel Analysis Method Allows Improper Information Disclosure in Microprocessors (CVE-2017-5715, CVE-2017-5753, CVE-2017-5754).

Update 04-17-2018

HPE finally published updated System ROMS for several Gen10, Gen9, Gen8, G7 and even G6 models, which also includes bread-and-butter servers like the ProLiant DL360 G6 to Gen10, and DL380 G6 to Gen10.

If you are running Windows on your ProLiant, you can use the online ROM flash component for Windows x64. If you are running VMware ESXi, you can use the systems ROMPaq firmware upgrade for for USB key media.

Citrix NetScaler and Exchange: Case-sensitivity of internal and external URLs

Exchange has known the concept of internal and external URLs for the different services (Outlook Web Access, OAB, EWS, ActiveSync etc) since Exchange 2007. And it’s still confusing people. The internal URL is the URL, that is used to access the desired service from the intranet. The external URL represents the URL that is used to access the service from the internet. Best practice is to use the same URL (the external) for both, use a certificate from a public CA, and use split DNS to access the external domain from the inside of your network.

People tend to imply, that URLs are not case-sensitive. This seems to be true in most cases. The World Wide Web Consortium (W3C) states:

URLs in general are case-sensitive (with the exception of machine names). There may be URLs, or parts of URLs, where case doesn’t matter, but identifying these may not be easy. Users should always consider that URLs are case-sensitive.

Source W3C

Citrix NetScaler and URLs

Citrix NetScaler handles URLs as case-sensitive.

A frequently used concept to load balance Microsoft Exchange with a NetScaler is Content Switching. Policies are used to identify traffic, and actions are used to take action on the traffic that matches the policies. The NetScaler uses the advanced policy engine to create expressions for the Content Switching Policies. When creating a Content Switching policy by creating an expression that uses the CONTAINS operator, you might notice that the results are case-sensitive.

This can be a problem in case of Microsoft Exchange, because /Autodiscover/Autodiscover.xml and /autodiscover/autodiscover.xml, or /ews/exchange.asmx and /EWS/Exchange.asmx are handled different.

Solution

To make sure that different cases are handled, you should add  SET_TEXT_MODE(IGNORECASE)  to you policy expression. Citrix describes this in CTX115528.

I’ve changed my NetScaler setup script for Exchange to handle this behavior.

Exchange DAG member dies during snapshot creation

Yesterday, a customer called me and told me about a scary observation on one of his Exchange 2016 DAG (Database Availability Groups) nodes.

In preparation of a security check, my customer created a snapshot of a Exchange 2016 DAG node. This node is part of a two node Windows Server 2012 R2/ Exchange 2016 CU7 cluster.

That something went wrong was instantly clear, after the first alarm messages were received. My customer opened a console windows and saw, that the VM was booting.

What went wrong?

Nothing. Something worked as designed, except the fact, that the observed behaviour was not intended.

That a snapshot was created was clearly visible in the logs. Interesting was the amount of time, that the snapshot creation took. It took 5 minutes from the start of the snapshot creation until the task finished. During this time, pretty much data was written to the disks.

VMware vSphere Throughput Snapshot Creation

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

The server eventlog contained an entry, that pointed me to to the right direction.

Event Type: Information
Event ID: 1001
Source: BugCheck
Description: The computer has rebooted from a bugcheck. The bugcheck was 0x0000009E (0xffffe0001eccf900, 0x000000000000003c, 0x000000000000000a, 0x0000000000000000).

The Ask the Core Team wrote a nice blog post about this STOP error. In short: The failvoer clustering service incorporates a detection mechanism that may detect unresponsive user-mode processes. If an unresponsive user-mode process is detected, a HangRecoveryAction is called. Since Windows Server 2008, a STOP error (Bugcheck) is caused on the cluster node.

Most likely hypothesis

My explanation of the observed behaviour is, that my customer accidentally created a snapshot that has contained the VM memory. Because the Exchange server has 32 GB memory, the snapshot creation took some time and the VM became unresponsive. As the VM was responding again, the HangRecoveryAction did its dirty job.

Check if the checkbox for the VM memory is disabled, before you create a snapshot. Otherwise the bugcheck will do its job. Please note, that you might see this behaviour in all Microsoft Windows Failover Clusters, not only with Microsoft Exchange.

Exchange receive connector rejects incoming connections

As part of a bigger Microsoft Exchange migration, one of my customers moved the in- and outbound mailflow to a newly installed mail relay cluster. We modified MX records to move the mailflow to the new mail relay, because the customer also switched the ISP. While changing the MX records for ~40 domains, and therefore more and more mails received through the new mail relay cluster, we noticed events from MSExchangeTransport (event id 1021):

192.168.xxx.xxx is the mail relay cluster, which is used for the in- and outbound mailflow.

This event indicates that the remote server has reached the maximum number of simultaneous incoming connections to the receive connector. This value is specified by the MaxInboundConnectionPerSource  parameter, and the default value is 20. You can easily increase the value using the  Set-ReceiveConnector  cmdlet.

Microsoft has decreased this value over time. It was 100 in Exchange 2007, but 20 since Exchange 2010.

Data Protector Exchange GRE and IP-less Exchange DAG

When dealing with Microsoft Exchange restore requests, you will come across three different restore situations:

  • a database
  • a single mailbox
  • a single mailbox item (mail, calendar entry etc.)

Restoring a complete database is not a complicated task, but restoring a single mailbox, or a single mailbox item, is. First, you need to restore the mailbox, that includes the desired mailbox, into a recovery database. Then you can restore the mailbox, or the mailbox items, from the recovery database. Some of the tasks can only be done with the Exchange Management Shell.

The HPE Data Protector Granular Recovery Extension (GRE) for Microsoft Exchange helps you to simplify the necessary steps to recover a single mailbox, or mailbox items. But the GRE can only assist you during the restore. It hids the above described tasks behind a nice GUI. The backup of Microsoft Exchange is still something you have to do with HPE Data Protector.

Database Availability Group without an Administrative Access Point

With Exchange 2013 SP1, Microsoft introduced the IP-less Database Availability Group (DAG). This type of DAG does not need a Cluster Name Object (CNO), and therefore has no IP address. With Exchange 2016, the IP-less DAG is the default DAG configuration.

But how to backup a DAG, that has no IP address and no name? It is easier than imagined. You have to create a DNS A-Record that includes all IP addresses of the cluster nodes, resulting in a DNS round-robin A-Record. You also have to install the Data Protector Disk Agent and On-line Extension on all cluster nodes. After that, you simply import the DAG by using the DNS A-Record into Data Protector. Then you can proceed with the creation and configuration of a backup job, that uses the newly imported cluster.

Backup runs fine, but the GRE fails

During the test phase of a new Exchange 2016 cluster, a customer of mine discovered a strange error, when he tried to restore a mailbox, or mailbox item, using the Exchange GRE.

Data Protector Exchange GRE Error

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

The customer and I double-checked the installation of the GRE on both nodes. Everything was fine. We also found out, that Data Protector was able to list the backup objects. This is a shortened output of the command.

As you can see, dag-backup.domain.tld is the DNS A-Record, that was created to backup the DAG with Data Protector.

Connection between A-Record and DAG name

It took some time to get this sorted, but at the end, a new A-Record was the key. The DAG has a name, e.g. customer-dag1.domain.tld. But there is no matching A-Record, and the DAG has no IP address.

When the GRE searches for available database backups, it stumbles over the mismatch between the DAG name, that is reported by the Exchange organization, and the name of the Data Protector client that was used to backup the databases.

The key to success was to change the DNS A-Record from dag-backup.domain.tld to customer-dag1.domain.tld. Latter is the name of the DAG, that is given during DAG creation. After removing the Data Protector client, the re-import of the DAG with the new A-Record, and a successful backup, the customer was able to restore mailboxes and mailbox items using the GRE for Microsoft Exchange.

This process is not described in detail in the Data Protector documentation. All you find is this foot note in the Data Protector Platform Integration Matrix (page 12, foot note 19):

Microsoft Exchange Server DAG configured without a Cluster Administrator Access Point is supported with Round Robin DNS mapping of DAG name to all the node IPs.

Make sure that the DNS round-robin A-Record matches your DAG name.

Simplemonitor – Python-based monitoring

While searching for a simple monitoring für my root servers, I’m stumbled over a python-based software called Simplemonitor. Other alternatives, like Nagios, or forks like Incinga etc., were a bit too much for my needs.

What is SimpleMonitor?

SimpleMonitor is a Python script which monitors hosts and network connectivity. It is designed to be quick and easy to set up and lacks complex features that can make things like Nagios, OpenNMS and Zenoss overkill for a small business or home network. Remote monitor instances can send their results back to a central location.

My requirements were simple:

  • Ping monitoring
  • TCP monitoring
  • HTTP monitoring
  • Service monitoring
  • Disk space monitoring

Monitoring is nothing without alerting, so I was pretty happy that Simplemonitor is able to send messages into a Slack channel! But it can also send e-mails, SMS, or it can write into a log file. To get a full feature overview, visit the Simplemonitor website.

The project is hosted on GitHub. If you are familiar with Python, you can contribute to the project, or you can add features as you need.

Installation & configuration

The installation is pretty simple: Just fetch the ZIP or the tarball from the project website, and extract it.

The configuration is split into two files:

  • monitor.ini
  • monitors.ini

The naming is a bit confusing. The monitor.ini contains the basic monitoring configuration, like the interval for the checks, the alerting and reporting settings. The monitors.ini contains the configuration of the service checks. That’s confusing, that confused me, and so I changed the name of the monitors.ini to services.ini.

The services.ini (monitors.ini) contains the service checks. This is a short example of a ping, a service check, a port check, and a disk space check.

The alerting is configured in the monitor.ini. I’m using only the Slack notification. All you need is a web hook and the corresponding web hook URL.

In case of a service fail, or service recovery, a notification is sent to the configured Slack channel.

To start Simplemonitor, just start the monitor.py. It expects the monitor.ini in the same directory.

Summary

I really like the simplicity of Simplemonitor. Download, extract, configure, run, done. That’s what I’ve searched for. It is still under development, but you should not expect that it will gain much complexity. Even if features will be added, it should be a simple monitoring.