Nick Vitucci, a Senior IT Associate with NSK Inc, was at a client’s office routinely monitoring the system and came across some small, seemingly unrelated issues across a number of servers and applications. What first caught Vitucci’s attention was that printing was taking much longer than normal. While trying to troubleshoot that issue, he noticed that the interface to manage the printers was unresponsive.
Further investigation revealed that group policies were taking a very long time to open, and the active directory replication seemed to be lagging. Changing a user’s password was taking far too long and the user interface for Windows Server Update Services froze and ultimately crashed.
Vitucci immediately checked to see if there were issues with networking and connectivity but found none, and the network switches weren’t showing any unusual activity. It wasn’t until he logged into VSphere (an application used to manage virtual servers) that he came across some disturbing graphic reports.
A number of servers displayed similar graphs. CPU usage had been steadily growing for the past several days and was now spiking at 100%. This was happening on most but not all virtual servers in the environment. Thinking about what the affected servers had in common and what changes had occurred in the last week when the activity started to increase, Vitucci realized that the servers involved all shared following characteristics: Windows operating system, a newly updated version of Trend Micro Antivirus, and the Pavis Server Management Agent.
While other servers had one or more of these characteristics, all the affected servers had all three. He quickly logged into one of the affected servers and found the OS was so unresponsive, it was barely usable. Vitucci checked a different, non-virtual server that had the same three factors and saw that it too was experiencing high CPU usage for apparently no reason, as were all servers with this combination (though luckily none of the physical machines were as severely affected as the virtual machines).
Vitucci began researching the issue and saw that a patch had recently been issued that was supposed to reduce processor usage by Trend Micro services. However, investigation directly on the servers themselves revealed that the problem was not entirely caused by processes related to Trend Micro, but were largely related to the Pavis monitoring agents installed on the computers. Vitucci rationalized that applying the patch from Trend Micro would eliminate the issue.
To solve the issue Vitucci first had to get the servers back to a state where the user interface was usable and apply the patch from Trend Micro, in hopes that it would solve the issue. Since the servers were all in production, he was limited in what he could simply reboot. Restarting servers right then and there would cause a large disruption with the client’s productivity, so Vitucci knew he had to be strategic with how he proceeded.
The servers on the system that had active high availability and/or redundancy such as the domain controllers, Vitucci knew he could easily reboot. However, the other servers that were mission critical (BlackBerry® Enterprise Server, print Server, database Server) weren’t as easy. On these servers, he had to manually kill processes until the machine became useable and then apply the Trend Antivirus patch. Because the servers were in use, he had to schedule an automatic reboot to happen overnight.
After manually killing processes and setting up reboot schedules on each of the individual servers, the issue was resolved and the symptoms were cleared up. What is most striking about all of this is that the client had no idea any of this was going on. Vitucci says that he discovered the problem on his own, while doing his job of proactive IT Support and managed services for the client. Instead of just reacting to a client complaint, Nick Vitucci discovered and remedied a situation before it made a major impact on client productivity.