Friday, October 12, 2012

Advanced Windows Desktop Troubleshooting with Sysinternals Process Monitor

This is a special post on a troubleshooting event that I was able to participate in which occurred a short while ago.

My company started deploying Windows 7 computers throughout the organization during the summer. We use two different Windows deployment types depending on where the computer is going. There is one very large specialty “image” (100gb in size) and one smaller “image” (15gb) for the general staff. The large image has a great deal of large software installs embedded in the image, while the smaller image is a standard one that include Microsoft Office, Adobe Reader, and other similar core apps that one may need. One special note about the larger deployment type is that it has Faronics DeepFreeze installed on it to prevent system changes.
When these computers started getting deployed on our network, a strange issue was happening. When a user would log in with their domain credentials on a PC with the 100GB image, it would literally take anywhere from 7-15 minutes for the desktop to go from logon screen to desktop view. We knew that the computer was not frozen at the “Welcome” screen after logging in because the please wait circle icon was rotating. All the user had to do was sit and wait around 10 minutes (or longer) until the computer displayed the desktop. This issue did not occur on computers with the standard deployment image.

In the beginning (when this issue really started), I pretty much knew where to start looking to resolve the problem. I remembered a TechEd 2010 video called "The Case Of The Unexplained" by Mark Russinovich, and I remembered reading some of his blog posts about troubleshooting very slow logins in Windows. I emailed some of this information to the support personnel to inform them that since no one could find the root cause, I recommended to take a look at Mark's blog posts for guidance. A few weeks went by and I never heard anything about it. But after a short while, the issue exploded and the “blame game” started up. One person would claim, “The network is the problem!”, or “The servers are the problem!”, or “The deployment image is the problem!”. Some people started clamoring, “Give me back my old Windows XP computer. I didn’t have a problem until I got this Windows 7 PC!”. We’ve all heard that one at some point in time haven’t we? (cough, Vista)… Anywhoo, lets move on.

Now, this issue had been going on for a few months but it wasn't a big deal until now. Some of the company "higher-ups" contacted our I.T. Directors and demanded that the problem be fixed NOW. One of the I.T. directors asked for network team, security team, and the server team’s assistance in finding out what the heck was going on. In the beginning it made sense to bring in all of these people resources because more and more Windows 7 computers were being deployed and this was now a widespread issue.

Each team (Desktop support, Server, Network, Security) worked together trying to locate the root cause. We checked the Event Viewer, we ran Wireshark traces to see if the problem was network related, we re-checked the image settings, the security team checked over the computers believing it could be a worm or virus causing the problem, we even re-built the deployment images because we weren’t sure of what the root cause was. To our surprise, the Wireshark captures and Windows Event viewer showed nothing strange. The image settings were the same as the smaller image and the problem came back even after rebuilding the large image. There were no viruses or worms detected. Did we disable IPv6? You bet we did. Did that fix the problem? Nope. In the end, the problem largely eluded us.

I was working on another critical project at the time research of this problem began. Since others were already putting in time on the problem, I stayed on the sidelines. But it was “all hands on deck” to resolve this issue and everyone was looking for help to get this problem fixed. Before long, my supervisor asked me to take a quick look at the problem since I may be able to help. I took a step back and looked at exactly what was going on. In general, the problem was very common. For some reason the computer was taking forever to finish the logon process. After a user logs into the computer using their domain credentials the "Welcome" screen would display with the rotating circle and stay in that mode for 5 to 10 minutes before the desktop would finally display. Using Mark's blog post "Case of the Verrry Slow Logins" as a guide, I turned to Process Monitor to start troubleshooting the problem. After I logged into on of the affected PCs, I copied PROCMON to the desktop of the PC, ran it, and enabled the boot logging option.
image
I wanted to capture all processes and events both during boot and after logging in, so I checked the box to generate profiling events every second.
image
After choosing this option, I immediately rebooted the computer. I logged back in (waiting the 5-10 minutes for the desktop to display), then launched Process Monitor and chose to save the boot logs immediately to the desktop. To my surprise, the boot logs were spread over four files and totaled over 1.3GB in size! Double-clicking on one of them opened up the entire log and displayed over 3 million events.

I needed a starting point. For that, I used Mark's blog as a reference. I cut straight to the chase and opened the Process Tree by clicking Options >> Process Tree.
image
The Process Tree shows every process that ran, was running, or opened at some point, during a logged trace. Starting from the bottom of the list I started looking at WinLogon.exe. The only thing that really stood out was the Citrix Receiver application. This entire process monitor trace starts at 3:57pm and ends at 4:05pm. The Citrix Receiver app started at 3:59 and ended at 4:02, taking 3 minutes to execute.
image
Another clue was the dark green bar under the Life Time column. Dark green indicates how long the process took to complete within this boot trace timeline. Anything light green indicates that the process never stopped during the trace (which is usually an internal Windows process or service). However, I was very skeptical that this was the culprit because looking at the sub-processes the Receiver app itself launches and runs in the system tray at all times. Plus, I hadn't heard of the Citrix Receiver causing boot problems before. So from the bottom of the Process Tree working my way up, I started looking at all of the processes one-by-one. Half-way up the list four very distinct dark green bars which were stacked on top of each other, in order, immediately caught my attention.
image
Two WScript processes were running side-by-side. But one of them called an application called FRAMEPKG.EXE. WScript is the Windows Script Host process that runs the domain VBscript login scripts on this computer. Yes, this computer is a member of a domain and in this domain we run login scripts. After one of the WScript process started, it launched FRAMEPKG.EXE, and under that executed FRMINST.EXE, which invokes Windows Installer (MSIEXEC.EXE). Framepkg.exe is digitally signed by McAfee as shown in the display grid. After doing a web search, I found that it is a part of the McAfee Antivirus Enterprise product. It was clear that a re-installation of the McAfee Framework package was occurring during a logon session. Scrolling to the right confirmed by suspicion that it was highly possible that this was the root cause because the process started at 3:58 and didn't finish until after 4:02pm, more than 4 minutes. Not only that, looking under the Commands column confirmed that the Framework package was indeed kicking off a re-installation thanks to the "/ForceInstall" switch leaving little to the imagination.
image
Highlighting the FRAMEPKG.EXE process showed more information about how\why it was launched at the bottom of the window.
image
So, a login script is kicking off the installation of the McAfee Framework package for McAfee Antivirus Enterprise after each user logon. But there is one more gotcha with this instance…. If you remember from the beginning of this article, Faronics DeepFreeze is installed on these computers to prevent system changes and these computers do not have any domain user profiles stored on them.... Are you starting to see the problem here? A program installation is running side-by-side with the user profile creation process (that copies data from the Default User profile). The hard disk I\O is off the charts and is incredibly intensive with this going on...

Here is a rough rundown of exactly what was happening after these PCs boot up:
  • PC boots.
  • Faronics DeepFreeze driver loads, freezing system changes.
  • Login screen appears.
  • User enters domain login credentials to log in.
  • The Welcome splash screen appears with the rotating circle, not yet displaying the desktop.
  • Login scripts run in the background, with one of them forcing the re-installation of the McAfee Framework package.
  • The McAfee Framework package first removes the existing installation of the Framework package, then installs it all over again.
  • While the McAfee installation is running, the user's profile is built for the first time further impacting disk I\O.
  • Only after the installation finishes and the user's profile is built, the desktop finally displays and the PC operates normally.
  • When the PC reboots, any system changes are wiped out.
  • Wash, rinse, repeat.
I was fairly confident that I found the culprit. The question "Why is this login script launching an Antivirus installation package after a user logs into the computer, every time a user logs in after the PC has been rebooted?" couldn't be answered by me. For that, I needed to query the server admin team. I sent an email to our server admins and included these findings and asked them why the login script is forcing the installation of the McAfee software after every login. I also provided proof that this was happening by showing them the Process Monitor logs. The server admins were very surprised to hear that this script was still active because it was removed from the domain a few months before we started rolling out Windows 7. After a quick look at the GPOs, that McAfee login script was still active in one of the OUs where the PCs with the 100GB Windows 7 deployment image reside. After removing the GPO culprit, the slow login issues on these computers simply went away.

This problem had been going on for several months and Process Monitor helped us to resolve the problem within just a few hours. Through this story, you can see just how powerful Process Monitor is. What I learned throughout this event is how important proper troubleshooting of the Windows Desktop really is. Unless you are absolutely sure, don’t start placing blame elsewhere if you cannot figure out a problem. If you are having a problem on a Windows desktop OS, then start troubleshooting on the Windows desktop OS itself and branch out from there. For me, this is elementary. But for others, it may be a challenge. One way or another, locating evidence that leads to the solution starts where the issue was found. It is up to you to find it. Troubleshooting the Windows desktop is like crime scene investigations and forensics; If you dig deep enough, you will find the culprit.

There isn't much more to say here in this article other than to praise Mark Russinovich and Bryce Cogswell, as well as their Sysinternals colleagues, for creating these tools because without them I guarantee that we would still be working on trying to resolve the issue, or contacting Microsoft support services. This is only but one example of why it is important to use the correct tools to troubleshoot a problem. I will close out this article with the following quote from Mark Russinovich himself...

"When you take the approach of restart the app or restart the operating system or even re-install an operating system and start from scratch and hope that the problem doesn't happen again, you haven't learned anything, you don't know what caused the problem, you don't know how severe the problem is, or if the problem is going to come back and bite you again. It is worth spending even just a few minutes to go try to figure out what's happened. Often times, that will lead you to a solution." - Mark Russinovich, TechEd 2010 Session "The Case of the Unexplained, 2010: Troubleshooting with Mark Russinovich"

- Joe