Troubleshooting Network Issues With Sysinternals Tools

Last post was about clever app Latch and how to create a simple plugin to integrate it on Windows logon. It was good fun to create, write and share but one issue arose during the test and was discussed on the comments section: the issue I had with the server responding slowly was not right, or at least it wasn’t shared by other users.

Decided to check if the problem was on my side I boot up again the virtual machine I downloaded from modern.ie to develop and test the Latch plugin and tried to recreate the issue. My scenario is as follow:

pGina Simulator Tool -> Fiddler -> Internet Connection -> Latch Servers

Fiddler offers a detailed list of the time spent for each step of the request/response communication and, as previously, my results were:

Request Count:   1
Bytes Sent:      261        (headers:261; body:0)
Bytes Received:  803        (headers:425; body:378)

ACTUAL PERFORMANCE
--------------
ClientConnected:       00:47:23.719
ClientBeginRequest:    00:47:24.078
GotRequestHeaders:     00:47:24.078
ClientDoneRequest:     00:47:24.078
Determine Gateway:     0ms
DNS Lookup:            0ms
TCP/IP Connect:        0ms
HTTPS Handshake:       0ms
ServerConnected:       00:47:23.875
FiddlerBeginRequest:   00:47:24.078
ServerGotRequest:      00:47:24.078
ServerBeginResponse:   00:47:30.078
GotResponseHeaders:    00:47:30.078
ServerDoneResponse:    00:47:30.078
ClientBeginResponse:   00:47:30.078
ClientDoneResponse:    00:47:30.078

    Overall Elapsed:   0:00:06.000

My first impression, when I first saw that, was to think Latch servers were the ones having some issues, as the action taking more time to complete was between ServerGotRequest and ServerBeginResponse. Case closed!

Or maybe not. So which one was the next step? Trying to reproduce it from another machine and different internet connection (previous one was Central London, now trying from sunny Madrid). And oh dear… it was a blast! Response received in less than a second. Solid 0.5 seg. response, every time.

So looks like Fiddler was lying to me (or maybe not Fiddler, but the way Microsoft raises different events that Fiddler might be listening when intercepting/analysing the HTTP requests) and I was on the quest to identify my network problems.

First thing was to reproduce the issue again, but this time running something else than Fiddler, so I executed Procmon on the physical machine, to check the network related events to the virtual machine:

Procmon filter

And with some rules to highlight any network related activity, to easily identify when the connection was performed and when the response was received.

highlight filter

After that was a matter of dig into the long list of generated events to try to identify the root of the problem. In my case the event corresponding with the ServerGotRequest in Fiddler was:

08:59:00.6788252    VirtualBox.exe    9044    TCP Send    pato.hitronhub.home:3306 -> ec2-54-72-11-190.eu-west-1.compute.amazonaws.com:https    SUCCESS    Length: 293, startime: 77060, endtime: 77061, seqnum: 0, connid: 0

And then, for more than five seconds, no more TCP activity, until I got this:

08:59:05.9079370    VirtualBox.exe    9044    TCP Receive    pato.hitronhub.home:3306 -> ec2-54-72-11-190.eu-west-1.compute.amazonaws.com:https    SUCCESS    Length: 837, seqnum: 0, connid: 0

Now my quest was to identify the guilty threat taking all those 5 seconds and not honouring the Amazon Cloud servers’ speed. Lucky for us Process Monitor offers a nice resume of the events happening for each process, so we can easily identify where we should start looking:

events timeline

As we can clearly see on the picture above we have some crazy peak of registry actions just before each network action (the request and the response are both represented on the graph) but I was only noticing it on the response, as Fiddler only measured after the request was initiated.

Checking the events of the registry I found that VirtualBox process was searching for different registry keys that didn’t exist. Those keys were related to Network Interfaces and checking my network interfaces quickly revealed the problem:

networki

At least four different services were active (before taking that screenshot) on my Internet connection, some of them, like the DNE LightWeight Filter or the Microsoft Network Monitor 3 Driver which I was not longer using (and forgot to unistall). Disabling/deleting them solved my problem and I was able now to enjoy full speed Latch servers once again.

Actually I found this quite a while, but didn’t have proper time to write it here until now. I posted it briefly on my twitter account:

Using pGina and Latch to protect your Windows login

When Latch was revealed last December it automatically caught my attention. Such a simple idea and a great improvement in accounts’ security. Also the team behind the product is putting a lot of effort in making this a successful product (obviously) and they are releasing plugins for different platforms and frameworks. So all this was adding to my idea of starting this blog with a nice, cool, not-another-hello-world and still, useful, post.

The final push to do so was the release of a SSH plugin for Latch, that allows you to configure your Linux machine to require your account to be unlocked by Latch in order to log in. “What about Windows? Will I be able to do the same for a Windows machine?” So I started my research.

First was to check how to make use in .Net (my C++ is not any good, to be honest) to develop a custom Credential Provider for Windows and I came across pGina, an open source implementation of the Microsoft API that allows custom .Net plugins to be added to the logon process. Perfect!

Following the pGina documentation I manage to put together a simple Latch plugin that blocks access to a Windows machine depending of users’ Latch configuration. I have shared it on GitHub and I am accepting comments, bugs and suggestions.

Latch protecting Windows login

Things that plugin is missing at the moment are:

  • Multi-user, currently only one Latch user can be paired with the Windows machine
  • Two factor authentication. pGina does not support adding UI control elements to the logon process. It might have to wait until they include on the core or find a way around this problem
  • Active Directory support anyone?

Finally, as a “here be dragons” section about things I learn and discover while coding this plugin here are some comments about pGina and Latch:

  • Latch is missing (or I couldn’t find the way to do it) a validate function to ensure introduced data (application id and secret) are valid while configuring the application. I had to use a little trick that I am not really proud of.
  • pGina runs as a Windows Service but, by default, testing the simulation runs as the current user. I am using ProtectedData object to store the application data in the registry and forgot to use DataProtectionScope.LocalMachine instead of DataProtectionScope.CurrentUser. It took me some time to figure this out, shame on me.
  • Latch service takes a while to respond. Sometimes this is up to 5 seconds after the request is made. I know it is not a lot, but still noticeable.  UPDATE: I found this was due to multiple network services running on my physical machine.
  • When installed, pGina allows any admin user to log in even if authorization plugins fails. This is to prevent possible lock outs of the accounts. It can be disabled with a little trick on the registry (or at least is what developers say)
  • And finally, Latch is a private solution. I’ll love to see an open-source implementation that you can install in your own environment to avoid being left without possibility to log in because the Latch’s servers are down. Or maybe is up to the plugins’ developers to implement a “safe” switch in case Latch’s servers stop responding.