For the last few months we have been experiencing intermittent issues with one of our production processes. The issue is one that has confounded us in its lack of consistency and ability to frustrate anyone assigned to troubleshoot the problem. I was asked to look into the situation and in the end was able to discover the root cause in just a few hours using Telerik JustDecompile coupled with Microsoft's SQL Profiler tool. The journey was exciting and I'll share what I can here.
- JustDecompile [telerik.com]
- SQL Server Profiler [msdn.microsoft.com]
- SQL Server Profiler deprecation - Replacement? [stackoverflow.com]
Over the last few months it was noticed that a search process would sometimes 'lose' results. There was no relationship between searches performed and the results that were dropped. This caused confusion and delay as searches had to be re-submitted and you would not know if they would succeed or fail until you tried.
Things took a turn for the worse around the new year: more and more results were being dropped until we weren't getting any search results back at all. There was speculation that the 'leap second' added on December 31st was causing the problem (fortunately this was a non-issue). About this time I was asked to take this troubleshooting task on as my priority.
Ordinarily a problem like this wouldn't be too hard to solve: add debugging or logging to the search tool/api/system, run a few test searches and see how far down the rabbit hole you can get. That approach was complicated by a few things:
- This is an old system and no one knows anything about how it is supposed to work
- While there is source code in our SCM, it appears that source control was circumvented for the last major update to the system, so we can't just add logging and recompile without redeploying an update that removes essential functionality
- Management was hesitant to make potentially breaking changes given the fragile nature of the system
Road to Recovery: Add Logging
By the time I was involved things were broken enough that Stakeholders were willing to let us make changes to the system. Given that we did not have an up to date source code, I turned to various .NET decompilers to try and reconstruct a Visual Studio project that we could use for troubleshooting. It was a surprise to see that my preferred tool for decompiling .NET assemblies wasn't up to the task. In the end only Telerik's JustDecompile tool was capable of decompiling our executable into a project I could immediately recompile without hassle.
We enabled a TON of logging in the tool and dropped it in as a replacement for the old executable. The logging showed that everything was working as it was supposed to, so we had to come up with another approach.
Road to Recovery: SQL Profiler Takes the Day
I pulled up SQL Profiler to help us understand how the search tool was querying and updating information in the database. The search tool only performs lookups when work items are marked witha 'ToDo' flag. If work items get mangled on the way into the database, the search tool will do nothing.
After leaving the SQL Profiler tool on for an extended amount of time I noticed that we were seeing search tool queries even though we had temporarily disabled it. This demonstrated that there were other factors in play, so I altered the SQL trace to include additional columns (including the HostName column) and waited.
Sure enough, we saw the search tool queries again- and this time we could see they were coming from another server in our environment. Upon further investigation we found that the virtual machine that hosts the search tool had been cloned and left running.
Once we knew that another system in the environment was running the search tool we investigated and found that all of our missing search results were cached on that system. The search tool was disabled on that server and the missing results were restored to their proper place.
It can be incredibly difficult to troubleshoot intermittent problems like this. Without appropriate tooling (in this case Telerik's JustDecompile and Microsoft's SQL Profiler) it would have been virtually impossible to present a viable path that management would feel comforable with us pursuing. It is disappointing to hear that Microsoft is deprecating Trace/Profiling in some future version of MSSQL- I hope they implement a capable replacement (Extended Events?).