Practical Data Analytics - Troubleshoot Your IIS Web App in Seconds with Kusto

How many of you find yourself in tech'ish conversations when the topic shifts to doing something useful with all the data running out of your company's data-tap. In my experience those conversations quickly shift to how data analytics will - one day - allow for predictive detection of anomalies, robotic responses, etc. If you are like me it surely sounds like a cool and exciting future that we all aspire to make real. So why not get started today?

I believe to start that journey we need to find practical ways to apply data analytics to the platforms we use right now, with the data we have right now, and use it to make a tangible impact to the job we do right now. So, let me show you HOW with a quick dive into using the Microsoft Kusto data platform.
 

Kusto is a time series data platform that Microsoft developed in-house for managing their own log data. Encouraged by success Microsoft turned it into an Azure service we can all use for applications hosted on-prim or in a cloud. Kusto shows up in several Azure services as the underpinning data logging platform ... you'll find it in Azure Log Analytics, Azure Application Insights, Azure Sentinel, Azure Data Explorer, etc. If you understand the concept on other platforms like Elastic and Splunk you can make a simple equivalency if its helpful. A good description of the Kusto syntax is provided by Microsoft ... 

The syntax of the tabular expression statement has tabular data flow from one tabular query operator to another, starting with data source (e.g. a table in a database, or an operator that produces data) and then flowing through a set of data transformation operators that are bound together through the use of the pipe (|) delimiter. link

For the sake of this discussion we'll be focusing on using the Azure Log Analytics platform. Log Analytics enables you to deploy log ingestion agents on your servers or send direct log streams to a REST API. I'll be using a demo environment that Microsoft maintains and makes accessible to the world [Demo Log Analytics Link]. You'll need to sign-up for access to the Azure Portal which is easy. Once you sign-in you'll be taken to the query interface in your browser. I highly recommend you use a browser based on Chrome. 





The query editor is laid out with the left hand panel showing you tables that are present in the workspace. For this demo I'll use the W3IISLog table that is under the LogManagement node. Feel free to navigate around to see the attributes of that table. 

So let's 1st find out how many Computers are sending their IIS logs to this workspace over the last 2 days. 


W3CIISLog
| where TimeGenerated <= ago(2d)
| summarize by Computer
| sort by Computer asc


I suggest you type this query in v/s cut and paste because you'll get to see the IntelliCode prompts that help you structure the query. This query will use the W3CIISLog table and look for rows of data created over the past 2 days then summarize it by Computer name and sort by name in ascending order. When I ran this query 928 rows were returned which means we have fleet of 928 servers running IIS and sending logs. 

TIP: For those that want to use the F5 shortcut to select and run their query ... DON'T DO IT! Simply highlight your query and use the SHIFT+ENTER shortcut.

Now lets get a feel for the overall health of our IIS fleet. 


W3CIISLog
| where TimeGenerated <= ago(2d)
| summarize count() by scStatus
| sort by count_ desc


So we see in the data that the HTTP status code of 200 is the most frequent response code which is good. 




Now lets visualize the data in a quick pie chart by adding the render command. 


W3CIISLog
| where TimeGenerated <= ago(2d)
| summarize count() by scStatus
| sort by count_ desc
| render piechart






Now we can clearly see that about 67% of the web requests to our web servers are good and around 18% are failing with HTTP 500 errors. Let's now take a look and see which servers are generating the 500 errors.


W3CIISLog
| where TimeGenerated <= ago(2d)
| where scStatus == 500
| summarize count() by Computer
| order by count_ desc


The result set returns over 900 rows that are ordered by highest to lowest count. To help us focus let's just look at the top ten. Replace the 'order by' with 'top 10 by'. 


W3CIISLog
| where TimeGenerated <= ago(2d)
| where scStatus == 500
| summarize count() by Computer
| top 10 by count_ desc


Let's render this as a pie chart.


W3CIISLog
| where TimeGenerated <= ago(2d)
| where scStatus == 500
| summarize count() by Computer
| top 10 by count_ desc
| render piechart
 

From our pie chart we can see that 2 servers in the 'contosohotels.com' name space are generating over 50% of the bad requests in our IIS fleet. Probably not a good user experience!




Let's look at those 2 servers and find out which requests are linked to the bad responses. 


W3CIISLog
| where TimeGenerated <= ago(2d)
| where Computer in ("AppBE00.NA.contosohotels.com", "AppBE01.NA.contosohotels.com")
| where scStatus == 500
| summarize count() by Computer, csUriStem




Well it looks like requests to '/Service.svc' are the ones that are failing so it seems we should dive in there next. It may help to take 10 of the complete rows of data to investigate other attributes to see if anything else jumps out. 


W3CIISLog
| where TimeGenerated <= ago(2d)
| where Computer in ("AppBE00.NA.contosohotels.com", "AppBE01.NA.contosohotels.com")
| where scStatus == 500
| take 10





Based on the results above I can see that the page is being hosted in the Default Web Site on the IIS server. At this point it would be a good idea to look into the Event Logs on the Windows server to see what else might be happening. 



Event
| where TimeGenerated <= ago(2d)
| where Computer in ("AppBE00.NA.contosohotels.com", "AppBE01.NA.contosohotels.com")
| where EventLog contains "Application"
| where EventLevelName contains "Error"
| summarize by Computer, EventID, RenderedDescription



Unfortunately the servers in the demo Log Analytics workspace are not configured to send the server event logs so we get back zero rows. To resolve this we'd need to change the configuration of the workspace per this article. That's beyond the scope of this article and not something I have permission to do in this workspace anyway. 


There are additional ways to go much deeper than what is logged at the OS level by deploying Azure Application Insights to your IIS servers. It allows us to click down into end-to-end transactions, see failed dependency calls (think SQL, REST calls, etc), and client side exceptions. Again, this is beyond the scope of this article yet I'll cover it in the not too distant future. Stay tuned! 

      

 
Well there you have it, you are now armed with practical data analytics that will enable you to troubleshoot your IIS web applications in seconds at scale. I do hope this will serve as a jumping off point for you and inspire you to dive into the depths of your data!