This blog contains reflections and thoughts on my work as a software engineer

onsdag den 14. oktober 2009

Definition of software quality #3

I have from time to time struggled hard to define the term "software quality". I have written about the subject before

...and on and off it has come back to us at work - we've discussed the subject for hours without any conclusions we could all back up 100%. The discussions were mainly taken whenever we had delivered incomplete features which proved to be - less than adequate tested...  :o)   At my current job we have no testers employed and our QA process is driven by developers so we do our best and take responsibility for our actions knowing that testing our own code is an antipattern. I had a small breakthrough last week - I think I've managed to figure out metrics for our business which actually is watertight - it was during a meeting when I thought it up and asked the others for feedback. They bought it immediatly and we've started gathering data for the first time... What we're measuring? First a little background story:

At work we handle a lot of asynchronous processing of online enrollments. An enrollment made by a user because he or she wants to attend an arrangement or a course (summer camps, soccer schools, gymnastics every Thursday throughout the winter etc.) ends up in our backend system as a "job" which will be processed asynchronously. A job could be an enrollment but it can also be updating an address change in various databases and 3rd party systems. Another job could be sending Emails to everybody attending a summercamp. We have a lot of business processes ending up as different kinds of jobs in our database. 

Processing jobs succesfully are highly critical for our business to function so if a job fails the business needs to attend it and fix the error. A likely cause for an error could be that a 3rd party webservice didn't respond in a timely fashion or that our support staff has blocked someone from attending more courses because the person in question didn't pay for the last course. That person is by design still able to make another enrollment but the job created will fail because there is a business rule somewhere blocking further money transactions on that particular person. Enter a developer who makes the necessary phonecalls to clarify if the person should be unblocked or not. The developer can then restart the job if our support staff unblocks that person in our backend systems.

Why do I tell you all this? Because I figured during a meeting last week that we could easily get a clear indication of our software quality by measuring the amount of failed jobs during a period of time. We are interested in measuring the quality of code. What code? That really was the question... And how to measure if that code was OK? You can get some of the way by using tools to analyze code and reviewing the codebase but I have realized that the main interest of the business is to know that whenever they use our backend as part of their business process they need to know that everything "works". How do we know that it doesn't? If a job fails... The only people who care about TDD, IoC, mocking frameworks, Cyclomatic Complexity and code coverage are developers - the business just wants to know that when they click "Send Email" that email was actually sent to the recipients. The quality of the code itself really comes second as long as your users are able to do their job. I had two other developers and my boss think about it for a few seconds and they bought it instantly. As of last week we've started saving an entry in a database table whenever a job fails us so we can track over time how many failed jobs we have under the current load. Then we also know which parts of the codebase and which business processes that needs improvement if the same kind of job fails repeatly under the same conditions. Sweeeeet........

If you (as a developer) are in doubt about the quality of your codebase ask yourself these questions:

  • Can you in any way as the software developer with insight measure when the software fails a user?
  • Can you in any way as the software developer with insight measure when the software fails a user - when the user doesn't know that an error occured?
  • Can you as a developer fix any problem in the codebase knowing all sideeffects of your changes?
  • Can you easily test and deploy a bugfix/patch to the production environment?

There are many ways to measure code quality but start focusing on the user experience and the shortcuts they make because the software doesn't work as expected. Software quality should be measured in

  1. Happy users being able to do what your software promises them they can do.
  2. Happy programmers feeling comfortable with the current state of the codebase

Until next time...

lørdag den 10. oktober 2009

Tracking down an IIS deadlock with WinDbg

We’ve had a few performance issues lately on the website I’m working on – while digging through code we’ve discovered a lot of technical debt which we’ve started paying more attention to. Also we discovered that the IIS on our farm servers went down on a daily basis – approximately one or two times per day the IIS on each server would recycle the application pool for reasons unknown to us. There was an entry in the Eventlog telling us that the IIS had deadlocked and hence a reset was issued. Surveillance didn't work as expected so we never got anything but random users once or twice a month complaining that "The site is slow sometimes"...

We developed quite a lot of theories along the way but due to gut feeling and what we could see from log files etc. we suspected either thread pool starvation because of the way we made some synchronous long-running webservive calls to i.e. clear the server’s CMS cache or an infinite loop spinning up eating all available ressources before the IIS decided that it needed to recycle the application pool to come back to life. We had little else apart from a Performance Monitor telling us a few minutes in advance that a server was going down due to a massive amount of SQL User connections suddenly craving all ressources.

Enter WinDbg – the coolest and geekiest debugging tool I have ever worked with. Until last week I had no knowledge of debugging using WinDbg apart from reading a few blogposts from time to time written by people who have made debugging their niche in the development community. I decided to take a dive into it because we needed some facts on the table instead of just creating theories. First of all I knew we had to collect some sort of dump of the process when it was in an unhealthy condition so se started to collect memory dumps using ADPlus and after a few hours of waiting the IIS on one server went down and generated a large 500+ MB dump named [something]_Process_Shutdown_[timestamps-something].dmp. I spent the next three days deep inside the dump, scratching my head and reading lots and lots of blogposts found on Google, peeking and poking and trying out commands which didn’t output anything useful. I wrote it off at first because I suspected my own missing competence for the lack of finding information until I came across a blogpost writing about false positive dumps… Which was exactly what I had. A false positive means that the dump was generated when the process was in a different state than expected – something you can only find out by debugging the content of the dump. What I had wasn’t a dump from a deadlocked process, oh no – I had a dump from the process right after the IIS had decided to recycle the application pool so every thread except the Finalizer had no ID which means it had been shut down nicely and was basicly just waiting for the garbage collector to put and end to it's misery. That was also why there were no callstacks on the threads, nothing useful in the heap… Heureka!!! Nothing but a lot of useless bytes in a big file, basicly.

A lot wiser but still with no data I started looking for a way to collect a dump when the IIS was in the deadlocked state. I managed to locate an MSDN-article which described how to set the metadata property OrphanWorkerProcess on the IIS to “true” so instead of the IIS automaticly recycling the application pool I could kick off a script invoking a WinDbg through the commandline which could write a dump of the process to the disc for later analysis. It took a while but after a bit of trial and error I managed to get the script working. The script worked fine apart from the IIS not calling filename.cmd with a parameter whenever the worker process was orphaned so whenever the IIS executed the script the script invoking the debugger failed because there was no process ID named %1… It took only another two days to figure out what was going on, thank you very much Microsoft. MSDN is an invaluable ressource for me but I didn't have anything nice to say about the author of the article for about 10-12 minutes   :o)

As of tonight - art break - I managed to get a positive-positive dump and it took only 15 minutes to figure out exactly what was going on. The last week’s trial and error, peeking and poking did prove to be a good investment – here’s what I did:

I downloaded the dump generated by WinDbg to my local machine and fired up WinDbg on my local machine. I had earlier downloaded the .NET Framework version from the farm server because even though you have version v2.0.50727 installed on your local machine there is a chance that it isn’t the exact same revision as the one you have installed on your server. In my case my local workstation’s copy of aspnet_isapi.dll is v2.0.50727.4016 while the server has 2.0.50727.3053. I figure it has something to do with the fact that I’m running Vista while the farm server is a Windows 2003. It hasn’t got any effect in your daily life but it is essential while debugging a crash dump that you have the *exact* same assemblies and symbol files as the machine where the dump was generated otherwise you might not be able to extract information out of the dump or – even worse – you might get corrupted or incorrect data which could very easy lead you in a completely wrong direction. So I zipped and downloaded the .NET Framework from the farmserver to my local machine and used the command “.exepath+ [path]” to search for matching assemblies in [path] if no matches in the default locations were found.

I loaded the dump into WinDbg using and thought back on our initial suspects. Our first suspicion was the infinite loop because it was the strongest theory we had. Debugging threading issues such as two threads deadlocking eachother would also be harder while an infinite loop would without a doubt reveal itself in a very long stacktrace on one of the active worker threads - given that I didn't have a false positive crash dump that is(!). So I used !threads to get a list of threads – here’s the output:

0:028> !threads 
ThreadCount: 52
UnstartedThread: 0
BackgroundThread: 39
PendingThread: 0
DeadThread: 6
Hosted Runtime: no
PreEmptive GC Alloc Lock
ID OSID ThreadOBJ State GC Context Domain Count APT Exception
13 1 52c 000ebc28 1808220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Worker)
15 2 1008 000f7ce0 b220 Enabled 00000000:00000000 000db760 0 MTA (Finalizer)
16 3 164c 00110208 80a220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Completion Port)
17 4 11c0 00113840 1220 Enabled 00000000:00000000 000db760 0 Ukn
11 5 314 0012f818 880a220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Completion Port)
18 6 12a0 0015ff88 180b220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Worker)
19 7 1760 0015a5c8 200b020 Enabled 00000000:00000000 001147a0 0 MTA
20 8 11fc 00164c40 200b020 Enabled 00000000:00000000 001147a0 0 MTA
21 9 f80 00158dc8 200b020 Enabled 00000000:00000000 001147a0 0 MTA
22 a 14dc 001597a8 200b020 Enabled 00000000:00000000 001147a0 0 MTA
23 b 170 0012a638 200b020 Enabled 00000000:00000000 001147a0 0 MTA
24 c 152c 00170c58 200b020 Enabled 00000000:00000000 001147a0 0 MTA
25 d 1668 06b33c10 200b020 Enabled 00000000:00000000 001147a0 0 MTA
27 e e70 06b6fb40 200b220 Enabled 00000000:00000000 001147a0 0 MTA
28 f 5e4 06b5e420 180b220 Enabled 00000000:00000000 001147a0 1 MTA (Threadpool Worker) System.Data.SqlClient.SqlException (1ee19b2c) (nested exceptions)
29 10 1470 06b6a650 200b220 Enabled 00000000:00000000 001147a0 0 MTA
30 11 fa4 06b7dbe8 180b220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Worker)
10 12 148c 06b80e60 220 Enabled 00000000:00000000 000db760 0 Ukn
7 14 1640 06b8e1b0 220 Enabled 00000000:00000000 000db760 0 Ukn
37 16 fd4 08f340d8 180b220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Worker)
39 18 8bc 09097448 200b220 Enabled 00000000:00000000 001147a0 0 MTA
40 17 104c 090532b8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
41 19 1238 090967e0 200b220 Enabled 00000000:00000000 001147a0 0 MTA
42 1a 13fc 090a8bd8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
43 1b 1408 09055108 200b220 Enabled 00000000:00000000 001147a0 0 MTA
44 1c 1580 090559c0 200b220 Enabled 00000000:00000000 001147a0 0 MTA
45 1d 184 09056278 200b220 Enabled 00000000:00000000 001147a0 0 MTA
46 1e 13dc 09056b30 200b220 Enabled 00000000:00000000 001147a0 0 MTA
47 1f 12e8 090573e8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
48 20 1458 06c04298 200b220 Enabled 00000000:00000000 001147a0 0 MTA
49 21 93c 06c04878 200b220 Enabled 00000000:00000000 001147a0 0 MTA
50 22 d5c 06c052a8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
51 23 15ec 06c05cd8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
52 24 29c 06c06738 200b220 Enabled 00000000:00000000 001147a0 0 MTA
53 25 1098 06c07198 200b220 Enabled 00000000:00000000 001147a0 0 MTA
54 26 13cc 06c07bf8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
55 27 12f0 06c08658 200b220 Enabled 00000000:00000000 001147a0 0 MTA
56 28 ce4 06c090b8 200b220 Enabled 00000000:00000000 001147a0 0 MTA
57 29 e60 06c09b00 200b220 Enabled 00000000:00000000 001147a0 0 MTA
58 2a d98 06c0a4e0 200b220 Enabled 00000000:00000000 001147a0 0 MTA
6 15 1060 090a6980 220 Enabled 00000000:00000000 000db760 0 Ukn
4 2b 590 090ddae8 220 Enabled 00000000:00000000 000db760 0 Ukn
3 13 cd4 1712e4f8 220 Enabled 00000000:00000000 000db760 0 Ukn
XXXX 2c 0 17133d50 1801820 Enabled 00000000:00000000 000db760 0 Ukn (Threadpool Worker)
XXXX 2d 0 17104970 1801820 Enabled 00000000:00000000 000db760 0 Ukn (Threadpool Worker)
XXXX 2e 0 090ac858 1801820 Enabled 00000000:00000000 000db760 0 Ukn (Threadpool Worker)
69 2f 99c 090ad248 180b220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Worker)
XXXX 30 0 090b5dd0 1801820 Enabled 00000000:00000000 000db760 0 Ukn (Threadpool Worker)
70 31 fcc 090b70e8 180b220 Enabled 00000000:00000000 000db760 0 MTA (Threadpool Worker)
XXXX 32 0 090b83e0 1801820 Enabled 00000000:00000000 000db760 0 Ukn (Threadpool Worker)
XXXX 33 0 090bb380 1801820 Enabled 00000000:00000000 000db760 0 Ukn (Threadpool Worker)
75 35 1134 1718fb70 200b220 Enabled 00000000:00000000 001147a0 1 MTA

There was an immediate suspect – thread #28 appeared to have a nested SqlException in it… So I switched to the thread by executing the command “~28s” and now I could get a callstack on the thread by using “!clrstack” with the following output:


…074fe5d4 0fc5c6f5 dgiCore.Log.ExceptionLogPublisher.Publish(System.Exception, System.Collections.Specialized.NameValueCollection, System.Collections.Specialized.NameValueCollection) 
[snip]
074fe920 0fc5c6f5 dgiCore.Log.ExceptionLogPublisher.Publish(System.Exception, System.Collections.Specialized.NameValueCollection, System.Collections.Specialized.NameValueCollection)
[snip]
074fec6c 0fc5c6f5 dgiCore.Log.ExceptionLogPublisher.Publish(System.Exception, System.Collections.Specialized.NameValueCollection, System.Collections.Specialized.NameValueCollection)
[snip]
074fee20 0f97d9e2 DGI.Web.layouts.DGI.Loginbox.btnLogin_Click(System.Object, System.EventArgs)
[snip]
074ff09c 01e1adf5 ASP.layouts_main_aspx.ProcessRequest(System.Web.HttpContext)
[snip]
074ff180 66083e1c System.Web.HttpRuntime.ProcessRequestInternal(System.Web.HttpWorkerRequest)
074ff1b4 66083ac3 System.Web.HttpRuntime.ProcessRequestNoDemand(System.Web.HttpWorkerRequest)
074ff1c4 66082c5c System.Web.Hosting.ISAPIRuntime.ProcessRequest(IntPtr, Int32)
074ff3d8 79f68cde [ContextTransitionFrame: 074ff3d8]
074ff40c 79f68cde [GCFrame: 074ff40c]
074ff568 79f68cde [ComMethodFrame: 074ff568]

I was quite surprised to find out that a click on the Submit-button on our login-page proved to be the cause of an infinite loop… What parameters was the methods being called with? Enter the command “!clrstack –p” and somewhere along the massive output of nested exceptions I found the call invoked by the user clicking the “Login” button:


074fed68 0d6a755b dgiCore.Security.Authenticate(System.String, System.String, System.String, dgiCore.Log.Exceptions.Fejl ByRef) 
PARAMETERS:
username =
password =
applicationName =
fejl = 0x074fee24

Empty username and password? Normally there would be a reference to an address in the memory… Just for the fun of it I tried to issue a login from the website under suspicion with an empty username and password and what do you know? The website hung until the IIS decided to recycle the application pool and come back to life with a new process ID... If you can imagine my surprise sitting late in the evening and just watching every little piece of the puzzle fall into place watching the IIS on the server choking harder and harder before it's inevitable death and immediate resurrection. It was in that exact splitsecond I fell in deep, unconditional love with WinDbg.


I won’t go into detail about the error itself and how it was fixed (I won’t say too much but part of it might include a bit of Javascript form validation) but I have had a major success using WinDbg (and Google) to track down and fix an error which by default only revealed itself through an entry in the server’s Event log. WinDbg is indeed a very useful tool in a developer’s toolbox and now I’ve got some experience with it I can only imagine how many hours I’ve wasted earlier on trying to come up with grand unified theories about the behaviour of a piece of code where I would now just attach a debugger / create a dump and check for myself what’s actually going on. It took me four to five days effectively to get to know the mindset and commands neccessary to make it useful but boy... They will come back a hundred times over the next three decades of programming. Try it out for yourself  :o)

onsdag den 7. oktober 2009

Stop whining!

This is a little out of the ordinary but what the heck... There has to be room for a little fun from time to time

You know Arnold Schwarzenegger? Of course you do... He has become famous for quotes like "I'll be back", "Aaaaargh" and various other sentences which primarily excels in containing words exceeding no more than two syllables. There's one quote in particular I like (and so does my fellow coworkers). It is "Stop whining" from Kindergarten Cop - when spoken McBain style nobody at our team can continue whining over bad code, cold coffee, stupid typos etc but inevitably start smiling because whining is such an easy trap to fall into when the right thing to do is just to get the job done and get going.

How does this relate to coding? Well, tonight I wondered how to play media files in C#... I never actually tried coding audio stuff in C# so I figured that I wanted to find a WAV file containing the infamous Arnold shouting "Stop whining"... And so I did and it turned out to be a very little deal getting it out. It was actually two lines of code:


var myPlayer = new System.Media.SoundPlayer("myfile.wav");
myPlayer.Play();


Then I figured that I wanted to play it using a keyboard stroke - to make some kind of Arnold soundboard just using keyboard shortcuts - how would that be possible? Capturing keyboard strokes isn't hard at all - if your program is in focus but I wanted to have some keyboard hook which would catch my keyboard strokes globally... I knew it had to be done invoking unmanaged code which isn't something I do on an everyday basis... so instead of doing all the WinAPI stuff myself (yes, I am a lazy bastard) I dug up this article on CodeProject which I could make use of. It basicly encapsulates the WinAPI goodies required to globally listen to whatever the user is doing with his keyboard. The hardest part was actually getting raw WAV-files with Arnold quotes - I managed to find some but it required me to save them in a different format as the .NET Sound API only allows PCM files. Enter Audacity, a load and a save later on each file and I was good to go. I needed to research a little getting the Windows Form to hide itself as a tray icon (yet another thing I've never tried before) but it was also a nobrainer.



So as of tonight I have whipped up a small application which can store itself in the tray and listen to keyboard events. If you press down S, T, O and P down, Arnold will in his informal voice speak out "Stop whining!". Try the following for yourself: LACK and COP... The implementation is here (I won't cover the GlobalKeyboardHook as it is covered in the CodeProject article):



    public partial class ArnoldForm : Form
{
private GlobalKeyboardHook _gkh;
private List _pressedKeys = new List();
public ArnoldForm()
{
InitializeComponent();

_gkh = new GlobalKeyboardHook();
_gkh.HookedKeys.Add(Keys.S);
_gkh.HookedKeys.Add(Keys.T);
_gkh.HookedKeys.Add(Keys.O);
_gkh.HookedKeys.Add(Keys.P);

_gkh.HookedKeys.Add(Keys.C);
_gkh.HookedKeys.Add(Keys.P);

_gkh.HookedKeys.Add(Keys.L);
_gkh.HookedKeys.Add(Keys.A);
_gkh.HookedKeys.Add(Keys.K);



_gkh.KeyDown += gkh_KeyDown;
_gkh.KeyUp += gkh_KeyUp;
}

private void gkh_KeyUp(object sender, KeyEventArgs e)
{
if (_pressedKeys.Contains(e.KeyCode))
_pressedKeys.Remove(e.KeyCode);
}

private void gkh_KeyDown(object sender, KeyEventArgs e)
{
_pressedKeys.Add(e.KeyCode);
var arnold = new Arnold();
arnold.Speak(_pressedKeys);
}

private void Form1_Resize(object sender, EventArgs e)
{
if (FormWindowState.Minimized == WindowState)
Hide();
}

private void notifyIcon1_DoubleClick(object sender, EventArgs e)
{
Show();
WindowState = FormWindowState.Normal;
}
}

public class Arnold
{
public void Speak(List keys)
{
string voiceFile = GetVoiceFile(keys);

if (!string.IsNullOrEmpty(voiceFile))
{
var asm = Assembly.GetExecutingAssembly();
var file = asm.GetManifestResourceStream(voiceFile);

var myPlayer = new System.Media.SoundPlayer(file);
myPlayer.Play();
}
}

private static string GetVoiceFile(ICollection keys)
{

if (keys.Contains(Keys.S) &&
keys.Contains(Keys.T) &&
keys.Contains(Keys.O) &&
keys.Contains(Keys.P))
{
return "ArnoldQuotesForm.media.stop_whining.wav";
}

if (keys.Contains(Keys.C) &&
keys.Contains(Keys.O) &&
keys.Contains(Keys.P))
{
return "ArnoldQuotesForm.media.cop.wav";
}

if (keys.Contains(Keys.L) &&
keys.Contains(Keys.A) &&
keys.Contains(Keys.C) &&
keys.Contains(Keys.K))
{
return "ArnoldQuotesForm.media.lack_discipline.wav";
}

return null;
}
}


Download the entire project here


It's not exactly rocket science but I had a few hours of fun building the thing. Comments are welcome - until next time...  :o)