Thursday, March 31, 2005

 

What is an error?

It's come to mind that if a casual reader were to stumble on this site that they would not have any idea what I was referring to, especially if s/he were not a programmer or someone conversant with the latest programming languages, so I will digress.


What is an error? What are we talking about with error handling? What is an exception? For someone coming out of a C/C++ or similar background (ok, there's a bit of an assumption that the reader at least knows that I am referring to a programming language) it is easy for them to visualize an error and what that means but many people have never dealt with an exception.


The basic problem is very simple...what happens when a program has a problem? For example, let's say you're writing a very sophisticated program that asks the user to perform a difficult chore, like "Enter a number from 1 to 10" and the user enters "43c65"....what should that program do? How does the program detect it and deal with it? In other words, when should a program go "ouch" or "oops"? What should the user see? What should it do? What other audiences are there for the "oops" report? In some ways a program is like a little child - it has trouble telling the adults (users) what's wrong with it.

There are a lot of aspects to this, so I'll list some of them from the user's perspective...

  1. Is there a problem?
  2. How does the program determine there is a problem?
  3. What kind of problem (major vs. minor, or fatal vs. non-fatal)
  4. What information should be reported?
  5. How should the information be reported or presented?
  6. What should the program do next?


There are probably a bunch of other things to deal with that I left out but I believe this hits all the high points. The important thing is that the issues are not related to a particular programming language or engineering methodology, operating system, application, or hardware platform. All software applications must deal with these problems. There is a lot of complexity in dealing with any single item on the list, and the complexity increases as these different aspects interact. In other words, the outcome of any particular step may be influenced by the nature or outcome of a different step, and the sheer complexity of a system makes it extremely difficult to devise a set of rules that will always make sense to follow.


These items all contribute to the robustness of a program. I am making an implicit assumption that a program should be robust and fault tolerant and that these are good things. How robust it needs to be is determined by how critical it is - is it controlling a space shuttle or your favorite game? The downside risk is often the determining factor.


#1 deals with the correctness of the result, the Garbage In - Garbage Out factor. If the program cannot detect that a problem exists then the program is so fatally flawed that there's little point in discussing what happens next. Some piece of code somewhere in the system must be capable of correctly determining that something is wrong; the user input data may be bad, a piece of hardware may have failed, an operation may have taken too long to complete, etc. There is an almost infinite list of things that can go "bump" in the computer's night.


#2 hints that there are relationships between software layers - the chunk of code that detects a problem and the chunk that must deal with it. This is concerned with the mechanisms of how one layer informs another layer that a problem exists.


This implies a separation between the layer that determines there is a problem and the layer that somehow "knows" what to do with the problem. For a lot of very good reasons we want to separate these layers - the evils of monolithic software layers are well known - but it does by itself increase the complexity (and therefore the fragility) of the system.


#3 deals with interpreting the nature of the problem. In many cases a problem may be easily corrected ("enter the real password you ninny"), and in others it's give up and go home time (the hard drive caught fire/the dog ate my homework). This is where we have to start dealing with context - what was the program doing when the failure occurred? Can it be corrected? How does it get corrected? Does it require manual intervention, or can it be automated? Should it be retried, and when should it retry it?


This is related to #6 but I listed that as a separate item because there are a lot of side effects to that question.


#4 and #5 deals with the audience that is targeted when the error occurs, in other words, who will be peering into the forensic evidence of this error? Will it be a DU (Dumb-User), a PU (Power user), administrator, tech support, a developer, another machine, etc. The reporting of this must also take into account globalization and localization issues, security issues, and policy issues. For example, it does no good to display an error message in Chinese (or English) if the person (assuming it is a person) that sees the error message does not read Chinese (or English). It would also be a bad thing (for some degree of "bad") if the error message somehow wound its way into the hands of a hacker and the message contained your old password in cleartext.

The presentation of the error message also helps determine its contents - if it is displayed to the end user you probably don't want to include the contents of a network packet but if it is going to an automated logging system (e.g. NT EventLog) perhaps you do.


I also believe that simply displaying a low-level error message (e.g. "System error 42 - please consult your handbook.") is a waste of everyone's time (and I put "Null refefence exception occurred" in that same category). Context is king - what was the program doing when the problem occurred? Quite often simply knowing what it was trying to do will suggest a corrective course of action.
Ideally and if at all possible, a corrective course of action should be made known to the end user.


#5 also concerns itself with presentation and recording issues. Should the error somehow be preserved in a permanent record? Displayed to a user? Where should the report go? How well does the reporting system scale? Does it need to handle one error per millenium or 1000 errors per second?

Each system will have different reporting requirements - the needs of a backend server are vastly different then the needs to an application displaying a bar chart.


#6 is the end result of the above. The ideal response is for the system to be able to fix itself - after all, as users we are concerned about getting the dang thing to work, not in getting a bazillion messages popping up telling us about every little thing that went wrong. If the system cannot fix itself then it needs to know what it ought to do next - should it support a "give me good data and I'll try again" mode? Should it abort and go back to waiting to be told what to do next? Should it give up and terminate the program? How does it "know" which of these it ought to do? Can it determine when one option is valid or invalid? And what defaults should it use?


There is no single answer to any of these questions. I hate to be situational but there really isn't a single right answer - "it depends" is often the correct answer. Like is hard and so is writing good software.

There's an item #7 even though it's not listed - an Unexpected Error (UE)! I didn't list it because it so perfectly captures the item itself - no one expects the UE (or the Spanish Inquisition). When things go horribly wrong, when you didn't see it coming...what do you do?

...to be continued...


Wednesday, March 23, 2005

 

Let's talk Exceptions, part 2

Before I start talking about exceptions I want to first talk about what I want to talk about...in other words, what's the plan?

The Topics
1. The mechanism. How exceptions are delivered to the user-mode application.
2. The basics to the try-catch-finally-fault syntax (C# specific)
3. The basics of throwing (when and how).
4. The basics of catching (when and how).
5. The basics of finally blocks.
6. The basics of finalizers.
7. The basics of the Dispose pattern.
8. The basics of unhandled exceptions.
9. Patterns to use when doing a...
a. Throw
b. Catch
c. Finally

10. Problems and issues, some specific to v1.1, some that are non-specific.
11. Problems with Dispose and finalize.

And if that's not enough there'll probably be some digression into related areas. And when I'm done with the basics I plan to talk about some advanced stuff....lots of it.

In fact, here's the first digression...most of the help text and published articles I've read cover the basics fairly well (just not all in one place), but when they apply the basics to real world applications the wheels start to come off...the typical bromides that fall out of the basics are fine when applied to trivial applications, but they don't scale well to large, complex, distributed, organic applications. I have yet to see a decent discussion on how well exceptions work over time as a system evolves and mutates. I have not seen evidence that there is much active research on this particular problem.

That's really where all this is heading. I am a practical sort. I enjoy the theory of this, that, and the other, and the discussions go along with it, but in the end I like to apply what I know. I like to make things work, not just talk about it. And my current interest is in making large systems work, and that includes how they evolve.

Enuf for now.

Monday, March 21, 2005

 

Exception talk 1

Let's talk just about exceptions...

Except that it really isn't possible to talk JUST about exceptions, you need to also talk about try, catch and finally (and fault) blocks, and about finalization and Dispose patterns, about threads and aborts and system shutdown and appdomain unloading, system reliability and integrity, and all kinds of other stuff. It starts off simple and gets horribly complex, and I get reminded all over again that the software ankle is connected to the kneebone is connected to the hipbone is connected to...

This gets deep and hard fast, but that does not mean that it is impossible to talk about a few simple things, just that for every rule or best practice that can be devised there are exceptions :-) to that rule.

So I'll start off with the basics; just the facts Jack.

Fact 1: The exception mechanism used in .NET is very similar to that used in Win32 SEH, and rides on top of the C++ model as well. See Matt Pietrek's most excellent article on how Win32 SEH is implemented, available here
http://www.microsoft.com/msj/0197/Exception/Exception.aspx
and on the .NET story, as told by the most excellent Chris Brumme, available here

http://blogs.msdn.com/cbrumme/archive/2003/10/01/51524.aspx.



Fact 2: Even though it is relatively easy to understand the actual mechanism used by the exception handling framework to deliver exceptions to code, it is very very difficult to come up with workable guidelines on how to design class libraries, components, modules, applications, and systems, to use them effectively. It's not that guidelines don't exist, it's that they are either misleading, do not cover all the cases, or provide simple homilies and bromides that are little more then admonishments to eat all your vegetables (I put the famous guide "there should be 10 times as many finally blocks as there are catches" in that category).

Fact 3: Even something that is relatively easy (the mechanism) is actually full of very complicated details. Most of these we don't need to worry about (such as the exception tables that contain the info needed to determine where in a method the catch block lies), but there are others that do affect user code, such as the translation process that occurs when an exception transitions from managed to unmanaged code (and back again). When information can get lost or translated, where one layer of code rubs against another, is a friction point, and it can cause problems; the world of interop is full of these points. I consider these to be moving parts, similar to a mechanical device, and all represent a point of failure.

So....I plan to start writing about this stuff. I do not work for Microsoft and I have no special knowledge of the CLR, and far less then others have. I also do not have a great deal of time to spend on this, and I doubt that anyone will actually read this stuff, but hey, it's 4AM and I don't feel like writing code right now. But I have read a lot on this subject, thought about it a fair bit, and have even conducted some training sessions for my company on this subject.

This page is powered by Blogger. Isn't yours?