Thursday, March 31, 2005
What is an error?
It's come to mind that if a casual reader were to stumble on this site that they would not have any idea what I was referring to, especially if s/he were not a programmer or someone conversant with the latest programming languages, so I will digress.
What is an error? What are we talking about with error handling? What is an exception? For someone coming out of a C/C++ or similar background (ok, there's a bit of an assumption that the reader at least knows that I am referring to a programming language) it is easy for them to visualize an error and what that means but many people have never dealt with an exception.
The basic problem is very simple...what happens when a program has a problem? For example, let's say you're writing a very sophisticated program that asks the user to perform a difficult chore, like "Enter a number from 1 to 10" and the user enters "43c65"....what should that program do? How does the program detect it and deal with it? In other words, when should a program go "ouch" or "oops"? What should the user see? What should it do? What other audiences are there for the "oops" report? In some ways a program is like a little child - it has trouble telling the adults (users) what's wrong with it.
There are a lot of aspects to this, so I'll list some of them from the user's perspective...
- Is there a problem?
- How does the program determine there is a problem?
- What kind of problem (major vs. minor, or fatal vs. non-fatal)
- What information should be reported?
- How should the information be reported or presented?
- What should the program do next?
There are probably a bunch of other things to deal with that I left out but I believe this hits all the high points. The important thing is that the issues are not related to a particular programming language or engineering methodology, operating system, application, or hardware platform. All software applications must deal with these problems. There is a lot of complexity in dealing with any single item on the list, and the complexity increases as these different aspects interact. In other words, the outcome of any particular step may be influenced by the nature or outcome of a different step, and the sheer complexity of a system makes it extremely difficult to devise a set of rules that will always make sense to follow.
These items all contribute to the robustness of a program. I am making an implicit assumption that a program should be robust and fault tolerant and that these are good things. How robust it needs to be is determined by how critical it is - is it controlling a space shuttle or your favorite game? The downside risk is often the determining factor.
#1 deals with the correctness of the result, the Garbage In - Garbage Out factor. If the program cannot detect that a problem exists then the program is so fatally flawed that there's little point in discussing what happens next. Some piece of code somewhere in the system must be capable of correctly determining that something is wrong; the user input data may be bad, a piece of hardware may have failed, an operation may have taken too long to complete, etc. There is an almost infinite list of things that can go "bump" in the computer's night.
#2 hints that there are relationships between software layers - the chunk of code that detects a problem and the chunk that must deal with it. This is concerned with the mechanisms of how one layer informs another layer that a problem exists.
This implies a separation between the layer that determines there is a problem and the layer that somehow "knows" what to do with the problem. For a lot of very good reasons we want to separate these layers - the evils of monolithic software layers are well known - but it does by itself increase the complexity (and therefore the fragility) of the system.
#3 deals with interpreting the nature of the problem. In many cases a problem may be easily corrected ("enter the real password you ninny"), and in others it's give up and go home time (the hard drive caught fire/the dog ate my homework). This is where we have to start dealing with context - what was the program doing when the failure occurred? Can it be corrected? How does it get corrected? Does it require manual intervention, or can it be automated? Should it be retried, and when should it retry it?
This is related to #6 but I listed that as a separate item because there are a lot of side effects to that question.
#4 and #5 deals with the audience that is targeted when the error occurs, in other words, who will be peering into the forensic evidence of this error? Will it be a DU (Dumb-User), a PU (Power user), administrator, tech support, a developer, another machine, etc. The reporting of this must also take into account globalization and localization issues, security issues, and policy issues. For example, it does no good to display an error message in Chinese (or English) if the person (assuming it is a person) that sees the error message does not read Chinese (or English). It would also be a bad thing (for some degree of "bad") if the error message somehow wound its way into the hands of a hacker and the message contained your old password in cleartext.
The presentation of the error message also helps determine its contents - if it is displayed to the end user you probably don't want to include the contents of a network packet but if it is going to an automated logging system (e.g. NT EventLog) perhaps you do.
I also believe that simply displaying a low-level error message (e.g. "System error 42 - please consult your handbook.") is a waste of everyone's time (and I put "Null refefence exception occurred" in that same category). Context is king - what was the program doing when the problem occurred? Quite often simply knowing what it was trying to do will suggest a corrective course of action.
Ideally and if at all possible, a corrective course of action should be made known to the end user.
#5 also concerns itself with presentation and recording issues. Should the error somehow be preserved in a permanent record? Displayed to a user? Where should the report go? How well does the reporting system scale? Does it need to handle one error per millenium or 1000 errors per second?
Each system will have different reporting requirements - the needs of a backend server are vastly different then the needs to an application displaying a bar chart.
#6 is the end result of the above. The ideal response is for the system to be able to fix itself - after all, as users we are concerned about getting the dang thing to work, not in getting a bazillion messages popping up telling us about every little thing that went wrong. If the system cannot fix itself then it needs to know what it ought to do next - should it support a "give me good data and I'll try again" mode? Should it abort and go back to waiting to be told what to do next? Should it give up and terminate the program? How does it "know" which of these it ought to do? Can it determine when one option is valid or invalid? And what defaults should it use?
There is no single answer to any of these questions. I hate to be situational but there really isn't a single right answer - "it depends" is often the correct answer. Like is hard and so is writing good software.
There's an item #7 even though it's not listed - an Unexpected Error (UE)! I didn't list it because it so perfectly captures the item itself - no one expects the UE (or the Spanish Inquisition). When things go horribly wrong, when you didn't see it coming...what do you do?
...to be continued...
Call this number now 24 hours a day 7 days a week (413) 208-3069
Get these Degrees NOW!!!
"BA", "BSc", "MA", "MSc", "MBA", "PHD",
Get everything within 2 weeks.
100% verifiable, this is a real deal
Act now you owe it to your future.
(413) 208-3069 call now 24 hours a day, 7 days a week.