Google
  Web lafcpug.org

 More Reviews
 

 

 

  

When Bad Things Happen to Good Computers

Why doesn't Apple just FIX THIS THING!!!
(or "When Bad Things Happen to Good Computers.")

 

by Tracy Valleau


One of the most frustrating things for computer users is the installation of a new piece of software... that doesn't work, or worse yet crashes the computer.

The swear words are often enough to embarrass the most savvy sailor.

These are usually followed by the posting of letters maligning the software, the publisher, the programmers, the computer manufacturer; and any other nearby and convenient target.

"Why doesn't (name) just fix this?" is the usual plaintive whine.

Well, here's why: because "just" doesn't apply.

There are only a few things that can cause software to misbehave: 1) hardware failure; 2) programmer's error; 3) software conflicts; 4) input or user error.

Hey: if your hardware is broken, well - fix it. There's nothing the software is going to do to repair your broken hard drive.

However, in this category, there is one insidious hardware/software problem that must be mentioned: the file tracking done on your hard drive with drivers.

All your software is stored on your hard drive, just as a collection of short stories is stored in a book.

To find a given story, you consult the table of contents, and it directs you to the proper location (page) in the book. You start reading, and when you finish one page, you assume that turning the page will take you to the next words in the story.

But suppose that there was a printing error, and you're 12 pages into "War and Peace" and turn to page 13, only to discover that page 13 contains the 33rd page of "Skippy the Bunny!"

In essence, this is what happens when the computer hard drive's directory gets corrupted.

In fact, it's much more complex than that, since, instead of each page following one on the other, a computer has to return to the table of contents (so to speak) to find out where the next page is, just as if you ripped out all the pages of "War and Peace," threw them up in the air and then gathered them back together in totally random order. (This happens on a drive because you save and delete things of different sizes, leaving bits of a file scattered in different places.)

If that "trail" of page after page gets even one page out of order, then the program crashes because it loaded in the wrong thing, and when it goes to work on it or to use it, what it expects to be there is something entirely different.

If you put on a parachute, boarded a seaplane and jumped out, you'd be in deep trouble if you did so before it took off, because instead of the air you expected to be falling through, you'd be in over your head in water.

Same thing with a computer program: if it expects to find one thing, but finds another because the directory was wrong, and it loaded in something else - you get "the big thud."

How do you prevent directory corruption? Well, if you crash, or notice anything suspicious, take the time to run a directory repair program. The longer a corrupted directory is used, the more corrupted it becomes.

Other problems related to hardware, including loose cables, failing power supplies, bad RAM, drives on their last legs and so on, can all cause problems that show up in your software. (Power supplies and hard drives both have maximum life of about 5 years, give or take.)

Programmer's errors not only do occur, but it is an axiom in the trade that there is no such thing as bug-free software. Why? Simply because you'd have to check at every single line of code for every single possible contingency to cover all your bases. That would produce code so bloated that no machine could run it; so expensive that no publisher could afford it; and so impossible on the face of it, that it could never be done (since it would require that once the software was written, all change in the universe come to a complete halt, so that no different contingencies would ever arise.)

So, programs are written to cover all the most common problems and foreseeable errors. But even this has to be weighed in the light of reality. One recent iteration of a popular operating system was released with 22,000 known bugs! (That doesn't include the probable doubling of that number in unknown bugs once a few million users started mucking about with it on a few million different machines, either!)

Hey! We're all human, and programmer errors do happen, although languages such as C++, and the use of libraries of prebuilt and pre-debugged code have gone a long way toward cutting that source of errors down.

There are reasonably sophisticated automated debugging tools which encapsulate each line of code, and can track down such mysterious things as "dangling pointers", and "undisposed handles" (ways that programs access memory.)

Good code will "TRY" a process, and if it doesn't work, "CATCH" the error. Problems caused by programmer error are now WAY down from the early years of personal computers.

So, software problems come down to true errors (the programmer used an "add" command when he should have used "multiply" for example (which no software debugging tool can detect) and "bullet-proofing" - trying to keep the unforeseen from totally messing everything up.

This last is the most common problem, exacerbated by the fact that there are no two computers alike, except as they come off the assembly line. We add software, create documents; hang new hardware off it; add memory; update the operating system; customize the desktop; and so on. Honestly, the fact that most software runs as well as it does on 99.9 % of the machines in the world is nothing short of amazing. (Which is of little consolation when it's you that is the remaining 0.1%...)

We're about to see what that means when it is you that's affected.

Software "conflicts" are by far the greatest single source of problems. In computer speak, a process (much like in normal speech) is just something that is going on. Running software starts a process; any software. Merely starting your machine in the morning starts hundreds of individual processes - some running the monitor; the keyboard; the drives; managing the memory and so on.

A conflict is a case where some combination of processes running on a given machine alters the state of that machine to the point where one (or more) processes can no longer function properly.

Let me give you an example. Process ONE wants to remember something, so it stores it in memory location "A". Process TWO wants to remember something, and also stores it in memory location "A". Process ONE now decides it needs what it just stored, and so retrieves it from memory location "A"... but now it's not what ONE stored there, but what TWO stored there.

ONE then tries to use that retrieved, and incorrect information, and, like slipping on the parachute, instead finds itself under water. Crash.

Now, that is a crude and, with modern systems and coding, unlikely example... but it's a perfectly valid one.

From the user's standpoint, his/her machine ran "just fine" until s/he installed the software that is Process TWO. Then the machine started crashing.

Must be the new software's fault, right?

Er... why? Maybe process ONE should not have been using Memory location "A". Maybe process ONE is the only software in the world that uses memory location "A" except for process TWO.

You might report the problem to the publisher of TWO, who might spend days, weeks, months trying to find the problem, when, in fact, there is absolutely nothing wrong with TWO.

Now this example is one of the easier kinds of problems to find, if you're willing to work with the publisher to help him find it. You can simply try removing everything and seeing if TWO continues to crash. (It won't, as long as ONE isn't running.) When you finally run ONE, and the crash in TWO happens, you'll have started toward finding the solution.

Where it gets hard is if ONE only uses Memory location "A" because some other process has made a change in the machine, requiring "A" to be used instead of ONE's preferred "B".

Yes - that was supposed to confuse you. If you're confused, and you're a thinking, adaptable human being, you'll begin to see how a blind dumb machine following a fixed set of rules, is prone to errors.

There is no "almost perfect" in computers; no "nearly right." If things are not perfectly set up, then bad things happen to good machines.

That is why the first thing a tech-support person will ask you to do is to restart the machine in a known good state, and then run the software to see if the problem still exists. Starting to diagnose what is going on from any other place is just plain impossible. (At least that's what happens with most Macintosh diagnosis; with Windows machines, the problem is multiplied several fold because of the intricacies of INI and a few dozen other setup files that are modified each time new software is installed.)

So, on the one hand, there's the possibilities of an infinite variety of software, and on the other, an infinite variety of machine configurations. And all it takes is one of those to not be perfect under all circumstances.

Finally, you may have triggered the error. Loading up a project incorrectly; telling the software that you're using 48K sound when you're using 32K, and then saving the project; switching from 32K to 48K in the middle of a tape; doing a forced break out of a program when it's writing to disk; restarting a program after a crash, without restarting the machine first; and so on.

Finally, there's the one obvious one, that applies, naturally enough, to everyone in the world except you - read the instructions! Your expectations of what the program should do, may not be what it's designed to do. That is not, my friends, a bug.

The truth be known, of all the tech support calls and bug reports a company gets, perhaps one in every 800-1000 is actually a problem with the software. Most often, it's "operator error" followed, at some distance, by software conflicts.

Real life -

However, some bugs are real. Here's an anecdote about finding one with Final Cut Pro 2.0.

The 2-pop board contained a couple of references to crashes when logging material. The usual suspects didn't seem to fix the problem. I noticed the problem myself, and had logged on to 2-pop to see if I was alone, and discovered that I was not.

So, as an experienced debugger, I started off tracking it myself.

First, I restarted with only the recommended extensions: no luck. (But that's where you should start, too!) I adjusted the memory: no luck. I trashed the preferences: no luck. I reinstalled the software: no luck. I looked at 2-pop: no luck.

I was intrigued. I next thought about what was obviously different between 1.25, which worked just fine, and 2.0, which crashed. The most obvious thing was in my face: the Audio Metering window (AMW).

So I closed it... ... and the problem went away.

Over the next several hours, I tried that new-found technique under various system configurations, and in every single case, closing the AMW fixed the problem, and in every single case opening it caused the problem. At this point I was pretty confident that I could post the technique to the board, and did.

As a developer, I also posted the issue to Apple's DTS (Developer Technical Services) bug board.

Within 90 minutes I had received a call from Apple.

Folks, this was to be expected from a quality software company - any such company, not just Apple, because we had a reproducible problem, shared by others, with a symptomatic relief that always worked.

Real problems, reproducible problems, are something that programmers and publishers want to catch and fix: it's only good business. It makes them money. No quality publisher is going to turn down the chance to improve his product, especially if it's malfunctioning! Really.

Compare that reproducibility with a tech support call like this : " You sleazy &**$!!!'s! Your software is worth %$#! I'm never going to buy your @#$! software again!"

"What seems to be the problem?"

"It crashes!"

er... um...

The lesson here is this: most people DON'T have your problem with the software; it's a problem likely unique to your machine. So, if you want tech support to help you, don't count on them being mind-reading magicians.

Remain calm. REPRODUCE the problem and write down the steps. Then, as a team, you and support can find the solution.

But I digress...

Apple, as I said, was very concerned about the problem. We spoke on the phone several times and exchanged dozens of emails. The folks at Apple were there until well after closing on a Friday. We tried a several different things. But no matter how much we tried, it always crashed for me and it never crashed for them.

When Monday morning rolled around, they were right back on the case. Between us, we tried reinstalling the OS; installing as yet unreleased "OSen." I installed MacsBUG and printed out stdlogs, showing the internals of the problem. I hooked up different hardware.

All to no avail. Then, at Apple's suggestion, we tried pulling the RAM, since I had 1.5 gigs installed. The suggestion was that the RAM, or the new firmware upgrade, might have caused the problem. (That is, it was beginning to look like a hardware problem, since nothing I did, from clean system installs on, seemed to make a whit of difference.)

With the hardware looming larger as a possibility, I contacted the other person (Mark) who had the crash, and we exchanged hardware setup information.

There was almost nothing in common: his was a nice clean setup, while mine was a cluttered mess of things - yet we both crashed. But I did notice one thing, and it lead right back to Apple's RAM suggestion: we both had 1.5 gigs of RAM.

Mark tried pulling some of his RAM. He also reset the memory allocations so that FCP would use most of what was available as he did the RAM changes.

And the problem went away.

So: was it hardware?

The clue came when Mark reported back that he had reset his RAM allocation (minimum and preferred sizes) to 500000K and 800000K. Mine were set considerably smaller.

So I reset my sizes to match his... and the problem went away for me as well.

So, there we were. The problem was related to hardware, but was, in fact, a software problem, caused (most likely) by an assumption made somewhere along the line in the programming that an address in memory was in one memory bank (one DIMM, as specified by the upper bytes of the address itself) when, in certain specific circumstances, it was in another. By resetting the memory allocations, we forced the memory to be in the same bank, and the addressing error went away.

It took the concerted effort of many people, at about 72 hours each, to find the problem and provide enough information so that Apple could reproduce the problem and eventually fix it.

What's to be learned from this anecdote? For one thing - Apple IS listening and cares about the product. They responded almost instantly, and stayed with it until the problem was identified and fixed. But, without cooperation and help from the user (me and Mark) the problem could have lingered for weeks, or forever. Patience and perseverance prevailed.

So, when bad things happen to good computers, here's what you can do:

Remain calm (not always easy to do, I know).

Try to determine if it's just something on your machine:

a) is your software / project / document set up correctly? Was everything just fine until your last effect was added? If so, did the project file get corrupted somehow? (Now you can see why backups are important, no?) Is the media corrupted?

Did you recently add some new software? Run an installer of any kind? Change your setup? Crash recently (even when using some other program) possibly corrupting files?

Did you just defragment your drive? (This is notorious for corrupting files - before you defragment a drive, back it up!)

Mac users: try trashing the preferences for your software; try rebuilding the desktop; run Disk First Aid, or DiskWarrior or TechTool Pro. Restart with only the extensions you need for the software - if that works, you've got an extension conflict.

PC users: try running Norton to verify the directory structure. Other than that, I'd have to say : call your IT guy, 'cause I don't know squat about how you'd find it on a PC, outside of returning the machine to a previously known good state, or uninstalling whatever you think might be the cause.

The point here, in both cases, is to try to find the problem yourself first.

Why should you? Well, as we just say, the odds are about 199 in 200 that it's something particular to your machine's current state, and finding it and fixing it yourself will get you up and running faster.

If you take these preliminary steps, then if you need to call tech support for the product, you'll be able to tell them that you've already done those steps, since that's exactly what they'll have you do anyway.

When and if you do call, remember that the guy or gal on the other end of the phone didn't write the software, and isn't personally responsible for your particular problem.

Remember that he or she has already be berated, yelled at, cursed, vilified and threatened 84 times today alone. So, if your goal is to solve your problem and get back to work as soon as possible, consider this counterpoint: be NICE. Overwhelm them with courtesy. Politely explain the problem, and the steps you've already taken to solve it. Have all the information at hand (such as your OS version; amount of RAM; serial number of the product and so on.)

After 84 jerks, you'll be a ray of sunshine - manna from heaven - such a delightful relief that they will work with you until hell freezes over to resolve your issue.

This works for me every time.

However, if you do get someone who has had one too many jerks, and is being rude to you, simply hang up and try again. If it's a big company, you'll get someone else on the second try. If you don't, then ask for a different tech, explaining that you're being civil, and expect the same in return.

Finally, there's this - try some preventative maintenance.

You don't expect your car to run forever without oil-changes, refilling the tank and tune-ups; treat your computer the same way. Don't load on every extension; utility; enhancement; gosh-O, golly-gee-whiz doodad that comes your way. Rebuild the directory every so often - not just when trouble appears. Stay up to date with your OS - use the most recent version (because it's likely to have fewer known bugs) as well as the most recent version of your software (for the same reason).

Mac users can run their software updater control panel; PC users - check the MS website for the latest DLLs and system updates.

Be aware that this maintenance is an ongoing process, and take the time to do it. Remember that software (such as Final Cut Pro) is developed to take advantage of the latest changes in the OS; and the OS evolves to provide bug fixes and new features. In short, these happen in parallel, so don't get out of sync here. Don't expect to run the latest software on an old OS, nor a new OS with old software. If you do, you're asking for trouble.

This is hardly an exhaustive treatment of what can go wrong - there's 6 year old hard drives; bad removable media; faulty cables; and a few zillion other things.

But with patience and logic and perseverance, you'll be able to solve 90% of the problems yourself, and be back up and running, even when bad things happen to good computers.

Copyright 2001 Tracy Valleau


About the author
Tracy Valleau started programming computers in 1978, and is credited with one of the earliest multimedia / hypertext programs (1988). He is currently the CEO of Digital Light Studios, Inc., a multimedia and consulting firm whose clients have included McGraw-Hill, Sony, Apple, Silicon Graphics and others. Mr. Valleau can be reached at tracy@DigitalLightStudios.com.


sponsor lafcpug.org
copyright © Michael Horton 2000-2010 All rights reserved