| |||||||||||||||
Page 2 of 3 from Sun Screen Sun's tepid and belated response comes despite the harsh lesson learned by Intel in late 1994, when it played down reports of a bug in a new Pentium chip and refused to make a sweeping fix. Intel took a drubbing, issued the recall and spent $450 million making things right. But Intel's Pentium problem involved a flaw that was mostly theoretical; it was hard to find a customer who could show that the error affected any computation. The Sun glitch, by contrast, has caused crashes at dozens of customer sites. The problem involves "cache" memory chips, which store the most frequently needed code for instant access. In May, after months of struggling to identify the cause, Sun found it had been shipping servers whose cache modules contained faulty S-RAM (static random access memory) chips from a supplierit won't name. The faulty chips are easily disrupted by stray radiation, alpha particles or cosmic rays. The trouble occurs at the bit level—a one turns into a zero, or vice versa. When the computer detects an error in memory, it shuts down and reboots itself. High altitude, high temperatures and other factors can contribute to the problem. "You can run tests for a long time, and the problem doesn't happen. Then you put the machine back into its environment, and you get the problem. It took us months just to figure out what was going on,"says Shoemaker. Engineers have long known that memory chips can be disrupted by radiation and other environmental factors. That is why Hewlett-Packard and IBM use error-correcting code (ECC), which detects cache errors and restores bits that were changed by mistake. Sun servers lack ECC protection. "Frankly, we just missed it. It's something we regret at this point,"Shoemaker says. Its next high-end servers, based on a new processor called the UltraSparc III, will have it; they are to come out in mid-2001. For now Sun is racing to deliver a new "mirrored"cache to replace defective modules in the field, although only where needed. The mirrored cache has two modules. If one fails, the other backs it up. The new modules begin shipping this month. Sun will install them, free, for customers whose machines have crashed. Some customers have fixed the problem themselves by installing special software that can find and correct memory errors. At BellSouth Technology Service, Sun has already replaced modules on servers that crashed, says Richard Liddell, a BellSouth vice president. But a dot-com in San Francisco has been waiting several weeks for a repair. It bought a Sun 6500 server to run the database that is the core of its business. The server crashed and rebooted four times over a few months. "It's ridiculous. I've got a $300,000 server that doesn't work. The thing should be bulletproof," says the company's president. Sidebar Bug Bites -
'); //--> Subscriptions >
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Delivered By |
Tested By |
Market Data By |
Market Data By |
Market Data By |
American History |
Luxury Cars |