This is from the RISKS newsgroup. It is a story of why you should study hard in CPS 110!
Date: Fri, 9 Jan 1998 14: 13: 58 - 0800 From: Mike JonesSubject: Re: What really happened on Mars? by Glenn Reeves (RISKS - ************************))>Date: Monday, December 15, 1997 10: 28 AM>From: Glenn E Reeves >Subject: Re: [Fwd: FW: What really happened on Mars?]>>What really happened on Mars?>>By now most of you have read Mike's ([email protected]) summary of Dave>Wilner's comments given at the IEEE Real-Time Systems Symposium. I don't>know Mike and I didn't attend the symposium (though I really wish I had now)>and I have not talked to Dave Wilner since before the talk. However, I did>lead the software team for the Mars Pathfinder spacecraft. So, instead of>trying to find out what was said I will just tell you what happened. You>can make your own judgments.>>I sent this message out to everyone who was a recipient of Mike's original>that I had an e-mail address for. Please pass it on to anyone you sent the>first one to. Mike, I hope you will post this wherever you posted the>original.>>Since I want to make sure the problem is clearly understood I need to step>through each of the areas which contributed to the problem.>>THE HARDWARE>>The simplified view of the Mars Pathfinder hardware architecture looks like>this. A single CPU controls the spacecraft. It resides on a VME bus which>also contains interface cards for the radio, the camera, and an interface to>a 1553 bus. The 1553 bus connects to two places: The "cruise stage" part>of the spacecraft and the "lander" part of the spacecraft. The hardware on>the cruise part of the spacecraft controls thrusters, valves, a sun sensor,>and a star scanner. The hardware on the lander part provides an interface>to accelerometers, a radar altimeter, and an instrument for meteorological>science known as the ASI / MET. The hardware which we used to interface to>the 1553 bus (at both ends) was inherited from the Cassini spacecraft. This>hardware came with a specific paradigm for its usage: the software will>schedule activity at an 8 Hz rate. This ** feature ** dictated the>architecture of the software which controls both the 1553 bus and the>devices attached to it.>>THE SOFTWARE ARCHITECTURE>>The software to control the 1553 bus and the attached instruments was>implemented as two tasks. The first task controlled the setup of>transactions on the 1553 bus (called the bus scheduler or bc_sched task) and>the second task handled the collection of the transaction results i.e. the>data. The second task is referred to as the bc_dist (for distribution)>task. A typical timeline for the bus activity for a single cycle is shown>below. It is not to scale. This cycle was constantly repeated.>Off Topic>Off Topic>Off Topic bc_sched active Off Topic ||>>>---- | ---------------- | -------------------- | ------ - | --- | --- | ------->t1 t2 t3 t4 t5 t1>>The *** are periods when tasks other than the ones listed are executing.>Yes, there is some idle time.>>t1 - bus hardware starts via hardware control on the 8 Hz boundary. The>transactions for the this cycle had been set up by the previous execution of>the bc_sched task.>t2 - 1553 traffic is complete and the bc_dist task is awakened.>t3 - bc_dist task has completed all of the data distribution>t4 - bc_sched task is awakened to setup transactions for the next cycle>t5 - bc_sched activity is complete>>The bc_sched and bc_dist tasks check each cycle to be sure that the other>had completed its execution. The bc_sched task is the highest priority task>in the system (except for the vxWorks "tExec" task). The bc_dist is third>highest (a task controlling the entry and landing is second). All of the>tasks which perform other spacecraft functions are lower. Science>functions, such as imaging, image compression, and the ASI / MET task are>still lower.>>Data is collected from devices connected to the 1553 bus only when they are>powered. Most of the tasks in the system that access the information>collected over the 1553 do so via a double buffered shared memory mechanism>into which the bc_dist task places the latest data. The exception to this>is the ASI / MET task which is delivered its information via an interprocess>Communication mechanism (IPC). The IPC mechanism uses the vxWorks pipe ()>facility. Tasks wait on one or more IPC "queues" for messages to arrive.>Tasks use the select () mechanism to wait for message arrival. Multiple>queues are used when both high and lower priority messages are required.>Most of the IPC traffic in the system is not for the delivery of real-time>data. However, again, the exception to this is the use of the IPC mechanism>with the ASI / MET task. The cause of the reset on Mars was in the use and>configuration of the IPC mechanism.>>THE FAILURE>>The failure was identified by the spacecraft as a failure of the bc_dist>task to complete its execution before the bc_sched task started. The>reaction to this by the spacecraft was to reset the computer. This reset>reinitializes all of the hardware and software. It also terminates the>execution of the current ground commanded activities. No science or>engineering data is lost that has already been collected (the data in RAM is>recovered so long as power is not lost). However, the remainder of the>activities for that day were not accomplished until the next day.>>The failure turned out to be a case of priority inversion (how we discovered>this and how we fixed it are covered later). The higher priority bc_dist>task was blocked by the much lower priority ASI / MET task that was holding a>shared resource. The ASI / MET task had acquired this resource and then been>preempted by several of the medium priority tasks. When the bc_sched task>was activated, to setup the transactions for the next 1553 bus cycle, it>detected that the bc_dist task had not completed its execution. The>resource that caused this problem was a mutual exclusion semaphore used>within the select () mechanism to control access to the list of file>descriptors that the select () mechanism was to wait on.>>The select mechanism creates a mutual exclusion semaphore to protect the>"wait list" of file descriptors for those devices which support select. The>vxWorks pipe () mechanism is such a device and the IPC mechanism we used is>based on using pipes. The ASI / MET task had called select, which had called>pipeIoctl (), which had called selNodeAdd (), which was in the process of>giving the mutex semaphore. The ASI / MET task was preempted and semGive ()>was not completed. Several medium priority tasks ran until the bc_dist task>was activated. The bc_dist task attempted to send the newest ASI / MET data>via the IPC mechanism which called pipeWrite (). pipeWrite () blocked, taking>the mutex semaphore. More of the medium priority tasks ran, still not>allowing the ASI / MET task to run, until the bc_sched task was awakened. At>that point, the bc_sched task determined that the bc_dist task had not>completed its cycle (a hard deadline in the system) and declared the error>that initiated the reset.>>HOW WE FOUND IT>>The software that flies on Mars Pathfinder has several debug features within>it that are used in the lab but are not used on the flight spacecraft (not>used because some of them produce more information than we can send back to>Earth). These features were not "fortuitously" left enabled but remain in>the software by design. We strongly believe in the "test what you fly and>fly what you test "philosophy.>>One of these tools is a trace / log facility which was originally developed to>find a bug in an early version of the vxWorks port (Wind River ported>vxWorks to the RS 6000 processor for us for this mission). This trace / log>facility was built by David Cummings who was one of the software engineers>on the task. Lisa Stanley, of Wind River, took this facility and>instrumented the pipe services, msgQ services, interrupt handling, select>services, and the tExec task. The facility initializes at startup and>continues to collect data (in ring buffers) until told to stop. The>facility produces a voluminous dump of information when asked.>>After the problem occurred on Mars we did run the same set of activities>over and over again in the lab. The bc_sched was already coded so as to>stop the trace / log collection and dump the data (even though we knew we>could not get the dump in flight) for this error. So, when we went into the>lab to test it we did not have to change the software.>>In less that 18 hours we were able to cause the problem to occur . Once we>were able to reproduce the failure the priority inversion problem was>obvious.>>HOW WAS THE PROBLEM CORRECTED>>Once we understood the problem the fix appeared obvious: change the>creation flags for the semaphore so as to enable the priority inheritance.>The Wind River folks, for many of their services, supply global>configuration variables for parameters such as the "options" parameter for>the semMCreate used by the select service (although this is not documented>and those who do not have vxWorks source code or have not studied the source>code might be unaware of this feature). However, the fix is not so obvious>for several reasons:>>1) The code for this is in the selectLib () and is common for all device>creations. When you change this global variable all of the select>semaphores created after that point will be created with the new options.>There was no easy way in our initialization logic to only modify the>semaphore associated with the pipe used for bc_dist task to ASI / MET task>communications.>>2) If we make this change, and it is applied on a global basis, how will>this change the behavior of the rest of the system?>>3) The priority inversion option was deliberately left out by Wind River in>the default selectLib () service for optimum performance. How will>performance degrade if we turn the priority inversion on?>>4) Was there some intrinsic behavior of the select mechanism itself that>would change if the priority inversion was enabled?>>We did end up modifying the global variable to include the priority>inversion. This corrected the problem. We asked Wind River to analyze the>potential impacts for (3) and (4). They concluded that the performance>impact would be minimal and that the behavior of select () would not change>so long as there was always only one task waiting for any particular file>descriptor. This is true in our system. I believe that the debate at Wind>River still continues on whether the priority inversion option should be on>as the default. For (1) and (2) the change did alter the characteristics of>all of the select semaphores. We concluded, both by analysis and test, that>there was no adverse behavior. We tested the system extensively before we>changed the software on the spacecraft.>>HOW WE CHANGED THE SOFTWARE ON THE SPACECRAFT>>No, we did not use the vxWorks shell to change the software (although the>shell is usable on the spacecraft). The process of "patching" the software>on the spacecraft is a specialized process. It involves sending the>differences between what you have onboard and what you want (and have on>Earth) to the spacecraft. Custom software on the spacecraft (with a whole>bunch of validation) modifies the onboard copy. If you want more info you>can send me e-mail.>>WHY DIDN'T WE CATCH IT BEFORE LAUNCH?>>The problem would only manifest itself when ASI / MET data was being collected>and intermediate tasks were heavily loaded. Our before launch testing was>limited to the "best case" high data rates and science activities. The fact>that data rates from the surface were higher than anticipated and the amount>of science activities proportionally greater served to aggravate the>problem. We did not expect nor test the "better than we could have ever>imagined "case.>>HUMAN NATURE, DEADLINE PRESSURES>>We did see the problem before landing but could not get it to repeat when we>tried to track it down. It was not forgotten nor was it deemed unimportant.>>Yes, we were concentrating heavily on the entry and landing software. Yes,>we considered this problem lower priority. Yes, we would have liked to have>everything perfect before landing. However, I don't see any problem here>other than we ran out of time to get the lower priority issues completed.>>We did have one other thing on our side; we knew how robust our system was>because that is the way we designed it.>>We knew that if this problem occurred we would reset. We built in>mechanisms to recover the current activity so that there would be no>interruptions in the science data (although this wasn't used until later in>the landed mission). We built in the ability (and tested it) to go through>multiple resets while we were going through the Martian atmosphere. We>designed the software to recover from radiation induced errors in the memory>or the processor. The spacecraft would have even done a 60 day mission on>its own, including deploying the rover, if the radio receiver had broken>when we landed. There are a large number of safeguards in the system to>ensure robust, continued operation in the event of a failure of this type.>These safeguards allowed us to designate problems of this nature as lower>Priority.>>We had our priority right.>>ANALYSIS AND LESSONS>>Did we (the JPL team) make an error in assuming how the select / pipe>mechanism would work? Yes, probably. But there was no conscious decision>to not have the priority inversion enabled. We just missed it. There are>several other places in the flight software where similar protection is>required for critical data structures and the semaphores do have priority>inversion protection. A good lesson when you fly COTS stuff - make sure you>know how it works.>>Mike is quite correct in saying that we could not have figured this out>** ever ** if we did not have the tools to give us the insight. We built many>of the tools into the software for exactly this type of problem. We always>planned to leave them in. In fact, the shell (and the stdout stream) were>very useful the entire mission. If you want more detail send me a note.>>SETTING THE RECORD STRAIGHT>>First, I want to make sure that everyone understands how I feel in regard to>Wind River. These folks did a fantastic job for us. They were enthusiastic>and supported us when we came to them and asked them to do an affordable>port of vxWorks. They delivered the alpha version in 3 months. When we had>a problem they put some of the brightest engineers I have ever worked with>on the problem. Our communication with them was fantastic. If they had not>done such a professional job the Mars Pathfinder mission would not have been>the success that it is.>>Second, Dave Wilner did talk to me about this problem before he gave his>talk. I could not find my notes where I had detailed the description of the>problem. So, I winged it and I sure did get it wrong. Sorry Dave.>>ACKNOWLEDGMENTS>>First, thanks to Mike for writing a very nice description of the talk. I.>think I have had probably 400 people send me copies. You gave me the push>to write the part of the Mars Pathfinder End-of-Mission report that I had>been procrastinating doing.>>Special thanks to Steve Stolper for helping me do this. The biggest thanks>should go to the software team that I had the privilege of leading and whose>expertise allowed us to succeed: Pam Yoshioka, Dave Cummings, Don Meyer,>Karl Schneider, Greg Welz, Rick Achatz, Kim Gostelow, Dave Smyth,>Steve Stolper. Also, Miguel San Martin, Sam Sirlin, Brian Lazara (WRS),>Mike Deliman (WRS), Lisa Stanley (WRS)>>Glenn Reeves, Mars Pathfinder Flight Software Cognizant Engineer>[email protected]
GIPHY App Key not set. Please check settings