(1 / 2)
Date: February 21, 1991 15:13
From: KIM::ALBAUGH
To: @SYS$MAIL:ENGINEER
(This file lives in ee$userdisk:[68k_us.memo]common_bugs.txt) WHO NEEDS TO READ THIS: every game programmer and every project leader whose game programmer says he/she has none of these bugs (they're lying). Hardware engineers might also find it a useful refresher. I have been doing a lot of poking around in dusty corners lately, trying to move people from older coin and eeprom routines to newer ones. In the process I have discovered a few bugs and "accidents waiting to happen". Each is described below, although I'm not claiming that they are new or that I didn't create some of them myself. First some background. Most of our games have (at least) two levels of interrupt. Some games add an extra level of "software interrupt" or "foreground". Others have multiple processors (you don't? how about that 6502 on your sound board?) All these things mean that the statements of a program are often executed in a different order than one might expect from the source listing. Keeping things on track requires a certain degree of (dare I say it) discipline. NOTE: These are not the only bugs in our code. They are just ones I have seen often, or lately. Each topic below is presented as: subject, background, the bug, and suggested changes. Mike SUBJECT: unrelated functions sharing a latch BACKGROUND: Our hardware guys are very fond of grouping miscellaneous bits together in four-, six-, or eight- bit latches. They are unlikely to change because a) it's cheaper b) it's not _that_ much of a hassle, usually and c) it's now the "standard". The problem arises because these latches are write-only, so they need to be shadowed. Unfortunately, when a game has two programmers (or one programmer in two different moods or seasons), there tend to be two (or more) shadows for the same latch. Also, we don't always take care about order of update or possible interrupts. THE BUG: Failure to use a single shadow, or to "guard" references to that shadow, leads to random setting/clearing of bits. This is especially prevalent in the _added_at_the_last_minute vcr code, which may, for example, change the resolution of your LETA... THE FIX: A single routine that has its own shadow and disables interrupts while mucking with it. the one below is called to set a some bits by: mod_latch( FIRST_BIT | OTHER_BIT ); and called to clear the same bits by: mod_latch( ~( FIRST_BIT | OTHER_BIT ) ); and it assumes that there is only one such latch, of up to 31 bits. The actual code snippet below assumes a 16-bit latch (and shadow), but all the '.W's could be '.B' or '.L' for 8 or 31 bits. It is _probably_ a real good idea if the latch address and the shadow address were _not_ available to anybody else, to "put some teeth" in the "convention" of using this routine. -------- sample code follows -------- xdef mod_latch * Modifies LATCH according to its param. If MSB set, clears any bits * that are clear in the param, else sets any that are set. NOTE: this * is the reverse of the original (ROMALOT) mod_latch, but nobody was * using that, and this is a more intuitive interface. mod_latch: MOVE SR,D1 OR #$700,SR * Lock INTS MOVE.L 4(SP),D0 * Read parameter LONG BMI.S ML10 * D31 signals 0: OR in bits 1:AND out bits OR.W bitshd,D0 BRA.S ML20 ML10: AND.W bitshd,D0 ML20: MOVE.W D0,LATCH Set LATCH MOVE.W D0,bitshd update shadow MOVE D1,SR RTS SUBJECT: Spinning on eer_busy() in Vblank BACKGROUND: Many of our games use EEPROM to hold statistics, High Scores, etc. One problem with EEPROM is that each byte (typically) takes time to write. This is normally handled by having a routine ( eer_hwt() ) that is called every VBlank actually do the write. If the test switch is changed while the game is writing, corrupt data could result. So, a routine ( eer_busy() ) is supplied to tell when it is safe to "turn over" the game from test to game or vice-versa. THE BUG: Somewhere, sometime, somone decided that the code to do this should look roughly like: VBLANK_IRQ:TST.W INSELF BEQ * We think we are in selftest, does the switch agree? IFTST * We need to transit from test to game. VB0: JSR eer_busy TST.L D0 BNE VB0 < code to fake a reset, which will find the test switch off and start game> A similar piece of code is used when the game vblank finds the test switch on. Since an EEPROM write can make no progress with Vblank IRQs essentially disabled, the eer_busy() will, if it ever returns TRUE, continue to do so forever (or until WATCHDOG kicks in). THE FIX: Don't just hang. Work towards finishing the write by calling eer_hwt() every Vblank until eer_busy() returns 0. This will let the game program continue too, until it is safe to "turn over". VBLANK_IRQ: TST.W INSELF BEQ * We think we are in selftest, does the switch agree? IFTST * We need to transit from test to game. JSR eer_busy TST.L D0 BEQ DIE JSR eer_hwt RTI DIE: < code to fake a reset, which will find the test switch off and start game> Incidentally, this is essentially what SYSTEM I originally did. I have no idea how the buggy code arose, but I have seen it in several games now. SUBJECT: VBACK, who hits it and when BACKGROUND: Most of our hardware designs generate an interrupt at the beginning of Vertical Blank. Most also have an address to "strobe" to acknowledge (and clear) this interrupt. Three warnings: 1) VAD-based systems cannot tolerate random gunk being stored in VBACK, always store a zero. You may want to use a CLR instruction but... 2) If you are using a 68010 in development and a 68000 in production, make sure your VBACK can tolerate being read (modern VADS can) because the 68000 (but not the 68010) does a read-modify-write cycle for CLR.W 3) Some games will generate another IRQ at line 256 if you strobe VBACK "too soon" (i.e. between lines 240 and 256). THE BUG: Several annoyances, mainly related to lunching VADS and getting spurious double Vblanks, which may have the side effect of inducing EEPROM errors. THE FIX: Make sure you are writing the proper value, and preferably late, rather than early in your Vblank IRQ routine. SUBJECT: How to set your priority level from C. BACKGROUND: Sometimes a routine written in C needs to disable interrupts during a "critical section". Since the MOVE xxx,SR instruction is not available in C, various snippets of assembly have been making the rounds. THE BUG: One of these snippets goes something like: OLD_IPL: DS.W 1 ... xdef ints_off,ints_on ints_off: MOVE.W SR,OLD_IPL ; Save old value of Interrupt priority MOVE.W 6(SP),SR ; Set new value from parameter RTS ints_on: MOVE.W OLD_IPL,SR RTS This works fine until the day that a mainline routine does an ints_off(0x2200) (saving 20xx in OLD_IPL), then is itself interrupted (say at level 4) by a routine that does an ints_off(0x2600), saving 22xx in OLD_IPL. BOTH calls to ints_on will "restore" the 22xx, leaving the "mainline" at level 2 forever. I've seen a variation on this theme which maintains a complicated stack in "private" memory, but does not check for consistency or stack overflow. THE FIX: Work with, rather than against, the language. The following: xdef set_ipl set_ipl: MOVEQ.L #0,D0 ; Not _strictly_ needed, but tidy MOVE.W SR,D0 MOVE.W 6(SP),SR RTS can be used in conjunction with a local variable in the procedure calling it: void foo() { int old_ipl; ... old_ipl = set_ipl(#COMM_LEVEL) ... (void) set_ipl(old_ipl); } SUBJECT: is "foreground" really vblank? BACKGROUND: Many of our games use an "interrupt level" between the lowest priority real interrupt and "background". This typically handles things that really should happen once a frame (like animation drivers), but are not critically dependant on the beam position. The Vblank IRQ routine checks to see if the "foreground" is active. If not, it is "activated" by marking it active, lowering the interrupt priority level, and "calling" the foreground. When the foreground returns, the IPL is once again raised, the "active" flag is cleared, and Vblank returns. If Vblank finds the foreground already active, it simply returns. THE BUG: Sometimes the VBlank routine does not lower its IPL, making the entire foreground routine essentially part of the Vblank IRQ. Another problem is covered below. THE FIX: Make sure the sequence of operations detailed above is followed, and that Vblank can be re-entered safely. SUBJECT: BIG "critical sections" BACKGROUND: In just about any game there are "critical sections", where we need to ensure that only one routine is mucking about with a given data structure. For quick changes to small pieces of data, simply raising the IPL will work just fine, but some games leave interrupts masked for multiple frames, to keep "foreground" from running while a large routine runs in "background". THE BUG: When VBlank is left masked for a long time, operations that actually need to be done during the blanking interval will actually get done at an essentially random time. The hardware may or may not allow this. Also if Vblank is left masked for a long time, the result may be two "back to back" Vblank interrupts, with subsequent errors in writes to the EEPROM. THE FIX: If you need to prevent the foreground from running, "mark" it in the same way Vblank does. Since presumably this is done in "background", there should be no hazards. If you really need to mask interrupts for a long time, perhaps you need to re-think the data structure and associated critical sections. SUBJECT: 6502 reading test-switch considered harmful BACKGROUND: For reasons lost in the mists of time, the sound processor (6502) has its own input for reading the test switch. Among other changes, the "coin read exception" will return current values of the coin inputs, rather than the "modulo four counters" if the test switch is found on. The main processor also reads the test switch, and can, therefore, have a different opinion about whether the pair are in "game" or "test" mode. THE BUG: When the test switch is switched on after some coins have been deposited, there is a 1 in 4 chance of "seeing" an extraneous coin on each mech. This will occur when the number of coins seen by a mech, modulo four, is three, because the 6502 will change to reporting '0', which is in fact the next valid value for the mech. The "credit" thus earned cannot usually be spent, since the main processor should see the test switch soon thereafter, but the EEPROM count of coins will be off from both the Electro-mechanical counters and the number of coins in the cashbox. Also, there is the possibility of a more permanent fault leading to bizarre behavior. THE FIX: There are a variety of options, ranging from converting to a scheme of "modulo three" counters, with zero a "forbidden value" to having a different exception for "raw coin switch" to having a command to set the 6502 in "test mode" rather than giving it access to the test switch. I welcome and solicit comment, but would frankly prefer all three... SUBJECT: calling eer_incs() et al. from interrupt level BACKGROUND: There are a number of routines in both the coin and eeprom code that cannot be safely re-entered. I also cannot make all of them "critical sections" because of deadly-embrace possibilties when called an arbitrary number of times from who-knows-where in a game. THE BUG: Calling certain routines with interrupts masked (like, from your VBlank routine) can lead to a deadly embrace. The most likely candidates are eer_incs() and cn_credits(). They are only most likely because you are most likely to want to call them then, although it never seems to be really needed. THE FIX: Don't! SUBJECT: calling com_[f]wrt() with an exception, or com_exc() with a command BACKGROUND: RPM accepts two kinds of "commands". The vast majority of them are inserted in a queue when received, to be processed as soon as the main loop has time. A special form, called an "exception" is processed as soon as it is received, and requires special handling. The code in RPMSLV.MAC (6502) and SND_COM.ASM (68000) take special pains to make everything work. This work is subverted when a game programmer uses com_exc() to send a "normal" command, or (more usually) uses com_[f]wrt() to send an exception. THE BUG: Mis-using these routines (as the handed-down-for-generations sound test code does, among others) can lead to spurious and/or lost coins. It is the number one reason for such spurious coins, despite 6+ years of my lectures. THE FIX: Use com_wrt() or com_fwrt() for normal commands, com_exc() for exceptions. If your old code from games gone by does not do this, fix it. While you are at it, make sure you check the return value from com_wrt() if you care whether or not it succeeded. Hardware guys might consider adding a (low priority) interrupt for comm_buffer_empty (maskable by writing a latch bit, of course) to speed up "queued" output with the sort of horrifying kluges I have glimpsed. SUBJECT: Who hits watchdog and when BACKGROUND: I may not have been the one who brought the "Watchdog" to Atari, but Rich Patak and I were the first I know of to have used one in a game here. The basic idea is simple. If you can determine an upper bound on the amount of "processing time" in your "main loop", you can have a timer that will reset the CPU if it is not reset at least that frequently. As a first approximation, this works fine, but there are a few problems. Concessions to cost lead to watchdog circuits with a period of 128 milliseconds or so, while our games sometimes have sections (particularly in initialization) longer than this. Also, our emulators often have problems with watchdog, so it is disabled during most of the development cycle, being tested as the game is going out the door. THE BUG (1): In an attempt to get around the tight timing of watchdog, some folks hit it in their Vblank IRQ. While this solves one problem, it creates another. In the absence of software "sanity checks", the mainline of a game can be "insane" and effectively hung up, while the Vblank routine is just fine and will never be reset. THE FIX (1): Add some sort of "sanity check", if only on such things as "stack pointer in bounds" or a longer interval software managed timer. A really spiffy scheme might include a software timer set at every "state transition" to the maximum length of the state, and a "scanner" to check critical data structures "in the background". THE BUG (2): As a quasi-religious reaction to bug (1), some folks shun hitting watchdog in Vblank at all, prefering to sprinkle watchdog hits semi- randomly through their code. Unfortunately, the choice of when and where is made without much time (remember, watchdog is usually disabled until management is breathing down your neck to release), and even choices that worked once may be invalidated by late changes in "tuning". In the face of these difficulties, a programmer often sprinkles enough "hits" to ensure that watchdog will never "bite", no matter how tangled the game gets, thus again effectively disabling it. The near-random nature of the "sprinkling" can cause other problems too. A few weeks ago I fixed a bug in the RAM test that had to do with WATCHDOG statements being placed between a CMPA.L and the branch that tested the result of the compare. These clobbered the condition codes. What is particularly noteworthy is that in the process of writing this memo, I came across a note from Jed Margolin (dated 16-JUN-1988) addressing this issue. I assume I changed _some_ copy of the RAM test at that time, but obviously not the "master". Old code rises from the grave... THE FIX (2): (OPINION ONLY) Abandon approach 2 and apply fix 1. I stress that this is my opinion, but I feel fairly strongly about this and am willing to debate at length issues of "correctness by construction" and the "partitioning" of software projects...
(2 / 2)
Date: March 08, 1991 14:56
From: KIM::ALBAUGH
To: @SYS$MAIL:EE
(I apologize to all you hardware guys who dropped hot soldering irons in your laps in your haste to read this, but there is no "Programmers" list) I short while ago, I sent out a memo about common bugs. One of them had to do with the 6502 and 68000 independantly deciding whether or not they were in self-test. I have modified SACOIN.MAC to deal with this problem by having a separate exception for self-test read of the raw switches. I also added provision for table-driven assignment of coin-counters to coin-mechs. I'd really like some brave volunteer to use this version, so I can know if it _really_ works. It should "drop right in" with no changes, if you can tolerate getting the incrementing counters on your self-test screen. Then, by adding an exception that uses E$TCSW (Test Coin SWitches) and using that, you will get the raw switches for your self test and avoid bogus coins in your EEPROM when you flip the test switch. The new code is five bytes shorter than the old code, but the added table space for the exception cancels some of that. In any case, it should pose no problem. Takers? Mike
Feb 21, 1991