Comba92's Site

Nes emulator development guide with tips focused on fast optimizations

Posted on 63 mins

Nes Emulation

This article is just a better formatted version of the original paper by Brad Taylor . All credits due to him.


Brad Taylor (BTTDgroup@hotmail.com )
4th release: April 23rd, 2004
Thanks to the NES community. http://nesdev.parodius.com .
recommended literature: 2A03/2C02/FDS technical reference documents

Overview of document

Topics discussed


General PPU emulation


Most likely, the key to your emulator’s performance will be based on the speed at which it can render NES graphics. It’s pretty easy to write a slow PPU render engine, since overall there’s a good deal of work that has to be done. Accurate emulation of the PPU is difficult, due to all the trickery various NES games use to achieve special video effects (like split screen scrolling), otherwise not possible by “clean” or conventional means. In reality, all these “tricks” are simply accomplished by writing to the appropriate PPU (or related) registers at the right time during the rendering of a frame (picture).

On a hardware level, the CPU & PPU in the NES run simultaniously. This is why a game can be coded to make a write out to a PPU register at a certain time during a frame, and the result of this is that the (on-screen) effect occurs in a specific location on the screen. Thus, the first instinct one has for writing a NES emulator is to execute both the CPU & PPU engines alternately on every (NES) clock cycle. The results of this will give very accurate emulation, BUT- doing this will also be VERY processor intense (this will mostly be due to all the overhead of transfering program control to so many hardware emulation routines in such little time (1 CPU clock cycle)). As a result, emulators coded like this turn out to be the slowest ones.

PPU info


NES graphics consist of a single scrollable playfield, and 64 objects/sprites. The screen resolution is 256x240 pixels, and while games can control the graphics on a per-pixel basis, it is usually avoided since it’s pretty difficult. Instead, the PPU makes displaying graphics easier for the programmer by dividing the screen up into tiles, which index an 8x8 pixel bitmap to appear in that particular spot. Each object defines 1 or 2 tiles to be displayed on a randomly-accessable xy coordinate on the screen. There are also 8 palette tables in the PPU that bitmap data can refer to (playfield & object bitmap data each have 4 palettes). Each palette has 3 indexable colors, as tile bitmaps only consist of 2 bits per pixel (the 00 combination is considered transparency). A single transparency color palette register is also defined, and is only used as the on-screen color when overlapping pixels (due to objects being placed on the playfield) of all playfield/object pixels are defined as transparent.

As graphics are rendered (as described in the “2C02 technical reference” document), the name tables are accessed sequentially to reference a tile’s bitmap, which gets used as the pixel data for the area of the screen that the name table index entry corresponds to (offset by the scroll register values). Attribute tables, which are layed out the same way that name tables are (except with lower resolution- 1 attribute entry represents a 2x2 cluster of on-screen tiles), determine the palette select value for the group of tiles to use (1 of 4).

Objects attribute memory (sprite RAM, or “OAM” which contain private tile index and palette select information) is evaluated every single scanline (y-coordinate entries are examined), and in-range objects have thier tile bitmaps loaded into the PPU inbetween scanlines. The contents are then merged with the playfield’s pixel data in real-time.

Accurate & efficient PPU emulation


For the most part, PPU operations are linear and sequential, making them easy to design algorithms for. By rendering playfields and objects on a per-tile basis, emulation can be made quick & easy. However, games that use special effects (mid-frame PPU trickery) require special handling, which in turn complicates algorithm design.

By implementing a clock cycle counter in the CPU core, it is possible for emulated PPU hardware to know exactly when a read/write is occuring to a PPU-related register (or otherwise, a register which is going to change the rendering of graphics from that point on). Therefore, when a write to a PPU register occurs, the PPU engine can then determine if the write is going to change the way the picture is going to be rendered, and at the exact clock cycle (which really translates into a screen position).

For example, say the CPU engine is executing instructions. Then, on clock cycle 13000 (relative to the last VINT), a write to the PPU’s scroll registers are made (which causes a split-screen effect). Now, first the PPU translates 13000 CC’s into X/Y coordinates (in this case, on-screen scanline 93, roughly pixel #126 (the equations to do these calculations will be revealed later)). Ideally*, all pixels before this point will now be rendered to a buffer, using the data in the PPU registers prior to the write. Now the screen area before the write occured has been rendered accurately, and the screen will progressively continue to be updated in this fashion as more mid-frame writes occur. If no more occur, when the CPU arrives at the ## of clock cycles per frame, the rest of the image (if any) can be rendered.

Knowing when to update the screen


The following list describes PPU status registers/bits that if a game modified/changed mid-frame, would change the way the rest of the frame is rendered.

O = update objects, P = update playfield.

O object enable bit
O left column objects clipping
O 8/16 scanline objects
O active object pattern table
O pattern table bankswitch (which effects active object pattern table)

PO color emphasis bits
PO black & white/color select

P playfield enable bit
P left column playfield clipping
P scroll registers
P X/Y name table selection
P name table bankswitch (hypothetical)
P active playfield pattern table
P pattern table bankswitch (which effects active playfield pattern table)

Note that any PPU mapped memory (which means name, pattern, attribute & palette tables) can only be changed while objects & the playfield are disabled (unless cartridge hardware provides a way to do this through the CPU memory map). Since the screen is blanked to black during this time (regardless of the current transparency color the palette is programmed with), these writes do not effect how the screen is rendered, and subsequently, updating the screen can be postponed.

Collision flag


Games without hardware for scanline counting often poll this bit to find out when to make a write out to a PPU register which will result in a split screen, or a pattern table swap/bankswitch. The collision flag is set when the first non-transparent pixel of object 0 collides with a playfield pixel that is also non-xparent. Since the screen position of the first colliding pixel can be determined at any time (and therefore, exact CPU clock cycle at which the collision is expected to occur), when a game requests the status of this flag for the first time, a routine part of the PPU engine can calculate at which clock cycle this flag will be set (calculations will be shown later). Subsequent requests for the collision flag’s status after this would then only require the engine to compare the current CPU clock cycle, to the calculated collision clock cycle. Whenever a mid-frame change occurs (whether it effects the playfield, or objects), the clock cycle at which the collision flag will go off will have to be recalculated (unless it has already gone off).

MMC3 IRQ timer


The MMC3’s IRQ timer relies on the toggling of the PPU’s A13 line 42 times a scanline. Basically, it’s counting operation is more or less at a constant rate (meaning predictable). However, when the PPU bus is disabled (via disabling the playfield & objects, or during the V-blank period), the counter must quit counting. Manual toggling of PPU address bits during this time will have to be intercepted, and the IRQ timer advanced appropriately.

CPUCC to X/Y coordinate equations


The PPU renders 3 pixels in one CPU clock. Therefore, by multiplying the CPU CC figure by 3, we get the total amount of pixels that have been rendered (including non-displayed ones) since the VINT.

341 pixels are rendered per scanline (although only 256 are displayed). Therefore, by dividing PPUCC by this, we get the ## of completely rendered scanlines since the VINT.

21 blank scanlines are rendered before the first visible one is displayed. So, to get a scanline offset into the actual on-screen image, we simply subtract the amount of non-displayed scanlines. Note that if this yeilds a negative number, the PPU is still in the V-blank period.

PPUCC = CPUCC x 3 Scanline = PPUCC div 341 - 21; X- coordinate PixelOfs = PPUCC mod 341; Y- coordinate CPUcollisionCC = ((Y+21)x341+X)/3

Note that if the PixelOfs equation yeilds a number higher than 255, the PPU is in the H-blank period.

Note on emulating Tengen’s Ms.Pac Man game


For emulators with poor 6502 cycle count provisions, there is a small problem that may arise when trying to run this game. During initialization, this game will loop, waiting for $2002’s vbl flag to set. When a NMI occurs, the NMI routine reads $2002 and throws away the value. Even though the NMI routine saves the A register from the main loop (where $2002 was loaded), the only way the PC will exit this loop is if $2002 returns the vbl flag set just before the NMI is executed. since the NMI is invoked pending the completion of the current instruction, and the vbl flag IS the NMI flag, the VBL flag must get set in the middle of the LDA instruction. Since there are 2 instructions in the main loop, there’s about a 50% chance of the read value from $2002 being pushed on the stack with the vbl bit set. A work-around for emulators that can’t handle this mid-instruction taboo, is to set the vbl bit slightly before the NMI routine is invoked.

Other notes



Pixel rendering techniques


3 rendering techniques are described in this section. They are all real-time techniques. An unreleased version of this document discussed a tile cache-based rendering solution. However, tile caching quickly loses it’s effectiveness for those games that use mid-frame (or even mid-scanline) trickery to change character sets, or even palette values. Additionally, with powerful seventh-generation Pentium processor-based PC’s being the slowest computers out there these days, there’s really no need to use bitmap caching algorithms to emulate NES graphics anymore, as was neccessary in the days of 486-based PCs in order to achieve full NES framerate emulation.

Basic


This method, which is the most straightforward, is to store the PPU’s 52-color matrix as constant data in the VGA palette registers (or otherwise, other palette registers used for an 8-bit per pixel graphics mode). Before a pixel can be drawn, pixel color is calculated (via pattern table & palette select data). The PPU palette registers are looked up in some way or another, and the contents of the palette register element is written to a virtual frame buffer as the pixel data. This technique is the easiest to implement, and provides the most accurate PPU emulation. However, since every pixel drawn requires an independent palette look-up, this method is naturally very slow.

One way to speed up this rendering style is to create a palette table designed for looking up 2 or more pixels simultaniously. The advantages are clear: you could easily shave alot of time (close to half with a 2 simultanious color lookup) off playfield rendering. The disadvantages are that the lookup table grows from 2^2x1=4 bytes for a single pixel lookup, to 2^4x2=32 bytes for a 2-pixel lookup, to 2^8x4=1024 bytes for a 4-pixel lookup. Each of the palette’s 4 colors is also mirrored across these tables, and this has to be maintained. Since I’ve never tried this optimization technique, I can’t tell you how effective it is (or when it stops being effective).

Another way to increase the speed of this approach is to change the bit ordering of the pattern tables stored in memory to favor this rendering algorithm. For example, store the bitmap for any scanline of a tile in an 8- 2-bit packed pixel format, instead of the 2- 8-bit planar method used by default. By doing this, it will allow the tile rendering routine to easily extract the 2-bit number for indexing the 4 color palette associated with the particular tile. Of course, by changing the pattern tables, whenever pattern table memory is read or written to, the format of the data will have to be converted. Since this happens much less often (even in games that use CHR-RAM), it’s a good idea.

VGA palette register indexed


This method involves programming the VGA palette registers to reflect the direct values the PPU palette registers contain. The VGA palette would be divided into 64- 4 color palettes. When sequential horizontal pixels are to be drawn, a large (32-bit or more) pattern table data fetch can occur (pixels for pattern table tiles should be organized in memory so that the 2 bits for each horizontally sequential pixel are stored in 8-bit increments). This chunk of fetched pixel data can then be masked (so that other pixel data from the chunk is not used), an indexed “VGA palette select” value can be added to the value, and finally can then be written out to the virtual frame buffer in one single store operation. The “VGA palette select” value is fetched via the VGA palette select table, which corresponds to the 8 classic PPU palettes (4x2 elements in the table; therefore, a tile’s attribute data (either PF or OBJ) is used as the index into this table). This table indicates which 4-color group of 64 groups in the VGA palette to use for color selection for the group of pixels being written. The idea is that when a mid-frame palette change occurs (or at any time, for that matter), the affected PPU palette in this table is changed to point to where the new palette modifications will be made in the VGA’s palette. The corresponding VGA palette entries will also have to be updated appropriately (generally, VGA palette updates will be made in a ring-buffer fashion. A pointer which keeps track of the first available 4 palette entries will be incremented when any entries in a 4-color PPU palette are changed).

Basically, this method offers the fastest possible way to render NES graphics, since data is fetched from pattern table memory and written directly to the virtual frame buffer. The number of pixels processed simultaniously can be as high as 8 (with MMX instructions). However, the # of mid-screen PPU palette modifications possible is limited to 64 times (or 32 for PF and 32 for OBJs, if one of the bits in every pixel needs to be used to distinguish a playfield pixel from an object), but multiple consecutive modifications to a single 4-color PPU palette only count as one actual modification.

MMX instruction-based rendering


In 1995, the x86 architecture became blessed with MMX instructions: a set of single function, multiple data, RISC-like instructions, that provide solutions for solving a large amount of modern-day logic puzzles. Nearly all the instructions have a very low 1 or 2 clock cycle latency across all x86 class CPUs which support them, hence these instructions are very desirable to use. The instructions work around an 8 element (64-bits/element) flat register file, which overlaps the legacy x87’s mantissa registers. The BIG deal about use of MMX instructions for pixel rendering is that 8 pixels can be operated on simultaniously, providing each pixel is no larger than a byte. The following assembly-written routine can fully calculate pixel color for 8 horizontally sequential pixels for every loop iteration (the example actually renders the first 4 scanlines of a tile).

Note: Pentium 4 and Athlon XP/64 processors support 128-bit versions of MMX instructions, so this could allow you to increase performance quite a bit more than what is already offered by the algorithm documented below. Very useful for virtual NES multitasking, when 20 or more NES screens need to be animated & displayed simultaniously in a high-resolution screen mode.

The pattern tables have already been reorganized so that the bitmap data for 4 scanlines of tile data can be loaded into an MMX register, and used in the most efficient way possible. Pixel data for 4 sequential scanlines under the same horizontal coordinate is stored in a single byte, with the 2 MSBs containing the lowest logical scanline coordinate. Sequential bytes, up to the 8th one, contain the pixel data for every successive horizontal position. Finally, the first 8 bytes of a tile’s pattern table data contain the full bitmap data for the first 4 scanlines of the tile, and the next 8 bytes contain the last 4 scanlines.

####################################

;register assignments
;——————–
;EAX: destination pixel pointer
;EBX: points to the palette to be used for this tile (essentially determined
by the attribute table lookup)
;ESI: source pointer for 32-pixel bitmap to be loaded from pattern table
;MM4: (8 - fine horizontal scroll value)*8
;MM5: ( fine horizontal scroll value)*8


;fetch 32 pixels from pattern table, organized as 8 horizontal x 4 vertical.
movq mm3,[esi]
mov ecx,-4; load negative loop count

;move constants related to color calculation directly into registers. These
have to be stored in memory since MMX instructions don’t allow the use of
immediate data as an operand.
@1: movq mm0,_C0x8; contains C0C0C0C0C0C0C0C0h
movq mm1,_00x8; contains 0000000000000000h
movq mm2,_40x8; contains 4040404040404040h

;generate masks depending on the magnitude of the 2 MSBs in each packed byte
(note that this is a signed comparison).
pcmpgtb mm0,mm3
pcmpgtb mm1,mm3
pcmpgtb mm2,mm3
psllq mm3,2; shift bitmap to access next scanline of
pixels

;to perform color lookup, a precalculated palette table is used & ANDed with
the resulting masks of the last operation. Since XOR operations are used to
combine the results, this requires the elements in the palette table to be
XORed with adjacent values, so that they’ll be cancelled out at the end of
the logic processing here. The required precalculated XOR combination of
each color element is shown in the comments below by the corresponding
element. Note that each lookup is 8 bytes wide; this requires the same
palette data for a single element to be mirrored across all 8 sequential
bytes.
pand mm0,[ebx+00]; 2^3
pand mm1,[ebx+08]; 3^0
pand mm2,[ebx+16]; 0^1
pxor mm0,[ebx+24]; 1
pxor mm1,mm2
pxor mm0,mm1

;this logic performs shift functionality, in order to implement fine
horizontal scrolling. The alternative to this is simply writing 64 bits out
to unaligned addresses for fine H scroll values other than zero, but since
this can incur large penalties on modern processors, this is generally the
preferred way to generate the fine horizontal scroll effect.
movq mm1,mm0
psllq mm0,mm4
psrlq mm1,mm5
por mm0,[eax]
movq [eax+8],mm1
movq [eax ],mm0

;loop maintenence
add eax,LineLen; advance pixel pointer to next scanline
position
inc ecx
jnz @1

###################################

To use the renderer, point EAX to the beginning of your render buffer (due to how the fine horizontal scrolling works, tiles must be rendered next to each other, incrementing along the horizontal tile axis). Without some ugly extra logic, the render buffer will have to be increased in size by 8 pixels per scanline, to accomodate for the extra tile pattern fetch required whenever the fine horizontal scroll value is not equal to zero. Once the routine has been executed enough times to fill your render buffer, consider the starting horizontal coordinates of the rendered playfield to be offset by 8 pixels, due to a required “spilloff area” for when the first tile pattern for that line needs to be shifted off the screen.

Branch prediction


Pentium MMX and later processors have improved branch prediction hardware over the original Pentium, and consequently can correctly detect a branch condition pattern, so long as the condition does not stay the same for more than 4 times in a row. The new system is based on keeping track of the last 4 known conditions for any branch that may be allocated in the BTB. Those 4 bits are used to index a 16-element table to fetch 2 bits that indicate the predicted branch condition (strongly taken, taken, not taken, strongly not taken), which is then written back after using saturated addition to increment or decrement the value, based on the actual branch condition that came from the program.


Merging playfield & object pixels


The most efficient way to effectively combine playfield & object data into your final rendered frame, is to always first, render your playfield (or a section of it, in the case of dealing with a split screen) directly to the image buffer itself. At this point, to effectively merge object pixels with the playfield’s, each pixel in your image buffer must have an extra 2 bits associated with it, one of which will represent the transparency status for a playfield pixel, and the other the same, except for object pixels (when drawn later).

Naturally, after rendering the playfield, the image buffer won’t have any pixels with the transparency status for object pixels marked as false. But now, as objects are rendered, the condition on that the actual pixel is drawn, depends on these two transparency status bits, the objects own transparency status, and it’s priority. Starting in the order from object 0 (highest priority) up to 63, object bitmaps are “merged” with the playfield, in the fashion that the following few lines of pseudo-code will show:

IF(SrcOBJpixel.xpCond=FALSE)THEN

IF((DestPixel.OBJxpCond=TRUE) AND ((DestPixel.PFxpCond=TRUE)OR(SrcOBJpixel.Pri=foreground))) THEN
DestPixel.data := SrcOBJpixel.data
FI
DestPixel.OBJxpCond := FALSE
FI

So, as you can see, the destination’s OBJxpCond is marked as false, even if the object’s pixel is not meant to be drawn. This is to prevent the pixels of lower priority (numerically higher-numbered) objects from being drawn in those locations.

This may raise the question, “Why do you render objects in the order of 0->63 (effectively requiring 2 bits for transparency status), when you can render them in the opposite direction (which only requires 1 bit for transparency status)?” The answer is because of what happens on a priority clash (see the “PPU pixel priority quirk” section of the “2C02 technical reference” document). Rendering objects in order of 0->63 is the only way to emulate this PPU feature properly (and some games DO depend on the functionality of this, as it provides a way to force the playfield to hide foreground priority object pixels). Otherwise (for 63->0), it would be neccessary to merge objects to an image buffer filled with the current transparency color, and then, merge playfield data with the buffer as well. Granted, this technique will only require 1 transparency (background priority) status bit per pixel, but since merge operations are slow, and this technique requires way more of them, this technique is inferior to the aforementioned one.

Other tips



Frame store optimizations


One of the simplest approaches to render emulation is to draw the entire playfield to the video buffer, and then place applicable object data in the buffer afterwards (this makes object/playfield pixel decisions easier to calculate since the playfield pixels are already rendered). This straight-forward approach would also be the most efficient way to deal with rendering NES graphics, were there not certain caveats of using the video buffer.

Now you should see that it’s just not a good idea to render graphics directly to the video buffer (although I don’t think any one would do this, anyway). Old versions of this document discussed using a virtual frame buffer, which was basically a buffer allocated in regular memory used to render graphics to (instead of directly to the video buffer). When the virtual buffer was full, it would then be copied to the video buffer in a large, sequential operation (just the way the video card likes it!). However, this method is actually quite inefficient, as the next paragraph explains.

Brief info on x86 on-chip caches


If you know how virtual memory works, well, a CPU cache is basically like a hardware version of this, although not for a disk drive, but main system RAM. A CPU caches data on a (so-called) line-by-line basis. Each line is anywhere from 16 (486) to 32 (Pentium) to 64 (Athlon) bytes in size, and will most likely grow larger in future CPUs. So, if only a single byte needs to be read from memory, an entire line is actually loaded into the cache (this is why data alignment, and grouping related data fields together in records is important to guarantee the effectiveness of a CPU’s cache). This action also pushes another line out of the cache (ideally the least-recently used one), and if it’s dirty (modified), it will be written back to main memory.

486’s started the on-chip x86 CPU cache trend, with a whole 8K bytes shared between both data and code. Intel 486DX4 models had 16K bytes. Pentiums had seperate 8K byte caches, each for data & code. 6th generation x86 processors again, doubled the on-chip cache size (although maintained the seperate code/data cache architecture started by the Pentium). The point is, the size of the (level-1) cache is basically the size of memory that the CPU can randomly access for the smallest amount of time possible. For even a 486, this means up to 8K bytes of cachable data structures, which can actually be quite a bit of memory, if the software is written carefully.

On-chip level 2 cache-based x86 CPU’s (introduced with the second-generation Intel Celeron core) effectively expand the amount of cachable data the CPU holds, while even sometimes hiding access latencies, by speculatively loading level-2 cached data structures into the level-1 cache, when the caching algorithm thinks that the data is going to be used very soon by the software algorithm. A good example of this would be a routine which performs sequential operations on a large array of memory.

The trick to effective use of the cache is all how software is written. The best thing to do, is to write software algorithms which work with an amount of temporary memory smaller than the size of the CPU’s level-1 cache. Even computational algorithms which appear to require a large amount of memory, can sometimes be broken down into sub-algorithms, in order to reduce the required amount of temporary memory. While taking this approach does incur a little load/store overhead, it’s more important that your data stay in the cache any way it can. These guidelines will pretty much guarantee that your software will perform in the most efficient way on any CPU with an internal cache.

The virtual frame buffer caveat


Lets consider the virtual frame buffer (VFB) model. We start rendering our playfield. Name tables and pattern tables are accessed, and that’s fine (the name tables are easily cached, and even some pattern table data gets cached). Then, we store our rendered pixel data to our buffer. The pixel data is stored to the VFB using a data size of 4 bytes (or otherwise, the biggest size that the processor will allow the programmer to store with). However, the CPU’s cache line size is always bigger than this, and therefore the CPU performs a merge operation with written data, and the cache line of the data being written to.

Now- here’s the first problem: the target of the store operation to the VFB is unlikely to be in the cache. This means that the CPU ends up actually reading main memory after your first 4-byte pixel store. Of course, now you can write to this line for free, but main memory access is slow, and considering what we’re doing here (which is exclusively store operations), it’s kind of ridiculous that the programmer has no way of telling the processor that the merge operation (and moreover the reading of main memory) is unneccessary, since we plan on overwriting all the original data in that particular line (extensions to the MMX instructions introduced with the Pentium 2 and later processors offer reasonable ways of dealing with non-temporal storage).

Anyway, you get the idea: after every few stores to the VFB occur, a new line from the VFB will be read in from main memory (or, the level-2 cache, if it’s in there). But guess what? this isn’t even the worst part of it. As you keep filling the VFB, your CPU’s cache overflows, since your CPU’s L1 cache is smaller than the VFB you’re working on. This means that not only will your VFB-rendering eventually push any lines out of the cache which aren’t used directly by the render routine (causing lost cycles for even local routines that may need them immediately after the render), but after the render when you go to copy the VFB to the video memory, the entire buffer has to be loaded back into the CPU’s cache.

Of course, size of CPU cache is everything here. Due to the sequential access patterns of the virtual frame buffer rendering model, this algorithm may actually perform modestly on CPU’s with large on-chip level-2 caches (due to the speculative loading of data from the level-2 to the level-1 cache). However, I can’t say I know what the performance penalties may be for running this algorithm on CPU’s with external level-2 caches. So in general, I would recommend against using the virtual frame buffer algorithm model targetted for CPUs without an on-chip level-2 cache of at least 128KB.

Scanline stores


By reducing the size of the VFB from full size down to a few scanlines (or even just one), most or all of the caveats of what has been mentioned can be avoided. Since typically a VFB scanline is 256 bytes (in the example for the NES’s PPU), this makes the memory requirement small enough to ensure good performance even on a 486.

Of course, this creates a new problem for writing the PPU render engine- tiles can no longer be rendered completely (unless you’re using an 8-scanline VFB, but the rest of this topic assumes you’re using only a single scanline VFB). Some overhead caused by only rendering a single scanline of a tile at a time can be avoided by pre-calculating pointer work for each sequential tile, and storing it in an array, so that calculations can be reused for the tile’s other scanlines. A similar technique can be done for object pointer calculations as well.

Prehaps a possible performance boost obtainable through a careful scanline rendering engine design, is that storing rendered playfield pixels directly to the video buffer may be permitted, since all pixels on the playfield’s scanline can be rendered sequentially, and thus, can be stored out that way. However, there are conditions that determine the effectiveness of this.

First, dealing with object pixels which overlap areas of any playfield scanline will be very difficult (without the use of at least a scanline buffer), since the playfield tile rendering is usually performed sequentially, while object tiles generally need to be rendererable at random horizontal coordinates on the same scanline (in order to emulate object priorities properly).

The second condition depends on alignment. If the PPU’s fine horizontal scroll offset is evenly divisible by the size being used to store pixel data to the frame buffer, then alignment isn’t a problem. However, in the case that it’s not (and this will occur often, since almost all NES games use smooth horizontal scrolling), then a method of shifting and merging pixels in the CPU registers should be used to effectively perform the smooth horizontal scrolling, in order to avoid a misaligned data store, and the unforgivable penalty which is associated with performing this action directly to the frame buffer.

Overcoming letterboxed displays


Since the NES doesn’t use any complex functions to generate it’s graphics (such as tilt, shift, twist, swivel, rotate, or scale), anti-aliasing has never been important for pixel-perfect emulation of NES graphics. However, due to the strange nature of VGA resolutions, to avoid ending up with a letterboxed NES game screen display (that’s one where there are large black borders of unused screen area on the sides), you will either need to scale the emulated graphics yourself, or find a way to get the video adaptor hardware to do it.

For scaling graphics intended to be displayed on a computer monitor, anti-aliasing is super-important to ensure that only a minimum screen resolution is required to ensure that artifacts (i.e., distorted or asymmetric pixels) are as indistinguishable to gamers as possible. A ratio of 5 destination to 2 source pixels can be used to stretch 256 source pixels to 640 destination ones (a very common VGA horizontal resolution). For calculating the color for the middle pixel of the 5, the two source color values have to be averaged. Note that this requires pixels to be pure RGB-values (as opposed to palette index values). Other VGA resolutions, such as 512x384, may also provide some usefulness.


Smooth audio reproduction


This chapter describes ways to improve NES sound emulation.

overview


Very few NES emulators out there emulate sound channel operations to the precision that the NES does it at, and the result is that emulation of some high-frequency rectangle and noise waves that many NES games produce on a frequent basis, will end up sounding like there are artifacts in the audio (i.e., two or more apparent frequencies present, even though only one frequency is supposed to be heard). Increasing sample playback frequencies can fix this problem, but in the end, sampling frequencies on sound cards found in PC’s and such can only go so high.

why are there artifacts in the high frequencies?


The NES’s sound generators each have an audio output update rate/resolution of 1.79 million samples per second (approx). Compared to the average sound blaster payback rate (44100 Hz), this means that the NES’s sound channels have 3125/77, or 40 and 45/77ths times the sample resolution. So, when just one calculated PCM sample needs to represent 40.6 from the NES’s sound channels (in the same timeframe), it’s no wonder the audio sounds so terrible at high frequencies: approximately 39.6 source audio samples have been skipped over, and assumed to be all equal to the single sample.

solutions


Sound blasters have hardware in place to overcome this transparently from the user, whenever audio signal digital capture is desired. The proof is in sampling NES music at 44100 Hz, 16 bits/sample: there is no distinguishable difference between how the real-time generated analog audio from the NES sounds when compared to the digitally captured sample track. They’re either using primitive RC integrator function circuits on the inputs of it’s ADCs to approximate a time-accumulated average voltage between ADC samples, or they are sampling the signal many times faster than the output PCM sample rate (some 2^n multiple), and using digital averaging hardware to produce each “downsampled” PCM result. Here’s more, courtesy of an NESdev veteran:

“What I’m suggesting is that you do the above at a high sampling rate, some power-of-2 multiple of the output rate, for example, 4x44100 = 176400 samples per second. You would add every four samples together, and divide by four (downsample), and that would be your output sample.

Suppose your wave amplitude is 1. Here are some examples of generating a single output sample:

EXAMPLE 1 Oversample Results: 1, 1, 1, 1 Downsampled Output: (1 + 1 + 1 + 1) / 4 = 4 / 4 = 1

EXAMPLE 2 Oversample Results: 1, 1, -1, -1 Downsampled Output: (1 + 1 + -1 + -1) / 4 = 0 / 4 = 0

EXAMPLE 3 Oversample Results: 1, -1, 1, 1 Downsampled Output: (1 + -1 + 1 + 1) / 4 = 3 / 4 = 0.75

So your output samples will not always be a simple 1 or -1. You’re really raising the sampling rate, and then converting the results back to the output sampling rate.”

simple rectangle channel implementation


Simple sound channels like rectangle wave can be designed to approximate the accurate output of the channel without having to resort to any downsampling techniques.

other notes



6502 instruction decoding & execution techniques


Other tips



Emulation address decoding


Emulation address decoding is taking a formed 6502 address, plus the 6502’s read/write status, and running it through (most the time) static logic to determine the access method, and the emulator-equivelant memory address that this 6502 address corresponds to, in order to emulate access to that memory location properly. With this approach, these decoded addresses in your emulator can be treated as either a direct pointer to the data, or as a pointer to a subroutine that the CPU core calls when additional code neccessary to accurately emulate the events of that 6502 clock cycle. The best-known technique for doing this is discussed as follows.

Using a 1:1 address decode look-up tables for both read & write 6502 memory maps is the fastest and most accurate way to determine where an NES memory area is, and what address it maps to. Generally, a byte should be used as a single element in the memory maps to represent the type of mem area (up to 256 types for each table), and you’ll have 128KB of them, since the 6502’s R/W line is also used during address calculations. Even though this technique seems to waste a lot of memory, the memory decode tables are most commonly accessed in parallel with memory areas containing NES ROM and RAM structures, and this means that cached data structures residing in the emu’s host CPU (due to simulated 6502 memory bus transfers) will usually never require more than twice the amount as normal. This is a small price to pay to ensure that adapting your 6502 core engine to any foreign NES/FC architecture/technology, is as easy as adding a few new memory area type handlers to your emulator’s core, and then building a new address decoder table.


Hardware port queueing


Hardware port queueing allows the CPU to write out (and sometimes even read in) data targetted at a hardware port in the virtual 6502’s memory map, without having to break CPU emulation to call a hardware port emulation routine/handler. This is possible through buffering writes out to the particular port, so that a hardware emulation routine may process the data later on (i.e., out-of-order from the one the CPU core issues it in).

pros


  1. your NES emulator core engines can now be designed to operate in one big loop, without having to worry about intervention from other hardware devices during the same virtual NES emulation time, unless it’s absolutely necessary. This means that say, the PPU engine can render a complete frame at any instant (as opposed to having to depend on data sent to the PPU engine in real-time via the CPU core), thanks to hardware port queueing.
  2. no matter how your NES-written 6502 code abuses the PPU, APU, MMC, etc. hardware in the NES, your core engines of all these devices can all now be designed to use a nearly constant amount of CPU clock cycles on the physical processor running your emulator’s software, thanks to the simple loop design of emulator core devices, in combination with branchless code solutions to if/else constructs and the like.

cons


overview


The hardware port queueing concept is only benificial for those hardware devices that do not interact with (i.e., change or effect the operation of) the CPU core, outside of readable ports like $2002. So, for example, you wouldn’t want to buffer writes to the cart mapper hardware if it’s effecting a PRG-bank (due to the fact that the write is supposed to effect CPU emulation immediately), but the opposite is true for CHR-bank changes. So, this is essentially the criteria that you must base your decisions on, when deciding which hardware ports should be queued.

Hardware devices that generate interrupts on the CPU are a little easier to deal with, since interrupt sources almost always come from some sort of on-going counter in the NES (the MMC3’s scanline counter, is a slight exception, since it relies on the clocking of A13 on the virtual PPU). Execution of the events that are to occur on the terminal count clock cycle can be queued to the CPU by creating an instance of a virtual cycle counter by the hardware emulation routine that needs it.

implementation


The “port queueing” idea really revolves around assigning back & forward pointers to all hardware-related (PPU, in this example) memory addresses that can be modified by the CPU. These pointers then link into a 1+2 way list that represents the queued data for that memory address. This means a pair of pointers for:

(* only physical addresses need to be considered here, since any bankswitches will be queued.)

When the CPU core decodes writes to ports like $4014, the CPU core will examine that port’s status as a queued port, along with the pointer to the last allocated link in the list of queued writes for that port will be decoded. If queueing is enabled for this port, the CPU will use the pointer info, along with memory allocation info and the current cycle count, to insert a new link into that list, containing the CPU write data.

attributes of a list element


A relative clock cycle tag value allows hardware emulation routines reading the value later on to determine when the next related write to this port occurs.

Fwd/back pointers are used in each element in the list for 2-way travel. This is required, since it is often neccessary for the hardware to know the last-known value of any memory it may have access to.

A third, one-way pointer in each element in the list will be used to link all nodes created from the same core engine in your emulator together. This makes deallocation of all those links very easy, with list length being a direct function of the number of hardware writes that occured that frame (so, generally not that much). Note that links with the “last allocated link” field = 0 are not to be deallocated, since these represent links that must be present for the next frame’s calculations.

For writing to ports like $2004 and $2007, which are designed to have data streamed into it, this will require some additional logic on the CPU core’s part to calculate the link list address (since there’s an additional lookup, and an address increment required). This would normally be done with a hardware port handler, but this approach would be frowned upon, since the whole point of implementing hardware port queueing is to avoid transfering emulation program control out of the CPU core into other modules, unless absolutely neccessary.

For handling CPU reads from hardware ports, it’s a simple matter of determining whether or not the port handler has to be called or not. For example, when $2002 is read, it’s status often doesn’t change until a certain (independent) clock cycle in the frame has been reached. In this case, the port would be read for the first time, and the handler would be invoked. The handler would then calculate the next clock cycle at which $2002’s status is expected to change, and creates a virtual cycle counter instance, programmed to execute another $2002-related handler when the cycle count expires. Meanwhile, the handler changes the CPU memory map layout so that subsequent reads from this port simply causes the CPU core to read from a regular memory address, where the last known port value is stored, thus avoiding unneccessary calling of $2002’s read handler, until the virtual counter goes off.

For handling CPU reads from ports like $2004 and $2007, the CPU core simply has to return the last-known value of the element being accessed from the array queues.


Threading NES applications


Lately, x86-based PC’s have become so blazingly fast, that emulating just one virtual NES on a modern PC, would seem to be a waste of processing power. With that said, modern PC’s have enough processing power to emulate dozens of virtual NES machines, but there is one big problem with multitasking NES applications: they were never designed to be threaded. Instead, an entire frame’s worth of NES CPU clocks have to be wasted for each NES application, in order to consider the application’s frame calculations complete, whether or not this may be true (and if not, a slowdown will occur). The following hints and tips suggest ways to reduce wasted time in virtual 6502 emulation normally lost due to spin-wait, poll, or cycle count loops.


Emulator features to support


This section merely contains some innovative and interesting suggestions for features to support in new NES emulators being developed.


New object-oriented NES file format specification


This section details a new, extremely easy to use standard for digital data storage of NES ROM images and related information, which provides as much object-orientation for the individual files as possible.

What does object-orientation mean?


In this case, I’m using it to describe the ability for the user to access specific information related to any NES/FC game stored ditigally on a local file repository, whether it be program ROM data, pattern table ROM data, mapper information, pictures and other digital images (label art, game manual pages, etc.), game save states, battery RAM states, etc., without having to rely on any custom or proprietary NES/FC software, simply by storing the several components that make up a digital copy of an NES game into individual files of known/established types, and grouping those files in a subdirectory folder named after the game.

Take for example, the UNIF standard: this is an excellent example of a monolithic file format structure. UNIF is a file format that forces people to rely on UNIF-conforming tools to access the data chunks inside it, be them bitmaps, jpegs, program ROM data, etc., when if these data chunks were just stored as individual files in a directory on the local repository, there would be no need for the UNIF-guidelines to access this data.

So basically, the idea here is to use existing file formats to store all information related to a single game, within a private directory on your filesystem amongst others, making up your electronic NES game library. All like file types may have similar extentions, while having different filenames, usually relating to the specific description of what the file represents (i.e., files relating to save state info, may have a title that describes the location or game status of the state, or patch files may describe the operation of the patch during emulation, etc). As other relivant file formats (like *.jpeg, *.gif, *.bmp, etc.) have been long established computer standards, only file formats relating to NES operation are defined here.

*.PRG a perfect digital copy of the game’s program ROM.
*.CHR a perfect digital copy of the game’s pattern table ROM, or RAM (for
save states).
*.MMC a text tag, identifying the PRG/CHR ROMs complete mapper type
*.INES a classic 16-byte iNES header used as an alternative to the *.MMC
file.
*.WRAM 2K RAM tied to 2A03 (6502) bus
*.VRAM 2K RAM tied to 2C02 bus
*.XRAM extra RAM (other than CHR RAM) used on the game cart
*.SRAM any battery-backed RAM used on the game cart
*.PRGPATCH program ROM patch
*.CHRPATCH character ROM patch
*.PRGHACK text file containing a list of program ROM patches.
*.CHRHACK text file containing a list of character ROM patches.

This list isn’t complete (as 2A03, 2C02, and MMC memory structures will always be emulator-specific), but it should give you an idea of how to seperate files relating to raw dumps of large internal memory structures used inside the NES, in order to improve the portability of the ROM files, large RAM structures, save state dumps, patches, hacks, and such.

*.PRG and *.CHR: the digital contents of program & character ROMs found on the NES game board. It would be nice to see these files maintain at all times a 2^n count of bytes, except when other PRG/CHR ROMs have to be appended to their respective files, due to the possibility that an NES game may use two or more differently-sized ROMs to make up a larger one (before 1987, this was mostly done to increase a game’s ROM size with more chips, since it seems that ROMs larger than 32KB were just either very expensive, or not available back then). The filename always relates to the name of the game, including if it’s been hacked, country it’s from, or whatever. *.CHR files that are produced for save state purposes when NES game carts use CHR-RAM, use a related save state’s description as a filename.

*.MMC: simply a regular text file containing a tag of the ASCII-encoded board type that NES and Famicom games use. Use the file’s size to determine length of the digital tag. The UNIF format does a pretty good job of outlining the various NES/Famicom cart board types there are; these are the text tags to use for this file. The *.MMC filename indicates the PRG ROM associated with the mapper type specified by the MMC file.

*.INES: a 16 byte file containing the iNES header equivelant of what a *.MMC text file would normally represent. This file only exists because digital storage of NES game ROMs is currently dominated by the dated iNES format. Support is not recommended in new emulators (if you’re not part of the solution, you’re part of the problem, right?).

*.WRAM, *.VRAM, *.XRAM: these all define files which contain mirror images of the RAM chips they represent in the NES being emulated. The filename for all of them relates to the save state description.

*.SRAM: defines the game’s battery-backed RAM area. Filename relates to description of backed-up RAM (game and state specific). Maintaining multiple copies of SRAM is useful for storing more saved game data than just one SRAM file allows (which is usually 3 save files per SRAM, though this is always game specific).

*.PRGPATCH, *.CHRPATCH: These files contain a (little-endian) 32-bit offset, followed by the raw data to be patched into the ROM type indicated by the extention. Filename here always relates to the effects the patch has during emulation. Filesize is used to determine the length of the patch (minus 4 to exclude the offset value).

*.PRGHACK, *.CHRHACK: these files define lists in plain text that define the patch files to apply to game emulation, when this specific HACK file is chosen to be applied for the emulated game. The filename relates to the group of patches you’ve chosen for this file (normally, this doesn’t matter, but it’s useful for storing multiple hack profiles (ones that make the game easier, harder, wierd, behave like an NSF file, change graphics, etc)). Use ASCII formfeed and/or carriage return codes (13 and 10) to seperate listed patch types in the file.

notes