In order to determine the required specifications for the hardware module, it was necessary to decide on the formats and electrical interfaces that would be used for the video signals. Possible choices included analogue composite, analogue component, analogue RGB (VGA), DVI, and HDMI.
The primary goal would be to make the module compatible with the equipment I currently use for watching TV. The DTV set top box mentioned previously supports all of the above formats except DVI, and is connected to a data projector with a native resolution of 1024 x 768. The projector supports composite, component, and RGB VGA inputs, and I use the VGA connection for best picture quality. However, most people use either a composite or HDMI connection for off-air video, and an RGB-only module would not be compatible with such a setup.
I considered using a digital connection for the logo canceller, which would eliminate the need for analogue-to-digital and digital-to analogue conversion within the module, however these connections use a serial interface for the video data. This would have a data rate eight times faster than a parallel output ADC, which would result in considerable design complications. I would also need an external DAC module to interface with the projector. (Commonly available DVI-VGA converters would be unsuitable, as the DVI connector also carries an analogue signal that is simply passed through by these adaptors)
Composite video, while having a quite moderate bandwidth, gives a relatively poor picture quality. It would require colour decoding and encoding circuitry in addition to the ADCs and DACs, and also has the additional complication of interlacing to contend with. Therefore, it was decided to concentrate on producing a module with a VGA interface, as a proof of concept. Once this was proven to function correctly, a universal design could be produced if desired.
Having selected the VGA interface, it was then necessary to consider the question of video formats. A full 1920x1080 signal would have a very high dot clock frequency, at around 200MHz. As the whole module would have to operate synchronously at the dot clock frequency, it was desirable to keep this frequency as low as possible. Furthermore, the storage requirements for the sample logos at this resolution would be considerable, at around 8MiB per image for an RGB logo and alpha channel.
My set top box is capable of outputting a few different resolutions, however only one gives the correct image geometry on the projector. Examination of the signal with an oscilloscope showed that this was a standard definition signal, with 625 lines total ('VTotal'), of which 576 are visible. The frame rate is 50Hz, and the horizontal sync frequency is 31.25kHz (This is double the standard PAL/CCIR rate, as the signal has been deinterlaced). For a 4:3 signal there will be 768 horizontal visible pixels, and measurement of the blanking interval gave around 917 total horizontal pixels ('HTotal'). This corresponds to a dot clock of around 28.6MHz, a much more reasonable figure. (The XFree86 Video Timings HOWTO provides useful background information on video display timing.)
Designing to this figure would be acceptable for my current setup, but it was considered desirable to allow some margin in the maximum frequency, to allow for different video modes. The native resolution of the projector requires a dot clock of around 65MHz at a 60Hz refresh rate, and this appeared to be a reasonable figure to design around. I have a laptop computer running at this resolution connected to the projector's second VGA input, for playing videos. Examination of the video signal from the computer showed that, for 1024x768 visible resolution, there was an HTotal of 1344, and a VTotal of 806.
With the video interface and maximum operating frequency determined, the overall architecture of the system could be considered. This would consist of:
Block diagram of the proposed module.
Before designing the processing circuitry, it is desirable to strip the cancelling algorithm back to the simplest form that gives the correct result, as ever extra mathematical operation demands extra circuitry in the processor. The algorithm should be implemented using fixed-point or integer arithmetic, and an analysis must be performed of the dynamic range of the variables in order to determine the optimum word lengths at each point of the operation.
At its simplest, the algorithm consists of applying a linear function to each pixel: f(x) = m*x + c, where the constants m and c differ for each pixel, but are unchanging in time. These coefficients can be precomputed and stored in the framebuffer, in place of the variables 'logo' and 'alpha' described above. The original function is:
output = (255*input - alpha * logo)/(255 - alpha)
This can also be expressed as:
output = (255/(255 - alpha)) * input - ((alpha * logo) / (255 - alpha))
or
output = m * input + c
where
m = 255/(255 - alpha) and c = (alpha * logo) / (alpha - 255)
Now, both of these values must be stored as unsigned 8-bit integers. As it stands all alpha values less than 128 correspond to m values between one and two, which will all be rounded to one, and the system will not be able to differentiate between any of these alpha values. Scaling the m value, say by a factor of 128, will address this issue, however there is a risk of overflow if alpha is greater than 126. Fortunately most logos will have an opacity less than 0.5, corresponding to alpha values less than 127. Therefore, the final equation for m becomes:
m = 128 * 255/(255-alpha) = 32640/(255-alpha)
Another problem exists with the 'c' value. As alpha will always be less than 255, c will always be negative. Therefore, the range of c is shifted by adding 256, effectively using twos complement notation. The final equation for c is
c = 256 - alpha * logo / (255 - alpha)
and the overall transformation equation becomes
output = m * input / 128 + c - 256
This equation was implemented in the software program, and was verified to give the same result as the original that worked with 'alpha' and 'logo' directly. The new equation is in a form that can be implemented in hardware easily. It only requires a single multiplication and a single addition. The division by 128 can be achieved simply by tapping off the output of the multiplier at the appropriate bit, giving a 9-bit input to the summation. As the sum will always be greater than 256, the subtraction can be implemented by simply ignoring bits 8 and 9 of the sum.
Logic for the evaluation of the linear function.
The blurring function also requires modification in order to make it suitable for implementing in hardware. The most significant problem is that blurring in the vertical direction requires access to pixel values in the previous and next lines, which are remote in time to the current pixel. This would require a shift register delay line capable of storing several lines of pixels. Therefore, it was decided to only apply the blur in the horizontal direction. This still requires a shift register, but it only has to be a few pixels long. Testing this concept on the computer showed that it still gave acceptable results.
The second obstacle with the blurring function was how to implement the divisions required, both for scaling the individual pixels, and averaging the result. It was considered desirable to make all required divisions be by a power of two, allowing implementation with bit shifts or tapping of the data buses. To meet these requirements, not only must the coefficients be related to one another by powers of two, they must also sum to a power of two. After some experimentation, the kernel [0.5 0.5 0.5 1 0.5 0.5 0.5] was derived. Although this no longer corresponds to a Gaussian distribution, it still provides quite acceptable results.
This function can be implemented using a seven stage, eight-bit wide shift register. Image data is shifted in, and the contents of each stage is fed to a summer, with all stages except the middle one being tapped at bit 1 instead of bit 0. The two least significant bits of the summer are ignored, with the rest being taken as the output of the filter.
This filter does delay the video signal by a few pixels, and as it was intended to pass through the sync signals from the original analogue video input, rather than regenerate them, the picture will be shifted slightly to the right after being processed. However, it was anticipated that this could be corrected using the picture geometry controls on the display. To switch the filter out of circuit without affecting the delay, the output can be taken from the middle stage of the shift register.
Digital filter logic for the blur function.
With a basic block diagram of the signal processing stage now complete, attention could now be turned to the selection of components for the hardware module. The primary constraints on the choice of components were availability, ease of hand assembly, operating speed, and price. To ease sourcing of components, the search was limited to stocked lines from RS Components, Farnell, Mouser, and Digi-Key. Fine pitch SMDs were not considered a major problem, though BGA devices were to be avoided. It was also necessary to ensure that the I/O interface voltages of the various components were compatible.
A search for ADC chips quickly located the AD9983 from Analog Devices. This device is virtually ideal, as it is designed specifically for video applications. It contains three individual ADCs, capable of operating at up to 140MSa/s, and each with a dedicated 8-bit output data bus. It also contains an inbuilt clock recovery PLL, with an output for the dot clock signal, which saves having to implement this function separately. Its operation is controlled by various internal registers, accessible via an I2C interface. It requires a 1.8V supply, but all of the external I/O pins can operate at 3.3V. It was also quite reasonably priced, at $17.30 in single quantities from RS. I added an external clock buffer to distribute the dot clock throughout the system.
The most critical part of the module was the signal processing circuitry. Although the logic functions were relatively simple, and could have been implemented fairly easily using discrete logic, the high-speed nature of the design precluded this approach. I therefore decided that an FPGA-based approach would be necessary. As I did not have much experience working with this type of component, it was quite hard to determine suitable selection criteria initially. I therefore decided to acquire an FPGA development kit, and attempt to implement the necessary logic using it. If the device chosen proved to have insufficient capacity, I would hopefully then be in a better position to specify a more appropriate device.
After some searching, I found an evaluation kit for the Lattice iCE40HX1K device. This part has 1280 logic cells, and is available in a 100 pin flat pack. The evaluation kit has a demo flashing light application, which provided a starting point for programming the chip. I downloaded the 'iCEcube' development software for this FPGA (which proved to be a mass of bloatware) and loaded up the Verilog demo code. The documentation for the development tools proved to be rather sparse, and this was not helped by the fact that the software was made up of components from multiple companies. However, I was eventually able to make some changes to the code, and propagate them through to the board, proving that the toolchain was working.
As I could not find any information of setting up a project from scratch, I decided to start with the demo application, and strip out all the functionality, but leave the development environment set up. Most of the demo code consisted of debug functions for interfacing with a PC. Although this would have been useful for developing the logo-cancelling algorithm, there would not have been enough pins left for the debugging interface. Therefore, I removed this code.
Most of the information that I could find on Verilog was not geared towards FPGA development, but I eventually managed to figure out enough to get some I/O working on the development board. After this was achieved, it was surprisingly easy to implement the processing logic. The only problem then was to verify that the logic functioned as expected. I managed some partial testing on the development board, using a row of DIP switches for input, and reading out the results on LEDs. However, this would not be feasible for a full-scale test, which would have to wait until the rest of the hardware was designed.
The test did show that the HX1K device appeared to be capable of performing the required logic functions, with only about 60% of the logic cells in use after implementing the three linear functions, and three filter blocks. Some additional capacity would be required for various switching functions and glue logic, but there appeared to be plenty of free space. I decided to leave the implementation of these functions until after the rest of the hardware was ready.
One potential limitation would be the number of I/O pins available. The 100 pin VQFP package only has 72 available I/O lines. It was decided at this stage to only provide for processing of greyscale logos, which eliminated the need for 16 extra bits of logo data. I also decided to eliminate the automatic logo recognition function, negating the need for numerous extra pins, the comparison logic, and the secondary framebuffer. However this still left eight 8-bit buses that had to be connected to the FPGA (red, green, and blue in, 'm' in, 'c', in, and red, green, and blue out). This left only eight pins for clocks and interfacing the FPGA with the control microcontroller. However, it appeared that the chip could be made to work overall, and I decided to give it a go for the final design.
The chip was not available locally in the correct package, and, although it was stocked by Mouser, it was listed as export restricted (Though, funnily enough, the dev boards with the exact same chip were not!). I therefore decided to reuse the one from the dev board that I had bought from Farnell. These boards were cheap enough that I could buy a whole extra board if necessary.
Schematic diagram (click to enlarge).
At this point, although not all of the parts had been chosen yet, it was possible to start work on the schematic diagram. The AD9983 was set up according to the example circuits in its datasheet, and buses were run from it to the iCE40. Then the necessary support circuitry was added for the FPGA. Although this device does have internal configuration memory, it appears to be OTP, so an external EEPROM was used, as on the dev board. In order to allow the EEPROM to be programmed, its connections were brought out to a connector. This could then be wired into the dev board, and the EEPROM and FPGA configured just as if they were on the original board.
The next major block of the circuit to design was the framebuffer memory. To simplify the design as much as possible, I elected to use static RAM for the framebuffer, eliminating the need for any refresh circuitry. However, this placed severe limits on the available storage capacity, as SRAM is not available in as high a capacity as DRAM (SRAM is also slower, and much more expensive!).
My plan was to let the address counters free run through the blanking intervals, simplifying the control circuitry considerably. However, this meant that the framebuffer had to be dimensioned for the 'total' number of pixels, rather than the 'visible' number of pixels. I also wanted to be able to switch between different images by bank switching using the high order address lines. This meant that each image would have to start at a new page. If the image dimensions slightly exceeded a power-of-two boundary, there would be considerable wasted space between the end of the image and the start of the next one. Unfortunately, this is the case with the 1344x806 'virtual' resolution that corresponds to a displayed resolution of 1024 x 768, with each 8-bit image being slightly over the 1 MiB boundary.
However, I realised that there was a way around this. First, instead of clocking the vertical address counter from the overflow of the horizontal counter, it would be clocked independently from the horizontal sync pulse. This would leave each counter with a separate 10-bit address space. Although 10 bits is insufficient to address the 1344 horizontal pixels, meaning that some of these pixels would be mapped into the video line twice, only 1024 contiguous pixels would be visible in the final image. Therefore, each pixel in the framebuffer would only be displayed once on the screen.
With this optimisation, a full screen 8-bit image could be stored in a 1 MiB page, and a full set of m- and c-values would only take up 2 MiB. However, there was a need to store a number of different logos. There are about 10 free-to-air television channels available in my area, each with its own logo. Programs are also transmitted in different aspect ratios, which alters the size and position of the logo on screen. It was decided that these variations would be treated as separate logos for the purpose of cancelling them, Furthermore, some TV networks have different variations or positions for the logo to suit different programs. All together, these variations point to a need to store at least 16-32 different logo patterns.
It would be impractical in terms of cost to provide enough SRAM to store all of these at the same time, as up to 64 MiB (512Mibit) of storage would be required. Therefore, it was decided to only hold two logos in RAM at any one time (to allow double-buffering when changing channels), and keep the rest in nonvolatile memory. (This would be necessary in any case, to initialise the RAM at power up). The speed requirements of the nonvolatile memory would not be nearly as demanding, as the logo could be loaded into RAM over a number of video frames. Therefore, the RAM requirements were reduced to 4MiB. To handle the 65Mhz dot clock, the access time would need to be less than 15ns. One possible candidate was the ISSI IS61WV102416. This chip is organised as 1M x 16 bits, and has an access time of 10ns. It was therefore decided to use this part, and add provision for a second chip, to enable double-buffering. These are quite expensive chips, at around $35 each!
The address counters of the RAM also posed a problem. Ideally, this function would have been performed by the FPGA, however a shortage of I/O pins prohibited this configuration. I had hoped that I would be able to use a single chip counter solution, capable of operating at the necessary speeds, however I could not find any suitable parts. In the end, I had to fall back on using discrete logic. A synchronous counter configuration is necessary for this application, therefore I settled on 74163 family devices. Unfortunately, this is only a four bit counter, so six chips are required. Although I was familiar with the standard LS/HC family devices, these would have been too slow for this application, particularly given the 3.3V supply voltage. Eventually, I found the LVC family, which can apparently run up to 150MHz.
The only other major component of the system was the video DAC. I was again fortunate in finding a three channel integrated device, the ADV7125. This has a bandwidth of 140Mhz, and runs from a 3.3V supply. It is good value at around $12. The datasheet suggested using an isolated supply, so I added a dedicated 3.3V regulator for it. The DAC is fairly easy to interface, although it requires an external voltage reference. I used the one suggested in the datasheet, although it was relatively expensive at over $3.00
It was clear that a microcontroller of some description would be required to generate the various control signals, and especially to set the I2C registers in the AD9983. I started off with an ATMega168, which I had used in other projects. However, while designing the circuit, I became annoyed with the configuration of the IO ports on this chip, and changed it for an ATMega32.
As explained previously, some type of nonvolatile storage for the logo masks would be required. I decided to investigate the possibility of using a micro SD card. Research quickly showed that these cards use an SPI interface, and talking to the card would simply be a matter of attaching a suitable socket to the SPI pins of the ATMega. In fact, the hardest part of the process was entering the footprint for the SD card socket into the PCB editor, which was quite time consuming!
I decided to also use the SPI interface to communicate with the FPGA. The iCE40 actually uses SPI to load its configuration image from the EEPROM, and the SPI bus must be isolated from the microcontroller during this process. However, once it is configured, the pins are available as general I/O, and suitable SPI decoding logic can be implemented within the FPGA. Separate chip select lines are used to individually address the SD card and the FPGA.
A few pushbutton switches and LEDs were added to provide a user interface for the system. I also thought there might be a need to select the target logo remotely (possibly using a separate module that decoded the channel LED display in the set top box), so I brought out four spare I/O pins to a header. Additionally, I added an RS-232 interface for debugging.
With all of the major blocks in place, it was necessary to pay some attention to the interfacing between the major components, particularly the micro, FPGA, RAM, address counters, and SD card. The system needs to be able to operate in four modes:
In each of these modes the clock and control signals for the RAM and address counters need to be connected to different points. The vertical clock, vertical reset, and horizontal reset signals change relatively slowly, and can either be controlled directly from the microcontroller, or switched to the appropriate points using a CMOS multiplexer. However the horizontal clock signal normally runs at the full system dot clock frequency, and I was concerned about the integrity of this signal when fed through a multiplexer. I therefore decided to dedicate a precious FPGA pin to this function.
The transfer speed to the SD card was also a concern. I had hoped to be able to clock the card from the FPGA, enabling a high-speed direct memory transfer between the two devices. However, I was forced by a lack of available FPGA pins to use the ATMega SPI clock, which was limited to 10MHz. The direct connection did still allow a form of 'DMA', as it would be possible to instruct the FPGA to listen, then read bytes in using the ATMega SPI interface. The FPGA would then eavesdrop on the SPI transfer, and the incoming byte could be discarded inside the micro.
Careful management of bus access on the SPI interface is necessary to prevent contention. At power up, the FPGA will first use the bus to load its configuration image from the EEPROM, during which time the micro must tristate its SPI pins. After the FPGA asserts its CDONE signal, the micro can take over control of the bus. The chip select line of the configuration EEPROM is connected via isolating resistors, enabling it to be jammed inactive during normal operation. The micro also monitors the CS line from the FPGA development board, so it can release the bus if reprogramming of the EEPROM is attempted. The SPI bus is also used for programming the microcontroller, however provided this is done after the FPGA is configured, the bus will be available, as the micro will tristate its pins and deassert all of the chip select lines.
The only other aspect of the circuit is the power supply. With each of the major chips requiring a different supply voltage, multiple regulators were needed. Linear regulators were used in the interest of simplicity, and as the power consumption was not expected to be particularly high. The 5V rail is not strictly necessary, though it is used as a clamp sink for the incoming sync signals. The 5V regulator also serves as a preregulator for the other devices, and spreads out the heat dissipation in the event of a high supply voltage. A generous number of decoupling capacitors was provided for the various supply rails, in view of the high-speed nature of the design.
There is not much to say about the PCB design. Because of the speed and complexity of the circuit, a four-layer design was considered necessary. Although going to this level is considered difficult by some, it actually gives a great deal more flexibility compared to a single- or double-sided board. With the exception of the connectors and pushbutton switches, all of the parts are surface mount types, allowing a compact overall arrangement.
Routing the tracks to the RAM chips presented some problems, as they are for the most part connected directly in parallel, and the pins are far too close for 'squeeze throughs'. Sometimes, identical parts with reversed pinouts are available for this situation, enabling mounting one chip on each side of the board, but this was not the case for this particular SRAM. In the end, I placed one chip on each side, but offset from one another, and ran a bus between the outer rows of pins. This did not take up too much room when done in 0.16mm track.
PCB layout (click to enlarge).
The prototype boards were fabricated by Seeed Studio, who provide an excellent service at 5 four-layer boards up to 100x100mm, silk screened and soldermasked both sides, for $80. Turnaround time was only 15 days, even with their standard postage option!
On to Part 3 - Construction & Testing.
Back to Part 1 - Algorithm Development.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
loopgain.net