In order to design and test the cancelling algorithm, it was first necessary to set up a suitable development environment.
Ideally, a digital TV capture card would be used, to get direct access to the transmitted bitmapped image. However, a suitable card was not available. Instead, a digital TV 'set top box' was used to generate an analogue PAL B/G composite video signal, which was then fed to an analogue video capture card. This approach results in a loss of quality due to a number of causes, including analogue noise, PAL colour encoding noise, loss of resolution due to downsampling, and pixel framing errors.
These factors are not necessarily a disadvantage from the perspective of developing the algorithm though. If the software is robust enough to give good quality results in the presence of these imperfections, then the results from a higher quality input signal can only be an improvement.
To capture still images to experiment with, the program '
xawtv' was used. This has the capability to save screen captures in PPM or JPEG format. However, results were disappointing, with colours in the saved images severely distorted, possibly due to the use of an incorrect colour encoding. As it did not appear to be possible to fix this easily, the 'streamer' program was used to capture images instead, into the greyscale PGM format, using the command "
streamer -s 736x552 -o `date +%s`.pgm". The primary channels of interest at this stage were ABC and SBS, neither of which use coloured logos. Therefore, the initial experiments could be simplified considerably by working in greyscale, and the algorithms developed would still be applicable to coloured scenes and logos.
A number of still images were captured, showing the ABC1 logo against a variety of different backgrounds. This logo is composed of two parts. At the top is a block showing the number '1'. This part of the logo is not visible against fully white backgrounds. At the bottom is the Lissajous figure, which is visible against both light and dark backgrounds.
The sample images were examined, and a few values were derived. The black level of the video signal was around 19-23 (out of a maximum of 255 for the 8-bit greyscale representation) and the peak white level was around 240-244. Against a black background, the top half of the logo had a brightness of around 85. At this point, it was possible to make a guess and the transformation function for pixels in the top half of the logo. I assumed that a simple linear function was used, mapping an input of 20 (black level) to an output of 85, while leaving an input of 240 unchanged. A function meeting these conditions is
f(b) = 0.7045b + 71, where b is the brightness of the pixel in the original image. Its inverse is
f-1(b) = 1.4193b - 101.
The next task was to find a means of testing this function. It would have been possible to do this with an image editing program, using the layer combining and level curve functions. However I decided instead to write a custom program. This would allow all the steps in the transformation to be automated easily. Speed of execution was not a concern while working with still images, and I could have used an interpreted language such as Python, which would have aided the development process somewhat. However, I chose to use C instead, to give the possibility of real-time testing on live video streams. Integer arithmetic was also used where possible, in order to ease the transition of the algorithm to an FPGA or DSP chip.
A simple PGM image I/O routine was written, to give the program a means of reading the source images and masks, and outputting the results. Fortunately, this format is relatively simple, consisting only of a simple header, followed by an array of 8-bit greyscale bitmap data.
A mask image was prepared, showing which pixels on the image made up the logo on the screen. This image was initially derived from a screenshot of the closing credits of a TV program, showing white text and the logo against a black background. The text was removed using an image editor, leaving only the logo.
If the inverse function computed above were used, it would be necessary to hard code the equation into the program. This has the serious disadvantage of being unable to handle any changes to the logo transparency without modifying the program. Furthermore, the program would not cope with different logos from different TV channels. Therefore, it was considered desirable that the program be able to automatically derive the inverse function from a sample of the logo. The process of deriving the inverse function was repeated, using variables in place of the constants used previously. This gave the function
f-1(b) = WHITE - (WHITE - b) * (WHITE - BLACK)/(WHITE - m)
where m is the brightness of the sample logo, or 'mask', and WHITE and BLACK are the white and black levels of the video signal. The mask image is read as an additional input to the program.
This program was tested, and gave good results on the top half of the ABC1 logo. The edge of the logo is not an abrupt transition, but is feathered out across a few pixels. (It is not clear if this is also true for the original high definition signal, or if it is a result of downsampling, D/A - A/D conversion, and framing error.) The above equation automatically compensates for these variations in brightness of the logo.
Unfortunately, this equation is incapable of correctly cancelling the lower half of the ABC1 logo. This is expected, as this part of the logo in capable of both brightening and darkening the image. Clearly, a different approach would be required. It would be advantageous if whatever technique was adopted was also applicable to the top half of the logo.
It was then necessary to guess at how the logo was added to the image at the transmitter. It seemed reasonable to assume that the transmitted scene was a weighted average of the original picture and the logo, with the weighting being controlled by a separate alpha (transparency) channel. This would also explain the lightening of the '1' part of the logo, if the source pixels for this region were full white. So, the transfer function was assumed to be
f(b) = b*(255-alpha)/255 + m*alpha/255
Inverting this in terms of b gives
f-1(b) = (255*b - alpha * m)/(255 - alpha)
The value of alpha was measured from the sample images at around 75/255. The brightness of the mask image was normalised, bringing the peak brightness to 255
This new method gave fairly good cancellation of the top half of the logo, although not quite as good as the original technique. Some cancellation of the lower part was also evident, although this was not perfect.
At this point, it was decided that it would be worthwhile to try the system on moving images. As the logo remains stationary during camera panning, it is important to assess the subjective effectiveness of a logo cancellation system using live video. As a first step, the video signal was captured and displayed on-screen using
mplayer. This program has a number of video processing and filtering modules, and it was thought that it might be possible to incorporate the logo cancellation as an additional filter. (Mplayer also has its own logo cancellation filter, but this relies on discarding pixels contaminated by the logo, and interpolating across the gap. Obviously, this technique cannot make use of any of the picture information transmitted 'under' the logo.)
Examination of the source code of mplayer revealed that it would require considerable effort to integrate a new filter into its processing chain. A simpler interface for the video stream data was desired, ideally something using Unix pipes. Further reading of the mplayer documentation revealed a method to write the video stream to a file. The command '
mplayer -vf pp=lb tv://0/3 -tv width=720:height=540 -vo yuv4mpeg'. will write the video stream from the composite input to the file 'stream.yuv' at the specified resolution, after passing it through a deinterlacing filter. The resulting file could then be played back, also using mplayer.
It was necessary to find a way to make the recording and playback happen simultaneously, without the use of any intermediate files. A named pipe was created, using the command '
mkfifo stream.yuv', and both processes were started. Initially, the playback process did not work, as it was attempting to seek the stream in order to read a header, however altering the playback command to '
mplayer -demuxer 12 stream.yuv' eliminated this problem. This resulted in the live TV signal being displayed on the computer monitor, with all of the video data passing through the FIFO. Some delay of the video relative to the audio (which was taken directly from the DTV tuner) was noticeable, but this was not considered to be a problem for the purpose of evaluating the system.
The next step was to modify the logo cancellation program to read frames from the FIFO, process them, and then write them to a second FIFO. This required an understanding of the YUV format used. The basic format is reasonably well documented, with the file consisting of a header line, giving the resolution, and then a series of frames. Each frame has a header line, followed by the binary bitmap data for the frame. However, the arrangement of the binary data was uncertain. The YUV format is a digital equivalent of the colour-difference encoding used for NTSC and PAL colour televisions. (Note: Digital colour difference signals are also referred to as 'YCbCr' format. However, I have continued to use the label YUV, in keeping with what is used by mplayer.) This encoding allows a saving of bandwidth by recording the chrominance information at reduced resolution. However, the exact encoding used in the file was uncertain.
Examination of a stream dump using a hex editor revealed that each frame used 1.5 bytes to encode each pixel of the image. Further examination of a frame composed mostly of solid colour revealed that the data at the start of the frame repeated at intervals corresponding to the full horizontal resolution of the image, sampled at 8 bits per pixel. There was no sign of any chrominance data being interleaved, so it was assumed that the first 2/3 of the frame data was the 8-bit luminance bitmap at the full resolution of the frame.
At this point, the program was modified to pass through the video stream, but zero out the last 1/3 of the frame data. This produced an image without any colour information, but with a strong green tint, showing that the chrominance was encoded using the last 1/3 of the frame data, and also that the chrominance bytes were encoded relative to a median value of 127, with 0 representing an extreme value.
It was then necessary to determine how the chrominance had two complete channels of data encoded into only half of the space required for a full-resolution 8-bit bitmap. Clearly, some sort of subsampling or reduction of the bit depth was occurring. The SECAM system uses 1:2 subsampling in the vertical direction, and all colour television standards have a reduced bandwidth in the horizontal direction. Given that the chrominance data was not interleaved with the luminance, it was considered likely that the two chroma channels would not be interleaved either. It was expected that first would be a block of U data, occupying half the remaining storage space allocated to the frame, then a similar block of V data. Setting the last 1/6 of the frame to the value 127 confirmed this, giving an effect similar to a TV with a certain type of colour fault. Further investigation showed that the chrominance data was stored as 8-bit values, at half the original resolution in both the horizontal and vertical directions. (Some additional reading confirmed this interpretation of the data was correct - and is known as YUV4:2:0 encoding)
Although the decomposition of the image into colour difference components may at first appear to be an additional complication, it actually makes the process of dealing with uncoloured logos considerably simpler. Only one channel of luminance from the first part of the frame data needs to be processed, and the chrominance values can be passed through unchanged. If it subsequently becomes necessary to remove logos that include a colour tint, the same methods can then be applied to the chrominance channels.
The logo cancellation was then trialed on live video data. Performance was comparable to that obtained in the static trial. It was considered that better performance could be achieved using a bitmap to provide an individual alpha value for each pixel, rather than using a constant value for the whole image. This would allow the opacity to be feathered out at the edges of the logo, mirroring the process that was apparent on the transmitted image.
To supply the two parameters, two sample images are required. One should be a best guess at the original logo, before it is combined with the video feed. The other should be a bitmap specifying the transparency or opacity of the logo at each pixel. The full equation becomes
output(x,y) = (255*input(x, y) - alpha(x, y) * logo(x, y))/(255 - alpha(x, y))
At this stage of the development process, a sample of the logo against a full white background was not available. Therefore, both the logo and alpha images were derived from a sample of the logo against a black background. The alpha image used this sample unchanged, while the logo image used the sample with the brightness normalised.
On testing this equation, a couple of defects were apparent. The solid interior of the top part of the logo was cancelled successfully, but an outline was visible at the transition from 'logo' to 'non-logo' areas. It was thought that this was because the strength of the cancellation was dependent on two inputs - the reference logo, and the alpha channel. Both of these signals drop off at the edge of the logo, with the result being that the cancellation effect falls off too quickly. To remedy this, the alpha bitmap could be edited manually, with the values at the edge of the logo brought up to match those in the interior.
However, this would mean that the manual editing process would be required every time the logo cancelling system encountered an unfamiliar logo. Therefore, it was decided to automate the process by applying a transform function to the alpha channel. This would perform a thresholding function, boosting alpha values above a certain figure, and attenuating them below the figure. In order to maintain some degree of feathering at the edges, a smooth continuous function was sought for this transformation. One possibility is based on the inverse tangent function:
alpha'(alpha) = A*atan(k*(alpha - t)) + B
where alpha' is the transformed alpha value, k is a shape factor, controlling the abruptness of the transition, and t is the turnover point at which the thresholding occurs. If the maximum alpha value in the image is called C, then two points on the curve are known: alpha'(0) = 0, and alpha'(C) = C. These points can be used to solve for the constants A & B, giving
A = C/(atan(k*C-k*t)-atan(-k*t))
B = -C*atan(-k*t)/(atan(k*C-k*t) - atan(-k*t)
In the interest of speed of execution, the alpha transform function was implemented as an integer look-up table. In order to optimise the function, it was desired to be able to change the parameters k and t while watching the output video signal. Controlling these values within the program could present some problems, as processing of the video signal would be suspended when the program blocks to read the parameter values. While it would be possible to poll the keyboard using non-blocking IO between each frame, a simpler approach was used. Signal handlers were written that would increment or decrement each parameter, and recalculate the lookup table. (Of course, the table must be declared volatile.) Adjustments could then be made using e.g 'killall -USR1 a.out'. This was done from a second machine over ssh, so that the first computer could run the video at full screen.
After some adjustments, fairly good results were obtained using the parameters t=10 and k=1. However, it was realised that transforming the alpha values was the wrong approach. The value of the sample logo outside the region defined by the alpha mask is not used in the calculations. Therefore, these values can have any value without affecting the output image. If the pixels near the edge of the top half of the ABC1 logo are changed to full intensity, the feathering effect will be controlled only by the alpha channel, and this will more closely match the transformation being applied to the transmitted picture. Indeed, the entire top half of the sample logo image can be blocked out solid white, and the whole shape determined from the alpha image. (This should be mathematically equivalent to the original linear cancelling function, but does not require special cases for different regions of the logo)
Implementing this change gave good results on the top half of the logo, and the alpha transform function was abandoned. The parameter adjustment code was reused to adjust the gain of the alpha channel, which allowed the '1' block in the logo to be nulled out almost completely.
However, the bottom part of the logo was still not cancelled completely. It was realised that the problem was due to the darker 'shadow' areas in this part of the logo. Although these were dark in the reference logo image they also had a reduced alpha value, as the alpha channel was derived directly from the logo. Therefore, the logo cancellation function was not applied with the necessary strength, and dark patches remained in the output image. Manual editing of the alpha bitmap brought some improvement, but an automatic approach was desired. It was realised that, in order to fully determine the extents of the logo, it would be necessary to capture it against both black and white backgrounds.
Finding a transmitted image with the logo on a white background proved to be considerably harder than getting the one with the black background. Although much of the content in the breaks between programs has a white background, the logo is not applied during these intervals. Eventually, some suitable images were captured. In order to reduce noise in the images, a number of frames were averaged together, both for the black and the white background images.
It was then necessary to combine the 'black background' and 'white background' images to produce the reference logo and alpha channel. A number of attempts were made based on inversion and channel mixing in an image editor, but eventually a better approach was found. The two sample images can be considered to be derived from the following equations:
W=alpha*L/255 + (255-alpha)*255/255 B=alpha*L/255 + (255-alpha)*0/255
where L=logo image pixel, alpha = logo opacity, W = image of logo on white background, and B = image of logo on black background.
Solving these equations simultaneously for alpha and L gives:
L=255*B/(B-W+255) alpha = B-W+255
A program, logosolve, was written to solve these equations and derive the sample logo and alpha channel images. Of course, in areas of the image where the logo is not present, L is undefined, and the program must be able to handle this. When the solved images were used as input to the cancellation program, good results were obtained. After adjusting the gain, all regions of the logo were cancelled almost completely. When the brightness was equalised in logo and non-logo regions of the image, some discolouration was evident against certain coloured backgrounds. This was thought to be a result of the 'chroma shift' phenomenon caused by the PAL encoding of the composite signal, and is expected to disappear if direct RGB video is used. Some ghosting around the edges of the logo region was also evident. In order to render these residual artefacts less objectionable, it was decided to apply a blur filter to the region of the image containing the logo.
The Gaussian blur algorithm functions by replacing each pixel with a weighted average of nearby pixels. The weighting factors are defined by a matrix centred on the current pixel. In order to reduce the computational complexity, it is possible to perform the blur separately in the horizontal and vertical directions. To determine when to blur a pixel, the alpha channel was used initially. This minimised the number of blurred pixels, but the edges of the logo were still perceptible. Therefore, a bounding box was defined for the logo, and the blur applied to all pixels within this box. A kernel size of 7 for the blur was found to give acceptable results.
Image before and after processing
At this point, the quality of the logo cancellation was judged to be sufficient to be practical for general television viewing. However, another part of the algorithm also needed development. It would be impractical to have to switch the logo canceller on and off manually for program breaks or when changing channels. Therefore, a method of recognising when any one of a number of logos was present on screen was necessary. This would then automatically select the appropriate logo, if any, for the cancellation function to use.
A software routine was envisaged that, given a reference logo image and a frame from the live video feed, would compute a 'score' of the likelihood that the frame contained the given logo. This function would be evaluated against each station logo known to the program, and the highest scoring logo chosen as the one to cancel. If no score were above a certain threshold, no cancellation would be performed.
A simple approach was tried at first. The average brightnesses were calculated for three groups of pixels: firstly, pixels within the alpha mask having a corresponding pixel in the sample logo with an intensity greater than 127; secondly, pixels within the bounding box of the logo but not included in the alpha mask; and thirdly, as for the first group, but for intensities less than 127.
If the image contains the selected logo, the average brightness for this first group should be higher than that of the second group, which should in turn be higher than that of the third group. This method did work to a certain extent, however the discrimination between scenes with and without the logo was not sufficiently good. It was clear that more powerful statistical techniques would be required.
At this point, it was considered desirable to be able to graphically observe the relationship between the brightness of pixels in the transmitted scene, and those of the reference logo. Normally, graphical presentation of data from C programs is a somewhat tedious process, requiring the use of graphics libraries or external programs. However, in this instance, the data could be displayed directly on the video framebuffer. A subroutine was quickly written to draw an X-Y scatterplot on top of the left side of the video frame. (I got a bit carried away at this point, and thought it might be interesting to quickly program up a U-V vectorscope)
Points were drawn for each pixel within the logo alpha mask. The x value corresponded to the brightness of the selected pixel in the reference logo image, while the y value was the brightness of the same pixel in the transmitted image, less the average brightness across the logo bounding box. It could immediately be seen that, when the logo was present in the image, the plot showed a positive linear correlation (with the gradient of the line corresponding to the average alpha value of the logo). When the logo was not shown, the points were clustered about a horizontal line. A solid region of colour behind the logo in the source image resulted in the points being tightly grouped in a line, while a detailed region in the source image resulted in a much broader distribution of points.
The appearance of this plot gave a good indication as to whether the logo was present or not. It was then necessary to reduce the data in the plot down to a few values, that could be used for making a decision.
One possible value of interest is the correlation coefficient, r. This shows how well the points are linearly related together. It has the disadvantage that it is not sensitive to the gradient of the relation, only its sign. It is also possible to compute a line of best fit through the point cloud, and examine its gradient, beta. Both of these values were computed and monitored while watching a live video stream. Although it was not possible to decide whether the logo was present by looking at one of these variables in isolation, it was noted that negative values of either of these measures correlated with absence of the logo, and high positive values for both corresponded with its presence.
It was then necessary to derive a single score value that could be used to make decisions. This was done by multiplying the gradient and correlation coefficient, but with the condition that if both values were negative, the resulting score would also be deemed negative. A moving average of this score value was then taken over 75 video frames, with the objective of reducing short duration noisiness of the score. After some experimentation, a threshold value of 0.06 was arrived at for the score. This did not provide perfect discrimination, but worked fairly well. It is preferable that the discrimination errs in favour of false positives rather than false negatives, as false positives will generally only affect program break content and advertising, which is of less importance than program content (note that TV network operators may contest the validity of this statement). The given value reflects this philosophy.
At this point, it was possible to test the system across a number of television channels. Logo and alpha masks were derived for the channels ABC1, ONE, SBS1, and SBS2. Data structures representing these channels were placed in an array, and the score was evaluated for each channel every time a new frame was available. (In a system with a large number of channels, it may be necessary to restrict the score calculation to every 2nd, 3rd, or nth frame, to reduce the CPU load. However, this was not found necessary with four channels.) Separate moving averages were maintained for each channel. Every time a frame is processed, the channel with the highest score is picked as the logo to remove from the picture, unless none of the scores are above the threshold value, in which case no cancellation is attempted. Note that it may be necessary to assign different weightings and thresholds to different channels, in order to cope with different sizes of logos, etc. However, this refinement was not incorporated at this stage.
On testing this system, a problem immediately became apparent. The ABC1 logo, which had formed the basis for testing up to this point, contains both light and dark areas, encompassing a range of luminance values. However the other channel logos only contained full intensity white. It was therefore not possible to calculate the gradient of the logo pixels <-> scene pixels relation, as at least two intensities are required to define the gradient. As a temporary fix, the logo intensity values were multiplied by the alpha channel. This gave a range of intensity values, due to the feathered edges of the alpha blending, which was sufficient to enable discrimination between the logos. It should be noted however, that this method works only due to the soft edges of the alpha mask, and some inherent noisiness in the mask and logo image. It is not generally applicable - e.g. in cases where the alpha mask has perfectly sharp edges.
Further testing revealed that although the logo discrimination was reasonably effective, the correct logo was not chosen all the time. This resulted in an annoying flickering effect, sometimes showing a second station logo in inverse video, in addition to the uncancelled original logo. In its present state, the logo selection algorithm was not suitable for casual viewing. When the correct logo was selected, the logo nulling achieved was quite good, however a practical system would require manual selection of the channel, if automatic logo detection could not be achieved reliably.
It was realised that the distraction resulting from the rapid logo switching could be reduced by crossfading between the two logos over a period of time, instead of hard switching. However, the end result would still have the logo uncancelled some of the time, so it was decided not to pursue this approach. Instead, a better logo selection algorithm was sought.
One major limitation of the original algorithm was that it was examining the pixels contained within the logo, but no attention was given to the non-logo pixels surrounding it. In effect, it was looking at the image through a mask in the shape of the logo. If, for example, the system were programmed with two logos, both solid white and one completely contained within the other, both logos would return a strong positive score.
At this point, the logo cancelling algorithm was considered sufficiently well proven to commit to a hardware design. Provision would be made for further development of the recognition algorithm within the hardware design if possible.
On to Part 2 - Hardware Design.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.