Hermosa introducción al procesamiento de video
Video is a combination of image and audio, so as a general rule whatever theory and applications were valid for those, are usually valid for video as well. Video consists of a set of still images called frames which are displayed to the user one after another at a specifc speed called the frame rate measured in frames per second, shortened to fps. If displayed at a fast enough rate, the human eye cannot distinguish the individual images as separate entities but merges them together creating an illusion of moving images. This phenomenon is called persistence of vision (PoV). It has been observed that the frame rate should be around 25 to 30 fps for perceiving a smooth motion without gaps or jerks. Audio is added and synchronized with the apparent movement of images to create a complete video sequence. A video fle therefore consists of multiple images and one or more audio tracks. One disadvantage of handling so much information together is an increase in fle size and consequently large amount of processing resources to handle them. For example, a one minute video fle consisting of 30 frames each 640 by 480 pixels in size and using 24-bits color information takes up more than 1582 MB of space. Audio sampled at 44,100 Hz adds another 10 MB to the fle every minute. Moreover, playback of the video fle requires a bandwidth of around 30 MB/s. Schemes of compression therefore are very important for video to handle such large overheads.
For creating digital video, we frst of all need the visual and audio information to be recorded in the form of electrical signals onto magnetic tapes or disks. The term for specifying this form of representation is motion video to distinguish it from another kind of representation called motion picture used in cinema theaters where video frames are recorded onto celluloid film using a photo-chemical process. Motion video in the form of electrical signals is generated from an analog video camera and stored in magnetic tapes like video cassettes and later played back using video cassette player (VCP). TV transmission is also a popular example of motion video display. Earlier generation analog video cameras used vacuum tubes called cathode ray tubes (CRT) to generate these signals, which can then be fed to a monitor to display the video, while audio was recorded separately using a microphone and fed to loud-speakers for generation of sound. Monochrome or grayscale video required a single intensity signal from the camera for the visual information and one or two audio signals, depending on whether the sound being played is mono or stereo. For displaying an image on a CRT monitor screen, the electron beam from the cathode is activated and focused on the phosphor-coated screen for emitting light. Phosphor is a chemical substance which emits a glow of light when it comes in contact with charged particles like electrons. To generate an image on the screen, the electron beam starts at the upper-left corner of the screen and sequentially traces over the first row of phosphor dots from left to right. At the end of each horizontal line, the beam moves diagonally to the beginning of the next row and starts the tracing operation. At the lowerright corner, the beam moves diagonally to the starting point at the upper-left corner and repeats the operation once again. This process is called raster scanning and usually completed about 60 times each second for a steady picture on screen, which is denoted as the refresh rate of the monitor, and each image produced on the screen is called a frame. A monitor which supports 60 frames per second produces a non-fickering image and is called a progressive scan monitor. An alternative technique, used especially for monitors of lower refresh rates, is called interlacing and the corresponding monitor is referred to as interlace scan monitor. In this case, one frame is split into two halves, each called a feld. The first feld, made of odd-numbered rows, is called the odd-field and the second row, made up of even-numbered rows, is called the even-field. Each feld contains only half the number of rows and is scanned 60 times per second, reducing the effective refresh rate to 30 frames per second. Due to PoV, this kind of arrangement leads to a smooth blending of the rows of each feld and helps to produce a non-fickering image even at low refresh rates. Notations which include the letters “p” and “i” are used to distinguish between progressive and interlaced monitors e.g. 720p and 1080i, the number denoting the total number of horizontal rows in the monitor.
New generation video cameras replace CRTs with electronic photo-sensors called charge coupled devices (CCD) which generate electrical signals roughly proportional to the intensity of light falling on them. Signals from a CCD array are collected sequentially and sent to a monitor for display. Modern day monitors use liquid crystal display (LCD) elements instead of CRT and electron beam. LCD elements are small transparent blocks flled with a liquid organic chemical substance consisting of long rod-like molecules which has properties of manipulating the direction of light rays fowing through the substance. LCD elements along with pair of polarizing flters allow light to fow from a backlight source like an LED to the observer in front creating the perception of a lighted pixel. Current fowing through the LCD elements change the orientation of molecules and prevent the light from reaching the observer. Switching on and off specifc dots helps to create an image on screen. In case of color video cameras, three separate RGB signals corresponding to the primary colors red, green, and blue, are used to create composite colors on screen, and these signals are fed to a color monitor using three separate cables, a scheme which came to be known as component video. Inside the monitor, these signals are used to activate a color generation system like CRT electron guns or LCD elements for color reproduction.
RGB signals for color reproduction worked well when transmitted over short distances, typically over a few meters. However, when these signals were needed to be transmitted over large distances spanning several kilometers like in TV transmission, engineers ran into a separate set of problems. First, three separate copper cables running over several kilometers made the system costly. Second, even if the cost was ignored, three separate signals transmitted along the three cables did not arrive exactly at the same instant of time at the receiving end due to differences in attenuation factors, with the result that the images were frequently out of sync and distorted. Third, when color TV transmission started in several countries, the earlier monochrome system of black & white (B/W) TV also continued side by side, and so the engineers had to come up with a system such that the same transmitted signals could cater to both the B/W TV sets and color TV sets at the same time, which was not possible using the existing RGB signal format. To deal with all these problems, a new signal format was developed which was called composite video and instead of the RGB signals, it used a different form of signal called YC signals.
Here, Y indicated the luminance or intensity signal and C indicated the chrominance or color signal. The advantages of this format over the RGB format included the fact that both Y and C could be transmitted over a single cable or channel, and therefore arrived exactly at the same instant at the receiving end. This was done by splitting the typical 6MHz bandwidth of a TV channel into two parts, 0 to 4MHz was allotted to Y and 1.5MHz for C, the remaining 0.5 MHz being used for audio. The reason behind this unequal distribution was because of the fact that the human eye was more sensitive toward the Y information and less toward the C information. Another advantage of the YC signal format was that only the Y part of the signal could be used to cater to the B/W TV sets while both Y and C could be used to cater to the color TV sets. This meant that only a single transmission system was required and only a flter at the receiving end could be used to remove the color signal for monochrome viewing. Due to the usage of a single transmission cable, cost of the system could also be reduced.
The Y signal representing the grayscale intensity can be computed using a linear combination of the RGB signals. After experimenting with a number of combinations and keeping in mind that the human eye was more sensitive toward the green part of the color spectrum, it was fnally decided that Y would be composed of 60% of G, 30% of R, and 10% of B resulting in the relation:
Y = 0.299R + 0.587G + 0.114B
The color information was represented using a circular scale instead of a linear scale and needs two components for identifying a specifc color value as a point on a plane, which was the color wheel. The color sub-components are called Cb and Cr and defned as follows:
Accordingly, the composite video format, requiring a single cable or channel for transmission, is specifcally referred to as using the YCbCr signal format. Most of the analog video equipment used today for transmitting video signals use the composite video cable and connector for interfacing e.g. between VCP and TV.
Conversion of RGB to the YC signal format has another important advantage: the possibility of using less bandwidth by reducing color information. Studies have shown that the human eye is more sensitive to luminance (brightness) information than to chrominance (color) information. This finding is exploited to reduce color information during video transmission, a process referred to as chroma sub-sampling. The reduction in color information is denoted by a set of three numbers expressed as a ratio A:B:C. The numbers denote the amount of luminance and chrominance information within a window on the screen usually 4 pixels wide and 2 pixels high. Common values include 4:2:2 which imply that within a sliding 4 × 2 window, there are 4 pixels containing Y information along the frst row, 2 pixels of C information along the frst row, and 2 pixels of C information along the second row. Essentially, this means while all the pixels contain brightness information, color information is reduced to half along the horizontal direction. Other values frequently used are 4:1:1 and 4:2:0. The first set indicates one-fourth color information horizontally, while the second set indicates half-color information both along the horizontal and vertical directions. Obviously, a value of 4:4:4 indicates no reduction in color information.
Like images and audio, compression schemes called CODEC (coder/decoder) can be used to reduce the size of digital video fles. Because of the large size of video fles, lossless algorithms are not used much. Lossy compression algorithms delete information from the image and audio components to reduce size of video fles. File formats in which the video is saved depend on the compression scheme used. The Windows native audio fle format is AVI which is typically uncompressed. Lossy compression algorithms are associated with fle formats like MPEG (MPEG-1), Window Media Video (WMV), MPEG-4 (MP4), Apple Quicktime Movie (MOV), and 3rd Generation Partnership Project (3GPP) for mobile platforms).
Obtenido de "Fundamentals of IMAGE, AUDIO, and VIDEO PROCESSING using MATLAB. With Applications To PATTERN RECOGNITION" de Ranjan Parekh.