The Camera Tech Nerd’s Guide to Demosaicing

0
70

[ad_1]

Artificial Intelligence isn’t just about post-processing and manipulation anymore; the state of the art in the discipline is transforming image capture and preprocessing. With some very rare exceptions, digital images being produced today are approximate reconstructions of incomplete data collected by sensors covered by Color Filter Arrays (CFA). In order to bring down the cost, a single image sensor is used to collect one wavelength’s color information at any single pixel’s location and takes the distribution of single wavelengths to interpolate an approximation of the scene’s true color information. While many different CFAs have been in use throughout the history of digital imaging, only two are still in contemporary use: the Bayer array and the X-Trans array— of which, the Bayer is much more popular.

The Bayer array is organized into a two-by-two grid where two pixels capture green and one pixel each capture red and blue. Historically, the primary colors are then used to perform an interpolation operation on each pixel sequentially, taking into account the color data of the adjacent pixels to estimate an approximation of the true color. These interpolation operations have varied across the development of digital imaging, but without exception, they have always been hand optimized.

Hand optimization is limited by the capabilities and imagination of the imaging scientists involved, but also limited by the speed at which they can iterate, test, and improve their technique. Research takes time, and the pace of it is pretty well demonstrated by Adobe’s Camera Raw demosaic process, which is only in its fifth iteration despite being nearly 20 years old.

The hand-crafted algorithms in use today are good— after all, nearly every digital image ever produced makes use of them, and if we focus only on smooth, low-frequency areas of an image then the interpolation techniques in use today are very accurate and can excellently reproduce some areas of an image. Where these techniques fail, however, is in high-frequency areas of the image, or areas where there are a significant number of sharp angles or edges. The cause of this is intuitive. Imagine you have the edge of a roof on a background of blue sky, where this edge perfectly divides the bayer array through the middle. Two pixels on the roof capture green and red, and two pixels on the sky capture green and blue. Using traditional interpolation methods the pixels over the roof might pull from the adjacent pixels capturing the sky blue and introduce an unrealistic approximation of the true colors in the roof shingles.

In handcrafted demosaicing, the algorithm is predetermined and, therefore, inherently static, being unable to account for the image type or for the specific features of any given image. The result is common artifacts that most photographers are likely to be aware of, including zippering or moiré. The challenges compound significantly in conditions of high noise. Imagine now that in the sky, there is a hot pixel where there ought to be a blue pixel. In this scenario, the array block only has data on two of three primary colors, and only on three of four pixels, meaning that only a very small fraction of color data in each of the four pixels is available. Some cameras build denoising into their RAW images, but generally speaking, denoising is a part of the post-process pipeline and leads to somewhat significant image detail loss. While this article pertains most closely to personal and commercial photography, the limits of the demosaicing process are even more aggressive in scientific imaging, where it’s necessary to have both visually identifiable features as well as factually accurate reproductions.

The solution to slow iterative innovation and unadaptive interpolation techniques seems within reach today owing to novel development in deep learning and adversarial neural networks. Research being conducted over the last decades in AI for imaging is available in many tools including DxO’s PureRAW, Topaz’s Gigapixel AI, and more recently Lightroom’s Super Resolution, but these tools have tended to focus on improvements in already demosaiced and rasterised images, using AI techniques on an already constructed image (typically). The most exciting developments in this discipline are training models, which promise to account for both the type, content, and noise in images and produce full-color outputs, which are both mathematically and perceptually superior to the interpolation techniques which proceeded them. The vast majority of novel research in the field works with convolutional neural networks, and while this sounds complex, the process in practice is relatively straightforward to understand. The technique involves compressing images down into four channel per pixel images in a process that is vaguely reminiscent of the process smartphones use in their quad-bayer sensors, and then reconstructing the full color on a lower-resolution proxy image. This stage is referred to as convolution. The convolution stage is later “upscaled” and stacked with the full resolution by color incomplete Bayer mosaic of the image and deconvoluted into a full resolution image.

The process is akin to constructing a lower spatial resolution proxy but which has full-color resolution for any given spatial area in the image. Each differing model handles this differently, but the common fundamental of many of them is that they then map this full-color information onto the full spatial resolution of the image using the secret magic of deep learning. The ability of deep learning methods to outperform the efforts of any human is easily attributed to the number of steps they can perform in their demosaic process. The most sophisticated prior are produced by human engineers had a maximum of 20 steps, whereas (in theory) a trained model can be infinitely scaled, or more practically, contain some hundreds of layers if the training process should dictate as much. Moreover, human engineers work to optimize against certain visual artifacts, including moiré and zippering, but a training process can go through thousands of interactions on hundreds of kinds of images, developing an intimate “understanding” of the casuistry involved in producing some image features given certain demosaicing processes.

The real magic of deep learning-driven demosaicing is that it is capable of performing kinds of transformation that were just being explored in hand-developed demosaicing techniques but alongside a variety of other known and novel techniques. One such technique is referred to as compressive demosaicing, and exploits the transformation of the color space from RGB to YUV (commonly used in video), alongside sophisticated compression algorithms to exploit the complex inter-channel and inter-pixel relationships to produce an over-complete color space that contains more information than would be necessary for the final RGB output. This technique is slightly beyond my ability to fully explain, but suffice it to say, the process is impossible or impractical with respect to human development and offers the ability to perform a demosaic no man could hope to replicate. Similarly, deep learning techniques are better able to account for noise within the images. The training process can include a variety of images at different noise levels and be trained to understand their impact on different image features, especially edges, and produce a superior outcome. Crucially, and unlike established demosaic techniques, the model can be adaptive and apply a different strategy to images with different amounts or patterns of noise, allowing noise of varying levels to be accounted for much earlier in the imaging pipeline to preclude as significant an affect on the visible image by controlling for noise in the demosaic. Research on the subject also finds promise in the ability to deploy this noise control within the camera Image Signal Processor (ISP), even before a file is written. Handling noise control this way has even more significant promise than attempting to mitigate the impact of noise within already lossy image files.

My description of these novel techniques may thus far make them seem uniformly superior to human-developed interpolation techniques, but there are quite a few downsides and catches. Chief amongst these, is that many of these demosaicing techniques are very resource intensive, and simply impractical for a wide variety of devices (especially phones). While these techniques have the potential to be incorporated into hardware-accelerated ISP pipelines, this isn’t guaranteed and in any case is very far away in terms of timeline. Their scope and generalisability are also likely to be impacted in the hardware-integrated and accelerated pipeline. You will notice that the process depictions use a bayer pattern in their examples, and that is instructive. In many models, this convolutional downsampling stage isn’t possible with X-Trans sensors because their filter arrays are six-by-six which would require too aggressive a downsampling to work properly. Some models which have been under research claim that additional steps introduced early in the model are capable of allowing generalization, but there is some dispute with respect to the veracity of these statements, and this only applies to some models and some techniques.

The most egregious fault of these models is that their performance is primarily synthetic, and while truly astounding in laboratory use, the metrics used to train these models don’t always reflect human visual perception or simply lack the ability to account for many types of demosaicing artifacts in a way which reflects their impact on human viewership. Several researchers have argued that widely used historical image comparison metrics, including L2 and PSNR are not much impacted by artifacts primarily owning to their general rarity in the total image, and therefore using them to train a model is very challenging. Much research has also argued that moiré is similarly not well reflected in, for example, PSNR and therefore complicates their value in comparing different techniques relative to their perceptual image quality. Even when attempting to control for the limitations of automated image comparison metrics through the use of human trials or testing the results are lackluster and of limited applicability. Trained or photographically experienced testers have tendencies and search for different markers of quality than untrained or lay testers, generally skewing data, and models which have been optimized for sharpening on demosaic perform higher due to perceived visual quality although mathematically less accurately with the additional downside of limiting control in the postprocessing pipeline.

Image comparison tests also provide an inaccurate basal point of comparison because many of the images uses to represent ground truth are already demosaiced images. For example, a common reference set of images which are used come from Kodak. The original images are demosaiced digital images with the expected and commensurate artifacts, which the researchers in most algorithms then perform a synthetic remosaicing on, effectively deleting two of the three color channels at each location and introducing noise based on some distribution.

Neither the remosaicing nor the introduced noise are reflective of a raw file because artifacts from the initial demosaicing bias the process, and the noise patterns, even with two or more distribution algorithms combined, are still far too regular. As critical, the images presented in these models are of relatively low resolutions which skews many of the models in favor of low-resolution images with differing levels of detail or levels of detail in the natural world relative to their demosaiced image. The size of the images also conceals the complexity of the models and the time necessary to effectively demosaic. The time necessary to demosaic does not scale linearly with side resolution, but usually as a square (or worse) meaning that demosaicing strategies which appear in research as only moderately slower for much higher quality results, can actually take much longer and require far higher resources despite their rendition performance being much higher.

Despite the limitations and complexities of deep learning demosaicing techniques, they offer the opportunity for three new kinds of processes which were never possible before their conception, including joint demosaicing and denoising, joint demosaicing super-resolution and spatially varying exposure (SVE) based high dynamic range imaging. Observant readers may have noticed that the deconvolution process includes what is effectively an upscaling process within its broader demosaicing strategy. As mentioned, demosaicing and denoising are traditionally performed sequentially for matters of modularity and general complexity, but can cause error accumulation. In the demosaic stage, noise present in the sensor readout interferes with the demosaic stage by further limiting already scant availability of data and after demosaicing, this noise pattern continues to present but much more irregularly and non-linearly, which interferes with the ability of denoising algorithms to perform as well as they might under less inconsistent circumstances. Performing the demosaicing and super-resolution process together in the same step has the advantage of minimizing the risk of compounding errors at each stage. Typically, the demosaicing process will introduce some errors, be they edge or color errors and the super-resolution (or upscaling) process will then compound those errors or introduce new ones while attempting to rectify some others. The advantage of using an integrated model is that it can learn the complete end-to-end mapping of the original RGGB (bayer pattern) to a full-resolution image, and can do so in full color three channel resolution. These models tend to do so in a way similar to the convolution and deconvolution steps in a simple demosaic process.

Traditional super-resolution techniques take the luminance channel and scale it up and then overlay the color data over top, whereas novel techniques map a low-resolution Bayer image onto a high-resolution color image that the pipeline has constructed, and then attempts to break the image down into blocks to be reconstructed into a more true to life full resolution image. The training data for these methods tend to go through great pains to reduce the impact of demosaic artifacts on the training model by downscaling a 16mp image to four mp “ground truth” and performing and further 1⁄4 downscale to act as the “before” which will then be scaled to the “ground truth.” There is dispute in the literature about the efficacy of this downscale process and further concerns about how capable the models will be in upscaling very high-resolution images. Very recent works have begun to build a dataset that relies on pixel shift images with full-color resolution and no demosaic artifacts, but as far as I am aware, no significant models have been trained on this nascent data set, and no major critique has been produced which can be of use in understanding any benefits (or not) which might derive from a more complete and higher resolution dataset.

The most exciting use of deep learning for demosaicing is actually it’s potential to be applied to much more complex patterns including a newer but familiar technology referred to as spatially varying exposure (SVE) based high dynamic range imaging (HDRI). This technology borrows from the fundamentals of exposure stacking we are all familiar with in smartphone photography in which several images are taken in rapid succession and then overlaid in order to deliver a more “well exposed” image than would otherwise be possible with the limited dynamic range of smaller sensors. The fundamental problem with this technology, however, is that these images are collected sequentially and therefore vary minutely in content. The movement of a subject or the hand of the photographer can be enough to meaningfully impact the quality of capture once the stacking process is performed. SVE is a technique that takes advantage of higher-resolution sensors to vary the exposure of an image line by line in a sensor. This technique can therefore capture two levels of luminance data, but has the downside of bisecting the mosaic pattern and substantially complicating the demosaic process. Deep learning in this case can perform the same kind of pattern matching and deconvolution in order to decipher the sensor data and produce a usable image. This process is massively complex because many parts of an image are underexposed naturally dude to variations in lighting, and some discrimination in the intended outcome becomes necessary. To perform this process the model will fabricate different patterns from combinations of the available single-channel pixels and combine them until a well-exposed and full colour resolution image is produced.

Deep learning and more broadly, “AI” have the potential to significantly improve image preprocessing and can help to drive the next generation of imaging improvements even as sensor technology seems to have stalled. Adobe’s new tools for joint debayer and super-resolution have been incredible successes for the company and widely lauded for the quality of their performance. With time performance is likely to improve, but as we can already see, these tools are slower and more complex and not always likely to yield improvements that are as dramatic or as significant as would be necessary to justify their use.

Citations

Chen, Honggang, Xiaohai He, Linbo Qing, Yuanyuan Wu, Chao Ren, Ray E. Sheriff, and Ce Zhu. ‘Real-World Single Image Super-Resolution: A Brief Review’. Information Fusion 79 (3 January 2022): 124–45.

Chen, Yu-Sheng, and Stephanie Sanchez. Machine Learning Methods for Demosaicing and Denoising. accessed 22 May 2023 http://stanford.edu/class/ee367/Winter2017/Chen_Sanchez_ee367_win17_rep ort.pdf.

Ehret, Thibaud, and Gabriele Facciolo. ‘A Study of Two CNN Demosaicking Algorithms’. Image Processing On Line 9 (9 May 2019): 220–30.

Gharbi, Michaël, Gaurav Chaurasia, Sylvain Paris, and Frédo Durand. ‘Deep Joint Demosaicking and Denoising’. ACM Trans. Graph. 35, no. 6 (11 November 2016): 1–12.

Kwan, Chiman, Bryan Chou, and James Bell Iii. ‘Comparison of Deep Learning and Conventional Demosaicing Algorithms for Mastcam Images’. Electronics 8, no. 3 (3 November 2019): 308.

Longere, P., Xuemei Zhang, P.B. Delahunt, and D.H. Brainard. ‘Perceptual Assessment of Demosaicing Algorithm Performance’. Proc. IEEE 90, no. 1 (January 2002): 123–32.

Luo, Jingrui, and Jie Wang. ‘Image Demosaicing Based on Generative Adversarial Network’. Mathematical Problems in Engineering 2020 (16 June 2020): 1–13.

Moghadam, Abdolreza Abdolhosseini, Mohammad Aghagolzadeh, Mrityunjay Kumar, and Hayder Radha. ‘Compressive Demosaicing’. In 2010 IEEE International Workshop on Multimedia Signal Processing, 105–10. Saint-Malo, France: IEEE. 2010.

Tang, Jie, Jian Li, and Ping Tan. ‘Demosaicing by Differentiable Deep Restoration’. no. 4. Applied Sciences 11, no. 4 (January 2021): 1649.

Verma, Divakar, Manish Kumar, and Srinivas Eregala. ‘Deep Demosaicing Using ResNet-Bottleneck Architecture’. In Computer Vision and Image Processing, edited by Neeta Nain, Santosh Kumar Vipparthi, and Balasubramanian Raman, 1148:170–79. Communications in Computer and Information Science. Singapore: Springer Singapore. 2020.

Xing, Wenzhu, and Karen Egiazarian. ‘End-to-End Learning for Joint Image Demosaicing, Denoising and Super-Resolution’, 3507–16, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

Xu, Xuan, Yanfang Ye, and Xin Li. ‘Joint Demosaicing and Super-Resolution (JDSR): Network Design and Perceptual Optimization’. IEEE Transactions on Computational Imaging 6 (2020): 968–80.

Xu, Yilun, Ziyang Liu, Xingming Wu, Weihai Chen, Changyun Wen, and Zhengguo Li. ‘Deep Joint Demosaicing and High Dynamic Range Imaging Within a Single Shot’. IEEE Transactions on Circuits and Systems for Video Technology 32, no. 7 (July 2022): 4255–70.

Zhang, Tao, Ying Fu, and Cheng Li. ‘Deep Spatial Adaptive Network for Real Image Demosaicing’. AAAI 36, no. 3 (28 June 2022): 3326–34.

Zhou, Ruofan, Radhakrishna Achanta, and Sabine Süsstrunk. ‘Deep Residual Network for Joint Demosaicing and Super-Resolution’. arXiv:1802.06573. arXiv. 19 February 2018.

[ad_2]