sauce
auce

Papers

SAUCE: Asset Libraries of the Future
Storage and retrieval of production assets is vital for every modern VFX and animation facility. From the volume of assets being stored to the constantly changing variety and richness of the asset data, efficiently storing, indexing, finding and retrieving the assets you want is a growing challenge. This paper discusses some of the requirements of modern asset storage systems for VFX and animation, introducing two systems that were built to address these challenges as part of the collaborative EU funded “SAUCE” project; DNEG’s search and retrieval framework, and Foundry’s back-end asset storage. It also presents example use cases of the asset library from Filmakademie’s experiments in virtual production, demonstrating more artist focused and task centered systems that enable greater asset re-use.
Authors: Jonas Trottnow; William Greenly; Christian Shaw; Sam Hudson; Volker Helzle; Henry Vera; Dan Ring


Vision models fine-tuned by cinema professionals for High Dynamic Range imaging in movies
Many challenges that deal with processing of HDR material remain very much open for the film industry, whose extremely demanding quality standards are not met by existing automatic methods. Therefore, when dealing with HDR content, substantial work by very skilled technicians has to be carried out at every step of the movie production chain. Based on recent findings and models from vision science, we propose in this work effective tone mapping and inverse tone mapping algorithms for production, post-production and exhibition. These methods are automatic and real-time, and they have been both fine-tuned and validated by cinema professionals, with psychophysical tests demonstrating that the proposed algorithms outperform both the academic and industrial state-of-the-art. We believe these methods bring the field closer to having fully automated solutions for important challenges for the cinema industry that are currently solved manually or sub-optimally. Another contribution of our research is to highlight the limitations of existing image quality metrics when applied to the tone mapping problem, as none of them, including two state-of-the-art deep learning metrics for image perception, are able to predict the preferences of the observers.
Authors: Praveen Cyriac; Trevor Canham; David Kane; Marcelo Bertalmío


Matching visual induction effects on screens of different size by regularizing a neural field model of color appearance
In the film industry, the same movie is expected to be watched on displays of vastly different sizes, from cinema screens to mobile phones. But visual induction, the perceptual phenomenon by which the appearance of a scene region is affected by its surroundings, will be different for the same image shown on two displays of different dimensions. This presents a practical challenge for the preservation of the artistic intentions of filmmakers, as it can lead to shifts in image appearance between viewing destinations. In this work we show that a neural field model based on the efficient representation principle is able to predict induction effects, and how by regularizing its associated energy functional the model is still able to represent induction but is now invertible. From this we propose a method to pre-process an image in a screen-size dependent way so that its perception, in terms of visual induction, may remain constant across displays of different size. The potential of the method is demonstrated through psychophysical experiments on synthetic images and qualitative examples on natural images.
Authors: Trevor D. Canham; Javier Vazquez-Corral; Elise Mathieu; Marcelo Bertalmío

Retinal Noise Emulation: A Novel Artistic Tool for Cinema That Also Improves Compression Efficiency
In cinema it is standard practice to improve the appearance of images by adding noise that simulates film grain. This is computationally very costly, so it is only done in post-production and not on the set. It is also limiting because the artists are not able to really experiment with the noise nor introduce novel looks. Furthermore, video compression requires a higher bit rate when the source material has film grain or any other type of high frequency texture. In this work, we introduce a method for adding texture to digital cinema that aims to solve these problems. The proposed algorithm is based on modeling retinal noise, with which the images processed by our method have a natural appearance. This “retinal grain” serves a double purpose. One is aesthetic, as it has parameters that allow to vary widely the resulting texture appearance, which make it an artistic tool for cinematographers. Results are validated through psychophysical experiments in which observers, including cinema professionals, prefer our method over film grain synthesis methods from academia and the industry. The other purpose of the retinal noise emulation method is to improve the quality of compressed video by masking compression artifacts, which allows to lower the encoding bit rate while preserving image quality, and to improve image quality while keeping the bit rate fixed. The effectiveness of our approach for improving coding efficiency, with average bit rate savings of 22.5%, has been validated through psychophysical experiments using professional cinema content shot in 4K, color-graded and where the amount of retinal noise was selected by a motion picture specialist based solely on aesthetic preference.
Authors: Itziar Zabaleta; Mateo Cámara; César Díaz; Trevor Canham; Narciso García; Marcelo Bertalmío


Optimized Predictive Coding of 5D Light Fields

With the emergence of Light Field (LF) technology, the number of dimensions representing light has once again increased. 4D light fields captured with additional temporal information per ray or as assemblies of rays include the 5th dimension, namely time and thus produce 5D light fields. This is very crucial when we have moving objects in the scene. In the recent years, research has paved way to several ideas on efficient 4D light field compression. However, techniques for compression and storage for higher dimensions is still an open challenge. In this paper we have introduced a low-complexity predictive coding of 5D light fields by automatic generation of per frame customized coding structure exploiting both spatial and temporal neighbors. Evaluations with HEVC codec shows an increase of more than 1.4 dB gain in quality.
Authors: Harini Priyadarshini Hariharan; Thorsten Herfet


A Versatile 5D Light Field Capture Array
In this paper, we describe a versatile light field capturing device able to generate sparse 4D LF, LF Video and 5D LF images. The capturing array has proven to be an ubiquitous tool for the experimental generation of light fields and the development of post processing algorithms to provide so called LF assets that can be re-used and re-purposed in creative environments.
Authors: Kelvin Chelli, Tobias Lange; Thorsten Herfet; Marek Solony; Pavel Smrz; Martin Alain; Aljosa Smolic; Jonas Trottnow; Volker Helzle


The persistent influence of viewing environment illumination color on displayed image appearance
Chromatic adaptation considering competing influences from emissive displays and ambient illumination is a little studied topic in the context of color management in proportion to its influence on displayed image appearance. An experiment was conducted to identify the degree to which observers adapt to the white point of natural images on an emissive display versus the color of ambient illumination in the room. The responses of observers had no significant difference from those of a previous experiment which was conducted with roughly the same procedure and conditions on a mobile display with a significantly smaller viewing angle. A model is proposed to predict the degree of adaptation values reported by observers. This model has a form such that it can be re-optimized to fit additional data sets for different viewing scenarios and can be used in conjunction with a number of chromatic adaptation transforms.
Authors: Trevor Canham; Marcelo Bertalmío


Matching visual induction effects on screens of different size by regularizing a neural field model of color appearance
In the film industry, the same movie is expected to be watched on displays of vastly different sizes, from cinema screens to mobile phones. But visual induction, the perceptual phenomenon by which the appearance of a scene region is affected by its surroundings, will be different for the same image shown on two displays of different dimensions. This presents a practical challenge for the preservation of the artistic intentions of filmmakers, as it can lead to shifts in image appearance between viewing destinations. In this work we show that a neural field model based on the efficient representation principle is able to predict induction effects, and how by regularizing its associated energy functional the model is still able to represent induction but is now invertible. From this we propose a method to pre-process an image in a screen-size dependent way so that its perception, in terms of visual induction, may remain constant across displays of different size. The potential of the method is demonstrated through psychophysical experiments on synthetic images and qualitative examples on natural images.
Authors: Trevor D. Canham; Javier Vazquez-Corral; Elise Mathieu; Marcelo Bertalmío


Color Stabilization for Multi-Camera Light-Field Imaging
By capturing a more complete rendition of scene light than standard 2D cameras, light-field technology represents an important step towards closing the gap between live action cinematography and computer graphics. Light-field cameras accomplish this by simultaneously capturing the same scene under different angular configurations, providing directional information that allows for a multitude of post-production effects. Among the practical challenges related to capturing multiple images simultaneously, a very important problem is the fact that the different images do not perfectly match in terms of color, which severely complicates all further processing. In this work we adapt and extend to the light-field scenario a color stabilization method previously proposed for standard multi-camera shoots, and demonstrate experimentally that it provides an improvement over the state-of-the-art techniques for light-field imaging.
Authors: Olivier Vu Thanh; Trevor Canham; Javier Vazquez-Corral; Raquel Gil Rodríguez; Marcelo Bertalmío


A reevaluation of Whittle (1986, 1992 reveals the link between detection thresholds, discrimination thresholds, and brightness perception)
In 1986, Paul Whittle investigated the ability to discriminate between the luminance of two small patches viewed upon a uniform background. In 1992, Paul Whittle asked subjects to manipulate the luminance of a number of patches on a uniform background until their brightness appeared to vary from black to white with even steps. The data from the discrimination experiment almost perfectly predicted the gradient of the function obtained in the brightness experiment, indicating that the two experimental methodologies were probing the same underlying mechanism. Whittle introduced a model that was able to capture the pattern of discrimination thresholds and, in turn, the brightness data; however, there were a number of features in the data set that the model couldn't capture. In this paper, we demonstrate that the models of Kane and Bertalmío (2017) and Kingdom and Moulden (1991) may be adapted to predict all the data but only by incorporating an accurate model of detection thresholds. Additionally, we show that a divisive gain model may also capture the data but only by considering polarity-dependent, nonlinear inputs following the underlying pattern of detection thresholds. In summary, we conclude that these models provide a simple link between detection thresholds, discrimination thresholds, and brightness perception.
Authors: David Kane; Marcelo Bertalmío


Approaching real-time Character Animation in Virtual Productions
Virtual productions get increasingly common in modern movie productions. The possibilities to visualize, edit and explore virtual 3D content directly on a movie set make it invaluable for VFX rich productions. Many of the virtual production scenarios also involve animated characters and motion capturing [4]. But the complexity of animations systems prohibits it’s usage on a film set. Within the EU funded project SAUCE (Smart Assets for re-Use in Creative Environments) an extensive research on available virtual production tools and frameworks has been carried out. While most of them are not publicly available or open source, none of them had the possibility to interactively and intuitively animate characters on set.
Authors: Jonas Trottnow, Simon Spielmann


The Potential of Light Fields in Media Productions
One aspect of the EU funded project SAUCE is to explore the possibilities and challenges of integrating light field capturing and processing into media productions. A special light field camera was build by Saarland University [Herfet et al. 2018] and is first tested under production conditions in the test production “Unfolding” as part of the SAUCE project. Filmakademie Baden-Württemberg developed the contentual frame, executed the post-production and prepared a complete previsualization. Calibration and post-processing algorithms are developed by the Trinity College Dublin and the Brno University of Technology. This document describes challenges during building and shooting with the light field camera array, as well as its potential and challenges for the post-production.
Authors: Jonas Trottnow; Simon Spielmann; Tobias Lange; Kelvin Chelli; Marek Solony; Pavel Smrž; Pavel Zemčík; Weston Aenchbacher; Mairéad Grogan; Martin Alain; Aljosa Smolic; Trevor Canham; Olivier Vu-Thanh; Javier Vázquez-Corral; Marcelo Bertalmío


Interactive Light Field Tilt-Shift Refocus with Generalized Shift-and-Sum
Since their introduction more than two decades ago, light fields have gained considerable interest in graphics and vision communities due to their ability to provide the user with interactive visual content. One of the earliest and most common light field operations is digital refocus, enabling the user to choose the focus and depth-of-field for the image after capture. A common interactive method for such an operation utilizes disparity estimations, readily available from the light field, to allow the user to point-and-click on the image to chose the location of the refocus plane.
In this paper, we address the interactivity of a lesser-known light field operation: refocus to a non-frontoparallel plane, simulating the result of traditional tilt-shift photography. For this purpose we introduce a generalized shift-and-sum framework. Further, we show that the inclusion of depth information allows for intuitive interactive methods for placement of the refocus plane. In addition to refocusing, light fields also enable the user to interact with the viewpoint, which can be easily included in the proposed generalized shift-and-sum framework.
Authors: Martin Alain; Weston Aenchbacher; Aljosa Smolic


Vision Models for Wide Color Gamut Imaging in Cinema
Gamut mapping is the problem of transforming the colors of image or video content so as to fully exploit the color palette of the display device where the content will be shown, while preserving the artistic intent of the original content's creator. In particular in the cinema industry, the rapid advancement in display technologies has created a pressing need to develop automatic and fast gamut mapping algorithms. In this paper we propose a novel framework that is based on vision science models, performs both gamut reduction and gamut extension, is of low computational complexity, produces results that are free from artifacts and outperforms state-of-the-art methods according to psychophysical tests. Our experiments also highlight the limitations of existing objective metrics for the gamut mapping problem.
Authors: Syed Waqas Zamir; Javier Vazquez-Corral; Marcelo Bertalmio


DublinCity: Annotated LiDAR Point Cloud and its Applications
Scene understanding of full-scale 3D models of an urban area remains a challenging task. While advanced computer vision techniques offer cost-effective approaches to analyse 3D urban elements, a precise and densely labelled dataset is quintessential. The paper presents the first-ever labelled dataset for a highly dense Aerial Laser Scanning (ALS) point cloud at city-scale. This work introduces a novel benchmark dataset that includes a manually annotated point cloud for over 260 million laser scanning points into 100'000 (approx.) assets from Dublin LiDAR point cloud [12] in 2015. Objects are labelled into 13 classes using hierarchical levels of detail from large (i.e., building, vegetation and ground) to refined (i.e., window, door and tree) elements. To validate the performance of our dataset, two different applications are showcased. Firstly, the labelled point cloud is employed for training Convolutional Neural Networks (CNNs) to classify urban elements. The dataset is tested on the well-known state-of-the-art CNNs (i.e., PointNet, PointNet++ and So-Net). Secondly, the complete ALS dataset is applied as detailed ground truth for city-scale image-based 3D reconstruction.
Authors: S. M. Iman Zolanvari; Susana Ruano; Aakanksha Rana; Alan Cummins; Rogerio Eduardo da Silva; Morteza Rahbar; Aljosa Smolic


Issues with Common Assumptions about the Camera Pipeline and Their Impact in HDR Imaging from Multiple Exposures
Multiple-exposure approaches for high dynamic range (HDR) image generation share a set of building assumptions: that color channels are independent and that the camera response function (CRF) remains constant while changing the exposure. The first contribution of this paper is to highlight how these assumptions, which were correct for film photography, do not hold in general for digital cameras. As a consequence, results of multiexposure HDR methods are less accurate, and when tone-mapped they often present problems like hue shifts and color artifacts. The second contribution is to propose a method to stabilize the CRF while coupling all color channels, which can be applied to both static and dynamic scenes, and yield artifact-free results that are more accurate than those obtained with state-of-the-art methods according to several image metrics.
Authors: R. Gil Rodríguez; J. Vazquez-Corral; M. Bertalmío

Enabling Multiview- and Light Field-Video for Veridical Visual Experiences
With the advent of UHDTV and the inclusion of High Dynamic Range, High Frame Rate and Extended Color Gamut 2D-imagery is able to push technical parameters up to the limits of the human visual sense. Consequently, developments in sensor technology can be used to capture information beyond 2D-imagery. In this paper we introduce multiview- and light field-video as an option to capture (at least parts of) the plenoptic function and therewith drive veridical visual experiences. Our contribution is on tools for capturing and encoding so called 5D light fields. We have built a multi-camera array producing up to 6 GigaRays/s and a real-time hierarchical H.264 MVC encoder that enables encoding the light fields in form of a legacy compliant video stream.
Authors: Thorsten Herfet; Tobias Lange; Harini Priyadarshini Hariharan


Using LSTM for Automatic Classification of Human Motion Capture Data
Creative studios tend to produce an overwhelming amount of content everyday and being able to manage these data and reuse it in new productions represent a way for reducing costs and increasing productivity and profit. This work is part of a project aiming to develop reusable assets in creative productions. This paper describes our first attempt using deep learning to classify human motion from motion capture files. It relies on a long short-term memory network (LSTM) trained to recognize action on a simplified ontology of basic actions like walking, running or jumping. Our solution was able of recognizing several actions with an accuracy over 95% in the best cases.
Authors: Rogerio Eduardo Da Silva; Jan Ondrej; Aljosa Smolic


Convolutional Neural Networks Deceived by Visual Illusion
Visual illusions teach us that what we see is not always what is represented in the physical world. Their special nature make them a fascinating tool to test and validate any new vision model proposed. In general, current vision models are based on the concatenation of linear and non-linear operations. The similarity of this structure with the operations present in Convolutional Neural Networks (CNNs) has motivated us to study if CNNs trained for low-level visual tasks are deceived by visual illusions. In particular, we show that CNNs trained for image denoising, image deblurring, and computational color constancy are able to replicate the human response to visual illusions, and that the extent of this replication varies with respect to variation in architecture and spatial pattern size. These results suggest that in order to obtain CNNs that better replicate human behaviour, we may need to start aiming for them to better replicate visual illusions.
Authors:Alexander Gomez-Villa; Adrian Martín ; Javier Vazquez-Corral ; Marcelo Bertalmío


Influence of Ambient Chromaticity on Portable Display Color Appearance
The share hold of mobile displays in the content distribution market has grown significantly over the past decade. These displays add new complication to media color management as they can be viewed across a wide range of environments over a short span of time. There is currently no consensus within the color science community on the extent to which surround adaptation to ambient chromaticity has a significant impact on the color appearance of image content on these displays. Thus, an investigation into this query has been conducted at the Dynamic Visual Adaptation Laboratory at the Rochester Institute of Technology in Rochester, NY. The study aimed to quantify the color appearance impact of these surround signals. Observers performed an asymmetric memory matching task for a set of images viewed under SMPTE standardized mastering conditions and under a series of ambient illumination conditions with varying chromaticity and luminance. The results suggested that observers adapt partially to the chromaticity of ambient illumination while viewing images on portable displays, and also that this mixed adaptation ratio varies as a function of ambient luminance and stimulus type (self-luminous solid color versus images).
Authors: Trevor Canham; Michael J. Murdoch; David Long


In-camera, Photorealistic Style Transfer for On-set Automatic Grading
In professional cinema, the intended artistic look of the movie informs the creation of a static 3D LUT that is applied on set, where further manual modifications to the image appearance are registered as 10-parameter transforms in a color decision list (CDL). The original RAW footage and its corresponding LUT and CDL are passedon to the post-production stage where the fine-tuning of the final look is performed during color grading. — In many cases, the director wants to emulate the style and look present in a reference image, e.g. a still from an existing movie, or a photograph, or a painting, or even a frame from a previously shot sequence in the current movie. The manual creation of a LUT and CDL for this purpose may require a significant amount of work from very skilled artists and technicians, while the state of the art in the academic literature offers promising but partial solutions to the photorealistic style transfer problem, with limitations regarding artifacts, speed and manual interaction. — In this paper, we propose a method that automatically transfers the style, in terms of luminance, color palette and contrast, from a reference image to the source raw footage. It consists of three separable operations: global luminance matching, global color transfer and local contrast matching. As it just takes into account the statistics of source and reference images, no training is required. The total transform is not static but adapts to the changes in the source footage. The computational complexity of the procedure is extremely low and allows for real-time implementation in-camera, for on-set monitoring. While the method is proposed as a substitute for the need to specify a LUT and a CDL, it's compatible with further refinements performed via LUTs, CDLs and grading, both on-set and in post-production. The results are free from artifacts and provide an excellent approximation to the intended look, bringing savings in pre-production, shooting and post-production time.
Authors: Itziar Zabaleta ; Marcelo Bertalmío


Color-matching Shots from Different Cameras Having Unknown Gamma or Logarithmic Encoding Curves
In cinema and TV it is quite usual to have to work with footage coming from several cameras, which show noticeable color differences among them even if they are all the same model. In TV broadcasts, technicians work in camera control units so as to ensure color consistency when cutting from one camera to another. In cinema post-production, colorists need to manually color-match images coming from different sources. Aiming to help perform this task automatically, the Academy Color Encoding System (ACES) introduced a color management framework to work within the same color space and be able to use different cameras and displays; however, the ACES pipeline requires to have the cameras characterized previously, and therefore does not allow to work ‘in the wild’, a situation which is very common. We present a color stabilization method that, given two images of the same scene taken by two cameras with unknown settings and unknown internal parameter values, and encoded with unknown non-linear curves (logarithmic or gamma), is able to correct the colors of one of the images making it look as if it was captured with the other camera. Our method is based on treating the in-camera color processing pipeline as a combination of a 3x3 matrix followed by a non-linearity, which allows us to model a color stabilization transformation among two shots as a linear-nonlinear function with several parameters. We find corresponding points between the two images, compute the error (color difference) over them, and determine the transformation parameters that minimize this error, all automatically without any user input. The method is fast and the results have no spurious colors or spatio-temporal artifacts of any kind. It outperforms the state of the art both visually and according to several metrics, and can handle very challenging real-life examples.
Authors: Raquel Gil Rodríguez ; Javier Vazquez-Corral ; Marcelo Bertalmío

Photorealistic Style Transfer for Cinema Shoots
Color grading is the process of subtly mixing and adjusting the color and tonal balance of a movie to achieve a specific visual look. This manual editing task may require a significant amount of work from very skilled artists and technicians. In many cases the director wants to emulate the style and look present in a reference image, e.g. a still from an existing movie, a photograph, or even a previously shot sequence in the current movie. In this paper we propose a method that automatically transfers the style, in terms of tone, color palette and contrast, from a reference image to the source RAW image. It consists of three separable operations: global luminance matching, global color transfer and local contrast matching. The computational complexity of the procedure is extremely low and allows for real-time implementation in-camera. As it just takes into account the statistics of source and reference images, no training is required. The results are free from artifacts and provide an excellent approximation to the intended look, bringing savings in pre-production, shooting and post-production time.
Authors: Itziar Zabaleta; Marcelo Bertalmío

Statistics of natural images as a function of dynamic range
The statistics of real world images have been extensively investigated, in virtually all cases using low dynamic range (LDR) image databases. The few studies that have considered high dynamic range (HDR) images have performed statistical analysis over illumination maps with HDR from different sets (Dror et al. 2001) or have examined the difference between images captured with HDR techniques against those taken with single-exposure LDR photography (Pouli et al. 2010). In contrast, in this study we investigate the impact of dynamic range upon the statistics of equally created natural images. To do so we consider the HDR database SYNS (Adams et al. 2016). For the distribution of intensity, we observe that the standard deviation of the luminance histograms increases noticeably with dynamic range. Concerning the power spectrum and in accordance with previous findings (Dror et al. 2001), we observe that as the dynamic range increases the 1/f power law rule becomes substantially inaccurate, meaning that HDR images are not scale invariant. We show that a second-order polynomial model is a better fit than a linear model for the power spectrum in log-log axis. A model of the point-spread function of the eye (considering light scattering, pupil size, etc.) has been applied to the datasets creating a reduction of the dynamic range, but the statistical differences between HDR and LDR images persist and further study needs to be performed on this subject. Future avenues of research include utilizing computer generated images, with access to the exact reflectance and illumination distributions and the possibility to generate very large databases with ease, that will help performing more significant statistical analysis.
Authors: Antoine Grimaldi; David Kane; Marcelo Bertalmío


Light Field Compression by Superpixel Based Filtering and Pseudo-Temporal Reordering
In this paper we have addressed the topic of an evolutionary integration of light fields into standard image/video processing chains by pre-processing light fields with superpixel-based and structurally adaptive Gaussian pre-filters and circular pseudo-temporal sequencing to feed them into an HEVC-codec with low-delay predictive coding configuration. We could show significant bit rate reductions of up to 27% compared to pseudo-temporal sequencing without pre-processing. The paper includes experimental results showing that not only the perceived visual quality, but also the cornucopia of post-processing options is preserved.
Authors: Harini Priyadarshini Hariharan; Thorsten Herfet