Disney Robot Mimics Human Camera Operator

ESPN parent Walt Disney Co. offers a glimpse in a patent application published Thursday at how robots could replace human camera operators.

The media giant details how its autonomous camera system could be programmed to mimic the way a human operator would pan and tilt a camera to track players in a live sports broadcast.

“Unlike human camera operators, hand-coded autonomous camera systems cannot anticipate action and frame their shots with sufficient ‘lead room.’ As a result, the output videos produced by such systems tend to look robotic, particularly for dynamic activities such as sporting events,” Disney states in the patent application.

Disney Research Senior Research Engineer Peter Carr is named as lead inventor on the patent, titled, “Method and system for mimicking human camera operation.”

Abstract: The disclosure provides an approach for mimicking human camera operation with an autonomous camera system. In one embodiment, camera planning is formulated as a supervised regression problem in which an automatic broadcasting application receives one video input captured by a human-operated camera and another video input captured by a stationary camera with a wider field of view. The automatic broadcasting application extracts feature vectors and pan-tilt-zoom states from the stationary camera and the human-operated camera, respectively, and learns a regressor which takes as input such feature vectors and outputs pan-tilt-zoom settings predictive of what the human camera operator would choose. The automatic broadcasting application may then apply the learned regressor on newly captured video to obtain planned pan-tilt-zoom settings and control an autonomous camera to achieve the planned settings to record videos which resemble the work of a human operator in similar situations.

Related articles:
How Disney Could Upsell Subscribers to Alternate Content
Disney Robot Could Call Games For ESPN
Disney Builds Wall-Sized UI
Disney Develops Glasses-Free Holographic Handeld Displays

Patent Application

Claims:

  1. A method for building a model to control a first device, comprising: receiving, as input, demonstration data from a human operating a second device to perform a demonstration and environmental sensory data associated with the demonstration data; determining device settings of the second device, as operated by the human, from the demonstration data; extracting, from the sensory data, feature vectors describing at least locations of objects in the environment; training, based on the determined device settings and the extracted feature vectors, a regressor which takes additional feature vectors as input and outputs planned device settings for operating the first device; and instructing the first device to attain the planned device settings output by the trained regressor.

    The method of claim 1, further comprising: receiving additional environmental sensory data; and extracting, from the additional environmental sensory data, the additional feature vectors describing at least locations of objects.

    3. The method of claim 1, wherein the first device is an autonomous camera, and wherein attaining the planned device settings includes capturing video by controlling the autonomous camera to achieve planned pan-tilt-zoom settings output by the trained regressor.

    4. The method of claim 1, wherein the first device includes one or more stationary cameras, and wherein attaining the planned device settings includes capturing videos with the one or more stationary cameras and sampling the videos captured with the one or more stationary cameras based on the planned device settings output by the trained regressor.

    5. The method of claim 1, further comprising, smoothing the planned device settings output by the trained regressor prior to instructing the first device.

    6. The method of claim 1, wherein the feature vectors include one or more spherical maps, the spherical maps being generated by projecting object locations onto a unit sphere.

    7. The method of claim 1, wherein the second device is a camera, wherein the demonstration data includes a first video captured by the camera under control of the human operator, and wherein the device settings are pan-tilt-zoom settings of the camera associated with the first video.

    8. The method of claim 7, wherein the sensory data includes a second video of the environment captured by a stationary camera and having a wider field of view than the first video.

    9. The method of claim 7, wherein determining the settings includes: estimating a calibration matrix of each frame of the first video using the pinhole model; and applying Levenberg-Marquardt optimization to estimate time invariant parameters of a modified pinhole model with a restricted distance between rotation and projection centers and to estimate per-frame pan-tilt-zoom settings by minimizing projection error of predefined key points.

    10. The method of claim 1, wherein the first device and the second device are the same device.

    11. A non-transitory computer-readable storage medium storing a program, which, when executed by a processor performs operations for building a model to control a first device, the operations comprising: receiving, as input, demonstration data from a human operating a second device to perform a demonstration and environmental sensory data associated with the demonstration data; determining device settings of the second device, as operated by the human, from the demonstration data; extracting, from the sensory data, feature vectors describing at least locations of objects in the environment; training, based on the determined device settings and the extracted feature vectors, a regressor which takes additional feature vectors as input and outputs planned device settings for operating the first device; and instructing the first device to attain the planned device settings output by the trained regressor.

    12. The computer-readable storage medium of claim 11, the operations further comprising: receiving additional environmental sensory data; and extracting, from the additional environmental sensory data, the additional feature vectors describing at least locations of objects.

    13. The computer-readable storage medium of claim 11, wherein the first device is an autonomous camera, and wherein attaining the planned device settings includes capturing video by controlling the autonomous camera to achieve planned pan-tilt-zoom settings output by the trained regressor.

    14. The computer-readable storage medium of claim 11, wherein the first device includes one or more stationary cameras, and wherein attaining the planned device settings includes capturing videos with the one or more stationary cameras and sampling the videos captured with the one or more stationary cameras based on the planned device settings output by the trained regressor.

    15. The computer-readable storage medium of claim 11, the operations further comprising, smoothing the planned device settings output by the trained regressor prior to instructing the first device.

    16. The computer-readable storage medium of claim 11, wherein the feature vectors include one or more spherical maps, the spherical maps being generated by projecting object locations onto a unit sphere.

    17. The computer-readable storage medium of claim 11, wherein the second device is a camera, wherein the demonstration data includes a first video captured by the camera under control of the human operator, and wherein the device settings are pan-tilt-zoom settings of the camera associated with the first video.

    18. The computer-readable storage medium of claim 17, wherein the sensory data includes a second video of the environment captured by a stationary camera and having a wider field of view than the first video.

    19. The computer-readable storage medium of claim 11, wherein the first device and the second device are the same device.

    20. A system, comprising: a first data capture device; a second data capture device; a processor; and a memory, wherein the memory includes an application program configured to perform operations for building a model to control the first data capture device, the operations comprising: receiving, as input, demonstration data from a human operating the second data capture device to perform a demonstration and environmental sensory data associated with the demonstration data; determining device settings of the second data capture device, as operated by the human, from the demonstration data; extracting, from the sensory data, feature vectors describing at least locations of objects in the environment; training, based on the determined device settings and the extracted feature vectors, a regressor which takes additional feature vectors as input and outputs planned device settings for operating the first data capture device; and instructing the first data capture device to attain the planned device settings output by the trained regressor.