How Nuance Could Let Viewers Buy Products Featured on TV With Voice Commands

When industry executives began discussing the potential of t-commerce nearly two decades ago, some predicted that viewers would eventually be able to buy sweaters worn by Friends star Jennifer Aniston with a click of a remote control.

Speech recognition technology provider Nuance takes that concept to a new level in a U.S. patent published on Tuesday. Instead of relying on remote controls, Nuance details how viewers could buy pizzas, furniture and “Mike’s pants” with voice commands.

Abstract: A system and method are described for delivering to a member of an audience supplemental information related to presented media content. Media content is associated with media metadata that identifies active content elements in the media content and supported intents associated with those content elements. A member of an audience may submit input related to an active content element. The audience input is compared to media metadata to determine whether supplemental information can be identified that would be appropriate to deliver to the audience member based on that person’s input. In some implementations, audience input includes audio data of an audience’s spoken input regarding the media content.

Patent

Nuance couchClaims:

  1. A tangible computer-readable storage medium containing instructions for performing a method of delivering information associated with media content being consumed by a member of an audience, the method comprising: maintaining or accessing media metadata associated with media content being presented to an audience; receiving audience input received from a member of the audience during the presentation of the media content to the audience and while a content element is being presented; identifying audience information describing the member of the audience; comparing the received audience input to the media metadata, wherein the media metadata describes the content element of the media content being presented to the audience, wherein the content element is associated with content presented to the audience perceiving the media content, wherein the media metadata includes at least two intents associated with the content element of the media content being presented to the audience, and wherein the intents are associated with anticipated inputs by the audience in reference to the content element; determining, based at least in part on the comparison, that the received audience input refers to the content element of the media content; identifying, based at least in part on the comparison, an intent of the at least two intents, wherein the identified intent is related to the audience input, identifying supplemental information to deliver to the audience member, wherein the supplemental information is identified based at least in part on the identified intent, the identified audience information, and the content element that the audience input is determined to refer to; and delivering the supplemental information to the member of the audience.
  1. The computer-readable storage medium of claim 1, wherein the audience input includes audio data representing a verbal input by the audience member, and wherein the method further comprises identifying spoken words from the audio data representing the verbal input of the audience member.
  1. The computer-readable storage medium of claim 1, wherein the audience input includes audio data representing a verbal input by the audience member, and wherein identifying the audience information comprises: identifying spoken words from the audio data representing the verbal input of the audience member; and identifying a gender of the audience member based on an analysis of the audio data, wherein the supplemental information is identified based at least in part on the gender of the audience.
  1. The computer-readable storage medium of claim 1, wherein identifying the intent of the at least two intents includes identifying in the audience input a spoken word that is associated with intent metadata.
  1. The computer-readable storage medium of claim 1, wherein: the content element is active for a limited period of time during the presentation of the content element, and the audience input is received during the limited period of time.
  1. The computer-readable storage medium of claim 1, wherein maintaining media metadata associated with media content being presented to the audience includes: analyzing the media content; identifying content elements in the media content that are to be presented to the audience; associating at least one intent with each of the content elements identified in the media content; identifying supplemental information associated with the content elements in the media content; and storing in a data structure the supplemental information in association with the content elements.
  1. The computer-readable storage medium of claim 1, wherein the media content includes at least one of video, audio, and animation provided via cable, television, or radio broadcast.
  1. The computer-readable storage medium of claim 1, wherein delivering the supplemental information to the member of the audience includes presenting the supplemental information using a display device being used to present the media content to the audience.
  1. The computer-readable storage medium of claim 1, wherein delivering the supplemental information to the member of the audience includes transmitting the supplemental information to a computing device associated with the member of the audience.
  1. The computer-readable storage medium of claim 2, further comprising analyzing the audio data to identify a voice component associated with the audience input, wherein the supplemental information identified to deliver to the audience member is based at least in part on the voice component.
  1. The computer-readable storage medium of claim 10, wherein the voice component includes at least one of pitch, tone, rate, and volume.
  1. A method of delivering information associated with content being consumed by a member of an audience, the method performed by a computing system having at least one processor and memory, the method comprising: maintaining or accessing media metadata associated with media content being presented to at least one audience member, wherein the media metadata identifies a content element of the media content; receiving audio data representing a verbal input from an audience member during the presentation of the media content to the audience and while the content element is being presented; identifying spoken words from the audio data representing the verbal input of the audience member; identifying audience information describing the audience member; comparing the spoken words from the audio data to the content element of the media content; determining that the spoken words reference the content element; determining an intent associated with the spoken words; identifying, based at least in part on the determined intent associated with the spoken words and on the identified audience information, supplemental information to deliver to the audience member, wherein the supplemental information is associated with the content element; wherein the supplemental information is identified from among other supplemental information associated with the content element; and delivering the identified supplemental information to the member of the audience.
  1. The method of claim 12, wherein the audience information is identified based at least in part on the audio data, the method further comprising: modifying the identified supplemental information based on the audience information prior to delivering the identified supplemental information to the member of the audience.
  1. The method of claim 12, wherein delivering the identified supplemental information to the member of the audience includes presenting the identified supplemental information to the member of the audience using a display device being used to present the media content to the audience.
  1. The method of claim 12, wherein the audio data is recorded using a mobile device associated with the member of the audience.
  1. A system including at least one processor and memory for delivering information associated with media content being consumed by a member of an audience, the system comprising: a media analysis module configured to: maintain or access media metadata associated with media content being presented to at least one audience member; an audience input analysis module configured to: compare audience input representing input from a member of an audience captured during the presentation of the media content and while a content element is being presented to media metadata, wherein: the media metadata describes the content element of the media content being presented to the audience, and the content element is associated with content presented to the audience of the media content; determine, based on the comparison, that the received audience input refers to the content element of the media content; and determine an intent associated with the received audience input; an audience recognition module configured to: identify, based at least in part on the audience input, audience information describing the audience member; an information identification module configured to: identify, based at least in part on the determined intent and the identified audience information, supplemental information to deliver to the audience member, wherein the supplemental information is associated with the audience input and the content element that the audience input refers to, and wherein the supplemental information is identified from among other supplemental information associated with the content element; and an information delivery module configured to: determine a mode by which the supplemental information is to be delivered to the member of the audience; and deliver the identified supplemental information to the member of the audience via the determined mode.
  1. The system of claim 16, wherein the audience input includes audio data representing a verbal input by the audience member, and wherein the system further comprises a speech recognition module configured to identify spoken words from the audio data representing the verbal input of the audience member.
  1. The system of claim 16, wherein the audience input includes audio data representing a verbal input by the audience member, and wherein the system further comprises a speech recognition module configured to: identify spoken words from the audio data representing the verbal input of the audience member; and identify an age or gender of the audience member based on an analysis of the audio data, wherein the information identification module is configured to identify the supplemental information based at least in part on the age or gender of the audience member.
  1. The system of claim 16, wherein the content element is active for a limited period of time during the presentation of the media content, and the audience input is captured during the limited period of time.
  1. The system of claim 17, wherein the speech recognition module is further configured to identify a voice component associated with the audience input, wherein the information identification module is further configured to identify supplemental information to deliver to the audience member based at least in part on the voice component.