User:Mkouklis(2)/ERATO Intelligent Conversational Android (Robot, ERICA Android)

File:ERIKA android(2016)01.png
ERICA 2016, age 23

The ERATO Intelligent Conversational Android or ERICA Android is a hyper-realistic social animatron out of Hiroshi Ishiguro’s lab. Three simular models are being developed in a collaboration between the Japan Science and Technology Agency, Osaka University, Kyoto University, and the Advanced Telecommunications Research Institute International (ATR) with one located at each institution.
This new ERICA Android is a major improvement on past androids in that it not only oddly mirrors the looks of your everyday human, but also features speech and body language that’s reactive to situations and closely mimics our natural way of communicating. Sure, while it can only respond to rudimentary questions, like “What are your hobbies?” the ERICA responds much more naturally than past androids from Ishiguro’s lab.

History edit

She (It) was presented at Japan’s Miraikan National Museum of Emerging Science and Innovation in August 2015 by Osaka University professor Hiroshi Ishiguro, who’s become renowned for his past work on creepily human-like androids.

Features edit

Features may or may not include on-board sensors plus external-remote sensing input(s) but, excluding any human remote controls.

On-board sensors edit

Two 1280x1024 pixel 30fps NCM13-J USB cameras mounted in her eyes.
Platform uses [OpenCV] to perform face tracking on the video feeds from these cameras to generate data that can be used for visual servo control of gaze.
Two Sony ECM-C10 omnidirectional condenser microphones (8mm diameter x 18mm length) are embedded in the ears.
Although it is hoped to eventually use these microphones for speech recognition, it may be some time before this is practical. However, they can still be used for detection and coarse localization of sound activity.

Position Tracking edit

ERICA uses the ATRacker tracking system, which can be used with Microsoft Kinect 2 sensors, 2D laser range finders, or a network of ceiling-mounted 3D range sensors.
Each of these configurations has different advantages such as precision, portability, or scalability.

For ERICA, robust human position tracking is important for two reasons: first, for precise gaze control, and second, for keeping track of the identities of people (who may be moving) in multiparty interactions.

Dylan F. Glas, et.al. IEEE Robotics symposium

Sound Source Localization edit

Two 16-channel microphone arrays.
From which sound directions are estimated in 3D space (azimuth and elevation angles) with 1 degree angular resolution and 100ms time resolution. If detections by multiple arrays intersect in 3D space and a human is tracked at that position, it is likely that that person is speaking.

Speech Recognition edit

Wireless Shure Beta 58A handheld microphones.
One of the research goals of this project is to develop the appropriate models and techniques for achieving reliable speech recognition in social conversation using microphone arrays placed at a distance. This will be realized by using deep neural networks (DNN) for front-end speech enhancement and acoustic modeling.

Speech recognition for Japanese is performed using the DNN version of the open-source Julius large-vocabulary speech recognition system. Speech recognition for other languages is currently under development.

Prosodic Information edit

Prosodic features such as power and pitch of a person’s voice play an important role in turn-taking, for signaling emotion or uncertainty, and for signaling questions or statements. An android will need to understand such signals in order to react quickly to the dynamics of a conversation, perform backchannel behaviors, change its gaze target, or express reactions before the human has finished speaking.

  • To enable such behaviors, the system analyzes the audio streams from speech inputs and provides continuous estimates of the power and pitch of the speaker’s voice, as well as an estimate of whether the current signal represents speech activity or not. This analysis is done both for the close-talk microphone inputs and for the separated signals from the microphone arrays.

Control Architecture edit

The software architecture of the ERICA platform combines a memory model, a set of behavior modules for generating dynamic movements, and a flexible software infrastructure supporting dialog management.

Perception and Memory
The robot’s awareness and memory of the world are stored in what are called the world model, which is divided into a set of human models, a scenario model, and a robot model.
Motion Control
A set of several behavior modules running in parallel, the output of which is combined in the pose blending logic component. Behavior modules can be activated or deactivated according to an end user’s application logic. Several behavior modules generate motion based on speech activity, including a “lip sync” behavior module, which calculates mouth and jaw commands to send to the robot based on the raw audio signal from the speech synthesizer. Furthermore, while the robot is speaking, a “rhythm” module generates forward and backward motions of the body trunk based on the power and pitch of the robot’s speech signal. While the robot is listening, a “backchannel” module is activated, which produces nodding behaviors based on short pauses in the person’s speech. To provide time for motor actuation of the lip sync and rhythm modules, an empirically determined delay of 200 ms is applied to the output audio signal played through the robot’s speaker.

Several other behavior modules have also been implemented, and a summary is presented in Table Ⅰ. It is planned to create a core set of behavior modules which will be useful in most interactions, although end users may wish to create some additional behavior modules for specific scenarios.

Table Ⅰ — BEHAVIOR MODULES
Pose and Animation
Expressions Manages transitions between fixed facial expressions and body poses
Gestures Can execute one or more gesture animations in parallel
Idle Motions
Breathing Moves shoulders and torso to simulate breathing when not speaking
Blinking Blinks eyes at a fixed rate with slight random variation
Speech-related
Lip Sync Moves lips to match speech signal
Rhythm Generates emphasis and nodding behaviors in response to the robot’s speech signal
Backchannel Generates nodding behaviors in response to the human’s speech signal
Gaze Control
Gaze
Controller
Uses closed-loop control to manipulate 7 joints to direct the robot’s gaze at a specific direction or point in space
Gaze Avert Adds small offsets to the target gaze direction to simulate natural human gaze variations

Public figure edit

Events edit

See also edit

References edit

Glas, Dylan F.; Minato, Takashi; Ishi, Carlos T.; Kawahara, Tatsuya; Ishiguro, Hiroshi (26–31 August 2016). 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE Xplore. pp. 22–29. doi:10.1109/ROMAN.2016.7745086. Retrieved 25 Oct 2018.

External links edit

Collaborating Research Institutions edit

Other links from Symposium edit

Category:Android (robot) Category:Humanoid robots Category:Social robots Category:2015 robots Category:Robots of Japan