Special Session at INTERSPEECH 2007, Antwerp, Belgium
Tuesday, August 28, 2007, 13.30 - 15.30
Astrid Plaza Hotel, Scala 1
Organized by Gerrit Bloothooft, Utrecht University, The Netherlands
Singing is perhaps the most expressive usage of human voice and speech. An
excellent singer, whether in classical opera, musical, pop, folk music, or any
other style, can express a message and emotion so intensely that it moves and
delights a wide audience. Synthesizing singing may be considered therefore as
the ultimate challenge to our understanding and modeling of human voice. In this
two hours interactive special session of INTERSPEECH 2007 on synthesized
singing, an enjoyable demonstration of the current state of the art has been
given, with active evaluation by the audience.
The session was special in many ways:
![]() |
||
1. bass-baritone voice Let me sing Let me sing Let me sing by bits and bytes Let me bring Let me bring Let me bring devine delights I sing an /a/ I sing an /i/ I sing an /u/ for you, for you, for you |
2. soprano voice Let me sing Let me sing Let me sing by bits and bytes Let me bring Let me bring Let me bring devine delights It is an art This is the best That I can do for you, for you, for you |
3. voice of choice Let me sing Let me sing Let me sing by bits and bytes Let me bring Let me bring Let me bring devine delights I sing an /a/ I sing an /i/ I sing an /u/ for you, for you, for you |
The musical score is written for soprano voice. Transposition of one or two octaves lower gives the scores for tenor or alto voice, or bass-baritone voice, respectively. For the realization of the Synthesizer Song, the first verse should be sung by bass-baritone and the second verse by soprano voice. No accompaniment or reverberation is allowed. For the third verse, any voice can be chosen, and accompaniment is permitted.
The papers and materials submitted to the Synthesis of Singing Challenge did not follow the regular review procedure, but were chosen by the session organizer to be fit for this session.
During the session, judgments were given by 60 voters from the audience (of
150 people), partly by an electronic voting system, partly on a paper form. The
average scores are presented below. It should be realized that the audience had
a difficult task, since not all systems produced both a baritone and a soprano
version, while the quality of the voices used could be quite different (weaker
results for the female voice). Also, the speech-to-singing systems had a
considerable different starting position than the tekst-to-singing systems.
The Synthesizer Song sung
by Lieve Geuens during the Special Session (.mov; 30 MB)
Peter Birkholtz, [Institute for Computer Science, University
of Rostock, Germany], "Articulatory Synthesis of
Singing"
Audio file: Dona nobis pacem
(mp3; 1.5 MB)
Interspeech07 presentation A system for the synthesis of singing on the basis of an articulatory speech synthesizer is presented. To enable the synthesis of singing, the speech synthesizer was extended in many respects. Most importantly, a rule-based transformation of a musical score into a gestural score for articulatory gestures was developed. Furthermore, a pitch-dependent articulation of vowels was implemented. The results of these extensions are demonstrated by the synthesis of the canon “Dona nobis pacem”. The two voices in the canon were generated with the same underlying articulatory models and the same musical score, the only difference being that their pitches differ by one octave. Note: See
http://www.vocaltractlab.de/ for background information, including a
download of the "Vocal Tract Laboratory", an interactive multimedial
software tool to demonstrate the mechanism of speech production (in due course) |
Takeshi Saitou1, Masataka Goto1, Masashi Unoki2, and Masato Akagi2 [1
National Institute of Advanced Industrial Science and Technology (AIST),
Japan; 2 School of Information Science, Japan Advanced Institute of Science and Technology, Japan], "Vocal Conversion from Speaking Voice to Singing Voice Using STRAIGHT" Audio files: Original Male Voice (wav; 0.2 MB), Synthesized Male Singing (wav; 0.4 MB); Original Female Voice (wav; 0.2 MB), Synthesized Female Singing (wav; 0.4 MB) The Synthesizer Song: Original Male Voice (wav; 1.0 MB), Baritone (verse 1; wav; 1.5 MB); Original Female Voice (verse 2; wav; 1.0 MB), Alto (verse 2; wav; 1.5 MB); Synthesized Baritone + Alto (verse 3; wav; 1.5 MB) Interspeech07 presentation Voting result: Voice source: 2.6, Articulation: 2.8, Expression: 2.3, Overall judgement: 2.4 AVERAGE: 2.5 (1st place) A vocal conversion
system that can synthesize a singing voice
given a speaking voice and a musical score is proposed. It
is
based on the speech manipulation system
STRAIGHT [1], and
comprises three models controlling three acoustic features
unique to singing voices: the F0, duration, and spectral
envelope. Given the musical score and its tempo, the F0
control model generates the F0 contour of the singing voice
by controlling four F0 fluctuations: overshoot, vibrato,
preparation, and fine fluctuation. The duration control
model
lengthens the duration of each phoneme in the speaking voice
by considering the duration of its musical note. The
spectral
control model converts the spectral envelope of the speaking
voice into that of the singing voice by controlling both the
singing formant and the amplitude modulation of formants in
synchronization with vibrato. Experimental results showed
that the proposed system could convert speaking voices into
singing voices whose quality resembles that of actual
singing
voices.
|
Axel Roebel1, Joshua Fineberg2 [1 IRCAM-CNRS-STMS, Paris, France;
2 Harvard
University, Boston, USA], "Speech to chant
transformation with the phase vocoder" Audio files: Original voice 1 (leave me alone; wav; 0.1 MB), Original voice 2 (let them play; wav; 0.1 MB), Lolita (let them play; wav; 17 MB) The Synthesizer Song (in French): Original voice (male; spoken; wav; 1.7 MB); Original voice synthesized (male; spoken; wav; 1.3 MB); Baritone (verse 1; wav; 4.1 MB), Soprano (verse 2; wav; 4.1 MB), Tenor (verse 3; wav; 4.1 MB) Interspeech07 presentation Voting result: Voice source: 3.5, Articulation: 3.3, Expression: 3.2, Overall judgement: 3.6 AVERAGE: 3.4 (5th place) The technique used for this composition is a semi automatic system
for speech to chant conversion. The transformation is
performed
using an implementation of shapeinvariant signal
modifications
in the phase vocoder and a recent technique for envelope
estimation that is denoted as True Envelope estimation.
We first describe the compositional idea and give an
overview of
the preprocessing steps that were required to identify the
parts
of the speech signal that can be used to carry the singing
voice.
Furthermore we describe the envelope processing that was
used
to be able to continuously transform the orginal voice of
the
actor into different female singing voices.
|
Hideki Kenmochi, Hayato Ohshita [Center for Advanced Sound
Technologies, Yamaha Corporation, Japan], "VOCALOID
– Commercial singing synthesizer based on sample concatenation" Audio file: Amazing Grace (wav; 7.8 MB) The Synthesizer Song: Complete (verse 1: baritone, verse 2: soprano, verse 3: alto+accompaniment; wav; 26 MB) Interspeech07 presentation Voting result: Voice source: 3.2, Articulation: 3.2, Expression: 3.1, Overall judgement: 3.1 AVERAGE: 3.1 (4th place) The song submitted
here to the “Synthesis of Singing
Challenge” is synthesized by the latest version of the
singing
synthesizer “Vocaloid”, which is commercially available now.
In this paper, we would like to present the overview of
Vocaloid, its product lineups, description of each
component,
and the synthesis technique used in Vocaloid.
|
Nicolas D’Alessandro, Thierry Dutoit [Laboratoire de Théorie
des Circuits et Traitement du Signal, Faculté Polytechnique de Mons,
Belgique], "RAMCESS/HandSketch : A Multi-Representation Framework
for Realtime and Expressive Singing Synthesis" Video file: nime06 (mov; 13 MB) The Synthesizer Song: Verse 1 (tenor; mov; 9 MB) Interspeech07 presentation Voting result: Voice source: 3.0, Articulation: 3.2, Expression: 2.6, Overall judgement: 2.9 AVERAGE: 2.9 (3rd place) In this paper we describe the different investigations that are
part of the development of a new singing digital musical
instrument,
adapted to real-time performance. It concerns improvement
of low-level synthesis modules, mapping strategies
underlying the development of a coherent and expressive
control
space, and the building of a concrete bi-manual controller.
|
Sten Ternström, Johan Sundberg [Department of Speech, Music
and Hearing, School of Computer Science and Communication, Kungliga Tekniska
Högskolan, Sweden], "Formant-based synthesis of singing" Audio files: Summertime child (.wav; 0.8 MB), Summertime baritone (.wav; 0.8 MB ), Summertime mixed (.wav; 0.8 MB), Summertime male jazz (.wav; 0.8 MB) The Synthesizer Song: Baritone (verse 1; wav; 2.1 MB), Soprano (verse 2; wav; 2.1 MB), Baritone + Soprano + accompaniment (verse 3; wav; 9.8 MB) Voting result: Voice source: 3.5, Articulation: 3.8, Expression: 3.3, Overall judgement: 3.5 AVERAGE: 3.5 (6th place) Rule-driven formant synthesis is a legacy technique that still has certain advantages over currently prevailing methods. The memory footprint is small and the flexibility is high. Using a modular, interactive synthesis engine, it is easy to test the perceptual effect of different source waveform and formant filter configurations. The rule system allows the investigation of how different styles and singer voices are represented in the low-level acoustic features, without changing the score. It remains difficult to achieve natural-sounding consonants and to integrate the higher abstraction levels of musical expression. |
Session organizer:
Gerrit Bloothooft
UiL-OTS, Utrecht University, The Netherlands
gerrit.bloothooft@let.uu.nl