Synthesis of Singing Challenge

Special Session at INTERSPEECH 2007, Antwerp, Belgium
Tuesday, August 28, 2007, 13.30 - 15.30
Astrid Plaza Hotel, Scala 1

Organized by Gerrit Bloothooft, Utrecht University, The Netherlands

Singing is perhaps the most expressive usage of human voice and speech. An excellent singer, whether in classical opera, musical, pop, folk music, or any other style, can express a message and emotion so intensely that it moves and delights a wide audience. Synthesizing singing may be considered therefore as the ultimate challenge to our understanding and modeling of human voice. In this two hours interactive special session of INTERSPEECH 2007 on synthesized singing, an enjoyable demonstration of the current state of the art has been given, with active evaluation by the audience.

The session was special in many ways:

Compulsory musical score (+ orchestration by Sten Ternström, and his karaoke version of verse 3)

1. bass-baritone voice
Let me sing
Let me sing
Let me sing by bits and bytes

Let me bring
Let me bring
Let me bring devine delights

I sing an /a/
I sing an /i/
I sing an /u/
for you, for you, for you
2. soprano voice
Let me sing
Let me sing
Let me sing by bits and bytes

Let me bring
Let me bring
Let me bring devine delights

It is an art
This is the best
That I can do
for you, for you, for you
3. voice of choice
Let me sing
Let me sing
Let me sing by bits and bytes

Let me bring
Let me bring
Let me bring devine delights

I sing an /a/
I sing an /i/
I sing an /u/
for you, for you, for you

The musical score is written for soprano voice. Transposition of one or two octaves  lower gives the scores for tenor or alto voice, or bass-baritone voice, respectively. For the realization of the Synthesizer Song, the first verse should be sung by bass-baritone and the second verse by soprano voice. No accompaniment or reverberation is allowed. For the third verse, any voice can be chosen, and accompaniment is permitted. 

Programme

The papers and materials submitted to the Synthesis of Singing Challenge did not follow the regular review procedure, but were chosen by the session organizer to be fit for this session.

During the session, judgments were given by 60 voters from the audience (of 150 people), partly by an electronic voting system, partly on a paper form. The average scores are presented below. It should be realized that the audience had a difficult task, since not all systems produced both a baritone and a soprano version, while the quality of the voices used could be quite different (weaker results for the female voice). Also, the speech-to-singing systems had a considerable different starting position than the tekst-to-singing systems.

The Synthesizer Song sung by Lieve Geuens during the Special Session (.mov; 30 MB)

Peter Birkholtz, [Institute for Computer Science, University of Rostock, Germany], "Articulatory Synthesis of Singing"

Audio file: Dona nobis pacem (mp3; 1.5 MB)
The Synthesizer Song: Complete (verse 1: baritone, verse 2: soprano, verse 3: baritone+soprano; wav; 6.6 MB)

Interspeech07 presentation

Voting result: Voice source: 2.8, Articulation: 2.7, Expression: 3.1, Overall judgement: 2.8      AVERAGE: 3.0  (2nd place)

A system for the synthesis of singing on the basis of an articulatory speech synthesizer is presented. To enable the synthesis of singing, the speech synthesizer was extended in many respects. Most importantly, a rule-based transformation of a musical score into a gestural score for articulatory gestures was developed. Furthermore, a pitch-dependent articulation of vowels was implemented. The results of these extensions are demonstrated by the synthesis of the canon “Dona nobis pacem”. The two voices in the canon were generated with the same underlying articulatory models and the same musical score, the only difference being that their pitches differ by one octave.

Note: See http://www.vocaltractlab.de/ for background information, including a download of  the "Vocal Tract Laboratory", an interactive multimedial software tool to demonstrate the mechanism of speech production (in due course)
 

Takeshi Saitou1, Masataka Goto1, Masashi Unoki2, and Masato Akagi2 [1 National Institute of Advanced Industrial Science and Technology (AIST), Japan;
2 School of Information Science, Japan Advanced Institute of Science and Technology, Japan], "Vocal Conversion from Speaking Voice to Singing Voice Using STRAIGHT"

Audio files: Original Male Voice (wav; 0.2 MB),  Synthesized Male Singing (wav; 0.4 MB); Original Female Voice (wav; 0.2 MB),  Synthesized Female Singing (wav; 0.4 MB)
The Synthesizer Song: Original Male Voice (wav; 1.0 MB),  Baritone (verse 1; wav; 1.5 MB); Original Female Voice (verse 2; wav; 1.0 MB),  Alto (verse 2; wav; 1.5 MB); Synthesized Baritone + Alto (verse 3; wav; 1.5 MB)

Interspeech07 presentation

Voting result: Voice source: 2.6, Articulation: 2.8, Expression: 2.3, Overall judgement: 2.4      AVERAGE: 2.5  (1st place)

A vocal conversion system that can synthesize a singing voice given a speaking voice and a musical score is proposed. It is based on the speech manipulation system STRAIGHT [1], and comprises three models controlling three acoustic features unique to singing voices: the F0, duration, and spectral envelope. Given the musical score and its tempo, the F0 control model generates the F0 contour of the singing voice by controlling four F0 fluctuations: overshoot, vibrato, preparation, and fine fluctuation. The duration control model lengthens the duration of each phoneme in the speaking voice by considering the duration of its musical note. The spectral control model converts the spectral envelope of the speaking voice into that of the singing voice by controlling both the singing formant and the amplitude modulation of formants in synchronization with vibrato. Experimental results showed that the proposed system could convert speaking voices into singing voices whose quality resembles that of actual singing voices.  
 

Axel Roebel1, Joshua Fineberg2 [1 IRCAM-CNRS-STMS, Paris, France; 2 Harvard University, Boston, USA], "Speech to chant transformation with the phase vocoder"

Audio files: Original voice 1 (leave me alone; wav; 0.1 MB), Original voice 2 (let them play; wav; 0.1 MB), Lolita (let them play; wav; 17 MB)
The Synthesizer Song (in French): Original voice (male; spoken; wav; 1.7 MB); Original voice synthesized (male; spoken; wav; 1.3 MB); Baritone (verse 1; wav; 4.1 MB), Soprano (verse 2; wav; 4.1 MB), Tenor (verse 3; wav; 4.1 MB)

Interspeech07 presentation

Voting result: Voice source: 3.5, Articulation: 3.3, Expression: 3.2, Overall judgement: 3.6      AVERAGE: 3.4  (5th place)

The technique used for this composition is a semi automatic system for speech to chant conversion. The transformation is performed using an implementation of shapeinvariant signal modifications in the phase vocoder and a recent technique for envelope estimation that is denoted as True Envelope estimation. We first describe the compositional idea and give an overview of the preprocessing steps that were required to identify the parts of the speech signal that can be used to carry the singing voice. Furthermore we describe the envelope processing that was used to be able to continuously transform the orginal voice of the actor into different female singing voices.  
 

Hideki Kenmochi, Hayato Ohshita [Center for Advanced Sound Technologies, Yamaha Corporation, Japan], "VOCALOID – Commercial singing synthesizer based on sample concatenation"

Audio file: Amazing Grace (wav; 7.8 MB)
The Synthesizer Song: Complete (verse 1: baritone, verse 2: soprano, verse 3: alto+accompaniment; wav; 26 MB)

Interspeech07 presentation

Voting result: Voice source: 3.2, Articulation: 3.2, Expression: 3.1, Overall judgement: 3.1      AVERAGE: 3.1  (4th place)

The song submitted here to the “Synthesis of Singing Challenge” is synthesized by the latest version of the singing synthesizer “Vocaloid”, which is commercially available now. In this paper, we would like to present the overview of Vocaloid, its product lineups, description of each component, and the synthesis technique used in Vocaloid.
 

Nicolas D’Alessandro, Thierry Dutoit [Laboratoire de Théorie des Circuits et Traitement du Signal, Faculté Polytechnique de Mons, Belgique], "RAMCESS/HandSketch : A Multi-Representation Framework for Realtime and Expressive Singing Synthesis"

Video file: nime06 (mov; 13 MB)
The Synthesizer Song: Verse 1 (tenor; mov; 9 MB)

Interspeech07 presentation

Voting result: Voice source: 3.0, Articulation: 3.2, Expression: 2.6, Overall judgement: 2.9      AVERAGE: 2.9 (3rd place)

In this paper we describe the different investigations that are part of the development of a new singing digital musical instrument, adapted to real-time performance. It concerns improvement of low-level synthesis modules, mapping strategies underlying the development of a coherent and expressive control space, and the building of a concrete bi-manual controller.  
 

Sten Ternström, Johan Sundberg [Department of Speech, Music and Hearing, School of Computer Science and Communication, Kungliga Tekniska Högskolan, Sweden], "Formant-based synthesis of singing"

Audio files: Summertime child (.wav; 0.8 MB), Summertime baritone (.wav; 0.8 MB ), Summertime mixed (.wav; 0.8 MB), Summertime male jazz (.wav; 0.8 MB)
The Synthesizer Song: Baritone (verse 1; wav; 2.1 MB), Soprano (verse 2; wav; 2.1 MB), Baritone + Soprano + accompaniment (verse 3; wav; 9.8 MB)

Interspeech07 presentation

Voting result: Voice source: 3.5, Articulation: 3.8, Expression: 3.3, Overall judgement: 3.5     AVERAGE: 3.5 (6th place)

Rule-driven formant synthesis is a legacy technique that still has certain advantages over currently prevailing methods. The memory footprint is small and the flexibility is high. Using a modular, interactive synthesis engine, it is easy to test the perceptual effect of different source waveform and formant filter configurations. The rule system allows the investigation of how different styles and singer voices are represented in the low-level acoustic features, without changing the score. It remains difficult to achieve natural-sounding consonants and to integrate the higher abstraction levels of musical expression.  

Contact

Session organizer:
Gerrit Bloothooft
UiL-OTS, Utrecht University, The Netherlands
gerrit.bloothooft@let.uu.nl