10.3.3 Prosody Generation - Pg. 269

10.3 Speech Synthesis 269 10.3.3 Prosody Generation Once the linguistic processing of text is completed, the next step is generating the prosody parameters, which will be used in the generation or selection of the speech units. In traditional speech synthesizers text is the only available input. As described in Section 10.2, future synthesizers will process not only text but also the state of the user and the surrounding environment. Mixing information will also be considered. The primary prosody parameters are F0 contour, duration, and energy of the speech signal. At this stage, spectral envelope shape and fine spectral characteristics can also be modified. The purpose of the modifications is to increase the richness and naturalness of the synthesized speech to make it as close to human speech as possible. Research on prosodic processing for speech synthesis has a long and rich history (a summary of different approaches can be found in [22]). Because F0 has received the most attention, there are many theories for modeling F0 contour variations. Duration (also referred to as timing) modification and prediction have also been studied in detail. The literature on energy modification for speech synthesis is rela- tively small compared to that on F0 and duration. However, based on recent results