A FORMANT SPEECH SPEECH SYNTHESIZER WHICH ACCEPTS PHONEMIC INPUT ================================================================ Donald FISK, Associate Consultant, Electronic Services Division, 45 Hoi Yuen Road, Kwun Tong, Hong Kong. Abstract ======== A prototype of a serial formant speech synthesizer, with a side branch for fricative noise, has been developed. Vowels, nasalized vowels, nasals, plosives, fricatives, approximants, laterals and trills are handled and so the synthesizer can be used for most of the world's major languages. It produces intelligible and natural sounding speech from input based on the International Phonetic Alphabet. Spectra of natural and synthesized speech are given for comparison (Appendix IV). Introduction ============ For many computer speech applications, the words to be spoken are sufficiently few, all of which are known in advance, so that they can simply be recorded and played back when needed. Depending on the number of words, different degrees of compression are appropriate: for a few words, no compression is necessary and so the words can be stored as digitized waveforms. The quality of the resulting speech obviously depends on the sampling rate (and the number of bits per sample), but is usually very good. However, there is no flexibility: one is restricted to a single voice and a few utterances. If there are, say, one thousand words to be stored, some form of compression, such as linear predictive coding (LPC) would be preferable. The quality is normally somewhat lower than if no compression is used. However, because in LPC the excitation source and the filter are separated, there is a limited amount of flexibility: it is now possible to alter the pitch of the utterance or whisper it, for example. If the vocabulary to be synthesized is very large or unlimited, even with compression its storage ceases to be feasible and so it is necessary to store smaller units than whole words. Coarticulation -------------- In addition to the memory cost of storing a vocabulary of reasonable size, there is another problem, and that is that it is not possible simply to concatenate utterances in the time domain in novel ways, owing to coarticulation effects. Because the first "y" in "every year" is different from the "y" in "every river", it is not possible simply to concatenate "every" with the following word; we would need a different "every" for each of the two cases. This problem occurs at every level of connected speech, from words to phonemes. If a different phoneme precedes or follows a sound (word, syllable, phoneme), that sound is different. There are two ways of dealing with this problem. One is to split the sounds which make up an utterance in the middle of a phoneme, where the sound reaches a steady state. The other will be described in detail in the main body of the paper. Lyssables --------- If we use [second half of syllable] + [first half of syllable] ("lyssable") as the basic unit of storage, while the same units can be used for many different words, there are still several thousand lyssables which need to be stored, each of which is about 0.3 seconds in length. In most cases, the syllables can be split at the vowel to form lyssables, and rejoined there during synthesis, e.g. "information" <=> "i-" + "-info-" + "-orma-" + "-atio-" + "-on" If it is necessary to synthesize an unlimited vocabulary, including foreign names such as "Mladenov" and "Landsbergis", we will need extra lyssables such as "mla-" and "-andsbe-", and it becomes difficult to see how it is possible to build such a database without inadvertently omitting items. Diphones -------- It is, of course, possible to work in smaller units called diphones. These consist of [second half of phoneme] + [first half of phoneme] and there are N*N of them, where N is the number of phonemes (for English, about forty would be necessary). Each is short (about 0.1 seconds). While the storage needs are much reduced and the data collection is simpler than for lyssables, the quality of speech produced by concatenating diphones would be poorer than that obtained from lyssable concatenation. Both collection of lyssables and, to a lesser extent, collection of diphones are major undertakings owing to the number of items involved. With each database, it is only possible to produce one accent. Phonemes -------- There are about 40 phonemes for English and they are straightforward to collect and require little storage. However, the problem of their concatenation is a serious one. The solution, described below (Connecting Phonemes), involves synthesis from anew each time the speech is required, and, unlike lyssable or diphone concatenation, is computationally expensive. Also, it becomes more difficult to produce natural-sounding speech. However, all this is offset by three advantages: (1) It requires much less storage. If its pronunciation can be determined from the spelling, a word does not need to be stored at all. If it cannot, its phonetic transcription is all that needs to be stored. (2) It is possible to pronounce unfamiliar words, such as foreign names, correctly, provided only that they do not contain phonemes absent from the database. (3) It is possible to adapt the system to a new accent or language very quickly, because collection of the 30-50 allophones (variants of phonemes) needed for pronouncing a language takes no more than about a day (and there are methods by which this process can be expedited). User Interface ============== Phonetic Notation ----------------- To use the synthesizer, it is necessary first to build an accent or allophone file (.IPA). This contains parameters (defined later) for each allophone present in the accent. It is read from the disk whenever the program starts execution, after which the user can input a character string for the synthesizer to pronounce. This character string must contain only allophones defined in the accent file, possibly modified by some diacritics, and tones. The diacritics are symbols which follow an allophone and can be used for length modification (<]> = x0.65, <;> = x1.3, <:> = x1.8), nasalization <~>, aspiration <"> and devoicing <%>. The phonetic text can be divided into syllables with the tones preceding each syllable. There are seven tone symbols, three level, two rising and two falling: Symbol Pitch ------ ----- ^ High level - Middle level _ Low level ' Rising to high level ` Falling to middle level / Rising to middle level \ Falling to low level The tones symbols can be combined, for example, ^\ starts at high level and falls to low level. The symbols used to represent the allophones are arbitrary, but they must be (a) typeable and printable and (b) not already used as a diacritic or tone. It is recommended that they resemble their IPA equivalents as closely as possible: for example, should represent the primary cardinal vowels or sounds close to them. The recommended character sets for use with Scottish English (ScotEng), Received Pronunciation (RP), General American (GA) and Standard French (SF) are given in Appendix I. Comparison with the Alvey machine readable Phonetic Notation (Wells, 1986) shows that the symbols used here more closely resemble their IPA equivalents in cases where the two systems differ (e.g. for Cardinal Vowel 5, which is in Alvey). A system which converts from orthographic to phonetic spelling has been developed for Scottish English. This system, which does not at present handle stress or allophonic variation (i.e. gives broad phonetic transcription only), will be described in detail in a subsequent paper. It consists of several forward chaining production systems whose pattern matching primitives are based on those present in the SNOBOL4 language. The rules are compiled either into LISP or C by a compiler which is written in LISP. At present, there are just two production systems, one for morphological analysis (e.g. "cats" -> "cat+s") and one for phonetic transcription (e.g. "ight" -> "@it"). To complete the system, it will be necessary to assign stress to the correct syllables and select the appropriate allophones. Examples of Allophonic Representations -------------------------------------- "alarming" pronounced in ScotEng is (see Appendix IV, Figs. 1 and 2) -'A]`^La;r]-\mI9; "forget that in its death their sires had part" (W. Owen),in RP is -f@^`gEt-5@t"InIt]s\dE8-5E@^`saI@z-h@d\p"Q:t" Making and Changing an Accent ----------------------------- To build an accent file if none is available, it is recommended that analysis-by-synthesis be used. This involves recording each sound and spectroanalysing it. From its spectrum it is then possible to read off the formant frequencies and, very roughly, estimate the bandwidths of each formant. Using the parameters thus obtained, the sound is then synthesized. The spectra of the synthesized sound and the natural recorded sound are then compared and any necessary adjustments are made. The process is repeated until a good match is obtained. Finally, the amplitude is adjusted so that the volume of the sound is correct relative to the allophones already recorded for the accent. To build a new accent file from an existing one is simple. For example, from a Scottish one, an RP file can be built by (a) adding extra vowels and modifying existing vowels and (b) copying the consonants. (it may be necessary to add extra consonants in other cases). Output ------ The output of the synthesizer is a string of 8 bit samples. There is no data compression; the samples contain the amplitude of the wave. The output is received by hardware which produces the sound. The same hardware can also be used for recording natural speech for analysis. The sampling rate selected was 10kHz. This gives a sufficient range of frequencies (0-5000Hz) to reproduce all the important phonetic features of speech, including some of the high frequency noise in fricatives such as /s/ and /8/ which is lost on a telephone line. For comparison, the maximum frequency transmissible on a telephone line is 3.4kHz and the maximum that is audible by an adult male with good hearing is about 15kHz. The system runs on a Televideo Tele/383 with an 80386 CPU running at 20MHz. The synthesis occurs at the rate of about twenty seconds per second of speech. For example, it takes about 15 seconds to synthesize the word "alarming" for a Scottish accent. A maximum of about 3.2 seconds (32000 8-bit samples) can be synthesized at a time. In order to run even at this rate, it was necessary for all time- critical calculations to be done in integer arithmetic; if the same calculations were done in floating point, synthesis takes several minutes for a second of speech. Synthesizer Architecture ======================== Excitation Source ----------------- There are two types of excitation source: impulse which occurs during voiced speech such as vowels, and white noise, which occurs during fricatives. Both can occur together during the same sound, e.g. in voiced fricatives and in breathy voice. The excitation source does not model the sound produced at the glottis; to produce this, it is necessary to send the impulse through a pole or formant filter (F0) (described below). In the case of voiced input to F0, the output has the correct shape but is time reversed, which means that its spectrum is correct but the phases of its components are wrong. The incorrect phases are not a problem because it is very difficult to distinguish between otherwise identical sounds of different phase. (See Linggard, 1982). Poles and Zeros --------------- Poles, or formant filters, satisfy the following differential equation: y''(t) + by'(t) + ky(t) = u(t) [1] with u as input and y as output. Zeroes are inverse poles, i.e. with the input and output are reversed. As can be seen from the equation, each formant can be specified by two independent parameters, b and k. b determines how broad the formant is, and can conveniently be referred to as its bandwidth, and 2 2 k = w + b / 4 [2] where w = 2nf and f is the formant's frequency. If u(t) = 0, the solution is then -bt/2 y(t) = C.e cos(wt) [3] It is most convenient to use f (frequency) (rather than k) and b (bandwidth) as the two independent parameters. The Fourier transform of [3] is 2 2 8w /b + Z(w) Y(w) = (2/b) --------------- [4] 2 2 2 16w /b + Z(w) 2 2 2 where Z(w) = 1 + 4(w - w )/b [5] 0 This gives the shape of the spectrum for a pole. For a zero, is is just the inverse of [4]. Poles and Zeroes as Digital Filters ----------------------------------- [1] can be converted to a form suitable for use as a digital filter. After the following conversions, 2 k -> k / (Sampling Frequency) b -> b / (Sampling Frequency) y'(t) -> (y[h] - y[h-2]) / 2 y''(t) -> y[h] - 2y[h-1] + y[h-2] it can be recast as u[h] + (2-k)y[h-1] + (b/2 - 1)y[h-2] y[h] = ------------------------------------ [6] b/2 + 1 Using this formula, as k increases, y will become progessively less accurate until it diverges. For this reason, oversampling is done for values of k > 2.5. The formula used in oversampling is u[h] + (8 - k)y[h-1] + (b - 4)y[h-2] y[h] = ------------------------------------ [7] b + 4 This is used once to obtain y for the point between the last sample and the current sample, and once again to obtain y for the current sample. As zeros are inverse poles, i.e. also obey [1], but with y as input and u as output. The digital filter formula for zeroes is obtained by inverting [2]: u[h] = (1 + b/2)y[h] + (k - 2)y[h-1] + (1 - b/2)y[h-2] [8] Oversampling is not necessary for zeroes as they all have small enough k. The Mouth as Series of Filters ------------------------------ Three formants (F1, F2 and F3) are necessary for distinguishing between different phonemes. It is best to represent an extra two formants explicitly (F4 and F5) in order to achieve good sound quality, but of these, only F4 needs to vary, and even that is constant most of the time. There is no limit on the number of formants, and formants F6 and above, though unimportant when considered individually, are important for shaping the spectrum. They can be modelled collectively as a single formant (called F6) whose amplitude is maximum at half the sampling frequency. Formant filters are most accurately configured in series. To prevent some frequencies from being overamplified, it is best to keep formants close together in frequency separate in the series (see Witten). The configuration used (including F0) is impulse or white noise -->-- F0 -->-- F6 -->-- F3 -->-- F5 -->-- F2 -->-- F4 -->-- F1 --> | | +-->-- voice bar This cascade of filters will later be referred to as the oral branch. Using the configuration described above, it is possible to synthesize vowels (e.g. i, e, a, o, u), and approximants (such as l, J (untrilled r), j, w) and their unvoiced counterparts (varieties of h). To obtain nasals, fricatives stops and trills, it is necesary to use other filters. Nasals ------ These are produced by passing the output of F1 through a zero and then a pole, before differentiating for radiative correction. The configuration is thus output from F1 -->-- NZ -->-- NP -->-- and is called the nasal branch. The frequencies and bandwidths of the nasal pole and the nasal zero are fixed at fNZ = 655Hz, bNZ = 500Hz, fNP = 872Hz, bNP = 1000Hz. Frication --------- For unvoiced fricatives and aspirated stop consonants, a cascade of filters parallel to the oral branch is used. This cascade is called the fricative branch and has the following configuration: white noise -->-- P2 -->-- Z1 -->-- Z2 -->-- P1 -->-- Not all of these filters need to be used for individual fricatives. For voiced fricatives, both the oral branch (with pulsed excitation) and the fricative branch are operative. Trills ------ For trills, the output of F1 is simply sent through a sine wave envelope of period 30Hz, i.e. multiplied by 2 + sin(30Hz.t). Radiative Correction -------------------- The output from the final filter (F1 for vowels and approximants, NP for nasals and P1 for fricatives) is differentiated to model the radiative effect of the lips. The Voice Bar ------------- During voiced stop consonants, there is a period during which the vocal tract is closed but the vocal cords continue to vibrate. This shows up as low intensity, low frequency noise on the spectra of these sounds and is modelled approximately by the glottal excitation. (See Appendix IV, Fig. 7.) Synthesis of Phonemes ===================== The synthesizer was originally developed for Scottish English and later extended for other accents of English (Received Pronunciation and General American) and other languages of known phonology (Standard French and Arran Gaelic). Some exotic sounds, such as clicks, ejectives and implosives, cannot be handled without modifying the program. Formants F5 and above have no effect on perception, and F4 has only in a few phonemes (, , , , <3>). F5 has frequency 3700Hz and bandwidth 1500Hz always, and F4 has frequency 3450Hz and bandwidth 1200Hz unless stated otherwise. The formants F6 and above are modelled by a single formant F6 whose frequency is 4500Hz and whose bandwidth is 1300Hz. The formant bandwidths of the various phonemes are difficult to estimate accurately, and the values given may not be very accurate. However, for intelligible synthesized speech, the accuracy of formant bandwidths is not critical. Glottals -------- The glottal stop and fricative take the formant frequencies and bandwidths of the following phoneme (usually a vowel). is synthesized by removing phonation (voicing or whisper) at the start of the closure and restoring it at the end. is synthesized by applying white noise instead of voicing. Both and last for about 60ms. Vowels ------ Vowels are synthesized by setting formant frequencies and bandwidths to their appropriate values and setting the excitation source to impulse. Only the oral branch is operative during the synthesis of vowels (this applies also to nasalized vowels). Vowels are often described by comparing them to the cardinal vowels. These consist of nine rounded and nine unrounded cardinal vowels, which represent extremes of articulation. These were defined by Daniel Jones and recordings (now, tape only) are obtainable from Linguaphone. The values for the frequency of formants 1 and 2 of the cardinal vowels pronounced by Daniel Jones are approximately (see also Appendix III): (UNROUNDED) (ROUNDED) F1 (Hz) F2 (Hz) F1 (Hz) F2 (Hz) ------- ------- ------- ------- CV1 (i) 300 2250 CV9 (y) 300 2050 CV2 (e) 400 2000 CV10 (0) 400 1750 CV3 (E) 550 1700 CV11 (6) 550 1400 CV4 (a) 850 1500 CV12 (G) 650 1200 CV5 (Q) 620 950 CV13 (D) 620 900 CV14 (A) 550 1150 CV6 (O) 550 900 CV15 (V) 400 1100 CV7 (o) 400 700 CV16 (W) 300 1100 CV8 (u) 300 550 CV17 (+) 300 1700 CV18 (H) 300 1300 TABLE 1. -------- The values above differ slightly from those given in Catford (1988). No vowel uttered by Daniel Jones should occur outside of the two octagons whose vertices are defined by F1 and F2 of the unrounded and rounded cardinal vowels. Ignoring the possibility of vocal tract abnormalities, this should apply, more or less, to all adult males in general. That this is true is suggested by the results for one male speaker with a Scottish accent (myself), whose vowel parameters are given below (see also Appendix III): Vowel Word f1 f2 f3 b1 b2 b3 ----- ---- -- -- -- -- -- -- i beat 300 2150 3050 200 300 800 I bit 400 1700 2300 400 500 1000 e bait 400 2000 2350 300 1000 1200 @ shepherd 580 1420 2250 300 600 800 E bet 600 1700 2300 300 500 1200 a bat 720 1300 2200 600 500 700 A but 650 1300 2050 700 700 1000 O bought 600 950 2100 300 500 1000 o boat 400 700 2200 300 500 1000 u boot 350 1100 1950 300 500 400 TABLE 2. -------- (F4 = 3450Hz, b4 = 1200Hz for all vowels. There are fewer vowels than other accents of English because three pairs of vowels merge, resulting in POOL and PULL, COT and CAUGHT, and PAM and PALM being pronounced the same in this accent. Also, postvocalic "r" does not alter vowel quality.) One vowel (E) lies outside the unrounded vowel octagon, but only slightly. The cardinal vowels given in table 1 can be used to determine the parameters of the vowels in an accent if the vowels of that accent are described accurately with reference to the cardinal vowels (e.g. by plotting them on a vowel trapezium). Nasalization ------------ Phonemic nasalization is absent from all major accents of English (except in loan words from French, such as GENRE), but it can occur as a coarticulatory effect, such as when a vowel is flanked by nasal consonants, as in the word "moaning". This does not require special treatment as will be clarified below. However, phonemic nasalization is well known as a feature of French. Nasalization's characteristic sound is the result of changes to the oral branch of the vocal tract caused by the partially open velum, the most important of which is broadening of formant 1. Good results were obtained by adjusting b1 = 2000Hz to obtain strongly nasalized vowels. Nasalization of consonants is rarer, but does occur in some languages such as Scottish Gaelic (e.g. "ag amharc" -> <@ga~v~@rk">) and is also produced by broadening F1. Approximants and Trills ----------------------- Approximants are synthesized identically to vowels. Trills are synthesized as approximants and then enveloped in a sine wave. Allophones of "l" and untrilled "r" last for about 90ms, and trilled "r" lasts about 60ms. A tapped "r" can be simulated by making trilled "r" sufficiently short. Allophone Symbol F1 F2 F3 F4 --------- ------ ---- ---- ---- ---- 350 650 2200 3450 250 2000 3200 3450 Clear l 400 1180 2350 3500 Velarized l 380 920 2100 3500 Pharyngealized l <1> 400 800 2150 3450 Untrilled r (Brit) 350 1240 1650 2550 Untrilled r (Amer) 350 1100 1450 2550 Trilled r 420 1220 1980 3450 TABLE 3. -------- Not that the glides and are not the same as and . has a much lower F4 frequency than any other phoneme. Stops and Affricates -------------------- The sound passing through the oral branch is gradually reduced to zero at the beginning of the closure and gradually increased again at its end. The transition to and from silence lasts for 100 samples (10ms). If the stop is voiced, a voice bar is maintained throughout the closure (this is the output from the F0 filter). While this does not always occur in naturally spoken English, it sounds satisfactory. The closure in unvoiced stops in English is about 120ms, and in voiced stops about 60ms. The formant frequencies are: <--- Formants ----> <- poles -> <- Zeroes -> F1 F2 F3 P1 P2 Z1 Z2 ---- ---- ---- ---- ---- ---- ---- Bilabial (p, b) 250 900 1700 0 Dental (T, D) 200 1600 2100 100 Alveolar (t, d) 200 1900 2500 3700 4350 Velar (k, g) 200 2300 1600 1600 2300 1250 2750 TABLE 4. -------- As the steady-state part of a stop is either silence or the voice bar, it is impossible to measure the formant parameters directly. However, by examining the formant trajectories of spectra of sounds of type Vowel-Stop-Vowel immediately before the closure, it was possible to estimate each formant's target frequency fairly accurately. The stops were then synthesized using the estimated formant target frequencies and it was found that they were perceived to be the stops intended for all combinations of vowel and stop. For velar stops, formants F2 and F3 can be seen to approach each other at the onset of closure. Just before closure, they coincide. F2 of velar stops has, then, a higher target frequency than F3. (This crossing of formants is also visible during other transitions, such as to .) Stops in many languages are aspirated. The aspiration can be synthesized as a short fricative burst lasting for about 30ms and ending about 20ms from the end of the closure, with the fricative resonances coinciding with formants (see Fricatives, below). The transitions in and out of frication last 10ms, and so are much shorter than the fricative bursts of affricates such as the ("ch" in "church"). The type of frication used for a given stop is that of the fricative with the same place of articulation, thus: P x s z "picketed" =

-> p"Ik"It"Id" (<"> indicates aspiration of a stop.) Affricates are synthesized explicitly as the concatenation of a stop and a fricative: "church" -> t]SArt]S (<]> halves the length of a phoneme.) Fricatives ---------- For both unvoiced and voiced fricatives, the fricative resonances are used to produce frication, which lasts for the duration of the fricative. Fricative resonances coincide with one or more formants (Table 5, last column), so that, in voiced fricatives, both voiced and unvoiced sound are present in those formants. For the duration of an unvoiced fricative, phonation (voiced and unvoiced) ceases and only the fricative resonances are synthesized. Phonation (voiced) is retained during voiced fricatives. Voiced fricatives of natural speech have clearly visible formants, and unvoiced and voiced components can be distinguished on the spectrum if the fricative is sung falsetto. Unvoiced fricatives last about 150ms, and voiced fricatives about 100ms. Phoneme Location F1 F2 F3 Fricative Res. ------- -------- ---- ---- ---- -------------- f, v Labiodental 300 1150 1950 F2 8, 5 Dental 450 1300 2000 F0, F6 s, z Alveolar 350 1400 2200 F5, F6 S, 3 Palatoalveolar 220 1800 2250 F3, F4 c Palatal 350 1650 2800 F2, F3 x, Y Velar 400 1200 1600 F2 K Uvular 420 1220 1980 F2 TABLE 5. -------- The frequency of F4 in and <3> is 3100Hz. Nasals ------ Typically, nasal consonants have a length of between 80 and 120ms. Their formant frequencies are given below. Phoneme Location F1 F2 F3 ------- -------- ---- ---- ---- m Bilabial 200 1200 2100 N Dental 200 1350 2300 n Alveolar 200 1500 2300 9 Velar 200 2200 1800 TABLE 6. -------- Connecting Phonemes (Coarticulation) ------------------------------------ If a vowel is sustained, its spectrum is constant with time. During a diphthong, however, there is a transition between the steady states of the two vowels during which the frequencies and bandwidths of the formants change. For any given formant, the transition closely resembles an arc tangent function, though the precise mathematical form is less important than the rough shape. If the transition in and out of a phoneme is sufficiently slow and the phoneme is sufficiently short, the steady state might not be reached. This can result in the phoneme sounding slurred, and if it is a vowel, neutralized (i.e. made close to the central vowel, or schwa, <@>). Note that this is a coarticulatory effect and is not the same as phonemic neutralization (where the phoneme really is schwa). The formula used to determine formant trajectories during a transition between two phonemes is f(t) = (f + f )/2 + (f - f )*arctan(R.t)/(0.843*3.14) [6] T S T S where f(t) = the frequency during a transition, f and f are the S T frequency at the start of the transition and the target frequency respectively, R is a constant for a given type of phoneme transition (e.g. vowel-vowel, vowel-nasal etc.) and determines the speed of the transition (Hertz per second), and t is the time defined so that t=0 at the phoneme boundary. The transition starts at R.t = -tan(0.843*3.14/2) [7] and ends at R.t = +tan(0.843*3.14/2), unless the next formant transition starts before then (as may well be the case if the target phoneme is very short). Use of this formula produces very realistic formant trajectories. The same formula can be used for formant bandwidths and phoneme amplitudes. Formant Transition Rates Used for Different Transition Types ------------------------------------------------------------ One of the allophone parameters in the accent file is transition rate. For a transition between two phonemes, the higher of the two transition rates is the one used in the transition. Duration of transition: Vowels (e.g. , ) 266ms OR 133ms Glides (e.g. , ) 160ms Approximants (e.g. , ) 133ms Unvoiced fricatives (e.g. , ) 80ms Voiced fricatives (e.g. , ) 133ms Unvoiced stops (e.g.

, ) 200ms Voiced stops (e.g. , ) 133ms Nasals (e.g. , <9>) 80ms Pitch Contours -------------- If a word is said (but not if it is sung) with a steady intonation, its pitch is still not constant. This is most noticeable when voiced fricatives are compared to vowels, and is easily heard in words such as "vase" and "these" when they are said with emphasis (see Appendix IV, Figs. 5 and 6). This suggests that each phoneme has an intrinsic pitch which becomes altered when the intonation patterns of normal speech are superimposed. In the case of unvoiced sounds, their intrinsic pitch can only be determined by extrapolation from adjacent sounds or by analogy from their voiced counterparts. The intrinsic pitch at any point during an utterance is determined during synthesis in much the same way as formant frequencies, i.e. by using arc tangent functions to determine the intrinsic frequencies during transitions between phonemes. On top of the intrinsic pitch contour is superimposed the intonation. (See Appendix IV, Figs. 1-4.) Unlike the intrinsic pitch, which is associated with phonemes, the intonation is associated with syllables. Syllables in English do not appear to have any objective acoustic reality other than that given them by the intonation. Even without intonation, words still sound acceptable even if somewhat less natural, and syllables can still be picked out by the listener, even though no syllabification has been done by the program. Also, unlike phonemes, the end of a syllable is not known at the time it is entered, with the result that the arc-tangent method of producing smooth contours cannot work. (It will be recalled that the next phoneme boundary becomes x = 0 for the arc-tangent function, abd that the end of the phoneme is known when the phoneme is entered.) The formula used for calculating the trajectory is p = pm.(ip - MIN_PITCH) + MIN_PITCH [8] where ip = intrinsic pitch, pm = pitch modifier and d(pm)/dt = const.pm.(1 - pm) [9] The minimum pitch is set to 130Hz for an adult male voice, and the maximum pitch, also the pitch for vowels, is 150Hz. The pitch modifier is 0 for minimum, and 1 for maximum, pitch. It is determined by the tones used. A high level tone corresponds to pitch modifier slightly less than 1, and a low level tone to pitch modifier slightly greater than 0. Application to Phonetics ------------------------ More reliable phonetic transcriptions of speech can be made by (1) recording and spectroanalysing the vowels, so that they can be plotted on a formant chart for comparison with the cardinal vowels, and then (2) synthesizing words and comparing the result with the natural speech. The transcription is correct when the natural and synthesized speech sound the same (apart from the voice quality). An additional check can be made by comparing spectra. This procedure is best referred to as analysis by synthesis. Analysis by synthesis was done for ScotEng. The results were in broad agreement with Wells (1982), but there were some notable differences (see Appendix 1), particularly in the diphthongs, the vowel in BIRD, and allophonic variation in . As it has only been possible for me to analyse the speech of one Scottish speaker (myself), it is unclear whether or not the differences are general. For RP and GA, F1 and F2 (i.e. the vowel qualities) were obtained from vowel trapezia (Jassem and Nolan, 1984), and their incidence was obtained from Wells (1982) and Trudgill and Hannah (1985). The synthesis of RP was satisfactory, but that of GA was not. It transpired that this was because the NURSE "vowel" is actually the approximant , and the final element in GA diphthongs appears not to be a vowel, but a glide. Also there is a slight difference in quality between British and American realizations of (see above). With these changes, the quality of synthesized GA became satisfactory. The French phonemes were obtained by analysis of the speech of a native speaker, but no comparison with published material has been made. Acknowledgments --------------- The synthesizer was developed at Hong Kong Productivity Council internal project. The hardware for converting between sound and waveform was designed by K.W. Li and constructed by Allen Sin. References ---------- Catford, J.C. (1988): A Practical Introduction to Phonetics. (Clarendon Press) Jassem, W. and Nolan, F. (1984): Speech Sounds and Languages, in Electronic Speech Synthesis: Techniques, Technology and Applications Ed. Bristow, G. (Granada) Jones, Daniel (1956): Cardinal Vowels (ENG.252, ENG.253, ENG.254A, ENG.255). (Linguaphone Institute Ltd.) Allen, J., Hunnicutt, M.S. and Klatt, D. (1987): From Text to Speech: The MITalk System. (Cambridge University Press). Linggard, R. (1985): Electronic Synthesis of Speech (Cambridge University Press) Trudgill, Peter and Hannah, Jean (1985): International English. A Guide to Varieties of Standard English. (Edward Arnold) Wells, J. C. (1982): Accents of English. (Cambridge University Press) Wells, J. C. (1986): International Conference on Speech Input/Output Techniques and Applications (March, 1986). Witten, I.H. (1982): Principles of Computer Speech. APPENDIX I: Phonetic Notation ============================= Wherever an IPA symbol, or one which closely resembles it, is available in the printable ASCII character set, that symbol is used. The recommended symbols for the vowels are front central back front central back i + W y H u Close I Y U e V 0 o Half-close @ * E 3 A 6 O Half-open & 2 a Q G D Open Unrounded Rounded No special symbol for modifying the place of articulation of vowels is supported; if there is a need for, say, a centralized , a separate symbol, such as <@> (unless it is already used for some other purpose), should be employed. The recommended symbols for the consonants are Bilabial Alveolar Velar LabialVelar Labiodental Palatoalveolar Uvular Glottal Dental Palatal LabialPalatal Pharyngeal m N n 9 N p T t k q ? b D d g G P f 8 s S c x X M h B v 5 z 3 Y K J j 7 w l L 1 r R The rows are for nasals, unvoiced plosives, voiced plosives, unvoiced fricatives, voiced fricatives, median approximants, lateral approximants and trills respectively. There are two points to note. The first is that there is duplication of symbols. In the event of both symbols occurring in the same accent, a substitute must be found for one of the symbols. This occurs in Scottish English and RP for <3>; the solution is either to use <@> for half-open schwa, or for palatoalveolar fricative, e.g. JERK -> or . The second point is the absence of recommended symbols for some of the less frequencly encountered sounds. The user should think of an appropriate symbol, taking care that it is not already used for any purpose. The consonants for English are similar from accent to accent. They are: m, n, , 9, p, t, k, ?, b, d, g, f, 8, s, S, x, M, v, 5, z, 3 (or Z), J (and r), j, w, l, L, 1, r, , and <1> are my symbols for velarized and pharyngealized respectively. (My own accent has three allophones of : <1> (pharyngealized) is used postvocalically, (velarized) is used prevocalically when flanked by a velar consonant or a back vowel, and clear is used in other cases. This contrasts with Wells (1982) and Trudgill, who both state that only one allophone is present, normally . It is notable that RP has an opposition corresponding to my and vs. <1>, and Gaelic one corresponding to my vs. . I am uncertain whether this amounts to an explanation. The opposition stated above only applies strictly in careful speech; in casual speech I tend to pharyngealize sometimes even prevocalically. and can be synthesized simply by connecting the two sounds comprising them, and can be synthesized by connecting with <,> (a very short ). As few, if any, native English accents have a trilled , that symbol can be used for the approximant instead. For GA, is used for the r-coloured schwa in words such as NURSE and LETTER, whereas is used for the consonant. There is no difference in quality between and : they have different intrinsic pitches, though. The recommended vowel symbols for three accents of English (c.f. Wells, 1982) are: RP GA ScotEng RP GA ScotEng KiT I I I GooSE Uu u u DReSS E (e) E E PRiCE aI aj @i SHePHERD E E @ PRiZE aI aj ae TRaP & & a CHoiCE OI oj Oe LoT D Q O MouTH aU Qw Au STRuT A A A NeaR I@ I;r ir FooT U U u SQUaRE E@ E;r er CLoTH D O O STaRT Q: Q;r a;r] HearD @: J: Er NoRTH O: O;r Or BirD @: J: @r FoRCE O: o;r or WorD @: J: Ar CuRE O:(U@) U;r ur FLeeCE Ii i i Fire Q:(aI@) ajr ae@r FaCE eI ej e Power Q:(QU@) Qwr Au@r PaLM Q: Q a HAPPy i (I) i e THoughT O: O O LETTer @ J; @r GoaT @;U ow o COMMa @ @ A ( is used for <2> in all three accents, and /O/ is used for /D/ in the Scottish accent.) For Standard French, the recommended symbols are VOWELS: i, e, E, a, Q, O, o, u, y, 0, 6, (@), &~, Q~, o~, (6~) CONSONANTS: j, w, 7, p, t (for T), k, b, d (for D), g, f, s, S, v, z, 3, l, R or r (for K), m, n, , 9 APPENDIX II: Parameter file for Scottish English ================================================ Record contents: phonetic symbol duration (seconds) volume (amplitude is proportional to exp(volume)) phoneme type o: vowel h: aspirate w, l: glide or approximant r: trill n: nasal f: unvoiced fricative v: voiced fricative t: unvoiced stop d: voiced stop transition speed: 8 * (sample-frequency) / (transition time) natural frequency (Hz) Formant frequencies (Hz): for F1, F2, F3, F4 Formant Bandwidths (Hz): for F1, F2, F3, F4 Fricative amplitude Zero frequencies (Hz): for Z1, Z2 Zero bandwidths (Hz): for Z1, Z2 Pole frequencies (Hz): for P1, P2 Zero bandwidths (Hz): for P1, P2 File Contents ------------- i 0.12 450 o 30 150 300 2150 3050 3450 200 300 800 1200 0 0 0 0 0 0 0 0 0 I 0.10 450 o 60 150 400 1700 2300 3450 400 500 1000 1200 0 0 0 0 0 0 0 0 0 e 0.14 500 o 60 150 400 2000 2350 3450 300 1000 1200 1200 0 0 0 0 0 0 0 0 0 E 0.13 580 o 30 150 600 1700 2300 3450 300 500 1200 1200 0 0 0 0 0 0 0 0 0 a 0.14 500 o 30 150 720 1300 2200 3450 600 500 700 1200 0 0 0 0 0 0 0 0 0 A 0.09 500 o 60 150 650 1300 2050 3450 700 700 1000 1200 0 0 0 0 0 0 0 0 0 O 0.13 350 o 30 150 600 950 2100 3450 300 500 1000 1200 0 0 0 0 0 0 0 0 0 o 0.15 0 o 30 150 400 700 2200 3450 300 500 1000 1200 0 0 0 0 0 0 0 0 0 u 0.12 150 o 30 150 350 1100 1950 3450 300 500 400 1200 0 0 0 0 0 0 0 0 0 @ 0.08 450 o 60 150 580 1420 2250 3450 300 600 800 1200 0 0 0 0 0 0 0 0 0 M 0.11 -200 h 50 120 350 650 2200 3450 400 400 1000 1200 0 0 0 0 0 0 0 0 0 h 0.06 0 h 50 150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 w 0.11 -200 w 50 120 350 650 2200 3450 400 400 1000 1200 0 0 0 0 0 0 0 0 0 j 0.12 400 w 50 133 250 2000 3200 3450 500 1000 1500 1200 0 0 0 0 0 0 0 0 0 , 0.08 400 w 100 133 250 2000 3200 3450 500 1000 1500 1200 0 0 0 0 0 0 0 0 0 l 0.09 200 l 60 120 400 1180 2350 3500 600 600 1000 500 0 0 0 0 0 0 0 0 0 L 0.09 200 l 60 120 380 920 2100 3500 600 800 1000 500 0 0 0 0 0 0 0 0 0 1 0.09 50 l 60 130 400 800 2150 3450 500 600 500 1200 0 0 0 0 0 0 0 0 0 r 0.09 -100 l 60 127 350 1240 1650 2550 600 700 700 500 0 0 0 0 0 0 0 0 0 m 0.10 0 n 100 120 200 1200 2100 3450 400 600 800 1200 0 0 0 0 0 0 0 0 0 N 0.09 100 n 100 120 200 1350 2300 3450 400 800 800 1200 0 0 0 0 0 0 0 0 0 n 0.09 100 n 100 120 200 1500 2300 3450 400 800 800 1200 0 0 0 0 0 0 0 0 0 9 0.12 50 n 100 120 200 2200 1800 3450 400 1000 1000 1200 0 0 0 0 0 0 0 0 0 f 0.15 200 f 100 103 300 1150 1950 3450 700 700 1000 1200 14 0 0 1150 0 0 0 700 0 8 0.14 0 f 100 103 450 1300 2000 3450 800 800 500 1200 60 0 0 0 4500 0 0 1000 1000 s 0.15 200 f 100 103 350 1400 2200 3450 700 800 400 1200 50 0 0 3700 4500 0 0 2000 1000 S 0.16 0 f 100 103 220 1800 2250 3100 800 1000 1000 2000 50 0 0 2250 3100 0 0 1000 2000 c 0.16 400 f 100 133 350 1650 2800 3450 500 2000 1000 1200 40 0 0 1650 2800 0 0 2000 1000 x 0.16 100 f 100 103 400 1200 1600 3450 800 1000 1000 1200 20 0 0 1200 0 0 0 1000 0 v 0.09 -50 v 100 103 300 1150 1950 3450 700 700 1000 800 10 0 0 1150 0 0 0 700 0 5 0.08 100 v 60 103 450 1300 2000 3450 800 800 500 1200 45 0 0 0 4500 0 0 1000 1000 z 0.12 50 v 60 103 350 1400 2200 3450 700 800 400 1200 30 0 0 3700 4500 0 0 2000 1000 3 0.10 100 v 60 103 220 1800 2250 3100 800 1000 1000 2000 40 0 0 2250 3100 0 0 1000 2000 p 0.14 -300 t 40 127 250 900 1700 3450 700 1000 1000 1200 30 0 0 0 0 0 0 1000 0 t 0.12 -100 t 40 130 200 1900 2500 3450 400 600 800 1200 60 0 0 3700 4350 0 0 1500 2000 k 0.12 -100 t 40 133 200 2300 1600 3450 400 1000 1000 1200 30 1250 2750 1600 2300 1000 3000 1000 3000 ? 0.06 0 t 100 150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 b 0.07 -300 d 60 127 250 900 1700 3450 700 1000 1000 1200 15 0 0 0 0 0 0 1000 0 d 0.05 0 d 60 130 200 1900 2500 3450 400 600 800 1200 30 0 0 3700 4350 0 0 1500 2000 g 0.08 100 d 60 133 200 2300 1600 3450 400 1000 1000 1200 20 1250 2750 1600 2300 1000 3000 1000 3000 Appendix III: Comparison of Scottish English with Cardinal Vowels ================================================================= Figure 1. (a) unrounded and (b) rounded vowels. Plot of F2 against F1. Figure 2. Vowel trapezium for Scottish English. Appendix IV: Spectra of Natural and Synthesized Speech ======================================================= Frequencies on the spectrograms range from 0Hz to 5000Hz, while the pitch scale ranges from 80Hz (bottom) to 180Hz (top). Figure 1. Spectrogram of "alarming", spoken. Figure 2. Spectrogram of "alarming", synthesized. Figure 3. Spectrogram of "Macclesfield", spoken. Figure 4. Spectrogram of "Macclesfield", synthesized. Figure 5. Spectrogram of "these", spoken. Note the pitch contour. Figure 6. Spectrogram of "these", synthesized. Figure 7. Spectrogram of "aga", spoken. Note the voice bar (0-200Hz) and the approach of F2 and F3. Figure 8. Spectrogram of "ai". The formants, particularly F2 and F3, can be seen to follow an arc tangent (approximately).