Although the current method of building diphones works well, I’m finding that probably a better way of doing things.
For example, consider the a recording of “maa-maam” (there’s no gap between the hyphens, they’re just there for readability).
This single recording would be cut up into:
sil_m_aa
sil_m
aa_m
m_aa
aa_m_sil
As a rule, I’m not using the “pure” vowels, only the transitions. So there’s no aa
diphone. But I could also include:
aa_m_aa
Using long strings of phonemes makes for a smoother connections, so having more transitions to choose from is a good thing. In the current implementation, the diphones
aa_m m_aa
would be used instead of aa_m_aa
, so the result would be much the same.
But back to the original recording: “maa-maam”.
In the current phoneme editor, the same phoneme would have to be marked up 5 different times to create 5 diphones. Since these diphones share common split points, that’s a bit redundant.
On first glance, you might thing you could just divide up the recording into phonemes, like so:
SILENCE | M | AA | M | AA | M | SILENCE
But it’s not quite that simple. Consonants need to be divided in half, so it’s more like:
SILENCE M | M AA | AA | AA M | M AA | AA | AA M | M SILENCE
Because stop consonants (such as /p/
or /t/
) don’t divide in half nicely, they are split just before the stop consonant.
The current system of splitting strings starts to work less well if I want to use longer strings.
For example, here’s a string from the Arparsing reclist:
"surd worlds verse" (s er d - w er l d z - v er s)
Consider the string “er l d z v er” between “worlds” and “verse”. It can be divided into the following 15 groupings:
er l d z v er
er l d z v
er l d z
er l d
er l
l d z v er
l d z v
l d z
l d
d z v er
d z v
d z
z v er
z v
v er
Using my current system, I’d have to create 15 different n-phones, manually editing them each time. Plus, they’d each have their own data instead of sharing the same audio.
The reason for creating each distinct n-phone is this greatly simplifies the process of building smoothly connected phonemes at runtime.
But it would obviously be a great time-saver if the system were to automatically split out the phonemes once they were tagged, instead of having to repeat the process for each phoneme.
Note that synSinger doesn’t use the sustained portion of the vowels when building most of its n-phones.
While only the split points for phonemes need to be declared, vowels need to have their start and end tagged
For example, the string “maa-maam” (m aa m aa m) would require the markup:
+sil m aa+ +aa m aa+ +aa m sil+
and result in:
sil_m_aa
sil_m
m_aa
aa_m_aa
aa_m
m_aa
aa_m_sil
aa_m
m_sil
Note that m_aa
and aa_m
now appear twice.
This isn’t a huge savings, but it certainly is better than having to mark up the same audio multiple times.
Back to the example of “surd worlds verse”. The markup would require 16 tags:
+sil s er+ +er d w er+ +er l d z v er+ +er s sil+
And the result is 26 n-phones:
sil_s_er
sil_s
er_d_w_er
er_d_w
er_d
d_w_er
d_w
w_er
er_l_d_z_v_er
er_l_d_z_v
er_l_d_z
er_l_d
er_l
l_d_z_v_er
l_d_z_v
l_d_z
l_d
d_z_v_er
d_z_v
d_z
z_v_er
z_v
v_er
er_s_sil
er_s
s_sil
This will require a lot of changes to the phoneme editor, but hopefully less on the rendering side of the code.