Refining Diphone Editor Display

The old version of the Diphone Editor was based on Praat, but the current version is slightly different:

Current display of the Diphone Editor

It works pretty well, but I’m not entirely happy about the colors. It might make more sense to have the wave color be a lighter color and use solid black for the marker beams.

Posted in Uncategorized | Leave a comment

Started Diphone Portion of Phoneme Editor

I’ve got the rudimentary display and editing working in the Diphone portion of the Phoneme Editor:

Diphone edit view of the Phoneme Editor

There are still some basic features that are missing, and the display needs some cleanup.

And I have to rewrite the code that performs the analysis, as well as the code to perform the synthesis so that it works with the new database format.

Once that’s done, I’ll need to re-record the reclist again in the new format.

Posted in Uncategorized | Leave a comment

Working on Phoneme Editor

There are two parts of the Phoneme Editor – where the Reclist is recorded and sliced into segments, and where the segments are divided into phonemes.

I’ve retrofitted the code that works with segments to work with the new database structures:

The Phoneme Editor displaying audio segments.

The next step is to position the markers on the segments. I’ll likely pull that from the existing code of my other phoneme editor, which does something very similar.

Posted in Uncategorized | Leave a comment

Rewrote Reclist Parser

I’ve rewritten the code that takes the recording list (reclist) and parses it out.

Here’s a portion of the reclist:

	bulbs, b ah l b z
	elf, eh l f
	milk, m ih l k
	help, hh eh l p
	else, eh l s

The line vcc-11 declares the name of the audio file, vcc-11.wav, which consists of 5 silence delimited words, “bulbs”, “elf”, “milk”, “help” and “else”. The actual phonemes associated with each word follows the comma. In the case of “bulbs”, the phonemes are /b ah l b z/.

The code then calculates the markers for each word. In the case of “bulbs”, which will be placed at the split points of the word. For “bulbs”, they are:

1	sil_b
2	b_ah
3	[ah]
4	ah_l
5	l_b
6	b_z
7	z_sil

These tags are displayed to the left of the marker, with the last marker having no text. The sil (silence) marker is implied:

| sil_b | b_ah | ah | ah_l | l_b | b_z | z_sil |

The square brackets around the vowel indicate that it’s excluded from the transitions, as the sustained portions of vowels are declared elsewhere in the reclist.

I’d considered marking the elements with the single phoneme:

| sil | b | ah+ | +ah | l | b | z | sil

But that’s not really accurate, as the markers are placed in the center of the consonants and not the beginning, and since synSinger uses diphones, it makes sense to display them as diphones.

The segments are divided into chains, with vowels at the head and tail of the chains. As silence counts as a vowel, the word “bulbs” consists of two chains of phonemes:

sil b ah
ah l b z sil

Each chain is the split into the permutations:



I’ve coded the CVCVC section with a stop consonant, which can be edited out and used as a closure /cl/ phoneme. For example, “baa-baab-kaab” codes the /k/ as a /cl/ phoneme. However, only the stop portion of the /k/ is used, which means the the cl_ phoneme can only appear on the tail of the diphone.

This creates the following combinations:


Because there are no diphones in the database that start with the cl phoneme, the sil phoneme is substituted for those cases.

The next step is to have the code generate the SQLite data for the tables, and then modify the Phoneme Editor.

Posted in Uncategorized | Leave a comment

Updated the Reclist

I’ve updated the recording list for synSinger and slightly modified the format. It now includes modified CVCVC strings including stop consonants, so it can better handle missing CC transitions

I’ve also added the missing VV transitions, so the list should be fairly complete.

The underlying data structures have had some major changes to them, so things will probably be broken for a while as I make changes to the Phoneme Editor.

Posted in Uncategorized | Leave a comment

Adding Longer Strings to the Reclist

Although the current method of building diphones works well, I’m finding that probably a better way of doing things.

For example, consider the a recording of “maa-maam” (there’s no gap between the hyphens, they’re just there for readability).

This single recording would be cut up into:


As a rule, I’m not using the “pure” vowels, only the transitions. So there’s no aa diphone. But I could also include:


Using long strings of phonemes makes for a smoother connections, so having more transitions to choose from is a good thing. In the current implementation, the diphones

aa_m m_aa

would be used instead of aa_m_aa, so the result would be much the same.

But back to the original recording: “maa-maam”.

In the current phoneme editor, the same phoneme would have to be marked up 5 different times to create 5 diphones. Since these diphones share common split points, that’s a bit redundant.

On first glance, you might thing you could just divide up the recording into phonemes, like so:


But it’s not quite that simple. Consonants need to be divided in half, so it’s more like:


Because stop consonants (such as /p/ or /t/ ) don’t divide in half nicely, they are split just before the stop consonant.

The current system of splitting strings starts to work less well if I want to use longer strings.

For example, here’s a string from the Arparsing reclist:

"surd worlds verse" (s er d - w er l d z - v er s)

Consider the string “er l d z v er” between “worlds” and “verse”. It can be divided into the following 15 groupings:

er l d z v er
er l d z v
er l d z
er l d
er l
l d z v er
l d z v
l d z
l d
d z v er
d z v
d z
z v er
z v
v er

Using my current system, I’d have to create 15 different n-phones, manually editing them each time. Plus, they’d each have their own data instead of sharing the same audio.

The reason for creating each distinct n-phone is this greatly simplifies the process of building smoothly connected phonemes at runtime.

But it would obviously be a great time-saver if the system were to automatically split out the phonemes once they were tagged, instead of having to repeat the process for each phoneme.

Note that synSinger doesn’t use the sustained portion of the vowels when building most of its n-phones.

While only the split points for phonemes need to be declared, vowels need to have their start and end tagged

For example, the string “maa-maam” (m aa m aa m) would require the markup:

+sil m aa+ +aa m aa+ +aa m sil+

and result in:


Note that m_aa and aa_m now appear twice.

This isn’t a huge savings, but it certainly is better than having to mark up the same audio multiple times.

Back to the example of “surd worlds verse”. The markup would require 16 tags:

+sil s er+ +er d w er+ +er l d z v er+ +er s sil+

And the result is 26 n-phones:


This will require a lot of changes to the phoneme editor, but hopefully less on the rendering side of the code.

Posted in Uncategorized | Leave a comment

Frankensteining Older Code

I had been wondering for a while what sort of results I’d get if I hooked together my older formant synthesis code with the WORLD front end.

Hours of frustration later… the results are not good.

It reminded me why I had moved away from formant-based synthesis. Especially when combined with FFT resynthesis!

First, there was the problem with WORLD itself. It apparently didn’t like having any zero values for the spectral envelope. And there were nasty glitches in the waveforms that I couldn’t get rid of:

This is not what you want a waveform to look like

Then there was “warbling” sound as formants moved and aligned unevenly against FFT bins:

More nasty glitches

Widening the formant bandwidths and increasing the FFT resolution took care of some problems, but then the audio took on a hollow, metallic ringing tone.

In the end, I still couldn’t get it to not sound like a wheezy old man, and it still was filled with pops.

Besides the technical issues, it still sounds overly mechanical, and not really useful.

So it was a useful exercise, but I’ll now be putting that away and moving back to sample-based rendering.

Posted in Uncategorized | Leave a comment

Importing USTs

UST files are the format used for UTAU song files.

I hadn’t really looked at them before, but since I’d been looking at other UTAU file formats, I figured I’d have a look at them.

They’re a fairly simple plain text format, so it didn’t take a lot to work it out, even without documentation.

However. getting them into a format that synSinger can use is a bit problematic, because they don’t make the same assumptions.

For example, a fundamental assumption in synSinger is that each note has a syllable associated with it, and each syllable has a nucleus vowel.

The fundamental unit for UTAU is also a note, but the Lyric data element makes no assumption about the audio, other than that it will exist somwhere. For example, here’s a chunk that contains the /sw/ sound:


A single note may be built from multiple chunks, and there’s no one way to determine of the chunks should be joined together into a single note,

I wrote a heuristic that made the decision based on:

  • If the pitch (NoteNum) was the same;
  • If there was no rest between this and the prior note; and
  • If a vowel had been encountered

This worked well – but only for the particular UST that I was working with, which happened to follow those rules.

Unfortunately, there are a lot of different phonetic systems in UTAU, so and there’s no way that I see from looking at the UST file of determining which phonetic system is used.

So while I can import one phonetic system, there are plenty more that it will simply fail with.

And even when they’re imported correctly, a UST only displays the lyrics in phonetic format, which is less than ideal to work with.

So it’s doubtful that I’ll go much further with supporting import of UST files.

Posted in Uncategorized | Leave a comment

Revisiting the Reclist

The “reclist” (recording list) is essentially the vocal data that runs the synthesizer.

Since editing reclists is a pain, one of my goals in creating a reclist is to have the smallest possible size that will give good results.

There are lots of examples of reclists from the UTAU community. Those have been very helpful – both the systems that work well, and those that don’t.

The core of the reclist system I’m using is basically a CVCVC (consonant/vowel) sort of recording. For example, the nonsense word “MA-MAM” can be chopped up into multiple diphones:

  • sil_m_ae
  • m_ae
  • ae_m
  • m_ae
  • ae_m_sil

Note that sil_m_ae and ae_n_sil are actually triphones. They could be divided up into diphones like so:

  • sil_m
  • m_ae
  • ae_m
  • m_sil

But it turns out that the articulation of phonemes inside of syllables is different than those at the tail end, so it’s important to capture that.

Generally, the phonemes at the beginning and end of the diphones are half-diphones, split so that they can be joined up with other diphones. For example, the word “man” would consist of the diphones:

sil_m_ae ae ae_n_sil

That system – plus additional consonants strings – is the core of the latest reclist system I’ve been using.

However, I’ve been comparing the output with older versions of synSinger. While the output that uses the WORLD system is superior – it’s much clearer and less noisy – the articulation isn’t quite as nice.

In the older system, I’d used a nonsense word of the CV-CVCX-VC, where X is a stop consonant of some sort, such as /k/ or /p/. For example, “MA-MAMP-AM”.

The reason for the stop consonant has to do with the way synSinger handles missing transitions. While the most common CC transitions have been recorded, there are still quite a few that haven’t been recorded.

For those cases, synSinger inserts a sil (silence). For example, “hazmat” might become “haz mat”:

sil_h_ae  ae  ae_z_sil  sil_m_ae  ae  ae_t_sil

This works out fairly well, except that the final consonant often has a release associated with it that is not characteristic of when the sound immediately resumes.

Using the stop (but not the actual stop consonant) takes care of that problem, as the following stop consonant closes off the consonant:

sil_h_ae  ae  ae_z_stop  sil_m_ae  ae  ae_t_sil

The other issues I’ve noted likely derive from not singing to a click track.

So it’s time to revisit the phoneme editor and see what can be done to make recordings more accurate, and hopefully easier to edit.

Posted in Uncategorized | Leave a comment


A melisma two or more notes assigned to a single syllable.

I’d noticed that one of the notes in one of my test songs had cut off early, and finally got around to looking into it, and noticed that I’d forgotten to implement them in the diphone code.

Correcting that was pretty trivial – it’s just a matter of extending the duration of the prior note.

Here’s the melisma in a phrase of “Somewhere Over the Rainbow”:

The word “rain” is split between two notes.

And here it is in Praat (lyrics added with MS Paint):

The blue line in Praat indicates the pitch line.

The code that generates the pitch line currently runs before the positions of the phonemes are calculated. That will prove to be an issue if I’d like the phonemes to have an influence on the pitch line – changing the shape of the mouth for different phonemes generally triggers momentary pitch instability.

But there are are plenty of more basic issues to deal with first.

Posted in Uncategorized | Leave a comment