Audio Samples from "PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS"

Audio Samples from "PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS"

Github: https://github.com/anonymous-pits/pits

Abstract: Previous pitch-controllable text-to-speech (TTS) models rely on directly modeling fundamental frequency, leading to low variance in synthesized speech. To address this issue, we propose PITS, an end-to-end pitch-controllable TTS model that utilizes variational inference to model pitch. Based on VITS, PITS incorporates the Yingram encoder, the Yingram decoder, and adversarial training of pitch-shifted synthesis to achieve pitch-controllability. Experiments demonstrate that PITS generates high-quality speech that is indistinguishable from ground truth speech and has high pitch-controllability without quality degradation. Code and audio samples will be available at https://github.com/anonymous-pits/pits.

Normal Speech Synthesis

This is a result of normal speech synthesis without pitch-shifting. All samples are not cherry-picked and applied to measure MOS.

p225_363: There is The Beautiful Game to write.
{DH EH R} {IH Z} {DH AH} {B Y UW T AH F AH L} {G EY M} {T UW} {R AY T}.
GT
VITS
FS2
PITS (A+D+Q)
PITS (A+D)
PITS (A+Q)
PITS (D+Q)

p232_413: It was that bad, that low.
{IH T} {W AA Z} {DH AE T} {B AE D}, {DH AE T} {L OW}.
GT
VITS
FS2
PITS (A+D+Q)
PITS (A+D)
PITS (A+Q)
PITS (D+Q)

p245_358: He's very explosive.
{HH IY Z} {V EH R IY} {IH K S P L OW S IH V}.
GT
VITS
FS2
PITS (A+D+Q)
PITS (A+D)
PITS (A+Q)
PITS (D+Q)

p250_496: Subsidy income will not be affected.
{S AH B S IH D IY} {IH N K AH M} {W IH L} {N AA T} {B IY} {AH F EH K T IH D}.
GT
VITS
FS2
PITS (A+D+Q)
PITS (A+D)
PITS (A+Q)
PITS (D+Q)

p267_420: I've got my own ideas.
{AY V} {G AA T} {M AY} {OW N} {AY D IY AH Z}.
GT
VITS
FS2
PITS (A+D+Q)
PITS (A+D)
PITS (A+Q)
PITS (D+Q)

p277_461: I am greatly relieved that the trust has reached its target.
{AY} {AE M} {G R EY T L IY} {R IH L IY V D} {DH AE T} {DH AH} {T R AH S T} {HH AE Z} {R IY CH T} {IH T S} {T AA R G AH T}.
GT
VITS
FS2
PITS (A+D+Q)
PITS (A+D)
PITS (A+Q)
PITS (D+Q)

p312_417: I have to say, for me, the game was not.
{AY} {HH AE V} {T UW} {S EY}, {F AO R} {M IY}, {DH AH} {G EY M} {W AA Z} {N AA T}.
GT
VITS
FS2
PITS (A+D+Q)
PITS (A+D)
PITS (A+Q)
PITS (D+Q)

p345_398: This was a unique election.
{DH IH S} {W AA Z} {AH} {Y UW N IY K} {IH L EH K SH AH N}.
GT
VITS
FS2
PITS (A+D+Q)
PITS (A+D)
PITS (A+Q)
PITS (D+Q)

p362_410: Figures are not relevant to the strategy.
{F IH G Y ER Z} {AA R} {N AA T} {R EH L AH V AH N T} {T UW} {DH AH} {S T R AE T AH JH IY}.
GT
VITS
FS2
PITS (A+D+Q)
PITS (A+D)
PITS (A+Q)
PITS (D+Q)

Pitch-Shifted Speech Synthesis

This is a result of speech synthesis with pitch-shifting. All samples are not cherry-picked and applied to measure MOS. Please note that positive scope shift lower the pitch.

p227_400: However, there is an issue, isn't there?
{HH AW EH V ER}, {DH EH R} {IH Z} {AH N} {IH SH UW}, {IH Z AH N T} {DH EH R}?
PITS(A+D+Q) +8
PITS(A+D+Q) +6
PITS(A+D+Q) +4
PITS(A+D+Q) +2
PITS(A+D+Q) 0
PITS(A+D+Q) -2
PITS(A+D+Q) -4
PITS(A+D+Q) -6
PITS(A+D+Q) -8
PITS(A+D) +8
PITS(A+D) +6
PITS(A+D) +4
PITS(A+D) +2
PITS(A+D) 0
PITS(A+D) -2
PITS(A+D) -4
PITS(A+D) -6
PITS(A+D) -8
PITS(A+Q) +8
PITS(A+Q) +6
PITS(A+Q) +4
PITS(A+Q) +2
PITS(A+Q) 0
PITS(A+Q) -2
PITS(A+Q) -4
PITS(A+Q) -6
PITS(A+Q) -8
PITS(D+Q) +8
PITS(D+Q) +6
PITS(D+Q) +4
PITS(D+Q) +2
PITS(D+Q) 0
PITS(D+Q) -2
PITS(D+Q) -4
PITS(D+Q) -6
PITS(D+Q) -8

p229_390: There were no casualties.
{DH EH R} {W ER} {N OW} {K AE ZH AH W AH L T IY Z}.
PITS(A+D+Q) +8
PITS(A+D+Q) +6
PITS(A+D+Q) +4
PITS(A+D+Q) +2
PITS(A+D+Q) 0
PITS(A+D+Q) -2
PITS(A+D+Q) -4
PITS(A+D+Q) -6
PITS(A+D+Q) -8
PITS(A+D) +8
PITS(A+D) +6
PITS(A+D) +4
PITS(A+D) +2
PITS(A+D) 0
PITS(A+D) -2
PITS(A+D) -4
PITS(A+D) -6
PITS(A+D) -8
PITS(A+Q) +8
PITS(A+Q) +6
PITS(A+Q) +4
PITS(A+Q) +2
PITS(A+Q) 0
PITS(A+Q) -2
PITS(A+Q) -4
PITS(A+Q) -6
PITS(A+Q) -8
PITS(D+Q) +8
PITS(D+Q) +6
PITS(D+Q) +4
PITS(D+Q) +2
PITS(D+Q) 0
PITS(D+Q) -2
PITS(D+Q) -4
PITS(D+Q) -6
PITS(D+Q) -8

p240_380: Safety was also an issue.
{S EY F T IY} {W AA Z} {AO L S OW} {AH N} {IH SH UW}.
PITS(A+D+Q) +8
PITS(A+D+Q) +6
PITS(A+D+Q) +4
PITS(A+D+Q) +2
PITS(A+D+Q) 0
PITS(A+D+Q) -2
PITS(A+D+Q) -4
PITS(A+D+Q) -6
PITS(A+D+Q) -8
PITS(A+D) +8
PITS(A+D) +6
PITS(A+D) +4
PITS(A+D) +2
PITS(A+D) 0
PITS(A+D) -2
PITS(A+D) -4
PITS(A+D) -6
PITS(A+D) -8
PITS(A+Q) +8
PITS(A+Q) +6
PITS(A+Q) +4
PITS(A+Q) +2
PITS(A+Q) 0
PITS(A+Q) -2
PITS(A+Q) -4
PITS(A+Q) -6
PITS(A+Q) -8
PITS(D+Q) +8
PITS(D+Q) +6
PITS(D+Q) +4
PITS(D+Q) +2
PITS(D+Q) 0
PITS(D+Q) -2
PITS(D+Q) -4
PITS(D+Q) -6
PITS(D+Q) -8

p243_398: People have a wish to charge twice.
{P IY P AH L} {HH AE V} {AH} {W IH SH} {T UW} {CH AA R JH} {T W AY S}.
PITS(A+D+Q) +8
PITS(A+D+Q) +6
PITS(A+D+Q) +4
PITS(A+D+Q) +2
PITS(A+D+Q) 0
PITS(A+D+Q) -2
PITS(A+D+Q) -4
PITS(A+D+Q) -6
PITS(A+D+Q) -8
PITS(A+D) +8
PITS(A+D) +6
PITS(A+D) +4
PITS(A+D) +2
PITS(A+D) 0
PITS(A+D) -2
PITS(A+D) -4
PITS(A+D) -6
PITS(A+D) -8
PITS(A+Q) +8
PITS(A+Q) +6
PITS(A+Q) +4
PITS(A+Q) +2
PITS(A+Q) 0
PITS(A+Q) -2
PITS(A+Q) -4
PITS(A+Q) -6
PITS(A+Q) -8
PITS(D+Q) +8
PITS(D+Q) +6
PITS(D+Q) +4
PITS(D+Q) +2
PITS(D+Q) 0
PITS(D+Q) -2
PITS(D+Q) -4
PITS(D+Q) -6
PITS(D+Q) -8

p333_424: Appointed general secretary last September.
{AH P OY N T AH D} {JH EH N ER AH L} {S EH K R AH T EH R IY} {L AE S T} {S EH P T EH M B ER}.
PITS(A+D+Q) +8
PITS(A+D+Q) +6
PITS(A+D+Q) +4
PITS(A+D+Q) +2
PITS(A+D+Q) 0
PITS(A+D+Q) -2
PITS(A+D+Q) -4
PITS(A+D+Q) -6
PITS(A+D+Q) -8
PITS(A+D) +8
PITS(A+D) +6
PITS(A+D) +4
PITS(A+D) +2
PITS(A+D) 0
PITS(A+D) -2
PITS(A+D) -4
PITS(A+D) -6
PITS(A+D) -8
PITS(A+Q) +8
PITS(A+Q) +6
PITS(A+Q) +4
PITS(A+Q) +2
PITS(A+Q) 0
PITS(A+Q) -2
PITS(A+Q) -4
PITS(A+Q) -6
PITS(A+Q) -8
PITS(D+Q) +8
PITS(D+Q) +6
PITS(D+Q) +4
PITS(D+Q) +2
PITS(D+Q) 0
PITS(D+Q) -2
PITS(D+Q) -4
PITS(D+Q) -6
PITS(D+Q) -8

p343_399: The role was a knockout, but really difficult.
{DH AH} {R OW L} {W AA Z} {AH} {N AA K AW T}, {B AH T} {R IH L IY} {D IH F AH K AH L T}.
PITS(A+D+Q) +8
PITS(A+D+Q) +6
PITS(A+D+Q) +4
PITS(A+D+Q) +2
PITS(A+D+Q) 0
PITS(A+D+Q) -2
PITS(A+D+Q) -4
PITS(A+D+Q) -6
PITS(A+D+Q) -8
PITS(A+D) +8
PITS(A+D) +6
PITS(A+D) +4
PITS(A+D) +2
PITS(A+D) 0
PITS(A+D) -2
PITS(A+D) -4
PITS(A+D) -6
PITS(A+D) -8
PITS(A+Q) +8
PITS(A+Q) +6
PITS(A+Q) +4
PITS(A+Q) +2
PITS(A+Q) 0
PITS(A+Q) -2
PITS(A+Q) -4
PITS(A+Q) -6
PITS(A+Q) -8
PITS(D+Q) +8
PITS(D+Q) +6
PITS(D+Q) +4
PITS(D+Q) +2
PITS(D+Q) 0
PITS(D+Q) -2
PITS(D+Q) -4
PITS(D+Q) -6
PITS(D+Q) -8

p376_422: That can happen so easily.
{DH AE T} {K AE N} {HH AE P AH N} {S OW} {IY Z AH L IY}.
PITS(A+D+Q) +8
PITS(A+D+Q) +6
PITS(A+D+Q) +4
PITS(A+D+Q) +2
PITS(A+D+Q) 0
PITS(A+D+Q) -2
PITS(A+D+Q) -4
PITS(A+D+Q) -6
PITS(A+D+Q) -8
PITS(A+D) +8
PITS(A+D) +6
PITS(A+D) +4
PITS(A+D) +2
PITS(A+D) 0
PITS(A+D) -2
PITS(A+D) -4
PITS(A+D) -6
PITS(A+D) -8
PITS(A+Q) +8
PITS(A+Q) +6
PITS(A+Q) +4
PITS(A+Q) +2
PITS(A+Q) 0
PITS(A+Q) -2
PITS(A+Q) -4
PITS(A+Q) -6
PITS(A+Q) -8
PITS(D+Q) +8
PITS(D+Q) +6
PITS(D+Q) +4
PITS(D+Q) +2
PITS(D+Q) 0
PITS(D+Q) -2
PITS(D+Q) -4
PITS(D+Q) -6
PITS(D+Q) -8

s5_398: That will be a decision for BBC Scotland.
{DH AE T} {W IH L} {B IY} {AH} {D IH S IH ZH AH N} {F AO R} {B}{B IY} {B IY} {S IY} {S K AA T L AH N D}.
PITS(A+D+Q) +8
PITS(A+D+Q) +6
PITS(A+D+Q) +4
PITS(A+D+Q) +2
PITS(A+D+Q) 0
PITS(A+D+Q) -2
PITS(A+D+Q) -4
PITS(A+D+Q) -6
PITS(A+D+Q) -8
PITS(A+D) +8
PITS(A+D) +6
PITS(A+D) +4
PITS(A+D) +2
PITS(A+D) 0
PITS(A+D) -2
PITS(A+D) -4
PITS(A+D) -6
PITS(A+D) -8
PITS(A+Q) +8
PITS(A+Q) +6
PITS(A+Q) +4
PITS(A+Q) +2
PITS(A+Q) 0
PITS(A+Q) -2
PITS(A+Q) -4
PITS(A+Q) -6
PITS(A+Q) -8
PITS(D+Q) +8
PITS(D+Q) +6
PITS(D+Q) +4
PITS(D+Q) +2
PITS(D+Q) 0
PITS(D+Q) -2
PITS(D+Q) -4
PITS(D+Q) -6
PITS(D+Q) -8

Variacne of Model

This is a result of speech synthesis with identical text. Check variacne of outputs

How much variation there?
{HH AW} {M AH CH} {V EH R IY EY SH AH N} {DH EH R}?

p277

p263

p295

p304

s5

VC Samples
Read VC
Target speaker sample (p239)
Source speech
PITS-VC (Iter=0)
PITS-VC (Iter=3)

Sing VC
Target speaker sample (MPOL)
Source speech
PITS-VC (Iter=0)
PITS-VC (Iter=3)