Skip to content

VoiceClone and XTTS Setup

You can skip to the bottom if you simply want to use an existing XTTS voice model.

Preamble

Hey commanders, welcome back from the black. After struggling for an unreasonable amount of time, I have finally managed to get Covas set up with the specific voice that I want. I figured that with the pain I went through to accomplish that, I would make a bit of a guide documenting my set up and how I accomplished it. It is likely far from ideal, and there are probably easier methods, but this is how I did it.

As a note, we are going to be using WSL and windows for this, but all of the software runs on linux as well, this should work there as well.

Firstly, we are going to need to set up our environment. We are going to head into control panel, then into programs, then "Turn Windows Features On or Off." In this menu, we want to enable both Virtual Machine Platform and Windows Subsystem for Linux. If this is your first time enabling them, you may need to restart after this step.

Next we are going to head over to docker.com and download the installer. As I already have it set up, I wont be running through the install process in detail here. The only thing to note is that, when you are installing it, make certain to have the box checked to use WSL2 as your backend. Once you have it installed, you may need to log out or restart.

We are now going to set up what you will be using to train your voice model: The Alltalk V2 beta.

Training

This section is for if you want to train your own model. Ignore and move on if you have your own model!

Click here to show training instructions. Expect scuff ahead. ## Setting up Alltalk Before installing AllTalk, ensure you have the following: * Git for cloning GitHub repositories. * Microsoft C++ Build Tools and Windows SDK for proper Python functionality. [Installation instructions](https://github.com/erew123/alltalk_tts/wiki/Install-%E2%80%90-WINDOWS-%E2%80%90-Python-C-&-SDK-Requirements) * Espeak-ng for multiple TTS engines to function. [Installation instructions](https://github.com/erew123/alltalk_tts/wiki/Install-%E2%80%90-WINDOWS-%E2%80%90-Espeak%E2%80%90ng) If you already have these installed, you can proceed directly to the Quick Setup instructions. Open Command Prompt and navigate to your preferred directory: cd /d C:\path\to\your\preferred\directory Clone the AllTalk repository: git clone -b alltalkbeta https://github.com/erew123/alltalk_tts Navigate to the AllTalk directory: cd alltalk_tts Run the setup script: atsetup.bat Follow the on-screen prompts: * Select Standalone Installation and then Option 1. * Follow any additional instructions to install required files. ## Step One: Creating samples for our dataset From here we are ready to begin creating our voice model. The first thing you will want to do is get a wav file of the voice you want to clone and place it in the voices folder of your Alltalk installation. An important thing to note here is that you typically want a minimum of 2 minutes of total audio samples (Alltalk won't even allow you to proceed without this minimum met). If you have enough audio of the voice you want to train, you can skip the first portion and go stright to training. That is typically pretty easy to meet if you have the dumped audio files for a character from a game, as an exmaple.
Screenshot ![](screenshots/voice-sample-dir.png?raw=true)
From here, you are going to start the Alltalk server. This can be done by either opening the start_alltalk.bat file in the folder you cloned, or by opening a command prompt, CDing into the directory and running the bat file this way. By default, you will be connecting to http://127.0.0.1:7852/ to access the webui. But what we really want is the TTS generator, which is by default at http://127.0.0.1:7851/static/tts_generator/tts_generator.html We will be using this to generate what we need for our dataset. Included in this repository, you will find a collection of prompts in various languages, pulled from the [Piper Recording Studio](https://github.com/rhasspy/piper-recording-studio) repo, leading digits stripped from them. Grab all three files from the language of your choice, English in my case, and drop them in the Text Input section of the generator. Each line should have a single sentence. Next, we want to set the chunk size to 1, playback to none, and the Character voice to your choice. In my case, I am using Reed, from Arknights. Our settings set, we hit generate and wait for it to finish. Once it is done, we can go ahead and stop the Alltalk server.
Screenshot ![](screenshots/genvoice.png?raw=true)
## Step Two: Its time for our Anime Training Arc! Now that we have our samples, we are going to move on to training. We now want to go back to our Alltalk directory. Because we have 1150 samples, it is best that we transfer them first, then begin the dataset creation. The TTS generator will have put them in the outputs folder. We want to copy them to the finetune/put-voice-samples-in-here folder, as shown below.
Screenshot ![](screenshots/move.png?raw=true)
Next up we want to run the start_finetune.bat file to start the trainer. Once its finished loading, it should automatically open itself in a browser window. If it does not, you can use http://127.0.0.1:7052/ to access it. By default it will open to a status page, show below. You want to make certain all of the boxes are green. If they are not, there are tabs along the top that _should_ will give you more info on how to resolve those issues.
Screenshot ![](screenshots/finetunestatus.png?raw=true)
Next up, we are moving to step one. Because we already moved our audio samples beforehand, all we need to do here is fill in the project name and hit Create Dataset at the bottom.
Screenshot ![](screenshots/step1.png?raw=true)
When we move over to step two, the fields should have autopopulated the appropriate data for your dataset, so all we need to do here is make sure our project name is correct, and hit Run the Training. I don't know nearly enough about the underlying nonsense that is AI training to advise on what settings are optimal, so I left them all the same and it turned out quite well. It took me about an hour to train 10 epochs with my setup.
Screenshot ![](screenshots/step2.png?raw=true)
Once its finished, head over to step three. Put the project name in the top right text box, click Refresh, load, set a prompt and then hit generate to get a sample of your model.
Screenshot ![](screenshots/step3.png?raw=true)
If you aren't quite happy, and want to continue training, you can head back to step two, and in the model option, select the previous model, rather than xtts.
Screenshot ![](screenshots/unhappy.png?raw=true)
If you are happy, head over to the final tab. Enter your project name, refresh dropdowns, select the model you are happy with and set your folder name. The compact and move button will move the model to your alltalk/models/xtts folder.
Screenshot ![](screenshots/happy.png?raw=true)

Setting up an OpenAI compatable TTS server than can use XTTS models

Preparing for OpenedAI Speech

The server we will be using here is OpenedAI Speech. We're gonna start by opening a WSL terminal from the command prompt, cloning the repo and enter the folder:

wsl -d Ubuntu
git clone https://github.com/matatonic/openedai-speech
cd openedai-speech
OpenedAI Speech Dir ![](screenshots/openedai.png?raw=true)

Now that we have it cloned, lets go ahead and start the server once to pre-download all of the included models and generate the default configuration files. This will also download all of the sample wavs for the pre-configured voices.

docker compose up
OpenedAI First Run ![](screenshots/openedfirstrun.png?raw=true)

Now we want to make a copy of sample.env and rename it speech.env. It may also a good idea to edit speech.env and uncomment one of the lines to preload XTTS. You will only want to set one model to preload. If you are using a non-nvidia GPU, you can uncomment the USE_ROCM line. I don't have a non-nvidia card, so I cannot attest to its performance. In addtion, the below includes two models with preload to display that you can specify older versions of models. XTTS and XTTS_V2.0.2 being the latest and older version respectively in this instance.

``` TTS_HOME=voices HF_HOME=voices PRELOAD_MODEL=xtts

PRELOAD_MODEL=xtts_v2.0.2

EXTRA_ARGS=--log-level DEBUG --unload-timer 300

USE_ROCM=1

 If you have your own model, you will want to copy it into the voices/tts folder. In order to prepare the setup

 We then want to open the voice_to_speaker.yaml in the config folder and add the model beneath the tts-1-hd heading. You can also place a wav file in the voices folder and map it to one of the default voices. An example of both is included in the screenshot below. 

<details><summary>OpenedAI Speech Dir</summary>

![](screenshots/configyaml.png?raw=true)

</details>

With everything set up, we can go ahead and restart the server with the same command from above.

<details><summary>OpenedAI running in docker</summary>

![](screenshots/dockerterm.png?raw=true)

</details>

# Covas
 [Download Covas Here!](https://github.com/RatherRude/Elite-Dangerous-AI-Integration)

 Now that we have the server up and running, we can setup Covas to use it.
 ```
 TTS Model Name: tts-1-hd
 TTS Endpoint: http://127.0.0.1:8000/v1
 ```

 For your TTS voice, it is going to be what is set in the voice_to_speaker.yaml file. As examples:

 ```
   shimmer:
    model: xtts
    speaker: voices/shimmer.wav
  reed:
    model: reed # This name is required to be unique
    speaker: voices/tts/reed/sample.wav # voice sample is required
    model_path: voices/tts/reed

If I enter shimmer, Covas will use the pre-configured shimmer voice. If I use reed, it will use my custom Reed model.

Covas TTS settings ![](screenshots/covastts.png?raw=true)

Potential issue with Alltalk

While setting up Alltalk, I ran into an issue with it not finding a DLL, specifically cudnn_cnn64_9.dll, despite it being in the alltalk env. To resolve the issue, I had to add the correct folder to my environment variables. You can do so by going to System > About > Advanced System Settings > Environment Variables > Path > New and then add the correct folder, where ever they may be for you.

Environment Variable ![](screenshots/sysvar.png?raw=true)