For the voiceovers of Echoes of Somewhere I will be doing a hybrid solution: some voiceovers will be real humans, some will be AI and some will be handled by Apple’s text to speech tech from the 90’s.
Gathering the text from the game
I am currently using Adventure Creator for all of my dialogue text management. I might switch to something else later. With Adventure Creator, the way to add VoiceOver audio to text, is to create audio files with a specific name and the system automatically finds the correct files. To get these filenames, I first need to export all of the dialogue text as a CSV.
This CSV makes creating the VO extremely easy! Here is a small snippet of the text CSV for Echoes of Somewhere.
|197||Speech||I need to go.||Player197||Has all audio|
|198||Speech||Remember to hydrate!||Faucet _1NPC_198||Has all audio|
|199||Speech||A Wet ticket! Amazing!||Microwave199||Missing main audio|
|200||Speech||Lets get that ticket of your’s dry!||Microwave200||Missing main audio|
|201||Speech||Oh wow, it is perfectly flat and dry as a bone!||Player201||Has all audio|
|202||Speech||You are welcome!||Microwave202||Missing main audio|
|203||Speech||So, what is it William?[expression:Shrug]||Player203||Has all audio|
|204||Speech||Finally!||Neighbour_Wife204||Has all audio|
|205||Speech||Good morning Samantha||Player205||Has all audio|
The different methods of VO
As said, there will be 3 different VoiceOver sources used for the game. Human actors, AI based VO, and PlainTalk from the 90’s. Here is bit of a breakdown which characters will be using which approach.
I have directed actors for TV ads, animations and games before and I thoroughly enjoy the process. It is so nice to feel someone slipping into the character and bringing their personality to life! This is a process I really would not want to give up. And in a project with proper funding, I would always choose human actors over AI for the lead roles.
This is why I dream of having as many human actors in the game as possible. But I am not yet sure which characters will be acted by real humans. As I have no budget, all actors will need to work for free and this is a big ask from anyone!
To make things a little bit easier, I will first generate every characters VO with AI text to speech. Doing it like this allows me to implement everything first, and then later set up a really streamlined process for the final, real human VO, as the game is pretty much done by then and I will have the context of the conversations acted out for each actor to hear while performing.
In unity, changing the VO to the final one is then just a drag and drop process.
My plan is to prepare for this acting process so well, that we can easily knock out each performance in well under a day. As I really do not want to take up anyone’s time too much!
AI text to speech
AI text to speech is currently extremely usable. I can not help but feel that this tool will be used more and more in games in the future. It will suddenly allow indie RPG games with hundreds and hundreds of pages of text to be fully voice acted. This has been unthinkable. No more is there any excuse for a game not to be fully voiced. And fully voiced in all translations! Chinese, sure! German, no problem! Finnish, you bet!
For smaller projects, this is a godsend! For developers and gamers alike!
I have now been using AI to create a first pass for all the spoken text in the games first scene, and it is wonderful how much this breathes life into the scenes. How much more premium the game feels like! And it is so fast and effortless as well! This workflow enables you to voice the game as you go along. You do not need to wait until the last month of production after all the text has gone trough the QA to record the actors. You can have VO each step of the way! It is a weird ew way of working and I do like it, even if the VO would be replaced with actors later. It gives a far better feel to the dialogue and to testing the game and seeing what the end product will be like at very early stages.
The AI workflow with Elevenlabs
When I need to turn any of these lines into an audio clip, I simply copy the Original Text to Elevenlabs’ Speech Synthesis, choose the correct actor and hit generate!
If the generated audio does not quite match my idea of the line, I can regenerate, or tweak the sliders a bit. There is not much control to give, but these two sliders do affect the audio quite dramatically.
Turning the stability down makes the performance more dramatic. So for characters that shout or need to sound distressed turning it all the wy down to the red usually does the trick.
naturally there are limitations. The controls are very simple and there is no way to tell the Ai what nuance of voice I would like to hear for each line. It is very hard to make it sound sarcastic, or imply a different meaning in the voice than what is portrayed trough the text. With human actors giving this direction would be trivial, but it is very difficult with AI.
After the voice file is created, I simply download it and rename it to match the filename as give on the CSV. This file, when placed in the correct folder, will now automatically work in the game!
Replacing, adding and removing VoiceOver lines is very easy with this setup!
I wanted the robots, synthetic beings and internet of things appliances to sound artificial. AI text to speech was out of the question as the results were simply too good. I tried to search online for super crappy text to speech implementations, but they were very cumbersome to use with very little variation. Then I remembered the good old Apple PlainTalk from the 90’s! This thing is still bundled with OS X!
You can access this tool from the OS X System Settings under Accessibility. The most famous use case for this is Siri for sure, but all the horrible text to speech models from the 90s are still there! Every. single. one. And they are horrible! I have no idea why Apple would still keep them bundled with the system, especially the “novelty” ones. For this project I am thankful though as these are absolutely perfect for my game!
Text to Speech workflow with PlainTalk
Getting the VoiceOver lines from Apples text to speech to Unity is not as easy as using the Elevenlabs service. It requires couple of additional steps!
First, I need to set up a hotkey for the text to speech in my system settings. Then I also need to select the voice I want to use for the particular character.
This next step is only for us Mac users. I bet this is easier with a PC as Mac does not allow you to capture the device audio output easily.
In order to make this happen, I need to install a tool called Blackhole. It is a plugin that allows me to pipe the Mac output audio to a virtual input device. As a recording software in this case I used QuickTime Player. The only drawback is that once this is enabled, you are unable to head any audio from your Mac! So in essence you will be working deaf.
So, Once Blackhole is installed, you need to set it as the computer audio output. Now I can load up the CSV set Quicktime to record all outgoing computer audio. For each line I need to record, I need to select the correct cell in the CSV and press the speech keyboard shortcut. After I see the Text-to-speech box disappear, I can do this for the next line.
Once all the lines are captured, I open this captured audio file in Adobe Audition and chop it to pieces, saving each separate line with the correct filename.
It is not the most elegant workflow, but it gets the job done and I get absolutely perfect voices for my synthetic characters!
The game’s casting
I am also using this devblog post as a bit of a design document! This section has all my casting choices for the game.
Main character: Clyde from Elevenlabs (pitched down)
William: Ethan from Elevenlabs
Samantha: Elli from Elevenlabs
Microwave: PlainTalk Junior (heavily distorted with flange)
Robot Sentry: PlainTalk Zarvox (heavily distorted with flange)
Diswasher: PlainTalk Ralph
Faucet: PlainTalk Ava
Agatha: Dorothy from Elevenlabs