I'm using espeak (from F-Droid) for text to speech, and it's working great. I'd like an app that does speech to text though, ideally supporting Swedish as well as English for Duolingo purposes, but even just English would be more than I have now.

you are viewing a single comment's thread
view the rest of the comments

[–] Pantherina@feddit.de 2 points 9 months ago (1 children)

Yeah, TTS also works well from Google but I use espeak on my personal profile too.

There are many other TTS libaries out there, simply nobody has made an Android App yet.

[–] rufus@discuss.tchncs.de 2 points 9 months ago* (last edited 9 months ago) (1 children)

Sure. Seems the Thorsten voice is in every FLOSS text-to-speech project. I think he (the real Thorsten) also does YouTube videos about that topic.

I don't know of any other free software Android speech software (that also speaks German) except for espeak. And I need something that can talk to me in the car. For other purposes it sounds a bit rough in my opinion. Eventually I would like something more state of the art with a more human-like sound. And something that properly ties into my Linux desktop and brings local STT and TTS to every application. I think the components are there already. But we're still missing the proper integration into both platforms. (And maybe a few more voices and training data for new ones in several languages.)

[–] Pantherina@feddit.de 2 points 9 months ago (1 children)

Crazy thing is, this is not magic.

There are dozens of companies doing that, even some for your own voice

You just need all phoneems (dt. Phonem) so just read a specific text multiple times, and you can have your own custom voice.

This just needs some funding, and the Mozilla Common Voice project should already have very sufficient data

[–] rufus@discuss.tchncs.de 2 points 9 months ago* (last edited 9 months ago) (1 children)

I think we're way past that. I've fiddled around a bit with 'bark' and another more common (open?) solution to do voice cloning. It takes like an 5 second audio clip of someone talking and it can extract features from that, train an AI model and transfer the 'style' of that voice to arbitrary speech. I don't really know if it's technically similar to the AI tools that can paint an astronaut on a horse and draw it in the style of van gogh... but it's the same idea. And bark and other tools can also synthesize speech with an AI model. You can just give it text and instruct it to talk in a relaxed female voice, and it'll do it. However, I wasn't able to get good results out of it. It's nice to play around with, but it's not yet feasible for real world use. And it takes a proper graphics card (or a cloud service that provides you with GPU compute) to run it.

I don't think these tools use phonemes and the old-fashioned ways of doing it. It is machine learning and AI 'magic' that makes those tools sound more smooth and realistic.

What I also like is coqui-ai. It seems to be entirely free and the samples sound on a complete next level compared to established tools like espeak-ng. Sadly it isn't packaged in any of the Linux distributions I use. And I really don't understand why. It also doesn't need crazy system specs. But it doesn't tie into the desktop at all and requires you to set up conda environments, handle the CUDA libraries and just running the 'pip install TTS' they listed on their github repo didn't do it for me.

(I excluded the commercial tools here. Big-Tech has some alright TTS. Google, Amazon, Apple, ... they're all usable. elevenlabs.io offer exceptionally good TTS, I think that's what the AI narrated YouTube videos are made with. And I sometimes use the button to convert heise online articles to speech while doing the laundry or other stuff in the house that doesn't take enough time for me to start a podcast. I just wish there was a button on my laptop that'd do the same thing with free software and offer similar quality.)

[Edit: Forget what I said last. I've been distro-hopping lately and it seems coqui-ai/TTS is avalable in the Linux I've installed last week. I'm going to try it tomorrow.]

[–] Pantherina@feddit.de 2 points 9 months ago (1 children)

Packaging is a big thing. On Android the model needs to be integrated in a surrounding modern app using modern libraries.

I wouldnt be too hyped about training an AI with really little data, but if its substantial this is probably crazy cool.

[–] rufus@discuss.tchncs.de 2 points 9 months ago* (last edited 9 months ago)

Maybe I should have worded things a bit differently. The 5 second snippets are for style transfer. I think it only picks up on the frequency spectrum of a new voice and knows how to handle that because it's been trained with several other voices. I suppose one or two sentences aren't enough to get the pacing right and all the disctinct features of a human speaker. I didn't get good results anyways. Tools like that from ElevenLabs recommend you upload 30mins to 3hours of speech.

I've managed to get the TTS running. the German thorsten/tacotron2-DDC is very good in my opinion. Could be the thing I was looking for. It just gets all the abbreviations and names wrong but the flow of the voice is quite good. And it's fast, even on my laptop. Sadly I also read that Coqui-AI are shutting down. Seems to be difficult to compete against the big-tech companies who integrate their proprietary TTS tech for free into the platforms.

I agree. Packaging and integration are some of the most important aspects if you want to actually use something. A research project is also nice, but those don't solve my every-day tasks. And I can't maintain too many development environments with complex dependencies and copy-and-paste everything to the command line. It abolutely needs to be available on the platform and there needs to be a wrapper that integrates it into the other software I use. We have that for espeak, flite and all the old-fashioned tools. But it's completely missing for the last 5 years of technological advancements...