I'm using espeak (from F-Droid) for text to speech, and it's working great. I'd like an app that does speech to text though, ideally supporting Swedish as well as English for Duolingo purposes, but even just English would be more than I have now.

all 22 comments

sorted by: hot top controversial new old

[–] Tiritibambix@lemmy.ml 10 points 9 months ago* (last edited 9 months ago) (2 children)

Futo

https://voiceinput.futo.org/

https://gitlab.futo.org/alex/voiceinput

https://app.futo.org/fdroid/repo/voiceinput-fdroid-20.apk

[–] JackGreenEarth@lemm.ee 2 points 9 months ago (1 children)

What are the details of the futo repo so I can install it with Droidify?

[–] Tiritibambix@lemmy.ml 5 points 9 months ago (2 children)

https://app.futo.org/fdroid/repo/

[–] Pantherina@feddit.de 2 points 9 months ago

This moment when an F-Droid repo works better than official releases in Obtainium

[–] JackGreenEarth@lemm.ee 1 points 9 months ago

Thanks

[–] SnokenKeekaGuard@lemmy.dbzer0.com 5 points 9 months ago

Sayboard?

[–] rufus@discuss.tchncs.de 4 points 9 months ago* (last edited 9 months ago) (1 children)

I think that's it. The two mentioned things in the previous comments are also what I've seen floating around. Sayboard and FUTO's voiceinput. The former is free software and FUTO releases under a source-available license. Additionally you can use something like Kõnele (available in F-Droid) to connect to cloud-based services. Disregarding free software, there are probably a few others with a proprietary license. For example Google's STT that is baked into their Android versions.

[–] Pantherina@feddit.de 2 points 9 months ago* (last edited 9 months ago) (1 children)

Googles "speech services" works on GrapheneOS using sandboxed play services.

Until they have added a Permission to restrict InterProcessCommunication (IPC) (Like possible on Linux with Flatpak) this might be too big of a privacy problem though.

Also a lot of Google stuff is basically a proprietary cloud adapter.

[–] rufus@discuss.tchncs.de 5 points 9 months ago* (last edited 9 months ago) (1 children)

Yeah, I don't know if OP was looking for that. They specified 'FOSS' in the title. But I think Google can also do local STT nowadays, I haven't tried it for quite a while. Sayboard and FUTO work remarkably well. I personally am struggling a bit more with the reverse part: TTS. There isn't much except for espeak if you want other languages than English (and maybe Russian since there is another project that does a few other languages.) But I skipped on the Google services on my phone.

[–] Pantherina@feddit.de 2 points 9 months ago (1 children)

Yeah, TTS also works well from Google but I use espeak on my personal profile too.

There are many other TTS libaries out there, simply nobody has made an Android App yet.

[–] rufus@discuss.tchncs.de 2 points 9 months ago* (last edited 9 months ago) (1 children)

Sure. Seems the Thorsten voice is in every FLOSS text-to-speech project. I think he (the real Thorsten) also does YouTube videos about that topic.

I don't know of any other free software Android speech software (that also speaks German) except for espeak. And I need something that can talk to me in the car. For other purposes it sounds a bit rough in my opinion. Eventually I would like something more state of the art with a more human-like sound. And something that properly ties into my Linux desktop and brings local STT and TTS to every application. I think the components are there already. But we're still missing the proper integration into both platforms. (And maybe a few more voices and training data for new ones in several languages.)

[–] Pantherina@feddit.de 2 points 9 months ago (1 children)

Crazy thing is, this is not magic.

There are dozens of companies doing that, even some for your own voice

You just need all phoneems (dt. Phonem) so just read a specific text multiple times, and you can have your own custom voice.

This just needs some funding, and the Mozilla Common Voice project should already have very sufficient data

[–] rufus@discuss.tchncs.de 2 points 9 months ago* (last edited 9 months ago) (1 children)

I think we're way past that. I've fiddled around a bit with 'bark' and another more common (open?) solution to do voice cloning. It takes like an 5 second audio clip of someone talking and it can extract features from that, train an AI model and transfer the 'style' of that voice to arbitrary speech. I don't really know if it's technically similar to the AI tools that can paint an astronaut on a horse and draw it in the style of van gogh... but it's the same idea. And bark and other tools can also synthesize speech with an AI model. You can just give it text and instruct it to talk in a relaxed female voice, and it'll do it. However, I wasn't able to get good results out of it. It's nice to play around with, but it's not yet feasible for real world use. And it takes a proper graphics card (or a cloud service that provides you with GPU compute) to run it.

I don't think these tools use phonemes and the old-fashioned ways of doing it. It is machine learning and AI 'magic' that makes those tools sound more smooth and realistic.

What I also like is coqui-ai. It seems to be entirely free and the samples sound on a complete next level compared to established tools like espeak-ng. Sadly it isn't packaged in any of the Linux distributions I use. And I really don't understand why. It also doesn't need crazy system specs. But it doesn't tie into the desktop at all and requires you to set up conda environments, handle the CUDA libraries and just running the 'pip install TTS' they listed on their github repo didn't do it for me.

(I excluded the commercial tools here. Big-Tech has some alright TTS. Google, Amazon, Apple, ... they're all usable. elevenlabs.io offer exceptionally good TTS, I think that's what the AI narrated YouTube videos are made with. And I sometimes use the button to convert heise online articles to speech while doing the laundry or other stuff in the house that doesn't take enough time for me to start a podcast. I just wish there was a button on my laptop that'd do the same thing with free software and offer similar quality.)

[Edit: Forget what I said last. I've been distro-hopping lately and it seems coqui-ai/TTS is avalable in the Linux I've installed last week. I'm going to try it tomorrow.]

[–] Pantherina@feddit.de 2 points 9 months ago (1 children)

Packaging is a big thing. On Android the model needs to be integrated in a surrounding modern app using modern libraries.

I wouldnt be too hyped about training an AI with really little data, but if its substantial this is probably crazy cool.

[–] rufus@discuss.tchncs.de 2 points 9 months ago* (last edited 9 months ago)

Maybe I should have worded things a bit differently. The 5 second snippets are for style transfer. I think it only picks up on the frequency spectrum of a new voice and knows how to handle that because it's been trained with several other voices. I suppose one or two sentences aren't enough to get the pacing right and all the disctinct features of a human speaker. I didn't get good results anyways. Tools like that from ElevenLabs recommend you upload 30mins to 3hours of speech.

I've managed to get the TTS running. the German thorsten/tacotron2-DDC is very good in my opinion. Could be the thing I was looking for. It just gets all the abbreviations and names wrong but the flow of the voice is quite good. And it's fast, even on my laptop. Sadly I also read that Coqui-AI are shutting down. Seems to be difficult to compete against the big-tech companies who integrate their proprietary TTS tech for free into the platforms.

I agree. Packaging and integration are some of the most important aspects if you want to actually use something. A research project is also nice, but those don't solve my every-day tasks. And I can't maintain too many development environments with complex dependencies and copy-and-paste everything to the command line. It abolutely needs to be available on the platform and there needs to be a wrapper that integrates it into the other software I use. We have that for espeak, flite and all the old-fashioned tools. But it's completely missing for the last 5 years of technological advancements...

[–] ExtremeDullard@lemmy.sdf.org 2 points 9 months ago* (last edited 9 months ago) (2 children)

FUTO Voice Input. Hands down:

It works fantastically well
It works offline - in other words, Google doesn't get to spy on what you say
It supports Swedish

FUTO asks you to pay for the app but doesn't force you to. You get the whole application regardless. Just for not treating users like crap and for releasing such nice apps, you really should pay them.

If you do, make sure you download the APK from F-Droid or directly from them, so Google doesn't get any of your money: the APK served by the Google Play Store uses the Play Store to collect payment, whereas the APK served by F-Droid and the direct download APK allows you to send FUTO money directly with Stripe (credit card).

[–] beyond@linkage.ds8.zone 2 points 9 months ago (1 children)

Unfortunately FUTO apps like this and Greyjay are not truly free software, they are source available with a proprietary EULA.

https://hiphish.github.io/blog/2023/10/18/grayjay-is-not-open-source/

[–] ExtremeDullard@lemmy.sdf.org 1 points 9 months ago

Interesting! Thank you for the link.

To be honest, I am not a lawyer so those issues didn't jump at me when I quickly read through the - very terse - license.

Also, FUTO seems like decent people, and trustworthy off-line voice input software that users escape the Google surveillance is hard to come by. FUTO Voice Input is pretty much the only game in town, and the fact that it comes with source code is amazing to me. So I kind of overlooked the finer points to be honest, because it surprisingly ticks all the other boxes that matter to me.

[–] JackGreenEarth@lemm.ee 1 points 9 months ago

Thanks!

[–] LoveSausage@lemmy.ml 2 points 9 months ago (1 children)

https://www.f-droid.org/en/packages/com.github.olga_yakovleva.rhvoice.android/ I use it for organic maps in the car which is fine. not sure how it does otherwise. No Swedish yet at least

[–] N4CHEM@lemmy.ml 3 points 9 months ago

RHvoice is text-to-speech (TTS), what OP is asking for is the opposite: speech-to-text (aka voice recognition).