this post was submitted on 20 Jul 2024

40 points (90.0% liked)

Ask Lemmy

26701 readers

3202 users here now

A Fediverse community for open-ended, thought provoking questions

Please don't post about US Politics.

Rules: (interactive)

1) Be nice and; have fun

Doxxing, trolling, sealioning, racism, and toxicity are not welcomed in AskLemmy. Remember what your mother said: if you can't say something nice, don't say anything at all. In addition, the site-wide Lemmy.world terms of service also apply here. Please familiarize yourself with them

2) All posts must end with a '?'

This is sort of like Jeopardy. Please phrase all post titles in the form of a proper question ending with ?

3) No spam

Please do not flood the community with nonsense. Actual suspected spammers will be banned on site. No astroturfing.

4) NSFW is okay, within reason

Just remember to tag posts with either a content warning or a [NSFW] tag. Overtly sexual posts are not allowed, please direct them to either !asklemmyafterdark@lemmy.world or !asklemmynsfw@lemmynsfw.com. NSFW comments should be restricted to posts tagged [NSFW].

5) This is not a support community.

It is not a place for 'how do I?', type questions. If you have any questions regarding the site itself or would like to report a community, please direct them to Lemmy.world Support or email info@lemmy.world. For other questions check our partnered communities list, or use the search function.

Reminder: The terms of service apply here too.

Partnered Communities:

Logo design credit goes to: tubbadu

founded 1 year ago

MODERATORS

Bluetreefrog@lemmy.world

candyman337@lemmy.world

TheSaneWriter@lemm.ee

TheSaneWriter@lemmy.thesanewriter.com

candyman337@sh.itjust.works

Asudox@lemmy.world

can@lemmy.ca

lemmy_bot@lemmy.world

beefbaby182@lemmy.world

AsudoxDev@programming.dev

How can everything be captured in simple data like people's exact voice noise? (lemmy.world)

submitted 3 months ago* (last edited 3 months ago) by cheese_greater@lemmy.world to c/asklemmy@lemmy.world

21 comments fedilink hide all child comments

If a recording of someones very rare voice is representable by mp4 or whatever, could monkeys typing out code randomly exactly reproduce their exact timbre+tone+overall sound?

I don't get how we can get rocks to think + exactly transcribe reality in the ways they do!

Edit: I don't get how audio can be fossilized/reified into plaintext

you are viewing a single comment's thread
view the rest of the comments

[–] PM_ME_VINTAGE_30S@lemmy.sdf.org 5 points 3 months ago

When you talk about a sample, what does that actually mean?

First, the sound in the real world has to be converted to a fluctuating voltage. Then, this voltage signal needs to be converted to a sequence of numbers.

Here's a diagram of the relationship between a voltage signal and its samples:

The blue continuous curve is the sine wave, and the red stems are the samples.

A sample is the value [1] of the signal at a specific time. So the samples of this wave were chosen by reading the signal's value every so often.

Like I recognize that the frequency of oscillations will tell me the pitch of something, but how does that actually translate to a chunk of data that is useful

One of the central results of Fourier Analysis is that frequency information determines the time signal, and vice versa [2]. If you have the time signal, you have its frequency response; you just gotta run it through a Fourier Transform. Similarly, if you have the frequencies that made up the signal, you have the time signal; you just gotta run it through an inverse Fourier Transform. This is not obvious.

Frequency really comes into play in the ADC and DAC processes because we know ahead of time that a maximum useful frequency exists. It is not trivial to prove this, but one of the results of Fourier Analysis is that you can only represent a signal with a finite number of frequencies if there is a maximum frequency above which there is no signal information. Otherwise, a literally infinite number of numbers, i.e. an infinite sequence, would be required to recover the signal. [2]

So for sampling and representing signals, the importance of frequency is really the fact that a maximum frequency exists, which allows our math to stop at some point. Frequency also happens to be useful as a tool for analysis, synthesis, and processing of signals, but that's for another day.

You mention a sample being stored as a number, which makes sense, but how is that number utilized?

The number tells the DAC how big a voltage needs to be sent to the speaker at a given time. I run through an example below.

Again assuming uncompressed, if my sample "value" comes up as 420, does that include all of the necessary components of that sound bite in a 1/44100th of a second? How would a sample at value 421 compare?

The value of a sample with value 420 is meaningless without specifying the range that samples are living in. Typically, we either choose the range -1 to 1 for floating point calculations, or 2^(n-1) to (2^(n-1) - 1) when using integer math [7]. If designed correctly, a sample that's outside the range will be "clipped" to the minimum or maximum, whichever is closer.

However, once we specify a digital range for digital signals to "live in", if the signal value is within range, then yes, it does in fact contain all the necessary components [6] for that sound bite in a 1/44100th of a second?

As an example [3], let's say that the 69th sample has a value of 0.420, or x[69]=0.420. For simplicity, assume that all digital signals can only take values between Dmin = -1 and Dmax = 1 for the rest of this comment. Now, let's assume that the DAC can output a maximum voltage of Vmax = 5V and a minimum voltage of Vmin = -7V [4]. Furthermore, let's assume that the relationship between the digital signal is exactly linear, and the sample rate is 44100Hz. Then, ([69+1]/44100) seconds after the audio begins, regardless of what happened in the past, the DAC will be commanded to output a voltage Vout (calculated below) for a duration of (1/44100) seconds. After that, the number specified by x(70) will command the DAC to spit out a new voltage for the next (1/44100) seconds.

To calculate Vout, we need to fill in the equation of a line.

Vout(x) = (Vmax - Vmin) / (Dmax - Dmin) × (x - Dmin) + Vmin

Vout(x) = (5V - (-7V)) / (1 - (-1) × (x - (-1)) + (-7V)

Vout(x) = 6(x + 1) - 7 [V]

Vout(x) = 6x + 6 - 7 [V]

Vout(x) = 6x - 1 [V]

As a check,

Vout(Dmin) = Vout(-1) = 6×(-1) - 1 = -7V = Vmin ✓

Vout(Dmax) = Vout(1) = (6×1) - 1 = 5V = Vmax ✓

At this point, with respect to this DAC I have "designed", I can always convert from a digital number to an output voltage. If x>1 for some reason, we output Vmax. If x<1 for some reason, we output Vmin. Otherwise, we plug the value into the line equation we just fitted. The DAC does this for us 44100 times per second.

For the sample x[69]=0.420:

Vout(x[69]) = 6•x[69] - 1 [V] = 6×0.420 - 1 = 1.520V.

A sample value of 0.421 would yield Vout = 1.526V, a difference of 6mV from the previous calculation.

And how does changing a sample from 0.420 to 0.421 affect how it's going to sound? Well, if that's the only difference, not much. They would sound practically (but not theoretically) identical. However, if you compare two otherwise identical tracks except that one is rescaled by a digital 1+0.001, then the track with the 1+0.001 rescaling will be very slightly louder. How slight really depends on your speaker system.

I have used a linear relationship because:

That's what we want as engineers.
This is usually an acceptable approximation.
It is easy to think about.

However, as long as the relationship between the digital value and the output voltage is monotonic (only ever goes up or only ever goes down), a designer can compensate for a nonlinear relationship. What kinds of nonlinearities are present in the ADC and DAC (besides any discussed previously) differ by the actual architecture of the ADC or DAC.

Is this like a RGB type situation where you'd have multiple values corresponding to different attributes of the sample (amplitude, frequencies, and I'm sure other things)?

Nope. R, G, and B can be adjusted independently, whereas the samples are mapped [5] one-to-one with frequencies. Said differently: you cannot adjust sample values and frequency response independently. Said another way: samples carry the same information as the frequencies. Changing one automatically changes the other.

Is a single sample actually intelligible in isolation?

Nope. Practically, your speaker system might emit a very quiet "pop", but that pop is really because the system is being asked to quickly change from "no sound" to "some sound" a lot faster than is natural.

Hope this helps. Don't hesitate to ask more questions 😊.

[1] Actually, it is ideally proportional to the value of the sample, what is termed a (non-dynamic) linear relationship, which is the best you can get with DSP because digital samples have no units! In real life, it could be some non-linear relationship with the voltage signal, especially if the device sucks.

[2] Infinite sequences are perfectly acceptable for analysis and design purposes, but to actually crunch numbers and put DSP into practice, we need to work with finite memory.

[3] Sample indices typically start at 0 and must be integers.

[4] Typically, you'll see either a range of [0, something] volts or [+something, -something] volts, however to expose some of the details I chose a "weird" range.

[5] If you've taken linear algebra: the way computers actually do the Fourier Transform, i.e. transforming a set of samples into its frequencies, is by baking the samples into a tall matrix, then multiplying the sample matrix by a FFT matrix to get a new matrix, representing the weights of the frequencies you need to add to get back the original signal. The FFT transformation matrix is invertible, meaning that there exists a unique matrix that undoes whatever changes the FFT matrix can possibly make. All Fourier Transforms are invertible, although the continuous Fourier Transform is too "rich" to be represented as a matrix product.

[6] I have assumed for simplicity that all signals have been mono, i.e. one speaker channel. However, musical audio usually has two channels in a stereo configuration, i.e. one signal for the left and one signal for the right. For stereo signals, you need two samples at every sample time, one from each channel at the same time. In general, you need to take one sample per channel that you're working with. Basically, this means just having two mono ADCs and DACs.

[7] Why 2^n and not 10^n ? Because computers work in binary (base 2), not decimal (base 10).