this post was submitted on 28 Oct 2023
321 points (97.9% liked)

Programmer Humor

32070 readers
425 users here now

Post funny things about programming here! (Or just rant about your favourite programming language.)

Rules:

founded 5 years ago
MODERATORS
 
all 35 comments
sorted by: hot top controversial new old
[–] mvirts@lemmy.world 119 points 10 months ago

πŸ™ƒ compression algorithms hate this one simple trick!!

[–] whileloop@lemmy.world 78 points 10 months ago (2 children)

This is a joke, right? This feels like a very dumb solution. I don't know much about UTF-8 encoding, but it sounds like Roman characters can be encoded shorter than most or all others because of a shorthand that assumes Roman characters. In that case, why not take that functionality and let a UTF-8 block specify which language makes up most of the text so that you can have that savings almost every time? I don't see why one would want it to be random.

[–] alvvayson@lemmy.world 121 points 10 months ago* (last edited 10 months ago) (1 children)

It's a joke.

UTF-16 already exists, which doesn't favor Roman characters as much, but UTF-8 is more popular because it is backword compatible with the legacy ASCII.

UTF-32 also exists which has exactly equal length representation for every character.

But the thing that equalizes languages is compression.

Yes, a text written in Cyrillic with UTF-8 will take more space than a Roman language, easily double. However this extra space is much more easily compressed by an algorithm like GZIP.

So after compression, the two compressed texts will then be similarly sized and much smaller than UTF-16 or UTF-32.

[–] jmcs@discuss.tchncs.de 15 points 10 months ago (1 children)

Besides most text on the average computer is either within some configuration file (which tend to use latin script), or within some SGML derived format which has a bunch of latin characters in it. For network transmission most things will use HTML, XML or JSON and use English language property names even in countries that don't speak English (see Yandex's and Baidu's APIs for example).

No one is moving large amounts of .txt files around.

[–] Buckshot@programming.dev 22 points 10 months ago (2 children)

You've never worked in finance then. All our systems at work do nothing but move large amounts of txt files around.

That said, many of our clients still don't support utf-8 so its all ascii and non-latin alphabets are screwed. They can't even handle characters 128-255 so even stuff like Β£ is unsupported.

[–] LaggyKar@programming.dev 12 points 10 months ago

That said, many of our clients still don’t support utf-8 so its all ascii and non-latin alphabets are screwed.

Ah, yes, I heard about that sort of thing. Some bank getting a GDPR complaint because they couldn't correct the spelling of someone's name, because their system uses EBCDIC.

[–] anytimesoon@lemmy.ml 4 points 10 months ago (1 children)

finance

even stuff like Β£ is unsupported.

Probably not an issue then...

[–] fibojoly@sh.itjust.works 7 points 10 months ago* (last edited 10 months ago)

Its not a joke. I worked for a big european bank network and the software there didn't know how to translate from EBCDIC to UTF8 because none of the devs writing the software knew enough of the other side (mainframe vs PC) to realise this was an issue.

Their solution was "if the file has a ? in it when we receive it, it's probably a Β£". Which of course completely breaks down the day you have any other untranslated character.

I spent fucking weeks explaining this issue and why this was abominable, but apparently this wasn't enough of an issue for people to fix it. Go figure...

[–] velox_vulnus@lemmy.ml 39 points 10 months ago (3 children)

My language still does not exist in the Unicode.

[–] S410@kbin.social 42 points 10 months ago (2 children)

It'll be added when they'd find some free time!

You see, adding pictures women with white cane facing right, limes and pregnant men is a very important and time consuming job! Standardizing encoding for some human language people use is just not as important!

[–] tetris11@lemmy.ml 8 points 10 months ago* (last edited 10 months ago) (2 children)

These are emojis, not unicode right?

Edit: well, TIL.

[–] Coolcoder360@lemmy.world 36 points 10 months ago

Emoji are defined as part of Unicode, so they can be encoded alongside other text:

https://unicode.org/emoji/charts/full-emoji-list.html

[–] _MusicJunkie@beehaw.org 21 points 10 months ago

Emoji are part of unicode. And people demand more of them, so it's no surprise they put effort into those, even if OP thinks they are not important.Few people appreciate the unicode consortium for their originally intended work.

[–] PM_ME_FAT_ENBIES@lib.lgbt 2 points 10 months ago

Random transphobia mixed in amongst a good point.

[–] optissima@possumpat.io 19 points 10 months ago (1 children)

Oh please share, what character set?

[–] velox_vulnus@lemmy.ml 42 points 10 months ago (1 children)
[–] steventhedev@lemmy.world 12 points 10 months ago* (last edited 10 months ago) (1 children)

I was not expecting the drama around it. Is the issue truly a different orthography or is more like a different font/ligature issue?

EDIT: forgot the article I found on it: https://restofworld.org/2021/tulu-unicode-script/

[–] velox_vulnus@lemmy.ml 29 points 10 months ago* (last edited 10 months ago) (2 children)

The KTSA under Pavanaja was trying to reform the language and modify it - it was a destructive reformation, where the language now borrows some feature from the Kannada and Malayalam script, and some of the characteristics were newly made and never seen before in the original script, whereas the other camp under Murthy was trying to preserve the original, archaic script. At last, both the groups have come to an agreement this year stating that they will allow the reformed script, as it is already ready and easier to grasp, and will be called the invented Tulu lipi. The ancient lipi will be called the Tulu-Tigalari lipi, and since there's still some unconfirmed research work on a few characters, all they have to do now is focus on those characters and they can share the rest of them with the invented lipi.

[–] v_krishna@lemmy.ml 5 points 10 months ago

Very interesting article and background. My father's side of the family is all from Mysuru but also long roots in Udupi and Manipal. I'll ask if anybody are Tulu speakers, I don't think so as I've never heard of it.

[–] fred@lemmy.ml 4 points 10 months ago (1 children)

So not all the characters are even known yet?

[–] velox_vulnus@lemmy.ml 4 points 10 months ago

They are known, but there are multiple different forms. Some of the forms may have never been seen, and some of them cannot be expressed in the Unicode, as it was made with Latin letters in mind.

So when you're trying to digitize abiguda, you have to be careful about ligatures, because real world may have multiple different forms in different context, and you can get to choose only one. But when we are talking about archiving, it has to be perfectly copy-pasted the way it was in the palm inscription.

[–] breakingcups@lemmy.world 6 points 10 months ago (1 children)
[–] velox_vulnus@lemmy.ml 15 points 10 months ago
[–] simplify@lemm.ee 19 points 10 months ago (1 children)

I immediately thought of Leeroy Jenkins in the last sentence.

https://youtu.be/mLyOj_QD4a4?si=6RhZzj8LO3tr80cT

[–] Shhalahr@beehaw.org 2 points 10 months ago (1 children)

Pretty certain it's an intentional reference.

[–] simplify@lemm.ee 1 points 10 months ago (1 children)

You're right, and someone else might be a part of the lucky 10,000 today.

[–] Shhalahr@beehaw.org 0 points 10 months ago (1 children)

And now we have the obligatory xkcd reference. 😁

[–] apotheotic@beehaw.org 16 points 10 months ago (1 children)

I can't read "what a time to be alive" without hearing Two Minute Papers in my head

[–] AVincentInSpace@pawb.social 4 points 10 months ago

hold onto your papers

[–] lowleveldata@programming.dev 11 points 10 months ago (1 children)

longer than necessary

It's as long as it needs to be unique

[–] palordrolap@kbin.social 4 points 10 months ago* (last edited 10 months ago) (1 children)

Sure. OK. How about we put the Greek alphabet at the lower code points and the Latin alphabet higher up, and now you might argue that Latin takes up more space than necessary.

Potential counterpoint: "This is stupid. Latin goes in the lower code points, it always has, it always will. Who's putting Greek down there??"

Well, if Greece had invented computing as well as, let's say, democracy that's very likely how things would be.

In that timeline, someone is using exactly the same line on you "[The representation of Latin text in memory i]s as long as it needs to be unique." and you're annoyed because your short letter to Grandma is using far too much space on your hard drive.

[–] lowleveldata@programming.dev 0 points 10 months ago

Oh true. I'd be so annoyed because I somehow wrote a whole letter to Grandma in English which she couldn't read.

[–] dukk@programming.dev 2 points 10 months ago

And it has 333 upvotes! We must maintain this at all costs…