**Charlie Loyd** @vruba@everything.happens.horse · 2023-10-06T20:10:57Z

Charlie Loyd @vruba@everything.happens.horse

Charlie Loyd @vruba@everything.happens.horse

Dear software people,

Unicode is older now than ASCII was when Unicode was introduced. It’s not a weird new fad.

It’s complicated but so is the domain it represents. We recognize that we have to think about time zones and leap days and seconds, for instance. And it’s a cleaner abstraction when you aren’t halfhearted about it.

Sincerely,
Charlie

October 6, 2023 at 8:10 PM · · Web · · ·

**Ivan Sagalaev** @isagalaev@mastodon.social · Oct 6, 2023

**Ivan Sagalaev** @isagalaev@mastodon.social · Oct 6, 2023

Oct 6, 2023

Ivan Sagalaev @isagalaev@mastodon.social

@vruba but isn't it a solved problem by now? Nobody writes new software that's not unicode-aware (and it would be hard to do, because all the systems and languages do it by default now). Converting old software is another matter of course, but that's not specific to Unicode.

**Charlie Loyd** @vruba@everything.happens.horse · Oct 6, 2023

**Charlie Loyd** @vruba@everything.happens.horse · Oct 6, 2023

Oct 6, 2023

Charlie Loyd @vruba@everything.happens.horse

@isagalaev It’s a technically solved problem if you make the effort to use the solutions. Some don’t. They assume that a character is a byte, or they consider systems that mishandle Unicode to be good enough to keep in service. That’s the halfheartedness I mean. It may not be specific to Unicode but it’s real.

(Analogously, you still see bugs from daylight saving time shifts, integer overflows, SQL injection, and other problems that are totally solved if you bother to use the solutions.)

**artemist** @artemist@mildlyfunctional.gay · Oct 6, 2023

**artemist** @artemist@mildlyfunctional.gay · Oct 6, 2023

Oct 6, 2023

artemist @artemist@mildlyfunctional.gay

@isagalaev @vruba I've definitely seen modern software that doesn't handle unicode properly, like spitting on codepoint boundaries. I've also seen things not properly passing a language flag to their renderer, which is important when figuring out which glyphs to use from CJK unified ideographs

**Ivan Sagalaev** @isagalaev@mastodon.social · Oct 6, 2023

**Ivan Sagalaev** @isagalaev@mastodon.social · Oct 6, 2023

Oct 6, 2023

Ivan Sagalaev @isagalaev@mastodon.social

@artemist @vruba ah, that's a good point. CJK is still something I don't think of much myself.

**Paul Norman** @pnorman@en.osm.town · Oct 20, 2023

**Paul Norman** @pnorman@en.osm.town · Oct 20, 2023

Oct 20, 2023

Paul Norman @pnorman@en.osm.town

@artemist @isagalaev @vruba Han unification is a tough problem. A user gives you a string of CJK ideographs, but doesn't tell you the language, and their language is either unknown or something like en.

How do you properly display them?

**Hugo 雨果** @whynothugo@fosstodon.org · Nov 3, 2023

**Hugo 雨果** @whynothugo@fosstodon.org · Nov 3, 2023

Nov 3, 2023

Hugo 雨果 @whynothugo@fosstodon.org

@pnorman @artemist @isagalaev @vruba UTF-8 has this big issue. There are plenty of pairs of characters which share the same byte representation. So you need additional out-of-band metadata, potentially at a per-byterange level.

Maybe a successor for UTF-8 will fix this, though migrating to a newer system is going to take longer than migrating to IPv6.

**Paul Norman** @pnorman@en.osm.town · Nov 5, 2023

**Paul Norman** @pnorman@en.osm.town · Nov 5, 2023

Nov 5, 2023

Paul Norman @pnorman@en.osm.town

@whynothugo @artemist @isagalaev
Ya, Unicode assumes that metadata exists, which isn't always true. Sometimes you don't know and never knew which language, just the codepoints.

**Niki Tonsky** @nikitonsky@mastodon.online · Oct 6, 2023

**Niki Tonsky** @nikitonsky@mastodon.online · Oct 6, 2023

Oct 6, 2023

Niki Tonsky @nikitonsky@mastodon.online

@isagalaev @vruba not by default, no. What’s "".length in your language of choice?

**Julien Silland** @eiggen@ohai.social · Oct 6, 2023

**Julien Silland** @eiggen@ohai.social · Oct 6, 2023

Oct 6, 2023

Julien Silland @eiggen@ohai.social

@nikitonsky @isagalaev @vruba "".length – I find it hard to fault Unicode for this.

IMO this has a lot more to do with programming languages that enable humans to trust their intuition when they really shouldn't.

**Buster | Felipe** @pitbuster@lile.cl · Oct 6, 2023

**Buster | Felipe** @pitbuster@lile.cl · Oct 6, 2023

Oct 6, 2023

Buster | Felipe @pitbuster@lile.cl

@eiggen

That's the point of Niki, programming languages make it easy to do bad stuff when working with unicode

@nikitonsky @isagalaev @vruba

**esa** @syklemil@snabelen.no · Oct 7, 2023

**esa** @syklemil@snabelen.no · Oct 7, 2023

Oct 7, 2023

esa @syklemil@snabelen.no

@eiggen
I suspect Niki would agree that it's not Unicode that's the problem here, but the defaults for how strings are handled in various languages. "".length is also kind of … what information are you looking for? The space it takes on disk or wire, or what a human would consider visible characters, or something else entirely? Or: What unit of measurement should the answer have?

https://tonsky.me/blog/unicode/

@nikitonsky @isagalaev @vruba

The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)

Modern extension to classic 2003 article by Joel Spolsky

tonsky.me

**mirabilos** @mirabilos@toot.mirbsd.org · Oct 7, 2023

**mirabilos** @mirabilos@toot.mirbsd.org · Oct 7, 2023

Oct 7, 2023

mirabilos @mirabilos@toot.mirbsd.org

@eiggen @isagalaev @nikitonsky @syklemil @vruba already in a low-level C library I carry three different kinds of lengths: the size: amount of bytes (UTF-8 coded), count: amount of multibyte/wide characters, width: amount of fixed-width screen cells this uses (2).

It gets worse when you throw grapheme clusters into the mix, which so far I’ve left to higher-level code but will have to look at eventually as well…

**Buster | Felipe** @pitbuster@lile.cl · Oct 6, 2023

**Buster | Felipe** @pitbuster@lile.cl · Oct 6, 2023

Oct 6, 2023

Buster | Felipe @pitbuster@lile.cl

@nikitonsky
mine does count bytes, but also its documentation (which you see when you are using autocompletion on your IDE) is very good at saying that len() is not what you think it is and guides you to the correct solution

@isagalaev @vruba

**Rob Napier** @cocoaphony@mastodon.social · Oct 7, 2023

**Rob Napier** @cocoaphony@mastodon.social · Oct 7, 2023

Oct 7, 2023

Rob Napier @cocoaphony@mastodon.social

@nikitonsky @isagalaev @vruba Swift is the only language I’ve worked in that handles this correctly over a wide range of issues. It’s also why its String type infuriates so many when they first come to Swift. myString[2] isn’t possible in Swift; there’s no O(1) way to get “the nth Character.” When you finally accept that this a key consequence of any deeply Unicode-aware language, it’s a big change in how you think about strings.

To your question: in Swift, that returns 1 (using .count).

**Niki Tonsky** @nikitonsky@mastodon.online · Oct 7, 2023

**Niki Tonsky** @nikitonsky@mastodon.online · Oct 7, 2023

Oct 7, 2023

Niki Tonsky @nikitonsky@mastodon.online

@cocoaphony @isagalaev @vruba I’ve heard that Erlang is the second one, but that’s about it. Shame, really

**Rob Napier** @cocoaphony@mastodon.social · Oct 7, 2023

**Rob Napier** @cocoaphony@mastodon.social · Oct 7, 2023

Oct 7, 2023

Rob Napier @cocoaphony@mastodon.social

@nikitonsky @isagalaev @vruba Erlang *can* do it, but only if you explicitly ask for it. The normal "string" type in Erlang is a charlist, which is just a list of Unicode scalars. Swift is the only language I know whose most-used string type is defined as a sequence of grapheme clusters, and whose normal operations are on those clusters.

For example, appending a combining character does not increase the length of Swift String. It's correct, but very unusual and a little surprising.

Erlang code showing the length of a complex ZWJ sequence is 5, but the string:length is 1.

Swift code showing the element count of a complex ZWJ sequence is 1, and the count of its unicodeScalars property is 5.

Also shows how adding a combining character does not increase the length of a string because it combines.

**Rob Napier** @cocoaphony@mastodon.social · Oct 7, 2023

**Rob Napier** @cocoaphony@mastodon.social · Oct 7, 2023

Oct 7, 2023

Rob Napier @cocoaphony@mastodon.social

@nikitonsky @isagalaev @vruba If this stuff interests you, one of my favorite talks I've given was on just this subject. https://www.youtube.com/watch?v=Cz64DHcRxAU

**Thomas Bornhaupt** @koenigdickbauch@mastodontech.de · Nov 17, 2023

**Thomas Bornhaupt** @koenigdickbauch@mastodontech.de · Nov 17, 2023

Nov 17, 2023

Thomas Bornhaupt @koenigdickbauch@mastodontech.de

@cocoaphony @nikitonsky @isagalaev @vruba

Delphi use utf8 if you want. But there are problems with other Editors. BOM or not BOM
https://blogs.embarcadero.com/the-delphi-compiler-and-utf-8-encoded-source-code-files-with-no-bom/

**Ivan Sagalaev** @isagalaev@mastodon.social · Oct 7, 2023

**Ivan Sagalaev** @isagalaev@mastodon.social · Oct 7, 2023

Oct 7, 2023

Ivan Sagalaev @isagalaev@mastodon.social

@nikitonsky @vruba okay, okay, admittedly I meant it in a very narrow sense of "everyone thankfully uses utf-8 everywhere by now", so text data is interoperable. I didn't mean all the interesting cases are solved.

Like the length of that emoji, where the correct answer is "emojis don't have a well defined meaning of length", so nobody should assume anything in this case. But as it turn out, mostly people care about the count of utf-8-encoded bytes, for storage or memory allocation.

**Charlie Loyd** @vruba@everything.happens.horse · Oct 7, 2023

**Charlie Loyd** @vruba@everything.happens.horse · Oct 7, 2023

Oct 7, 2023

Charlie Loyd @vruba@everything.happens.horse

@isagalaev @nikitonsky At a systems programming level, people certainly care more about storage size. But at a UI level, they might want to ensure that only one emoji can be used in a certain context. So perhaps it’s better to say that there are multiple useful senses of the idea of “length” that might matter in different areas. But I think we basically agree about the important parts of this issue.

**Ivan Sagalaev** @isagalaev@mastodon.social · Oct 7, 2023

**Ivan Sagalaev** @isagalaev@mastodon.social · Oct 7, 2023

Oct 7, 2023

Ivan Sagalaev @isagalaev@mastodon.social

@vruba @nikitonsky but the part about "only one emoji can be used in a certain context" is interesting. What context? Emoji pickers in UI couldn't care less about `.length`, they are tables of grapheme clusters, where each emoji is a full utf-8 encoded string that gets appended to a string in a text input. Nobody cares if its length is 1 or more.

It's just not a(n important) use case. Just like it turned out that nobody needs random access to "characters" in a string by index in O(1).

**Charlie Loyd** @vruba@everything.happens.horse · Oct 7, 2023

**Charlie Loyd** @vruba@everything.happens.horse · Oct 7, 2023

Oct 7, 2023

Charlie Loyd @vruba@everything.happens.horse

@isagalaev @nikitonsky Think of emoji reactions or status fields. They could be implemented in such a way that it is reasonable to check that the user only submits one emoji at a time, or at least that only one is displayed at a time. Or consider CSS use cases like https://developer.mozilla.org/en-US/docs/Web/CSS/::first-letter

::first-letter - CSS: Cascading Style Sheets | MDN

The ::first-letter CSS pseudo-element applies styles to the first letter of the first line of a block-level element, but only when not preceded by other content…

developer.mozilla.org

**Charlie Loyd** @vruba@everything.happens.horse · Oct 7, 2023

**Charlie Loyd** @vruba@everything.happens.horse · Oct 7, 2023

Oct 7, 2023

Charlie Loyd @vruba@everything.happens.horse

@isagalaev @nikitonsky All I’m saying is that the length of a string as a reader would understand it (not only as a hard drive would understand it) is a useful concept that should be exposed by at least some string libraries. I strongly agree with you that it’s not worth optimizing for at the cost of, well, almost any other operation.

**Niki Tonsky** @nikitonsky@mastodon.online · Oct 7, 2023

**Niki Tonsky** @nikitonsky@mastodon.online · Oct 7, 2023

Oct 7, 2023

Niki Tonsky @nikitonsky@mastodon.online

@vruba @isagalaev Even simpler. Doing this is impossible without being aware of grapheme clusters

**Ivan Sagalaev** @isagalaev@mastodon.social · Oct 8, 2023

**Ivan Sagalaev** @isagalaev@mastodon.social · Oct 8, 2023

Oct 8, 2023

Ivan Sagalaev @isagalaev@mastodon.social

@nikitonsky @vruba as for the contraction with "…", UIs seem to universally converge on visual hiding with a transparency gradient, because the visual width of a string only makes sense after being rendered with a particular font on a particular device. Doing `str[:max] + '...'` was only good enough in the beginning.

**Niki Tonsky** @nikitonsky@mastodon.online · Oct 9, 2023

**Niki Tonsky** @nikitonsky@mastodon.online · Oct 9, 2023

Oct 9, 2023

Niki Tonsky @nikitonsky@mastodon.online

@isagalaev @vruba remember when every cyrillic letter was counted as two for character limit at Twitter? Those were not the fun times

**Niki Tonsky** @nikitonsky@mastodon.online · Oct 7, 2023

**Niki Tonsky** @nikitonsky@mastodon.online · Oct 7, 2023

Oct 7, 2023

Niki Tonsky @nikitonsky@mastodon.online

@isagalaev @vruba I disagree. Emoji length is well-defined (in the same Unicode standard that you are so happy everyone uses). Even more, that definition _is consistent_ with what people expect it to be.

The problem is that most programming languages _choose to ignore_ that and instead give you a metric that has no practical use whatsoever (unicode codepoints)

**Ivan Sagalaev** @isagalaev@mastodon.social · Oct 8, 2023

**Ivan Sagalaev** @isagalaev@mastodon.social · Oct 8, 2023

Oct 8, 2023

Ivan Sagalaev @isagalaev@mastodon.social

@nikitonsky @vruba ah, that's a good clarification, thanks! I didn't know it was defined. Although I believe the story with languages not supporting it is more complicated than just choosing to ignore.

I'm going to still consider unicode a solved problem, even if it means "except emoji". It's just so much better now than it was ~20 years ago.

**Niki Tonsky** @nikitonsky@mastodon.online · Oct 9, 2023

**Niki Tonsky** @nikitonsky@mastodon.online · Oct 9, 2023

Oct 9, 2023

Niki Tonsky @nikitonsky@mastodon.online

@isagalaev @vruba I specifically address this misconception about “only emoji”, see “Is Unicode hard only because of emojis?” in https://tonsky.me/blog/unicode/

The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)

Modern extension to classic 2003 article by Joel Spolsky

tonsky.me

**BoredomFestival** @BoredomFestival@sfba.social · Nov 3, 2023

**BoredomFestival** @BoredomFestival@sfba.social · Nov 3, 2023

Nov 3, 2023

BoredomFestival @BoredomFestival@sfba.social

@isagalaev @nikitonsky @vruba "everyone uses utf8 everywhere now" -- coders still using raw Win32 API and/or Java look on and lolsob in UTF16

**Josh** @jcmrva@hachyderm.io · Nov 3, 2023

**Josh** @jcmrva@hachyderm.io · Nov 3, 2023

Nov 3, 2023

Josh @jcmrva@hachyderm.io

@nikitonsky @isagalaev @vruba .Length is 7 in dotnet; we need System.Globalization.StringInfo("").LengthInTextElements to get 1 grapheme.

**Caleb Maclennan** @alerque@mastodon.social · Oct 7, 2023

**Caleb Maclennan** @alerque@mastodon.social · Oct 7, 2023

Oct 7, 2023

Caleb Maclennan @alerque@mastodon.social

@isagalaev @vruba

> Nobody writes new software that's not unicode-aware

This is just categorically false. It happens all the time in new software. Frequently some parts of the system will accept Unicode but many little corners either won't take it or mangle it or whatever.

**maxy** @maxy@sigmoid.social · Oct 7, 2023

**maxy** @maxy@sigmoid.social · Oct 7, 2023

Oct 7, 2023

maxy @maxy@sigmoid.social

@isagalaev @vruba It's not just old software, it's also old standards. I recently stumbled over this QR decoding issue, it's depressing. You can specify the codec in the QR code, but many people don't. The standard says it's ISO-8859-1 in this case, but more likely it will be either BIG-5 or UTF8 and the decoder has to guess. (If it wants to guess correctly, it should check probabilities with a language model or something...) https://github.com/mchehab/zbar/issues/212

Wrong text encoding assumption · Issue #212 · mchehab/zbar

This image contain the text "Il était une fois, un noël radieiux et un gros test. Manchmal sind wir über freundlich." but ZBar returns "Il 矇tait une fois, un no禱l…

github.com

**robryk** @robryk@qoto.org · Oct 7, 2023

**robryk** @robryk@qoto.org · Oct 7, 2023

Oct 7, 2023

robryk @robryk@qoto.org

@isagalaev @vruba

It's way more messy for CJK due to Han unification: given that you actually want to draw the "same" pictogram differently across different languages, you need to pick a font depending on the language. :(

**mirabilos** @mirabilos@toot.mirbsd.org · Oct 7, 2023

**mirabilos** @mirabilos@toot.mirbsd.org · Oct 7, 2023

Oct 7, 2023

mirabilos @mirabilos@toot.mirbsd.org

@isagalaev @vruba haha I wish…

**Chris** @ctm@chaos.social · Nov 26, 2023

**Chris** @ctm@chaos.social · Nov 26, 2023

Nov 26, 2023

Chris @ctm@chaos.social

@isagalaev @vruba you would be surprised what (new) software I come across every once in a while

**Mark Igra** @markigra@sciences.social · Oct 6, 2023

**Mark Igra** @markigra@sciences.social · Oct 6, 2023

Oct 6, 2023

Mark Igra @markigra@sciences.social

@vruba @bthalpin I remember when a friend joined the Unicode team at Microsoft in 1989/90. At the time it seemed like a corporate backwater, but was hugely consequential. 64K characters felt like more than we could use, but then 640K of RAM also felt that way to some.

**Space Hobo** @spacehobo@teh.entar.net · Oct 6, 2023

**Space Hobo** @spacehobo@teh.entar.net · Oct 6, 2023

Oct 6, 2023

Space Hobo @spacehobo@teh.entar.net

@vruba Dear Microsoft,

UCS-2, and thus UTF-16, were a wrong-headed mistake from the start. Join the rest of us over in UTF-8 land, and stop trying to ice-skate uphill.

**mirabilos** @mirabilos@toot.mirbsd.org · Oct 7, 2023

**mirabilos** @mirabilos@toot.mirbsd.org · Oct 7, 2023

Oct 7, 2023

mirabilos @mirabilos@toot.mirbsd.org

@spacehobo @vruba UCS-2 was entirely reasonable at the beginning, when Unicode (pre 10646) was a 16-bit encoding.

Before Plan 9’s UTF-8, no other encoding was really suitable for all use cases.

**Space Hobo** @spacehobo@teh.entar.net · Oct 7, 2023

**Space Hobo** @spacehobo@teh.entar.net · Oct 7, 2023

Oct 7, 2023

Space Hobo @spacehobo@teh.entar.net

@mirabilos @vruba Cool, and when did Windows choose UTF-16? Was it before or after Plan-9 invented UTF-8?

**mirabilos** @mirabilos@toot.mirbsd.org · Oct 7, 2023

**mirabilos** @mirabilos@toot.mirbsd.org · Oct 7, 2023

Oct 7, 2023

mirabilos @mirabilos@toot.mirbsd.org

@vruba @spacehobo after, but at that point they had ab ABI and everyone will inform you ABI breakage is bad (except OpenBSD who’ll just recompile everything).

You’ll be glad to now they managed to make UTF-8 fit into the codepage system as cp65001 many years ago, and nowadays there’s also a global-ish all-text-is-UTF-8 toggle. Doesn’t affect the wide APIs much, but they now recommend using the *A APIs with UTF-8 so your application can operate on them or UCS-4 at libitum (and encode the UCS-4 to UTF-8 for calls like on Unix).

**vandys** @vandys@goto.vsta.org · Oct 6, 2023

**vandys** @vandys@goto.vsta.org · Oct 6, 2023

Oct 6, 2023

vandys @vandys@goto.vsta.org

@vruba And you finally can have slashes in your filenames. Thanks, UTF!

**NaNdy Stalick** @bug138@yellowmustard.club · Oct 7, 2023

**NaNdy Stalick** @bug138@yellowmustard.club · Oct 7, 2023

Oct 7, 2023

NaNdy Stalick @bug138@yellowmustard.club

@vruba I love unicode but whyyyy are the specs still released canonically as XML?

**Charlie Loyd** @vruba@everything.happens.horse · Oct 7, 2023

**Charlie Loyd** @vruba@everything.happens.horse · Oct 7, 2023

Oct 7, 2023

Charlie Loyd @vruba@everything.happens.horse

@bug138 https://en.wiktionary.org/wiki/Persian_flaw#English

Persian flaw - Wiktionary, the free dictionary

en.wiktionary.org

**autism** @jeff@misinformation.wikileaks2.org · Oct 7, 2023

**autism** @jeff@misinformation.wikileaks2.org · Oct 7, 2023

Oct 7, 2023

autism @jeff@misinformation.wikileaks2.org

@vruba just like ipv6

**Jean-Baptiste "JBQ" Quéru** @jbqueru@fosstodon.org · Oct 7, 2023

**Jean-Baptiste "JBQ" Quéru** @jbqueru@fosstodon.org · Oct 7, 2023

Oct 7, 2023

Jean-Baptiste "JBQ" Quéru @jbqueru@fosstodon.org

@vruba Very good point. At the same time, Unicode is far more complex than ASCII and changes a lot more often.

**Ricardo B�nffy** @rbanffy@mastodon.social · Oct 7, 2023

**Ricardo B�nffy** @rbanffy@mastodon.social · Oct 7, 2023

Oct 7, 2023

Ricardo B�nffy @rbanffy@mastodon.social

@vruba Amen. For obvious reasons.

**mirabilos** @mirabilos@toot.mirbsd.org · Oct 7, 2023

**mirabilos** @mirabilos@toot.mirbsd.org · Oct 7, 2023

Oct 7, 2023

mirabilos @mirabilos@toot.mirbsd.org

@vruba unfortunately, most people don’t do for leap seconds. Ask me how I know…

**arclight** @arclight@oldbytes.space · Oct 7, 2023

**arclight** @arclight@oldbytes.space · Oct 7, 2023

Oct 7, 2023

arclight @arclight@oldbytes.space

@vruba It solves the encoding problem but It's frighteningly unusable from the data entry side. Granted, that's not the problem it was designed to solve but it's frustrating having the possibility of writing (say) technical text with proper characters lost in a sea of characters not used in the text. We went from slightly less than 128 usable characters to a near infinite amount of noise (nobody uses but a tiny fraction of what the encoding makes available). The encoding is fine but it's a pity editors and other software haven't caught up to the representation.

**fluffy** @fluffy@plush.city · Oct 7, 2023

**fluffy** @fluffy@plush.city · Oct 7, 2023

Oct 7, 2023

fluffy @fluffy@plush.city

@vruba I have bad news about how well software engineers reason about time zones

**PulkoMandy** @pulkomandy@mastodon.tetaneutral.net · Nov 3, 2023

**PulkoMandy** @pulkomandy@mastodon.tetaneutral.net · Nov 3, 2023

Nov 3, 2023

PulkoMandy @pulkomandy@mastodon.tetaneutral.net

@fluffy @vruba
Facebook engineers convinced the organism who handles leap seconds that the earth rotation is not an important thing to be synchronized to, and they should stop doing leap seconds. That's how the "problem" was "handled".

**Paul Norman** @pnorman@en.osm.town · Oct 20, 2023

**Paul Norman** @pnorman@en.osm.town · Oct 20, 2023

Oct 20, 2023

Paul Norman @pnorman@en.osm.town

@vruba Han unification (and related Syriac and Nastaliq issues) aren't well solved issues. But plain ASCII doesn't help you there!

**Charlie Loyd** @vruba@everything.happens.horse · Oct 20, 2023

**Charlie Loyd** @vruba@everything.happens.horse · Oct 20, 2023

Oct 20, 2023

Charlie Loyd @vruba@everything.happens.horse

@pnorman Right – the point is not that Unicode is perfect, because it has various obvious warts and arguable major design flaws. It’s that its utility vastly exceeds any realistic alternative’s, especially ASCII’s, for basically all user-facing software.

**Charlie Loyd** @vruba@everything.happens.horse · Oct 20, 2023

**Charlie Loyd** @vruba@everything.happens.horse · Oct 20, 2023

Oct 20, 2023

Charlie Loyd @vruba@everything.happens.horse

@pnorman Someday that will change, and that will be a new question. For now, as I expect you agree, arguing for ASCII as the only way to do strings is like arguing for 32-bit unsigned integers as the only way to do numbers. “But they’re fast and easy to reason about!”

**spv** @spv@spv.sh · Nov 3, 2023

**spv** @spv@spv.sh · Nov 3, 2023

Nov 3, 2023

spv @spv@spv.sh

@vruba i never thought about that, but you're right lol

**Giles Goat** @gilesgoat@toot.wales · Nov 5, 2023

**Giles Goat** @gilesgoat@toot.wales · Nov 5, 2023

Nov 5, 2023

Giles Goat @gilesgoat@toot.wales

@vruba Dear software people : *DO NOT* dare to use UNICODE in Windows / VS or such, there's NO such "standard" as "the standard". Just use UTF-8 encoding for it and be sure NOWHERE in your code there's WCHAR and/or wchar_t ! .. otherwise NOTHING will EVER work the way it SHOULD. You may end up with "WCHAR" at 32 bits still encoded in UTF-8 ( i.e. instead of bytes8 using long32 but still with UTF-8 encoding ). Also beware of BOM and such and compiler options. It's a PROPER MESS ! ( CONT ).

**Giles Goat** @gilesgoat@toot.wales · Nov 5, 2023

**Giles Goat** @gilesgoat@toot.wales · Nov 5, 2023

Nov 5, 2023

Giles Goat @gilesgoat@toot.wales

@vruba ( CONT ) also NOT even L"string" will guarantee you that string is really UNICODE ( 16 ? 32 ? ) I've seen all sort of unbelievable things happening. Beware also at exchanging "files with translations" between people .. already just saving a ".txt" or ".csv" file may cause lot of problems. Beware of those "I use Office, no I use OpenOffice, no I use some Office-clone something on Mac" .. UNICODE .. is a standard .. often BADLY implemented !

**Jonah** @vjon@mastodon.online · Nov 14, 2023

**Jonah** @vjon@mastodon.online · Nov 14, 2023

Nov 14, 2023

Jonah @vjon@mastodon.online

@vruba That's kind of the thing -- Unicode is *insanely* complicated, even outside of the fact that languages are complicated. For example, the existence of synonymous and homonymous representations was completely avoidable and the mess with bidirectional writing should have been IMO left to a higher layer (though I get why some would disagree with me on this one).

Trending now

Resources

Developers

What is Hometown/Mastodon?

everything.happens.horse

More…