Dear software people,

Unicode is older now than ASCII was when Unicode was introduced. It’s not a weird new fad.

It’s complicated but so is the domain it represents. We recognize that we have to think about time zones and leap days and seconds, for instance. And it’s a cleaner abstraction when you aren’t halfhearted about it.

Sincerely,
Charlie

@vruba but isn't it a solved problem by now? Nobody writes new software that's not unicode-aware (and it would be hard to do, because all the systems and languages do it by default now). Converting old software is another matter of course, but that's not specific to Unicode.

@isagalaev @vruba not by default, no. What’s "🤦🏼‍♂️".length in your language of choice?

@nikitonsky @vruba okay, okay, admittedly I meant it in a very narrow sense of "everyone thankfully uses utf-8 everywhere by now", so text data is interoperable. I didn't mean all the interesting cases are solved.

Like the length of that emoji, where the correct answer is "emojis don't have a well defined meaning of length", so nobody should assume anything in this case. But as it turn out, mostly people care about the count of utf-8-encoded bytes, for storage or memory allocation.

@isagalaev @nikitonsky At a systems programming level, people certainly care more about storage size. But at a UI level, they might want to ensure that only one emoji can be used in a certain context. So perhaps it’s better to say that there are multiple useful senses of the idea of “length” that might matter in different areas. But I think we basically agree about the important parts of this issue.

@vruba @nikitonsky but the part about "only one emoji can be used in a certain context" is interesting. What context? Emoji pickers in UI couldn't care less about `.length`, they are tables of grapheme clusters, where each emoji is a full utf-8 encoded string that gets appended to a string in a text input. Nobody cares if its length is 1 or more.

It's just not a(n important) use case. Just like it turned out that nobody needs random access to "characters" in a string by index in O(1).

@isagalaev @nikitonsky Think of emoji reactions or status fields. They could be implemented in such a way that it is reasonable to check that the user only submits one emoji at a time, or at least that only one is displayed at a time. Or consider CSS use cases like developer.mozilla.org/en-US/do

Follow

@isagalaev @nikitonsky All I’m saying is that the length of a string as a reader would understand it (not only as a hard drive would understand it) is a useful concept that should be exposed by at least some string libraries. I strongly agree with you that it’s not worth optimizing for at the cost of, well, almost any other operation.

· · Web · 0 · 0 · 0
Sign in to participate in the conversation
Horsin' Around

This is a hometown instance run by Sam and Ingrid, for some friends.