Dear software people,
Unicode is older now than ASCII was when Unicode was introduced. It’s not a weird new fad.
It’s complicated but so is the domain it represents. We recognize that we have to think about time zones and leap days and seconds, for instance. And it’s a cleaner abstraction when you aren’t halfhearted about it.
Sincerely,
Charlie
@vruba but isn't it a solved problem by now? Nobody writes new software that's not unicode-aware (and it would be hard to do, because all the systems and languages do it by default now). Converting old software is another matter of course, but that's not specific to Unicode.
@isagalaev @vruba not by default, no. What’s "🤦🏼♂️".length in your language of choice?
@nikitonsky @vruba okay, okay, admittedly I meant it in a very narrow sense of "everyone thankfully uses utf-8 everywhere by now", so text data is interoperable. I didn't mean all the interesting cases are solved.
Like the length of that emoji, where the correct answer is "emojis don't have a well defined meaning of length", so nobody should assume anything in this case. But as it turn out, mostly people care about the count of utf-8-encoded bytes, for storage or memory allocation.
@isagalaev @nikitonsky At a systems programming level, people certainly care more about storage size. But at a UI level, they might want to ensure that only one emoji can be used in a certain context. So perhaps it’s better to say that there are multiple useful senses of the idea of “length” that might matter in different areas. But I think we basically agree about the important parts of this issue.
@isagalaev @nikitonsky Think of emoji reactions or status fields. They could be implemented in such a way that it is reasonable to check that the user only submits one emoji at a time, or at least that only one is displayed at a time. Or consider CSS use cases like https://developer.mozilla.org/en-US/docs/Web/CSS/::first-letter
@nikitonsky @vruba as for the contraction with "…", UIs seem to universally converge on visual hiding with a transparency gradient, because the visual width of a string only makes sense after being rendered with a particular font on a particular device. Doing `str[:max] + '...'` was only good enough in the beginning.
@isagalaev @vruba remember when every cyrillic letter was counted as two for character limit at Twitter? Those were not the fun times
@isagalaev @nikitonsky All I’m saying is that the length of a string as a reader would understand it (not only as a hard drive would understand it) is a useful concept that should be exposed by at least some string libraries. I strongly agree with you that it’s not worth optimizing for at the cost of, well, almost any other operation.