Curious behaviour in Char

David_Legard · November 11, 2018, 1:30pm

I am using some characters outside the main ASCII set, such as Ñ and ñ, and need to differentiate between the lower and upper-case variants.

The documentation says that Char.isUpper only works on ASCII characters, so sure enough, elm -repl gives me:

Char.isUpper 'Ñ'
False : Bool

however, if I upcase ñ, we get

Char.toUpper 'ñ'
'Ñ' : Char.Char

That seems inconsistent. It is equivalent to saying:

Char.isUpper (Char.toUpper somechar) = false

Surely, both should work or neither.

Any thoughts?

glennsl · November 11, 2018, 2:07pm

It isn’t though, since Char.isUpper is defined to only return true for uppercase ASCII characters. You could instead argue that it isn’t accurately named, which I would agree with.

Unicode case mapping is a non-trivial problem because there isn’t a one-to-one mapping between lowercase and uppercase character. You can see some of the issues with it here. And to add to that, a unicode “character” might not even be what you expect it to be.

But even if someone does implement proper unicode support for Char.isUpper wouldn’t it be inconsistent that most other functions don’t fully or properly handle unicode? If consistency is the goal, you might be looking at a very big task then.

I do hope there will be proper unicode support sometime in the future, since it’s really nice to not have to worry about unicode issues and having a good static type system might help a lot in designing a good API for it. But for the above-mentioned issues I don’t expect to see it in the near future.

jwoLondon · November 11, 2018, 2:14pm

Is there any reason not to use a homegrown version?

isUpper : Char -> Bool
isUpper c =
    c == Char.toUpper c

Presumably there must be a reason why this isn’t how Char.isUpper works, but it would seem to do the job, at least for a wider range of characters than ASCII.

glennsl · November 11, 2018, 2:32pm

Since, as I said, there isn’t a one-to-one mapping, this wouldn’t work for all unicode characters. It would as you say work for a wider range of characters, but that range wouldn’t be well-defined. And I for one would rather know for certain when it doesn’t work, than having it suddenly not work as expected. I can always just implement this hack myself, and would by doing so hopefully have a better understanding (or care less) of when it works.

David_Legard · November 12, 2018, 1:59am

I reckon the name is profoundly misleading. The idea that Char.toUpper does not produce an uppercase character as defined by Char.isUpper is going to catch out other people, for sure.

It turns out to be simple to implement this for extended ASCII, together with a custom ordering (A,B,C,D,Ð,E… M,N,Ñ,O …) .

Thanks for the responses.

malaire · November 12, 2018, 2:35am

There is no “extended ASCII” in Unicode. If you mean Latin Extended, then which extensions you mean, as there are quite many? Supporting Cyrillic would also be simple, so why support only Latin?

I think it only makes sense to support either only ASCII or full Unicode officially. When trying to support something in between, it’s going to be quite difficult to decide what exactly will be supported.

David_Legard · November 12, 2018, 4:27am

Latin Extended, then.

I don’t know its formal name, but it’s the one that comes up in Windows’ Character Map when you first open it.

I’m sure you’re right about the difficulties of supporting Unicode.

My point is, that defining Char.toUpper in such a way that does not produce an uppercase character as defined by Char.isUpper for many characters seems misleading and is probably going to catch out other people.

EDIT: It’s called ISO/IEC 8859-1

Qqwy · November 12, 2018, 8:01am

I think it would be nice to have a library that, for each Unicode grapheme, is able to return if it is in a given Unicode category (As well as the reverse: looking up all categories for the grapheme). These categories include things like ‘lowercase’, ‘uppercase’, ‘titlecase’, etc.

Such a library would probably best be created in a metaprogramming fashion , similar as to what Elixir does. (it is unfortunate that this cannot be written in Elm itself , but writing something like it in either JS or Haskell should not be too much trouble.)

malaire · November 12, 2018, 11:08am

I think that best fix for now would be to just rename ASCII functions like Char.isUpper to Char.isUpperAscii so there is no confusion.

malaire · November 12, 2018, 11:12am

I just wonder how large such a library would be, containing data for all 137439 characters (as of Unicode 11.0). But yes proper library is much better solution that trying to fix current functions one-by-one. And that also leaves faster ASCII versions available for those who need them.

Qqwy · November 12, 2018, 12:29pm

If you only check the basic Unicode categories, you do not need a function head per character, since many characters are grouped by category (so you can check if the input codepoint is in a given range). This kind of ‘size reduction’ is already apparent in the Unicode PropList.txt file.

malaire · November 12, 2018, 12:57pm

Unicode is hard… I was reading a bit about Unicode case mappings (section 5.18 here if interested), and I noticed a bug in Char.toUpper where Char.toUpper('ß') returns two characters as single Char: https://github.com/elm/core/issues/1001

malaire · November 12, 2018, 4:10pm

I just realized that the problematic character ß I mentioned above is in ISO/IEC 8859-1, so even supporting that set of characters is not trivial.

Haskell has solved this problem so that Data.Char.toUpper of type Char -> Char just returns ß unchanged, and other function Data.Text.toUpper of type Text -> Text does proper conversion to SS.

But this does mean that in Haskell

Data.Char.isUpper(Data.Char.toUpper('ß')) == False

even though that Data.Char.isUpper is for all Unicode characters and not just ASCII.

system · November 22, 2018, 4:10pm

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elm-unicode is here! Show and Tell	6	1131	April 8, 2021
Comparing utf-8 strings by transliterating them Learn	5	1053	December 20, 2021
lowerCase UpperCase Learn	2	701	April 27, 2020
Weird thing I found in elm/parser Learn	2	616	August 14, 2020
Help improve Unicode support on Windows! Request Feedback	5	1794	March 30, 2019

Curious behaviour in Char

Related topics