Curious behaviour in Char

I am using some characters outside the main ASCII set, such as Ñ and ñ, and need to differentiate between the lower and upper-case variants.

The documentation says that Char.isUpper only works on ASCII characters, so sure enough, elm -repl gives me:

Char.isUpper 'Ñ'
False : Bool

however, if I upcase ñ, we get

Char.toUpper 'ñ'
'Ñ' : Char.Char

That seems inconsistent. It is equivalent to saying:

Char.isUpper (Char.toUpper somechar) = false

Surely, both should work or neither.

Any thoughts?

It isn’t though, since Char.isUpper is defined to only return true for uppercase ASCII characters. You could instead argue that it isn’t accurately named, which I would agree with.

Unicode case mapping is a non-trivial problem because there isn’t a one-to-one mapping between lowercase and uppercase character. You can see some of the issues with it here. And to add to that, a unicode “character” might not even be what you expect it to be.

But even if someone does implement proper unicode support for Char.isUpper wouldn’t it be inconsistent that most other functions don’t fully or properly handle unicode? If consistency is the goal, you might be looking at a very big task then.

I do hope there will be proper unicode support sometime in the future, since it’s really nice to not have to worry about unicode issues and having a good static type system might help a lot in designing a good API for it. But for the above-mentioned issues I don’t expect to see it in the near future.


Is there any reason not to use a homegrown version?

isUpper : Char -> Bool
isUpper c =
    c == Char.toUpper c

Presumably there must be a reason why this isn’t how Char.isUpper works, but it would seem to do the job, at least for a wider range of characters than ASCII.

Since, as I said, there isn’t a one-to-one mapping, this wouldn’t work for all unicode characters. It would as you say work for a wider range of characters, but that range wouldn’t be well-defined. And I for one would rather know for certain when it doesn’t work, than having it suddenly not work as expected. I can always just implement this hack myself, and would by doing so hopefully have a better understanding (or care less) of when it works.


I reckon the name is profoundly misleading. The idea that Char.toUpper does not produce an uppercase character as defined by Char.isUpper is going to catch out other people, for sure.

It turns out to be simple to implement this for extended ASCII, together with a custom ordering (A,B,C,D,Ð,E… M,N,Ñ,O …) .

Thanks for the responses.

1 Like

There is no “extended ASCII” in Unicode. If you mean Latin Extended, then which extensions you mean, as there are quite many? Supporting Cyrillic would also be simple, so why support only Latin?

I think it only makes sense to support either only ASCII or full Unicode officially. When trying to support something in between, it’s going to be quite difficult to decide what exactly will be supported.

Latin Extended, then.

I don’t know its formal name, but it’s the one that comes up in Windows’ Character Map when you first open it.

I’m sure you’re right about the difficulties of supporting Unicode.

My point is, that defining Char.toUpper in such a way that does not produce an uppercase character as defined by Char.isUpper for many characters seems misleading and is probably going to catch out other people.

EDIT: It’s called ISO/IEC 8859-1

1 Like

I think it would be nice to have a library that, for each Unicode grapheme, is able to return if it is in a given Unicode category (As well as the reverse: looking up all categories for the grapheme). These categories include things like ‘lowercase’, ‘uppercase’, ‘titlecase’, etc.

Such a library would probably best be created in a metaprogramming fashion , similar as to what Elixir does. (it is unfortunate that this cannot be written in Elm itself :sweat_smile:, but writing something like it in either JS or Haskell should not be too much trouble.)

I think that best fix for now would be to just rename ASCII functions like Char.isUpper to Char.isUpperAscii so there is no confusion.

I just wonder how large such a library would be, containing data for all 137439 characters (as of Unicode 11.0). But yes proper library is much better solution that trying to fix current functions one-by-one. And that also leaves faster ASCII versions available for those who need them.

If you only check the basic Unicode categories, you do not need a function head per character, since many characters are grouped by category (so you can check if the input codepoint is in a given range). This kind of ‘size reduction’ is already apparent in the Unicode PropList.txt file.

Unicode is hard… I was reading a bit about Unicode case mappings (section 5.18 here if interested), and I noticed a bug in Char.toUpper where Char.toUpper('ß') returns two characters as single Char:

I just realized that the problematic character ß I mentioned above is in ISO/IEC 8859-1, so even supporting that set of characters is not trivial.

Haskell has solved this problem so that Data.Char.toUpper of type Char -> Char just returns ß unchanged, and other function Data.Text.toUpper of type Text -> Text does proper conversion to SS.

But this does mean that in Haskell

Data.Char.isUpper(Data.Char.toUpper('ß')) == False

even though that Data.Char.isUpper is for all Unicode characters and not just ASCII.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.