Unicode symbols >= 𐀀 are broken #2646

Bodigrim · 2022-01-28T00:20:19Z

Your environment

Which OS do you use: MacOS
Which LSP client (editor/plugin) do you use: Sublime Code

Steps to reproduce

aaa :: Word 
aaa = 10

𐀀𐀀𐀀 :: Word 
𐀀𐀀𐀀 = 10

Expected behaviour

Both aaa and 𐀀𐀀𐀀 should be underlined as unused bindings.

Actual behaviour

aaa is underlined correctly, but only the first character of 𐀀𐀀𐀀 is underlined.

That's because HLS does not distinguish positions returned from GHC (which are in code points = characters) from positions mandated by LSP (which are UTF-16 code units). Basically, GHC says us that 3 first code points in line 5 are an unused binding. Each 𐀀 is a single character, but 2 UTF-16 code units, so HLS should ask LSP to underline first 6 code units. Instead of this HLS asks LSP to underline only 3, and since the 3rd one is in the middle of the 2nd character, only the 1st character gets underlined.

CC @michaelpj @alanz, this is related to haskell/lsp#392 (comment).

The text was updated successfully, but these errors were encountered:

Bodigrim · 2022-01-28T00:57:33Z

Code actions are also broken. E. g., choosing "Delete '𐀀𐀀𐀀'" results in the following program

aaa :: Word 
aaa = 10 10

That's again because HLS meant to ask LSP to delete 8 characters, but asked to delete 8 code points, 6 of which were 𐀀𐀀𐀀.

michaelpj · 2022-01-28T12:45:05Z

Thanks for the reproduction. To be clear, did you mean that this is related to the issue with splitting in the middle of code units, or is it this one? haskell/lsp#392 (comment)

That is, do you think this will be fixed by the changes you made to lsp, or is there still something broken?

jneira · 2022-01-28T13:18:53Z

In any case a regression test (or several of them) here exercising unicode symbols would be great

Bodigrim · 2022-01-29T11:52:02Z

@michaelpj This is unlikely to be fixed in recent lsp patches. Essentially there must be a place, responsible for (fast) conversion between positions in code points and positions in code units. I think I can provide such low-level routine from text-rope and then someone will have to migrate ghcide to it.

@jneira I’m sorry, I’m unlikely to have capacity to add a regression test. Could someone else possibly please pick it up?

michaelpj · 2022-01-29T11:57:36Z

I'll look into it (inc tests), I just thought you might already have a hunch as to where the problem lies.

michaelpj · 2022-01-29T21:33:55Z

Looking at this more, I think this is utterly broken throughout the codebase. We frequently change from GHC SrcSpans to Positions, blindly assuming that the numbers can just be translated between them, which is very wrong. And we use the numbers from Position with functions from Text, which is also wrong.

If I understand correctly, you can't even convert from a postion-in-code-points to a position-in-code-units without having the whole text in question to hand, which seems quite annoying. In that case it probably would be useful to have such a function in text-rope and we'll just have to painfully audit HLS for direct use of SrcSpans.

(Maybe text-rope should allow using Positions defined in terms of code-points also, might be more useful for clients other than LSP...)

michaelpj · 2022-01-29T21:38:31Z

Not really sure what the best way to tackle this systematically is. We could have a CodeUnit newtype in lsp that we use for the column offset, that might force us to look at all the use sites. I guess I can't argue that the Text methods should take a CodePoint newtype instead of Int :)

michaelpj · 2022-01-30T19:13:56Z

Vague plan:

Since we need the text, let's ask the VFS to do the conversion for us
Add CodePointPosition to the VFS, which is a mirror of Position with code points for the column offset. Perhaps also CodePointRange, CodePointLocation
Add a function of type VFS -> URI -> CodePointPosition -> Position, etc.

Bodigrim · 2022-01-30T21:24:30Z

(Maybe text-rope should allow using Positions defined in terms of code-points also, might be more useful for clients other than LSP...)

Already there: it provides both character-based API (Data.Text.Lines / Data.Text.Rope) and UTF16-based API (Data.Text.Utf16.Lines / Data.Text.Utf16.Rope). I just need to provide conversions.

Bodigrim · 2022-01-31T21:03:11Z

I realised that text-rope already provides a way to convert between positions: one can Data.Text.Lines.splitAtPosition with character-based counter and then call Data.Text.Utf16.Lines.lengthAsPosition on the prefix to receive UTF-16-based counter. I might provide a dedicated helper in the future, but this would work for a proof of concept.

Looking on ghcide, I think it can really benefit from using Data.Text.Lines from text-rope instead of a plain Text, because this will improve ubiquitous line splitting.

So I'd suggest to wait for an lsp release with text-rope-based VFS, then take another look at this issue.

michaelpj · 2022-01-31T21:35:11Z

Yes, definitely not planning to do anything before then. Thanks for letting me know how to do the conversion!

LSP `Position`s use UTF-16 code units for offsets within lines; most other sane tools (like GHC) use Unicode code points. We need to use the right one in the right place, otherwise we get issues like haskell/haskell-language-server#2646. This is pretty unpleasant, since code points are variable-size, so you can't do the conversion without having the file text itself. This PR provides a type for positions using code points (for clients to use to help them be less confused) and functions for using the VFS to convert between those and LSP positions.

Bodigrim added type: bug Something isn't right: doesn't work as intended, documentation is missing/outdated, etc.. status: needs triage labels Jan 28, 2022

Ailrun added component: lsp and removed status: needs triage labels Jan 28, 2022

jneira added the status: regression test needed label Jan 28, 2022

Ailrun added the component: ghcide label Jan 30, 2022

michaelpj mentioned this issue Feb 19, 2022

Add functions to convert between code-point and code-unit positions haskell/lsp#407

Merged

hasufell removed the old_status: regression test needed label Jul 13, 2022

michaelpj mentioned this issue Jan 17, 2024

Tab character incorrectly assumed to be composed of 8 characters #1987

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode symbols >= 𐀀 are broken #2646

Unicode symbols >= 𐀀 are broken #2646

Bodigrim commented Jan 28, 2022

Bodigrim commented Jan 28, 2022

michaelpj commented Jan 28, 2022

jneira commented Jan 28, 2022

Bodigrim commented Jan 29, 2022

michaelpj commented Jan 29, 2022

michaelpj commented Jan 29, 2022

michaelpj commented Jan 29, 2022

michaelpj commented Jan 30, 2022

Bodigrim commented Jan 30, 2022

Bodigrim commented Jan 31, 2022

michaelpj commented Jan 31, 2022

Unicode symbols >= 𐀀 are broken #2646

Unicode symbols >= 𐀀 are broken #2646

Comments

Bodigrim commented Jan 28, 2022

Your environment

Steps to reproduce

Expected behaviour

Actual behaviour

Bodigrim commented Jan 28, 2022

michaelpj commented Jan 28, 2022

jneira commented Jan 28, 2022

Bodigrim commented Jan 29, 2022

michaelpj commented Jan 29, 2022

michaelpj commented Jan 29, 2022

michaelpj commented Jan 29, 2022

michaelpj commented Jan 30, 2022

Bodigrim commented Jan 30, 2022

Bodigrim commented Jan 31, 2022

michaelpj commented Jan 31, 2022