Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode symbols >= 𐀀 are broken #2646

Open
Bodigrim opened this issue Jan 28, 2022 · 11 comments
Open

Unicode symbols >= 𐀀 are broken #2646

Bodigrim opened this issue Jan 28, 2022 · 11 comments
Labels
component: ghcide component: lsp type: bug Something isn't right: doesn't work as intended, documentation is missing/outdated, etc..

Comments

@Bodigrim
Copy link
Contributor

Your environment

Which OS do you use: MacOS
Which LSP client (editor/plugin) do you use: Sublime Code

Steps to reproduce

aaa :: Word 
aaa = 10

𐀀𐀀𐀀 :: Word 
𐀀𐀀𐀀 = 10 

Expected behaviour

Both aaa and 𐀀𐀀𐀀 should be underlined as unused bindings.

Actual behaviour

aaa is underlined correctly, but only the first character of 𐀀𐀀𐀀 is underlined.

изображение

That's because HLS does not distinguish positions returned from GHC (which are in code points = characters) from positions mandated by LSP (which are UTF-16 code units). Basically, GHC says us that 3 first code points in line 5 are an unused binding. Each 𐀀 is a single character, but 2 UTF-16 code units, so HLS should ask LSP to underline first 6 code units. Instead of this HLS asks LSP to underline only 3, and since the 3rd one is in the middle of the 2nd character, only the 1st character gets underlined.

CC @michaelpj @alanz, this is related to haskell/lsp#392 (comment).

@Bodigrim Bodigrim added type: bug Something isn't right: doesn't work as intended, documentation is missing/outdated, etc.. status: needs triage labels Jan 28, 2022
@Bodigrim
Copy link
Contributor Author

Code actions are also broken. E. g., choosing "Delete '𐀀𐀀𐀀'" results in the following program

aaa :: Word 
aaa = 10 10

That's again because HLS meant to ask LSP to delete 8 characters, but asked to delete 8 code points, 6 of which were 𐀀𐀀𐀀.

@michaelpj
Copy link
Collaborator

Thanks for the reproduction. To be clear, did you mean that this is related to the issue with splitting in the middle of code units, or is it this one? haskell/lsp#392 (comment)

That is, do you think this will be fixed by the changes you made to lsp, or is there still something broken?

@jneira
Copy link
Member

jneira commented Jan 28, 2022

In any case a regression test (or several of them) here exercising unicode symbols would be great

@Bodigrim
Copy link
Contributor Author

@michaelpj This is unlikely to be fixed in recent lsp patches. Essentially there must be a place, responsible for (fast) conversion between positions in code points and positions in code units. I think I can provide such low-level routine from text-rope and then someone will have to migrate ghcide to it.

@jneira I’m sorry, I’m unlikely to have capacity to add a regression test. Could someone else possibly please pick it up?

@michaelpj
Copy link
Collaborator

I'll look into it (inc tests), I just thought you might already have a hunch as to where the problem lies.

@michaelpj
Copy link
Collaborator

Looking at this more, I think this is utterly broken throughout the codebase. We frequently change from GHC SrcSpans to Positions, blindly assuming that the numbers can just be translated between them, which is very wrong. And we use the numbers from Position with functions from Text, which is also wrong.

If I understand correctly, you can't even convert from a postion-in-code-points to a position-in-code-units without having the whole text in question to hand, which seems quite annoying. In that case it probably would be useful to have such a function in text-rope and we'll just have to painfully audit HLS for direct use of SrcSpans.

(Maybe text-rope should allow using Positions defined in terms of code-points also, might be more useful for clients other than LSP...)

@michaelpj
Copy link
Collaborator

Not really sure what the best way to tackle this systematically is. We could have a CodeUnit newtype in lsp that we use for the column offset, that might force us to look at all the use sites. I guess I can't argue that the Text methods should take a CodePoint newtype instead of Int :)

@michaelpj
Copy link
Collaborator

Vague plan:

  • Since we need the text, let's ask the VFS to do the conversion for us
  • Add CodePointPosition to the VFS, which is a mirror of Position with code points for the column offset. Perhaps also CodePointRange, CodePointLocation
  • Add a function of type VFS -> URI -> CodePointPosition -> Position, etc.

@Bodigrim
Copy link
Contributor Author

(Maybe text-rope should allow using Positions defined in terms of code-points also, might be more useful for clients other than LSP...)

Already there: it provides both character-based API (Data.Text.Lines / Data.Text.Rope) and UTF16-based API (Data.Text.Utf16.Lines / Data.Text.Utf16.Rope). I just need to provide conversions.

@Bodigrim
Copy link
Contributor Author

I realised that text-rope already provides a way to convert between positions: one can Data.Text.Lines.splitAtPosition with character-based counter and then call Data.Text.Utf16.Lines.lengthAsPosition on the prefix to receive UTF-16-based counter. I might provide a dedicated helper in the future, but this would work for a proof of concept.

Looking on ghcide, I think it can really benefit from using Data.Text.Lines from text-rope instead of a plain Text, because this will improve ubiquitous line splitting.

So I'd suggest to wait for an lsp release with text-rope-based VFS, then take another look at this issue.

@michaelpj
Copy link
Collaborator

Yes, definitely not planning to do anything before then. Thanks for letting me know how to do the conversion!

michaelpj added a commit to michaelpj/lsp that referenced this issue Feb 19, 2022
LSP `Position`s use UTF-16 code units for offsets within lines; most
other sane tools (like GHC) use Unicode code points. We need to use the
right one in the right place, otherwise we get issues like
haskell/haskell-language-server#2646.

This is pretty unpleasant, since code points are variable-size, so you
can't do the conversion without having the file text itself.

This PR provides a type for positions using code points (for clients to
use to help them be less confused) and functions for using the VFS to
convert between those and LSP positions.
michaelpj added a commit to michaelpj/lsp that referenced this issue Feb 19, 2022
LSP `Position`s use UTF-16 code units for offsets within lines; most
other sane tools (like GHC) use Unicode code points. We need to use the
right one in the right place, otherwise we get issues like
haskell/haskell-language-server#2646.

This is pretty unpleasant, since code points are variable-size, so you
can't do the conversion without having the file text itself.

This PR provides a type for positions using code points (for clients to
use to help them be less confused) and functions for using the VFS to
convert between those and LSP positions.
michaelpj added a commit to michaelpj/lsp that referenced this issue Feb 20, 2022
LSP `Position`s use UTF-16 code units for offsets within lines; most
other sane tools (like GHC) use Unicode code points. We need to use the
right one in the right place, otherwise we get issues like
haskell/haskell-language-server#2646.

This is pretty unpleasant, since code points are variable-size, so you
can't do the conversion without having the file text itself.

This PR provides a type for positions using code points (for clients to
use to help them be less confused) and functions for using the VFS to
convert between those and LSP positions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: ghcide component: lsp type: bug Something isn't right: doesn't work as intended, documentation is missing/outdated, etc..
Projects
None yet
Development

No branches or pull requests

5 participants