Custom Lexer and Ascii chars

Topics: Developer Forum
Mar 24, 2015 at 12:40 PM
Edited Mar 24, 2015 at 12:51 PM
I'm having some problems with special characters like "▒" or "ñ"... If some ascii character is in a document the styles from that line and the following ones lost its correct format, i'm using the WPF example as base of my custom lexer... I hope someone can guide me to fix the problem.

Here a screenshot of the problem:
Mar 24, 2015 at 8:24 PM
I would surmise that the non-ASCII character requires more than a single byte to represent. For example, to represent a 'n' character in UTF-8 would be 0x6E. But to represent 'ñ' would be two bytes, 0xC3 and 0xB1. Thus, adding one of those characters might very well be throwing off your styling calculations.

This is a fundamental flaw (IMO) of the current ScintillaNET implementation. It doesn't account for Unicode characters. That's why I've started work on a rewrite which you can get an early preview of here:

To get what you want working in the current version of ScintillaNET you'll have to do the arithmetic yourself to make sure you're counting the correct number of "bytes", not "characters" when you're styling.

Mar 24, 2015 at 8:43 PM
Jacob is correct. The custom lexer code is written under the assumption that 1 character is 1 byte, but that assumption fails with multi-byte characters.
Mar 25, 2015 at 6:44 AM
I see... Any tip where to start to add the support for multi-byte characters in the current version of ScintillaNet? or is a better option to use your new version jacobslusser?
In your project description says that is not considered ready for general use, but i'm just making an editor for my personal usage and i don't want to consume too much time in its development...

Any advice?
Mar 25, 2015 at 11:30 PM

I'm at a point with my rewrite where I'm ready to start getting user feedback and would very much appreciate you being willing to give it a shot. For my part I would be willing to prioritize any features, bug fixes, and assistance you need in exchange for you being my guinea pig. But I also wouldn't be offended if you say 'no'. If you're interested we can take this discussion offline and you can send me a PM.

If you plan to stick with the current version of ScintillaNET, my recommendation would be to use the Scintilla.NativeInterface.PositionAfter (or Scintilla.NativeInterface.PositionBefore) method(s). What they allow you to do is specify a BYTE position as input and then the return the BYTE position where the next CHARACTER starts, taking into account the current encoding/codepage. Thus for every CHARACTER you want to consume in your lexer, you can determine the number of BYTES to consume.


P.S. Given the topic of discussion I find your username ironic. :)