Automatic Dialect Switching using Dialect Tags Embedded in a Comment. #6453

trijezdci · 2023-06-16T09:19:31Z

trijezdci
Jun 16, 2023

Hi there,

I have several Modula-2 projects, one of which is a multi-dialect compiler. The sources are in different dialects, there are also other Modula-2 projects on Github which use different dialects.

In some cases the differences between a dialect is very minimal, for example when a compiler implements one or two additional built-in functions that aren't standard. In those cases one might argue that it is sufferable not to have this highlighted correctly.

HOWEVER, there are major differences between three main language dialects and for those differences it is extremely annoying when the sources are incorrectly highlighted. For this reason, I had written a multi-dialect plugin for Pygments where the dialect was automatically chosen by examining a comment at the beginning of a source file that contained a dialect tag indicating the dialect.

The dialect tagged comments were:

(*!m2pim*) (* dialect tag for PIM Modula-2 *)
(*!m2iso*) (* dialect tag for ISO Modula-2 *)
(*!m2r10*) (* dialect tag for Modula-2 Rev 2010 *)

I even added support for various compiler-specific extensions over time, like:

(*!m2pim+gm2*) (* dialect tag for PIM Modula-2 with GNU Modula-2 extensions *)

All one had to do is add such a tag comment at the top of one's source files and everything was rendered properly for the given dialect.

The very same dialect tags are also supported by VIM and Emacs. I contributed support to VIM and the maintainer of GNU Modula-2 contributed the support to Emacs.

So, this is a kind of de-facto standard that has been in place for years now.

Unfortunately though, both Bitbucket and Github moved away from using Pygments. As a result, this scheme no longer works there.

Worse still, the current support for Modula-2 in Linguist is broken. It doesn't recognise interface modules at all, and the list of reserved words and predefined identifiers that it highlights is a mish-mash that does not reflect any dialect, nor any compiler extensions, it is just thrown together with various reserved words and identifiers from Pascal. It doesn't reflect Modula-2. It's just all wrong.

I would like to help making this auto-selecting multi-dialect rendering work again on Github, but I find Linguist extremely confusing, in particular since it doesn't seem to have its own way to define grammars but somehow uses various grammar schemes of third party software. It seems to me that in the time one needs to learn all the different components and understand how they work together, I could write a lexer and parser with HTML renderer from scratch. But hey, its not my decision to make how Github want to render code, however, I will need a bit of help to be able to contribute that functionality that was once working on Github and had been inadvertently removed when Github moved off Pygments.

So, I would appreciate if somebody could give me some pointers, in particular where one would add code to search and parse the aforementioned dialect tag comments to then be able to choose the corresponding grammar, and also how to make Linguist accept multiple grammars for the same language. At the moment, I cannot see anything in the documentation that gives me any ideas how to do those two essential things.

thanks in advance
regards
benjamin

lildude · 2023-06-16T10:21:34Z

lildude
Jun 16, 2023
Maintainer

So, this has very little to do with Linguist.

Lets start with a few facts:

Linguist only provides the grammars used by GitHub's syntax highlighting engine (internal project) ¹.
These grammars are Textmate-compatible ¹ grammars as used by the likes of Atom, VS Code, Textmate and Sublime 2 (Sublime 3 uses a new format which isn't supported by Linguist).
Only one grammar can be directly associated with each language within Linguist.
Inline codeblock rendering is implemented by markup and requires explicit manual language specification (eg ```ruby) in order for the correct grammar to be used by the syntax highlighting engine.
The syntax highlighting engine relies on what you tell markup the language is or what Linguist determines in the case of files.
Neither the syntax highlighting engine nor markup support automatic dialect detection.

With that out of the way, I can say you're not going to be able to get automatic dialect detection working in codeblocks and/or files as I think you're trying to do as this is going to need major changes to markup and GitHub's syntax highlighting engine.

The closest I think you can get is one or both of the following:

Create or find a Textmate-compatible grammar that somehow can use regular expressions to identify the different dialects and apply the appropriate syntax highlighting for that file or section. A single grammar can source other grammar files for certain sections and is commonly used by other grammars to keep the size and complication down or just to separate things out for maintainability. I can imagine this could be used for dialects too.

OR

Add each dialect, with its own Textmate-compatible grammar, as a new language in Linguist and group them all under the same name (using the group: tag in the language.yml file). This will require unique extensions or heuristics to differentiate the different dialects and comes with the caveat that you won't be able to mix-and-match dialects in the same file or codeblock, and codeblocks will still require explicit language-specification in order for the correct grammar to be used.

So with that in mind, I can comment on some of your comments:

Worse still, the current support for Modula-2 in Linguist is broken. It doesn't recognise interface modules at all, and the list of reserved words and predefined identifiers that it highlights is a mish-mash that does not reflect any dialect, nor any compiler extensions, it is just thrown together with various reserved words and identifiers from Pascal. It doesn't reflect Modula-2. It's just all wrong.

You already know this, but this has nothing to do with Linguist and is entirely down to the third-party grammar.

I would like to help making this auto-selecting multi-dialect rendering work again on Github,

You'll only be able to get close to this using the methods I detailed above.

… but I find Linguist extremely confusing, in particular since it doesn't seem to have its own way to define grammars but somehow uses various grammar schemes of third party software.

This has nothing to do with Linguist. As previously stated, Linguist supplies the grammars needed by the highlighting engine and these need to be in Textmate-compatible format. How to write and maintain Textmate compatible grammars is outside of the scope of Linguist though @Alhadis is quite the expert so may be able to offer tips and help. Textmate has their own documentation (though it is a bit poor the last time I looked) as does VS Code.

So, I would appreciate if somebody could give me some pointers, in particular where one would add code to search and parse the aforementioned dialect tag comments to then be able to choose the corresponding grammar,

The only place you can do this is within the grammar, assuming it's possible to support dialects.

… and also how to make Linguist accept multiple grammars for the same language.

You can't. Linguist has a one-to-one mapping between languages and grammars (the scope: key in the languages.yml file).

Mostly. Some of the more popular languages use Treesitter grammars directly in the highlighting engine (denoted by the 🐌 icon in the README.md) and there is currently no method for those outside of GitHub to contribute grammars or request support directly. You'd need to go through GitHub support. ↩ ↩²

4 replies

trijezdci Jun 16, 2023
Author

When I say multiple grammars for the same language, I mean a mechanism by which the grammar is chosen not by static association but through running the code that disambiguates the file extension (for example).

Perhaps it helps if I explain how this works in vi/VIM as illustration:

Vim has a feature where you can add a script that is intended to disambiguate file extensions. So for any .def file (and any .mod file) this script is run. I added code to that script to look for a Modula-2 comment that contains a dialect tag such as (*!m2pim*) within the first 200 lines of code so as to permit a comment with preamble and license to come before the dialect tag comment.

When the script finds such a comment, it determines the dialect from the tag inside the comment.

For each dialect there is a separate syntax description file, and the script then causes Vim to use that syntax description file which is associated with the dialect indicated in the comment with the dialect tag.

In the graphical VIM editor, there is a language menu where the syntax can be chosen manually and all three dialects are placed together in a submenu under a Modula-2 menu item.

So, from the user experience point of view, this is all the same language, different dialects. From an implementor's point of view, one might also describe this as three different languages and each one having exactly one grammar description, whilst they share the same file extensions.

How we call this is not my concern. Instead, it is the user experience that counts.

I believe that a similar scheme should be possible to do for Modula-2 on Github.

You might ask how should the case be handled where no dialect tag is present in the source code? This is fairly straight forward:

(1) search for symbols that are unique to one dialect, if found, select that dialect's description.
(2) if no symbols that are unique to any dialect are found, select some default as fallback.

For TextWrangler which had no other way to disambiguate automatically, I used a scheme where .def/.mod is Revision 2010 Modula-2, .Def/.Mod is ISO Modula-2 and .DEF/.MOD is Classic Modula-2. There is usually some way how disambiguation can be done even if no scripting is available (as in the case of TextWrangler).

As to your "caveat" that with separate syntax descriptions, dialects cannot be mixed in Markup blocks, I cannot imagine any situation where one would use different dialects within the same code block, so I don't think that would be an impediment.

As for the names of the dialects, we can just use the year of first release suffixed to the language name

Modula2_1978 => PIM
Modula2_1996 => ISO
Modula2_2010 => R10

This way, people who inadvertently stumble upon this, will immediately realise what it means. Also, it could later be formalised as there are many other languages for which this would be useful, Ada83, Ada95, Ada2005, Ada2012, C90, C99, C11, C23, Fortran58, Fortran77, Fortran90, Fortran2015, etc etc. Ideally, an attribute could be added to Linguist in the future where repo owners could set a repo-wide default dialect. I am somewhat surprised that dialect-switching hasn't been on Github's radar thus far.

To summarise, from your comments, it is my impression that the aforementioned approach with separate syntax descriptions for each dialect is the most promising. But please tell me if I misunderstood.

lildude Jun 16, 2023
Maintainer

To summarise, from your comments, it is my impression that the aforementioned approach with separate syntax descriptions for each dialect is the most promising. But please tell me if I misunderstood.

Both are equally promising.

Most of the languages you've highlighted rely on the single grammar to support all dialects which is the primary reason Linguist hasn't had the need to differentiate. Looking at languages.yml file we can see it's only Fortran that has a different language variant as a specific language, but is still grouped under "Fortran" so appears as Fortran in the language analysis:

linguist/lib/linguist/languages.yml

Lines 217 to 229 in d352058

    
           Ada: 
        
             type: programming 
        
             color: "#02f88c" 
        
             extensions: 
        
             - ".adb" 
        
             - ".ada" 
        
             - ".ads" 
        
             aliases: 
        
             - ada95 
        
             - ada2005 
        
             tm_scope: source.ada 
        
             ace_mode: ada 
        
             language_id: 11

linguist/lib/linguist/languages.yml

Lines 701 to 715 in d352058

    
           C: 
        
             type: programming 
        
             color: "#555555" 
        
             extensions: 
        
             - ".c" 
        
             - ".cats" 
        
             - ".h" 
        
             - ".idc" 
        
             interpreters: 
        
             - tcc 
        
             tm_scope: source.c 
        
             ace_mode: c_cpp 
        
             codemirror_mode: clike 
        
             codemirror_mime_type: text/x-csrc 
        
             language_id: 41

linguist/lib/linguist/languages.yml

Lines 2018 to 2045 in d352058

    
           Fortran: 
        
             group: Fortran 
        
             type: programming 
        
             color: "#4d41b1" 
        
             extensions: 
        
             - ".f" 
        
             - ".f77" 
        
             - ".for" 
        
             - ".fpp" 
        
             tm_scope: source.fortran 
        
             ace_mode: text 
        
             codemirror_mode: fortran 
        
             codemirror_mime_type: text/x-fortran 
        
             language_id: 107 
        
           Fortran Free Form: 
        
             group: Fortran 
        
             color: "#4d41b1" 
        
             type: programming 
        
             extensions: 
        
             - ".f90" 
        
             - ".f03" 
        
             - ".f08" 
        
             - ".f95" 
        
             tm_scope: source.fortran.modern 
        
             ace_mode: text 
        
             codemirror_mode: fortran 
        
             codemirror_mime_type: text/x-fortran 
        
             language_id: 761352333

As you can see, most use aliases for the different names, but still ultimately use the same grammar and rely on it to differentiate where the syntax varies.

trijezdci Jun 16, 2023
Author

Fortran being quite old has gone through some major changes. Granted C does not have much syntax to begin with, they just kept adding new reserved words with leading double-lowlines like __inline__ and such. I can imagine that this can easily be done by just hooking in another sub-grammar.

Modula-2 is quite different. As a member of the Pascal family of languages it's philosophy is based on readability and so it has got a lot more syntax to begin with. Then the ISO standards working group (of wich I was a member once) didn't do what the C working group did and what usually standard working groups do which is just resolve ambiguities and norm actual use; no, we invented a new language, so ISO Modula-2 is quite different already. And this took many years all the while compiler vendors stopped working on their Modula-2 compilers waiting for the standard to be ratified and published. After some 6 or 7 years when the standard was ratified and published, there were no more vendors left and the language was dead, killed by its standards committee, just like its ancestor Algol.

This happened before Unicode though and the few compilers that were still around and being maintained then added different ways how to support unicode, all incompatible. Generics and OOP were bolted on later, by what remained of the working group before it was disbanded. Two of us (former working group members) got back together in 2009 and decided to do a modern revision without any regard for backwards compatibility, and based on Wirth's original specification as a starting point instead of the ISO standard.

Being the work of two people helped to make the revised language compact and consistent, but there was absolutely nobody who stalled us in breaking backwards compatibility and removing features no longer timely. The result is a dialect that has quite a number of very substantial differences perhaps even more so than ISO Modula-2.

For example, in classical Modula-2 and ISO Modula-2 non-decimal numeric literals used a suffix character, H for base-16, B for base-8 and C for base-8 used as character code point. This breaks the language's philosophy of LL(1) compliance and it is awkward to lex. We are not fond of C, but C has got the very best syntax for non-decimal number literals, so we adopted that, Modula-2 Rev 2010 uses 0x for base-16, 0b for base-2, no support for octals and 0u with base-16 for character code points. And digits can be grouped with a single apostrophe, as in 0b0011'0000'1111'1010 etc.

I know how much of an effort it is to write regex that matches the older and the new C like number literals without conflicting with each other because I have done it for Pygments, and I have also implemented a handwritten lexer for both in my multi-dialect Modula-2 front end. Also, the number of reserved words and predefined identifiers in ISO Modula-2 with full coverage of all three parts of the standard (base+OOP+generics) is outright insane.

It isn't that it can't be done, but the code quickly gets out of hand and is then difficult to read, comprehend and thus maintain. It is much easier to maintain if there is separation of concern and the scope is smaller. I have since split up my multi-dialect front end into separate projects, and even though there is much more code to handle now, it has become much cleaner and much easier to maintain.

Before this background I am leaning towards separate syntax descriptions.

Another benefit would be that the new descriptions can all be tested separately and it won't impact the current support.

Is there any way to establish an alias for a language that can then be pointed to one of the dialects (as default)?

Say if we have three language definitions: Modula2_1978, Modula2_1996 and Modula2_2010, would it be possible to have a plain name Modula2 as an alias that points to any one of the three definitions, for example Modula2_1978?

lildude Jun 17, 2023
Maintainer

Is there any way to establish an alias for a language that can then be pointed to one of the dialects (as default)?

No.

Say if we have three language definitions: Modula2_1978, Modula2_1996 and Modula2_2010, would it be possible to have a plain name Modula2 as an alias that points to any one of the three definitions, for example Modula2_1978?

For this you'd need to have Modula2 be the default (it'll also be the parent language for the group) and use the grammar of the dialect you want but include an aliases every for those that want to refer to it in codeblocks by that variant name.

One thing to keep in mind with the dialect names is how does the rest of the community refer to them? You're using a year, but you've also mentioned PIM, ISO and R10. If these are more widely used terms, then I'd recommend you use those names too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic Dialect Switching using Dialect Tags Embedded in a Comment. #6453

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Automatic Dialect Switching using Dialect Tags Embedded in a Comment. #6453

trijezdci Jun 16, 2023

Replies: 1 comment · 4 replies

lildude Jun 16, 2023 Maintainer

Footnotes

trijezdci Jun 16, 2023 Author

lildude Jun 16, 2023 Maintainer

trijezdci Jun 16, 2023 Author

lildude Jun 17, 2023 Maintainer

trijezdci
Jun 16, 2023

Replies: 1 comment 4 replies

lildude
Jun 16, 2023
Maintainer

trijezdci Jun 16, 2023
Author

lildude Jun 16, 2023
Maintainer

trijezdci Jun 16, 2023
Author

lildude Jun 17, 2023
Maintainer