Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex matching for Snippets #2525

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

simonfelding
Copy link

I think this would be really useful.

@gir-bot gir-bot added S: needs-review Needs to be reviewed and/or approved. C: docs Related to documentation. C: snippets Related to the snippets extension. C: source Related to source code. labels Nov 13, 2024
@simonfelding
Copy link
Author

I think it would be nice if it was possible to specify the string to join the match groups with somehow, but I doubt anyone will ever need that.

@facelessuser
Copy link
Owner

I'm questioning how useful this would be to the general population. I personally have no need for such functionality, and I question how generally useful it is. I don't doubt that some may use it for some niche purposes, but it is inherently more fragile and could cause people frustration with false positive matches, gathering too many lines across the file, etc. I think, while powerful, it is hard to use and likely impractical in most cases.

Snippets is purposely designed to be more explicit when declaring what you wish to include: lines, sections, files, etc. I'm not sure how comfortable I am in opening it up to a, what I personally view as, very niche behavior that is not generally useful to the average user. To be frank, I personally don't feel this is a good way to manage snippets and a lot harder to support for the average user as subtle changes in a linked file could unexpectedly include things in your Markdown you did not intend to include. This is, of course, a subjective opinion, but I would need a lot of convincing.

Simon Ungar Felding added 2 commits November 13, 2024 17:32
@simonfelding
Copy link
Author

simonfelding commented Nov 14, 2024

Thanks for the quick and thorough response! I hope I can answer it satisfyingly. I think you have some really good points.

The reason I want to add this feature is that the pymdown-extensions are used a lot in technical documentation, for example through the popular Material for MkDocs. To me, regex is not a niche utility - it's the main way to extract text from files, so I was surprised it's not part of the Snippets code. I use regex constantly, and I rarely if ever encounter a tool related to technical writing that doesn't support some form of it.

The advantage of being able to use regex in Snippets is that I can parse configuration documents without requiring significant line numbers. In most configuration languages, you wouldn't expect the order and number of lines to be significant. From my point of view, using line numbers is a much more fragile solution than regex. Example: I always want to extract the username from the config file. With line numbers, I'd have to make sure that it never moves to another line if the config file is updated. With regex, I'd know that I always get what I asked for.

I think you are completely right that regex can be impractical in many cases, but the same goes for line numbers and the sections system. I think they are for different use cases.


Another example: Lets say I have a system where I manage authorization through a plain text file in git. Every time I make a commit to the main branch, the system pulls the configuration, and finally mkdocs builds the documentation. It could be a csv file that looks like this:

username,role
alice,admin
bob,reader
johnny,reader
jim,admin

Now I want to document what users have read access, and what users have admin access. With this feature in Snippets, all I'd have to do would be to write a markdown file like this:

# Users with admin access
--8<-- "auth.config:/^([^,]+),.*admin/"

# Users with read access
--8<-- "auth.config:/^([^,]+),.*reader/"

This would be really convenient for me. I could achieve the same using the sections syntax, but that would make the CSV file invalid. There are so many similar use cases, like generating an inventory of servers with some server role or whatever.

And sure, I could make the mistake of writing --8<-- "auth.config:/^([^,]+).*admin/" instead (omitting the , between the match group and the role name), where a line like badminton,reader would result in b being the matched section for admin usernames.

But why would anyone complain to you about that? If that's something that worries you, I could add a line to the documentation where it says that it uses the python re library and that they should make sure to validate and debug their regex using tools like regex101.com. That might direct users over there instead if they have issues with how regex is parsed.


Without the regex feature in Snippets, I'd have to write a CI script that parses the csv file, outputs the list of users and have Snippets add that to the rendered markdown file. This would not be as good a solution, as now part of the text processing happens in the CI script and half of the text processing happens in the markdown file. CI pipelines are for security reasons often not editable for all users - frequently you just include the pipeline from some other repository where there are tight security controls. This means that for people to be able to add stuff to the documentation this way, they need CI pipeline access. And it also means I now have to maintain a CI pipeline per config file format I need to parse, which really sucks and adds a lot of complexity. And from my point of view, it is much worse to have documentation writers destroy the entire pipeline instead of making a mistake in the documentation.


Finally, I completely agree that adding this functionality lets the user make some confusing mistakes. However I think this is entirely expected by anyone who chooses to use the regex functionality. The same goes for the line number syntax - you have to include the correct lines and make sure the lines you want to include don't change positions.

I also think my contribution is quite elegant and simple, as it is convention in regex to enclose the regex string in slashes, and that the Snippets code would never be able to include a file that ends in / anyway, since that would not be a valid filename. It doesn't force the user to use regex for line matching - it just adds it as an option. If the user chooses the wrong tool for the job, of course they are going to have a bad time. But I think it would be really nice to have for those of us that need it!

@simonfelding
Copy link
Author

I could add a line to the documentation where it says that it uses the python re library and that they should make sure to validate and debug their regex using tools like regex101.com. That might direct users over there instead if they have issues with how regex is parsed.

I added this in the latest commits

@facelessuser
Copy link
Owner

facelessuser commented Nov 14, 2024

To me, regex is not a niche utility

Let me be clear so this is not taken out of context. I am not saying regex is niche, I am a heavy user of regex, but that I feel usage of regex in the context of Snippets would be niche. I am not trying to develop a grep tool for inserts and I don't think this is what the majority of users are desiring as well.

I think you are completely right that regex can be impractical in many cases, but the same goes for line numbers and the sections system.

While I don't deny that managing lines for Snippets is fragile, this was my argument when it was requested, but I relented only because many include extensions allow this and there were a significant number of people asking for it despite my reservations. I disagree with your assessment of sections though as it provides a clear explicit way to define what you want and I see no evidence that it is impractical.

Keep in mind that Snippets was designed to allow users to define reusable content and insert it where they need to. Moving forward, people have used it for all sorts of things, and yes, I have obliged with expanding my scope to some degree, but keep in mind, I had (and have) no interest in creating a templating language. Snippets was always meant for easy, dumb, and predictable inserts.

This regex approach is a bit problematic as well as you can single out arbitrary lines, but you cannot target blocks. It's usefulness is greatly diminished which relegates it to ver niche cases.

Another example: Lets say I have a system where I manage authorization through a plain text file in git. Every time I make a commit to the main branch, the system pulls the configuration, and finally mkdocs builds the documentation. It could be a csv file that looks like this:

While I do not deny you may have this one niche case where you may want to grab users from a CSV, many people are targeting blocks of text from snippets, this is a lot more difficult for people to use in a practical sense for that purpose as often the kind of content that is targetted is not regular. I do believe this example you've given is far from a common case. I also believe this approach does not apply well to how the majority of people use this extension.

@simonfelding
Copy link
Author

simonfelding commented Nov 14, 2024

Thanks for your response. I have a couple comments:

I had (and have) no interest in creating a templating language. Snippets was always meant for easy, dumb, and predictable inserts.

I think the text you grab with regex is just as much a snippet as any other (section of a) file is. I don't think it's a templating language, all I do with this is grab one or more lines from a file which I think is pretty in line with how Snippets works.

This regex approach is a bit problematic as well as you can single out arbitrary lines, but you cannot target blocks. It's usefulness is greatly diminished which relegates it to ver niche cases.

I implemented multiline regex matches just now! :)

I do believe this example you've given is far from a common case.

Disagree (but of course you can't use the Snippets extension like that right now since it doesn't support regex). I think this is a huge use case from my point of view. I personally can't really see the benefit of including specific lines, but I know for sure that generating documentation this way is gaining massive traction because of the entire gitops approach to doing stuff is such a big influence on the devops field, and because mkdocs is a big player in generating documentation, I naturally end up in your project.

I guess I could write a plugin that does the same thing, but my code just fits in with snippets so nicely. I think it's a very natural extension of Snippets and I don't really understand why Snippets can't support both the line-selecting approach and the full file including approach. Isn't it just a matter of how you structure your text snippets - in files or inside files?

So tl;dr: Snippets is the de facto way to include sections of files in mkdocs-rendered documentation. I think there actually is a very large amount of users that would love to have this functionality but have to look elsewhere because it isn't available in mkdocs.

@facelessuser
Copy link
Owner

don't think it's a templating language

I didn't mean to insinuate it is per see, but I'm trying to illustrate there is a limit to the scope of what I am willing to add. Even Jinja2, which is a templating language, I don't think provides inclusion of lines or blocks per regex. To do so, you have to construct programmatic blocks of logic to attempt such a feat because it is a far from a common use case, most likely due to practical limitations.

I implemented multiline regex matches just now! :)

That's fine, but in a practical sense, I'm not quite sure how useful this would be. I don't deny that it can be useful if your narrow case lends well to it, but to be fair, I think capturing content programmatically would be far more useful as much content is not regular, though I'm not advocating that we do this either. You can often include regular expression in the process, but pure regex can be cumbersome, and often inefficient when singling out non-regular targets. In a small set of cases, this might work, but I can think of many that would be difficult for a user to target well, even with multiline regex.

I support a couple of tools that employ regular expression and I get a number of questions on how to effectively use regular expression to do specific tasks. It should be noted that this is something I also take into consideration: what will it cost me in support time? The more difficult it is to apply, the more questions I get. The return has to be significant enough to offset this cost. I guarantee I will get questions such as:

I have this Snippet where I am trying to use regex and I want to capture this block of text. Everything I've tried so far doesn't work, how do I do it?

A small number of users will navigate this just fine and quickly identify what they can and cannot do, but there will be a large number who don't fully understand the limitations of what they can and can't do (and yes, there are limitations that limit the scope of what multiline regex can do). And there will be a number of people who just need help using regex effectively.

In the vast majority of cases, I think regex would not be the answer and I would direct them to the other approaches. I think the need to use regex would be niche.

Disagree (but of course you can't use the Snippets extension like that right now since it doesn't support regex).

Disagreement is fine, but you are the first to ever request such a feature, and outside the CSV example, which I don't think is a common case, I cannot see this being used generally to a significant degree. I certainly could be wrong, but I was honest from the start, I would likely need a lot of convincing.

Anything I allow added to the code adds complexity that I must maintain, long after the requester is gone. So I am, at times, very critical and will question strongly the necessity of any addition, big or small. Adding certain code also opens the doors for other related requests as well. I have to determine what the scope of what I offer is. Sometimes, suggestions seem like a clear win for me, others I feel require more scrutiny to assess the real payoff for the general population of users vs the cost of supporting the feature.

So tl;dr: Snippets is the de facto way to include sections of files in mkdocs-rendered documentation. I think there actually is a very large amount of users that would love to have this functionality but have to look elsewhere because it isn't available in mkdocs.

I am open to having people chime in and champion this request in order to note how useful this would really be. I'm not sure I'm convinced there is such a strong desire for this. I am open to being proven wrong.

@simonfelding
Copy link
Author

simonfelding commented Nov 14, 2024

This regex approach is a bit problematic as well as you can single out arbitrary lines, but you cannot target blocks. It's usefulness is greatly diminished which relegates it to ver niche cases.

I now realize I misunderstood this line. Why does my code not target blocks? I thought I made it fit pretty well in the way the existing code handles all the stuff that looks like <path>:<section>. Did I miss something?

I also think my code solves the example from the documentation even nicer, where you don't have to add any Snippets-specific syntax inside your code to make it possible to grab that snippet:
image

Could now be solved with a simple --8<-- "example.py:/(def my_function.*)\n\w/" and regex_flags=['DOTALL'] - without having to modify the source code with Snippets syntax stuff. For me this is more natural adn elegant, although I see the appeal of the existing approach in this case. Safe and simple. However to some people, the regex approach is better than the sections syntax because you don't have to edit the source files. Sometimes you cannot edit them, for example if they are hosted remotely (and you already support getting sections from remote files)!

I think a well-written regex string is much safer than getting a range of lines from some remote file.

@simonfelding
Copy link
Author

simonfelding commented Nov 14, 2024

Oops, we commented at the same time.

I've made my case and I think this is entirely within the scope of the already existing functionality (As per the example in my previous comment), even though it's more than your original intention. I do understand what you're saying and respect that, I don't have anything more to add.

How many people do you need to chime in before you think there is a use-case? I could easily gather a quite large amount of people who agree this is very useful. Just looking through open issues, my PR would solve #2385 quite easily:

--8<-- "my_file:/(.*we want this line\n).*\n(.*)/ and regex_flags=['MULTILINE'].

Just one comment:

I have this Snippet where I am trying to use regex and I want to capture this block of text. Everything I've tried so far doesn't work, how do I do it?

I don't think you have any responsiblity to support that but it's nice of you that you do :)

@facelessuser
Copy link
Owner

I don't think you have any responsiblity to support that but it's nice of you that you do :)

I still have to take time to respond to all of these, even if it is to say I have no intention to help them. But sure, I don't have to give them helpful answers.

Could now be solved with a simple --8<-- "example.py:/(def my_function.*)\n\w/" and regex_flags=['DOTALL'] -

I think you missed a lookahead as you would grab the first character in the next function or whatever. You should probably also look for ( to prevent grabbing functions that have the same function name prefix.

(?s)(def my_function\(.*)\n(?=\w)

This only illustrates how easy it is for people to accidentally grab what they don't want, but sure that's not really my problem, but theirs.

Safe and simple. However to some people, the regex approach is better than the sections syntax because you don't have to edit the source files.

In the context of certain coding languages, yes, I can see this being true, less true in prose, HTML, and various other places, but yes, when targeting something with a regular structure, yes, it works fine.

Let me think about this some more. If this was going to be included there are some things that should be addressed. I'm not promising acceptance, only that I will take some more time to consider the possibility.

  1. I'd prefer that users use inline (?si) flags so we don't need a special parameter to pass flags. Then they can control what flags they want and when.
  2. I'm not sold on a regex should gather anything that matches a certain pattern. You should probably implement a greedy option: /pattern/g to prevent surprises. A user can explicitly control gathering more or less.
  3. I don't know if your code handles it, but you should ensure a user can use / in their pattern and that your capture of such things doesn't break when this happens. A user should be able to use \/. There should be tests for this.

@simonfelding
Copy link
Author

simonfelding commented Nov 14, 2024

yep i wrote it too fast, but it would have given me an error if i had check_paths enabled ;)

I'd prefer that users use inline (?si) flags so we don't need a special parameter to pass flags. Then they can control what flags they want and when.

Of couse. I love this.

I'm not sold on a regex should gather anything that matches a certain pattern. You should probably implement a greedy option: /pattern/g to prevent surprises. A user can explicitly control gathering more or less.

You mean the case where you don't have any match groups? I think it already works like how I interpret your /pattern/g. I don't think we should have an implicit group if there is none specified; this is not how the re library works and I think it's best to stick to default Python behavior, exactly to avoid surprises.

I don't know if your code handles it, but you should ensure a user can use / in their pattern and that your capture of such things doesn't break when this happens. A user should be able to use /. There should be tests for this.

It's entirely supported - the regex string just has to start and end with a /, but can have anything in between. They don't even have to escape the /, but it's fine if they do.

@facelessuser
Copy link
Owner

It's entirely supported - the string just has to start and end with a /, but can have anything in between.

If it does, great, but tests should confirm this.

You mean the case where you don't have any match groups? It already works like how I interpret your /pattern/g.

I don't think I understand what match groups have to do with greedy. If there is some baked in behavior here, I don't think it sounds intuitive.

@simonfelding
Copy link
Author

If it does, great, but tests should confirm this.

For sure.

I don't think I understand what match groups have to do with greedy. If there is some baked in behavior here, I don't think it sounds intuitive.

I think I'm confused, it's a bit late here. Can you give an example of the behavior you expect and the behavior you want to avoid?

@facelessuser
Copy link
Owner

I think I'm confused, it's a bit late here. Can you give an example of the behavior you expect and the behavior you want to avoid?

It's possible I don't understand the implementation (I have not looked at the code closely). Does \w* match every line that has a match? Or just the first instance?

@facelessuser
Copy link
Owner

Marking this as needing more information per the earlier questions. Also, this would need testing as well.

@facelessuser facelessuser added the S: more-info-needed More information is required. label Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: docs Related to documentation. C: snippets Related to the snippets extension. C: source Related to source code. S: more-info-needed More information is required. S: needs-review Needs to be reviewed and/or approved.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants