HTML string containing underscores gets escaped and shown in output markdown #63

jiggneshhgohel · 2016-02-09T20:17:36Z

Ruby 2.3.0

Rails 4.2.5

Rails Console output:

2.3.0 :005 > html_str = "<p><strong>Username</strong> : %{user_name}</p>"
 => "<p><strong>Username</strong> : %{user_name}</p>" 
2.3.0 :006 > md = ReverseMarkdown.convert(html_str)
 => " **Username** : %{user\\_name}\n\n" 
2.3.0 :007 > puts md
 **Username** : %{user\_name}
 => nil

As can be seen when my HTML string contains words separated by underscore(s), the underscore(s) gets escaped which is correct. But is there any way we can hide those in the output string? I have a use-case wherein I want to convert the HTML string to its Markdown version and render the markdown version as it is.

The text was updated successfully, but these errors were encountered:

xijo · 2016-02-11T14:31:41Z

Hello @jiggneshhgohel

I'm not quite sure how to help you there. We could introduce an option to disable the escaping, but I think the use cases for that are quite rare.

On the other hand could you replace escaped key chars with the non-escaped versions whenever you need it. Wouldn't that be the easier solution?

Let me know what you think!
Jo

jiggneshhgohel · 2016-02-11T16:19:55Z

@xijo

On the other hand could you replace escaped key chars with the non-escaped versions whenever you need it. Wouldn't that be the easier solution?

That's the most obvious option when there is no provision for desired feature in the library. But my expectancy of using a library like yours is to expect raw original markdown from the HTML I need to convert. My HTML didn't contained any escape characters thus I wasn't expecting it in my converted to markdown.

Thanks.

anujbiyani · 2019-01-02T21:47:43Z

I think the issue is that the md = ReverseMarkdown.convert(html_str) is double-escaping the underscore, no? I would expect

2.3.0 :006 > md = ReverseMarkdown.convert(html_str)
 => " **Username** : %{user\\_name}\n\n"

to read as

2.3.0 :006 > md = ReverseMarkdown.convert(html_str)
 => " **Username** : %{user\_name}\n\n" # only \_ here instead of \\_

The issue I'm having is with a string like

"https://github.com/xijo/reverse_markdown"

ReverseMarkdown converts it to

"https://github.com/xijo/reverse\\_markdown\n\n"

and then the marked library converts it to

"<p><a href="https://github.com/xijo/reverse\_markdown">https://github.com/xijo/reverse\_markdown</a></p>
"

which is correct from marked's perspective because it's processing only one of the escapes, but incorrect from the user's perspective because the original string had no escapes and the outputted markdown has escapes.

(I'm using ReverseMarkdown because sometimes the input string contains actual markup, sometimes it's just plain text.)

anujbiyani · 2019-01-02T23:53:48Z

https://www.markdownguide.org/basic-syntax/#italic

To italicize text, add one asterisk or underscore before and after a word or phrase. To italicize the middle of a word for emphasis, add one asterisk without spaces around the letters.

So I think the actual problem is that reverse markdown should never escape underscores that are in the middle of a word.

xijo · 2019-01-04T10:04:56Z

@anujbiyani Thanks for your input on that issue.

I like the idea to leave underscored in words unmodified. It won't solve all issues though, because something like foo, _bar = something will still raise problems, but it's a good solution for most of the cases.

Would you like to open a PR?

anujbiyani · 2019-01-04T17:44:30Z

Sure, happy to try and tackle this!

One question: why are * and _ escaped at all? (Referring to this function.) Aren't unmatched * and _ valid without the escaping, and won't most markdown->HTML parsers simply ignore them?

xijo · 2019-01-05T12:02:13Z

@anujbiyani Good question! I did some digging in the past and I think initial the description mentions this case: https://daringfireball.net/projects/markdown/syntax#backslash

The example sounds pretty reasonable to me so I wouldn't ditch the whole escaping. What about:

don't escape in the middle of a word
add an configuration flag that disables escaping

This way it should be save for the majority and doesn't change the default behavior all to drastically. But everyone is free to opt out of escaping if it fits their case.

What do you think?

anujbiyani · 2019-01-07T18:28:29Z

Happy to do it via configuration flag to maintain backwards compatibility! Follow-up question, though: I believe the link you provided is referring to user-provided backslashes whereas the code I linked in ReverseMarkdown is escaping underscores and asterisks that don't have specifically two backslashes.

I think that method should either:

be removed (if it doesn't have to exist)
be a no-op (if it does have to exist)
add a second backslash if there already is one backslash (if we need the double-backslash instead of just the single)

I'm trying to understand what the right behavior should be for when the flag is enabled 😄 let me know what you think!

xijo · 2019-01-08T18:26:14Z

You're right, the backslash escape is used for user provided escapes. But, if we complete a full life cycle (HTML - MD - HTML) and we would not escape the initial *, then it would be treated as if it were markdown and therefore the information would be lost. It's a little confusing, so let me give you an example:

With escaping

<div>I _emphasize_ that</div>   # Original HTML (user input)
I \_emphasize\_ that            # MD after conversion
<p>I _emphasize_ that</p>       # After using random MD interpreter

Without escaping (as proposed)

<div>I _emphasize_ that</div>     # Original HTML (user input)
I _emphasize_ that                # MD after conversion
<p>I <em>emphasize</em> that</p>  # Now there is an additional em tag

Does this example clarify my problem with just skipping the whole escaping?

In my opinion it is correct to treat the original HTML entities as user input and therefore escape them.

anujbiyani · 2019-01-08T19:48:33Z

Ohh that helps a ton, thanks! So now I agree that removing the escaping entirely isn't actually a good idea.

What if we just go with option 1 from your post above and skip the disabling escaping via config?

don't escape in the middle of a word

add an configuration flag that disables escaping

Or did you mean that the configuration flag should apply to whether or not we escape in the middle of the word?

Sorry for all the back-and-forth, I find escaping/not-escaping incredibly confusing from a keeping-it-straight-in-my-head perspective 😖 .

bsbodden · 2020-02-13T13:37:47Z

Also, don't convert/parse tag attributes, for example:

one_tag = "<img src=\"https://someimageserver.com/img__1fu9uz__.png\" alt=\"foo\" />"
ReverseMarkdown.convert(one_tag)

results in:

=> " ![foo](https://someimageserver.com/img __1fu9uz__.png)"

notice the space

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML string containing underscores gets escaped and shown in output markdown #63

HTML string containing underscores gets escaped and shown in output markdown #63

jiggneshhgohel commented Feb 9, 2016

xijo commented Feb 11, 2016

jiggneshhgohel commented Feb 11, 2016

anujbiyani commented Jan 2, 2019

anujbiyani commented Jan 2, 2019

xijo commented Jan 4, 2019

anujbiyani commented Jan 4, 2019

xijo commented Jan 5, 2019

anujbiyani commented Jan 7, 2019 •

edited

Loading

xijo commented Jan 8, 2019 •

edited

Loading

anujbiyani commented Jan 8, 2019 •

edited

Loading

bsbodden commented Feb 13, 2020 •

edited

Loading

HTML string containing underscores gets escaped and shown in output markdown #63

HTML string containing underscores gets escaped and shown in output markdown #63

Comments

jiggneshhgohel commented Feb 9, 2016

xijo commented Feb 11, 2016

jiggneshhgohel commented Feb 11, 2016

anujbiyani commented Jan 2, 2019

anujbiyani commented Jan 2, 2019

xijo commented Jan 4, 2019

anujbiyani commented Jan 4, 2019

xijo commented Jan 5, 2019

anujbiyani commented Jan 7, 2019 • edited Loading

xijo commented Jan 8, 2019 • edited Loading

anujbiyani commented Jan 8, 2019 • edited Loading

bsbodden commented Feb 13, 2020 • edited Loading

anujbiyani commented Jan 7, 2019 •

edited

Loading

xijo commented Jan 8, 2019 •

edited

Loading

anujbiyani commented Jan 8, 2019 •

edited

Loading

bsbodden commented Feb 13, 2020 •

edited

Loading