Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL detection is overzealous #28

Open
CalebCourier opened this issue Nov 21, 2024 · 1 comment
Open

URL detection is overzealous #28

CalebCourier opened this issue Nov 21, 2024 · 1 comment

Comments

@CalebCourier
Copy link
Contributor

Inconsistently, but often enough to notice, chunks of a sentence will be flagged as a URL when they are not.

Example:

from guardrails import Guard
from guardrails.hub import CompetitorCheck, DetectPII

guard = Guard().use(
    CompetitorCheck(
        competitors=["Fortran", "Ada", "Pascal"],
        on_fail="fix"
    )
).use(DetectPII(pii_entities="pii", on_fail="fix"))

response = guard.validate("The author is Paul Graham. Growing up, he worked on writing short stories and programming, starting with an early version of Fortran on an IBM 1401 in 9th grade. Later, he transitioned to microcomputers and began programming more extensively, including writing simple games and a word processor on a TRS-80.")

print("Raw: ", response.raw_llm_output)
print("Guarded: ", response.validated_output)
Raw:  The author is Paul Graham. Growing up, he worked on writing short stories and programming, starting with an early version of Fortran on an IBM 1401 in 9th grade. Later, he transitioned to microcomputers and began programming more extensively, including writing simple games and a word processor on a TRS-80.

Guarded:  The author is <PERSON><URL>owing up, he worked on writing short stories and programming, starting with an early version of [COMPETITOR] on an IBM 1401 in 9th <URL>er, he transitioned to microcomputers and began programming more extensively, including writing simple games and a word processor on a TRS-80.

I'm guessing it may have to do with how the chunking removes spaces so it sees .[some three letters] and thinks it's a top level domain. For example turning 9th grade. Later, into 9th <URL>er, kind of makes sense because .lat is a known top level domain. However, the other assumably flagged tld is .gro which isn't a thing unless it's fuzzy matching to .org.

@dtam
Copy link
Contributor

dtam commented Nov 21, 2024

the underlying url regex in presidio doesnt flag this might be coming from context or one of the nlp code paths in presidio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants