Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MemoryError for a name with a lot of prefixes #108

Open
Ronserruya opened this issue Mar 22, 2020 · 1 comment
Open

MemoryError for a name with a lot of prefixes #108

Ronserruya opened this issue Mar 22, 2020 · 1 comment
Labels

Comments

@Ronserruya
Copy link

I don't really think this is a "bug", more like an extreme edge case.

While using the library I had to parse millions of name and encountered a user input:

"<first_name> van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der van der <last_name>"

This name quickly caused a MemoryError in a PC with 60+GB of RAM, more specifically
This list : https://github.com/derek73/python-nameparser/blob/master/nameparser/parser.py#L799 is growing exponentially in size very fast.

Again, Im not expecting you to fix this since this is obviously a user input error (which I bypassed by setting a maximum size to the string), but I thought you might be interested to know about this edge case.

@derek73 derek73 added the bug label Mar 22, 2020
@derek73
Copy link
Owner

derek73 commented Mar 22, 2020

Thanks for the bug report. I wondered if this would ever be an issue when I wrote it that way.

When the parser encounters a new combination of titles joined with a conjunction, it saves the complete string as a new title in the module's shared config (by default) and takes another pass. So each pass would result in a title with one additional conjunction or title added to the end. That somewhat explains the exponential nature, but it might also depend on how you're using the parser. I wonder if you would have the same problem with something like this:

parser = HumanName()
parser.fullname = name1
parsed_name1 = str(parser)

parser.fullname = name2
parsed_name2 = str(parser)

This should ensure that the module level config is shared across all the instances. I guess I'm not clear why that list would grow so large as to throw a memory error. In my understanding, it seems like it should just be storing less than 50 different versions of that very long title.

Anyway, It would be nice if the library didn't throw ambiguous memory errors, so maybe we can give it a better exception. Here is where the new title with conjunctions are saved to the module level config:

https://github.com/derek73/python-nameparser/blob/master/nameparser/parser.py#L721

We could test for some maximum around there, and have a default that can be overridden with the config object. I'm not sure exactly where it would need to go though. I wonder if the problem is in that group_contiguous_integers(conj_index) call?

If you are able to poke around or have any ideas, let me know. I haven't fired up my dev environment yet to try that name string, but when I do I'll try to find someplace to put in a maximum and then maybe throw some more informative exception? or maybe a warning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants