Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error when using german specific letters #5

Open
smartini87 opened this issue Apr 19, 2022 · 3 comments
Open

error when using german specific letters #5

smartini87 opened this issue Apr 19, 2022 · 3 comments

Comments

@smartini87
Copy link

Script aborts when using letters with german umlauts (ä, ö, ü) letters with sharp s (ß).

@selva221724
Copy link
Owner

selva221724 commented Apr 20, 2022

Hi @smartini87 ,

As I checked with the libpostal library .exe itself, it is not taking inputs with german umlauts or words with accents that's why pypostalwin was not able to prase it. Added a screenshot below from address_parser.exe

This is the issue which was raised in lipostal and says libpostal needs a 'UTF-8' encoded string.

image

This is why I have added a layer in the pypostalwin to remove the special characters which are non ASCII values

image

Also, you have to normalize the non-English/special address before passing it to the parser using expandAddress . it is mentioned in libpostal's readme

I think you can use the below function to remove the accents before passing it to it.

import unicodedata
def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

Let me know if that works or we need to wait for the libpostal newer versions to allow the different character encodings.

@smartini87
Copy link
Author

thanks for the try-outs.
The function you proposed works, but the output is not what I would like it to be, because it's technically false written (city):
grafik
My idea would be to check whether a replacements of those characters is possible, if yes I will keep that information saved and when parsing is done I would change that letter back to the original one. I cannot just convert every string back to the umlauts, because there is the chance the city or street is originally written the same as being converted upfront.

Also, while facing the issue I was stuck in an infinite loop, without an error being shown to me, so I was not able to bypass this certain datarecord with an exception handling. Is there any solution to receive at least an error, when the parser could not retrieve a collection?

import pypostalwin
import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

parser = pypostalwin.AddressParser()
try:
    parsedAddress = parser.runParser(remove_accents("Weissgerber Str. 10, 84453 Mühldorf am Inn"))
    print(parsedAddress)
except:
    print('Error')

@selva221724
Copy link
Owner

Thanks for the reply @smartini87 , I just tried the same code on my python env and pasted below,

image

It is not giving me an infinite loop. and please make sure you use the latest version,

pip install pypostalwin==0.0.3 

and pypostalwin may be stuck if the character is non-ASCII values like ®,±, Æ, but I have added many exceptions possible that will remove these characters basically, the best practice is, to convert your address into the UTF8 encoded string and send them into the parser.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants