error when using german specific letters #5

smartini87 · 2022-04-19T15:57:16Z

Script aborts when using letters with german umlauts (ä, ö, ü) letters with sharp s (ß).

selva221724 · 2022-04-20T07:16:14Z

As I checked with the libpostal library .exe itself, it is not taking inputs with german umlauts or words with accents that's why pypostalwin was not able to prase it. Added a screenshot below from address_parser.exe

This is the issue which was raised in lipostal and says libpostal needs a 'UTF-8' encoded string.

This is why I have added a layer in the pypostalwin to remove the special characters which are non ASCII values

Also, you have to normalize the non-English/special address before passing it to the parser using expandAddress . it is mentioned in libpostal's readme

I think you can use the below function to remove the accents before passing it to it.

import unicodedata
def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

Let me know if that works or we need to wait for the libpostal newer versions to allow the different character encodings.

smartini87 · 2022-04-20T09:55:17Z

thanks for the try-outs.
The function you proposed works, but the output is not what I would like it to be, because it's technically false written (city):

My idea would be to check whether a replacements of those characters is possible, if yes I will keep that information saved and when parsing is done I would change that letter back to the original one. I cannot just convert every string back to the umlauts, because there is the chance the city or street is originally written the same as being converted upfront.

Also, while facing the issue I was stuck in an infinite loop, without an error being shown to me, so I was not able to bypass this certain datarecord with an exception handling. Is there any solution to receive at least an error, when the parser could not retrieve a collection?

import pypostalwin
import unicodedata

def remove_accents(input_str):
    nfkd_form = unicodedata.normalize('NFKD', input_str)
    return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])

parser = pypostalwin.AddressParser()
try:
    parsedAddress = parser.runParser(remove_accents("Weissgerber Str. 10, 84453 Mühldorf am Inn"))
    print(parsedAddress)
except:
    print('Error')

selva221724 · 2022-04-21T10:41:27Z

Thanks for the reply @smartini87 , I just tried the same code on my python env and pasted below,

It is not giving me an infinite loop. and please make sure you use the latest version,

pip install pypostalwin==0.0.3

and pypostalwin may be stuck if the character is non-ASCII values like ®,±, Æ, but I have added many exceptions possible that will remove these characters basically, the best practice is, to convert your address into the UTF8 encoded string and send them into the parser.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error when using german specific letters #5

error when using german specific letters #5

smartini87 commented Apr 19, 2022

selva221724 commented Apr 20, 2022 •

edited

Loading

smartini87 commented Apr 20, 2022

selva221724 commented Apr 21, 2022

error when using german specific letters #5

error when using german specific letters #5

Comments

smartini87 commented Apr 19, 2022

selva221724 commented Apr 20, 2022 • edited Loading

smartini87 commented Apr 20, 2022

selva221724 commented Apr 21, 2022

selva221724 commented Apr 20, 2022 •

edited

Loading