WARN invalid UTF-8 at transliterate (transliterate.c:790) errno: Resource temporarily unavailable #101

johnhamelink · 2016-08-13T16:35:28Z

Hi there,

I'm working on an Elixir NIF for libpostal (mainly just to learn how to build NIFs to be honest). When I retrieve the binary string data from the Erlang VM and copy it into a signed char, I pass it through to libpostal to parse/expand the address input. It seems to work perfectly around 20% of the time, and all the other times I instead get the following response:

WARN  invalid UTF-8
   at transliterate (transliterate.c:790) errno: Resource temporarily unavailable

I would've assumed that the problem was in my code (it probably still is) but the errno: Resource temporarily unavailable as well as the fact that /sometimes/ it does work has thrown me off...

Would you be able to provide any insight?

You can check the code out here: https://github.com/johnhamelink/postie

The text was updated successfully, but these errors were encountered:

albarrentine · 2016-08-13T18:24:33Z

Hi John - from my hazy recollection of Erlang, strings are represented as linked lists and then there's a more efficient type called a binary which is a pointer to a character array and its size similar to strings in C++, Python, etc. and that's the type you're using (bravo). I'll assume that the original string is already UTF-8 encoded (if not, that's what libpostal expects so should check/ensure its encoding on the way in).

The problem, I would guess, is that the Erlang string is not NUL-terminated ('\0' at the end) and that's how C expect strings to be represented. Most string operations in C, including those used in libpostal, will start at the pointer address and continue to read bytes from memory until a zero is encountered. When it occasionally works, that means there happened to be a zero somewhere beyond the boundaries of your string and the intervening "garbage memory" happened to be valid UTF-8. The simplest case would be if there was a zero in memory at strlen + 1, in which case it would behave like a normal C string.

So you'll want to create a NUL-terminated C string from the Erlang binary before passing it to libpostal. Haven't tested this, but something like changing https://github.com/johnhamelink/postie/blob/master/src/postie.c#L78 to char *address = strndup(in_binary.data, in_binary.size); should do the trick. Note that strndup is a caller-frees function so you'll also need to call free(address) somewhere after the libpostal call to free up the allocated memory.

johnhamelink · 2016-08-13T18:35:25Z

@thatdatabaseguy Thank you for such a clear and definitive explanation! Adding in that line did the trick, and because of no extra random data making its way into the libpostal call, the responses have become much less erratic as well, which also makes my unit tests work better.

I will keep working on it, and then perhaps I can submit a PR to add postie to your list of unofficial libs?

albarrentine · 2016-08-13T18:39:45Z

No problem, and yes, happy to accept pull requests!

WARN invalid UTF-8 at transliterate openvenues/libpostal#101

…resses like "100 Main" with "100 S Main St." or units like "Apt 101" vs. "#101". Instead of expanding the phrase abbreviations, this version tries its best to delete all but the root words in a string for a specific component. It's probably not perfect, but does handle a number of edge cases related to pre/post directionals in English e.g. "E St" will have a root word of simply "E", "Avenue E" => "E", etc. Also handles a variety of cases where the phrase could be a thoroughfare type but is really a root word such as "Park Pl" or the famous "Avenue Rd". This can be used for near dupe hashing to catch possible dupes for later analysis. Note that it will normalize "St Marks Pl" and "St Marks Ave" to the same thing, which is sometimes warranted (if the user typed the wrong thoroughfare), but can also be reconciled at deduping time.

johnhamelink added a commit to johnhamelink/postie that referenced this issue Aug 13, 2016

Use strndup as per suggestion from openvenues/libpostal#101

2740288

albarrentine closed this as completed Jan 17, 2017

xiamx added a commit to SweetIQ/expostal that referenced this issue May 30, 2017

count reference on libpostal and fix #101

587613e

WARN invalid UTF-8 at transliterate openvenues/libpostal#101

selva221724 mentioned this issue Apr 20, 2022

error when using german specific letters selva221724/pypostalwin#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WARN invalid UTF-8 at transliterate (transliterate.c:790) errno: Resource temporarily unavailable #101

WARN invalid UTF-8 at transliterate (transliterate.c:790) errno: Resource temporarily unavailable #101

johnhamelink commented Aug 13, 2016

albarrentine commented Aug 13, 2016

johnhamelink commented Aug 13, 2016

albarrentine commented Aug 13, 2016

WARN invalid UTF-8 at transliterate (transliterate.c:790) errno: Resource temporarily unavailable #101

WARN invalid UTF-8 at transliterate (transliterate.c:790) errno: Resource temporarily unavailable #101

Comments

johnhamelink commented Aug 13, 2016

albarrentine commented Aug 13, 2016

johnhamelink commented Aug 13, 2016

albarrentine commented Aug 13, 2016