Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARN invalid UTF-8 at transliterate (transliterate.c:790) errno: Resource temporarily unavailable #101

Closed
johnhamelink opened this issue Aug 13, 2016 · 3 comments

Comments

@johnhamelink
Copy link

Hi there,

I'm working on an Elixir NIF for libpostal (mainly just to learn how to build NIFs to be honest). When I retrieve the binary string data from the Erlang VM and copy it into a signed char, I pass it through to libpostal to parse/expand the address input. It seems to work perfectly around 20% of the time, and all the other times I instead get the following response:

WARN  invalid UTF-8
   at transliterate (transliterate.c:790) errno: Resource temporarily unavailable

I would've assumed that the problem was in my code (it probably still is) but the errno: Resource temporarily unavailable as well as the fact that /sometimes/ it does work has thrown me off...

Would you be able to provide any insight?

You can check the code out here: https://github.com/johnhamelink/postie

@albarrentine
Copy link
Contributor

Hi John - from my hazy recollection of Erlang, strings are represented as linked lists and then there's a more efficient type called a binary which is a pointer to a character array and its size similar to strings in C++, Python, etc. and that's the type you're using (bravo). I'll assume that the original string is already UTF-8 encoded (if not, that's what libpostal expects so should check/ensure its encoding on the way in).

The problem, I would guess, is that the Erlang string is not NUL-terminated ('\0' at the end) and that's how C expect strings to be represented. Most string operations in C, including those used in libpostal, will start at the pointer address and continue to read bytes from memory until a zero is encountered. When it occasionally works, that means there happened to be a zero somewhere beyond the boundaries of your string and the intervening "garbage memory" happened to be valid UTF-8. The simplest case would be if there was a zero in memory at strlen + 1, in which case it would behave like a normal C string.

So you'll want to create a NUL-terminated C string from the Erlang binary before passing it to libpostal. Haven't tested this, but something like changing https://github.com/johnhamelink/postie/blob/master/src/postie.c#L78 to char *address = strndup(in_binary.data, in_binary.size); should do the trick. Note that strndup is a caller-frees function so you'll also need to call free(address) somewhere after the libpostal call to free up the allocated memory.

johnhamelink added a commit to johnhamelink/postie that referenced this issue Aug 13, 2016
@johnhamelink
Copy link
Author

@thatdatabaseguy Thank you for such a clear and definitive explanation! Adding in that line did the trick, and because of no extra random data making its way into the libpostal call, the responses have become much less erratic as well, which also makes my unit tests work better.

I will keep working on it, and then perhaps I can submit a PR to add postie to your list of unofficial libs?

@albarrentine
Copy link
Contributor

No problem, and yes, happy to accept pull requests!

xiamx added a commit to SweetIQ/expostal that referenced this issue May 30, 2017
albarrentine added a commit that referenced this issue Dec 18, 2017
…resses like "100 Main" with "100 S Main St." or units like "Apt 101" vs. "#101".  Instead of expanding the phrase abbreviations, this version tries its best to delete all but the root words in a string for a specific component. It's probably not perfect, but does handle a number of edge cases related to pre/post directionals in English e.g. "E St" will have a root word of simply "E", "Avenue E" => "E", etc. Also handles a variety of cases where the phrase could be a thoroughfare type but is really a root word such as "Park Pl" or the famous "Avenue Rd". This can be used for near dupe hashing to catch possible dupes for later analysis. Note that it will normalize "St Marks Pl" and "St Marks Ave" to the same thing, which is sometimes warranted (if the user typed the wrong thoroughfare), but can also be reconciled at deduping time.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants