Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add new CentralEuropeanStreetNameClassifier #88

Merged
merged 2 commits into from
Apr 24, 2020

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Apr 17, 2020

adds a new CentralEuropeanStreetNameClassifier which is able to handle the cases mentioned in #83

it's still fairly basic, but relatively safe.

in the future we may consider expanding this to cover:

  • more than one unclassified span before the housenumber
  • the inverted order of 1 xxx instead of xxx 1 (although this might be dangerous?)

closes: #83

@Joxit
Copy link
Member

Joxit commented Apr 20, 2020

You are using section classifier and forcing length to 2, this definitely reduce side effects 👍.

But we should be careful with words and phrases. In your PR the Alpha member should not be classified with a public classification, which is good IMO. But the section is composed by words... And one word can also be a phrase (#47).
Here the word Paris is classified as an Alpha, but the phrase is classified as Locality... Theoretically this would mean that CentralEuropeanStreetNameClassifier should not classify it 😕
It's ok for now because the confidence is low, this is a reminder for me 😅

$ node bin/cli.js Paris 75000, France

master:

================================================================
TOKENIZATION (2ms)
----------------------------------------------------------------
INPUT                           ➜  Paris 75000, France
SECTIONS                        ➜   Paris 75000   0:11    France  12:19 
S0 TOKENS                       ➜   Paris  0:5   7500  6:10 
S1 TOKENS                       ➜   France  13:19 
S0 PHRASES                      ➜   Paris 7500  0:10   Paris  0:5   7500  6:10 
S1 PHRASES                      ➜   France  13:19 

================================================================
CLASSIFICATIONS (4ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Paris                           ➜   alpha  1.00   start_token  1.00  
75000                           ➜   numeric  1.00   housenumber  0.90   postcode  1.00  
France                          ➜   alpha  1.00   end_token  1.00  

----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Paris                           ➜   given_name  1.00   surname  1.00   area  1.00   locality  1.00  
France                          ➜   given_name  1.00   surname  1.00   area  1.00   country  0.90  

================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.96) ➜ [ { locality: 'Paris' },
  { postcode: '75000' },
  { country: 'France' } ]

central_european_streets:

================================================================
TOKENIZATION (2ms)
----------------------------------------------------------------
INPUT                           ➜  Paris 75000, France
SECTIONS                        ➜   Paris 75000   0:11    France  12:19 
S0 TOKENS                       ➜   Paris  0:5   7500  6:10 
S1 TOKENS                       ➜   France  13:19 
S0 PHRASES                      ➜   Paris 75000  0:10   Paris  0:5   7500  6:10 
S1 PHRASES                      ➜   France  13:19 

================================================================
CLASSIFICATIONS (6ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Paris                           ➜   alpha  1.00   start_token  1.00   street  0.50  
75000                           ➜   numeric  1.00   housenumber  0.90   postcode  1.00  
France                          ➜   alpha  1.00   end_token  1.00  

----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Paris                           ➜   given_name  1.00   surname  1.00   area  1.00   locality  1.00  
France                          ➜   given_name  1.00   surname  1.00   area  1.00   country  0.90  

================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.96) ➜ [ { locality: 'Paris' },
  { postcode: '75000' },
  { country: 'France' } ]

(0.79) ➜ [ { street: 'Paris' },
  { postcode: '75000' },
  { country: 'France' } ]

(0.77) ➜ [ { street: 'Paris' },
  { housenumber: '75000' },
  { country: 'France' } ]

@missinglink
Copy link
Member Author

Yeah agreed, it should ensure that the tokens have no public classifications at all.

@missinglink
Copy link
Member Author

It's a really tricky case to handle without a gazetteer and/or a geocoder.

There is a street I cycle past quite often called Esplanade and I'm wondering how we will ever be able to correctly parse those addresses, eg Esplanade 17, 13187 Berlin, Germany

@missinglink
Copy link
Member Author

Maybe we also add a check that the housenumber span doesn't also have a postcode classification.

@missinglink
Copy link
Member Author

IMG_20200423_121013

@Joxit
Copy link
Member

Joxit commented Apr 23, 2020

Nice, your PR seems to work for Esplanade too ! (Which is a street prefix in French)

$ node bin/cli.js Esplanade 17, 13187 Berlin, Germany

================================================================
TOKENIZATION (2ms)
----------------------------------------------------------------
INPUT                           ➜  Esplanade 17, 13187 Berlin, Germany
SECTIONS                        ➜   Esplanade 17  0:12    13187 Berlin  13:26    Germany  27:35 
S0 TOKENS                       ➜   Esplanade  0:9   17  10:12 
S1 TOKENS                       ➜   13187  14:19   Berlin  20:26 
S2 TOKENS                       ➜   Germany  28:35 
S0 PHRASES                      ➜   Esplanade 17  0:12   Esplanade  0:9   17  10:12 
S1 PHRASES                      ➜   13187 Berlin  14:26   13187  14:19   Berlin  20:26 
S2 PHRASES                      ➜   Germany  28:35 

================================================================
CLASSIFICATIONS (4ms)
----------------------------------------------------------------
WORDS
----------------------------------------------------------------
Esplanade                       ➜   alpha  1.00   start_token  1.00   street_prefix  1.00   street  0.50  
17                              ➜   numeric  1.00   housenumber  1.00  
13187                           ➜   numeric  1.00   housenumber  0.20   postcode  1.00  
Berlin                          ➜   alpha  1.00  
Germany                         ➜   alpha  1.00   end_token  1.00  

----------------------------------------------------------------
PHRASES
----------------------------------------------------------------
Berlin                          ➜   surname  1.00   area  1.00   locality  1.00   region  1.00  
Germany                         ➜   area  1.00   country  0.90  

================================================================
SOLUTIONS (4ms)
----------------------------------------------------------------
(0.82) ➜ [ { street: 'Esplanade' },
  { housenumber: '17' },
  { postcode: '13187' },
  { locality: 'Berlin' },
  { country: 'Germany' } ]

(0.82) ➜ [ { street: 'Esplanade' },
  { housenumber: '17' },
  { postcode: '13187' },
  { region: 'Berlin' },
  { country: 'Germany' } ]

@missinglink missinglink force-pushed the central_european_streets branch from 39e3d29 to ae0aa7b Compare April 24, 2020 14:04
@missinglink
Copy link
Member Author

I just added two more test cases.
I also added some code to check the parent phrases but it caused one test to fail, so I'm thinking we just leave it as-is for now?

Screenshot 2020-04-24 at 16 02 56

Screenshot 2020-04-24 at 16 03 10

@missinglink missinglink merged commit 9581567 into master Apr 24, 2020
@missinglink missinglink deleted the central_european_streets branch April 24, 2020 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parsing Czech Republic addresses
2 participants