-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update world_bank scraper #349
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
Yes, but my goal here was to update the scraper with minimal changes, not to update the data (which is typically done all together in a separate process). Just pointing it out for @vincentarelbundock's awareness. |
It does make it hard to see what, if anything, changed as a result. |
Thanks all. I added one line of code with an You can merge whenever you want. |
Croatia,HRV | ||
Cuba,CUB | ||
Curaçao,CUW | ||
Cyprus,CYP | ||
Czech Republic,CZE | ||
Côte d’Ivoire,CIV |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want curly ' ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is probably not proper, right? But that's what it is in the original data. Should I make the scraper "fix" it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think so. My gut feeling is that converting to nice curly should be the typesetter's job, not a data-level thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔
packageVersion("countrycode")
#> [1] '1.5.0'
countrycode::countrycode("CIV", "iso3c", "country.name")
#> [1] "Côte d’Ivoire"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's hard to see the difference with GitHub's formatting, but both ASCII single quote and UTF curly quote work, and UTF curly quote is what's currently in countrycode::codelist
, so I'm assuming leaving it as curly quote is ok, maybe even ideal.
library(countrycode)
packageVersion("countrycode")
#> [1] '1.6.0'
as.hexmode(utf8ToInt("'"))
#> [1] "27"
as.hexmode(utf8ToInt("’"))
#> [1] "2019"
countrycode("Côte d\U27Ivoire", "country.name", "country.name")
#> [1] "Côte d’Ivoire"
countrycode("Côte d\U2019Ivoire", "country.name", "country.name")
#> [1] "Côte d’Ivoire"
stringi::stri_escape_unicode(countrycode("CI", "iso2c", "country.name"))
#> [1] "C\\u00f4te d\\u2019Ivoire"
@vincentarelbundock I was under the impression that all of these "getter" scripts got run together in a Docker during some part of the process you do, so I included the changed CSV just to facilitate reviewing the consequences of the changes I made, but with the intention of removing the changed CSV before merging. Looking a bit deeper at things now, it looks like maybe that doesn't happen anymore or the process has changed? Should I included the modified CSV as well, if it changes, when making a change to a getter script? |
remove no longer relevant comment
Yeah, it's been a while, but if I remember correctly, my previous setup was ridiculously over-(and badly-) engineered. So I simplified everything. I always run it on my local machine; never on Docker. Maybe I should do that, but things have worked ok thus far... |
Caution: causes many changes/moves in the data CSV