Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update world_bank scraper #349

Merged
merged 3 commits into from
Sep 27, 2024
Merged

update world_bank scraper #349

merged 3 commits into from
Sep 27, 2024

Conversation

cjyetman
Copy link
Collaborator

@cjyetman cjyetman commented Sep 9, 2023

Caution: causes many changes/moves in the data CSV

@etiennebacher

This comment was marked as off-topic.

@cjyetman
Copy link
Collaborator Author

cjyetman commented Sep 9, 2023

Hi, jumping in here to point out that most of the changes are due to the fact that the dataset is now sorted by wb and not by country anymore. Isn't sorting by country again an easy way to reduce the number of changes here? Sorry if I missed sth obvious about this

Yes, but my goal here was to update the scraper with minimal changes, not to update the data (which is typically done all together in a separate process). Just pointing it out for @vincentarelbundock's awareness.

@NilsEnevoldsen
Copy link
Contributor

It does make it hard to see what, if anything, changed as a result.

@vincentarelbundock
Copy link
Owner

Thanks all. I added one line of code with an arrange to make the diff easier to see. The changes look pretty minimal.

You can merge whenever you want.

Croatia,HRV
Cuba,CUB
Curaçao,CUW
Cyprus,CYP
Czech Republic,CZE
Côte d’Ivoire,CIV

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want curly ' ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is probably not proper, right? But that's what it is in the original data. Should I make the scraper "fix" it?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think so. My gut feeling is that converting to nice curly should be the typesetter's job, not a data-level thing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔

packageVersion("countrycode")
#> [1] '1.5.0'
countrycode::countrycode("CIV", "iso3c", "country.name")
#> [1] "Côte d’Ivoire"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to see the difference with GitHub's formatting, but both ASCII single quote and UTF curly quote work, and UTF curly quote is what's currently in countrycode::codelist, so I'm assuming leaving it as curly quote is ok, maybe even ideal.

library(countrycode)
packageVersion("countrycode")
#> [1] '1.6.0'

as.hexmode(utf8ToInt("'"))
#> [1] "27"
as.hexmode(utf8ToInt(""))
#> [1] "2019"

countrycode("Côte d\U27Ivoire", "country.name", "country.name")
#> [1] "Côte d’Ivoire"
countrycode("Côte d\U2019Ivoire", "country.name", "country.name")
#> [1] "Côte d’Ivoire"

stringi::stri_escape_unicode(countrycode("CI", "iso2c", "country.name"))
#> [1] "C\\u00f4te d\\u2019Ivoire"

@cjyetman
Copy link
Collaborator Author

cjyetman commented Sep 9, 2023

@vincentarelbundock I was under the impression that all of these "getter" scripts got run together in a Docker during some part of the process you do, so I included the changed CSV just to facilitate reviewing the consequences of the changes I made, but with the intention of removing the changed CSV before merging. Looking a bit deeper at things now, it looks like maybe that doesn't happen anymore or the process has changed? Should I included the modified CSV as well, if it changes, when making a change to a getter script?

remove no longer relevant comment
@vincentarelbundock
Copy link
Owner

@vincentarelbundock I was under the impression that all of these "getter" scripts got run together in a Docker during some part of the process you do, so I included the changed CSV just to facilitate reviewing the consequences of the changes I made, but with the intention of removing the changed CSV before merging. Looking a bit deeper at things now, it looks like maybe that doesn't happen anymore or the process has changed? Should I included the modified CSV as well, if it changes, when making a change to a getter script?

Yeah, it's been a while, but if I remember correctly, my previous setup was ridiculously over-(and badly-) engineered. So I simplified everything. get_*() saves a CSV file that we keep in the repo. Then, build.R reads and merges all the CSV files.

I always run it on my local machine; never on Docker. Maybe I should do that, but things have worked ok thus far...

@cjyetman cjyetman marked this pull request as ready for review September 27, 2024 16:55
@cjyetman cjyetman merged commit ecf0013 into main Sep 27, 2024
6 checks passed
@cjyetman cjyetman deleted the update-world_bank_scraper branch September 27, 2024 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants