Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added feature infos to JSON dump #2660

Merged
merged 14 commits into from
Feb 20, 2020
Merged

added feature infos to JSON dump #2660

merged 14 commits into from
Feb 20, 2020

Conversation

StrikerRUS
Copy link
Collaborator

Part of #2604.

Not sure about the implementation, maybe it's better to have something like feature_infos_json_?

@guolinke
Copy link
Collaborator

guolinke commented Jan 3, 2020

if it is only used one-time, I think you can write it directly, without a new function.

@StrikerRUS
Copy link
Collaborator Author

StrikerRUS commented Jan 3, 2020

@guolinke Sorry, not sure that fully understood you. Do you mean to write code of bin_info() and feature_infos() with JSON-specific syntax in GBDT::DumpModel?

UPD: BTW, seems that feature_infos_ (feature_infos()) is also used only once during saving a model.

@guolinke
Copy link
Collaborator

guolinke commented Jan 3, 2020

sorry, i misread it. Is feature_infos_json_ is only used one time, i think current implementation is okay.

@StrikerRUS
Copy link
Collaborator Author

@guolinke OK. One more thing I can suggest is to rename bin_info() to ToString(). Write new function in bin ToJSON(). Then add function bin_mappers() in dataset which will return std::vector<BinMapper>. It will allow to rewrite feature_infos() in the following manner:

...
bufs.push_back(bin_mapper->ToString());
...

and add feature_infos_json() with the following line:

...
bufs.push_back(bin_mapper->ToJSON());
...

This will be consistent with the rest codebase.

@guolinke
Copy link
Collaborator

guolinke commented Jan 6, 2020

@StrikerRUS yeah, it makes sense.
However, maybe ToString is not a good name for bin_info? I remember bin_info contains the value range for numerical features, and distinct values for categorical features. It is not the string value of BinMapper class.

@StrikerRUS
Copy link
Collaborator Author

@guolinke

However, maybe ToString is not a good name for bin_info?

Agree! WDYT will be better names instead of ToString() and ToJSON()? Maybe bin_info_string() and bin_info_json()?

@StrikerRUS
Copy link
Collaborator Author

@guolinke Can you please help with bin_mappers() function when have time?

@guolinke
Copy link
Collaborator

@guolinke
Copy link
Collaborator

maybe use a function like std::vector<std::string> BinMapperApply(func) ?
and pass the bin_info_string() and bin_info_json() as arguements.

@StrikerRUS
Copy link
Collaborator Author

@guolinke

did you mean
https://github.com/microsoft/LightGBM/pull/2660/files#diff-0ce9f6b0c2b65569d0099b591a4617c3R585-R596

Yes, exactly!

maybe use a function like std::vectorstd::string BinMapperApply(func) ?
and pass the bin_info_string() and bin_info_json() as arguements.

Great idea (if I understand correctly 😃 )! To be honest, I have no practice in writing cpp code, so I'll be happy to see any decision you think fits here better.

@guolinke
Copy link
Collaborator

guolinke commented Feb 9, 2020

@StrikerRUS what is the status of this PR?

@StrikerRUS
Copy link
Collaborator Author

@guolinke If you have some time, please implement your idea. Otherwise, I think it's better to close this PR as I'm not sure I'll be able to do it on my own.

@guolinke
Copy link
Collaborator

okay, I will implement it when I have time.

@guolinke
Copy link
Collaborator

@StrikerRUS when I implement it, I find the template is a better solution.

src/boosting/gbdt.cpp Outdated Show resolved Hide resolved
str_buf << "\"feature_infos\":" << "{";
auto feature_infos_json_objs = train_data_->feature_infos<true>();
bool first_obj = true;
for (size_t i = 0; i < feature_infos_json_objs.size(); ++i) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems this loop causes the fails, @StrikerRUS could you check it ?

Copy link
Collaborator Author

@StrikerRUS StrikerRUS Feb 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented the loop and discovered that the following line causes segfaults:

auto feature_infos_json_objs = train_data_->feature_infos<true>();

UPD: the following (old code for ordinary string format) is also failing with the same error. Maybe a bug was introduced earlier?..

diff --git a/src/boosting/gbdt_model_text.cpp b/src/boosting/gbdt_model_text.cpp
index b1ea0d06..8180bb67 100644
--- a/src/boosting/gbdt_model_text.cpp
+++ b/src/boosting/gbdt_model_text.cpp
@@ -40,18 +40,18 @@ std::string GBDT::DumpModel(int start_iteration, int num_iteration) const {
           << Common::Join(monotone_constraints_, ",") << "]," << '\n';
 
   str_buf << "\"feature_infos\":" << "{";
-  auto feature_infos_json_objs = train_data_->feature_infos<true>();
-  bool first_obj = true;
-  for (size_t i = 0; i < feature_infos_json_objs.size(); ++i) {
-    if (feature_infos_json_objs[i] != "{}") {
-      if (!first_obj) {
-        str_buf << ",";
-      }
-      str_buf << "\"" << feature_names_[i] << "\":";
-      str_buf << feature_infos_json_objs[i];
-      first_obj = false;
-    }
-  }
+  auto feature_infos_json_objs = train_data_->feature_infos<false>();
+  // bool first_obj = true;
+  // for (size_t i = 0; i < feature_infos_json_objs.size(); ++i) {
+  //   if (feature_infos_json_objs[i] != "{}") {
+  //     if (!first_obj) {
+  //       str_buf << ",";
+  //     }
+  //     str_buf << "\"" << feature_names_[i] << "\":";
+  //     str_buf << feature_infos_json_objs[i];
+  //     first_obj = false;
+  //   }
+  // }
   str_buf << "}," << '\n';
 
   str_buf << "\"tree_info\":[";

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cannot dynamically generate the feature_infos_json_objs, as the model could be loaded from file/string. In this case, both used_feature_map_ and 'feature_bin_mappers_` are lost and cause the error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StrikerRUS I propose a new change, you can check it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see... Now feature_infos field is not actually a JSON representation. I think we can get back to my initial implementation then, which parses those strings and constructs valid JSONs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would be better

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in the latest commit.

src/boosting/gbdt.cpp Outdated Show resolved Hide resolved
src/boosting/gbdt.cpp Outdated Show resolved Hide resolved
@guolinke guolinke merged commit c4a7ab8 into master Feb 20, 2020
@StrikerRUS StrikerRUS deleted the feature_infos branch February 20, 2020 12:15
@lock lock bot locked as resolved and limited conversation to collaborators Apr 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants