Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose random number seed #4

Open
kousu opened this issue May 13, 2018 · 3 comments
Open

Expose random number seed #4

kousu opened this issue May 13, 2018 · 3 comments

Comments

@kousu
Copy link
Owner

kousu commented May 13, 2018

In Stata, it is common practice to start examples with set seed to ensure replicability. But this is currently impossible if your problem involves svmachines because libsvm has its own random seeding involved.

We need to find a way to seed the libsvm seed in order to make runs consistent; we could either seed it from the Stata seed, or maybe expose it as a separate optional parameter.

EDIT: StataCorp recommends recording c(rngstate) in your output dataset and recovering it by set rngstate and not set state if you need to rerun it.

@kousu
Copy link
Owner Author

kousu commented May 13, 2018

Hm, I think I remember why we didn't do this in the first place, since it seems like an obvious oversight: the libsvm API

struct svm_problem
{
	int l;
	double *y;
	struct svm_node **x;
};

struct svm_parameter
{
	int svm_type;
	int kernel_type;
	int degree;	/* for poly */
	double gamma;	/* for poly/rbf/sigmoid */
	double coef0;	/* for poly/sigmoid */

	/* these are for training only */
	double cache_size; /* in MB */
	double eps;	/* stopping criteria */
	double C;	/* for C_SVC, EPSILON_SVR and NU_SVR */
	int nr_weight;		/* for C_SVC */
	int *weight_label;	/* for C_SVC */
	double* weight;		/* for C_SVC */
	double nu;	/* for NU_SVC, ONE_CLASS, and NU_SVR */
	double p;	/* for EPSILON_SVR */
	int shrinking;	/* use the shrinking heuristics */
	int probability; /* do probability estimates */
};

struct svm_model *svm_train(const struct svm_problem *prob, const struct svm_parameter *param);

doesn't expose anywhere to set the random seed.

@kousu
Copy link
Owner Author

kousu commented May 13, 2018

I read the source and it seems that it's just using the standard rand():

which means we can control it with srand(). We can read Stata's seed state with c(rngstate), which gives a lg(5000)-bit string of state, so we'll have to severely clip the state size to fit it into the 32/64 bits of state that rand() uses.

@kousu
Copy link
Owner Author

kousu commented May 13, 2018

That raises another issue though: do we seed at loading the library, or do we seed each time we call svmachines? If we seed each time we run svmachines, that will crash up against any other library that uses rand() but if we don't we will miss a reseed; ideally we would hook each time set seed runs (and at load-time, because we can't hook anything before we're called).

Practically, it's probably more important to avoid conflicts than to catch reseeds, because reseeds are bad statistical pratice. But that would have to come with huge caveats in the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant