Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel dist comp #49

Merged
merged 2 commits into from
May 4, 2024
Merged

Conversation

maxmekiska
Copy link
Contributor

@maxmekiska maxmekiska commented May 1, 2024

The following aims to enhance the performance of the private method _compute_score_distribution, mirroring the logic applied in _bootstrap.

Now, distfit will introduce an additional optional argument, n_jobs_dist, enabling users to parallelly fit distribution candidates. This suggestion aims to reduce computational time, particularly beneficial when fitting numerous candidate distributions to data.

The following experiments have been conducted:

benchmarking process setup

X = np.random.normal(163, 10, 10000)


def run_prog_bootstrap(n_jobs: int, n_jobs_dist: int):
    dfit = distfit(distr='popular', n_boots=100, n_jobs=n_jobs, n_jobs_dist=n_jobs_dist)

    results = dfit.fit_transform(X)

    return results

def run_prog_raw(n_jobs_dist: int):
    dfit = distfit(distr='popular', n_jobs_dist=n_jobs_dist)

    results = dfit.fit_transform(X)

    return results

experiments - with bootstrap

%timeit run_prog_bootstrap(n_jobs=4, n_jobs_dist=4)

result: 2min 12s ± 11.2 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit run_prog_bootstrap(n_jobs=7, n_jobs_dist=1)

result: 2min ± 11.6 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit run_prog_bootstrap(n_jobs=1, n_jobs_dist=1)

result: 4min 44s ± 25.9 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

experiments - without bootstrap

%timeit run_prog_raw(n_jobs_dist=1)

result: 2.57 s ± 114 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit run_prog_raw(n_jobs_dist=8)

result: 1.63 s ± 434 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

@maxmekiska
Copy link
Contributor Author

Hi @erdogant. I just noticed the very similar suggestions in PR 35 from a year ago. I believe a potential issue was the handling/processing of the results from Parallel. I believe the changes proposed in this PR might overcome the error you were facing. The initial performance tests did not throw any errors. Might be interesting to test this again on your side as well. Another potential issue could have been the following line in PR35:

Parallel(n_jobs=cpu_count())

setting n_jobs to the maximum CPU count might be problematic since _bootsrap shares the same compute resources. Hence, I recommend the use of n_jobs_dist which allows allocating compute resources to _bootstrap and compute resources to _fit_distribution individually. Lastly, _bootstrap is nested within _fit_distriubtion which requires careful compute resource allocation.

@erdogant erdogant merged commit ea3b168 into erdogant:master May 4, 2024
@erdogant
Copy link
Owner

erdogant commented May 4, 2024

Thank you for your contribution on this matter! Not sure where my previous comment has gone too but let me write again. I did run some experiments but one test seem to fail. I will look into this. I find it confusing to have two n_jobs so I will regulate this internally. From your timeit statistics, the bootstrap seems the benefit the most. I will run some more tests but likely when bootstrap is used, most of the jobs would go to that part.

@erdogant
Copy link
Owner

erdogant commented May 4, 2024

No errors on my side anymore. I updated the initialization with only n_jobs. I described the results over here how effective the parralization for the bootstrap and the general part. During initialization, During initialiation, I simply divide n_jobs with n_jobs_dist based on n_boost.

@erdogant
Copy link
Owner

I released the new version with your contribution!
https://github.com/erdogant/distfit/releases/tag/1.8.0

Update to the latest version with pip install -U distfit
If it breaks somehow, let me know asap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants