Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resolvers: optimize "uniq" iteration #5914

Merged
merged 1 commit into from
Jan 17, 2024

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Jan 10, 2024

Requires: #5769

In the data store we sometimes need to strip duplicate items from an iterable whilst maintaining iteration order.

If we didn't need to maintain order, we would use sets. This new method is more efficient than the old one for iteration use cases.

Simple performance test:
from random import random, shuffle                                
from time import time                                             
                                                                  
from cylc.flow.network.resolvers import uniq, iter_uniq           
                                                                  
N = 1000                                                          
M = 1000                                                          
                                                                  
_0 = [round(random(), 5) for _ in range(N)]                       
_10 = [*_0[:int(N * 0.9)], *_0[:int(N * 0.1)]]                    
_50 = [*_0[:int(N * 0.5)], *_0[:int(N * 0.5)]]    
                                                                  
shuffle(_0)                                                       
shuffle(_10)                                                      
shuffle(_50)                                                      
                                                                  
def _uniq():                                                      
    for _ in range(M):                                            
        for _ in uniq(_0):                                        
            pass                                                  
                                                                  
                                                                  
def _iter_uniq():                                                 
    for _ in range(M):                                            
        for _ in iter_uniq(_0):                                   
            pass               
                               
                               
def _set():    
    for _ in range(M):    
        yield from set(_0)      
                                                         
                                                         
start = time()                                           
_uniq()                                                  
end = time()    
print(f'{"uniq":10} {end - start}')                      
                                                         
start = time()      
_iter_uniq()                                             
end = time()                                             
print(f'{"iter_uniq":10} {end - start}')      
     
start = time()      
_set()      
end = time()    
print(f'{"set":10} {end - start}')    

The real world impact of this optimisation is probably quite small.

Check List

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
  • Tests are included (or explain why tests are not needed).
  • CHANGES.md entry included if this is a change that can affect users
  • Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
  • If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

@oliver-sanders oliver-sanders added small efficiency For notable efficiency improvements labels Jan 10, 2024
@oliver-sanders oliver-sanders added this to the cylc-8.3.0 milestone Jan 10, 2024
@oliver-sanders oliver-sanders self-assigned this Jan 10, 2024
@oliver-sanders oliver-sanders requested a review from wxtim January 12, 2024 13:00
@wxtim
Copy link
Member

wxtim commented Jan 12, 2024

TL;DR - It profiles quite nicely.
from random import randint
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as pyplot
pyplot.style.use('ggplot')
def old_uniq(iterable):
    ret = []
    for item in iterable:
        if item not in ret:
            ret.append(item)
    return ret
def new_uniq(iterable):
    cache = set()
    for item in iterable:
        if item not in cache:
            cache.add(item)
            yield item

Generate samples

samples = {}
for samplesize in [10, 50, 100, 500, 1000, 5000, 10000, 50000]:
    samples[samplesize] = [randint(1, 10) for i in range(samplesize)]

Run tests

Iterable size

results = {}
for size, samples in samples.items():
    res = {}
    res['before'] = timeit(lambda: old_uniq(samples), number=100)
    res['after'] = timeit(lambda: new_uniq(samples), number=100)
    results[size] = res
pd.DataFrame(results).T
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
before after
10 0.000087 0.000029
50 0.000287 0.000028
100 0.000513 0.000028
500 0.002480 0.000028
1000 0.018029 0.000033
5000 0.044226 0.000033
10000 0.066170 0.000033
50000 0.283870 0.000034
pd.DataFrame(results).T.plot(xlabel='Iterable Size', ylabel='Time(s)', title='Time v. Iterable Size')
<Axes: title={'center': 'Time v. Iterable Size'}, xlabel='Iterable Size', ylabel='Time(s)'>

output_8_1

Selection variability

selectionsize = {}
for size in [10, 50, 100, 500, 1000, 5000, 10000, 50000]:
    selectionsize[size] = [randint(1, size) for i in range(1000)]
results2 = {}
for size, samples in selectionsize.items():
    res = {}
    res['before'] = timeit(lambda: old_uniq(samples), number=100)
    res['after'] = timeit(lambda: new_uniq(samples), number=100)
    results2[size] = res
pd.DataFrame(results2).T
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
before after
10 0.005377 0.000032
50 0.019387 0.000032
100 0.036149 0.000033
500 0.147314 0.000033
1000 0.171415 0.000032
5000 0.264235 0.000033
10000 0.361872 0.000034
50000 0.377165 0.000034
pd.DataFrame(results2).T.plot(xlabel='Iterable Variability', ylabel='Time(s)', title='Time v. Iterable Variability')
<Axes: title={'center': 'Time v. Iterable Variability'}, xlabel='Iterable Variability', ylabel='Time(s)'>

output_13_1

Overall

sizeandvar = {}
for size in [10, 50, 100, 500, 1000, 5000]:
    sizeandvar[size] = [randint(1, size) for i in range(size)]
results3 = {}
for size, samples in sizeandvar.items():
    res = {}
    res['before'] = timeit(lambda: old_uniq(samples), number=100)
    res['after'] = timeit(lambda: new_uniq(samples), number=100)
    results3[size] = res
pd.DataFrame(results3).T
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
  <td>0.000033</td>
</tr>
before after
10 0.000077 0.000031
50 0.000707 0.000030
100 0.002536 0.000030
500 0.042678 0.000033
1000 0.201779 0.000034
5000 6.123827
pd.DataFrame(results3).T.plot(xlabel='Iterable Size & Variability', ylabel='Time(s)', title='Time v. Iterable Variability')
<Axes: title={'center': 'Time v. Iterable Variability'}, xlabel='Iterable Size & Variability', ylabel='Time(s)'>

output_18_1

@hjoliver
Copy link
Member

BLOCKED.. No merge till ...

Requires: #5769

@hjoliver hjoliver added the BLOCKED This can't happen until something else does label Jan 16, 2024
* Add a more efficient method for stripping duplicate items whilst
  maintaining iteration order.
@oliver-sanders oliver-sanders removed the BLOCKED This can't happen until something else does label Jan 17, 2024
@oliver-sanders
Copy link
Member Author

PR merged

@oliver-sanders
Copy link
Member Author

Unrelated linkcheck failure

@oliver-sanders oliver-sanders merged commit ddc80b6 into cylc:master Jan 17, 2024
26 of 27 checks passed
@oliver-sanders oliver-sanders deleted the iter_uniq branch January 17, 2024 12:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency For notable efficiency improvements small
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants