resolvers: optimize "uniq" iteration #5914

oliver-sanders · 2024-01-10T14:27:40Z

Requires: #5769

In the data store we sometimes need to strip duplicate items from an iterable whilst maintaining iteration order.

If we didn't need to maintain order, we would use sets. This new method is more efficient than the old one for iteration use cases.

Simple performance test:

from random import random, shuffle                                
from time import time                                             
                                                                  
from cylc.flow.network.resolvers import uniq, iter_uniq           
                                                                  
N = 1000                                                          
M = 1000                                                          
                                                                  
_0 = [round(random(), 5) for _ in range(N)]                       
_10 = [*_0[:int(N * 0.9)], *_0[:int(N * 0.1)]]                    
_50 = [*_0[:int(N * 0.5)], *_0[:int(N * 0.5)]]    
                                                                  
shuffle(_0)                                                       
shuffle(_10)                                                      
shuffle(_50)                                                      
                                                                  
def _uniq():                                                      
    for _ in range(M):                                            
        for _ in uniq(_0):                                        
            pass                                                  
                                                                  
                                                                  
def _iter_uniq():                                                 
    for _ in range(M):                                            
        for _ in iter_uniq(_0):                                   
            pass               
                               
                               
def _set():    
    for _ in range(M):    
        yield from set(_0)      
                                                         
                                                         
start = time()                                           
_uniq()                                                  
end = time()    
print(f'{"uniq":10} {end - start}')                      
                                                         
start = time()      
_iter_uniq()                                             
end = time()                                             
print(f'{"iter_uniq":10} {end - start}')      
     
start = time()      
_set()      
end = time()    
print(f'{"set":10} {end - start}')

The real world impact of this optimisation is probably quite small.

Check List

I have read CONTRIBUTING.md and added my name as a Code Contributor.
Contains logically grouped changes (else tidy your branch by rebase).
Does not contain off-topic changes (use other PRs for other changes).
Applied any dependency changes to both setup.cfg (and conda-environment.yml if present).
Tests are included (or explain why tests are not needed).
CHANGES.md entry included if this is a change that can affect users
Cylc-Doc pull request opened if required at cylc/cylc-doc/pull/XXXX.
If this is a bug fix, PR should be raised against the relevant ?.?.x branch.

wxtim · 2024-01-12T16:02:31Z

TL;DR - It profiles quite nicely.

from random import randint
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as pyplot
pyplot.style.use('ggplot')

def old_uniq(iterable):
    ret = []
    for item in iterable:
        if item not in ret:
            ret.append(item)
    return ret

def new_uniq(iterable):
    cache = set()
    for item in iterable:
        if item not in cache:
            cache.add(item)
            yield item

Generate samples

samples = {}
for samplesize in [10, 50, 100, 500, 1000, 5000, 10000, 50000]:
    samples[samplesize] = [randint(1, 10) for i in range(samplesize)]

Run tests

Iterable size

results = {}
for size, samples in samples.items():
    res = {}
    res['before'] = timeit(lambda: old_uniq(samples), number=100)
    res['after'] = timeit(lambda: new_uniq(samples), number=100)
    results[size] = res

pd.DataFrame(results).T

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	before	after
10	0.000087	0.000029
50	0.000287	0.000028
100	0.000513	0.000028
500	0.002480	0.000028
1000	0.018029	0.000033
5000	0.044226	0.000033
10000	0.066170	0.000033
50000	0.283870	0.000034

pd.DataFrame(results).T.plot(xlabel='Iterable Size', ylabel='Time(s)', title='Time v. Iterable Size')

<Axes: title={'center': 'Time v. Iterable Size'}, xlabel='Iterable Size', ylabel='Time(s)'>

Selection variability

selectionsize = {}
for size in [10, 50, 100, 500, 1000, 5000, 10000, 50000]:
    selectionsize[size] = [randint(1, size) for i in range(1000)]

results2 = {}
for size, samples in selectionsize.items():
    res = {}
    res['before'] = timeit(lambda: old_uniq(samples), number=100)
    res['after'] = timeit(lambda: new_uniq(samples), number=100)
    results2[size] = res

pd.DataFrame(results2).T

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	before	after
10	0.005377	0.000032
50	0.019387	0.000032
100	0.036149	0.000033
500	0.147314	0.000033
1000	0.171415	0.000032
5000	0.264235	0.000033
10000	0.361872	0.000034
50000	0.377165	0.000034

pd.DataFrame(results2).T.plot(xlabel='Iterable Variability', ylabel='Time(s)', title='Time v. Iterable Variability')

<Axes: title={'center': 'Time v. Iterable Variability'}, xlabel='Iterable Variability', ylabel='Time(s)'>

Overall

sizeandvar = {}
for size in [10, 50, 100, 500, 1000, 5000]:
    sizeandvar[size] = [randint(1, size) for i in range(size)]

results3 = {}
for size, samples in sizeandvar.items():
    res = {}
    res['before'] = timeit(lambda: old_uniq(samples), number=100)
    res['after'] = timeit(lambda: new_uniq(samples), number=100)
    results3[size] = res

pd.DataFrame(results3).T

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

  <td>0.000033</td>
</tr>

	before	after
10	0.000077	0.000031
50	0.000707	0.000030
100	0.002536	0.000030
500	0.042678	0.000033
1000	0.201779	0.000034
5000	6.123827

pd.DataFrame(results3).T.plot(xlabel='Iterable Size & Variability', ylabel='Time(s)', title='Time v. Iterable Variability')

<Axes: title={'center': 'Time v. Iterable Variability'}, xlabel='Iterable Size & Variability', ylabel='Time(s)'>

hjoliver · 2024-01-16T01:21:43Z

BLOCKED.. No merge till ...

Requires: #5769

* Add a more efficient method for stripping duplicate items whilst maintaining iteration order.

oliver-sanders · 2024-01-17T11:29:40Z

PR merged

oliver-sanders · 2024-01-17T12:38:14Z

Unrelated linkcheck failure

oliver-sanders added small efficiency For notable efficiency improvements labels Jan 10, 2024

oliver-sanders added this to the cylc-8.3.0 milestone Jan 10, 2024

oliver-sanders self-assigned this Jan 10, 2024

oliver-sanders requested a review from wxtim January 12, 2024 13:00

wxtim approved these changes Jan 12, 2024

View reviewed changes

hjoliver approved these changes Jan 16, 2024

View reviewed changes

hjoliver added the BLOCKED This can't happen until something else does label Jan 16, 2024

resolvers: optimize "uniq" iteration

d72f460

* Add a more efficient method for stripping duplicate items whilst maintaining iteration order.

oliver-sanders removed the BLOCKED This can't happen until something else does label Jan 17, 2024

oliver-sanders force-pushed the iter_uniq branch from 3e94637 to d72f460 Compare January 17, 2024 11:29

oliver-sanders merged commit ddc80b6 into cylc:master Jan 17, 2024
26 of 27 checks passed

oliver-sanders deleted the iter_uniq branch January 17, 2024 12:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resolvers: optimize "uniq" iteration #5914

resolvers: optimize "uniq" iteration #5914

oliver-sanders commented Jan 10, 2024 •

edited

Loading

wxtim commented Jan 12, 2024 •

edited

Loading

Generate samples

Run tests

Iterable size

Selection variability

Overall

hjoliver commented Jan 16, 2024

oliver-sanders commented Jan 17, 2024

oliver-sanders commented Jan 17, 2024

resolvers: optimize "uniq" iteration #5914

resolvers: optimize "uniq" iteration #5914

Conversation

oliver-sanders commented Jan 10, 2024 • edited Loading

wxtim commented Jan 12, 2024 • edited Loading

Generate samples

Run tests

Iterable size

Selection variability

Overall

hjoliver commented Jan 16, 2024

oliver-sanders commented Jan 17, 2024

oliver-sanders commented Jan 17, 2024

oliver-sanders commented Jan 10, 2024 •

edited

Loading

wxtim commented Jan 12, 2024 •

edited

Loading