tl; DR I used our scalene profiler and some math to make an example program run 5000x faster.
I am quite interested in the performance of Python so naturally, I read this article – https://martinheinz.dev/blog/64whose title is Python program profiling and analysis performance, It presents an example program (from https://docs.python.org/3/library/decimal.html) and shows how to run it with Python profilers worn multiple times. Unfortunately, it doesn’t come with much actionable information to, more or less, “try on”. ppp”, which increases the code by about 2x. I wondered if I’d be able to get more useful information from sklearn, a profiler I co-wrote.
We developed sklearn to be a lot more useful than existing Python profilers: it provides line-level The information, separating Python from native time, profiles memory usage, GPU, and even copy cost, all at a single line granularity.
Anyway, here’s the result of running sklearn (with just CPU profiling) on the example code. It really cuts to chase.
% scalene --cpu-only --cli --reduced-profile test/test-martinheinz.py
You can see that practically all execution time is spent computing the ratio between
fact, so that’s the only place to really focus any optimization efforts. The fact that a lot of time is spent running native code means that this line is executing some C library under the covers.
It turns out it’s dividing the two
Decimals (aka bignums) the underlying bignum library is written in C code and is very fast, but seems to be exclusively factorial really Huge really Fast. In an example input, the last value of
fact11,000 points long! No wonder: doing math on such large numbers is expensive. Let’s see what we can do to make those numbers smaller.
I see that we can calculate
num / fact Not from scratch but incrementally: on each loop iteration update a variable via a calculation on a very small number. To do this, I add a new variable
nf which will always be equal to the ratio
num / fact, Then, on each loop iteration, the program updates
nf multiplying it by
x / i, You can verify that it maintains immutable
nf == num/fact By observing the following (where
_new Mean count of updated variable in each iteration).
nf == num / fact # true by induction
nf_new == nf * (x / i) # we multiply by x/i each time
nf_new == (num / fact) * (x / i) # definition of nf
nf_new == (num * x) / (fact * i) # re-arranging
nf_new == num_new / fact_new # simplifying
To incorporate this into the original program three lines of code need to be changed, all of which are followed by
getcontext().prec += 2
i, lasts, s, fact, num = 0, 0, 1, 1, 1
nf = Decimal(1) ### was: = num / fact
while s != lasts:
lasts = s
i += 1
fact *= i
num *= x
nf *= (x / i) ### update nf to be num / fact
s += nf ### was: s += num / fact
getcontext().prec -= 2
The result of this change is, uh, dramatic,
On the Apple Mini M1, the original version:
7.64620098905470488931072765993E+1302Elapsed time, original (s): 33.231053829193115
7.64620098905470488931072766048E+1302Elapsed time, optimized (s): 0.006501913070678711
more than one 5000X Speedup (5096, to be exact).
The moral of the story is that using a more detailed profiler like sklearn can actually help with optimization efforts by finding inefficiencies in an actionable way.
#time #optimized #program #5000x #Emery #Berger