[Embench] How to measure code size fairly

Jon Taylor Jon.Taylor at arm.com
Thu Aug 29 09:16:29 CEST 2019

Thanks Jeremy.

Firstly my opinion is that any code we're measuring the size or performance of needs to be functional. If an algorithm requires lots of maths library code (such as cubic), there is a benefit to having an optimised library available and that should be reflected in a benchmark score. This could also include allowing a library optimised for a processor with custom instruction extensions.  I'm really not sure what measuring the performance of something that can't be executed really tells us - for example "cubic" is about 1k of code with dummy libraries, but ~9k with libraries (Arm GCC, building -O2). We wouldn't measure the runtime without libraries, so why would measuring the size without libraries be considered valid?

Having said that, I think it likely (particularly for benchmarks run on actual hardware), use of printf might be desirable for recording the runtime (eg via a UART, trace port or other mechanism), but measuring the size of the printf library is not helpful because it's effectively only for debug, not functional purposes.  Comparing code with and without printf, the print library adds ~20k to Arm code size, and ~60k to RISC-V; when many of the tests are a kb or two in size, this massively distorts the results. Having an empty test allows this to be discarded, since the printf would be in common code and thus compiled into the empty test too.

I'm not sure I understand the point about needing a different dummy for each benchmark. My expectation is that a test consists of:
<test initialisation>
<start timer>
<stop timer>
<possible cleanup code>

We want to discount everything that is not <test> - and an empty test would achieve this (assuming that we are happy counting library code that is required by the benchmarks). Everything outside of <test> should be common code across all of the tests, so only a single dummy is needed. I do think we need to allow for LTO being used as it can offer some significant size and performance benefits, but we should investigate whether it distorts the results significantly.

Kind regards,


> -----Original Message-----
> From: Embench <embench-bounces at lists.librecores.org> On Behalf Of
> Jeremy Bennett
> Sent: 26 August 2019 19:36
> To: embench at lists.librecores.org
> Subject: [Embench] How to measure code size fairly
> Hash: SHA1
> Hi all,
> Jon Taylor from ARM has posed some useful questions about how Embench
> measures code size. This is a new thread to get input from the community.
> I think we can do better, and would welcome on advice on improved
> approaches.
> Background
> - ----------
> At present, the scripts measure size by building a benchmark with dummy
> libraries and dummy startup code. This minimizes the impact of such code on
> the measurement. Since libraries are not typically rebuilt with the same
> compiler options, they can provide a constant bias on each benchmark
> measurement.
> This is particularly the case with the relatively small benchmarks we have in
> Embench. We can see this if we compare ARM and RISC-V benchmarks out of
> the box. Most of the time ARM appears to be much larger, but this is
> because its startup code is much more general purpose than RISC-V, and
> adds 4Kbyte to the code size. Strip this out and ARM code comes out
> generally somewhat smaller than RISC-V.
> Conversely in the few benchmarks that have floating point calculations, ARM
> does very well, due to its hand-optimized floating point library.
> By using dummy startup code and libraries, we can remove this bias.
> However...
> The programs will not then execute, so there is no guarantee that the
> compiler has generated correct code. There is also much greater potential for
> global inter-procedural optimization (LTO) than would be the case with real
> libraries.
> I refer to this current approach as "Option 0". Here are some other options
> which might be better.
> Option 1: Just accept the bias
> - ------------------------------
> We could just accept that the bias is there, and use size as measured.
> This option relies on very few assumptions about the target and tools.
> The problem with this, that with small programs, the bias is substantial and
> we lose a lot of insight. Instead of being able to see which architecture and
> compiler features are beneficial, we just measure start-up code and library
> design for the architecture.
> Option 2: Have a dummy benchmark with no code to subtract
> - ---------------------------------------------------------
> This would give us a good result, but with garbage collection of sections,
> modern tool chains only link in the code they actually use.
> So we would need a different dummy for each benchmark, potentially quite
> complex to construct. This gets even harder with LTO, potentially moving
> code in and out of libraries.
> This option starts to require more assumptions about the target and tools.
> Option 3: Just count the size of the object files before linking
> - ----------------------------------------------------------------
> This is relatively straightforward to do.  The problem is that it precludes any
> benchmarking of link time optimizations such as global-interprocedural
> optimization (LTO). Given the importance of such techniques, this
> significantly reduces the value of Embench to the compiler community.
> This option makes relatively few assumptions about the target architecture
> and tools.
> Option 4: Subtract the size of the startup and library code
> - -----------------------------------------------------------
> We can look at the compiled binary and subtract any code/data associated
> with libraries and startup.
> This would be compatible with link time optimizations, although with a
> measurement error if such optimizations migrate benchmark code to/from
> library code.
> This option makes assumptions about code and data layout. For example that
> a function starts at its label and ends at the label with the next highest
> address.
> Option 5: Link but measure only benchmark code
> - ----------------------------------------------
> This is a combination of options 3 and 4. We look at the benchmark code pre-
> linking to determine the symbols used in the benchmark code and data.  We
> then link and only count the size of the symbols from the benchmark code.
> Also potentially vulnerable to error with link time optimizations, and makes
> all the same assumptions as options 3 and 4.
> Option 6: Statistically eliminate the bias
> - ------------------------------------------
> This uses the current option 0 and option 1, to provide a per benchmark
> estimate of startup and library code size. This still actually includes dummy
> code size, but potentially option 4 could we used to estimate this.
> This makes relatively few assumptions about target and tools (at least
> without option 4), but might be hard to explain to people.
> Feedback very welcome.
> Thanks,
> Jeremy
> - --
> Tel: +44 (1590) 610184
> Cell: +44 (7970) 676050
> SkypeID: jeremybennett
> Twitter: @jeremypbennett
> Email: jeremy.bennett at embecosm.com
> Web: www.embecosm.com
> PGP key: 1024D/BEF58172FB4754E1 2009-03-20 -----BEGIN PGP SIGNATURE---
> --
> +CIAnRS1iWUyEHbdwreisMGAW1ccyCZs
> =x6gL
> --
> Embench mailing list
> Embench at lists.librecores.org
> https://lists.librecores.org/listinfo/embench

More information about the Embench mailing list