[Embench] How to measure code size fairly

Ofer Shinaar Ofer.Shinaar at wdc.com
Thu Aug 29 17:41:30 CEST 2019


Hi Jon,
I want to share my two cents regarding code size.
Measuring code size is a different approach from checking performance for synthetic/none-synthetic benchmarks.
While performance is tested over libs and applicative code (like crc, SHA, Fourier transform, etc...), checking size over those will be unreverent since usually embedded FW will have more "control code" then  "library usage".
For example, FW can use JPEG encoder and it will take 4kB of size in some target, but the overall code of the program will be 100 or 1000 bigger. 
Today IoT device are "fighting" over several bytes how we measure code size (we call small embedded devices IoT today just because it fits to the concept, but we can have big ones as well) so?

Well, I think that practice comparison is one of the options. If we have a code that we spotted that have difference between ARM/RV/x86/Other we can use it as a "test case".
This code will have randomly C functionality  (loops, ifs, inline, etc...)
Of course that this will be massively depend on the compiler but also on the ISA and ABI rules, we already spotted cases like this internally and we open source those test cases.

Another approach will be to use "big FW applications" which use a lot of randomly C functionality, like RTOS.
For example we can examine what is the size of FreeRTOS with RV32IMC vs ARM (Thumb 2). This will be very interesting to small embedded devices that depend on RTOS, this can highlight how much one target is better/worst then the other, from size perspective.

Thanks,
Ofer 



> -----Original Message-----
> From: Embench [mailto:embench-bounces at lists.librecores.org] On Behalf
> Of Jon Taylor
> Sent: Thursday, August 29, 2019 10:16
> To: embench at lists.librecores.org
> Cc: nd <nd at arm.com>
> Subject: Re: [Embench] How to measure code size fairly
> 
> Thanks Jeremy.
> 
> Firstly my opinion is that any code we're measuring the size or performance
> of needs to be functional. If an algorithm requires lots of maths library code
> (such as cubic), there is a benefit to having an optimised library available and
> that should be reflected in a benchmark score. This could also include
> allowing a library optimised for a processor with custom instruction
> extensions.  I'm really not sure what measuring the performance of
> something that can't be executed really tells us - for example "cubic" is about
> 1k of code with dummy libraries, but ~9k with libraries (Arm GCC, building -
> O2). We wouldn't measure the runtime without libraries, so why would
> measuring the size without libraries be considered valid?
> 
> Having said that, I think it likely (particularly for benchmarks run on actual
> hardware), use of printf might be desirable for recording the runtime (eg via
> a UART, trace port or other mechanism), but measuring the size of the printf
> library is not helpful because it's effectively only for debug, not functional
> purposes.  Comparing code with and without printf, the print library adds
> ~20k to Arm code size, and ~60k to RISC-V; when many of the tests are a kb
> or two in size, this massively distorts the results. Having an empty test allows
> this to be discarded, since the printf would be in common code and thus
> compiled into the empty test too.
> 
> I'm not sure I understand the point about needing a different dummy for
> each benchmark. My expectation is that a test consists of:
> <bootcode>
> <test initialisation>
> <start timer>
> <test>
> <stop timer>
> <possible cleanup code>
> 
> We want to discount everything that is not <test> - and an empty test would
> achieve this (assuming that we are happy counting library code that is
> required by the benchmarks). Everything outside of <test> should be
> common code across all of the tests, so only a single dummy is needed. I do
> think we need to allow for LTO being used as it can offer some significant size
> and performance benefits, but we should investigate whether it distorts the
> results significantly.
> 
> Kind regards,
> 
> Jon
> 
> > -----Original Message-----
> > From: Embench <embench-bounces at lists.librecores.org> On Behalf Of
> > Jeremy Bennett
> > Sent: 26 August 2019 19:36
> > To: embench at lists.librecores.org
> > Subject: [Embench] How to measure code size fairly
> >
> > -----BEGIN PGP SIGNED MESSAGE-----
> > Hash: SHA1
> >
> > Hi all,
> >
> > Jon Taylor from ARM has posed some useful questions about how
> Embench
> > measures code size. This is a new thread to get input from the community.
> >
> > I think we can do better, and would welcome on advice on improved
> > approaches.
> >
> > Background
> > - ----------
> >
> > At present, the scripts measure size by building a benchmark with dummy
> > libraries and dummy startup code. This minimizes the impact of such code
> on
> > the measurement. Since libraries are not typically rebuilt with the same
> > compiler options, they can provide a constant bias on each benchmark
> > measurement.
> >
> > This is particularly the case with the relatively small benchmarks we have in
> > Embench. We can see this if we compare ARM and RISC-V benchmarks out
> of
> > the box. Most of the time ARM appears to be much larger, but this is
> > because its startup code is much more general purpose than RISC-V, and
> > adds 4Kbyte to the code size. Strip this out and ARM code comes out
> > generally somewhat smaller than RISC-V.
> >
> > Conversely in the few benchmarks that have floating point calculations,
> ARM
> > does very well, due to its hand-optimized floating point library.
> >
> > By using dummy startup code and libraries, we can remove this bias.
> >
> > However...
> >
> > The programs will not then execute, so there is no guarantee that the
> > compiler has generated correct code. There is also much greater potential
> for
> > global inter-procedural optimization (LTO) than would be the case with real
> > libraries.
> >
> > I refer to this current approach as "Option 0". Here are some other options
> > which might be better.
> >
> > Option 1: Just accept the bias
> > - ------------------------------
> >
> > We could just accept that the bias is there, and use size as measured.
> > This option relies on very few assumptions about the target and tools.
> >
> > The problem with this, that with small programs, the bias is substantial and
> > we lose a lot of insight. Instead of being able to see which architecture and
> > compiler features are beneficial, we just measure start-up code and library
> > design for the architecture.
> >
> > Option 2: Have a dummy benchmark with no code to subtract
> > - ---------------------------------------------------------
> >
> > This would give us a good result, but with garbage collection of sections,
> > modern tool chains only link in the code they actually use.
> > So we would need a different dummy for each benchmark, potentially
> quite
> > complex to construct. This gets even harder with LTO, potentially moving
> > code in and out of libraries.
> >
> > This option starts to require more assumptions about the target and tools.
> >
> > Option 3: Just count the size of the object files before linking
> > - ----------------------------------------------------------------
> >
> > This is relatively straightforward to do.  The problem is that it precludes any
> > benchmarking of link time optimizations such as global-interprocedural
> > optimization (LTO). Given the importance of such techniques, this
> > significantly reduces the value of Embench to the compiler community.
> >
> > This option makes relatively few assumptions about the target architecture
> > and tools.
> >
> > Option 4: Subtract the size of the startup and library code
> > - -----------------------------------------------------------
> >
> > We can look at the compiled binary and subtract any code/data associated
> > with libraries and startup.
> >
> > This would be compatible with link time optimizations, although with a
> > measurement error if such optimizations migrate benchmark code to/from
> > library code.
> >
> > This option makes assumptions about code and data layout. For example
> that
> > a function starts at its label and ends at the label with the next highest
> > address.
> >
> > Option 5: Link but measure only benchmark code
> > - ----------------------------------------------
> >
> > This is a combination of options 3 and 4. We look at the benchmark code
> pre-
> > linking to determine the symbols used in the benchmark code and data.
> We
> > then link and only count the size of the symbols from the benchmark code.
> >
> > Also potentially vulnerable to error with link time optimizations, and makes
> > all the same assumptions as options 3 and 4.
> >
> > Option 6: Statistically eliminate the bias
> > - ------------------------------------------
> >
> > This uses the current option 0 and option 1, to provide a per benchmark
> > estimate of startup and library code size. This still actually includes dummy
> > code size, but potentially option 4 could we used to estimate this.
> >
> > This makes relatively few assumptions about target and tools (at least
> > without option 4), but might be hard to explain to people.
> >
> >
> > Feedback very welcome.
> >
> > Thanks,
> >
> >
> > Jeremy
> >
> > - --
> > Tel: +44 (1590) 610184
> > Cell: +44 (7970) 676050
> > SkypeID: jeremybennett
> > Twitter: @jeremypbennett
> > Email: jeremy.bennett at embecosm.com
> > Web: www.embecosm.com
> > PGP key: 1024D/BEF58172FB4754E1 2009-03-20 -----BEGIN PGP
> SIGNATURE---
> > --
> >
> >
> iEYEARECAAYFAl1kJnoACgkQvvWBcvtHVOHaFwCdHFYOoMHcsF2QL2fdXCpcg
> > OAH
> > +CIAnRS1iWUyEHbdwreisMGAW1ccyCZs
> > =x6gL
> > -----END PGP SIGNATURE-----
> > --
> > Embench mailing list
> > Embench at lists.librecores.org
> > https://lists.librecores.org/listinfo/embench
> --
> Embench mailing list
> Embench at lists.librecores.org
> https://lists.librecores.org/listinfo/embench


More information about the Embench mailing list