[Embench] New metrics for Embench results - Base/Peak

Roger Shepherd roger.shepherd at chipless.eu
Mon Mar 8 23:13:25 CET 2021


I think we should separate the Speed/Rate and Base/Peak issues so I’m setting off two e-mail threads. This is the Base/Peak thread.

I think it worth working through Base/Peak and understanding exactly why this is of interest and to whom. 

Base/Peak is about what how compiler and other tool optimisations may be used when reporting Embench results. I’ve gone back and looked at what the current guidelines say. They say you must report details of the toolchain used and the compiler and linker flags used for the benchmarks, “which should be the same for each benchmark program”, and add "For clarification compiler flags, whose effect is to vary the choice and parameters of optimization passes on a per program (or per compilation unit or function) basis are permitted. For example flags which use machine learning techniques to match source code styles with the a choice of optimization passes. Note that the flags can differ between different architectures.

Ofer says "Currently, we can say we only have “Base”, on IoT results.” but our rules appear to allow feedback driven optimisation on a per-program basis, whereas SPEC is very clear that "2.2.3. Feedback directed optimization is allowed in peak.” I also note that the SPEC rules are very detailed, presumably to prevent “cheating” via the use of special optimisation flags. 

We went through a lot of these issues when I was on the board of EEMBC and I imagine little has changed since then, although perhaps the process of building software has become more sophisticated. The “users” of benchmarks fell into three categories - i) processor architects and compiler writers, desperate for “real-world” programs that they could use to optimise their developments, ii) processor and compiler vendors, wanting benchmarks that show off their products in the best light, preferably showing them to be better than their competition, and iii) people trying to make a buying decision. I’ve been in all three categories, sometimes at the same time, and I know that there are conflicting interests and that it is the job of the benchmarking organisation to set the rules that balance these interests. 

One thing I learned from years of working on embedded system processors was that the people using embedded* processors are typically not highly skilled programmers, they have deadlines to meet, and all they have time to do is to get the code compiled and running without too many bugs. They don’t have time to optimise, they don’t have the skill, or the build system, that can apply different optimisations to different parts of the system. That’s why the Base results should be measured using the same flags for every benchmark. (And is a reason that feedback based optimisations should be excluded from the baseline). I think what we have today is great for two of the three parties and is the right thing to have done (although we may need to tighten the rules in the future).

[*really deeply embedded processors might have very little software and might be amenable to careful optimisation, but in that case, the idea of a benchmark suite being useful isn’t so clear].

Let’s take it as read that is some desire to be able to publish something that shows off what can be done by tweaking compilation and build options on a per program basis. For me, this raises questions of what Embench then has to do, how much work it is, and who benefits. I will express my thoughts as a series of questions:

. does Embench need to specify a “Peak” metric?
. does Embench need to specify how to report the individual results?
. does Embench need to modify its build systems to support these things?
. does having Peak and Base lessen the brand?
. does it (the Peak score) help anyone trying to buy a processor 

It seems to me that blessing “Peak” might make sense if there really are going to be competitive Embench Peak results available. Otherwise parties are free the measure and present the impact of tool tweaks on a per program basis as they want, they just can’t claim they are Embench results. Is there really a demand for peak?

Roger

> On 8 Mar 2021, at 08:03, Ofer Shinaar <Ofer.Shinaar at wdc.com> wrote:
> 
> Hi Ray,
> By raw, I am thinking about the basic compiler flags, something like we currently have. But if we take my example, “save-restore” is something unique to RISCV.
> I understand that it is basic usage when doing “size” testing, but we can argue that it should be part of the “none-raw” flags or “special target flags.”
> Having the “peek/rate” will give us the option to total separate basic flags from none-basic flags… Plus, we can show how each target can boost performance and decrease size just by compiler “tweaks.”  
>  
> BR,
> Ofer
>  
> From: Ray Simar <ray.simar at rice.edu> 
> Sent: Friday, 5 March 2021 03:21
> To: Ofer Shinaar <Ofer.Shinaar at wdc.com>
> Cc: David Patterson <pattrsn at cs.berkeley.edu>; embench at lists.librecores.org
> Subject: Re: [Embench] New metrics for Embench results - Base/Peak and Speed/Rate
>  
> Hi Ofer,
>  
> Thanks for the ideas.  I was wondering if we might be able to calculate these measures, at least in part, from the raw measures.  I do like the idea of a single top line number as the team is doing now.  But maybe if we offered the right set of raw measurements people could calculate these additional measures as they needed.
>  
> Thoughts?
>  
> All the best,
> Ray
>  
> On Mar 4, 2021, at 3:17 AM, Ofer Shinaar <Ofer.Shinaar at wdc.com <mailto:Ofer.Shinaar at wdc.com>> wrote:
>  
> Hi Dave,
> Its is more an enhancement and less a “problem-solving.”
>  
> For base/peak, we can also allow vendors to publish their best result with compiler tweaks, along with the current “base” typical results. 
> After all, riscv and arm compiler are not the same; for example, we do provide –msave-restore for RV, but we do not give that ARM because it is just a basic flag that we “must use”.
> Peak will allow you to drive more flags and expose where we can get with each compiler per target.
>  
> For Rate/Speed:
> We only publish Speed results, and some targets want to see IPC/throughput as well. Give the “Rate” option will solve that.
>  
> Thanks,
> Ofer
>  
>  
>  
> From: David PATTERSON <pattrsn at cs.berkeley.edu <mailto:pattrsn at cs.berkeley.edu>> 
> Sent: Thursday, 4 March 2021 03:44
> To: Ofer Shinaar <Ofer.Shinaar at wdc.com <mailto:Ofer.Shinaar at wdc.com>>
> Cc: embench at lists.librecores.org <mailto:embench at lists.librecores.org>
> Subject: Re: [Embench] New metrics for Embench results - Base/Peak and Speed/Rate
>  
> What is the problem you're trying to solve with the new metrics?
> 
> Dave
>  
> On Wed, Mar 3, 2021 at 3:47 AM Ofer Shinaar <Ofer.Shinaar at wdc.com <mailto:Ofer.Shinaar at wdc.com>> wrote:
> Hello all,
> I would like to bring to the forum a suggestion on including more results metrics as we have on SPEC
>  
> 1.       Include metric fo Base and Peak:
> 
> -          SPEC url or explanation: https://www.spec.org/cpu2017/Docs/overview.html#Q16 <https://www.spec.org/cpu2017/Docs/overview.html#Q16>
> -          Currently, we can say we only have “Base”, on IoT results.
> 
> -          Having “Peak” will give us a possibility to expose more compiler flags that can boost performance, and we can publish results + links to tools that can “find the best set of compiler flags”
> 
> Example for work done by Craig Blackmore (from Embecosm): 
> 
> o   https://github.com/craigblackmore/opentuner <https://github.com/craigblackmore/opentuner>
> o   https://www.groundai.com/project/automatically-tuning-the-gcc-compiler-to-optimize-the-performance-of-applications-running-on-embedded-systems/2 <https://www.groundai.com/project/automatically-tuning-the-gcc-compiler-to-optimize-the-performance-of-applications-running-on-embedded-systems/2>
> 2.        Include metric for Speed and Rate
> 
> -          SPEC url for explanation: https://www.spec.org/cpu2017/Docs/overview.html#Q15 <https://www.spec.org/cpu2017/Docs/overview.html#Q15>
> -          On the embedded space, sometimes throughput is more important than “time”. Those days we can start seeing IoT cores /MCUs that have a multithreaded design.
> Also, we see emerging multithreading designs for RISCV. On those designs rate for throughput will be more important than Time.
> 
>  
> Thoughts/comments?
>  
> Thanks,
> Ofer
>  
>  
>  
>  
> Ofer Shinaar
> Senior Manager, R&D Engineering – Firmware & Toolchain, CTO Group
>  
> Western Digital®
> Israel, Migdal Tefen 24959, P.O Box 3
> Email: Ofer.shinaar at wdc.com <mailto:Ofer.shinaar at wdc.com>
> Office: +972-4-9078783
> Mobile: +972-52-2836160
>  
>  
> -- 
> Embench mailing list
> Embench at lists.librecores.org <mailto:Embench at lists.librecores.org>
> https://lists.librecores.org/listinfo/embench <https://lists.librecores.org/listinfo/embench>
> -- 
> Embench mailing list
> Embench at lists.librecores.org <mailto:Embench at lists.librecores.org>
> https://lists.librecores.org/listinfo/embench <https://lists.librecores.org/listinfo/embench>
>  
> -- 
> Embench mailing list
> Embench at lists.librecores.org
> https://lists.librecores.org/listinfo/embench

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.librecores.org/pipermail/embench/attachments/20210308/e23ac42a/attachment.htm>


More information about the Embench mailing list