[Open SoC Debug] packet vs. memory interface

Stefan Wallentowitz stefan at wallentowitz.de
Thu Jan 28 14:47:24 CET 2016

Hash: SHA1

On 27.01.2016 20:42, Tim Newsome wrote:

> My thoughts here are that this is a lot of overhead if all you
> care about is accessing memory mapped registers. It seems more 
> straightforward to implement a protocol that doesn't require an
> extra layer of packet format on top of a memory bus. Am I
> overestimating the extra complexity here?

Hi Tim,

I think you are overestimating it a bit, but it all depends a bit on
how you define a bus. If you have a look at state-of-the art systems
they actually have not much in common with the old tristate-lines plus

Below I wrote a small overview of interconnect topologies. But the
most important thing to say is, that we plan composable modules in a
way that they are split into the Debug Interface Interconnect
(DII)-specific frontend and an interconnect-independent part with the
MMIO-Interface [9] or similar wherever possible. As a sidenote: In
trace modules thats between the trace generation and the
packetization, but because the trace primitives can generally be of
arbitrary size that depends a bit on the specific module.


For a general discussion around topologies let me shortly summarize my
rough knowledge and thoughts:

# Old-School Debug Interconnect

Thats of course the good old JTAG. Traditionally, the JTAG TAPs are
chained up in a device and you get the large shift register spanning
your chip. This has slightly change for modern debug systems, where
you often find a tree of multiplexers that are controlled with a JTAG
register themselves. Slide 10 in [1] is a good picture of this. There
you can also find the equivalence for trace streams on slide 16.

# The Bus

In the old days a bus was a shared medium with tri-state drivers and
an arbiter. In on-chip implementations this is very rare nowadays and
on an FPGA it is even impossible to do tri-state since Virtex2-Pro.
Instead there are a few building blocks: One #Masters-Mux, one
arbiter, one #Slave-Demux and an address decoder. I put up a rough
sketch in [2]. To increase the throughput you can use a crossbar
instead, and many processors actually used such. There you have
#Masters many (#Slave-Demux, Address Decoder) pairs and #Slaves many
(#Masters-Mux, Arbiter). I have similarly drawn it in [3].

# The Ring

The ring is like the most simple network-on-chip of point-to-point
connections. As you say it is generally packet-based as in our case.
Each ring router looks like [4]: 2 2-Demux, 2 Comparators, 2 2-Mux, 2
Arbiters and buffers. The buffers allows parallel transmission of
multiple packets by partitioning it and it increases the speed, making
a ring the fastest interconnect.

If you look at current Intel processors, the previous crossbar between
the cores and slaves has now been replaces with a ring [5].

# The modern channel-based interfaces: AXI, NASTI, TileLink

The description of a bus above pretty much matches what you find in
simple AHB buses etc. For the modern interfaces like AXI (or NASTI or
TileLink) the design changes a bit. Due to their channel-based nature
the interconnects don't share a common medium, but the different
requests and responses are very much decoupled. The AXI Interconnect
from Xilinx is still a bus or crossbar on the channels [6]. Taking
that each channel is pretty wide, adding a port actually implies
relatively much logic. If you use the ARM CoreLink NIC-400 [7] in a
large design, the thin link (TLX) feature is often used. What it does
it serialize and de-serialize the requests and responses to reduce the
internal connectivity. I roughly depicted how it looks like in SoC
interconnects then in [8]. On the outside there are protocol adapters
and internally some kind of packet format is actually routed between

# Conclusion for Debug Interconnect

I am sorry, that this got rather long. But the point I wanted to make
is that the ring is actually not the complex if you compare it to
what's in the field as system buses. As packet-based interconnect the
advantages are its better scalability, it can span a chip thanks to
the buffer partitioning and can be faster. Nevertheless the throughput
is not that high, so that I am really looking forward to exploring
more sophisticated topologies.

To further justify the decision for a packet-based interconnect
actually stems from the demand for trace-based debugging which I
personally see as much more important in future systems. Using an MMIO
interface here is also possible, but then you map it to a FIFO-like
static address write with variable length bursts or so. Thats not much
different from the packet-based interconnect then.

[1] http://www.mad-workshop.de/slides/2_1.pdf
[2] https://goo.gl/BxgL1G
[3] https://goo.gl/Xrcznc
[4] https://goo.gl/MxyGTg
[8] https://goo.gl/KyY5sI
Version: GnuPG v2


More information about the OpenSoCDebug mailing list