Dependability Measures
NoCs are usually very poor at handling faults. If you mess up, say a destination address in a header flit, chances are that your entire network will go bye bye!
Anyways, if you look into the problem faults on the data path in the router, you can easily divide it into 2 parts, the faults inside the FIFO and the faults between the fifo output of the upstream router and FIFO input of the downstream router. the later is just a bunch of wires plus a path through the crossbar switch. So what would be the easiest way to make sure that the network is doing ok is to drop the faulty packet.
Packet dropping
This is how it works:
- FIFO: the packet dropping mechanism in FIFO uses an FSM that tracks the progression of the packet through the input port. In fault free scenario it goes from IDLE state to HEADER, to BODY and once all the body flits are gone it goes to TAIL and back to IDLE (or alternatively back to HEADER in case you get packets back to back). In faulty environment you have to add one more state called PACKET_DROPPING and you have to consider that you can end up in this state from any other state. you enter this state in one of the following three cases:
- header flit is faulty: this is easy, keep dropping flits and issue credit for upstream router until you get the tail flit
- a body flit is faulty: here you dont have any idea where is the header of the packet(it can easily be out of reach of your FIFO, specially in worm-hole routers). So you assume that the header is gone, you replace the body flit with a dummy tail, forwarding the packet while dumping the rest of the packet from upstream router.
- tail flit is faulty: this is similar to faulty body
- Routing Logic: This one is much easier, since the routing logic acts based on header flit only, it can initiate a packet drop if there is a fault in the header flit. To achieve this, the routing logic can fake grant signal to the FIFO, making it believe that it can send these flits while not even sending a single request to the allocator unit. There is one downside here! if the tail of the packet has a faulty flit type as well, probably another innocent packet will also be dropped.
Also Its crucial to somehow inform some system manager about the faults. The problem is that sending all such info up to the manager (if there is one! ?) is adding too much overhead to the system. So the idea is to abstract them and even classify them.
One of my colleagues is working on the concept of concurrent online checkers for control logic of NoC routers(which works like a hardware assertion). These checkers are really powerful and it’s possible to prove that these checkers cover 100% of single stuck at faults (for proving coverage, he uses this tool called Turbo Tester). He is going to defend soon and I will add some link here to his thesis. We used these checkers to model turn faults.
Online local fault classification
There is this paper that mentioned online hardware based fault classifiers:
- J. Silveira, M. Bodin, J. M. Ferreira, A. C. Pinheiro, T. Webber, C. Marcon, “A fault prediction module for a fault tolerant noc operation”, Sixteenth International Symposium on Quality Electronic Design, pp. 284-288, March 2015.
There are a couple more who use some shift register (i didnt find it appealing since they only see a short window -just to save some gates- and you cant really make decisions using such architecture. And counters are much more elegant). I modified it to be compatible to single parity (instead of hamming decoder) and reduced its area by removing one of the counters! We also tried it for the abstracted checkers outputs. works like a charm!
I Also made a flit tracker module that you can connect to network links and it will log the packet information. Also I wrote some small python script that generates an animation out of the logs (It’s interesting for me because I can not show waveforms to people who are not directly working on this project but i can show this. BTW, in the current version you can see the faulty flits marked as red so you can see packets getting corrupt!). I think I should have spent more time making it more interesting but other things came up and nobody has interest for fancy stuff in academia (its like people enjoy ugly stuff :P). Here you can find an example of the output:
Anyways, we have a paper about all of these stuff that I mentioned above. Check it if you are interested:
- S. P. Azad et al., “From online fault detection to fault management in Network-on-Chips: A ground-up approach,” 2017 IEEE 20th International Symposium on Design and Diagnostics of Electronic Circuits & Systems (DDECS), Dresden, 2017, pp. 48-53. doi: 10.1109/DDECS.2017.7934565
note: we are still not utilizing the checkers for correction. But we have some plans! ?