ROACH2 packetised correlator sizing

From Casper
Jump to navigationJump to search

ROACH-II was primarily specified for the packetised correlator, which was considered to be the most demanding application. This page explains the reasoning behind choosing the memory interfaces and capacities, as well as IO bandwidth for ROACH2.

Executive Summary

To enable MeerKAT, PAPER and ATA correlator requirements, ROACH2 should be upgraded over ROACH1 as follows:

  • A Virtex-6 SX475 main FPGA.
  • Four 36bit QDR parts of 36Mibit each (144Mibit for MeerKAT2 upgrade path).
  • A 72-bit DRAM DIMM slot, capable of housing at least a 256MiB DIMM running at or faster than 250MHz DDR.
  • At least four 10GbE ports (more will be needed for beamforming).

Additionally, it would be convenient to have:

  • Increased PPC-FPGA datarates (32bit bus?).

F engine coarse delay compensation

Simple delay compensation implementation requires dual-ported memory. Likely to put in on-FPGA BRAM.

Delay = baseline / speed of signal. Speed of signal ~ 300 000km/s.

For each polarisation, we then need:

Longest delay Sample rate Samples Mem req'd at 8b/sample
1 km 1Gsps 3K 26Kb
8 km 2Gsps 53K 417Kb
20 km 1Gsps 66K 520Kb
20 km 2Gsps 133K 1017Kb
60 km 1Gsps 198K 1562Kb
60 km 2Gsps 396K 3125Kb
60 km 4Gsps 792K 6250Kb

MeerKAT will see each F board processing two polarisations, so the numbers in the table above need to be doubled.

Conclusion: MeerKAT-1 will require at least 2Mb of BRAM for delay processing. Any of the Virtex-6 devices will be sufficient if ignoring FFT, PFB and fine delay requirements.

F engine logic requirements

Not yet thoroughly checked, but as a guideline: FFT scales linearly with bandwidth and NlogN for Nchans. ROACH1 runs out of logic at 8k chans at 500MHz.

PAPER's processing requirements are small.
MeerKAT will need at least four times the logic to do 16k chans at 1GHz. SX95 * 4 = SX380.

F engine corner-turn bandwidth

There is a small matrix transpose that happens inside the F engines in order to have each packet contain data for a single antenna, single frequency channel.

For reference, ROACH has two 18bit-wide dual-ported DDR SRAMs (QDR) and a single 72bit wide DDR DRAM which are each presented as SDR 36 bit and 144 bit interfaces in application space.

On each FPGA clock cycle, we need to be able to read and write:

bus_width = n_pols * n_parallel_stream * n_bits * complex

PAPER is processing 100MHz on a 200MHz FPGA.

PAPER-64:  8 * 0.5 * 4 * 2 = 32 bit
PAPER-128: 8 * 0.5 * 4 * 2 = 32 bit

ATA is processing two 1GHz dual pol antennas on a 250MHz FPGA.

ATA-42:    4 *  8  * 4 * 2 = 256 bit  

MeerKAT1 is processing 1GHz on a 250MHz FPGA.

MeerKAT1:  2 *  4  * 4 * 2 = 64 bit

MeerKAT2 is processing 4GHz on a 250MHz FPGA.

MeerKAT2:  2 *  16  * 4 * 2 = 256 bit

We're double buffering this CT at the moment and need to read from one memory while writing to the other.

Conclusion: two 64 bit memory interfaces will suffice for naive implementation up to MeerKAT1, however, ATA and MeerKAT2 require wider bitwidths of 256bit. These are probably best implemented in 4x QDR 36 bit parts which appear as 4x 72bit SDR interfaces.

F engine corner-turn space

For reference, ROACH1 has 2x 36Mib QDRs and 1x 1GiB DRAM.

Calculation is as follows: req'd_mem = double_buffer * pkt_len * n_chans * n_parallel_streams * n_bits * complex.

We assume 4 bit quantisation (4b real + 4b imaginary).

PAPER-64:    2 * 128 *  2048 * 8 * 4 * 2 = 32 Mibit
PAPER-128:   2 * 256 *  2048 * 8 * 4 * 2 = 128Mibit
MeerKAT1:    2 * 256 * 16384 * 2 * 4 * 2 = 128Mibit
MeerKAT3:    2 * 256 * 65536 * 2 * 4 * 2 = 512Mibit

Conclusion: ROACH2 should have at least 128Mbit of QDR memory to support the matrix transpose operation for MeerKAT phase-1. Should we want to use ROACH2 for future MeerKAT phases, we will need at least 512 Mib (4x 144Mib parts) for the F engine.

X engine VACC data rates

The X engine cores output data in windows. You need to be able to output all baselines, all stokes complex values within one window to avoid overflows. Each window is length n_ants*pkt_size. The Xengine must produce n_baselines*n_stokes*complex during this time.

The vector accumulator needs to be able to read and write once per incomming value. On QDR, this is a single-clock operation (since it's dual ported), but on DRAM, this requires two clocks.

The choice of pkt size has implications for minimum integration period.

For reference, ROACH has 2x 36bit QDR interfaces and 1x 144bit DRAM interface.

If we assume the use of QDR for the VACC, which allows simultaneous reads and writes, the following table results...

n_ants pkt_size clk_avil n_bls demux clk_req'd
32 128 4096 528 8 (32bit VACC) 4224
32 128 4096 528 4 (64bit VACC) 2112
64 128 8192 2080 8 (32bit VACC) 16640
64 256 16384 2080 8 (34bit VACC) 16640
64 128 8192 2080 4 (64bit VACC) 8320
64 256 16384 2080 4 (68bit VACC) 8320
64 128 8192 2080 2 (128bit VACC) 4160
128 128 16384 8256 2 (128bit VACC) 16512
128 256 32768 8256 2 (136bit VACC) 16512
128 512 65536 8256 2 (144bit VACC) 16512

Conclusion: We will need a single 128-bit interface for larger numbers of antennas (128). For smaller numbers of antennas (<=64), we would like multiple 64-bit interfaces. QDR is more convenient for these VACCs as implementation is easier. Multiple QDR parts would give additional flexibility in terms of how the memory is arranged (single large databus vs multiple smaller busses). Thus, there should be at least four 32-bit interfaces, which can be configured for use as four stand-alone VACCs, or combined in parallel to use as two 64-bit VACCs or as a single 128-bit interface. Four 36-bit QDR parts are appropriate, and match Feng corner-turn bandwidth requirements (see above section).

X engine VACC capacities

Each X engine processes a subset of frequency channels. The number of X engines required scale with the bandwidth you're processing. If your FPGAs are running at the same speed as the bandwidth you're processing, then you need one X engine for every F engine. However, if you're processing a wideband design (eg 1GHz with FPGAs running 250MHz) then you need X:Y times more X engines than F engines where X is the bandwidth you're processing and Y is the FPGA clock rate of your X engine.

For PAPER, 100MHz is processed on FPGAs running at 200MHz. Thus there are twice as many F engines than X engines.

MeerKAT-1 will likely process 1GHz of bandwidth on FGPAs at 250MHz (four times as many X engines as F engines).

MeerKAT-2 will likely process 4GHz of bandwidth on FPGAs at 250MHz (16 times as many X engines as F engines).

Each baseline has 4 stokes, 32 bit complex numbers (=256bits per baseline).

Data must be double-buffered, so that previous accumulation can be read out slowly while capturing next accumulation.

n_ants n_chan n_xeng n_bls mem_per_xeng n_xeng_per_roach2 total_double_buff_mem_per_roach2
32 2048 64 2080 4.125Mb 4 33Mb
64 2048 32 2080 32.5Mb 2 130Mb
64 4096 256 2080 8.13Mb 2 33Mb
128 2048 64 8256 64.5Mb 1 130Mb
128 16384 512 8256 64.5Mb 1 130Mb
128 65536 512 8256 258Mb 1 512Mb

Conclusion: ROACH2 should have at least 130Mb of QDR memory for MeerKAT1. MeerKAT-2/3/4 can switch to DRAM VACC at the expense of minimum dump times (see section above on VACC bandwidth).

Minimum integration period

This is affected linearly by number of freq channels and packet size.

Ignoring VACC output datarates, yields the following possible example:

   Processing 2GHz bandwidth, 16384 chans, minimum integration as follows:
       1/2GHz:           0.5ns
       16384 chans:   *16384
       pkt size:      *  256
       min dump time: ~2.048ms  

However, this ignores the fact that with the current design, the VACC uses spare FPGA cycles to retrieve the previous accumulation while storing the current one. If the data is forwarded through the VACC, how quickly you can retrieve the previous accumulation (even if only integrating a single spectrum) thus depends on your board clock rate and the ratio of number of valid clocks to idle clocks. Realistically, this will be another factor of ~4, increasing the minimum dump time in afore-mentioned example to ~8ms. Alternatively, you can bypass the VACC entirely and output at a fixed ~2ms period.

Switch and 10GbE requirements

Calculating maximum analogue bandwidth transportable over a 10GbE link...

Digital link: 156.25MHz*4*20, but 8/10 encoding -> 10Gbps.

Channel utilisation:

Bits %pkt
Layer1 overhead 160 6%
Ethernet header and footer 160 6%
IPv4 header 160 6%
UDP header 64 2%
Application header 64 2%
Data payload 2048 77%
77/100 * 10Gbps = 7.7Gbps max application usable per 10GbE link. 

Assuming 4 bit complex data: 7.7/8b = 962.5MHz total. With 2 pols (single antenna/F engine per board) gives max 481MHz bandwidth per link.

  • This does NOT account for any out of band signalling (currently used for heartbeat signals and legacy data synchronisation).
  • This does NOT conform to SPEAD packet formats. It is the maximum efficiency we can get away with.
  • Not recommended to run links near 100%.

Conclusion: MeerKAT1 will thus require at least four ports per board (X engines need two to switch and two to F engines to carry 1GHz bandwidth).