Image may be NSFW.
Clik here to view.© ASTRON
As 1024 ms is a oft-used number in the data packetisation, as early as in the Apertif Front End beam-former, we initially concluded the problem must originate elsewhere, upstream from our systems, or be caused by the data capture software we used on the GPU cluster. But after these were ruled out it became apparent we were creating The Artefact, as it was now known, ourselves. Somewhere. The literate reader may know that Sherlock Holmes excels in solving other people's problems, but is not actually all that self-aware. And that, precisely, was what solving our problem required now. And thus, week after week, Jonathan Hargreaves would produce new plots, like the 2nd subpanel, showing the incorrect patterns the data would sometimes make. Poring over numbers of missing bits here, and jumps in the data there, he was able to first pinpoint the problem to a single Uniboard (3rd panel), then to a specific FPGA, over the course of two months. With the offending FPGA identified (there are hundreds in the whole system), a trip to Westerbork confirmed his skills surpass Sherlock's -- the problem was of our own making: the FPGA had a memory module of the wrong type installed, and we had not checked. The firmware is optimized for a specific memory type to achieve the maximum throughput. This wrong module affected both imaging and time-domain Apertif data. After banishing and replacing the memory, the data looked clean.
Case closed.
Lessons learned: 1) Budget significant time for debugging your complex system. 2) Exploit time-domain data for telescope verification.