The recent trend in computing toward deep learning has resulted in the development of a variety of highly innovative AI accelerator architectures. One such architecture, the Cerebras Wafer-Scale Engine 2 (WSE-2), features 40 GB of on-chip SRAM, making it an attractive platform for latency-or bandwidth-bound HPC simulation workloads. In this study we examine the feasibility of performing continuous energy Monte Carlo (MC) particle transport by porting a key kernel from the MC transport algorithm to Cerebras’s CSL programming model. We then optimize the kernel and experiment with several novel algorithms for decomposing data structures across the WSE-2’s 2D network grid of approximately 750,000 user-programmable distributedmemory compute cores and for flowing particles (tasks) through the WSE-2’s network for processing. New algorithms for minimizing communication costs and for handling load balancing are developed and tested. The WSE-2 is found to run 130 times faster than a highly optimized CUDA version of the kernel run on an NVIDIA A100 GPU—significantly outpacing the expected performance increase given the relative number of transistors each architecture has.