This repo is inspired by a series of recent posts by François Fleuret on X. The goal is to implement the PScan algorithm in a simple yet efficient way
Let's consider a tensor
And let
Where
Let's denote a binary upper triangular matrix of size
Note that
Knowing that
Following the recursion, for every
Now lets denote
Or using torch notation:
Y = X + torch.bmm(M, X)
Unfortunately, the product torch.bmm(M, X) requires
First, note the first row of
M = torch.cat([torch.zeros(N, 1, D), Z], dim=1)
The matrix
Then
Z = Z_ * L_T
By def torch.log in the complex space, then
Now let's express
or in PyTorch:
D[n, :, :] = torch.cat([U_t, torch.zeros(T - t, T)], dim=0)
Z_[n, t, :] = D[n, :, :] @ torch.log(A)[n, :]
It turns out this product can be computed efficiently using either FFT or CumSum. Turns out we can extend matrix CumSum. However, even though torch.bmm(M, X) entirely.
First let's have a look at
or, in other words:
As we said earlier, the product of any vector on the matrix A_log.cumsum(dim=-1) as CS, then Z[n, T - 1, :] = CS[:, -1].unsqueeze(-1) + A_log - CS. Now let's note the following fact, for any
Z[n, t - 2, :] = (Z[n, t - 1, :] - ln(A[n, t - 1])) * torch.cat([torch.ones(t), torch.zeros(T-t)])
So this leads to the conclusion that knowing torch.cat([torch.ones(t), torch.zeros(T-t)]). Let's now rewrite this in a tensor form. Lets denote
where
or
Moreover, we know that we can compute the product of
We know that both
Y = X + (L_T * torch.exp(V.unsqueeze(1) - W.unsqueeze(2))) @ X
Let's have a closer look for any fixed
or
The final step is to use the following trick. For any vector
Or using the CumSum operation:
CumSum we can achieve different performances. In the next few sections, we explain different CumSum algorithms, their tradeoffs, and their complexities.
See pytorch official docs for more details
Turns out we can extend matrix FFT.md
To run the code pytorch and tqdm are required. The recommended way is to use docker, the image can be built running:
docker build -it scan .
Here we compare cusum implementation with the one proposed by François Fleuret. Benchmarking conducted with NVIDIA V100, see Dockerfile for the environment details, and benchmark.py.

