I know when I was coding an avx2 matmul last month, having multiple dot product dependency chains operating in parallel was the single biggest thing that brought it in the same league of performance as MKL. It was like a night and day difference the first time I ran the program after doing that. Using lookaside L1 cache didn't help me very much, since it worked better to share register loads across operations.