Re. the last point, was trying to think of computations where 1) an efficient in-place version is possible, and 2) the most efficient out-of-place version is significantly faster than copying the input and executing the in-place version.
In 1D convolutions, the in-place version would need to use O(filter size) scratch space for lookahead, but this doesn't seem like it would be too significant. However, it might start to become significant in higher-dimensional convolutions.
In Big-O notation, there will not be any difference, because copying the data will just be O(N), and whatever you do in the op will be at least O(N), so no change.
But in absolute terms, it could make a difference. Think of y = x + 1 vs y = x; y += 1. I would expect that the former is slightly faster. But actually I'm not really sure.
Actually, I implemented most of my native ops exactly in this way, i.e. I implemented the inplace version, and the non-inplace version would just additionally copy it and then call the inplace version.
In 1D convolutions, the in-place version would need to use O(filter size) scratch space for lookahead, but this doesn't seem like it would be too significant. However, it might start to become significant in higher-dimensional convolutions.
Any particular example that occurs to you?