Currently, the PosthocLinearCBM is trained in a very weird way: Using sklearn.SGDClassifier and then copying the weights into the torch weight matrices. Why? The code says "for pedagogical reasons" but why not just using torch optim.SGD and avoid 20+ lines of unnecessary code?