Hello,
In the original implementation of this model, the authors employed a one-hot audio vector of dimension 1024. Unfortunately, the authors did not detail much about this one-hot vector in the paper and did not explain its purpose in the model. Given that its dimension is 1024 = (2^10), and that authors use 10-bit audio samples, I assume this vector is related to the prediction of each bit in each audio sample. But that's just a guess.
So, I have two (actually three) questions:
- What is the purpose of the one-hot audio vector in the original implementation?
- Why did you replace the one-hot vector with an embedding layer? What changed in the model behavior with this replacement?
Thank you very much