-
Notifications
You must be signed in to change notification settings - Fork 91
Description
Hi @stes and @MMathisLab et al.,
I've been testing CEBRA supervised embeddings and observed that even when the neural data are shuffled, the resulting embeddings remain highly consistent with those for the original (unshuffled) data—with high decoding accuracy and low InfoNCE loss. This suggests that the model is overfitting to the supervised labels, and high consistency and decoding performance is the null expectation for any sufficiently flexible encoder network. Naturally, this calls into question the soundness of the method as well as the validity of any scientific inferences drawn from the use of such a method.
Edit: This initially suggested that the model might be overfitting to the supervised labels. However, after further analysis and clarification, it is clear this is not a widespread issue, and only occurs in specific circumstances with small datasets, large models and extended training time.
-
Method:
Experiments were conducted using example rat and monkey data with the provided code for CEBRA supervised embeddings. Embeddings for the original (unshuffled) and shuffled (randomly permuted) neural data were compared, with the supervised labels left intact (unshuffled). -
Code & Models:
Code, figures, and saved models for the experiments are available at cebra_shuffle_overfit. -
Observations:
- When the neural data are shuffled, the embeddings remain highly consistent with those for the original (unshuffled) data.
- Decoding accuracy remains high.
- InfoNCE loss remains low.
Concerns
- Overfitting:
Experiments strongly indicate that the encoder network for the neural data is readily overfitting to the supervised labels.
Edit: Upon further analysis and discussion, this overfitting behavior is observed only with larger models trained for an extended period. When proper validation protocols—such as using a train/validation split and monitoring loss curves—are employed, this failure mode is detectable and manageable.
- Lack of Proper Validation:
The software package doesn't provide any obvious functionality for proper validation and testing of generalization for supervised embeddings, and it doesn’t appear that this issue is discussed or addressed in the paper.
Edit: The package does provide validation functionality, which has been made more prominent in the demos. The published results are based on heldout test sets.
- Documentation Issues:
The documentation encourages users to increase the network size ad libitum for a "better embedding" without explicitly discussing potential pitfalls. This effectively suggests that users should overfit embeddings to their data. Documentation reference:num_hidden_units (int): The number of dimensions to use within the neural network model. Higher numbers slow down training, but make the model more expressive and can result in a better embedding. Especially if you find that the embeddings are not consistent across runs, increase :py:attr:~.num_hidden_units and :py:attr:~.output_dimension to increase the model size and output dimensionality. |Default:| 32.
- Impact:
These limitations raise serious concerns about the central claims and soundness of the work, especially given the high-impact venue where the paper was published and the seemingly widespread adoption of the method.
Proposed Next Steps
1. Paper Correction and Software Update:
The paper should be corrected and the software package updated to address these limitations—or retracted if the issues cannot be adequately resolved.
2. User Awareness:
You, as the maintainers/authors, should use GitHub and your social media channels to alert users to this issue, ensuring they can make informed decisions about using the method.
Edit: After further review and clarification (see conversation below), it’s clear that the overfitting observed under specific conditions (large models and extended training) is expected behavior that can be managed with proper validation. These statements overestimated the scope of the issue.
I invite you as maintainers, as well as others contributors and users, to review these findings and investigate this problem yourselves. This issue aims to ensure that you and other users are aware of this critical limitation and that necessary actions are taken to address it.
Edit: Demos have been updated to make validation more prominent and documentation on best practices has been added.
Best regards,
Jake Graving