-
Notifications
You must be signed in to change notification settings - Fork 2
Description
From the associated manuscript:
In order to predict the SNPs number of strains at their SNP saturation state, the sequencing coverage, sequencing depth, relative abundance, genome length, SNP number, SNP density and saturated SNP number of strains in the aforementioned subsamples and ultra-deep samples were used to construct a data set. Only saturated strains in our data were used here. The data set is divided into training set and test set according to the ratio of 4:1, and the python package scikit-learn (Pedregosa et al., 2011) was used to train a linear regression model and a random forest regression model, respectively.
So from this description, it appears that multiple subsamples of the same metagenomes were used for training & testing. These subsamples are not independent, since they are derived from the same metagenome. This could lead to data leakage between the train and test subsets, which can result in overfitting and biased test accuracies.
Can you please provide more information about how the ML train/test was conducted, especially in regards to the potential for data leakage?
If data leakage has occurred, maybe you could re-train the model while blocking by metagenome sample (e.g., 2 of the 3 metagenomes are used for training, while the 3rd is used for the independent test)?