simple_splitter is a handy data-splitter that circumvents the some limitations of scikit-learn's train_test_split():
simple_splitterallows for a straightforward, custom train-dev-test splitting. As a matter of fact, you can have as many splits as you want.simple_splitterhandles very tiny data, included stratified-splitting of classes with a single element. The basic idea is the following :- If there is more or as many examples as splits, make sure each splits has at least one example.
- If there are less examples than splits, prioritise the desired splits (you can set priorities).
Besides, on the contrary to train_test_split(), simple_splitter returns a single list representing the distribution
of splits along your data. A matter of choice 😉
If your data is in a pandas.DataFrame, with columns of length len(df), you can easily create your split column
with simple_splitter:
from simple_splitter.split import split
# Define you splits :
splits = [('train', 0.65), ('dev1', 0.10), ('dev2', 0.10), ('test', 0.15)]
split_column = split(splits=splits, data_length=len(my_df))If you need a stratified split, just passed your stratifying column(s) :
split_column = split(data_length=len(my_df), splits=splits, strats=[my_df['my_stratitying_class_column_1'], ['my_stratitying_class_column_2']])Installation requires no external dependancies.
python -m pip install git+'https://github.com/sven-nm/simple_splitter'Any improvements, suggestions and ideas are welcome !