Warning
This project was part of my master's thesis and is no longer being developed. Feel free to fork, of course.
Tip
For a generated Table of Contents, click the Outline button above in the GitHub UI.
This repo contains some tools around working with data from the Labelbox platform, as well as creating and shaping TFRecord files (a binary file format for machine learning training data) for use with TensorFlow. It can:
- Download data from Labelbox containing images and labels
- Convert downloaded Labelbox data into a .tfrecord file
- Perform operations on one or more .tfrecord files such as splitting, merging, shuffling, and listing information
Installation may be done directly on a machine, via raw Docker, or via a VS Code container image.
This method supports Tensorflow 2. Use of a pip-compatible virtual environment is encouraged.
- Install TensorFlow's Object Detection API following the directions here. Choose the Python package method. The Docker instructions here have not been tested with this repo.
- Clone this repo:
git clone git@github.com:xdhmoore/LabelboxToTFRecord.git
cd LabelboxToTFRecord
- Run the following:
python3 -m pip install -r requirements.txt
- To use, see the usage info for the
convert.pyscript below.
Unfortunately, this installation method does not support Tensorflow 2 yet.
- Clone the repo:
git clone git@github.com:xdhmoore/LabelboxToTFRecord.git
cd LabelboxToTFRecord
- Create a src/config.yaml file based on src/config.yaml.sample. Use the api key and project id found in Labelbox.
- Build with docker:
# From project root
docker build -t lb2tf .
mkdir data
- Run the following from the project root to download the data from Labelbox and convert it to 2 .tfrecord files via a 20%/80% split. See the Usage info and Examples below for more info:
# This will run convert.py, downloading the data to a ./data folder
docker run --mount "type=bind,src=${PWD}/data,dst=/data" lb2tf --split 80 20 --download --labelbox-dest /data/labelbox --tfrecord-dest /data/tfrecord
Change the mount src to change where the data is downloaded to.
Caution
If you have downloaded a large amount of data in your project, when docker build runs it will copy the data as part of the context which may take a long time. To avoid this, either move downloaded data outside of the project folder before doing a build, use mount settings to save the data outside the project folder to begin with, or use a dockerignore file to ignore the data once downloaded.
Tip
If you encounter permissions denied errors, check to see that docker hasn't created the data directory as root. chown or recreate the directory yourself to fix.
This method is experimental and may have some issues. As it depends on the "raw docker" method, it also is only setup for Tensorflow 1 currently.
- Clone the repo, then open VS Code in that folder
git clone git@github.com:xdhmoore/LabelboxToTFRecord.git
cd LabelboxToTFRecord
code .
- Click "Trust" the author and then click the "Reopen in Container" button in the bottom right. If this doesn't show up, try selecting "Rebuild and Reopen in Container" from the Command Palette (Ctrl+Shift+P).
Caution
See the Caution above about building the docker image when it copies a large amount of dowloaded data. When you open VS Code, it will run a docker build.
- See the usage info around the convert.py script below.
The individual script usages are found below. You will want to start with convert.py as it is the script that actually downloads from Labelbox and converts the data to .tfrecords. The rest of the scripts perform manipulations around existing .tfrecord files.
usage: convert.py [-h] [--puid PUID] [--api-key API_KEY]
[--labelbox-dest LABELBOX_DEST]
[--tfrecord-dest TFRECORD_DEST]
[--splits SPLITS [SPLITS ...]] [--download]
Convert Labelbox data to TFRecord and store .tfrecord file(s) locally. Saving
images to disk is optional. Create a file "config.yaml" in the current
directory to store Labelbox sensitive data, see "config.yaml.sample" for an
example.
optional arguments:
-h, --help show this help message and exit
--puid PUID Project Unique ID (PUID) of your Labelbox project,
found in URL of Labelbox project home page
--api-key API_KEY API key associated with your Labelbox account
--labelbox-dest LABELBOX_DEST
Destination folder for downloaded images and json file
of Labelbox labels.
--tfrecord-dest TFRECORD_DEST
Destination folder for downloaded images
--limit LIMIT Only retrieve and convert the first LIMIT data items
--splits SPLITS [SPLITS ...]
Space-separated list of integer percentages for
splitting the output into multiple TFRecord files
instead of one. Sum of values should be <=100.
Example: '--splits 10 70' will write 3 files with 10%,
70%, and 20% of the data, respectively
--download Save the images locally in addition to creating
TFRecord(s)
usage: split.py [-h] infile splits [splits ...]
Split a .tfrecord file into smaller files
positional arguments:
infile the .tfrecord file to split
splits Space-separated list of integers, the number of records to put
in each output file. Should add up to the total number of
records in the input tfrecord file.
optional arguments:
-h, --help show this help message and exit
usage: shuffle.py [-h] tfrfile randfile
Shuffle a .tfrecord file using a random numbers file
positional arguments:
tfrfile the .tfrecord file to shuffle
randfile A file containing a shuffled sequence of newline-separated
numbers from 0 to N-1, where N is the number of records in the
.tfrecord file.
optional arguments:
-h, --help show this help message and exit
usage: join.py [-h] outfile infiles [infiles ...]
Combine several .tfrecord files into a new one
positional arguments:
outfile the name of the output file
infiles files to be combined
optional arguments:
-h, --help show this help message and exit
usage: count.py [-h] [--total | --categories] infiles [infiles ...]
Display the number of records in each file
positional arguments:
infiles files with records to be counted
optional arguments:
-h, --help show this help message and exit
--total, -t instead of the the total for each file, display the sum
total across all files
--categories, -c display the number of labels of each category for each
file
- Example 1 - Download Labelbox images and convert labels to TFRecord format:
python convert.py PUID API_KEY
This will download all Labelbox data (images, label file) to the folder ./labelbox, and will create a tfrecord file at tfrecord/.tfrecord
- Example 2 - If you have a config.yaml file specified in the current directory, you can grab and convert the Labelbox data with a simple:
python convert.py
See src/config.yaml.sample for an example config file.
- Example 3 - To split data into two groups, with 30% in the first and 70% in the second...
python convert.py --split 30 70
This is useful for setting up data sets for cross-validation
- Example 4 - To split data into two groups, with 30% in the first and 70% in the second, while downloading images locally...
python convert.py --download --split 30 70
- Example 5 - You can also split an existing .tfrecord file into smaller pieces with split.py:
python split.py ./10_record_file.tfrecord 3 2 5
This will write 3 new files containing 3, 2, and 5 records, respectively.
- Example 6 - To copy a .tfrecord file into a new file, shuffling the records according to a provided random_ints.txt file:
python shuffle.py ./10_record_file.tfrecord random_seq.txt
random_seq.txt should be a file of all the indices into the tfrecord file,
[0,N), where N is the number of records in the tfrecord file, each index
occurs exactly once, and there is one index per line. This allows you to
shuffle the tfrecord file using random data like from random.org.
- Example 7 - To copy several
.tfrecordfiles into a new combined file:
python join.py outfile.tfrecord infile1.tfrecord infile2.tfrecord infile3.tfrecord
- Example 8 - To display the number of records in each of files
pets_train.tfrecordandpets_val.tfrecord...
python count.py pets_train.tfrecord pets_val.tfrecord
- Example 9 - To display a table of the number of records in each category (shark, dolphin, etc.) in all files with "train" in the name...
python count.py -c *train*.tfrecord
The above prints something like:
2021-03-04 00:21:34.523331: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-03-04 00:21:34.531411: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1992000000 Hz
2021-03-04 00:21:34.532907: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x42a1fd0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-04 00:21:34.532962: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
filename total sealion person dolphin shark boat
----------------------------------- ------- --------- -------- --------- ------- ------
2021-01-26_cv_a_train_2824.tfrecord 2824 0 3078 712 2512 165
2021-01-26_cv_b_train_2824.tfrecord 2824 1 3188 689 2472 173
2021-01-26_cv_c_train_2824.tfrecord 2824 1 3201 686 2493 173
2021-01-26_cv_d_train_2824.tfrecord 2824 1 2977 729 2488 176
2021-01-26_cv_e_train_2824.tfrecord 2824 1 3132 680 2491 173
To run the tests:
cd src
python -m unittest
This project was part of my master's thesis and is no longer being developed.
This project is under an MIT license.