LabelboxToTFRecord

Warning

This project was part of my master's thesis and is no longer being developed. Feel free to fork, of course.

Tip

For a generated Table of Contents, click the Outline button above in the GitHub UI.

LabelboxToTFRecord

This repo contains some tools around working with data from the Labelbox platform, as well as creating and shaping TFRecord files (a binary file format for machine learning training data) for use with TensorFlow. It can:

Download data from Labelbox containing images and labels
Convert downloaded Labelbox data into a .tfrecord file
Perform operations on one or more .tfrecord files such as splitting, merging, shuffling, and listing information

Installation

Installation may be done directly on a machine, via raw Docker, or via a VS Code container image.

Direct Python Installation:

This method supports Tensorflow 2. Use of a pip-compatible virtual environment is encouraged.

Install TensorFlow's Object Detection API following the directions here. Choose the Python package method. The Docker instructions here have not been tested with this repo.
Clone this repo:

git clone git@github.com:xdhmoore/LabelboxToTFRecord.git
cd LabelboxToTFRecord

Run the following:

python3 -m pip install -r requirements.txt

To use, see the usage info for the convert.py script below.

Docker Installation:

Unfortunately, this installation method does not support Tensorflow 2 yet.

Clone the repo:

git clone git@github.com:xdhmoore/LabelboxToTFRecord.git
cd LabelboxToTFRecord

Create a src/config.yaml file based on src/config.yaml.sample. Use the api key and project id found in Labelbox.
Build with docker:

# From project root
docker build -t lb2tf .
mkdir data

Run the following from the project root to download the data from Labelbox and convert it to 2 .tfrecord files via a 20%/80% split. See the Usage info and Examples below for more info:

# This will run convert.py, downloading the data to a ./data folder
docker run --mount "type=bind,src=${PWD}/data,dst=/data" lb2tf --split 80 20 --download --labelbox-dest /data/labelbox --tfrecord-dest /data/tfrecord

Change the mount src to change where the data is downloaded to.

Caution

If you have downloaded a large amount of data in your project, when docker build runs it will copy the data as part of the context which may take a long time. To avoid this, either move downloaded data outside of the project folder before doing a build, use mount settings to save the data outside the project folder to begin with, or use a dockerignore file to ignore the data once downloaded.

Tip

If you encounter permissions denied errors, check to see that docker hasn't created the data directory as root. chown or recreate the directory yourself to fix.

Dev Container Installation

This method is experimental and may have some issues. As it depends on the "raw docker" method, it also is only setup for Tensorflow 1 currently.

Clone the repo, then open VS Code in that folder

git clone git@github.com:xdhmoore/LabelboxToTFRecord.git
cd LabelboxToTFRecord
code .

Click "Trust" the author and then click the "Reopen in Container" button in the bottom right. If this doesn't show up, try selecting "Rebuild and Reopen in Container" from the Command Palette (Ctrl+Shift+P).

Caution

See the Caution above about building the docker image when it copies a large amount of dowloaded data. When you open VS Code, it will run a docker build.

See the usage info around the convert.py script below.

Usage:

The individual script usages are found below. You will want to start with convert.py as it is the script that actually downloads from Labelbox and converts the data to .tfrecords. The rest of the scripts perform manipulations around existing .tfrecord files.

convert.py

usage: convert.py [-h] [--puid PUID] [--api-key API_KEY]
              [--labelbox-dest LABELBOX_DEST]
              [--tfrecord-dest TFRECORD_DEST]
              [--splits SPLITS [SPLITS ...]] [--download]

Convert Labelbox data to TFRecord and store .tfrecord file(s) locally. Saving
images to disk is optional. Create a file "config.yaml" in the current
directory to store Labelbox sensitive data, see "config.yaml.sample" for an
example.

optional arguments:
  -h, --help            show this help message and exit
  --puid PUID           Project Unique ID (PUID) of your Labelbox project,
                    found in URL of Labelbox project home page
  --api-key API_KEY     API key associated with your Labelbox account
  --labelbox-dest LABELBOX_DEST
                    Destination folder for downloaded images and json file
                    of Labelbox labels.
  --tfrecord-dest TFRECORD_DEST
                    Destination folder for downloaded images
  --limit LIMIT         Only retrieve and convert the first LIMIT data items
  --splits SPLITS [SPLITS ...]
                    Space-separated list of integer percentages for
                    splitting the output into multiple TFRecord files
                    instead of one. Sum of values should be <=100.
                    Example: '--splits 10 70' will write 3 files with 10%,
                    70%, and 20% of the data, respectively
  --download            Save the images locally in addition to creating
                    TFRecord(s)

split.py

usage: split.py [-h] infile splits [splits ...]

Split a .tfrecord file into smaller files

positional arguments:
  infile      the .tfrecord file to split
  splits      Space-separated list of integers, the number of records to put
              in each output file. Should add up to the total number of
              records in the input tfrecord file.

optional arguments:
  -h, --help  show this help message and exit

shuffle.py

usage: shuffle.py [-h] tfrfile randfile

Shuffle a .tfrecord file using a random numbers file

positional arguments:
  tfrfile     the .tfrecord file to shuffle
  randfile    A file containing a shuffled sequence of newline-separated
              numbers from 0 to N-1, where N is the number of records in the
              .tfrecord file.

optional arguments:
  -h, --help  show this help message and exit

join.py

usage: join.py [-h] outfile infiles [infiles ...]

Combine several .tfrecord files into a new one

positional arguments:
  outfile     the name of the output file
  infiles     files to be combined

optional arguments:
  -h, --help  show this help message and exit

count.py

usage: count.py [-h] [--total | --categories] infiles [infiles ...]

Display the number of records in each file

positional arguments:
  infiles           files with records to be counted

optional arguments:
  -h, --help        show this help message and exit
  --total, -t       instead of the the total for each file, display the sum
                    total across all files
  --categories, -c  display the number of labels of each category for each
                    file

Examples

Example 1 - Download Labelbox images and convert labels to TFRecord format:

python convert.py PUID API_KEY

This will download all Labelbox data (images, label file) to the folder ./labelbox, and will create a tfrecord file at tfrecord/.tfrecord

Example 2 - If you have a config.yaml file specified in the current directory, you can grab and convert the Labelbox data with a simple:

python convert.py

See src/config.yaml.sample for an example config file.

Example 3 - To split data into two groups, with 30% in the first and 70% in the second...

python convert.py --split 30 70

This is useful for setting up data sets for cross-validation

Example 4 - To split data into two groups, with 30% in the first and 70% in the second, while downloading images locally...

python convert.py --download --split 30 70

Example 5 - You can also split an existing .tfrecord file into smaller pieces with split.py:

python split.py ./10_record_file.tfrecord 3 2 5

This will write 3 new files containing 3, 2, and 5 records, respectively.

Example 6 - To copy a .tfrecord file into a new file, shuffling the records according to a provided random_ints.txt file:

python shuffle.py ./10_record_file.tfrecord random_seq.txt

random_seq.txt should be a file of all the indices into the tfrecord file, [0,N), where N is the number of records in the tfrecord file, each index occurs exactly once, and there is one index per line. This allows you to shuffle the tfrecord file using random data like from random.org.

Example 7 - To copy several .tfrecord files into a new combined file:

python join.py outfile.tfrecord infile1.tfrecord infile2.tfrecord infile3.tfrecord

Example 8 - To display the number of records in each of files pets_train.tfrecord and pets_val.tfrecord...

python count.py pets_train.tfrecord pets_val.tfrecord

Example 9 - To display a table of the number of records in each category (shark, dolphin, etc.) in all files with "train" in the name...

python count.py -c *train*.tfrecord

The above prints something like:

2021-03-04 00:21:34.523331: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2021-03-04 00:21:34.531411: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 1992000000 Hz
2021-03-04 00:21:34.532907: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x42a1fd0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-04 00:21:34.532962: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
filename                               total    sealion    person    dolphin    shark    boat
-----------------------------------  -------  ---------  --------  ---------  -------  ------
2021-01-26_cv_a_train_2824.tfrecord     2824          0      3078        712     2512     165
2021-01-26_cv_b_train_2824.tfrecord     2824          1      3188        689     2472     173
2021-01-26_cv_c_train_2824.tfrecord     2824          1      3201        686     2493     173
2021-01-26_cv_d_train_2824.tfrecord     2824          1      2977        729     2488     176
2021-01-26_cv_e_train_2824.tfrecord     2824          1      3132        680     2491     173

Tests

To run the tests:

cd src
python -m unittest

Contributing

This project was part of my master's thesis and is no longer being developed.

License

This project is under an MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
examples/example_1		examples/example_1
src		src
.devcontainer.json		.devcontainer.json
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LabelboxToTFRecord

Installation

Direct Python Installation:

Docker Installation:

Dev Container Installation

Usage:

convert.py

split.py

shuffle.py

join.py

count.py

Examples

Tests

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LabelboxToTFRecord

Installation

Direct Python Installation:

Docker Installation:

Dev Container Installation

Usage:

convert.py

split.py

shuffle.py

join.py

count.py

Examples

Tests

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages