This repository is for creating a machine learning model that manages to create natural language descriptions for line graphs.
This project uses two pre-existing datasets:
The FigureQA dataset is used to generate a synthetic dataset, while Chart-to-Text is used to generate a natural-language dataset.
To generate the synthetic dataset, place the downloaded folders to data/figureqa (check the README there), and run
python3 src/synthetic/preprocess.py data/fiqureqa/X
Flags you can provide:
--unroll-descriptions: By default, if a plot/figure/graph has more than 1 description, they are concatenated. This flag unrolls the descriptions, to createnrows incaptions.csvif there aren > 1descriptions for a given graph.--replace-subjects: Replaces subjects in descriptions. For exampleRed is greater than Bluebecomes<A> is greater than <B>. This flag also addssubject_mapcolumn tocaptions.csv, so for every plot there is a JSON blob string that maps replacements to original subjects.--description-limit N: Limits description length in sentences. Cannot be present together with--unroll-descriptions--synthetic-config PATH_TO_FILE: Provide a config file for custom question templates and desired question types. An example file is provided (synthetic.default.json). For correct forms, checkquestion_to_descriptioninsrc/synthetic/preprocess.py. For question IDs, check keys inquestion_type_to_id
Dataset will be placed to data/processed_synthetic/X.
To generate the synthetic "question types" dataset, place the downloaded folders to data/figureqa (check the README there), and run
python3 src/synthetic/preprocess-question-types.py data/fiqureqa/X
Subject replacement (--replace-subjects), unrolling (--unroll) and synthetic config flag (--synthetic-config) can be provided, as for generating the normal synthetic dataset.
To generate the natural-language dataset, place the downloaded folders to data/charttotext (check the README there), and run
python3 src/natural/preprocess.py
Dataset will be placed to data/processed_natural.