Formula Classification and Mathematical Token Embeddings

Description

The repository focuses on formula classification and mathematical token embeddings. It provides functionality to process and analyze the MSE dataset, which involves extracting mathematical formulas and assigning semantic types to them. Through a custom parser Latex strings of formulas are turned into a tree-based representation which can be used for further processing. Different variations of the formula classification task are implemented. Token embeddings extracted from the trained classification models can be useful for capturing the meaning of the underlying concepts.

Technical architecture

overview.ipynb - A general overview of the MSE dataset and an intermediate analysis of the semantic math extraction procedure (it makes use of overview_func.py and /images)
utils.py - a script with preprocessing methods that were used to process the large XML file which included the initial datset
apply.py - a file that is used to execute data processing procedures
funcs.py - includes most data processing methods and database queries for processing and analysis
mse_db.py - contains the MSE_DB class, which sets up the database connection and provides a useful interface for applying data processing functions from funcs.py to particular subsets of data
sem_math module:
- math_types.py - includes the FormulaContextType and FormulaType classes
- ft_transformer.py - includes the FormulaTreeTransformer class
- math_tokenizer.py - provides the SemMathTokenizer class which helps creating a list of mathematical tokens from a formula tree
- post-thread.py - the PostThread class encapsulates a postThread database entry and facilitates formula extraction
- comparer.py - the Comparer class includes the arbitration mechanism which helps to assign a semantic type to a formula
- /tests - includes unit tests for the FormulaType class methods
- /grammar - contains .lark files with formula parser grammar rules
classification_formulas_binary - contains different binary classification variations that are used to train formula embeddings
classification_formulas_multilabel - contains different multi-label classification variations that are used to train formula embeddings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Formula Classification and Mathematical Token Embeddings

Description

Technical architecture

Contents:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 113 Commits
classification_formulas_binary		classification_formulas_binary
classification_formulas_multilabel		classification_formulas_multilabel
data		data
data_intermediate_multiclass_type_tokens_BOTH		data_intermediate_multiclass_type_tokens_BOTH
data_intermediate_multiclass_type_tokens_FORMULA		data_intermediate_multiclass_type_tokens_FORMULA
images		images
kb		kb
sem_math		sem_math
.gitignore		.gitignore
apply.py		apply.py
funcs.py		funcs.py
mse_db.py		mse_db.py
overview.ipynb		overview.ipynb
overview_funcs.py		overview_funcs.py
readme.md		readme.md
requirements.txt		requirements.txt
utils.py		utils.py

padieul/sem_math_repo

Folders and files

Latest commit

History

Repository files navigation

Formula Classification and Mathematical Token Embeddings

Description

Technical architecture

Contents:

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages