vec2seq is a Chatbot Repling Core project, act as Question-Answering of presets, split from WakeUpScrew.
Reply of vec2seq is based on Question and Dataset. Dataset can be added from various sources, mostly articles on PTT bulletboard in current WakeUpScrew Line-bot.
We use finalfusion, which based on FastText and Word2Vec, to provide word embeddings. Train data for word embedding can be but not limit to Wikipedia pages.
In order to address word in chinese, we use a rust implementation of jieba to provide chinese word segementation.
After word embeddings are extracted, all word embeddings in same sentence will be combined into one single sentence embedding, with the help of TF-IDF algorithm.
To achieve real-time-search over more than 2M of articles, we use granne*, a Rust library for approximate nearest neighbor search based on Hierarchical Navigable Small World (HNSW) graphs.
While question from user can be matched to question in database on semantics, all replies can be seem as a proper reply for question. vec2seq will randomly choice one as final answer.