Implementation of Tao-Deng-Stonebraker approximate string joins with abbreviations for PostgreSQL.
Approximate string joins are implemented as a C-language extension for PostgreSQL.
The extension consists of three components, each of which introduces a user-defined function. Their definitions are given below.
The extension follows a common pipeline for PostgreSQL C-language extensions.
Before installation, make sure you have PostgreSQL server installed. Cluster state (running or stopped) does not matter.
-
Build
- Open command line. Change working directory to where extension files (this git repository) reside;
- Run
make. You may have to installlibpq-devif it was not installed with PostgreSQL; - Run
make install. You need certain privileges (probablyroot) for this to work, as PostgreSQL binary files are modified.
-
Install
- Login to PostgreSQL as superuser (
postgres); - Run
CREATE EXTENSION "mipt-asj";.
- Login to PostgreSQL as superuser (
To perform JOIN with approximate comparison of strings, you need a set of rules (full form <=> abbreviation).
-- Generate abbreviation rules
CREATE TABLE rules(f VARCHAR, a VARCHAR);
INSERT INTO rules(f, a) (
SELECT f, a FROM mipt_asj.calc_dict(
(SELECT oid FROM pg_catalog.pg_class WHERE relname = 'MY_TABLE'), 'f',
(SELECT oid FROM pg_catalog.pg_class WHERE relname = 'MY_TABLE'), 'a'
)
);
The produced table has a simple form: every tuple is a pair (f, a). You may freely add, delete or modify the rules. Alternatively, instead of generating rules, you may supply your own ones.
The following code conducts approximate string JOIN:
-- Select the strings to join (may return false-positive results)
CREATE TABLE to_join(s1 VARCHAR, s2 VARCHAR);
INSERT INTO to_join(s1, s2) (
SELECT s1, s2 FROM mipt_asj.calc_pairs(
(SELECT oid FROM pg_catalog.pg_class WHERE relname = 'MY_TABLE'), 'c1',
(SELECT oid FROM pg_catalog.pg_class WHERE relname = 'MY_TABLE'), 'c2',
(SELECT oid FROM pg_catalog.pg_class WHERE relname = 'rules'), 'f', 'a',
0.7
)
);
-- JOIN
SELECT DISTINCT t1.s1, t1.s2
FROM to_join AS t1 INNER JOIN to_join AS t2 ON mipt_asj.cmp(
t1.s1,
t1.s2,
(SELECT oid FROM pg_catalog.pg_class WHERE relname = 'rules'), 'f', 'a',
0.7
);
The extension interface is a few user-defined functions. All functions are placed in schema mipt_asj.
mipt_asj.calc_dict(full_OID, full_column, abbr_OID, abbr_column).
Calculates abbreviation rules.
- Call parameters:
full_OID. Full forms table OIDfull_column. Full forms table column nameabbr_OID. Abbreviations table OIDabbr_column. Abbreviations table column name
abbr_OID may be equal to full_OID.
- Returns: table. Each tuple is an abbreviation rule. Fields:
f. Full form of abbreviationa. Abbreviation
The rules are calculated using a trie.
mipt_asj.calc_pairs(1_OID, 1_column, 2_OID, 2_column, rules_OID, rules_full_column, rules_abbr_column, exactness).
Selects equal (in terms of Tao-Deng-Stonebraker (pkduck) metric) string pairs. Some pairs may be falsely considered equal; use cmp to filter out such pairs.
- Call parameters:
1_OID. 1st string set table OID1_column. 1st string set table column name2_OID. 2nd string set table OID2_column. 2nd string set table column namerules_OID. Rules table OIDrules_full_column. Rule's full forms column namerules_abbr_column. Rule's abbreviation column nameexactness. Exactness parameter, a real number in range [0; 1]
Try to set exactness equal to 0.7 if you are not sure about its value.
1_OID may be equal to 2_OID.
- Returns: table. Each tuple is a pair of equal (in terms of the metric mentined above) strings. Fields:
s1. String from table1_OID, column1_columns2. String from table2_OID, column2_column
mipt_asj.cmp(string_1, string_2, rules_OID, rules_full_column, rules_abbr_column, exactness).
Compares two strings using Tao-Deng-Stonebraker (pkduck) metric.
- Call parameters:
string_1. 1st string to comparestring_2. 2nd string to comparerules_OID. Rules table OIDrules_full_column. Rule's full forms column namerules_abbr_column. Rule's abbreviation column nameexactness. Exactness parameter, a real number in range [0; 1]
Try to set exactness equal to 0.7 if you are not sure about its value.
If calc_pairs was used with some exactness value, the same should be set as exactness in call to this function.
- Returns: boolean.
Feel free to open an issue on GitHub!
This is alpha version in development. Both extension API and internals may change. Use cautiously!
The code is available under MIT license.