A Python package for syllabifying text data in English or Polish
This Python package contains programs that can extract syllabic cues from text data in English or Polish, whether the data consist of a word, a sentence or a whole corpus. The resulting syllabic cues can be used as inputs to train language learning models that accepts data in an 'associative learning' style (i.e. each training example that is fed to the model is an event made up of some cues that trigger some outcomes) such as the naive discriminative learning (NDL; https://pyndl.readthedocs.io/en/latest/index.html) or the models provided in the Deep text modelling package (DTM; https://github.com/Adnane017/Deep_text_modelling). One motivation behind the developement of this package is a study that aims to compare how well orthographic and phonetic methods can explain response time in reading.
Our syllabication method consists of two stages:
- We transcribe the text in IPA (International Phonetic Alphabet). This is done using the 'Epitran' package (https://github.com/dmort27/epitran/)
- We extract all the syllables from each transcribed word seperately. We assume that a syllable is made of one vowel along with the consonants that surround it. For example, if a word is of the form #cvccvcv# ('c' and 'v' refer to a consonant and a vowel respectively; # marks the beginning or the ending of the word), then the syllabic cues will be #cvcc_ccvc_cv#.
This section presents the main functions that are offered in the package along with some basic examples illustrating their usage. For more detailed examples, see the 'examples' section below.
To syllabify a sentence, use the following function:
syllabify_line(line, ipa_converter, syllable_pattern, not_symbol_pattern, add_boundaries, as_event)line- Line to syllabify.ipa_converter- Function that converts a string to ipa. For English, useenglish_to_ipaand for Polish, usepolish_to_ipa.syllable_pattern- Tuple of compiled regulars expression that matches the syllables in a word.not_symbol_pattern- Compiled regular expression that matches disallowed characters like punctuation.add_boundaries- Whether or not to add hashtags to the outermost syllables to mark the begining and end of the word.as_event- Whether or not to return a (cues, outcomes)-tuple, where the cues are all the syllables and the outcomes are all the words, separated by underscores. If false, a list of the syllabic cues is a returned.
For English, the workflow would be:
import syllabifiers.ipa as sy
import regex
sent_en = "Imagination is more important than knowledge"
ipa_converter_en = sy.english_to_ipa
syllable_pattern_en = sy.regex_en_ipa
ENGLISH = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" # list of allowed English characters
not_symbol_pattern_en = regex.compile(f"[^{ENGLISH}]") # Matches disallowed English characters
sy.syllabify_line(line = sent_en,
ipa_converter = ipa_converter_en,
syllable_pattern = syllable_pattern_en,
not_symbol_pattern = not_symbol_pattern_en,
add_boundaries = True,
as_event = True)
>>> ('#ɪm_mæd͡ʒ_d͡ʒən_nejʃ_ʃən#_#ɪz#_#mɔɹ#_#ɪmp_mpɔɹt_ɹtənt#_#ðæn#_#nɑl_ləd͡ʒ#', 'imagination_is_more_important_than_knowledge')For Polish:
import syllabifiers.ipa as sy
import regex
sent_pol = "Miło cię spotkać!"
ipa_converter_pol = sy.polish_to_ipa
syllable_pattern_pol = sy.regex_pol_ipa
POLISH = 'aąbcćdeęfghijklłmnńoóprsśtuwyzźżqvxAĄBCĆDEĘFGHIJKLŁMNŃOÓPRSŚTUWYZŹŻQVX' # list of allowed Polish characters
not_symbol_pattern_pol = regex.compile(f"[^{POLISH}]") # Matches disallowed Polish characters
sy.syllabify_line(line = sent_pol,
ipa_converter = ipa_converter_pol,
syllable_pattern = syllable_pattern_pol,
not_symbol_pattern = not_symbol_pattern_pol,
add_boundaries = True,
as_event = True)
>>> ('#miw_wɔ#_#t͡ɕɛ#_#spɔtk_tkat͡ɕ#', 'miło_cię_spotkać')There is also a syllabifier that accepts a path to a corpus file as input, and return an event file that contains both the syllabic cues and outcomes (all the words in each sentence). The function is defined as follows:
syllabify_corpus(corpus_path, event_file_path, ipa_converter, syllable_pattern, not_symbol_pattern, add_boundaries, numcores, chunksize)corpus_path- Path to the corpus to syllabify.event_file_path- Path to the event file that will be generated, which contains the syllabic cues and outcomes.ipa_converter- Function that converts a string to ipa. For English, useenglish_to_ipaand for Polish, usepolish_to_ipa.syllable_pattern- Tuple of compiled regulars expression that matches the syllables in a word.not_symbol_pattern- Compiled regular expression that matches disallowed characters like punctuation.add_boundaries- Whether or not to add hashtags to the outermost syllables to mark the begining and end of the word.numcores- Number of cores to use.chunksize- Number of lines each process will work on in parallel.
For English, the workflow is as follows:
import syllabifiers.ipa as sy
import regex
CORPUS_EN = 'Corpus_sample_en.txt'
S2L_EVENT_EN = 'S2L_events_en.gz'
ipa_converter_en = sy.english_to_ipa
syllable_pattern_en = sy.regex_en_ipa
ENGLISH = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" # list of allowed English characters
not_symbol_pattern_en = regex.compile(f"[^{ENGLISH}]") # Matches disallowed English characters
sy.syllabify_corpus(corpus_path = CORPUS_EN,
event_file_path = S2L_EVENT_EN,
ipa_converter = ipa_converter_en,
syllable_pattern = syllable_pattern_en,
not_symbol_pattern = not_symbol_pattern_en,
add_boundaries = True,
numcores = 8,
chunksize = 125)
### Print the first 3 lines from the S2L event file
with gzip.open(S2L_EVENTS_POL, 'rt', encoding='utf-8') as e:
for x in range(3):
print(next(e))
>>> '#ðə#_#mejd͡ʒ_d͡ʒəɹ#_#ɪmp_mpækt#_#ɪz#_#jɛt#_#tə#_#kʌm# the_major_impact_is_yet_to_come'
'#ɹejz_zɪŋ#_#mʌn_ni#_#fɔɹ#_#jɔɹ#_#fejv_vəɹ_ɹɪt#_#t͡ʃɛɹ_ɹɪt_ti#_#kæn#_#bi#_#fʌn# raising_money_for_your_favourite_charity_can_be_fun'
'#dɪd#_#ju#_#now# did_you_know'For Polish:
import syllabifiers.ipa as sy
import regex
CORPUS_POL = 'Corpus_sample_pol.txt'
S2L_EVENTS_POL = 'S2L_events_pol.gz'
ipa_converter_pol = sy.polish_to_ipa
syllable_pattern_pol = sy.regex_pol_ipa
POLISH = 'aąbcćdeęfghijklłmnńoóprsśtuwyzźżqvxAĄBCĆDEĘFGHIJKLŁMNŃOÓPRSŚTUWYZŹŻQVX' # list of allowed Polish characters
not_symbol_pattern_pol = regex.compile(f"[^{POLISH}]") # Matches disallowed Polish characters
sy.syllabify_corpus(corpus_path = CORPUS_POL,
event_file_path = S2L_EVENT_POL,
ipa_converter = sy.polish_to_ipa,
syllable_pattern = syllable_pattern_pol,
not_symbol_pattern = not_symbol_pattern_pol,
add_boundaries = True,
numcores = 8,
chunksize = 125)
### Print the first 3 lines from the S2L event file
with gzip.open(S2L_EVENTS_POL, 'rt', encoding='utf-8') as e:
for x in range(3):
print(next(e))
>>> '#nadmj_dmjɛrn_rnɛ#_#napj_pjɛnt͡ɕ_nt͡ɕɛ#_#ɡrup_pɨ#_#mjɛ̃ɕɲ_ɕɲi#_#pɔv_vɔd_duj_jɛ#_#ɲɛpr_prav_vidw_dwɔv_vɛ#_#ust_stavj_vjɛɲ_ɲɛ#_#stavj_vjɛ# nadmierne_napięcie_grupy_mięśni_powoduje_nieprawidłowe_ustawienie_w_stawie'
'#mjɛ̃ɕɲ_ɕɲɛ#_#spast_stɨt͡ʂn_t͡ʂnɛ#_#ɲɛ#_#sɔ̃#_#mjɛ̃ɕɲ_ɕɲam_mi#_#pɔr_raʐ_ʐɔn_nɨm_mi#_#i#_#mɔɡ_ɡɔ̃#_#ɕɛ#_#kurt͡ʂ_rt͡ʂɨt͡ɕ# mięśnie_spastyczne_nie_są_mięśniami_porażonymi_i_mogą_się_kurczyć'
'#t͡sɔ#_#vjɛnt͡s_nt͡sɛj#_#t͡ʂɛ̃st_stɔ#_#ɲɛ#_#sɔ̃#_#mjɛ̃ɕɲ_ɕɲam_mi#_#ɕiln_lnɨm_mi# co_więcej_często_nie_są_mięśniami_silnymi'All you need to start using the package is to copy the folder inside 'package' in your computer and make it as your working directory in Python. You will also need to install the following packages:
- epitran
- regex
Adnane Ez-zizi (I am very grateful to Christian Adam for optimising an earlier version of the code)