My research consists of interdisciplinary collaborations with colleagues from quantitative linguistics, cognitive linguistics, psycholinguistics, computational linguistics and phonetics. I’m interested in understanding all sorts of linguistic phenomena (particularly but not limited to speech communication and acquisition) by applying quantitative methods, such as measuring linguistic complexity, building language models and running statistical analyses. Below is the list of research projects I’ve participated in.
Linguistic complexity and information
The main objective of this project (PI: François Pellegrino) is to understand the way information is encoded and conveyed by speakers in human communication.
In our study (Coupé et al. 2019), we show that the coupling between language-level (information per syllable) and speaker-level (speech rate) properties results in languages encoding similar information rates (~39 bits/s) despite wide differences in each property individually:
Languages are more similar in information rates than in Shannon information or speech rate.
These findings suggest that the encoding and transmission strategy of information during speech communication results from the intimate feedback loops between languages’ structural properties and their speakers’ neurocognition and biology under communicative pressures.
◦ Keywords
Information rate, Information theory, language universals, linguistic complexity, syllables
◦ Main publications
[1] Oh, Y., & Pellegrino, F. (2023). Towards robust complexity indices in linguistic typology: A corpus-based assessment, Studies in Language, 47
(4), 789-829. https://doi.org/10.1075/sl.22034.oh
[2] Coupé, C.*, Oh, Y.*, Dediu, D., & Pellegrino, F. (2019). Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche, Science Advances, 5
(9), eaaw2594. https://doi.org/10.1126/sciadv.aaw2594
*Equally contributed authors
⁃ Our favorite science news stories of 2019 (link)
[3] Oh, Y., Coupé, C., Marsico, E., & Pellegrino, F. (2015). Bridging phonological system and lexicon: insights from a corpus study of functional load, Journal of Phonetics, 53, 153-176. https://doi.org/10.1016/j.wocn.2015.08.003
Statistical learning with and without a lexicon
Most non-Māori-speaking New Zealanders are regularly exposed to Māori throughout their lives without seeming to build any extensive Māori lexicon.
This project (PI: Jen Hay) aims to investigate non-Māori-speaking New Zealanders’ Māori proto-lexicon (implicit knowledge of the existence of Māori words and sub-word units without any associated meaning).
We show that by statistically generalizing over this proto-lexicon, they can distinguish real words from highly Māori-like nonwords, and they can rate the well-formedness of non-words as accurately as fluent Māori-speakers.
They can also readily identify many more Māori words than they can define, and the number of words they can reliably define is quite small.
These results suggest that adults can possess a large pre-semantic proto-lexicon of a language to which they are regularly exposed.
◦ Keywords
Implicit word knowledge, Māori language, phonotactics, proto-lexicon, second-language acquisition
◦ Main publications
Oh, Y., Todd, S., Beckner, C., Hay, K., & King, J. (2023). Assessing the size of non-Māori-speakers’ active Māori lexicon, PLoS ONE, 18
(8), e0289669. https://doi.org/10.1371/journal.pone.0289669
Oh, Y., Todd, S., Beckner, C., Hay, J., King, J., & Needle, J. (2020). Non-Māori-speaking New Zealanders have a Māori proto-lexicon, Scientific Reports, 10,
22318. https://doi.org/10.1038/s41598-020-78810-4
⁃ the top 100 downloaded papers for Scientific Reports in 2020 (link)
Information density and the predictability of phonetic structure
The aim of this project (PI: Bernd Möbius) is to investigate the relation between information density (quantified in terms of surprisal) and linguistic encoding in phonetics (such as segmental duration, vowel space size and spectral characteristics of vowels and consonants).
We assess our underlying hypothesis according to which speakers modulate the density of phonetic encoding in the service of maintaining a balanced distribution of information.
Our findings are generally compatible with a weak version of the Smooth Signal Redundancy (SSR) hypothesis (Aylett & Turk 2004, 2006, Turk 2010), albeit with evidence for additional, direct effects of changes in predictability on the phonetic structure of utterances, suggesting that the prosodic structure mediates between requirements of efficient communication and the speech signal.
◦ Keywords
Information density, phonetics, segments, surprisal, syllables
◦ Main publication
Malisz, Z., Brandt, E., Möbius, B., Oh, Y., & Andreeva, B. (2018). Dimensions of segmental variability: interaction of prosody and surprisal in six languages, Frontiers in Communication, 3
(25). https://doi.org/10.3389/fcomm.2018.00025