WAR OF WORDS: SCIENTISTS REVEAL HOW TO CREATE THE ULTIMATE WORD LIST FOR DIFFERENT LANGUAGES
Researchers at the Complexity Science Hub have developed LEXpander, a language-independent algorithm that can significantly expand word lists for sentiment analysis in different languages. The tool outperforms previous algorithms, making it a cornerstone for future research projects in various fields.
Word lists serve as a fundamental element for research across numerous fields. Recently, scientists at the Complexity Science Hub have designed an algorithm capable of enhancing word lists to a greater extent than other methods and adaptable to various languages.
Making a word list is a common starting point for projects. This practice is not only prevalent in corporate settings while devising mind maps, but also extensively employed across diverse research fields. For instance, if one intends to investigate the days when individuals exhibit an exuberant mood by scrutinizing Twitter posts, simply seeking the term “happy” would be inadequate.
Instead, you would need to use an algorithm that recognizes any tweets that express happiness.
“So the first step,” elaborates Anna Di Natale, “is to create a list of all the words that indicate just that. The whole research stands or falls on doing so.”
But, how can the most precise, comprehensive word lists be created?
The issue of accurate sentiment analysis affects not only opinion researchers seeking to gauge public reception of politicians’ statements but also companies seeking to understand consumer perceptions of their products. In response, Anna Di Natale has created a solution named LEXpander, which surpasses previous algorithms, even when applied to two different languages, German and English. Furthermore, Di Natale’s methodology marks a first-of-its-kind ability to compare and evaluate various sentiment analysis tools.
LEXpander, an algorithm developed by researchers at the Complexity Science Hub, outperformed four other algorithms used for wordlist expansion, including WordNet, Empath 2.0, FastText, and GloVe. The results showed that LEXpander was especially effective in German. For instance, when expanding an English word list for positive connotations, LEXpander was able to correctly guess 43% of the words, while FastText, a widely used model, only scored 28%.
One of the primary reasons why LEXpander outperformed other algorithms is that it is language-independent. Unlike other methods, LEXpander relies on a colexification network instead of a single language. This linguistic concept is based on homonyms and polysemies, which refer to words that have multiple meanings. For instance, the ancient Greek word φάρμακοv (pharmacon) can mean both medicine and poison, which are thematically related but distinct concepts. However, some words like “bank” can have multiple unrelated meanings, such as a financial institution or the land beside a river.
“If you collect them across many languages — and here we analyzed about 19 different languages — you can see connections between them,” adds Di Natale.
LEXpander’s ability to establish a network across multiple languages is attributed to the occurrence of colexifications, where two or more concepts are expressed by the same word in different languages across various language families. This results in the creation of connections within the network. As a result of this language-independent approach, LEXpander has demonstrated superior performance across different languages.
“There are many methods developed for English. They work very well and quickly and everyone uses them,” adds the creator.
“Trying to apply them to other languages works, but not as well as it might work if you had started developing a method for German or Italian.”
When it comes to established topics, there are usually reliable word lists already in existence. However, new subjects, such as COVID-19, require the development of fresh word lists. Traditionally, these were created manually by collaborating with colleagues and employing various tools, but there was no means of comparing their accuracy. Anna Di Natale and her team have now addressed this issue by introducing a new tool that outperforms its counterparts, which can serve as a crucial building block for numerous research projects in various fields.
The findings of the analysis were published in the journal Behavior Research Methods.