Hubert Baginski will present an online talk within the seminar “Analysis of Complex Systems” on July 09, 2021, 3PM-4PM (CET) via Zoom.
If you would like to attend, please email firstname.lastname@example.org
Title: “Automatic Detection and Classification of Suicide-related Content in English Texts”
Media reporting on suicide has repeatedly been shown to be associated with suicide rates. The impact of suicide reporting may not be restricted to harmful effects; rather, stories of coping and recovery in adverse circumstances may have protective effects. Specifically, exposure to media reports about deaths is associated with increases in suicides, suggesting a Werther effect. In contrast, exposure to content describing stories of hope and coping are associated with a decrease in suicides, which has been labeled as the Papageno effect. Investigating the impacts of suicide reporting requires classifying various characteristics of media-items that may have harmful or negative effects, which proves time-intensive and challenging. Using natural language processing for the classification of such texts could facilitate this tedious task. We use the bidirectional language model BERT and compare its performance against TFIDF and Bag-of-words.
We show that deep learning and synthetic data generation allow developing an application, which is capable of processing English texts and detecting specific characteristics of suicide-related content. We describe an effective classification model that enables the user to retrieve the predicted label of a specific variable code for the given English input text. Simple binary classification tasks are best solved by a fine-tuned BERT model trained on the original data, achieving 85%−95%F1 compared to human performance of F1human=100%. Intermediate binary classification tasks often benefit from synthetically balancing the data, with performances around 75%-80% (F1human∼80%).. Difficult binary classification and multi-class classification tasks always benefit from synthetically balancing the data. However, which balancing method works best is task specific, and performances range between ∼70%(F1human∼80%)and∼80%(F1human∼95%), respectively. Our results show that pre-trained bidirectional language models work incredibly well. Yet, improvements seem to mostly come from bigger models and more data. Synthetically balancing the minority classes provides more training data and improves the model’s ability to generalize to new inputs. However, limiting the amount of synthetic data is crucial, since performance appears to tail off when the balance is tipped too far in favour of the synthetic data. Our application will enable researchers to investigate the effect of different characteristics of texts about suicide at large scales and help improve reporting guidelines, thereby effectively contributing to the prevention of suicides.