Topic modeling of news articles

  • Feb 13, 2022

Since quite a few things have changed in the world over the last years, I was wondering if and how these changes also reflect in the news.

The data used in this analysis are the news articles published by the Estonian Public Broadcasting (ERR) from 2016 to 2021.

Since the articles are all in Estonian, the articles are first translated and then lemmatized for topic modeling. The topics are extracted by using non-negative matrix factorization.

I ended up with 15 topics: basketball, culture, crime, economy, education, film, football, foreign politics, healthcare, music, other sports (track and field, skiing, etc), politics, sports general (a team has a new coach, someone joins a new club, etc.), tennis, weather.

Below is a chart showing the total number of new articles published per topic. Some topics are clearly seasonal (sports, education). Some have a trend (crime, music). Some are linked to current events (politics, healthcare).

The pandemic has clearly left its mark -- the healthcare topic skyrocketed at the beginning of 2020 and has since remained on a relatively high level. Economy also got more coverage in March 2020 while sports-related topics fell in the background.

The How

https://github.com/kadrilenk/Projects/tree/main/news