Skip to main content
Case study

Categorizing News Articles

Sector: News & Media 5 articles per second

Creating a custom LLM for applying category tags to news articles.

Problem

Scouted News is an aggregator app for high-quality independent journalism.

The goal was to allow users to browse news by topic, grouping articles under a consistent set category headings. Because the content came from hundreds of different sources, each with their own approach to categorization (or none at all), doing this consistently with off-the-shelf tools was not feasible.

Approach

Categorizing news articles is a "multi-label classification" problem; for any given piece of content, you want a list of labels which are appropriately classify the problem.

Probable created a custom AI solution by fine-tuning Longformer, a base model which is optimized for longer documents. We created a training dataset by gathering thousands of well-categorized news articles from across the web. For a label to be accurately applied to an article, you need to provide the model with about 600 examples of correct usage, so we filtered the training dataset to only include articles and categories for which we had at least 600 examples.

After a few days of processing, we had a model which could quickly and reliably categorize any news article.

Outcome

Scouted Media received an AI model which they could run on their own hardware with no ongoing costs per-use. When running on a MacBook Pro with M4, the model could classify an article in about 0.2 seconds. When running on a CPU-only server on Azure, the model could classify an article in about 3 seconds.

Have a similar challenge? Let's talk about your context.