Building AI-Powered Headline Analysis for HeadlineWise

A person sits on a park bench holding a newspaper.

Back in 2018 I built an app called Map the News for my Flatiron School capstone project. (The project is no longer online, but you can read about the development process here.) Map the News aimed to break users out of their “media bubbles” at a time when siloed media consumption was a rising concern.

Even back in 2018, I was already thinking about next steps — and specifically language analysis in headlines, to try to understand how subjective and provocative language in news headlines, likely crafted to drive engagement via social media, might be contributing to our understanding and sentiment about more complex stories.

In 2018 I didn’t have the tools to follow these ideas very far, but given the explosion of promising LLMs and generative AI tools in 2023-24, it seemed like a great time to get building again!

Enter… HeadlineWise

This week I launched HeadlineWise, a media literacy training tool (and at this point, a bit of an experiment) to see how AI tools might help us recognize subjective and provocative language in headlines. You can read more about the project and its goals here. The repo is also available on GitHub here.

In this post I wanted to share a few details about the building process, and what I’ve learned about the tech stack I chose—especially some challenges using generative AI APIs in production.

The Tech Stack

HeadlineWise is built in Next.js 14 (deployed to Vercel) with Supabase. It retrieves headlines from the News API, and uses Anthropic’s Claude API and the OpenAI API for headline analysis.

Although I’ve worked with the OpenAI API in the past, I initially chose the Claude API for this project based on some promising early test results through the chat UI. After some initial battling with the rate limit (more below), I decided to add a fallback to OpenAI. This opened the door to comparing results generated by different models, and inspired me to add a data reporting page to the app to share some interesting aggregate results.

Data Flow

The app’s data flow works like this:

  • Every day, a cron job triggers a request to the News API to pull headlines from select sources on a list of targeted topics. Some custom logic chooses from the results to try to pick a varied sample from different news sources.
  • After these article previews are saved to Supabase, they’re sent to the Claude AI API (falling back to OpenAI) for analysis in batches, the results are saved, and pages in Next.js are revalidated.
  • On a daily basis (or as often as possible), I log in to the site as an admin to review the generated analyses (blind to the news source) and mark them “approved” or “rejected.” The status of this human review is visible to users for transparency, and aggregate results (and a more detailed explanation of how I review) are published here.

Currently I’m storing headlines and analyses indefinitely, but at some point soon I plan to add a process to delete the oldest results on a regular basis.

Key Challenges

Building on top of generative AI came with some fresh challenges—some expected, and some surprising! A few of the most significant challenges:

1. Keeping costs down

Because this is a personal project with no outside funding, keeping costs to a minimum was a major design concern. In most cases I’ve tried to stick to the free/lowest-cost tier of underlying services, and that’s had a couple of significant limitations on data flow and feature design:

  • Instead of live search of headlines or showing breaking news as it happens, I retrieve a limited number of headlines once per day.
  • Although I’d like this app to be more interactive in the future, for now there’s no personalization, limited interactivity, and limited data storage while I try to get a feel for how quickly I’m going to approach the limits of Supabase’s free tier.
  • Although originally I’d hoped to use Claude’s opus model (which generated notably better results than sonnet or OpenAI’s gpt-3.5-turbo in some early tests), as soon as I started sending requests to opus in my production app, my function calls started timing out in Vercel. (On the free tier, functions have only 10 seconds to complete.) To avoid upgrade costs, I’m manually sending some requests to opus from a local server to get that data in the mix.

2. Working around generative AI API rate/usage limits and downtime

One major running concern I’ve had about building products on top of generative AI APIs like OpenAI or Claude right now is: how reliable are they, really? These are new and expensive technologies and (in my experience so far) the APIs seem to be a bit unstable.

Without any live interactive features, HeadlineWise didn’t seem like a particularly challenging test — I could generate analyses ahead of time when headlines were pulled and cache the results. But my initial requests were still surpassing Claude’s rate limit (only 50 RPM for Tier 1 usage as I’m writing this). In fact, initially I was sending one request per headline simultaneously, and found I was only able to get results back for 3-5 requests at a time from Claude.

Two design changes dramatically increased successful responses:

  1. I started using OpenAI as a fallback for Claude should the request fail.
  2. I realized I could reduce the number of API requests by changing my instructions for the analysis task and batching headlines for analysis. I updated my system prompt to expect an array of headlines as input, and return the result as an array of JSON objects. (Right now I’m batching 20 headlines per request, and testing increasing the batch size.) One interesting consequence of this choice, however: at least once, I’ve observed that Claude’s opus model appeared to hallucinate in an analysis, inventing a quote that didn’t exist in one headline, but might have been related to a different headline in the same batch. I haven’t observed this often enough to reconsider, but it’s definitely something I’m watching for now.

3. Unpredictable response formats

Another significant challenge dealing with automated requests to generative AI APIs (and a particularly irksome one as a software engineer) is dealing with sometimes unpredictable response formats.

I have generally had luck fixing this through system prompting, explicitly describing the JSON format I want in the response and instructing the model not to prefix the response with any text before the JSON. But within the JSON, sometimes there are still irregularities.

The only solution I’ve found is to build in some typechecking when I’m processing the response, and build in flexibility when I ask it to do tasks like categorization (assuming that, at some point, the AI will likely send back some of its own categories 🤖). There’s just no substitute here for some trial and error.

But… does it work? And how will I know?

After solving for all these technical challenges, there’s still one big, uncomfortable, frustratingly subjective question that I’m still struggling to answer: does it work?

HeadlineWise “works” in a technical sense — the data flow runs pretty smoothly now. I’m getting new results every day. As a web app, it’s a success (even if there are still little things here and there I’d love to improve, especially with some funding).

But as a product, are these generated analyses reliable enough to depend on? Do they reveal anything useful? I see the rush to bring in generative AI in nearly every product I use day-to-day, but actual usefulness and customer satisfaction isn’t always obvious. 🙃 This technology is still experimental, and as excited as I am to play with it too, I think it’s worth asking: is the usefulness really worth the hype, the expense, all of the extra management of its unpredictability?

In HeadlineWise, I’m trying to answer this question with data, and with human review. I’m currently the sole reviewer deciding whether a generated analysis is acceptable, but I’d like to build public-facing features to compare human sentiment and understanding with generated analyses. After adding a data reporting page recently, it’s also been interesting to review analysis data in the aggregate, to see how language use and political bias by media organizations lines up with my general perceptions of different outlets. Although the “mushiness” of this kind of evaluation is probably always going to bug me, it has been pretty stunning how good the results have been so far on the whole.

Cover photo by Roman Kraft via Unsplash.

Chrissy Hunt

Chrissy Hunt is a software engineer in Brooklyn, NY who loves reading, writing, and chasing after her dog.