Bipartisan News Digest with Python, Ollama, and Google API

Concept and scaffolding by Francois (grammar edited by AI)

Overview

I wanted a way to regularly consume news coverage from across the political spectrum—without living inside any one outlet’s bubble. The goal was to build a bipartisan news digest that:

  • Pulls articles from a curated mix of conservative and liberal outlets.
  • Summarizes each article with a local LLM (Ollama).
  • Heuristically scores relevance and political bias.
  • Emails a ranked digest every few hours, with links and bias labels.
  • Keeps my reading experience relatively anonymous while still benefiting from a broad set of sources.

An interesting side effect is that this system also generates structured, timestamped metadata about which topics are being covered where, and with what slant—essentially the kind of dataset a news-analytics data broker might like. In my case, it’s purely for personal analysis and experimentation.

The project is implemented in Python, uses requests and BeautifulSoup for scraping, Ollama for summarization, and the Gmail API for delivery. It’s packaged into a Docker container for easy deployment.

Source Selection & Scraping Pipeline

The first step is defining a curated list of news sources, intentionally balanced between conservative, liberal, and more centrist outlets, plus some tech-focused sources:

Article Link Extraction

Every loop, the script picks a random base URL and scrapes it for article links using BeautifulSoup. Instead of blindly harvesting everything, it uses per-site and generic heuristics to detect article-like URLs while avoiding:

  • Author/index pages.
  • Login/subscribe paths.
  • Tag/category hubs.
  • Podcast/video pages.

Article Extraction & Summarization with Ollama

Once an article URL is selected, the script:

  1. Downloads the HTML.
  2. Extracts the title (preferring <h1>, falling back to <title> or slug).
  3. Extracts main article text using site-specific selectors and a generic fallback.

Example article text extraction:

Summarization is handled by a local Ollama model (e.g., llama3) via its HTTP API. The prompt instructs the model to produce 5–7 concise, factual bullet points with no editorializing:

OLLAMA_URL = "http://localhost:11434/api/generate"
OLLAMA_MODEL = "llama3"
OLLAMA_MAX_CHARS = 8000

def summarize_with_ollama(text: str) -> str:
    if len(text) > OLLAMA_MAX_CHARS:
        text = text[:OLLAMA_MAX_CHARS]

prompt = f"""
"""

resp = requests.post(
    OLLAMA_URL,
    json={"model": OLLAMA_MODEL, "prompt": prompt, "stream": False},
    timeout=90,
)
data = resp.json()
return (data.get("response") or "").strip() or "[Ollama error: empty response]"

Relevance Scoring & Bias Detection

Not every article is equally interesting. The script computes a relevance score using a mix of:

  • Keyword groups (U.S. politics, international conflict, tech, courts, etc.).
  • A “sweet spot” on summary length.
  • Extra weight for technical/innovation stories that look like concrete proposals or architectures.

These are compiled into an index and used in relevance_score(summary) to accumulate weighted hits per group, capped by max_hits so a single topic doesn’t dominate the score.

Bias Detection

Bias detection is intentionally heuristic and coarse. It combines:

  1. A base label inferred from the domain (e.g. foxnews.com → “conservative”, motherjones.com → “liberal”, etc.).
  2. Phrase-level cues in the generated summary ("woke agenda", "voter suppression", "tax cuts", "universal healthcare", etc.).

Story Clustering Across Outlets

To spot cross-outlet coverage of the same story, the code builds a simple “story key” from normalized title tokens with stopwords removed:


Articles with the same story_key are grouped into clusters. In the email digest, the script adds a note like “Also covered by X, Y, Z” and whether those outlets seem to show similar or different framing.

Gmail Digest: HTML Email Layout

Every few hours (configurable via EMAIL_INTERVAL_MINUTES), the script:

  1. De-duplicates stories by URL (keeping the highest-scoring version).
  2. Sorts by relevance score.
  3. Takes the top N (MAX_EMAIL_ITEMS).
  4. Clusters by story key for cross-outlet notes.
  5. Renders a compact HTML table with:
  6. Emails are created as MIME messages, base64-encoded, and sent via the Gmail API users.messages.send endpoint. For my personal use, sending to email works well; for something public-facing, a dashboard or web UI would be more appropriate to avoid becoming an accidental “spam cannon.”

Because the script samples a subset of articles from each site each round, every digest is slightly different—even when run frequently. Longer runtimes between digests (e.g. 6 hours instead of 2) allow a larger pool of candidate articles to accumulate, which makes the “top 100” more competitive and diverse.

Deployment & Dockerization

The crawler is wrapped in a Docker image for portability. In its current form, the email notification system is the main limiting factor for public distribution:

  • Hard-coding email credentials or sending high volumes would be a problem on any shared platform.
  • For a public Docker image, a better pattern would be:
    • Expose a small web dashboard (Flask/FastAPI).
    • Use user-supplied secrets (e.g., via environment variables).
    • Possibly push notifications to a web UI or database instead of email.

For my personal use case, reading the digest in my inbox is ideal, so the Gmail integration is kept.

Future Improvements

This project is a proof-of-concept and very much in early alpha. Some logical next steps:

  1. Richer bias modeling
    Instead of keyword + domain heuristics, feed the full summary (or article excerpt) into a dedicated classification model trained to estimate ideological slant on a continuous spectrum.
  2. Sentiment and framing analysis
    Go beyond “liberal vs conservative” and score:
    • Valence (positive/negative/neutral).
    • Tone (alarmist, neutral, dismissive, etc.).
    • Framing of key actors.
  3. Configurable sources and schedules
    Allow users to:
    • Add/remove sources via a config file or UI.
    • Choose digest frequency and size.
    • Toggle specific topic focus (e.g. only tech, only foreign policy).
  4. Public Docker image with safer notification channel
    Publish an image that:
    • Uses API keys and secrets provided at runtime.
    • Delivers to a dashboard or webhook instead of email by default.
  5. Persistent storage & analytics
    Store summaries, scores, and metadata in a database (e.g. SQLite, Postgres, or Elasticsearch) to:
    • Track coverage and bias over time.
    • Visualize which outlets cover which topics.
    • Explore how framing shifts across events.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *