Media Cloud News Index Story Guide

A story in the Media Cloud Online News Archive represents a unique URL with content and other associated computed and extracted metadata. Each story contains the following fields:

ID

The unique ID of this story within the Media Cloud system. It is a hash of a normalized URL, serving as one layer within our deduplication system. This is useful as a way to uniquely identify a story in our system.

Media Name

A simplified version of the domain where the story was published. In most cases (99.9% of the time), it represents the website's root domain without including any subdomains.

Media URL

A simplified version of the domain where the story was published. In most cases (99.9% of the time), it represents the website's root domain without including any subdomains.

Title

A cleaned up version of the title of the article on the webpage. This is extracted via a heuristic search of the HTML and content to identify the text that is formatted like a title as a human reader would perceive it. This includes cleaning up to remove common suffixes and prefixes such as the publication name.

Publish Date

The date that we think this story was published, based on a set of heuristics that identify candidate publication dates encoded in HTML or the text itself. This does not include time because few published articles do, and it is impossible to accurately determine the timezone indicated at scale across our system.

URL

The fully resolved URL that returned the content held in the story. Note that this might be different from the original URL shared in a RSS feed or posted to social media, because many URLs are redirected from one URL to another; this holds the final one that returned the story. This URL is lightly processed for cleaning, to remove things like utm social tracking parameters and such.

Language

The two-letter ISO 639-1 representation of the language the article is in. This is derived algorithmically using language-detection software.

Indexed Date

The date and time this content was captured and processed for insertion into our online news archive. This is helpful for identifying the order in which stories were ingested.

Text

We store the extracted text content for users to search against via boolean keywords searches, relying on validated libraries to pull out the content of a story from the full HTML. This is not available for download due to copyright restrictions.

‍

Stories in our API

When you query for a story with the API it comes back looking like this:

{"id": "f6cdbcd7be5ecc60728831ee52fab75131e575972ec848f4441fd71b581099ae",
 "indexed_date": "2024-02-21 11:05:03.305199+00:00",
 "language": "en",
 "media_name": "theblaze.com",
 "media_url": "theblaze.com",
 "publish_date": "2023-09-09",
 "title": "Florida man arrested after trying to travel across Atlantic in 'human-powered hamster wheel'",
 "url": "https://www.theblaze.com/news/florida-man-arrested-after-trying-to-travel-across-atlantic-in-human-powered-hamster-wheel"
}

Still have questions?

Send us an email at support@mediaecloud.org or fill out our support form.

Support form