Media Cloud News Index Story Guide
A story in the Media Cloud Online News Archive represents a unique URL with content and other associated computed and extracted metadata. This includes:
{
"id": "f6cdbcd7be5ecc60728831ee52fab75131e575972ec848f4441fd71b581099ae",
"media_name": "theblaze.com",
"media_url": "theblaze.com",
"title": "Florida man arrested after trying to travel across Atlantic in
'human-powered hamster wheel'",
"publish_date": "2023-09-09",
"url": "https://www.theblaze.com/news/florida-man-arrested-after-trying-to-
travel-across-atlantic-in-human-powered-hamster-wheel",
"language": "en",
"indexed_date": "2024-02-21 11:05:03.305199",
"text": "..."
}
ID
The unique ID of this story within the Media Cloud system. It is a hash of a normalized URL, serving as one layer within our deduplication system. This is useful as a way to uniquely identify a story in our system.
Media Name
A simplified representation of the domain name this story appeared on. 99.9% of the time this is the root domain name of the website that published the story, without any subdomains.
Media URL
A simplified representation of the domain name this story appeared on. 99.9% of the time this is the root domain name of the website that published the story, without any subdomains.
Title
A cleaned up version of the title of the article on the webpage. This is extracted via a heuristic search of the HTML and content to identify the text that is formatted like a title as a human reader would perceive it. This includes cleaning up to remove common suffixes and prefixes such as the publication name.
Publish Date
The date that we think this story was published, based on a set of heuristics that identify candidate publication dates encoded in HTML or the text itself. This does not include time because few published articles do, and it is impossible to accurately determine the timezone indicated at scale across our system.
URL
The fully resolved URL that returned the content held in the story. Note that this might be different from the original URL shared in a RSS feed or posted to social media, because many URLs are redirected from one URL to another; this holds the final one that returned the story. This URL is lightly processed for cleaning, to remove things like utm social tracking parameters and such.
Language
The two-letter ISO 639-1 representation of the language the article is in. This is derived algorithmically using language-detection software.
Indexed Date
The date and time this content was captured and processed for insertion into our online news archive. This is helpful for identifying the order in which stories were ingested.
Text
We store the extracted text content for users to search against via boolean keywords searches, relying on validated libraries to pull out the content of a story from the full HTML. This is not available for download due to copyright restrictions. We are actively working on exploring legal ways to create data sharing agreements and licenses to allow full text sharing; keep an eye on our blog for updates.
Still have questions?
Send us an email at info@mediaecosystems.org or fill out our support form.