Media Source Guide
A Source in the Media Cloud Directory represents metadata about a unique domain that regularly publishes news content where each story is a URL. Sources have associated Feeds (RSS and Google News Sitemap) that the system regularly checks for new stories that should be ingested. This is the primary way we build our Online News Archive. This page describes all the properties that can be associated with a Source in our Directory.
What Counts as a Source?
There are loads of different types of websites out there, and Media Cloud isn’t Google so we don’t catalog them all. Our Directory is built to track news-focused websites that regularly publish general interest stories as web pages. We apply that criteria subjectively based on each research project we or our partners undertake. Here are some examples to guide you:
Sources we want to track
- Global news websites
- Hyperlocal news
- Well-read blogs
- Industry-specific publications (when relevant to a research project)
- Unreliable / propaganda sources
- Paywalled-news sites (sometimes we still get titles and summaries via rss)
Sources we DO NOT want to track
- General websites
- Spam farms or parked domains
- Hyper-specific niche blogs
- Seldom-changing analysis or tourism sites
- Pornography
- Sites posting PDFs of print papers
- Organizations' websites (ie. WHO.int)
Source Metadata Fields
Each Source in our system is described by a set of required and optional metadata. These can be easily exported via our web interface or API, and updated in batch for a Collection via a CSV file. These metadata fields include:
Homepage
required
A link to the browse-able homepage of the Source. This should be a fully clickable link to their homepage.
Domain (used to be "Name")
The domain that uniquely identifies the Source within our system for searching against the Online News Archive. We advise leaving this empty for the system to generate based on the homepage URL you enter. If you supply this, it will be validated against the homepage to ensure it is properly formatted and matches. This should not include any prefixes like “https” or “www”; it should be a simple value like “cnn.com”.
⚠️ Note: This should just be the root domain and suffix: bostonglobe.com, not www.bostonglobe.com or https://bostonglobe.com, or https://newton.bostonglobe.com.
URL Search String
For a very limited number of domains, it makes sense to support “child” Sources that allow us to search against only some of the stories published. Consider sites published on wordpress.com, or outlets that wrap up content from a number of places. For these types of Sources we support repeating a domain from the “parent” source (which has no url_search_string value), and specify a url_search_string value on the child domain that will be matched against URLs. This should be used sparingly, because it slows down queries considerably (and uses a lot of computational power on our limited servers).
An example is the Globo set of Sources, which publish stories that cover different states and cities in Brazil that we need to search against individually:
- The parent globo.com Source has no url_search_string
- The child G1 Globo Amapa Source has the same domain (“globo.com”), but include the “g1.globo.com/ap/amapa/*” url_search_string.
This example will match a URL like: “https://g1.globo.com/ap/amapa/noticia/2025/01/06/ipva-2025-confira-calendario-com-prazos-para-pagamento-no-amapa.ghtml”
⚠️ Note: The URL Search String should start with the domain and end with “/*”, which means that it’ll match any URL that starts with what you’ve entered. Like the domain, it should NOT include any prefix like “http” or “https”.
Label
The name of the Source, as it should be shown to people using the system. This is often the name of the publication. If you leave this blank then the domain will be used instead.
Pub Country
Optionally set the primary country this Source is publishing from, where their headquarters are located. This must be written as the 3-letter ISO 3166-1 alpha-3 standard (reference).
Pub State
Optionally set the primary state or province this Source is publishing from, where their headquarters are located. This must be written as the ISO 3166-2 standard (reference). For Massachusetts in the US, you’d write this as “US-MA”, for Amapá in Brazil you’d write this as “BR-AP”.
Notes
Optional documentation about the inclusion of the Source in our system. This is a good place to put things like who suggested adding the source, your name as the person actually adding it, editing history, content quality, where it was imported from, etc.
Media Type
Optionally set a valid media type to describe what kind of publication this Source is. Pick one of these supported values and enter it into the column:
- “digital_native”: A Source that originated publishing on the internet.
- “print_native”: A Source that originally was published in print, but then migrated to the internet.
- “audio_broadcast”: A Source that is mostly publishing audio content to broadcast on radio or podcast.
- “video_broadcast”: A Source that is mostly publishing video content to broadcast on radio or YouTube.
- “other”: Another type of Source not captured in the other categories.
Platform
Leave this empty to default to “online_news”. At previous times we supported searching other types of content, so this field is mostly a legacy holdover.
Id
read only
A unique ID our system assigns to identify the Source internally. If you are updating a source do not fill anything in here; our system computes this automatically.
Stories Per Week
read only
The average number of stories published by this Source per week, based on our ingestion. We are planning to recompute this every few months. If you are updating a source do not fill anything in here; our system computes this automatically.
First Story
read only
The date of the first story we’ve ingested from this Source. We are planning to recompute this every few months because sometimes we ingest historical data for a source. If you are updating a source do not fill anything in here; our system computes this automatically.
Primary Language
read only
Our system guesses the primary language of each article it ingests. For each Source we indicate the language the majority of its articles are in (if we have enough to measure). We are planning to recompute this every few months. If you are updating a source do not fill anything in here; our system computes this automatically.
Still have questions?
Send us an email at support@mediacloud.org or fill out our support form.