Code related to a working paper that was first presented at the AFSP Annual Meeting in Paris, 2013. See Section 1 of this paper and its appendix, or read the HOWTO below for a technical summary.
- June 2014 – Major update * Updated working paper * Added new appendix * Added five media scrapers * Updated Google Trends data
- June 2013 – First release
The scraper currently collects slightly over 6,300 articles from
- ecrans.fr (including articles from liberation.fr)
- lemonde.fr (first lines only for paid content)
- lesechos.fr (left-censored to December 2011)
- lefigaro.fr (first lines only for paid content)
- numerama.com (including old articles from ratiatium.com)
- zdnet.fr
The entry point is make.r:
get_articleswill scrape the news sources (adjust page counters to current website search results to update the data)get_corpuswill extract all entities and list the most common ones (set minimum frequency withthreshold; defaults to 10)get_rankingwill export the top 15 central nodes of the co-occurrence network to thetablesfolder, in Markdown formatget_networkreturns the co-occurrence network, optionally trimmed to its top quantile of weighted edges (set withthreshold; defaults to 0)
corpus.terms.csv– a list of all entities, ordered by their raw countscorpus.freqs.csv– a list of entities found in each articlecorpus.edges.csv– a list of undirected weighted network ties
- The weighting scheme is inversely proportional to the number of entity pairs in each article.
- The weighted degree formula is by Tore Opsahl and uses an alpha parameter of 1.
