propaganda-scraper
- repo:
- Waaangjl/local_central-level-propaganda
- lang:
- Python
- year:
- 2022 — archived
A Python scraper I wrote in 2022 to feed my undergraduate Signature Work thesis on the Shanghai lockdown — the data pipeline behind Framing the Crisis. It pulls articles from People's Daily, Global Times, Jiefang Daily, and Xinmin Evening Post; normalizes wildly inconsistent date formats into ISO; cleans HTML; and ships clean CSVs ready for LDA topic modeling.
I leave the repo public, archived, because two things are worth saying out loud about it.
The repo is older than I'd write today
This was the first non-trivial Python I'd written. It's all requests + BeautifulSoup, no async, no rate limiting beyond a polite 2-second sleep, and a directory layout that no version of me would defend now. The exception handling is mostly except Exception:. There's a hardcoded path that points to my old DKU laptop.
I left it that way on purpose. It's an honest record of where I was as a programmer when I needed the data badly enough to write it. Rewriting it now would be retconning history.
The data is what made the thesis possible
1,161 articles across the four newspapers, April 1 – June 30 2022, the period of the Shanghai lockdown. Without the scraper there is no thesis — there's just a hypothesis I'd never have been able to test. The codebase is rough; the output held up to four committee readers and a public defense. That asymmetry — between code that's "good enough to ship the dataset" and code that's "good engineering" — turned out to be one of the more useful things I learned writing it.
What's in the box
scrapers/people_daily.py— handles the People's Daily archive's date-keyed URLs.scrapers/global_times.py— paginated index pages.scrapers/jiefang.py,scrapers/xinmin.py— Shanghai-local outlets, both behind the same kind of CMS.clean.py— date normalization (the four papers used four different formats), HTML stripping, encoding fix-ups.out/— the actual CSV outputs, frozen.
A note on responsible use
These are public newspaper articles, scraped politely. Nothing here circumvents a paywall or hits anything you couldn't read in a Shanghai library. If you fork it and point it somewhere it shouldn't go, that's on you.
Clone
git clone https://github.com/Waaangjl/local_central-level-propaganda
cd local_central-level-propaganda
pip install -r requirements.txt
python scrapers/people_daily.py --start 2022-04-01 --end 2022-06-30
Repository: github.com/Waaangjl/local_central-level-propaganda ↗
Related: the thesis it fed — Framing the Crisis →.