jialong@columbia:~/site$cat ./lab/propaganda-scraper.md
home
> Lab · Waaangjl/local_central-level-propaganda

propaganda-scraper

repo:
Waaangjl/local_central-level-propaganda
lang:
Python
year:
2022 — archived

A Python scraper I wrote in 2022 to feed my undergraduate Signature Work thesis on the Shanghai lockdown — the data pipeline behind Framing the Crisis. It pulls articles from People's Daily, Global Times, Jiefang Daily, and Xinmin Evening Post; normalizes wildly inconsistent date formats into ISO; cleans HTML; and ships clean CSVs ready for LDA topic modeling.

I leave the repo public, archived, because two things are worth saying out loud about it.

The repo is older than I'd write today

This was the first non-trivial Python I'd written. It's all requests + BeautifulSoup, no async, no rate limiting beyond a polite 2-second sleep, and a directory layout that no version of me would defend now. The exception handling is mostly except Exception:. There's a hardcoded path that points to my old DKU laptop.

I left it that way on purpose. It's an honest record of where I was as a programmer when I needed the data badly enough to write it. Rewriting it now would be retconning history.

The data is what made the thesis possible

1,161 articles across the four newspapers, April 1 – June 30 2022, the period of the Shanghai lockdown. Without the scraper there is no thesis — there's just a hypothesis I'd never have been able to test. The codebase is rough; the output held up to four committee readers and a public defense. That asymmetry — between code that's "good enough to ship the dataset" and code that's "good engineering" — turned out to be one of the more useful things I learned writing it.

What's in the box

  • scrapers/people_daily.py — handles the People's Daily archive's date-keyed URLs.
  • scrapers/global_times.py — paginated index pages.
  • scrapers/jiefang.py, scrapers/xinmin.py — Shanghai-local outlets, both behind the same kind of CMS.
  • clean.py — date normalization (the four papers used four different formats), HTML stripping, encoding fix-ups.
  • out/ — the actual CSV outputs, frozen.

A note on responsible use

These are public newspaper articles, scraped politely. Nothing here circumvents a paywall or hits anything you couldn't read in a Shanghai library. If you fork it and point it somewhere it shouldn't go, that's on you.

Clone

git clone https://github.com/Waaangjl/local_central-level-propaganda
cd local_central-level-propaganda
pip install -r requirements.txt
python scrapers/people_daily.py --start 2022-04-01 --end 2022-06-30

Repository: github.com/Waaangjl/local_central-level-propaganda ↗

Related: the thesis it fed — Framing the Crisis →.