David Porter's Chinese Text Sampler as a Single, Convenient Download

There's this wonderful website by David Porter at the University of Michigan that collects samples of Chinese text at different levels of proficiency for easy practice. Each text is introduced with a brief paragraph in English.

I adapted this collection for offline use with Pleco Reader. Namely:
  • Downloaded all of the text samples. In a few cases, the downloads were incorrectly linked or broken, but the actual filename could be guessed, and where it couldn't, I copied the text from the online HTML version, so that nothing is missing in the end. Note that I didn't download the most frequent characters and surnames, as there are better ways to study that, as well as the Three Character and the Thousand Character classics, which I already had as part of my collection posted in another thread. I did however include the excerpts from The Analects and all the other texts my other collection has the full version of. Other than these minor differences, everything else from the website is here as well.
  • Converted all the files to UTF-8 with BOM (the original encoding was GB 2312), Windows (CR+LF) EOLs.
  • Converted the text into full-form (traditional) characters using OpenCC, profile s2tw.json. I left a few files with simplified characters, and their names are prefixed with an [S].
  • Did a little cleanup using regular expressions (unwrap text, separate paragraphs, remove indents, remove unnecessary newlines, fix punctuation, other fixes where necessary) and added some finishing touches manually.
  • For each of the texts I copied the introductory descriptive paragraph from the website and pasted it at the top so that when you open any of them, you'll see a brief description in English first.
Categorization. The files are split into seven categories: 古典經文 Old Classics, 政治社會 Politics & Society, 故事傳說 Stories & Legends, 生活環境 Living Environment, 當代文學 Contemporary Literature, 電影劇稿 Film Scripts, and 音樂歌詞 Song Lyrics. Note that this is different than the way the texts are arranged on the website.

Naming convention. File names are prepended with [n.n], where n.n is the difficulty grade as indicated on the website, theoretically ranging from 1.0 to 7.0 (1.6 to 5.4 among all the included texts). It appears that this grade is based on the number and rarity of characters appearing in the text, and not on actual difficulty in comprehension: in other words, if it's a classical text, you might very well know all the characters but not be able to make much sense of them, and the number does not account for that kind of difficulty. Still, it's better than nothing, so the files are named in such a way to make sure they'll be sorted from the least to most challenging in their respective category.

What's included:

[2.2] 孟子語錄
[2.3] 水滸傳
[2.3] 紅樓夢
[2.4] 論語
[2.4] 金瓶梅
[2.5] 道德經
[2.8] 李白唐詩十五首
[2.8] 西遊記
[2.9] 三國演義
[3.0] 肉蒲團
[3.0] 苦相篇
[3.1] 醉翁亭記
[3.2] 山西商
[3.2] 陶淵明詩集
[3.3] 史記:伯夷列傳
[3.3] 木蘭辭
[3.3] 李清照代表詩詞
[3.3] 杜甫代表詩詞
[3.3] 畫皮
[3.6] 三十六計


[1.6] [S] 毛主席语录:阶级和阶级斗争
[1.7] 在延安文藝座談會上的講話
[1.8] 敬告中國兩萬萬女同胞書
[1.9] 和平宣言
[2.1] [S] 口号
[2.1] 乾隆帝
[2.3] [S] 我们对香港问题的基本立场
[2.3] 新生活運動綱要
[2.4] 少年中國說
[2.7] 五四運動的回憶
[2.7] 孔子之道與現代生活
[3.2] [S] 刘少奇:论共产党员的修养


[1.6] 曹衝稱象
[1.7] 黔驢技窮
[2.0] 司馬光砸缸救友
[2.1] 十二生肖的故事
[2.1] 千里送鵝毛,禮輕情誼重
[2.2] 神筆馬良
[2.3] 守株待兔
[2.4] 狐假虎威
[2.5] 白毛女
[2.6] 白蛇傳
[2.8] 莊周夢蝶
[2.9] 刻舟求劍
[3.4] 塞翁失馬
[3.7] 愚公移山
[3.9] 揠苗助長
[5.4] 井底之蛙


[1.9] [S] 中国公民出国旅游文明行为指南
[2.1] [S] 首都市民文明公约
[2.2] 常用俗話
[2.3] [S] 八荣八耻
[2.5] [S] 路标
[2.7] [S] 中国地名
[3.7] 笑話
[4.8] 餐廳菜單


[1.8] 太陽照在桑乾河上
[1.9] 家
[1.9] 李有才板話
[1.9] 藥
[2.0] 善訟的人的故事
[2.0] 笑的研究
[2.1] 庭院深深
[2.1] 活著
[2.2] 手
[2.2] 茶館
[2.3] 紀念
[2.4] 傾城之戀
[2.4] 徐志摩代表詩詞
[2.4] 西風
[2.4] 醉酒
[2.5] 子夜
[2.5] 射鵰英雄傳
[2.5] 祖國啊,祖國
[2.8] 荷塘月色
[4.4] 洋涇浜奇俠


[2.0] 藍風箏
[2.1] 紅高粱
[2.2] 秋菊打官司
[2.3] 大紅燈籠高高掛
[2.4] 菊豆
[2.5] 臥虎藏龍
[2.6] 黃土地
[3.1] 英雄
[3.3] 霸王别姬


[1.8] 東方紅
[2.1] 十五的月亮
[2.1] 我的祖國
[2.3] 妹妹
[2.4] 夜來香
[2.9] 黃河船伕曲
[3.1] 瀏陽河
[3.2] 洪湖水
[4.4] 茉莉花
Preview: 茉莉花 (most texts are much longer)
Molihua (4.4) - A folk song from Hebei province that has attained international popularity.


Download below. The collection reflects what was on the website as of May 2015. Enjoy!


