Size of Chinese Texts: 1550 - 1800, Rolling Ten-Year Window

For an interactive, dynamic version of this, please see my d3.js based size distribution explorer!

This gif (hover your mouse over the image to play) illustrates shifts in size of Chinese texts across the 16th, 17th, and 18th centuries. This animation derives from research presented in the paper "Analyzing Late Imperial Printing Trends Using Large Bibliometric Datasets", which is forthcoming in 2016. This figure was produced using bibliographic meta-data stored in MARC records acquired from WorldCat through the WorldCat search API (with the generous permission of the Online Computer Library Center of Dublin, Ohio). Online library catalog records remain a largely untapped source of information on late Imperial Chinese texts. Yet because of the Rare Chinese Book project, a significant amount of information on rare Chinese books is contained in WorldCat records. This information, while in some cases spotty, is very valuable in visualizing large-scale printing trends once it is properly sanitized. The results of this study are striking. The average size of texts produced in China during this period were published at an average size of around 270 square centimeters. However, beginning in the 1720s, a new class of small-format texts became increasingly popular. This is visible in the growth of the small shoulder on the left of this density curve. The significance of these small works offers a valuable area for new research. Which books were produced in small sizes? Who was writing them? Who was printing them? Do they come from a specific region? Are there historical factors that might explain their increased prevalence?

Principal Component Analysis of Late Imperial Unofficial vs. Official Histories

The figure shown here, which is abstracted in the header image to this website, is a principal component analysis of thirty eight unofficial histories (or yeshi 野史) written in the Ming and Qing dynasties, in comparison with the Official History of the Ming Dynasty (the Ming shi 明史). I have broken each text into ten thousand character sections and counted how many times each character appears. I then used the one hundred most common characters across the entire corpus to create a vector that represents each section of text. Each vector contains information on how often those common characters appear in that particular text. Each vector represents a point in 100 dimensional space. The variations between these vectors provides an interesting analog for the style of a text. Principal component analysis offers a way reduce the dimensionality of the data in order visualize the relationships among these vectors as dots in two dimensional space. The results are often predictive of genre (particularly in the case of Chinese texts).

This analysis shows several interesting things. First, while there is a close stylistic relationship between many yeshi and the Official History of the Ming (shown in magenta), the official history is more stylistically diverse. The sections from the official history that correspond to the biographies are most similar to the unofficial histories. The annal sections are much more distinct (and appear largely in the lower-left hand corner of this figure). If one looks at the character loadings, it is clear the works in the lower-left hand corner are dominated by time words, and the points in the upper-left hand corner focus on geographical vocabulary. While there are annalistic yeshi that should be more closely related to these outlying points, none of them appear here. This suggests that yeshi are more closely connected with narrative official histories, a similarity they share with novels (Some, like Sheldon Lu, argue that fiction emerged from the biographical style found in official histories).

This description is paraphrased from more detailed analysis found on pages 243-245 of my dissertation.

I explain this, and other figures, in much greater detail in the An Wang postdoctoral talk I gave in September 2014, which I have embedded in the next section. It is published on the Fairbank Center for Chinese Studies' Youtube channel.

An Wang Postdoctoral Fellow Talk: September 19, 2014

Description from the Fairbank Center website.

In this talk “Digital Approaches to Late Imperial Chinese Literature: Exploring Quasi-historical Texts,” Paul Vierthaler discusses using statistical methods adopted from stylometric analysis to clarify the complicated stylistic relationships among late Ming and early Qing novels on historical events, drama on historical events, and yeshi 野 史 (unofficial histories), which he collectively calls “quasi-history.” After introducing quasi-histories, he discusses some of the intricacies of using stylometry to analyze digitized transcripts of late Imperial Chinese works. Corpus linguists initially developed stylometery for authorship attribution studies, but scholars have also found it useful for generic analysis.

Vierthaler highlights some of the technical details of stylometry, including useful software and text preparation (e.g. sanitizing, tokenizing, and normalizing for length), while emphasizing concerns unique to Chinese studies. He discusses several mathematical tools used in stylometry, including hierarchical cluster analysis and principal component analysis. He then looks closely at what these tools reveal about late Imperial Chinese texts in general and quasi-histories specifically. Vierthaler first analyzes ten famous Ming and early Qing works and four yeshi. He then expands the scope to include 126 different works of disparate genres and finally compares more than forty yeshi with the Official History of the Ming. He concludes with a brief discussion of the theoretical implications of using digital tools for critical analysis.

Reflections on Research: Late Imperial Bibliographic Studies and Digital Quantitative Analysis

I wrote this essay while working on my dissertation on a Fulbright grant in Taiwan. It offers some general reflections on the database driven bibliographic research I was conducting during the Fall 2013. Since then I have expanded my bibliographic analysis and discovered some valuable repositories of fully digitized texts and have conducted extensive full-text analysis.

