Chinese Text Modeling Tools and Applications for Computational Journalism
Statistical topic models – one of the sub fields of machine learning and natural language processing – provide a data-driven framework for analyzing collections of text documents. It has become one of the most frequently used tools for computational journalism used to investigate abstract topics and keywords that occur in a collection of text documents. Digital journalists can use such tools to extract frequently appearing terms, and to analyze the trend of a particular news brand or stories about a social event. Articles, analyses and documents written in Chinese have become increasingly important for multimedia stories about China. Available Chinese archives on the Internet might contain stories that require digital journalists to apply appropriate topic modeling tools.
Unlike English and other alphabetic languages, the basic structural unit of Chinese language is character encoded in Guobiao GB18030 or Unicode. Since there are no spaces between words in Chinese documents, topic-modeling tools for analyzing English text documents, such as gensim library, cannot be fully applied to analyze Chinese text documents. The gensim library can neither separate Chinese characters into segments nor convert Chinese characters to vectors using the bag-of-words approach. It only accepts pre-compiled documents with Chinese word segmentations with UTF-8 encoding. Digital journalists can take advantage of gensim statistical analyses, such as Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA) and Term Frequency–Inverse Document Frequency (TF-IDF) to analyze existing corpus created by other tools. In this project, I will analyze one of the Chinese topic-modeling tools jieba and its application to computational journalism. I will analyze articles from the opinion archive of the People’s Daily and generate a list of frequently appearing words using the keyword extraction tool provided by the jieba library. I will also apply gensim TF-IDF analysis to existing Chinese segmentation documents represented as bag-of-words counts and apply a weighting, which discounts common terms. I will compare the jieba keyword extraction results with the gensim TF-IDF results.
Chinese Text Segmentation
Importing jieba library and setting the encoding to UTF-8 by declaring #encoding=utf-8, my app reads the monthly-based txt files from the monthly_raw_data folder and save the texts of each monthly article collection to a multiline string called multi_line_text:
This multiline string contains a number of ‘\n’ characters that are interpreted as a new line by the jieba segmentation tool. I get rid of these ‘\n’ characters by splitting the old multiline string and recreating a new multiline string all_text: templist = multi_line_text.split(‘\n’) for text in templist: all_text += text In order to separate the string to Chinese segments, I pass the string that contains texts of a monthly article collection to the jieba segmentation tool: seg_list = jieba.cut(all_text, cut_all=False) temp_text = [] for item in seg_list: temp_text.append(item.encode(‘utf-8’)) The jieba.cut() function returns a generator expression that creates all segments of the multiline string on the fly. Developers need to loop through the generator seg_list once and create a copy in memory. The temp_text list is the copy of the generator. It contains all segments of the articles, which includes a number of meaningless Chinese words commonly appearing in all text files. To clean such noises, the application reads a stop word txt file and saves the list of Chinese words to a stop_list. Looping through the segment words in temp_text, I generate a new list called text_without_stopwords contained segment words that were not appeared in the stop_list: stop_list = [] with open(“stopwords.txt”, “r”) as stoplistfile: for item in stoplistfile: stop_list.append(item.rstrip(‘\r\n’)) text_without_stopwords = [] for word in temp_text: if word not in stop_list: text_without_stopwords.append(word) As shown in figure 4 and figure 5, the stop list removes all meaningless words from the text list, such as “的”(Of), “在”(In), “中”(Middle) and “和”(And). The text_without_stopwords list contains Chinese segment words needed for statistical analyses.
Before using stop word list:
After using stop word list:
In order to test the performance of the jieba segmentation tool, I print all Chinese segments stored in the text_without_stopwords list:
As we can see from the list of printed words, a number of Chinese terms, such as “人民日报社论”(People’s Daily Opinions), “国际劳动节”(The International Labor Day), “中国特色社会主义”(Socialism with Chinese Characteristics) and “中国共产党第十八次全国代表大会/十八大”(The 18th National Congress of Communist Party of China), are not accurately divided. In order to train the system to recognize these new terms, I created a new customized dictionary including these new terms and added it to the jieba module using the load_userdict() method: jieba.load_userdict(“customized_dict.txt”) After adding this customized dictionary, the jieba segmentation tool can recognize such new terms and correctly cut the string into segments:
This word list can be exported to a txt file using csv library. The exported text file stores words in comma-separated values, which can be imported to statistical libraries, such as gemsim library and jieba analysis library, for further analyses.
Keyword Extraction using Jieba Library The jieba library has an analysis module that supports keyword extraction. The method takes two parameters, including the text that needs to be extracted and the number of frequently appearing keywords. It returns a list of keywords. My applications read the previously generated csv file and save the lists of word segments to all_text string. These two apps import jieba.analyse module and pass the all_text string to the jieba.analyse.extract_tags() method: for item in jieba.analyse.extract_tags(all_text, 10): text_temp.append(item.encode(‘utf-8’)) Both applications need to export the results to txt files using the csv library. Developers need to encode Chinese text segments to UTF-8 and save them to a temporary list. The csv writer can save the list of words using this particular encoding to Chinese characters. My application extracts keywords from the entire text string and returns 20 frequently appearing keywords: 发展 (Development) 中国特色社会主义 (Socialism with Chinese Characteristics) 建设 (Construction) 人民 (People) 工作 (Word) 中国 (China) 我国 (Our Country) 小康社会 (Well-off Society) 人民日报社论 (People’s Daily Opinion) 社会 (Society) 经济 (Economics) 文化 (Culture) 改革 (Reform) 社会主义 (Socialist) 水利 (Water Resources) 科学发展观 (Scientific Outlook on Development) 历史 (History) 中华民族伟大复兴 (The Great Rejuvenation of the Chinese Nation) 经济社会 (Economics & Society) 农业 (Agriculture) My application extracts keywords from monthly article collections and returns 10 frequently appearing keywords for each month: 2010/10 上海世博会,世博,落幕,中国,人类,世博会,世博园,低碳,沟通,城市 2010/11 亚运,亚运会,上海世博会,世博,广州,中国,落幕,城市,亚洲,发展 2010/12 发展,农村,农业,工作,残疾人,残运会,经济,精神,推进,防汛 2011/01 中国特色社会主义,水利,发展,法律,建设,立法,体系,一号文件,坚持,我国 2011/03 中国特色社会主义,发展,工作,十二五规划,法律,人民政协,十二五,立法,热烈祝贺,体系 2011/05 劳动,发展,科技,青年,创新,工人阶级,青春,创造,中国,90 2011/07 水利,发展,人民,西藏,建设,60,水资源,改革,中国特色社会主义,加快 2011/09 中华民族,人民,中国,中华民族伟大复兴,抗日战争,民族,中国共产党,历史,复兴,伟大 2011/10 文化,中国特色社会主义,中华民族伟大复兴,发展,建设,社会主义,辛亥革命,坚持,繁荣,推动 2011/11 文化,文艺工作者,神舟八号,对接,交会,天宫一号,广大,发展,任务,圆满成功 2011/12 发展,扶贫开发,经济,农业,环境保护,工作,农村,我国,经济社会,加快 2012/01 金融,发展,经济,工作,我国,国际金融,把握,改革,金融业,经济社会 2012/02 农业,科技,农村,发展,农产品,供给,创新,稳定,保障,加快 2012/03 人民政协,发展,工作,社会,建设,民政,人民,中国特色社会主义,会议,发挥 2012/05 青年,共青团,广大青年,中国特色社会主义,劳动,事业,90,共青团员,发展,群众 2012/06 上海合作组织,成员国,发展,峰会,合作,地区,共同,北京,元首,携手 2012/08 奥运会,中国特色社会主义,奥运,奥林匹克精神,伦敦,奥林匹克,奥林匹克运动,体育健儿,见证,赛场 2012/10 中国特色社会主义,发展,社会主义,十年,社会主义现代化,十八大,中国,道路,社会,构建 2012/11 中国特色社会主义,十八大,小康社会,科学发展观,社会主义现代化,全面,党和国家,人民,发展,大会
TF-IDF Keywords Analysis using Gensim Library Although the gensim library does not have segmentation tools for Chinese content, it has statistical analysis tools that can be applied to an existing corpus. I copied the existing segmentation result files. My gensim application reads this existing segmentation file and generates a gensim dictionary:
Code for generating the dictionary
The dictionary
My gensim application also generates a corpus:
Code for generating the corpus
The corpus
The dictionary is a document file that maps each word with its ID. The corpus represented the original documents as sparse vectors.
Using the gensim dictionary and the corpus, developers can perform statistical analyses. My application runs the TF-IDF analysis on the existing corpus: corpus = corpora.MmCorpus(‘corpus.mm’) tfidf = models.TfidfModel(corpus) It saves the TF-IDF results and returns 20 word IDs with highest TF-IDF score:
农业 (Agriculture) 1.13101010579 法律 (Law) 0.965040840409 劳动 (Work) 0.957638193359 青年 (Youth) 0.942623989634 水利 (Water Resources) 0.907716774625 金融 (Finance) 0.872824577986 世博 (Expo 2010 Shanghai China) 0.718687279642 上海合作组织 (The Shanghai Cooperation Organisation) 0.718505717462 农村 (Countryside) 0.695366468602 文化 (Culture) 0.693982350176 十年 (Ten years) 0.661320834741 立法 (Legislation) 0.637700637339 中国特色社会主义 (Socialism with Chinese Characteristics) 0.582919379105 工作 (Work) 0.581953680087 人民政协 (Chinese People’s Political Consultative Conference) 0.578274475464 西藏 (Tibet) 0.556786191245 辛亥革命 (The Xinhai Revolution/ The Revolution of 1911) 0.537749641255 落幕 (Ring down the curtain) 0.522466708471 供给 (Supply) 0.497780006814
Developers usually emphasize words with low TF-IDF scores. Words with low IF-IDF score frequently appear in one document, but they are not frequently appear in all documents. Since we want to make comparisons with the keywords generated by jieba, these 20 TF-IDF words must be frequently appearing in all documents.
My app reads also returns the top 10 IDs with the highest scores for each month: 2010/10 世博, 落幕, 上海世博会, 世博会, 沟通, 上海, 世博园, 人类, 城市 2010/11 亚运,亚运会,广州,世博,亚洲,落幕,上海,上海世博会,沟通 2010/12 残疾人,明年,农业,农村,亚,抗旱,残运会,防汛,亚洲 2011/01 法律,水利,立法,一号文件,十年,社会主义民主法制,宪法,体系,水资源 2011/03 法律,人民政协,立法,报告,十一届四次会议,牢固,规划,委员,监督 2011/05 劳动,青年,工人阶级,青春,紧密结合,科技,人才,工作者,劳动者 2011/07 西藏,水利,水资源,代表,政党,谨记,须,全党同志,07 2011/09 抗日战争,日本,侵略者,觉醒,九一八事变,抗日,世界反法西斯战争,这场,侵略 2011/10 辛亥革命,文化,先生,孙中山,百年,先驱,中华民族伟大复兴,繁荣,没有 2011/11 文艺工作者,交会,神舟八号,对接,天宫一号,航天,创作,文化,航天事业 2011/12 环境保护,扶贫开发,农业,扶贫,明年,农村,贫困地区,12,力度 2012/01 金融,金融业,金融机构,防范,实体,金融监管,监管,稳健,动荡 2012/02 农业,供给,农产品,农村,科技,绝不能,因为,约束,强 2012/03 民政,人民政协,雷锋,五次,学雷锋,思想道德,民政工作,雷锋精神,十一届 2012/05 青年,共青团,劳动,广大青年,工人阶级,共青团员,团组织,劳动者,主力军 2012/06 上海合作组织,成员国,峰会,地区,本,元首,合作,互信,携手 2012/08 奥运会,伦敦,奥运,奥林匹克,奥林匹克运动,人民解放军,军队,我军,体育健儿 2012/10 十年,构建,回顾,越,关键环节,伟大祖国,奋勇前进,社会主义,十八大 2012/11 中央委员会,十八大,党和人民,新一届,中国共产党第十八次全国代表大会,选举,中国特色社会主义,表现,党的建设 Comparison of Jieba Keyword Extraction Results with Gensim TF-IDF Results
The 20 frequently appearing words generated by the jieba library does not well match the TF-IDF results generated by the gensim library. Only 5 out of 20 Chinese words match in these two results. However, the monthly keyword results of the jieba library match the monthly TF-IDF results of the gensim library. Although some words are not exactly the same, these two monthly results contain synonyms.
The 20 frequently appearing words generated by jieba library are nouns. This is because the jieba keyword extraction takes advantages of its default Chinese dictionary to identify word types. This list of words can be found in the monthly keyword results of the jieba library. The list of words with high TF-IDF scores generated by the gensim library picks up all words that are frequently appearing in all documents. It does not avoid picking up meaningless words, such as 供给 (Supply). In extracting keywords from documents, users should use the jieba library. The gensim library might only be good for getting words with low TF-IDF scores. Such words frequently appear in one document, but they are not frequently appear in all documents















