학술논문

A Standard Sample of Present-Day Chinese for Use with Digital Computers. Final Report.
Document Type
Reports - Research
Source
Subject
Chinese
Codification
Computational Linguistics
Data Collection
Databases
Digital Computers
Information Retrieval
Mandarin Chinese
Mathematical Linguistics
Vocabulary
Word Frequency
Word Lists
Language
Abstract
The final report on a project to develop a standard corpus of present-day Mandarin Chinese is presented. This corpus consists of words of running text of Chinese prose printed in the Republic of China during the calendar year 1968. The corpus, although originally planned to have a total of 500 samples of 2000 words each, has only 294 samples. Each sample starts at the beginning of a sentence, but not necessarily at the beginning of a paragraph or larger division. The samples represent a variety of styles of modern prose, selected for their representative quality rather than their literary merit. The collection consists primarily of samples from books and some major periodicals available through the library at the National Taiwan University and the National Central Library. For each sample collected, a copy was made and then transcribed into a modified Pin-yin romanization. For each sample, counts were taken of the following: names, formulae, figures, foreign strings, foreign words, words (in total), and syllables. After the samples were collected and romanized, they were then codified. A manual accompanies the corpus, which comprises one magnetic tape of about 1,200 feet, available in either 7-track or 9-track mode. (Author/LG)