IPSJ-TS     Information Processing Society of Japan    Trial Standard

IPSJ-TS 0007:2003


Basic Subset of Coded Character Sets - Japanese Core Ideographs



Publication of the Version 1.0E (this version) 2003-05-23
Publication of the Version 1.xE --
Publication of the Version 1.yE --

Errata 1 to the Version 1.0E 2003-11-02
Please send comments about this document to TS desk, IPSJ/ITSCJ

Copyright ©2003 IPSJ/ITSCJ, All Right Reserved.

The normative version of the specification is the Japanese version found at the ITSCJ web.



Table of contents

Introduction
1. Scope
2. Normative References
3. Definitions
4. Criterion
5. Listing of the characters
6. Basic Subset of Coded Character Sets - Japanese Core Ideographs



Introduction

This Trial Standard has been reviewed and endorsed by the IPSJ/ITSCJ technical committee. The authors of this document are the members of Working Group Five of IPSJ/ITSCJ TS (Trial Standard) Committee. They have developed this document as an activity of the Working Group taken in 2002 and 2003.


1. Scope

This Trial Standard specifies the Basic Subset of Coded Character Sets - Japanese Core Ideographs, which consists of the Kanji characters required in ordinary social life in Japan.

The characters in the Basic Subset have been extracted out of published standards such as JIS X 0208 and IPSJ-TS 0005, and reports on occurrence of Kanji characters encountered in some newspapers and dictionaries, considering the degree of functional importance of Kanji characters.


2. Normative References

The following documents contain provisions which, through reference in this text, constitutes of this Trial Standard. For the references, the latest edition of the normative document referred to applies.

IPSJ-TS 0005:2002, Basic Subset of Coded Character Sets (BUCS), 2002-03

JIS X 0208:1997, 7-bit and 8-bit double byte coded KANJI sets for information interchange, 1997-01


3. Definitions

For the purpose of this Trial Standard, the definitions in IPSJ-TS 0005 apply.


4. Criterion

4.1 Source sets

The Kanji characters in the Basic Subset - Japanese Core Ideographs are defined referring to the following documents:

[1] Occurrence list of Kanji characters in Asahi Shimbun 1993 (Yokoyama, Sasahara, et al., Kanji characters in newspaper media, Sanseido Publishing, 1998)
the number of character occurrence - 24,896,411,   the number of characters - 4,488

[2] Daijilin (大辞林) (2nd Edition, Sanseido Publishing, 1995)
the number of character occurrence - 4,595,109,   the number of characters - 5,534

[3] Occurrence list of Kanji characters in Mainichi Shimbun 2000 (IPSJ/ITSCJ TS Committee/WG5, 2003-03)
the number of character occurrence - 21,843,957,   the number of characters - 4,426

[4] JIS X 0213:2000, 7-bit and 8-bit double byte coded extended KANJI sets for information interchange, 2000-01

[5] The form of kanji characters not listed in the Joyo kanji Table (表外漢字字体表), Agency for Cultural Affairs, 2000-12
the standard form of kanji characters for printing (印刷標準字体) - 1,022 characters,   the generic form of kanji characters (簡易慣用字体) - 22 characters

[6] List of non-JIS Kanji characters in Daijilin (大辞林)

4.2 Processing

Assembling the subsets 4.2.1 and 4.2.2 and adjustment of 4.2.3 are carried out to configure the Basic Subset of 4,593 characters, Japanese Core Ideographs.

4.2.1 Subset of JIS X 0208 characters

The subset consists of the following 4,567 characters defined in JIS X 0208:

a) 3,739 characters which are contained in any sources of [1], [2] and [3]

b) 670 characters which are contained with high occurrence in any two sources of [1], [2] and [3], excluding the characters of a)

c) 130 characters which are contained in [1] and [3], excluding the characters of a) and b)

d) 28 characters which are required to describe person and place names, excluding the characters of a), b) and c)

4.2.2 Subset of non JIS X 0208 characters

The subset consists of the following 28 characters that are not defined in JIS X 0208:

a) 15 characters that describe headings in [6]

b) 13 characters of [4], which are required to describe person and place names

4.2.3 Adjustment

The 4,595 characters of the subsets 4.2.1 and 4.2.2 are adjusted from the character shapes' point of view, considering the source [5].

a) Five shapes are changed.

b) Two shapes are replaced with the corresponding shapes contained in the set of 4,595 characters. Accordingly, 2 characters are removed.


5. Listing of the Characters

The elements of the Basic Subset are ordered according to 康煕字典(Kangxi Dictionary) and assigned with their sequential numbers. Each element is represented with a [[sequential number], UCS code position, character shape] tuple structure.


6. Basic Subset of Coded Character Sets - Japanese Core Ideographs

The Basic Subset is shown in Table 6.1.

Table 6.1 Basic Subset of Coded Character Sets - Japanese Core Ideographs