Skip to content

Conversation

@Yukinoroh
Copy link
Contributor

See file "mod24.txt" for details.
Contents were too big to paste here.

@Yukinoroh
Copy link
Contributor Author

Yukinoroh commented Sep 4, 2024

Right now I am running some more scripts to compare CDP/CNS/Daikanwa to UCS data to find:
・CDP/CNS/Daikanwa character refered to in the UCS file and whose decomposition data exactly equals to that of an existing UCS character;
・Partial decomposition data in UCS that could be replaced by a CDP/CNS/Daikanwa character.

@Yukinoroh
Copy link
Contributor Author

Yukinoroh commented Sep 8, 2024

I have just compared the UCS files and Daikanwa and I found three issues in the Daikanwa files. I don't know how to fix them as I don't have access to the source material, but I am quite sure there is a problem:
・M-49556 is labelled as equivalent of U+2458B 𤖋 but yet has decomposition data of U+2E34E 𮍎.
(Maybe this is due to U+2458B 𤖋 being fixed later and M-49556 left as is? I see a lot of data redundancy in the non-UCS files...)
・M-25016 has incomplete decomposition data "⿰禾"
・M-32388 has incomplete decomposition data "⿱⺿"

@Yukinoroh
Copy link
Contributor Author

Yukinoroh commented Sep 11, 2024

Additionally, I found two erros in the CBETA file:
・CB08110 強 has decomposition of "U-00020F22 𠼢"
・Both CB02326 and CB11790 use decomposition "⿱巳廾". This may be not a mistake, but I don't know which one to use to simplify U-000266AA 𦚪 and U-00029426 𩐦, so I will leave them untouched for now.

@chise
Copy link
Owner

chise commented Sep 11, 2024

Additionally, there is a problem with the "CB08110 強" character in IDS-CBETA.txt; it has decomposition of "U-00020F22 𠼢".

cf. https://www.chise.org/est/view/character/rep.cbeta=08110
https://www.chise.org/est/view/character/repi.cbeta=08110

CB08110 is 𠼢, not 強.

In general, the second column was automatically generated in old days, that was the mapping in the time or sometimes it would break due to some problems of Sometimes it would break due to some problems with wrong setting, editing or definitions of XEmacs CHISE. In addition, the second column is designed to display isolated character of the encoding, not Unicode mapping. So, it should be stored entity-reference. If normal Unicode character is stored, it may be bug.

Anyway, IDS-CBETA.txt was automatically generated and not maintained enough. We need Taishō Tripiṭaka (大正新脩大藏經) to check this file semantically, but now I don't have these volumes. Even if I can access them easily, I don't have enough time to do semantically check. I can regenerate it based on the current CHISE character ontology, but it might introduce new semantic bugs instead of fix syntactic problems. I think the value of this file is that it records information about when the file was created, including syntax issues.

@chise
Copy link
Owner

chise commented Sep 11, 2024

By the way, we should move to issue.

@Yukinoroh
Copy link
Contributor Author

So far by comparing the UCS file to CDP, Daikanwa and CBETA I have some 350 fixes, and half of them suggest a unicode character back into UCS. I still need to process against CNS. (P.S.: I will not commit until you checked the current fixes.)

@Yukinoroh
Copy link
Contributor Author

Yukinoroh commented Sep 15, 2024

I finished comparing UCS files against CDP, Daikanwa, CBETA and CNS (and in between each of these I re-compared UCS to itself). I have 388 fixes waiting. By the way I will not be home from the 16th to about the 23rd. I will not be able to commit before the 23rd.

@Yukinoroh
Copy link
Contributor Author

Actually they are 589 fixes, not 388. I was missing a huge chunk of them in the summary file. Let me know when you're ready to take them.

@Yukinoroh
Copy link
Contributor Author

Hello, any news? As I commented after sending the pull request, I have another set of changes pending. I would like to make them available to the community. How should I proceed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants