Skip to content

ENH: Better handling of annotation encoding in Nihon Kohden reader #13457

@myd7349

Description

@myd7349

After merging this PR, there is still an issue in my use case: the Nihon Kohden reader cannot correctly decode Chinese characters in the annotations. Considering that Nihon Kohden's Neuroworkbench is a Windows application implemented using MFC, it is more likely to use ANSI encoding rather than UTF-8 encoding. On Simplified Chinese Windows systems, the local character set is cp936 (a variant of GBK), not cp65001 (UTF-8).

Currently, the annotations in mne-python are decoded as follows:

_encodings = ("utf-8", "latin1")

t_desc = t_desc.rstrip(b"\x00")
for enc in _encodings:
try:
t_desc = t_desc.decode(enc)
except UnicodeDecodeError:
pass
else:
break
else:
warn(f"Could not decode log as one of {_encodings}")
continue

When encountering characters encoded in cp936/GBK, they are decoded as UTF-8 or latin-1 according to the current logic, leading to garbled text.

(Although the .21E file is also decoded using the same logic, since .21E mainly stores electrode information and is unlikely to use CJK characters, the issue should be less significant.)

A test program:

# coding: utf-8

from pathlib import Path
from mne.io.nihon import read_raw_nihon

directory = Path(r"E:\NKT\EEG2100")

for file in directory.rglob("*.EEG"):
    print('-' * 40)
    print(f"Processing file: {file}")

    try:
        raw = read_raw_nihon(file, preload=False, verbose=False)
        annotations = raw.annotations

        if len(annotations) == 0:
            print("No annotations found.")
        else:
            print("Annotations length:", len(annotations))
            for onset, description in zip(annotations.onset, annotations.description):
                print(f"Onset: {onset:.2f}, Description: {description}")
    except Exception as e:
        print(f"Failed to read {file}: {e}")

For one of the files, the output is:

Processing file: E:\NKT\EEG2100\DA045014.EEG
Annotations length: 221
Onset: 0.00, Description: REC START IA 16µ¥µ¼
Onset: 0.00, Description: PAT IA 16µ¥µ¼ CAL
Onset: 1.00, Description: IMP CHECK ON
Onset: 1.00, Description: A1+A2 OFF
Onset: 6.00, Description: IMP CHECK OFF
Onset: 26.00, Description: PAT VIIA 8µû¹Ç EEG
Onset: 46.00, Description: PAT IA 16µ¥µ¼ EEG
Onset: 226.00, Description: PAT IIA 16ƽ¾ù EEG
Onset: 406.00, Description: PAT IVA 8µ¥¼« EEG
Onset: 586.00, Description: BAD boundary
Onset: 586.00, Description: EDGE boundary
Onset: 586.00, Description: REC START IA 16µ¥µ¼
Onset: 586.00, Description: PHOTO 8Hz
Onset: 587.00, Description: Recording Gap 0000:0
Onset: 587.00, Description: A1+A2 OFF
Onset: 606.00, Description: PHOTO 10Hz
Onset: 626.00, Description: PHOTO 12Hz
Onset: 734.00, Description: REM
Onset: 737.00, Description: Stage 1

Currently, I see two ways to solve this issue:

  1. Add locale code page to the _encodings list:
import locale

_encodings = (locale.getpreferredencoding(), "utf-8", "latin1")

That is: first try decoding using the locale preferred encoding.

However, this approach also has drawbacks: imagine I share an annotation log file encoded with cp936 with a Japanese or Korean friend.

  1. Add an encoding parameter to read_raw_nihon, allowing the user to specify the encoding to decode annotations.

Like this: myd7349@6af1291

I saw there is already an encoding parameter in some readers(

encoding="utf8",
).

This option is not perfect either. If I send an annotation file encoded in cp936 to a Japanese friend, and they edit it using Neuroworkbench, it may contain two character sets. But such a situation is likely rare.

For this solution, should we keep the UTF-8 fallback? (That is, after failing to decode, fallback to UTF-8, which may still result in garbled text?)

Currently, I have tested both solutions locally.

# coding: utf-8

from pathlib import Path
from mne.io.nihon import read_raw_nihon

directory = Path(r"E:\NKT\EEG2100")

for file in directory.rglob("*.EEG"):
    print('-' * 40)
    print(f"Processing file: {file}")

    try:
        raw = read_raw_nihon(file, preload=False, encoding='cp936', verbose=False)
        annotations = raw.annotations

        if len(annotations) == 0:
            print("No annotations found.")
        else:
            print("Annotations length:", len(annotations))
            for onset, description in zip(annotations.onset, annotations.description):
                print(f"Onset: {onset:.2f}, Description: {description}")
    except Exception as e:
        print(f"Failed to read {file}: {e}")

Output:

Processing file: E:\NKT\EEG2100\DA045014.EEG
Annotations length: 221
Onset: 0.00, Description: REC START IA 16单导 EEG
Onset: 0.06, Description: PAT IA 16单导 CAL
Onset: 1.28, Description: IMP CHECK ON
Onset: 1.36, Description: A1+A2 OFF
Onset: 6.78, Description: IMP CHECK OFF
Onset: 26.68, Description: PAT VIIA 8蝶骨 EEG
Onset: 46.64, Description: PAT IA 16单导 EEG
Onset: 226.68, Description: PAT IIA 16平均 EEG
Onset: 406.74, Description: PAT IVA 8单极 EEG
Onset: 586.00, Description: BAD boundary
Onset: 586.00, Description: EDGE boundary
Onset: 586.00, Description: REC START IA 16单导 EEG
Onset: 586.14, Description: PHOTO 8Hz
Onset: 587.06, Description: Recording Gap 0000:00:18
Onset: 587.14, Description: A1+A2 OFF
Onset: 606.42, Description: PHOTO 10Hz
Onset: 626.44, Description: PHOTO 12Hz
Onset: 734.00, Description: REM
Onset: 737.60, Description: Stage 1

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions