ENH: Better handling of annotation encoding in Nihon Kohden reader

After merging [this PR](https://github.com/mne-tools/mne-python/pull/13251), there is still an issue in my use case: the Nihon Kohden reader cannot correctly decode Chinese characters in the annotations. Considering that Nihon Kohden's Neuroworkbench is a Windows application implemented using MFC, it is more likely to use ANSI encoding rather than UTF-8 encoding. On Simplified Chinese Windows systems, the local character set is cp936 (a variant of GBK), not cp65001 (UTF-8).

Currently, the annotations in mne-python are decoded as follows:

https://github.com/mne-tools/mne-python/blob/181fea1e7e0bd73c2ec5cf67391cf598ad1af7bf/mne/io/nihon/nihon.py#L125

https://github.com/mne-tools/mne-python/blob/181fea1e7e0bd73c2ec5cf67391cf598ad1af7bf/mne/io/nihon/nihon.py#L348-L359

When encountering characters encoded in cp936/GBK, they are decoded as UTF-8 or latin-1 according to the current logic, leading to garbled text.

(Although the .21E file is also decoded using the same logic, since .21E mainly stores electrode information and is unlikely to use CJK characters, the issue should be less significant.)

A test program:

```python
# coding: utf-8

from pathlib import Path
from mne.io.nihon import read_raw_nihon

directory = Path(r"E:\NKT\EEG2100")

for file in directory.rglob("*.EEG"):
    print('-' * 40)
    print(f"Processing file: {file}")

    try:
        raw = read_raw_nihon(file, preload=False, verbose=False)
        annotations = raw.annotations

        if len(annotations) == 0:
            print("No annotations found.")
        else:
            print("Annotations length:", len(annotations))
            for onset, description in zip(annotations.onset, annotations.description):
                print(f"Onset: {onset:.2f}, Description: {description}")
    except Exception as e:
        print(f"Failed to read {file}: {e}")
```

For one of the files, the output is:

```
Processing file: E:\NKT\EEG2100\DA045014.EEG
Annotations length: 221
Onset: 0.00, Description: REC START IA 16µ¥µ¼
Onset: 0.00, Description: PAT IA 16µ¥µ¼ CAL
Onset: 1.00, Description: IMP CHECK ON
Onset: 1.00, Description: A1+A2 OFF
Onset: 6.00, Description: IMP CHECK OFF
Onset: 26.00, Description: PAT VIIA 8µû¹Ç EEG
Onset: 46.00, Description: PAT IA 16µ¥µ¼ EEG
Onset: 226.00, Description: PAT IIA 16Æ½¾ù EEG
Onset: 406.00, Description: PAT IVA 8µ¥¼« EEG
Onset: 586.00, Description: BAD boundary
Onset: 586.00, Description: EDGE boundary
Onset: 586.00, Description: REC START IA 16µ¥µ¼
Onset: 586.00, Description: PHOTO 8Hz
Onset: 587.00, Description: Recording Gap 0000:0
Onset: 587.00, Description: A1+A2 OFF
Onset: 606.00, Description: PHOTO 10Hz
Onset: 626.00, Description: PHOTO 12Hz
Onset: 734.00, Description: REM
Onset: 737.00, Description: Stage 1
```

Currently, I see two ways to solve this issue:

1. Add locale code page to the `_encodings` list:

```python
import locale

_encodings = (locale.getpreferredencoding(), "utf-8", "latin1")
```

That is: first try decoding using the locale preferred encoding.

However, this approach also has drawbacks: imagine I share an annotation log file encoded with cp936 with a Japanese or Korean friend.

2. Add an `encoding` parameter to `read_raw_nihon`, allowing the user to specify the encoding to decode annotations.

Like this: https://github.com/myd7349/mne-python/commit/6af129145d56feb04cdf960ae19ba52532056596

I saw there is already an `encoding` parameter in some readers(https://github.com/mne-tools/mne-python/blob/181fea1e7e0bd73c2ec5cf67391cf598ad1af7bf/mne/io/edf/edf.py#L145).

This option is not perfect either. If I send an annotation file encoded in cp936 to a Japanese friend, and they edit it using Neuroworkbench, it may contain two character sets. But such a situation is likely rare.

For this solution, should we keep the UTF-8 fallback? (That is, after failing to decode, fallback to UTF-8, which may still result in garbled text?)

Currently, I have tested both solutions locally.

```python
# coding: utf-8

from pathlib import Path
from mne.io.nihon import read_raw_nihon

directory = Path(r"E:\NKT\EEG2100")

for file in directory.rglob("*.EEG"):
    print('-' * 40)
    print(f"Processing file: {file}")

    try:
        raw = read_raw_nihon(file, preload=False, encoding='cp936', verbose=False)
        annotations = raw.annotations

        if len(annotations) == 0:
            print("No annotations found.")
        else:
            print("Annotations length:", len(annotations))
            for onset, description in zip(annotations.onset, annotations.description):
                print(f"Onset: {onset:.2f}, Description: {description}")
    except Exception as e:
        print(f"Failed to read {file}: {e}")
```

Output:

```
Processing file: E:\NKT\EEG2100\DA045014.EEG
Annotations length: 221
Onset: 0.00, Description: REC START IA 16单导 EEG
Onset: 0.06, Description: PAT IA 16单导 CAL
Onset: 1.28, Description: IMP CHECK ON
Onset: 1.36, Description: A1+A2 OFF
Onset: 6.78, Description: IMP CHECK OFF
Onset: 26.68, Description: PAT VIIA 8蝶骨 EEG
Onset: 46.64, Description: PAT IA 16单导 EEG
Onset: 226.68, Description: PAT IIA 16平均 EEG
Onset: 406.74, Description: PAT IVA 8单极 EEG
Onset: 586.00, Description: BAD boundary
Onset: 586.00, Description: EDGE boundary
Onset: 586.00, Description: REC START IA 16单导 EEG
Onset: 586.14, Description: PHOTO 8Hz
Onset: 587.06, Description: Recording Gap 0000:00:18
Onset: 587.14, Description: A1+A2 OFF
Onset: 606.42, Description: PHOTO 10Hz
Onset: 626.44, Description: PHOTO 12Hz
Onset: 734.00, Description: REM
Onset: 737.60, Description: Stage 1
```

	t_desc = t_desc.rstrip(b"\x00")
	for enc in _encodings:
	try:
	t_desc = t_desc.decode(enc)
	except UnicodeDecodeError:
	pass
	else:
	break
	else:
	warn(f"Could not decode log as one of {_encodings}")
	continue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Better handling of annotation encoding in Nihon Kohden reader #13457

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

ENH: Better handling of annotation encoding in Nihon Kohden reader #13457

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions