A PyTorch implementation of the EUSIPCO 2021 paper "Self-supervised Complex Network for Machine Sound Anomaly Detection" by Kim et al.
MIMIMI implements a novel deep complex neural network for machine sound anomaly detection using phase-aware spectral analysis. The model leverages complex-valued convolutions and attention mechanisms to capture both magnitude and phase information from audio spectrograms, achieving superior performance over traditional magnitude-only approaches.
- Complex-valued Neural Networks: Full complex arithmetic throughout the network
- Phase-aware Processing: Utilizes both magnitude and phase information
- Attention Mechanism: Adaptive weighting of magnitude vs. complex features
- Self-supervised Learning: Trains on normal sounds only using machine ID classification
- Multi-branch Architecture: Separate processing paths for magnitude, complex, and combined features
The network consists of three main components:
- Complex Convolutional Backbone: 5 complex convolution blocks with PReLU activation
- Dual-path Processing: Separate branches for magnitude and complex features
- Attention Fusion: Channel attention to weight magnitude vs. complex contributions
Important
The following table compares the paper's results, which used the complete MIMII dataset (30+ GB), compared to mine which uses only the development subset (~4 GB).
Machine Type | Paper (%) | Best Machine ID (%) | Average of all IDs (%) |
---|---|---|---|
Fan | 96.40 | 88.33 | 83.43 |
Pump | 98.75 | 99.48 | 87.15 |
Slider | 96.87 | 99.84 | 92.35 |
Valve | 96.87 | 93.06 | 85.22 |
Average | 97.22 | 95.18 | 87.04 |
Note
The paper didn't mention if it highlighted the best performing machine ID or the average across all IDs. Therefore, I decided to show both.
- Overall AUC: 83.43%
- Best performing ID: 88.33% (ID 1)
- Range: 73.48% - 88.33%
- Overall AUC: 87.15%
- Best performing ID: 99.48% (ID 2)
- Range: 51.69% - 99.48%
- Overall AUC: 92.35%
- Best performing ID: 99.84% (ID 2)
- Range: 73.94% - 99.84%
- Overall AUC: 85.22%
- Best performing ID: 93.06% (ID 0)
- Range: 79.94% - 93.06%
- Python 3.8+ (mandatory)
pip install -r requirements.txt
The implementation uses the MIMII dataset (development subset) from Zenodo:
cd data
python download_data.py
The script will automatically download and extract:
dev_data_fan.zip
dev_data_pump.zip
dev_data_slider.zip
dev_data_valve.zip
Train models for all machine types:
python main.py --mode train
Or train a specific machine type:
python train.py # Will train all types in config.MACHINE_TYPES
Evaluate trained models:
python main.py --mode evaluate
mimimi/
βββ src/
β βββ model/
β β βββ __init__.py # Model package initialization
β β βββ model.py # Main complex anomaly detector
β β βββ complex_layers.py # Complex convolution, batch norm, PReLU
β β βββ attention.py # Channel attention mechanism
β β βββ dataset.py # MIMII dataset loader with complex STFT
β β βββ train.py # Training pipeline with mixup support
β β βββ evaluate.py # Comprehensive evaluation metrics
β β βββ config.py # Configuration parameters
β β βββ utils.py # Utility functions (Youden threshold)
β βββ main.py # Main entry point for training/evaluation
βββ data/
β βββ download_data.py # Automated dataset download
βββ models/ # Saved models and metrics
Complex multiplication follows the mathematical rule:
Where:
- Real output:
$conv_{real}(real) - conv_{imag}(imag)$ - Imaginary output:
$conv_{real}(imag) + conv_{imag}(real)$
For complex input
- Real part:
$x_{\text{norm}} = \frac{x - \mu_x}{\sigma_x + \epsilon}$ - Imaginary part:
$y_{\text{norm}} = \frac{y - \mu_y}{\sigma_y + \epsilon}$ - Output:
$z_{\text{norm}} = x_{\text{norm}} + iy_{\text{norm}}$
Channel-wise attention to balance magnitude vs. complex features:
Where:
$A(F_t) = \text{softmax}(\text{FC}_2(\text{ReLU}(\text{FC}_1(\text{GAP}(F_t)))))$ -
$\odot$ denotes element-wise multiplication -
$\text{GAP}$ is Global Average Pooling
- Self-supervised Learning: Machine ID classification on normal sounds only
- Multi-loss Training: Combined losses from magnitude, complex, and fused branches
- Complex STFT: Input preprocessing preserves phase information
- Data Augmentation: Spectral masking, noise injection, frequency shifting
Parameter | Value | Description |
---|---|---|
Sample Rate | 16 kHz | Audio sampling frequency |
FFT Size | 1024 | STFT window size |
Hop Length | 512 | STFT hop size |
Frames | 64 | Time frames per sample |
Learning Rate | 1e-4 | Adam optimizer learning rate |
Batch Size | 32 | Training batch size |
Epochs | 30 | Maximum training epochs |
Patience | 20 | Early stopping patience |
The model detects anomalies through classification confidence scoring:
- Training Phase: Learn to classify machine IDs of normal sounds
- Test Phase: High cross-entropy loss indicates inability to classify β anomaly
- Threshold: Optimal threshold found using Youden's J statistic
def get_anomaly_score(self, x, true_class):
_, _, logits_total = self.forward(x)
return F.cross_entropy(logits_total, true_class, reduction='none')
The implementation provides comprehensive evaluation:
- AUC-ROC: Area under receiver operating characteristic curve
- Precision/Recall/F1: Binary classification metrics at optimal threshold
- Per-Machine Analysis: Individual machine ID performance
- Confusion Matrices: Classification and anomaly detection confusion matrices
- Score Distributions: Visualization of normal vs. anomalous score distributions
- Complete Complex Processing: Full complex arithmetic pipeline maintained
- Faithful Architecture: Closely follows paper specifications
- Comprehensive Evaluation: Detailed metrics and visualizations
- Robust Training: Early stopping, validation monitoring, model checkpointing
- Slider machines: Best performance (92.35%), demonstrating clear phase differences
- Pump machines: Moderate performance (87.15%), some IDs excel (99.48%)
- Valve machines: Good performance (85.22%), consistent across most IDs (93.06% best)
- Fan machines: Most challenging performance (83.43%), ID-dependent variations
- Individual Excellence: Several machine IDs achieve >99% AUC, showing model capability
- Data Augmentation: Enhanced spectral augmentation strategies
- Architecture Tuning: Hyperparameter optimization for each machine type
- Ensemble Methods: Combining multiple model predictions
- Advanced Complex Operations: Exploring additional complex layer types
- Kim, M., Ho, M. T., & Kang, H. G. (2021). Self-supervised Complex Network for Machine Sound Anomaly Detection. EUSIPCO 2021.
- Purohit, H., et al. (2019). MIMII Dataset: Sound dataset for malfunctioning industrial machine investigation and inspection. arXiv preprint.
- Choi, H. S., et al. (2018). Phase-aware speech enhancement with deep complex u-net. ICLR 2018.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
- Original paper authors for the innovative complex network approach
- MIMII dataset creators for providing high-quality industrial sound data
- PyTorch community for excellent deep learning framework
Note: This implementation demonstrates the practical feasibility of complex-valued neural networks for audio anomaly detection, achieving competitive results while providing insights into phase-aware signal processing for industrial applications.