1 多任务学习方法
1.1 混合CTC-attention模型
1.2 口音识别模型
2 基于CTC尖峰特征的口音识别方法
2.1 CTC尖峰特性
2.2 口音特征分析
2.3 基于CTC尖峰特征的口音识别
3 试验结果与分析
3.1 试验数据及配置
表 1 试验数据 |
| 数据集 | 口音 | 缩写 | 语音数据时长/h | ||
| 训练集 | 验证集 | 测试集 | |||
| Common Voice 9.0 | Australian | AU | 365.80 | 92.40 | 50.77 |
| Canadian | CA | 124.82 | 31.20 | 17.28 | |
| England | EN | 104.39 | 26.01 | 14.66 | |
| German | GE | 59.16 | 14.62 | 8.22 | |
| India | IN | 50.79 | 12.84 | 7.07 | |
| United States | UN | 55.08 | 13.71 | 7.55 | |
| AESRC2020 | United Kingdom | UK | 18.50 | 1.54 | 2.21 |
| Russia | RU | ||||
| India | IND | ||||
| South of Korea | KR | ||||
| Japan | JPN | ||||
| China | CHN | ||||
| Portugal | PT | ||||
| United States | US | ||||
| Librispeech | — | — | 960.90 | 10.70 | 10.40 |
3.2 解码配置与评价指标
3.3 试验结果分析
表 2 Common Voice数据集的WER |
| 训练方法 | AU | CA | EN | GE | IN | UN | 平均值 |
| ASR | 10.9 | 8.6 | 10.3 | 6.3 | 12.4 | 9.2 | 9.7 |
| AR+ASR | 10.4 | 8.2 | 9.6 | 6.1 | 11.5 | 8.7 | 9.1 |
表 3 AESRC2020数据集的WER |
| 训练方法 | UK | RU | IND | KR | JPN | CHN | PT | US | 平均值 |
| ASR | 16.8 | 21.9 | 21.7 | 17.4 | 18.9 | 23.5 | 18.7 | 19.7 | 19.7 |
| AR+ASR | 11.2 | 15.2 | 14.8 | 10.2 | 11.5 | 19.7 | 11.2 | 12.2 | 13.2 |
| ASR(Pretrain) | 2.5 | 11.9 | 9.1 | 7.6 | 9.2 | 11.5 | 6.9 | 6.0 | 7.8 |
| AR+ASR(Pretrain) | 1.8 | 11.1 | 7.3 | 7.0 | 8.7 | 10.1 | 5.8 | 4.8 | 6.8 |
表 4 Common Voice数据集上的口音识别和语音识别性能 |
| 任务 | 评价指标 | 训练方法 | AU | CA | EN | GE | IN | UN | 平均值 |
| AR | Acc | AR+ASR | 92.7 | 54.2 | 88.1 | 100.0 | 97.6 | 96.1 | 91.7 |
| Spike-Frame AR+ASR | 93.2 | 58.5 | 89.3 | 100.0 | 97.9 | 96.1 | 92.3 | ||
| Spike-Chunk AR+ASR | 93.5 | 58.5 | 89.2 | 100.0 | 98.2 | 96.2 | 92.4 | ||
| ASR | WER | AR+ASR | 10.4 | 8.2 | 9.6 | 6.1 | 11.5 | 8.7 | 9.1 |
| Spike-Frame AR+ASR | 10.3 | 8.1 | 9.6 | 6.2 | 11.5 | 8.7 | 9.1 | ||
| Spike-Chunk AR+ASR | 10.3 | 8.1 | 9.7 | 6.2 | 11.5 | 8.7 | 9.1 |
表 5 AESRC2020数据集上的口音识别和语音识别性能 |
| 任务 | 评价指标 | 训练方法 | UK | RU | IND | KR | JPN | CHN | PT | US | 平均值 |
| AR | Acc | AR+ASR(Pretrain) | 92.9 | 72.0 | 90.2 | 80.4 | 62.5 | 72.2 | 78.6 | 61.9 | 75.6 |
| Spike-Frame AR+ASR(Pretrain) | 93.0 | 69.8 | 91.5 | 84.3 | 69.5 | 74.6 | 79.8 | 62.8 | 77.5 | ||
| Spike-Chunk AR+ASR(Pretrain) | 93.0 | 70.8 | 91.3 | 83.5 | 69.1 | 75.1 | 80.1 | 62.4 | 77.5 | ||
| ASR | WER | AR+ASR(Pretrain) | 1.8 | 11.1 | 7.3 | 7.0 | 8.7 | 10.1 | 5.8 | 4.8 | 6.8 |
| Spike-Frame AR+ASR(Pretrain) | 1.8 | 11.1 | 7.1 | 7.1 | 8.7 | 10.1 | 5.8 | 4.8 | 6.8 | ||
| Spike-Chunk AR+ASR(Pretrain) | 1.8 | 11.2 | 7.2 | 7.0 | 8.7 | 10.2 | 5.9 | 4.8 | 6.8 |
