Skip to content

Commit 0e70614

Browse files
committed
Merge pull request dcramer#5 from byroot/fix-bom-detection
Fix BOM detection dcramer#4 Thanks @byroot
2 parents a621369 + e5e8add commit 0e70614

File tree

7 files changed

+43
-8
lines changed

7 files changed

+43
-8
lines changed

charade/universaldetector.py

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -70,31 +70,31 @@ def feed(self, aBuf):
7070

7171
if not self._mGotData:
7272
# If the data starts with BOM, we know it is UTF
73-
if aBuf[:3] == '\xEF\xBB\xBF':
73+
if aBuf[:3] == b'\xEF\xBB\xBF':
7474
# EF BB BF UTF-8 with BOM
7575
self.result = {'encoding': "UTF-8", 'confidence': 1.0}
76-
elif aBuf[:4] == '\xFF\xFE\x00\x00':
76+
elif aBuf[:4] == b'\xFF\xFE\x00\x00':
7777
# FF FE 00 00 UTF-32, little-endian BOM
7878
self.result = {'encoding': "UTF-32LE", 'confidence': 1.0}
79-
elif aBuf[:4] == '\x00\x00\xFE\xFF':
79+
elif aBuf[:4] == b'\x00\x00\xFE\xFF':
8080
# 00 00 FE FF UTF-32, big-endian BOM
8181
self.result = {'encoding': "UTF-32BE", 'confidence': 1.0}
82-
elif aBuf[:4] == '\xFE\xFF\x00\x00':
82+
elif aBuf[:4] == b'\xFE\xFF\x00\x00':
8383
# FE FF 00 00 UCS-4, unusual octet order BOM (3412)
8484
self.result = {
8585
'encoding': "X-ISO-10646-UCS-4-3412",
8686
'confidence': 1.0
8787
}
88-
elif aBuf[:4] == '\x00\x00\xFF\xFE':
88+
elif aBuf[:4] == b'\x00\x00\xFF\xFE':
8989
# 00 00 FF FE UCS-4, unusual octet order BOM (2143)
9090
self.result = {
9191
'encoding': "X-ISO-10646-UCS-4-2143",
9292
'confidence': 1.0
9393
}
94-
elif aBuf[:2] == '\xFF\xFE':
94+
elif aBuf[:2] == b'\xFF\xFE':
9595
# FF FE UTF-16, little endian BOM
9696
self.result = {'encoding': "UTF-16LE", 'confidence': 1.0}
97-
elif aBuf[:2] == '\xFE\xFF':
97+
elif aBuf[:2] == b'\xFE\xFF':
9898
# FE FF UTF-16, big endian BOM
9999
self.result = {'encoding': "UTF-16BE", 'confidence': 1.0}
100100

test.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ def main():
4949
continue
5050
for file_name in os.listdir(path):
5151
_, ext = os.path.splitext(file_name)
52-
if ext not in ['.html', '.txt', '.xml']:
52+
if ext not in ['.html', '.txt', '.xml', '.srt']:
5353
continue
5454
suite.addTest(TestCase(os.path.join(path, file_name), encoding))
5555
unittest.TextTestRunner().run(suite)

tests/UTF-16BE/bom-utf-16-be.srt

1.67 KB
Binary file not shown.

tests/UTF-16LE/bom-utf-16-le.srt

1.67 KB
Binary file not shown.

tests/UTF-32BE/bom-utf-32-be.srt

3.35 KB
Binary file not shown.

tests/UTF-32LE/bom-utf-32-le.srt

3.35 KB
Binary file not shown.

tests/utf-8/bom-utf-8.srt

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
1
2+
00:00:06,500 --> 00:00:09,000
3+
About 2 months ago I found myself on
4+
the comment section of YouTube
5+
6+
2
7+
00:00:11,000 --> 00:00:17,000
8+
And I was commenting,
9+
unfortunately I was commenting,
10+
on a video about the famous Ayn Rand
11+
12+
3
13+
00:00:19,000 --> 00:00:24,000
14+
And I
15+
posted underneath against
16+
this woman's tirades,
17+
against what is essentially
18+
the human race.
19+
20+
4
21+
00:00:25,000 --> 00:00:31,000
22+
that, this monetary system seems to have no point, seems to actually hinder people
23+
24+
5
25+
00:00:31,000 --> 00:00:36,000
26+
and hinder progress, and one of the responses I got, I didn't answer it, was:
27+
28+
6
29+
00:00:37,000 --> 00:00:43,000
30+
what actually money creates is an incentive to invent the new items, that's the driving force behind it
31+
32+
7
33+
00:00:43,000 --> 00:00:50,000
34+
So what I thought I do is instead if answering on a YouTube comment is organize a global awareness day
35+

0 commit comments

Comments
 (0)