Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/quality.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.6
- name: Set up Python 3.8
uses: actions/setup-python@v2
with:
python-version: 3.6
python-version: 3.8
- name: Install Python dependencies
run: pip install black
- name: Run Quality check
run: make quality
run: make quality
72 changes: 36 additions & 36 deletions subtitles/zh-CN/56_data-processing-for-masked-language-modeling.srt
Original file line number Diff line number Diff line change
Expand Up @@ -5,12 +5,12 @@

2
00:00:05,250 --> 00:00:07,230
- 让我们看看如何预处理我们的数据
- 让我们看一下如何针对掩码语言建模
- Let's see how we can preprocess our data

3
00:00:07,230 --> 00:00:08,703
用于掩码语言建模
预处理我们的数据
for masked language modeling.

4
Expand All @@ -20,7 +20,7 @@ As a reminder, masked language modeling

5
00:00:12,570 --> 00:00:15,333
是当模型需要填补句子中的空白时
主要在模型需要填补句子中的空白时使用
is when a model needs to fill the blanks in a sentence.

6
Expand All @@ -30,27 +30,27 @@ To do this, you just need texts, no labels,

7
00:00:19,650 --> 00:00:22,200
因为这是一个自我监督的问题
因为这是一个自监督的问题
as this is a self-supervised problem.

8
00:00:22,200 --> 00:00:23,670
要将其应用于你自己的数据
要将其应用于您自己的数据
To apply this on your own data,

9
00:00:23,670 --> 00:00:25,740
只要确保你收集了所有的文本
只要确保您在数据集的一列中
just make sure you have all your texts gathered

10
00:00:25,740 --> 00:00:27,603
在数据集的一列中
收集了所有的文本
in one column of your dataset.

11
00:00:28,440 --> 00:00:30,480
在我们开始随机掩盖事物之前
在开始随机掩码处理之前
Before we start randomly masking things,

12
Expand All @@ -60,7 +60,7 @@ we will need to somehow make all those texts the same length

13
00:00:33,090 --> 00:00:34,263
将它们一起批处理
从而将它们一起批处理
to batch them together.

14
Expand All @@ -70,7 +70,7 @@ The first way to make all the texts the same length

15
00:00:38,490 --> 00:00:40,590
是我们在文本分类中使用的那个
和我们在文本分类中所使用的相同
is the one we used in text classification.

16
Expand All @@ -95,27 +95,27 @@ this is all done by our tokenizer

20
00:00:49,923 --> 00:00:53,130
具有正确的填充和截断选项
并且配置相应的填充和截断选项
with the right options for padding and truncation.

21
00:00:53,130 --> 00:00:56,100
但是,这会使我们丢失很多文本
如果与我们选择的上下文长度相比,
This will however make us lose a lot of texts

22
00:00:56,100 --> 00:00:58,620
如果我们数据集中的示例很长
我们数据集的示例很长
if the examples in our dataset are very long,

23
00:00:58,620 --> 00:01:00,960
与我们选择的上下文长度相比
就会使我们丢失很多文本
compared to the context length we picked.

24
00:01:00,960 --> 00:01:03,393
在这里,所有灰色部分都丢失了
在这里,所有标记灰色部分都丢失了
Here, all the portion in gray is lost.

25
Expand All @@ -125,17 +125,17 @@ This is why a second way to generate samples of text

26
00:01:06,660 --> 00:01:08,820
具有相同的长度是分块我们的文本
具有相同的长度是为了在上下文长度中
with the same length is to chunk our text

27
00:01:08,820 --> 00:01:10,560
在上下文长度中,
为我们的文本分块
in pieces of context lengths,

28
00:01:10,560 --> 00:01:14,010
而不是在第一个块之后丢弃所有内容
而不是在第一个数据块之后丢弃所有内容
instead of discarding everything after the first chunk.

29
Expand All @@ -150,7 +150,7 @@ of length smaller than the context size,

31
00:01:17,700 --> 00:01:20,493
我们可以选择保留和填充或忽略
我们可以选择保留并填充或者忽略
which we can choose to keep and pad or ignore.

32
Expand All @@ -160,32 +160,32 @@ Here is how we can apply this in practice,

33
00:01:23,790 --> 00:01:26,460
只需添加 return overflowing tokens 选项
只需在我们调用分词器时添加 return overflowing tokens
by just adding the return overflowing tokens option

34
00:01:26,460 --> 00:01:28,200
在我们的分词器调用中。
选项
in our tokenizer call.

35
00:01:28,200 --> 00:01:30,243
请注意这如何为我们提供更大的数据集
请注意这样会为我们提供更大的数据集
Note how this gives us a bigger dataset!

36
00:01:31,560 --> 00:01:34,260
这第二种分块方式是理想的,如果你所有的文本
如果你所有的文本很长,
This second way of chunking is ideal if all your texts

37
00:01:34,260 --> 00:01:36,270
很长,但行不通
这里第二种分块方式是理想的,
are very long, but it won't work

38
00:01:36,270 --> 00:01:39,900
如果你的课文有不同的长度,那也不错
但如果你的课文有不同的长度,那么效果就不尽人意
as nicely if you have a variety of lengths in the texts.

39
Expand All @@ -195,22 +195,22 @@ In this case,

40
00:01:41,040 --> 00:01:44,280
最好的选择是连接所有标记化的文本
最好的选择是将所有标记的文本组合成为一个大的数据流
the best option is to concatenate all your tokenized texts

41
00:01:44,280 --> 00:01:46,560
在一个大流中,有一个特殊的标记
附加一个特殊的标记
in one big stream, with a special tokens

42
00:01:46,560 --> 00:01:49,800
指示你何时从一份文件转到另一份文件
表明你何时从一份文件转到另一份文件
to indicate when you pass from one document to the other,

43
00:01:49,800 --> 00:01:52,503
然后才将大流分成块
然后才将该数据流分成数据块
and only then split the big stream into chunks.

44
Expand All @@ -230,32 +230,32 @@ and another one to chunk it.

47
00:02:00,780 --> 00:02:02,850
注意它是如何减少样本数量的
注意在我们这里的数据集中,
Notice how it reduces the number of samples

48
00:02:02,850 --> 00:02:04,230
在我们这里的数据集中,
它是如何减少样本数量的
in our dataset here,

49
00:02:04,230 --> 00:02:06,580
一定有不少短条目
一定有大量短条目
there must have been quite a few short entries!

50
00:02:07,710 --> 00:02:11,130
完成此操作后,掩码就很容易了
完成此操作后,掩码处理就很容易了
Once this is done, the masking is the easy part.

51
00:02:11,130 --> 00:02:13,400
有专门为此设计的数据整理器
在 Transformers 库中有专门为此设计的
There is a data collator designed specifically for this

52
00:02:13,400 --> 00:02:15,540
在变形金刚图书馆
数据整理器
in the Transformers library.

53
Expand All @@ -265,7 +265,7 @@ You can use it directly in the Trainer,

54
00:02:17,700 --> 00:02:20,400
或者将你的数据集转换为张量流数据集时
或者将你的数据集转换为 tensorflow 数据集时
or when converting your datasets to tensorflow datasets

55
Expand Down