huggingface · lewtun · Dec 12, 2022 · Dec 5, 2022 · Dec 8, 2022
diff --git a/.github/workflows/quality.yml b/.github/workflows/quality.yml
@@ -11,11 +11,11 @@ jobs:
     runs-on: ubuntu-latest
     steps:
     - uses: actions/checkout@v2
-    - name: Set up Python 3.6
+    - name: Set up Python 3.8
       uses: actions/setup-python@v2
       with:
-        python-version: 3.6
+        python-version: 3.8
     - name: Install Python dependencies
       run: pip install black
     - name: Run Quality check
-      run: make quality
+      run: make quality
diff --git a/subtitles/zh-CN/56_data-processing-for-masked-language-modeling.srt b/subtitles/zh-CN/56_data-processing-for-masked-language-modeling.srt
@@ -5,12 +5,12 @@
 
 2
 00:00:05,250 --> 00:00:07,230
-- 让我们看看如何预处理我们的数据
+- 让我们看一下如何针对掩码语言建模
 - Let's see how we can preprocess our data
 
 3
 00:00:07,230 --> 00:00:08,703
-用于掩码语言建模。
+预处理我们的数据。
 for masked language modeling.
 
 4
@@ -20,7 +20,7 @@ As a reminder, masked language modeling
 
 5
 00:00:12,570 --> 00:00:15,333
-是当模型需要填补句子中的空白时。
+主要在模型需要填补句子中的空白时使用。
 is when a model needs to fill the blanks in a sentence.
 
 6
@@ -30,27 +30,27 @@ To do this, you just need texts, no labels,
 
 7
 00:00:19,650 --> 00:00:22,200
-因为这是一个自我监督的问题。
+因为这是一个自监督的问题。
 as this is a self-supervised problem.
 
 8
 00:00:22,200 --> 00:00:23,670
-要将其应用于你自己的数据，
+要将其应用于您自己的数据，
 To apply this on your own data,
 
 9
 00:00:23,670 --> 00:00:25,740
-只要确保你收集了所有的文本
+只要确保您在数据集的一列中
 just make sure you have all your texts gathered
 
 10
 00:00:25,740 --> 00:00:27,603
-在数据集的一列中。
+收集了所有的文本。
 in one column of your dataset.
 
 11
 00:00:28,440 --> 00:00:30,480
-在我们开始随机掩盖事物之前，
+在开始随机掩码处理之前，
 Before we start randomly masking things,
 
 12
@@ -60,7 +60,7 @@ we will need to somehow make all those texts the same length
 
 13
 00:00:33,090 --> 00:00:34,263
-将它们一起批处理。
+从而将它们一起批处理。
 to batch them together.
 
 14
@@ -70,7 +70,7 @@ The first way to make all the texts the same length
 
 15
 00:00:38,490 --> 00:00:40,590
-是我们在文本分类中使用的那个。
+和我们在文本分类中所使用的相同。
 is the one we used in text classification.
 
 16
@@ -95,27 +95,27 @@ this is all done by our tokenizer
 
 20
 00:00:49,923 --> 00:00:53,130
-具有正确的填充和截断选项。
+并且配置相应的填充和截断选项。
 with the right options for padding and truncation.
 
 21
 00:00:53,130 --> 00:00:56,100
-但是，这会使我们丢失很多文本
+如果与我们选择的上下文长度相比，
 This will however make us lose a lot of texts
 
 22
 00:00:56,100 --> 00:00:58,620
-如果我们数据集中的示例很长，
+我们数据集的示例很长，
 if the examples in our dataset are very long,
 
 23
 00:00:58,620 --> 00:01:00,960
-与我们选择的上下文长度相比。
+就会使我们丢失很多文本。
 compared to the context length we picked.
 
 24
 00:01:00,960 --> 00:01:03,393
-在这里，所有灰色部分都丢失了。
+在这里，所有标记灰色部分都丢失了。
 Here, all the portion in gray is lost.
 
 25
@@ -125,17 +125,17 @@ This is why a second way to generate samples of text
 
 26
 00:01:06,660 --> 00:01:08,820
-具有相同的长度是分块我们的文本
+具有相同的长度是为了在上下文长度中
 with the same length is to chunk our text
 
 27
 00:01:08,820 --> 00:01:10,560
-在上下文长度中，
+为我们的文本分块
 in pieces of context lengths,
 
 28
 00:01:10,560 --> 00:01:14,010
-而不是在第一个块之后丢弃所有内容。
+而不是在第一个数据块之后丢弃所有内容。
 instead of discarding everything after the first chunk.
 
 29
@@ -150,7 +150,7 @@ of length smaller than the context size,
 
 31
 00:01:17,700 --> 00:01:20,493
-我们可以选择保留和填充或忽略。
+我们可以选择保留并填充或者忽略。
 which we can choose to keep and pad or ignore.
 
 32
@@ -160,32 +160,32 @@ Here is how we can apply this in practice,
 
 33
 00:01:23,790 --> 00:01:26,460
-只需添加 return overflowing tokens 选项
+只需在我们调用分词器时添加 return overflowing tokens
 by just adding the return overflowing tokens option
 
 34
 00:01:26,460 --> 00:01:28,200
-在我们的分词器调用中。
+选项
 in our tokenizer call.
 
 35
 00:01:28,200 --> 00:01:30,243
-请注意这如何为我们提供更大的数据集！
+请注意这样会为我们提供更大的数据集！
 Note how this gives us a bigger dataset!
 
 36
 00:01:31,560 --> 00:01:34,260
-这第二种分块方式是理想的，如果你所有的文本
+如果你所有的文本很长，
 This second way of chunking is ideal if all your texts
 
 37
 00:01:34,260 --> 00:01:36,270
-很长，但行不通
+这里第二种分块方式是理想的，
 are very long, but it won't work
 
 38
 00:01:36,270 --> 00:01:39,900
-如果你的课文有不同的长度，那也不错。
+但如果你的课文有不同的长度，那么效果就不尽人意。
 as nicely if you have a variety of lengths in the texts.
 
 39
@@ -195,22 +195,22 @@ In this case,
 
 40
 00:01:41,040 --> 00:01:44,280
-最好的选择是连接所有标记化的文本
+最好的选择是将所有标记的文本组合成为一个大的数据流
 the best option is to concatenate all your tokenized texts
 
 41
 00:01:44,280 --> 00:01:46,560
-在一个大流中，有一个特殊的标记
+附加一个特殊的标记
 in one big stream, with a special tokens
 
 42
 00:01:46,560 --> 00:01:49,800
-指示你何时从一份文件转到另一份文件，
+表明你何时从一份文件转到另一份文件，
 to indicate when you pass from one document to the other,
 
 43
 00:01:49,800 --> 00:01:52,503
-然后才将大流分成块。
+然后才将该数据流分成数据块。
 and only then split the big stream into chunks.
 
 44
@@ -230,32 +230,32 @@ and another one to chunk it.
 
 47
 00:02:00,780 --> 00:02:02,850
-注意它是如何减少样本数量的
+注意在我们这里的数据集中，
 Notice how it reduces the number of samples
 
 48
 00:02:02,850 --> 00:02:04,230
-在我们这里的数据集中，
+它是如何减少样本数量的
 in our dataset here,
 
 49
 00:02:04,230 --> 00:02:06,580
-一定有不少短条目！
+一定有大量短条目！
 there must have been quite a few short entries!
 
 50
 00:02:07,710 --> 00:02:11,130
-完成此操作后，掩码就很容易了。
+完成此操作后，掩码处理就很容易了。
 Once this is done, the masking is the easy part.
 
 51
 00:02:11,130 --> 00:02:13,400
-有专门为此设计的数据整理器
+在 Transformers 库中有专门为此设计的
 There is a data collator designed specifically for this
 
 52
 00:02:13,400 --> 00:02:15,540
-在变形金刚图书馆。
+数据整理器。
 in the Transformers library.
 
 53
@@ -265,7 +265,7 @@ You can use it directly in the Trainer,
 
 54
 00:02:17,700 --> 00:02:20,400
-或者将你的数据集转换为张量流数据集时
+或者将你的数据集转换为 tensorflow 数据集时
 or when converting your datasets to tensorflow datasets
 
 55