huggingface · lewtun · Dec 12, 2022 · Dec 11, 2022 · Dec 11, 2022
diff --git a/subtitles/zh-CN/61_data-processing-for-summarization.srt b/subtitles/zh-CN/61_data-processing-for-summarization.srt
@@ -15,12 +15,12 @@
 
 4
 00:00:05,550 --> 00:00:08,450
-- 让我们看看如何预处理数据集以进行汇总。
+- 让我们看看如何预处理数据集以进行文本摘要。
 - Let's see how to preprocess a dataset for summarization.
 
 5
 00:00:09,750 --> 00:00:13,083
-这是总结一份长文档的任务。
+这是概括一份长文档的任务。
 This is the task of, well, summarizing a long document.
 
 6
@@ -30,32 +30,32 @@ This video will focus on how to preprocess your dataset
 
 7
 00:00:16,830 --> 00:00:19,680
-一旦你设法将其放入以下格式：
+一旦你成功将其按照以下格式处理：
 once you have managed to put it in the following format:
 
 8
 00:00:19,680 --> 00:00:21,510
-一栏用于长文件，
+用一列表示长文件，
 one column for the long documents,
 
 9
 00:00:21,510 --> 00:00:23,610
-和一个摘要。
+和一列表示摘要。
 and one for the summaries.
 
 10
 00:00:23,610 --> 00:00:24,930
-这是我们如何实现这一目标
+这是我们如何使用 XSUM 数据集上的
 Here is how we can achieve this
 
 11
 00:00:24,930 --> 00:00:27,573
-使用 XSUM 数据集上的数据集库。
+Datasets 库实现这一效果。
 with the Datasets library on the XSUM dataset.
 
 12
 00:00:28,650 --> 00:00:30,810
-只要你设法让你的数据看起来像这样，
+只要你能够让你的数据以如下形式呈现，
 As long as you manage to have your data look like this,
 
 13
@@ -65,17 +65,17 @@ you should be able to follow the same steps.
 
 14
 00:00:33,690 --> 00:00:35,880
-这一次，我们的标签不是整数
+这一次，我们的标签对于某些类不再是整数
 For once, our labels are not integers
 
 15
 00:00:35,880 --> 00:00:39,150
-对应于某些类，但纯文本。
+而是纯文本。
 corresponding to some classes, but plain text.
 
 16
 00:00:39,150 --> 00:00:42,480
-因此，我们需要将它们标记化，就像我们的输入一样。
+因此，我们需要将它们词元化，就像我们的输入数据一样。
 We will thus need to tokenize them, like our inputs.
 
 17
@@ -85,22 +85,22 @@ There is a small trap there though,
 
 18
 00:00:43,920 --> 00:00:45,360
-因为我们需要标记我们的目标
+因为我们需要 
 as we need to tokenize our targets
 
 19
 00:00:45,360 --> 00:00:48,690
-在 as_target_tokenizer 上下文管理器中。
+在 as_target_tokenizer 上下文管理器中词元化我们的目标输出。
 inside the as_target_tokenizer context manager.
 
 20
 00:00:48,690 --> 00:00:51,030
-这是因为我们添加的特殊标记
+这是因为我们添加的特殊词元
 This is because the special tokens we add
 
 21
 00:00:51,030 --> 00:00:54,000
-输入和目标可能略有不同，
+其输入和目标输出可能略有不同，
 might be slightly different for the inputs and the target,
 
 22
@@ -110,82 +110,82 @@ so the tokenizer has to know which one it is processing.
 
 23
 00:00:57,300 --> 00:00:59,550
-处理整个数据集非常容易
+通过 map 函数处理整个数据集
 Processing the whole dataset is then super easy
 
 24
 00:00:59,550 --> 00:01:01,290
-与地图功能。
+非常容易。
 with the map function.
 
 25
 00:01:01,290 --> 00:01:03,450
-由于摘要通常要短得多
+由于摘要相比文件，
 Since the summaries are usually much shorter
 
 26
 00:01:03,450 --> 00:01:05,400
-比文件，你绝对应该选择
+通常要短得多，
 than the documents, you should definitely pick
 
 27
 00:01:05,400 --> 00:01:08,880
-输入和目标的不同最大长度。
+你应该针对输入和目标输出选择不同的最大长度设定。
 different maximum lengths for the inputs and targets.
 
 28
 00:01:08,880 --> 00:01:11,730
-你可以选择在此阶段填充到最大长度
+你可以通过设置 padding=max_length 在此阶段
 You can choose to pad at this stage to that maximum length
 
 29
 00:01:11,730 --> 00:01:14,070
-通过设置 padding=max_length。
+选择填充到最大长度。
 by setting padding=max_length.
 
 30
 00:01:14,070 --> 00:01:16,170
-在这里，我们将向你展示如何动态填充，
+因为它还需要一步，在这里，
 Here we'll show you how to pad dynamically,
 
 31
 00:01:16,170 --> 00:01:17,620
-因为它还需要一步。
+我们将向你展示如何动态填充。
 as it requires one more step.
 
 32
 00:01:18,840 --> 00:01:20,910
-你的输入和目标都是句子
+你的输入和目标输出
 Your inputs and targets are all sentences
 
 33
 00:01:20,910 --> 00:01:22,620
-各种长度。
+都是各种长度的句子。
 of various lengths.
 
 34
 00:01:22,620 --> 00:01:24,960
-我们将分别填充输入和目标
+由于输入和目标输出的最大长度均不相同
 We'll pad the inputs and targets separately
 
 35
 00:01:24,960 --> 00:01:27,030
-作为输入和目标的最大长度
+我们将分别填充输入
 as the maximum lengths of the inputs and targets
 
 36
 00:01:27,030 --> 00:01:28,280
-是完全不同的。
+和目标输出。
 are completely different.
 
 37
 00:01:29,130 --> 00:01:31,170
-然后，我们将输入填充到最大长度
+然后，我们将输入数据
 Then, we pad the inputs to the maximum lengths
 
 38
 00:01:31,170 --> 00:01:33,813
-在输入之间，对于目标也是如此。
+填充到最大长度，对于目标输出数据也是如此。
 among the inputs, and same for the target.
 
 39
@@ -195,42 +195,42 @@ We pad the input with the pad token,
 
 40
 00:01:36,630 --> 00:01:39,000
-以及索引为 -100 的目标
+和索引为 -100 的目标输出
 and the targets with the -100 index
 
 41
 00:01:39,000 --> 00:01:40,980
-确保不考虑它们
+确保在损失计算中
 to make sure they are not taken into account
 
 42
 00:01:40,980 --> 00:01:42,180
-在损失计算中。
+不会包含它们。
 in the loss computation.
 
 43
 00:01:43,440 --> 00:01:45,180
-变形金刚库为我们提供
+Transformers 库为我们提供
 The Transformers library provide us
 
 44
 00:01:45,180 --> 00:01:48,510
-使用数据整理器自动完成这一切。
+数据整理器以自动完成这一切。
 with a data collator to do this all automatically.
 
 45
 00:01:48,510 --> 00:01:51,690
-然后你可以将它与你的数据集一起传递给培训师，
+然后你可以将它与你的数据集一起传递给 Trainer，
 You can then pass it to the Trainer with your datasets,
 
 46
 00:01:51,690 --> 00:01:55,710
-或者在使用 model.fit 之前在 to_tf_dataset 方法中使用它
+或者在你当前的模型上使用 model.fit 之前通过 to_tf_dataset 方法
 or use it in the to_tf_dataset method before using model.fit
 
 47
 00:01:55,710 --> 00:01:56,823
-在你当前的模型上。
+使用它。
 on your current model.
 
 48