Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 39 additions & 39 deletions subtitles/zh-CN/61_data-processing-for-summarization.srt
Original file line number Diff line number Diff line change
Expand Up @@ -15,12 +15,12 @@

4
00:00:05,550 --> 00:00:08,450
- 让我们看看如何预处理数据集以进行汇总
- 让我们看看如何预处理数据集以进行文本摘要
- Let's see how to preprocess a dataset for summarization.

5
00:00:09,750 --> 00:00:13,083
这是总结一份长文档的任务
这是概括一份长文档的任务
This is the task of, well, summarizing a long document.

6
Expand All @@ -30,32 +30,32 @@ This video will focus on how to preprocess your dataset

7
00:00:16,830 --> 00:00:19,680
一旦你设法将其放入以下格式
一旦你成功将其按照以下格式处理
once you have managed to put it in the following format:

8
00:00:19,680 --> 00:00:21,510
一栏用于长文件
用一列表示长文件
one column for the long documents,

9
00:00:21,510 --> 00:00:23,610
和一个摘要
和一列表示摘要
and one for the summaries.

10
00:00:23,610 --> 00:00:24,930
这是我们如何实现这一目标
这是我们如何使用 XSUM 数据集上的
Here is how we can achieve this

11
00:00:24,930 --> 00:00:27,573
使用 XSUM 数据集上的数据集库
Datasets 库实现这一效果
with the Datasets library on the XSUM dataset.

12
00:00:28,650 --> 00:00:30,810
只要你设法让你的数据看起来像这样
只要你能够让你的数据以如下形式呈现
As long as you manage to have your data look like this,

13
Expand All @@ -65,17 +65,17 @@ you should be able to follow the same steps.

14
00:00:33,690 --> 00:00:35,880
这一次,我们的标签不是整数
这一次,我们的标签对于某些类不再是整数
For once, our labels are not integers

15
00:00:35,880 --> 00:00:39,150
对应于某些类,但纯文本
而是纯文本
corresponding to some classes, but plain text.

16
00:00:39,150 --> 00:00:42,480
因此,我们需要将它们标记化,就像我们的输入一样
因此,我们需要将它们词元化,就像我们的输入数据一样
We will thus need to tokenize them, like our inputs.

17
Expand All @@ -85,22 +85,22 @@ There is a small trap there though,

18
00:00:43,920 --> 00:00:45,360
因为我们需要标记我们的目标
因为我们需要
as we need to tokenize our targets

19
00:00:45,360 --> 00:00:48,690
在 as_target_tokenizer 上下文管理器中
在 as_target_tokenizer 上下文管理器中词元化我们的目标输出
inside the as_target_tokenizer context manager.

20
00:00:48,690 --> 00:00:51,030
这是因为我们添加的特殊标记
这是因为我们添加的特殊词元
This is because the special tokens we add

21
00:00:51,030 --> 00:00:54,000
输入和目标可能略有不同
其输入和目标输出可能略有不同
might be slightly different for the inputs and the target,

22
Expand All @@ -110,82 +110,82 @@ so the tokenizer has to know which one it is processing.

23
00:00:57,300 --> 00:00:59,550
处理整个数据集非常容易
通过 map 函数处理整个数据集
Processing the whole dataset is then super easy

24
00:00:59,550 --> 00:01:01,290
与地图功能
非常容易
with the map function.

25
00:01:01,290 --> 00:01:03,450
由于摘要通常要短得多
由于摘要相比文件,
Since the summaries are usually much shorter

26
00:01:03,450 --> 00:01:05,400
比文件,你绝对应该选择
通常要短得多,
than the documents, you should definitely pick

27
00:01:05,400 --> 00:01:08,880
输入和目标的不同最大长度
你应该针对输入和目标输出选择不同的最大长度设定
different maximum lengths for the inputs and targets.

28
00:01:08,880 --> 00:01:11,730
你可以选择在此阶段填充到最大长度
你可以通过设置 padding=max_length 在此阶段
You can choose to pad at this stage to that maximum length

29
00:01:11,730 --> 00:01:14,070
通过设置 padding=max_length
选择填充到最大长度
by setting padding=max_length.

30
00:01:14,070 --> 00:01:16,170
在这里,我们将向你展示如何动态填充
因为它还需要一步,在这里
Here we'll show you how to pad dynamically,

31
00:01:16,170 --> 00:01:17,620
因为它还需要一步
我们将向你展示如何动态填充
as it requires one more step.

32
00:01:18,840 --> 00:01:20,910
你的输入和目标都是句子
你的输入和目标输出
Your inputs and targets are all sentences

33
00:01:20,910 --> 00:01:22,620
各种长度
都是各种长度的句子
of various lengths.

34
00:01:22,620 --> 00:01:24,960
我们将分别填充输入和目标
由于输入和目标输出的最大长度均不相同
We'll pad the inputs and targets separately

35
00:01:24,960 --> 00:01:27,030
作为输入和目标的最大长度
我们将分别填充输入
as the maximum lengths of the inputs and targets

36
00:01:27,030 --> 00:01:28,280
是完全不同的
和目标输出
are completely different.

37
00:01:29,130 --> 00:01:31,170
然后,我们将输入填充到最大长度
然后,我们将输入数据
Then, we pad the inputs to the maximum lengths

38
00:01:31,170 --> 00:01:33,813
在输入之间,对于目标也是如此
填充到最大长度,对于目标输出数据也是如此
among the inputs, and same for the target.

39
Expand All @@ -195,42 +195,42 @@ We pad the input with the pad token,

40
00:01:36,630 --> 00:01:39,000
以及索引为 -100 的目标
和索引为 -100 的目标输出
and the targets with the -100 index

41
00:01:39,000 --> 00:01:40,980
确保不考虑它们
确保在损失计算中
to make sure they are not taken into account

42
00:01:40,980 --> 00:01:42,180
在损失计算中
不会包含它们
in the loss computation.

43
00:01:43,440 --> 00:01:45,180
变形金刚库为我们提供
Transformers 库为我们提供
The Transformers library provide us

44
00:01:45,180 --> 00:01:48,510
使用数据整理器自动完成这一切
数据整理器以自动完成这一切
with a data collator to do this all automatically.

45
00:01:48,510 --> 00:01:51,690
然后你可以将它与你的数据集一起传递给培训师
然后你可以将它与你的数据集一起传递给 Trainer
You can then pass it to the Trainer with your datasets,

46
00:01:51,690 --> 00:01:55,710
或者在使用 model.fit 之前在 to_tf_dataset 方法中使用它
或者在你当前的模型上使用 model.fit 之前通过 to_tf_dataset 方法
or use it in the to_tf_dataset method before using model.fit

47
00:01:55,710 --> 00:01:56,823
在你当前的模型上
使用它
on your current model.

48
Expand Down