Skip to content

TSV parsing issue returns an empty dataframe rather than indication of parsing issue #138

@earlev4

Description

@earlev4

Describe the unexpected behaviour

Hello CHDB creators and contributors! I am very new to CHDB and ClickHouse. I am thoroughly enjoying it. The performance is impressive. Thank you very much for the excellent project!

I am testing out CHDB with various file formats. So far, the Parquet file formats work perfectly and flawlessly. However, I have run into an issue with a TSV file. When reading a TSV file, I was receiving an empty dataframe. Unsure of why I was receiving an empty dataframe, I tried the same query on ClickHouse Local and received an error. There appears to be an issue parsing the TSV. Sure enough using CHDB, if I do a LIMIT up to the line with the parsing issue, I get a populated dataframe with rows up to the LIMIT. ClickHouse Local shows the last row returned and also provides a "Receive exception" message. Rather than producing an empty dataframe, should the last row and a message similar to ClickHouse Local be returned in CHDB indicating an issue? Basically, without ClickHouse Local, I wouldn't have known the issue. Please forgive me, if I am doing something wrong.

There definitely is a problem with parsing the file. I tried another embedded database and it also errored out due to parsing (but on another line). However, with that other embedded database, if I increase the sample size or scan the entire file, it can be read successfully. But the issue being reported is not the parsing issue, it is the ability to receive an error similar to ClickHouse Local.

I can open an issue on the parsing problem with ClickHouse, but wanted to get this error message addressed first, since it might be a good example.

How to reproduce

The dataset is from the IMDb Non-Commercial Datasets. Please also ensure the attached file is used since the dataset is refreshed daily, which could present problems in reproducing the issue.
https://drive.google.com/file/d/1zQN-1qa49oh6opa3zAlVF5e_oE8cRu-e/view?usp=sharing

Using Google Colab High-RAM runtime:

  • CHDB: 0.15.0
  • ClickHouse: 23.6.1.1
  1. Upload TSV file to Google Colab - name.basics.tsv.gz.
  2. A query with LIMIT 10 returns a dataframe of 10 rows as expected.
chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames) LIMIT 10""", 'Dataframe')
  1. A query with no LIMIT returns an empty dataframe.
chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames)""", 'Dataframe')
  1. Confirmed the result is an empty dataframe.
res = chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames)""", 'Dataframe')

type(res)
pandas.core.frame.DataFrame

res.shape
(0, 0)
  1. A query with the LIMIT set to the row before the parsing issue returns a dataframe with rows prior to the LIMIT.
chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames) LIMIT 65408""", 'Dataframe')

Expected behavior

Expected behavior would be an error message similar to the error message produced in ClickHouse Local.
ClickHouse Local provides the following message:

  • ClickHouse 23.11.1.2205
SELECT * FROM file('name.basics.tsv.gz') FORMAT TabSeparatedWithNames;

...Previously rows omitted for brevity.

nm0068534       Suzanne Bell    \N      \N      make_up_department      tt0099134,tt0092527,tt0099092,tt0099817
nm0068535       Tamara Bell     \N      \N      producer,executive      tt11388822,tt8244578
↙ Progress: 65.41 thousand rows, 1.68 MB (37.40 thousand rows/s., 962.72 KB/s.) ▎                                                  0%
65408 rows in set. Elapsed: 1.750 sec. Processed 65.41 thousand rows, 1.68 MB (37.38 thousand rows/s., 962.28 KB/s.)
Peak memory usage: 39.30 MiB.

Received exception:
Code: 27. DB::ParsingException: Cannot parse input: expected '\t' before: '001\t\\N\tassistant_director,producer,director\ttt2909116,tt1701210,tt0338512,tt0870921\nnm0144483\tTomas Castanos\t\\N\t\\N\tmiscellaneous\ttt0138467\nnm0144484\tDan Castano':
Row 6568:
Column 0,   name: nconst,            type: Nullable(String), parsed text: "nm0144481"
Column 1,   name: primaryName,       type: Nullable(String), parsed text: "Mauricio Castano"
Column 2,   name: birthYear,         type: Nullable(Int64),  parsed text: "<BACKSLASH>N"
Column 3,   name: deathYear,         type: Nullable(Int64),  parsed text: "<BACKSLASH>N"
Column 4,   name: primaryProfession, type: Nullable(String), parsed text: "actor"
Column 5,   name: knownForTitles,    type: Nullable(String), parsed text: "tt0109747,tt3190448"

Row 6569:
Column 0,   name: nconst,            type: Nullable(String), parsed text: "nm0144482"
Column 1,   name: primaryName,       type: Nullable(String), parsed text: "Xavier Castano"
Column 2,   name: birthYear,         type: Nullable(Int64),  parsed text: "0"
ERROR: garbage after Nullable(Int64): "001<TAB><BACKSLASH>N<TAB>ass"

: While executing ParallelParsingBlockInputFormat: While executing File: (in file/uri /home/earlev4/name.basics.tsv.gz): (at row 137387)
. (CANNOT_PARSE_INPUT_ASSERTION_FAILED)

I am unsure why row 65409 was not reported in the exception, but that can be addressed later with a ClickHouse GitHub issue regarding the parsing problem.

Also looked into input_format_allow_errors_num and input_format_allow_errors_ratio. The adjustment of input_format_allow_errors_num to 1 or input_format_allow_errors_ratio to 0.1 allow the query to bypass the errors and omits rows in ClickHouse Local. However, I was unsuccessful in changing these format settings in CHDB. This could be another issue; I am just not knowledgeable enough to know.
 
Additional context
To further clarify, the issue being reported is not with parsing the TSV, but the expectation of receiving an error message to indicate an issue rather than an empty dataframe or return a dataframe with only the rows that could be read.

Looking forward to your insight. Thank you in advance for your assistance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions