TSV parsing issue returns an empty dataframe rather than indication of parsing issue

**Describe the unexpected behaviour**

Hello CHDB creators and contributors! I am very new to CHDB and ClickHouse. I am thoroughly enjoying it. The performance is impressive. **Thank you very much for the excellent project!**

I am testing out CHDB with various file formats. So far, the Parquet file formats work perfectly and flawlessly. However, I have run into an issue with a TSV file. When reading a TSV file, I was receiving an empty dataframe. Unsure of why I was receiving an empty dataframe, I tried the same query on ClickHouse Local and received an error. There appears to be an issue parsing the TSV. Sure enough using CHDB, if I do a LIMIT up to the line with the parsing issue, I get a populated dataframe with rows up to the LIMIT. ClickHouse Local shows the last row returned and also provides a "Receive exception" message. Rather than producing an empty dataframe, should the last row and a message similar to ClickHouse Local be returned in CHDB indicating an issue? Basically, without ClickHouse Local, I wouldn't have known the issue. Please forgive me, if I am doing something wrong. 

There definitely is a problem with parsing the file. I tried another embedded database and it also errored out due to parsing (but on another line). However, with that other embedded database, if I increase the sample size or scan the entire file, it can be read successfully. But the issue being reported is not the parsing issue, it is the ability to receive an error similar to ClickHouse Local.

I can open an issue on the parsing problem with ClickHouse, but wanted to get this error message addressed first, since it might be a good example.

**How to reproduce**

The dataset is from the [IMDb Non-Commercial Datasets](https://developer.imdb.com/non-commercial-datasets/). Please also ensure the attached file is used since the dataset is refreshed daily, which could present problems in reproducing the issue.
[https://drive.google.com/file/d/1zQN-1qa49oh6opa3zAlVF5e_oE8cRu-e/view?usp=sharing](https://drive.google.com/file/d/1zQN-1qa49oh6opa3zAlVF5e_oE8cRu-e/view?usp=sharing)

Using Google Colab High-RAM runtime:
- CHDB:  0.15.0
- ClickHouse: 23.6.1.1

1. Upload TSV file to Google Colab - `name.basics.tsv.gz`.
2. A query with `LIMIT 10` returns a dataframe of 10 rows as expected.
```python
chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames) LIMIT 10""", 'Dataframe')
```
3. A query with no `LIMIT` returns an empty dataframe.
```python
chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames)""", 'Dataframe')
```
4. Confirmed the result is an empty dataframe.
```python
res = chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames)""", 'Dataframe')

type(res)
pandas.core.frame.DataFrame

res.shape
(0, 0)
```
5. A query with the LIMIT set to the row before the parsing issue returns a dataframe with rows prior to the LIMIT.
```python
chdb.query("""select * from file('/content/name.basics.tsv.gz', TabSeparatedWithNames) LIMIT 65408""", 'Dataframe')
```

**Expected behavior**

Expected behavior would be an error message similar to the error message produced in ClickHouse Local.
ClickHouse Local provides the following message:
- ClickHouse  23.11.1.2205
```shell
SELECT * FROM file('name.basics.tsv.gz') FORMAT TabSeparatedWithNames;
```

_...Previously rows omitted for brevity._
```shell
nm0068534       Suzanne Bell    \N      \N      make_up_department      tt0099134,tt0092527,tt0099092,tt0099817
nm0068535       Tamara Bell     \N      \N      producer,executive      tt11388822,tt8244578
↙ Progress: 65.41 thousand rows, 1.68 MB (37.40 thousand rows/s., 962.72 KB/s.) ▎                                                  0%
65408 rows in set. Elapsed: 1.750 sec. Processed 65.41 thousand rows, 1.68 MB (37.38 thousand rows/s., 962.28 KB/s.)
Peak memory usage: 39.30 MiB.

Received exception:
Code: 27. DB::ParsingException: Cannot parse input: expected '\t' before: '001\t\\N\tassistant_director,producer,director\ttt2909116,tt1701210,tt0338512,tt0870921\nnm0144483\tTomas Castanos\t\\N\t\\N\tmiscellaneous\ttt0138467\nnm0144484\tDan Castano':
Row 6568:
Column 0,   name: nconst,            type: Nullable(String), parsed text: "nm0144481"
Column 1,   name: primaryName,       type: Nullable(String), parsed text: "Mauricio Castano"
Column 2,   name: birthYear,         type: Nullable(Int64),  parsed text: "<BACKSLASH>N"
Column 3,   name: deathYear,         type: Nullable(Int64),  parsed text: "<BACKSLASH>N"
Column 4,   name: primaryProfession, type: Nullable(String), parsed text: "actor"
Column 5,   name: knownForTitles,    type: Nullable(String), parsed text: "tt0109747,tt3190448"

Row 6569:
Column 0,   name: nconst,            type: Nullable(String), parsed text: "nm0144482"
Column 1,   name: primaryName,       type: Nullable(String), parsed text: "Xavier Castano"
Column 2,   name: birthYear,         type: Nullable(Int64),  parsed text: "0"
ERROR: garbage after Nullable(Int64): "001<TAB><BACKSLASH>N<TAB>ass"

: While executing ParallelParsingBlockInputFormat: While executing File: (in file/uri /home/earlev4/name.basics.tsv.gz): (at row 137387)
. (CANNOT_PARSE_INPUT_ASSERTION_FAILED)
``` 
I am unsure why row 65409 was not reported in the exception, but that can be addressed later with a ClickHouse GitHub issue regarding the parsing problem.

Also looked into `input_format_allow_errors_num` and `input_format_allow_errors_ratio`. The adjustment of `input_format_allow_errors_num` to `1` or `input_format_allow_errors_ratio` to `0.1` allow the query to bypass the errors and omits rows in ClickHouse Local. However, I was unsuccessful in changing these format settings in CHDB. This could be another issue; I am just not knowledgeable enough to know.
 
**Additional context**
To further clarify, the issue being reported is not with parsing the TSV, but the expectation of receiving an error message to indicate an issue rather than an empty dataframe or return a dataframe with only the rows that could be read.

Looking forward to your insight. Thank you in advance for your assistance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TSV parsing issue returns an empty dataframe rather than indication of parsing issue #138

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

TSV parsing issue returns an empty dataframe rather than indication of parsing issue #138

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions