-
Notifications
You must be signed in to change notification settings - Fork 1.1k
fix: fix [[NULL]] array doesn't roundtrip in arrow-row bug #9275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks for volunteering to pick up this issue. I'm concerned this doesn't properly address the root cause; for example this case would also fail: #[test]
fn test_nested_null_list() {
let null_array = Arc::new(NullArray::new(3));
// [[NULL], [], [NULL, NULL]]
let list: ArrayRef = Arc::new(ListArray::new(
Field::new_list_field(DataType::Null, true).into(),
OffsetBuffer::from_lengths(vec![1, 0, 2]),
null_array,
None,
));
let converter = RowConverter::new(vec![SortField::new(list.data_type().clone())]).unwrap();
let rows = converter.convert_columns(&[Arc::clone(&list)]).unwrap();
let back = converter.convert_rows(&rows).unwrap();
assert_eq!(&list, &back[0]);
}Can check my comments on the original issue to perhaps try do deeper/root cause fix. |
|
@Jefffrey now the test case has fixed, please review it |
|
|
||
| // Check if child is Null type - if so, we need special handling | ||
| // to count elements correctly even when they decode to 0 bytes | ||
| let is_null_child = matches!(&field.data_type, DataType::List(f) if f.data_type() == &DataType::Null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems very specific to List / LargeList (aka this solution seems to be treating the symptom rather than the underling root cause)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @alamb, thank you for the review. I agree that this solution is treating the symptom rather than the underlying root cause. Let me explain the rationale and what a proper fix would involve.
The root cause is in the encoding phase: for List, both child elements and the list end marker are encoded as EMPTY_SENTINEL (1 byte), making them indistinguishable during decoding.
However, fixing it at the encoding level would require:
- Breaking change to the row format: We'd need to encode NullArray elements differently (e.g., using
NON_EMPTY_SENTINELactual data blocks, or changing the list terminator) - Backward compatibility concerns: The row format is a stable serialization format. Changing encoding would break compatlity with existing serialized data
- Complexity: Any encoding change would need to handle both ascending/descending orders, nested lists, and ensure comparn semantics remain correct
Since List<Null> is an edge case, Only NullArray has this ambiguity since it produces zero bytes of data, so Existing serialized data continues to work.
Which issue does this PR close?
fix: fix
[[NULL]]array doesn't roundtrip in arrow-row bug[[NULL]]array doesn't roundtrip in arrow-row #9227.Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?