Skip to content

Conversation

@lichuang
Copy link
Contributor

Which issue does this PR close?

fix: fix [[NULL]] array doesn't roundtrip in arrow-row bug

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jan 27, 2026
@Jefffrey
Copy link
Contributor

Thanks for volunteering to pick up this issue. I'm concerned this doesn't properly address the root cause; for example this case would also fail:

    #[test]
    fn test_nested_null_list() {
        let null_array = Arc::new(NullArray::new(3));
        // [[NULL], [], [NULL, NULL]]
        let list: ArrayRef = Arc::new(ListArray::new(
            Field::new_list_field(DataType::Null, true).into(),
            OffsetBuffer::from_lengths(vec![1, 0, 2]),
            null_array,
            None,
        ));

        let converter = RowConverter::new(vec![SortField::new(list.data_type().clone())]).unwrap();
        let rows = converter.convert_columns(&[Arc::clone(&list)]).unwrap();
        let back = converter.convert_rows(&rows).unwrap();

        assert_eq!(&list, &back[0]);
    }

Can check my comments on the original issue to perhaps try do deeper/root cause fix.

@lichuang
Copy link
Contributor Author

@Jefffrey now the test case has fixed, please review it


// Check if child is Null type - if so, we need special handling
// to count elements correctly even when they decode to 0 bytes
let is_null_child = matches!(&field.data_type, DataType::List(f) if f.data_type() == &DataType::Null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very specific to List / LargeList (aka this solution seems to be treating the symptom rather than the underling root cause)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @alamb, thank you for the review. I agree that this solution is treating the symptom rather than the underlying root cause. Let me explain the rationale and what a proper fix would involve.

The root cause is in the encoding phase: for List, both child elements and the list end marker are encoded as EMPTY_SENTINEL (1 byte), making them indistinguishable during decoding.

However, fixing it at the encoding level would require:

  1. Breaking change to the row format: We'd need to encode NullArray elements differently (e.g., using NON_EMPTY_SENTINEL actual data blocks, or changing the list terminator)
  2. Backward compatibility concerns: The row format is a stable serialization format. Changing encoding would break compatlity with existing serialized data
  3. Complexity: Any encoding change would need to handle both ascending/descending orders, nested lists, and ensure comparn semantics remain correct

Since List<Null> is an edge case, Only NullArray has this ambiguity since it produces zero bytes of data, so Existing serialized data continues to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[[NULL]] array doesn't roundtrip in arrow-row

3 participants