Skip to content

Conversation

@scovich
Copy link
Contributor

@scovich scovich commented Jan 26, 2026

Which issue does this PR close?

This is a variant of #9259, but stacked on top of three building block PR:

Rationale for this change

See description of #9259. This version here factors out building blocks so it's easier to see what actually changes to add custom json decoder support.

What changes are included in this PR?

See description of #9259. Same net change, just organized differently.

Are these changes tested?

Yes. Existing and newly added unit tests.

Are there any user-facing changes?

  • Make JSON tape decoder classes public
  • Make JSON ArrayDecoder trait public
  • Make JSON DecoderContext class public
  • New public trait: DecoderFactory

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate parquet-variant parquet-variant* crates labels Jan 26, 2026
@scovich
Copy link
Contributor Author

scovich commented Jan 26, 2026

Oops @alamb this should prob stay in draft status until (a) the dependencies it's stacked on merge; and (b) we decide we actually want this approach in this form?

@scovich scovich force-pushed the json-decoder-factory-v2 branch from 8f683d7 to 92929be Compare January 27, 2026 23:56
@github-actions github-actions bot removed the parquet Changes to the parquet crate label Jan 27, 2026
@scovich
Copy link
Contributor Author

scovich commented Jan 28, 2026

Ok, rebased now that the three prefactor PR merged. The diff looks big, but the bulk of it is tests and doc comments. The actual change is pretty small.

The biggest single contributors are:

  • 800+ LoC for the various motivating examples/tests in custom_decoder_tests.rs
    • Hopefully they're easily understood and help spur a healthy discussion on what the API should actually look like
    • TBD whether they stay as tests for now, or if some are deemed useful enough to become part of the API
  • ~280 LoC to define the variant decoder and factory (+ tests)
    • Could potentially split out to a separate PR

pub mod writer;

pub use self::reader::{Reader, ReaderBuilder};
pub use self::reader::{ArrayDecoder, DecoderFactory, Reader, ReaderBuilder, Tape, TapeElement};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the key part that might be deemed controversial. Is Tape really a good thing to expose publicly? It's been a while since I wrote it, but I remember it not being especially friendly as an API, and something that stands a good chance of being changed in future - e.g. to avoid copying strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, and I don't remember seeing any discussion on the original PR I'm building on here:

Is there any way to allow users to customize parsing without exposing something? Other options might include:

  1. Create a new trait or wrapper that exposes the tape's information in a simplified/safe/stable way, to decouple users from the low-level details.
    • Maybe could work? Worth exploring?
  2. Convert the tape to variant, and shift the factory/decoder stuff over to variant-compute instead of json crate
    • We'd still need something to allow parsing JSON to variant, which I believe is a canonical extension types that should be supported directly.
    • Variant is insanely complex once shredding comes into the picture, so such an interface would not be easier or safer to use IMO.
    • The extra layer of conversion would impose significant overhead for somebody who just wants to parse a few misbehaving columns in a special way.
  3. Something else I'm not thinking of?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree if we want to allow users to override decoding behavior we are going to have to given them direct access to Tape / Tape Element - I don't really see any way around it

something that stands a good chance of being changed in future - e.g. to avoid copying strings.

@tustvold -- what strings are you referring to? I don't see any strings copied here:

https://github.com/apache/arrow-rs/blob/7e5076f1f775a6fd08a4d63389e26e2920fe3f6a/arrow-json/src/reader/tape.rs#L34-L33

pub struct Tape<'a> {
elements: &'a [TapeElement],
strings: &'a str,
string_offsets: &'a [usize],
num_rows: usize,
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants