Skip to content

Route Arrow - DType extension conversion through the registry#7592

Open
palaska wants to merge 4 commits intodevelopfrom
bp/arrow-ext
Open

Route Arrow - DType extension conversion through the registry#7592
palaska wants to merge 4 commits intodevelopfrom
bp/arrow-ext

Conversation

@palaska
Copy link
Copy Markdown
Contributor

@palaska palaska commented Apr 22, 2026

  • Arrow <-> DType extension conversion now consults the DTypeSession extension registry.
  • Any registered extension round-trips through Arrow by default, not just temporal types and Parquet Variant.
  • Extensions can register an Arrow canonical alias; FixedShapeTensor uses this to emit arrow.fixed_shape_tensor with JSON metadata, so pyarrow sees it as a first-class FixedShapeTensorArray.

@palaska palaska changed the title Route Arrow ↔ DType extension conversion through the registry Route Arrow - DType extension conversion through the registry Apr 22, 2026
palaska added 2 commits April 22, 2026 16:53
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 22, 2026

Merging this PR will improve performance by 27.72%

⚡ 2 improved benchmarks
✅ 1161 untouched benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation varbinview_zip_block_mask 3.7 ms 2.9 ms +27.72%
Simulation varbinview_zip_fragmented_mask 7.3 ms 6.2 ms +16.34%

Comparing bp/arrow-ext (c4d1079) with develop (2ad173d)

Open in CodSpeed

@palaska palaska added the changelog/feature A new feature label Apr 22, 2026
@palaska palaska marked this pull request as ready for review April 22, 2026 15:59
@palaska palaska requested a review from connortsui20 April 22, 2026 15:59
@connortsui20 connortsui20 requested a review from AdamGS April 22, 2026 16:09
Comment thread vortex-array/src/dtype/arrow.rs Outdated
/// Vortex-internal extension ids and Arrow canonical extension names. Canonical extensions
/// serialize metadata as raw UTF-8 (typically JSON) rather than base64-wrapped bytes.
const CANONICAL_ALIASES: &[(&str, &str)] =
&[("vortex.fixed_shape_tensor", "arrow.fixed_shape_tensor")];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops I forgot to change vortex.fixed_shape_tensor to vortex.tensor.fixed_shape_tensor... Can you add that to this PR? I could do it myself but then these won't be synced. Maybe the str ID should just be a constant and both call sites use that?

@AdamGS
Copy link
Copy Markdown
Contributor

AdamGS commented Apr 22, 2026

ok I have thoughts but I'm going to head out soon and this required more thinking, I'll try and post a review first thing tomorrow @palaska

Comment thread vortex-array/src/dtype/arrow.rs Outdated
Comment on lines +56 to +62
const ARROW_EXT_NAME_VARIANT: &str = "arrow.parquet.variant";

/// `(vortex_id, arrow_canonical_name)` pairs — single source of truth for bijection between
/// Vortex-internal extension ids and Arrow canonical extension names. Canonical extensions
/// serialize metadata as raw UTF-8 (typically JSON) rather than base64-wrapped bytes.
const CANONICAL_ALIASES: &[(&str, &str)] =
&[("vortex.fixed_shape_tensor", "arrow.fixed_shape_tensor")];
Copy link
Copy Markdown
Contributor

@AdamGS AdamGS Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just putting what we talked about over slack here - I think we should avoid hard-coding extensions here, this whole part of the code shouldn't know about them and while we can have some best-practices/defaults or assumed semantics for some of these, it should be done externally and allow for customization by the user.

Comment thread vortex-array/src/dtype/arrow.rs Outdated
/// resolve `ARROW:extension:name` metadata into [`DType::Extension`] values.
///
/// Unregistered or malformed extension metadata falls back to the storage dtype.
pub trait FromArrowWithSession<T>: Sized {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a function with a default impl on TryFromArrowType?

Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
@palaska palaska requested review from AdamGS and connortsui20 April 24, 2026 12:21
#[derive(Debug)]
pub struct DTypeSession {
registry: ExtDTypeRegistry,
arrow_canonical: RwLock<ArrowCanonicalAliases>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot use a RwLock, it has very high contention. It rarely written to?


#[derive(Debug, Default)]
struct ArrowCanonicalAliases {
entries: Vec<(ExtId, &'static str)>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is &'static str shall we give it a type alias?

//! Test extension types for exercising the [`ExtVTable`] contract.

mod divisible_int;
pub(crate) mod divisible_int;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

@@ -3,6 +3,12 @@

//! Fixed-shape Tensor extension type.

/// Vortex extension id for [`FixedShapeTensor`].
pub(crate) const ID: &str = "vortex.tensor.fixed_shape_tensor";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you expose this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this in this PR

Comment thread vortex-tensor/src/lib.rs
let dtypes = session.dtypes();
dtypes.register(Vector);
dtypes.register(FixedShapeTensor);
dtypes.register_arrow_canonical(ExtId::new(fixed_shape::ID), fixed_shape::ARROW_EXT_NAME);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't do this ExtId::new(fixed_shape::ID)

@@ -19,7 +21,7 @@ impl ExtVTable for Vector {
type NativeValue<'a> = &'a ScalarValue;

fn id(&self) -> ExtId {
ExtId::new("vortex.tensor.vector")
ExtId::new(ID)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ExtId::new(ID)
static ID: CachedId = CachedId::new(...);
*ID

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants