Route Arrow - DType extension conversion through the registry#7592
Route Arrow - DType extension conversion through the registry#7592
Conversation
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
Merging this PR will improve performance by 27.72%
Performance Changes
Comparing |
| /// Vortex-internal extension ids and Arrow canonical extension names. Canonical extensions | ||
| /// serialize metadata as raw UTF-8 (typically JSON) rather than base64-wrapped bytes. | ||
| const CANONICAL_ALIASES: &[(&str, &str)] = | ||
| &[("vortex.fixed_shape_tensor", "arrow.fixed_shape_tensor")]; |
There was a problem hiding this comment.
Whoops I forgot to change vortex.fixed_shape_tensor to vortex.tensor.fixed_shape_tensor... Can you add that to this PR? I could do it myself but then these won't be synced. Maybe the str ID should just be a constant and both call sites use that?
|
ok I have thoughts but I'm going to head out soon and this required more thinking, I'll try and post a review first thing tomorrow @palaska |
| const ARROW_EXT_NAME_VARIANT: &str = "arrow.parquet.variant"; | ||
|
|
||
| /// `(vortex_id, arrow_canonical_name)` pairs — single source of truth for bijection between | ||
| /// Vortex-internal extension ids and Arrow canonical extension names. Canonical extensions | ||
| /// serialize metadata as raw UTF-8 (typically JSON) rather than base64-wrapped bytes. | ||
| const CANONICAL_ALIASES: &[(&str, &str)] = | ||
| &[("vortex.fixed_shape_tensor", "arrow.fixed_shape_tensor")]; |
There was a problem hiding this comment.
Just putting what we talked about over slack here - I think we should avoid hard-coding extensions here, this whole part of the code shouldn't know about them and while we can have some best-practices/defaults or assumed semantics for some of these, it should be done externally and allow for customization by the user.
| /// resolve `ARROW:extension:name` metadata into [`DType::Extension`] values. | ||
| /// | ||
| /// Unregistered or malformed extension metadata falls back to the storage dtype. | ||
| pub trait FromArrowWithSession<T>: Sized { |
There was a problem hiding this comment.
Should this be a function with a default impl on TryFromArrowType?
Signed-off-by: Baris Palaska <barispalaska@gmail.com>
| #[derive(Debug)] | ||
| pub struct DTypeSession { | ||
| registry: ExtDTypeRegistry, | ||
| arrow_canonical: RwLock<ArrowCanonicalAliases>, |
There was a problem hiding this comment.
We cannot use a RwLock, it has very high contention. It rarely written to?
|
|
||
| #[derive(Debug, Default)] | ||
| struct ArrowCanonicalAliases { | ||
| entries: Vec<(ExtId, &'static str)>, |
There was a problem hiding this comment.
What is &'static str shall we give it a type alias?
| //! Test extension types for exercising the [`ExtVTable`] contract. | ||
|
|
||
| mod divisible_int; | ||
| pub(crate) mod divisible_int; |
| @@ -3,6 +3,12 @@ | |||
|
|
|||
| //! Fixed-shape Tensor extension type. | |||
|
|
|||
| /// Vortex extension id for [`FixedShapeTensor`]. | |||
| pub(crate) const ID: &str = "vortex.tensor.fixed_shape_tensor"; | |||
There was a problem hiding this comment.
Why did you expose this?
There was a problem hiding this comment.
Why remove this in this PR
| let dtypes = session.dtypes(); | ||
| dtypes.register(Vector); | ||
| dtypes.register(FixedShapeTensor); | ||
| dtypes.register_arrow_canonical(ExtId::new(fixed_shape::ID), fixed_shape::ARROW_EXT_NAME); |
There was a problem hiding this comment.
don't do this ExtId::new(fixed_shape::ID)
| @@ -19,7 +21,7 @@ impl ExtVTable for Vector { | |||
| type NativeValue<'a> = &'a ScalarValue; | |||
|
|
|||
| fn id(&self) -> ExtId { | |||
| ExtId::new("vortex.tensor.vector") | |||
| ExtId::new(ID) | |||
There was a problem hiding this comment.
| ExtId::new(ID) | |
| static ID: CachedId = CachedId::new(...); | |
| *ID |
arrow.fixed_shape_tensorwith JSON metadata, so pyarrow sees it as a first-class FixedShapeTensorArray.