HIVE-29419: Provide a Hive-specific docker image for Tez AM#6435
HIVE-29419: Provide a Hive-specific docker image for Tez AM#6435abstractdog wants to merge 1 commit intoapache:masterfrom
Conversation
7240e71 to
08f94a2
Compare
|
|
FYI, there is a docker profile to create new Hive image: -Pdocker |
| </property> | ||
| <property> | ||
| <name>hive.server2.tez.initialize.default.sessions</name> | ||
| <value>false</value> |
There was a problem hiding this comment.
it's a yarn-based thing that HS2 initializes a few sessions on startup
in the standalone mode, HS2 has no control over them, just discovers them via ZK
| <!-- | ||
| A registry namespace prefix is a hardcoded prefix for Tez external sessions. | ||
| The actual tez.am.registry.namespace value is appended to this prefix. | ||
| Once hive can use the registry client that Tez provides (ZkAMRegistryClient), this property will be removed. |
There was a problem hiding this comment.
is there a ticket for that? needs Tez upgrade?
There was a problem hiding this comment.
for the release we already have: https://issues.apache.org/jira/browse/TEZ-4701
now I created: https://issues.apache.org/jira/browse/HIVE-29573
| @@ -0,0 +1,5 @@ | |||
| log4j.rootLogger=INFO, console | |||
There was a problem hiding this comment.
maybe, simply [tez-log4j.properties]
| </property> | ||
| <property> | ||
| <name>hive.llap.daemon.umbilical.port</name> | ||
| <value>33333</value> |
There was a problem hiding this comment.
default is 0, do we even need to touch it?
There was a problem hiding this comment.
right, it works but needs a minor fix in LlapTaskCommunicator to handle the default properly:
- conf.get(HiveConf.ConfVars.LLAP_TASK_UMBILICAL_SERVER_PORT.varname)
+ HiveConf.getVar(conf, HiveConf.ConfVars.LLAP_TASK_UMBILICAL_SERVER_PORT)
otherwise I get:
tezam | 2026-04-17T12:23:28,727 INFO AbstractService - Service org.apache.hadoop.hive.llap.tezplugins.LlapTaskCommunicator failed in state STARTED
tezam | java.lang.NullPointerException: Cannot invoke "String.split(String)" because the return value of "org.apache.hadoop.conf.Configuration.get(String)" is null
tezam | at org.apache.hadoop.hive.llap.tezplugins.LlapTaskCommunicator.startRpcServer(LlapTaskCommunicator.java:261)
tezam | at org.apache.tez.dag.app.TezTaskCommunicatorImpl.start(TezTaskCommunicatorImpl.java:140)
tezam | at org.apache.hadoop.hive.llap.tezplugins.LlapTaskCommunicator.start(LlapTaskCommunicator.java:237)
tezam | at org.apache.tez.dag.app.ServicePluginLifecycleAbstractService.serviceStart(ServicePluginLifecycleAbstractService.java:41)
tezam | at org.apache.hadoop.service.AbstractService.start(AbstractService.java:195)
tezam | at org.apache.tez.dag.app.TaskCommunicatorManager.serviceStart(TaskCommunicatorManager.java:165)
tezam | at org.apache.hadoop.service.AbstractService.start(AbstractService.java:195)
tezam | at org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1857)
tezam | at org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1878)
good catch, fixing it now
| tezam: | ||
| profiles: | ||
| - llap | ||
| image: apache/hive:${HIVE_VERSION} |
There was a problem hiding this comment.
is it Ok to reuse hive image for TezAM? IDK just asking. is is the same downstream?
There was a problem hiding this comment.
I think it's ok to reuse, actually, we build both (Hive specific TezAM image + Tez AM image in Tez project), see the motivation on HIVE-29419
downstream we have only a hive image for TezAM, but upstream, the Tez project needs to have its own, TEZ-4682, most probably mimics the Tez Container mode (DAGAppMaster + TezChiild), not the LLAP
|
|
||
| # LLAP daemon discovery | ||
| HIVE_ZOOKEEPER_QUORUM: zookeeper:2181 | ||
| LLAP_SERVICE_HOSTS: '@llap0' |
There was a problem hiding this comment.
is the env var correct? not HIVE_LLAP_DAEMON_SERVICE_HOSTS ? see https://github.com/apache/hive/pull/6435/changes#diff-d389e28124c68afe85354e9f947d6d9ec8f7cfb9d49c46b39891ab5da7a7be7cR68
There was a problem hiding this comment.
good catch, it's not correct, the only reason why it works is that I take care of it one level deeper in entrypoint:
https://github.com/apache/hive/pull/6435/changes#diff-b7d5fbeab2c6af92616ab371fa3f237867d60d7c81072045342cb44a8981bf90R45
fixing this
| ARG TEZ_SNAPSHOT_REPO_URL=https://repository.apache.org/content/repositories/snapshots | ||
|
|
||
| # When snapshot jars are included, client version must match the snapshot version. | ||
| ENV TEZ_CLIENT_VERSION=${TEZ_SNAPSHOT_VERSION:-$TEZ_VERSION} |
There was a problem hiding this comment.
is this used somewhere? can't find any usages here
There was a problem hiding this comment.
this is needed or another config to disable client version check in TezAM, otherwise we get:
tezam | 2026-04-17T12:50:58,952 INFO DAGAppMaster - Comparing client version with AM version, clientVersion=Unknown, AMVersion=1.0.0-SNAPSHOT
tezam | 2026-04-17T12:50:58,953 ERROR DAGAppMaster - Incompatible versions found, clientVersion=Unknown, AMVersion=1.0.0-SNAPSHOT
I'm fine with disabling the check:
https://github.com/apache/tez/blob/2ca44b4ba3839ff4c2c2ab2ec95e34d687f61c09/tez-dag/src/main/java/org/apache/tez/dag/app/DAGAppMaster.java#L455-L476
in which case tez site xml has to contain:
<property>
<name>tez.am.disable.client-version-check</name>
<value>true</value>
</property>
it's fine to keep TEZ_CLIENT_VERSION, but the comment should be changed then to:
Client version check is enabled by default in Tez AM, which is picked up from TEZ_CLIENT_VERSION env var.
to reflect that this is not because of the optional snapshot override
wdyt?
There was a problem hiding this comment.
if version check makes sense, let's keep the env var
| HIVE_VERSION= | ||
| HADOOP_VERSION= | ||
| TEZ_VERSION= | ||
| TEZ_SNAPSHOT_VERSION= |
There was a problem hiding this comment.
could we keep one required input in [build.sh]: -tez
If it ends with -SNAPSHOT, treat it as snapshot channel automatically - otherwise release (for example 0.10.5)
There was a problem hiding this comment.
a release tarball must always be fetched (as the structural foundation, mostly because it contains all the dependencies for Tez), and snapshot jars are an optional overlay, placed on top to override the release-version tez jars while leaving lib/ intact
a snapshot version alone cannot form a self-contained, runnable Tez installation, as we don't have SNAPSHOT tez tarballs downloadable
There was a problem hiding this comment.
Do we need to support SNAPSHOT overlays? Currently, the Docker image only allows released versions.
There was a problem hiding this comment.
short term, yes, otherwise the whole TezAM initiative cannot be tried out as here
Tez 1.0.0 could be months (or more) away, and I would like to be able to proceed with cloud-native initiative in the meantime
released/published Docker images must not contain SNAPSHOT jars, I agree, but a temporary Hive image built with Tez 1.0.0-SNAPSHOT can be utilized for distributed/perf test even in the next 1-2 months
There was a problem hiding this comment.
would it make sense splitting this PR in 2 parts so we could revert the SNAPSHOT workaround once Tez 1.0.0 is released?
There was a problem hiding this comment.
maybe a cleaner approach could be to build a custom Tez Docker image on top of the nightly Tez image, extending it with the required Hive jars.



What changes were proposed in this pull request?
Make hive image able to start a TezAM in LLAP mode that can assign tasks to the LLAP daemons.
Why are the changes needed?
Because it's the next step to have a fully distributed, Dockerized environment for Hive.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manually, steps are below.
Needs Hive 4.3.0-SNAPSHOT jars that contain recent changes (specifically HIVE-29477):
start cluster:
test:
see logs that queries go through tezam and daemons:
very important test case is that the container layout implemented here in docker-compose is compatible with the already existing and working hs2+llapdeamon setup (no tezam), which is confirmed as:
in which case tezam will simply fail to start, and the cluster works exactly the same way as post-HIVE-29411
be aware of the difference, in case of:
the tez zookeeper-based registry and external sessions code are simply not present, that's why - regardless of the config - here it can work