Hive vs Presto vs Trino: Choosing a SQL Engine for Your Data Lake

April 18, 2026 · 6 min read

Big Data Engineers

Three SQL engines dominate the Hadoop data lake landscape: Apache Hive, Presto, and Trino (Presto's open-source fork). Each evolved to solve different problems. Picking the wrong one leads to either unbearably slow interactive queries or over-engineered infrastructure for simple batch ETL. Here's how they compare.

Origin Stories

Apache Hive was created at Facebook in 2008 as a way to run SQL-like queries over HDFS data without writing raw MapReduce jobs. It translates HiveQL into MapReduce (or later Tez/Spark) execution plans.

Presto was also created at Facebook in 2012 — specifically because Hive was too slow for interactive analysis. It uses an in-memory, pipelined execution model with no disk spill by design.

Trino (formerly PrestoSQL) forked from Presto in 2019 after the original Presto core team left Facebook. It is now the community-driven, fully open-source successor. The two are API-compatible but have diverged in features and release cadence.

Architecture Comparison

Hive

HiveQL Query
     │
     ▼
Hive Metastore (schema + table definitions)
     │
     ▼
Query Planner (Hive → Tez/Spark DAG)
     │
     ▼
YARN execution (reads HDFS blocks from DataNodes)
     │
     ▼
Results (written to HDFS or streamed to client)

Hive is a batch SQL engine. Queries produce YARN jobs that read from HDFS, shuffle through reducers, and write results — the same data path as MapReduce. It's reliable and handles arbitrarily large datasets, but startup overhead and disk-based shuffle mean latency is measured in minutes, not seconds.

Presto / Trino

SQL Query
     │
     ▼
Coordinator Node (plan + split generation)
     │
     ▼
Worker Nodes (in-memory pipelined execution)
     │  │  │
     ▼  ▼  ▼
Data Sources: HDFS, S3, HBase, MySQL, Kafka, ...
     │
     ▼
Results streamed back to client

Presto/Trino uses a coordinator/worker model with fully pipelined, in-memory execution. No intermediate disk writes between stages (unless memory pressure forces spill). Latency is measured in seconds for most interactive queries.

Feature Comparison

Feature	Hive	Presto	Trino
Query model	Batch (MapReduce/Tez/Spark)	Interactive, in-memory	Interactive, in-memory
Latency	Minutes (startup + shuffle overhead)	Seconds	Seconds
Data volume	Unlimited (disk-backed)	Limited by cluster RAM (spill available)	Limited by cluster RAM (spill available)
ACID transactions	Yes (Hive 3+ with ORC)	No (read + INSERT only)	No (read + INSERT only)
Data formats	ORC, Parquet, Avro, Text, JSON	ORC, Parquet, Avro, Text, JSON	ORC, Parquet, Avro, Text, JSON
Metastore	Built-in (HMS)	Uses Hive Metastore	Uses Hive Metastore
Federated queries	No	Yes (connectors)	Yes (connectors)
Streaming	Limited	No	No
UDFs	Java, Python (via Hive)	Java (via plugin)	Java (via plugin)
YARN integration	Native	Optional (via YARN service)	Optional
License	Apache 2.0	Apache 2.0	Apache 2.0

Performance: When Does Each Win?

Hive is faster when...

Query touches the entire dataset — full table scans with heavy aggregations where disk I/O is the bottleneck, not startup time
Data doesn't fit in cluster RAM — Hive's disk-backed shuffle handles any data size gracefully
ACID updates/deletes are required — Hive 3+ with ORC and compaction supports MERGE, UPDATE, DELETE
Long-running ETL pipelines — startup overhead amortizes over hour-long jobs

-- Hive: Efficient for full-table ETL
INSERT OVERWRITE TABLE sales_summary
SELECT region, product_category, SUM(revenue) as total_revenue
FROM sales_raw
WHERE dt BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY region, product_category;

Presto/Trino is faster when...

Query touches a small partition — partition pruning + in-memory execution means sub-second results
Interactive BI dashboards — analysts run dozens of queries per session; Hive's per-query startup overhead (5–30s) kills interactivity
Federated queries across systems — join HDFS data with MySQL, Kafka, or a REST API in one query
Ad hoc exploration — analysts iterating on queries don't want to wait 2 minutes per attempt

-- Trino: Sub-second interactive query with partition pruning
SELECT customer_id, COUNT(*) as order_count, SUM(amount) as total
FROM hive.sales.orders
WHERE order_date = CURRENT_DATE - INTERVAL '7' DAY
  AND region = 'US'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 100;

Federated Queries: Trino's Superpower

Trino's connector architecture lets you query multiple data sources in a single SQL statement:

-- Join HDFS data with live MySQL data
SELECT
  h.user_id,
  h.total_purchases,
  m.email,
  m.subscription_tier
FROM hive.warehouse.user_purchases h
JOIN mysql.crm.users m ON h.user_id = m.id
WHERE h.purchase_date > DATE '2024-01-01'
  AND m.subscription_tier = 'premium';

Available Trino connectors include: Hive/HDFS, Delta Lake, Iceberg, MySQL, PostgreSQL, Kafka, Cassandra, Elasticsearch, MongoDB, HTTP (REST API), and many more. Hive has no equivalent capability.

Configuration Quick Start

Hive Metastore (shared by all three engines)

<!-- hive-site.xml -->
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://metastore-db:3306/metastore</value>
</property>
<property>
  <name>hive.metastore.uris</name>
  <value>thrift://metastore-host:9083</value>
</property>

Trino Configuration

# etc/config.properties (coordinator)
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery.uri=http://coordinator:8080

# etc/catalog/hive.properties
connector.name=hive
hive.metastore.uri=thrift://metastore-host:9083
hive.config.resources=/etc/hadoop/core-site.xml,/etc/hadoop/hdfs-site.xml

# Query via Trino CLI
trino --server coordinator:8080 --catalog hive --schema warehouse

Table Format: The Modern Choice

All three engines support modern open table formats that add ACID capabilities to object stores:

Format	Hive	Presto	Trino
Apache Iceberg	Yes (3.x+)	Yes	Yes (best support)
Delta Lake	Limited	Via delta connector	Yes
Apache Hudi	Yes	Limited	Yes

Apache Iceberg is the recommended format for new data lakes — it provides ACID transactions, schema evolution, time travel, and hidden partitioning, and it's supported well across all three engines.

When to Use Each

Use Case	Best Choice
Overnight ETL loading petabytes into the warehouse	Hive (or Spark SQL)
Interactive BI dashboards with SLA < 5s	Trino
Ad hoc analyst queries on partitioned data	Trino or Presto
ACID UPDATE/DELETE/MERGE on HDFS tables	Hive 3+
Federated queries across HDFS + RDBMS + Kafka	Trino
Legacy HiveQL compatibility (UDFs, SerDes)	Hive
Mixed batch + interactive on same cluster	Hive (batch) + Trino (interactive)

The Practical Answer

In 2025, most data-mature organizations run both Hive and Trino:

Hive (on Tez or Spark) for overnight batch ETL, data loading, and ACID operations
Trino for all analyst-facing interactive queries and BI tool connectivity

They share the same Hive Metastore and read the same HDFS/S3 files in ORC or Parquet format. This gives you the best of both worlds: reliable batch processing and fast interactive queries, without migrating data or rewriting pipelines.

Presto vs Trino: if you're starting fresh, choose Trino — it has a more active open-source community, more frequent releases, and broader connector support.

Origin Stories​

Architecture Comparison​

Hive​

Presto / Trino​

Feature Comparison​

Performance: When Does Each Win?​

Hive is faster when...​

Presto/Trino is faster when...​

Federated Queries: Trino's Superpower​

Configuration Quick Start​

Hive Metastore (shared by all three engines)​

Trino Configuration​

Table Format: The Modern Choice​

When to Use Each​

The Practical Answer​