Hive vs Presto vs Trino: Choosing a SQL Engine for Your Data Lake
Three SQL engines dominate the Hadoop data lake landscape: Apache Hive, Presto, and Trino (Presto's open-source fork). Each evolved to solve different problems. Picking the wrong one leads to either unbearably slow interactive queries or over-engineered infrastructure for simple batch ETL. Here's how they compare.
Origin Stories
Apache Hive was created at Facebook in 2008 as a way to run SQL-like queries over HDFS data without writing raw MapReduce jobs. It translates HiveQL into MapReduce (or later Tez/Spark) execution plans.
Presto was also created at Facebook in 2012 — specifically because Hive was too slow for interactive analysis. It uses an in-memory, pipelined execution model with no disk spill by design.
Trino (formerly PrestoSQL) forked from Presto in 2019 after the original Presto core team left Facebook. It is now the community-driven, fully open-source successor. The two are API-compatible but have diverged in features and release cadence.
Architecture Comparison
Hive
HiveQL Query
│
▼
Hive Metastore (schema + table definitions)
│
▼
Query Planner (Hive → Tez/Spark DAG)
│
▼
YARN execution (reads HDFS blocks from DataNodes)
│
▼
Results (written to HDFS or streamed to client)
Hive is a batch SQL engine. Queries produce YARN jobs that read from HDFS, shuffle through reducers, and write results — the same data path as MapReduce. It's reliable and handles arbitrarily large datasets, but startup overhead and disk-based shuffle mean latency is measured in minutes, not seconds.
Presto / Trino
SQL Query
│
▼
Coordinator Node (plan + split generation)
│
▼
Worker Nodes (in-memory pipelined execution)
│ │ │
▼ ▼ ▼
Data Sources: HDFS, S3, HBase, MySQL, Kafka, ...
│
▼
Results streamed back to client
Presto/Trino uses a coordinator/worker model with fully pipelined, in-memory execution. No intermediate disk writes between stages (unless memory pressure forces spill). Latency is measured in seconds for most interactive queries.
Feature Comparison
| Feature | Hive | Presto | Trino |
|---|---|---|---|
| Query model | Batch (MapReduce/Tez/Spark) | Interactive, in-memory | Interactive, in-memory |
| Latency | Minutes (startup + shuffle overhead) | Seconds | Seconds |
| Data volume | Unlimited (disk-backed) | Limited by cluster RAM (spill available) | Limited by cluster RAM (spill available) |
| ACID transactions | Yes (Hive 3+ with ORC) | No (read + INSERT only) | No (read + INSERT only) |
| Data formats | ORC, Parquet, Avro, Text, JSON | ORC, Parquet, Avro, Text, JSON | ORC, Parquet, Avro, Text, JSON |
| Metastore | Built-in (HMS) | Uses Hive Metastore | Uses Hive Metastore |
| Federated queries | No | Yes (connectors) | Yes (connectors) |
| Streaming | Limited | No | No |
| UDFs | Java, Python (via Hive) | Java (via plugin) | Java (via plugin) |
| YARN integration | Native | Optional (via YARN service) | Optional |
| License | Apache 2.0 | Apache 2.0 | Apache 2.0 |
Performance: When Does Each Win?
Hive is faster when...
- Query touches the entire dataset — full table scans with heavy aggregations where disk I/O is the bottleneck, not startup time
- Data doesn't fit in cluster RAM — Hive's disk-backed shuffle handles any data size gracefully
- ACID updates/deletes are required — Hive 3+ with ORC and compaction supports MERGE, UPDATE, DELETE
- Long-running ETL pipelines — startup overhead amortizes over hour-long jobs
-- Hive: Efficient for full-table ETL
INSERT OVERWRITE TABLE sales_summary
SELECT region, product_category, SUM(revenue) as total_revenue
FROM sales_raw
WHERE dt BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY region, product_category;
Presto/Trino is faster when...
- Query touches a small partition — partition pruning + in-memory execution means sub-second results
- Interactive BI dashboards — analysts run dozens of queries per session; Hive's per-query startup overhead (5–30s) kills interactivity
- Federated queries across systems — join HDFS data with MySQL, Kafka, or a REST API in one query
- Ad hoc exploration — analysts iterating on queries don't want to wait 2 minutes per attempt
-- Trino: Sub-second interactive query with partition pruning
SELECT customer_id, COUNT(*) as order_count, SUM(amount) as total
FROM hive.sales.orders
WHERE order_date = CURRENT_DATE - INTERVAL '7' DAY
AND region = 'US'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 100;
Federated Queries: Trino's Superpower
Trino's connector architecture lets you query multiple data sources in a single SQL statement:
-- Join HDFS data with live MySQL data
SELECT
h.user_id,
h.total_purchases,
m.email,
m.subscription_tier
FROM hive.warehouse.user_purchases h
JOIN mysql.crm.users m ON h.user_id = m.id
WHERE h.purchase_date > DATE '2024-01-01'
AND m.subscription_tier = 'premium';
Available Trino connectors include: Hive/HDFS, Delta Lake, Iceberg, MySQL, PostgreSQL, Kafka, Cassandra, Elasticsearch, MongoDB, HTTP (REST API), and many more. Hive has no equivalent capability.
Configuration Quick Start
Hive Metastore (shared by all three engines)
<!-- hive-site.xml -->
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://metastore-db:3306/metastore</value>
</property>
<property>
<name>hive.metastore.uris</name>
<value>thrift://metastore-host:9083</value>
</property>
Trino Configuration
# etc/config.properties (coordinator)
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
discovery.uri=http://coordinator:8080
# etc/catalog/hive.properties
connector.name=hive
hive.metastore.uri=thrift://metastore-host:9083
hive.config.resources=/etc/hadoop/core-site.xml,/etc/hadoop/hdfs-site.xml
# Query via Trino CLI
trino --server coordinator:8080 --catalog hive --schema warehouse
Table Format: The Modern Choice
All three engines support modern open table formats that add ACID capabilities to object stores:
| Format | Hive | Presto | Trino |
|---|---|---|---|
| Apache Iceberg | Yes (3.x+) | Yes | Yes (best support) |
| Delta Lake | Limited | Via delta connector | Yes |
| Apache Hudi | Yes | Limited | Yes |
Apache Iceberg is the recommended format for new data lakes — it provides ACID transactions, schema evolution, time travel, and hidden partitioning, and it's supported well across all three engines.
When to Use Each
| Use Case | Best Choice |
|---|---|
| Overnight ETL loading petabytes into the warehouse | Hive (or Spark SQL) |
| Interactive BI dashboards with SLA < 5s | Trino |
| Ad hoc analyst queries on partitioned data | Trino or Presto |
| ACID UPDATE/DELETE/MERGE on HDFS tables | Hive 3+ |
| Federated queries across HDFS + RDBMS + Kafka | Trino |
| Legacy HiveQL compatibility (UDFs, SerDes) | Hive |
| Mixed batch + interactive on same cluster | Hive (batch) + Trino (interactive) |
The Practical Answer
In 2025, most data-mature organizations run both Hive and Trino:
- Hive (on Tez or Spark) for overnight batch ETL, data loading, and ACID operations
- Trino for all analyst-facing interactive queries and BI tool connectivity
They share the same Hive Metastore and read the same HDFS/S3 files in ORC or Parquet format. This gives you the best of both worlds: reliable batch processing and fast interactive queries, without migrating data or rewriting pipelines.
Presto vs Trino: if you're starting fresh, choose Trino — it has a more active open-source community, more frequent releases, and broader connector support.
