Comparison between arrow::to_arrow() and duckplyr for writing Parquet files

R
duckdb
arrow
Why You Should Avoid arrow::to_arrow() with DuckDB + dplyr
Author

Nicolas Chuche

Published

July 11, 2025

A commonly recommended approach to write a Parquet file after using dplyr::tbl with duckdb is to use arrow::to_arrow with arrow::write_dataset or arrow::write_parquet:

tbl(con, "read_parquet('geo.parquet')") |>
  ...
  arrow::to_arrow() |>
  arrow::write_dataset("my_dataset")

While this syntax works, the new duckplyr package offers a much more efficient alternative:

con <- dbConnect(duckdb())

tbl(con, "read_parquet('geo.parquet')") |>
  ...
1  duckplyr::as_duckdb_tibble() |>
2  duckplyr::compute_parquet("my_tbl.parquet")
1
duckplyr::as_duckdb_tibble converts the object returned by tbl() into a duckplyr objet
2
duckplyr::compute_parquet writes the Parquet file

These two lines achieve the same result as the Arrow version, but using duckplyr is much more efficient.

A Quick Benchmark

Here are the results from benchmarking three common methods (with full reproducible code below):

  • with_arrow: using arrow::to_arrow() + write_dataset()
  • with_duckplyr: using duckplyr::as_duckdb_tibble() + compute_parquet()
  • with_copy_to: using DuckDB’s native COPY ... TO ... as a baseline
Show me the benchmark code
library(duckdb)
library(dplyr)
library(arrow)
library(kableExtra)
library(timemoir)

if (!file.exists("geo.parquet")) {
  download.file("https://static.data.gouv.fr/resources/sirene-geolocalise-parquet/20240107-143656/sirene2024-geo.parquet", "geo.parquet")
}

# Full DuckDB method
with_copy_to <- function() {
  con <- dbConnect(duckdb())
  on.exit(dbDisconnect(con, shutdown = TRUE))

  dbExecute(con, "COPY (FROM read_parquet('geo.parquet')) TO 'test.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)")
}

# "Historical" version with Arrow
with_arrow <- function() {
  con <- dbConnect(duckdb())
  on.exit(dbDisconnect(con, shutdown = TRUE))

  tbl(con, "read_parquet('geo.parquet')") |>
    arrow::to_arrow() |>
    arrow::write_dataset('test', compression='zstd')
}

# Version using the new duckplyr package
with_duckplyr <- function() {
  con <- dbConnect(duckdb())
  on.exit(dbDisconnect(con, shutdown = TRUE))

  tbl(con, "read_parquet('geo.parquet')") |>
    duckplyr::as_duckdb_tibble() |>
    duckplyr::compute_parquet("my_tbl.parquet")
}
res <- timemoir(
  with_arrow(), 
  with_copy_to(), 
  with_duckplyr()
)
res |>
  kableExtra::kable()
fname duration error start_mem max_mem cpu_user cpu_sys
with_arrow() 125.123 NA 153200 21206428 157.164 50.598
with_copy_to() 28.969 NA 158720 11870840 157.578 58.525
with_duckplyr() 33.704 NA 164088 11933724 128.338 50.864
plot(res)


On the server I use, the duckplyr version is 9× faster than the arrow version and uses half the memory, performing on par with pure DuckDB (for this very simple test case).

Conclusion

If you’re working with dplyr, stop using to_arrow() and switch to duckplyr for better performance.

Back to top