1. Setup Rmd for blogs

In my last post, I mentioned that I used a newer template for my personal website. One thing I didn’t do was to bring my old posts into my new website. One of the posts was actually about me building a package to compile R Markdown file in Atom editor to plain HTML file for blog posts. As it turns out, this is no longer needed and there is an easier way to do this.

I stole/adapted the YAML header from Ethan Roubenoff’s blog, which was “stole/adapted” (original wording of Ethan) from Steven Miller’s post. This will allow us to write in Rmd and then compile into the corresponding post folder for display on website.

Some tweaks were needed in my case including adding the category and tags fields and the output folder. Here is the YAML for this blog post.

---
title: Transform a Large CSV/TSV File into Arrow
author: Fangzhou Xie
date: 2023-03-20
output:
  md_document:
    variant: gfm
    preserve_yaml: TRUE
knit: (function(inputFile, encoding) {
  rmarkdown::render(inputFile, 
                    encoding = encoding, 
                    output_file=paste0(Sys.Date(), "-", sub(".Rmd", ".md",inputFile)), 
                    output_dir = "../_posts") })
layout: post
category: [programming, website]
tags: [arrow, Rmd, R]
always_allow_html: true
related_posts: false
---

2. My failed attempt to follow the Arrow tutorials

But what I really wanted to say was my recent discovery of R package arrow. This will allow us to work with very large dataset (those even larger than RAM) efficiently. A wonderful introduction of why parquet format is more suitable in those situations can be found in R for Data Science.

However, if we follow the steps in the Section 23.4.3, we will encounter the following code block:

seattle_csv |>
  group_by(CheckoutYear) |>
  write_dataset(path = pq_path, format = "parquet")

where the seattle_csv object was defined earlier in Section 23.3 as:

seattle_csv <- open_dataset(
  sources = "data/seattle-library-checkouts.csv", 
  format = "csv"
)

If we combine those two blocks we have the following:

seattle_csv <- open_dataset(
  sources = "data/seattle-library-checkouts.csv", 
  format = "csv"
)

seattle_csv |>
  group_by(CheckoutYear) |>
  write_dataset(path = pq_path, format = "parquet")

This seems pretty straightforward, right? The first block read the data in CSV and the second block write the data into parquet format.

I felt very confident, and tried to used those two blocks of codes to read and write an extremely large TSV file I have (>40 GB) into parquet format. However, I tried it multiple times and it always ended up occupying all of my RAM (64 GB) and froze my desktop.

my_big_tsv <- open_dataset(sources = "big.tsv", format = "tsv")

my_big_tsv %>% 
  group_by(year) %>% 
  write_dataset(path = "big-parquet")

I was exactly following the recipe but why didn’t it work?

3. My solution to deal with large CSV/TSV files

I then did some online surfing to figure out why this is the case. The official Apache Arrow documentation for R clears states: “There are two different ways to specify variables to use for partitioning - either via the partitioning variable as above, or by using dplyr::group_by() on your data - the group variables will form the partitions.”

This wording seems to suggest that both ways are equivalent to each other. Either we group before writing or we specify the group in the argument when writing the dataset. But I am stuck with the dplyr::group_by() route and why not try the other way?

my_big_tsv <- open_dataset(sources = "big.tsv", format = "tsv")

my_big_tsv %>% 
  write_dataset(path = "big-parquet", partitioning = c("year"))

And this time the machine only took a litter CPU and RAM to produce parquet files.

It seems reasonable now. The open_dataset will not read the actual data into memory unless it was requested. The group_by actually requests the data to be loaded for the grouping, before writing into the parquet format.

However, without the explicit dplyr::group_by, the grouping was carried out by arrow directly in the partitioning argument. Hence there is no need to read the entire dataset into memory for grouping.

Therefore, the two different approaches are not equivalent to each other. The group_by method will suffocate the machine when dealing with larger-than-RAM files, whereas the partitioning method will not. In my case, the file was smaller than RAM, but the group_by operation clearly needs memory several times of the original file, so that the entire operation becomes a larger-than-RAM situation.

The whole point of using Arrow was to avoid loading the whole dataset into memory, but if we really need to group_by before writing into the parquet and if this process actually suffocates the memory, why should we do it this way? I think this difference should be at least mentioned somewhere by someone, but I haven’t found anyone doing so. Hence I documented my struggle in the hope of helping someone in the future.