A Guide to the Ruby CSV Library, Part I
Several weeks ago I needed to do something in Ruby that involved processing a large number of CSV files. I was happy to learn there was a good, comprehensive CSV library to make my job easier.
One thing struck me, however, as soon as I started looking for tutorials: none of them covered this library in detail. Most articles I have seen barely touched the surface of what the CSV library can do. I felt determined to get a solid understanding of this part of Ruby, so I started reading every possible tutorial and book chapter on CSV. This two-part series is the result of my endeavors.
I make some basic assumptions in this article:
- You know what a CSV (comma-separated file) looks like.
- You have a basic knowledge of Ruby.
- You have a a basic understanding of working with files in Ruby. This will help in the last section.
Avoiding Common Quirks in CSV Files
Imagine you have an uncle (let’s call him Bob) who has a restaurant.
Uncle Bob’s staff keeps a spreadsheet of all customers. Each row contains (in a separate cell):
- The name of the customer.
- Total number of times they arrived and took a meal at the restaurant.
- Total money spent.
- A short phrase they used when asked to describe the restaurant’s food.
For a start, uncle Bob gave you a small CSV file that contained 4 rows with their 4 most frequent customers. Here’s a visual representation of the file:
Let’s take a small quiz. How are we going to represent this data in plain text? Fill in the blanks:
Dan,34,2548,Lovin it!
Maria,55,5054,_____________
Carlos,22,4352,_________________
Stephany,34,6542,I want bigger steaks!!!!!
Answer: The 2 uncomplete rows would look like this:
Maria,55,5054,"Good, delicious food"
Carlos,22,4352,"I am ""pleased"", but could be better"
If the CSV is comma-separated and you have a comma inside a cell, you need to surround that cell’s text with double quotes (like in the 3rd row). Otherwise, you risk confusing the parser.
Now, if things get even more complicated and, besides having a comma, you also have double quotes inside the cell, you just escape each double quote with another double quote (like in the 4th row).
How Ruby Sees CSV Files
With our improved understanding of CSV files, let’s learn how to get them into Ruby. Before we start doing anything, we need to require 'csv'
in our program file.
Continuing with our Uncle Bob’s task, we now have a file (let’s name it customers.csv
) that he gave us. How do we import it into Ruby? In Ruby, you can import your CSV file either at once (storing all of the file content in memory) or read from it row-by-row. Each approach has its own pros and cons. For example, you don’t want to import a file with 300,000 lines all at once on a computer with 512MB RAM and crash your program, as we’ll learn later.
Either way you do it, Ruby will store each table row as an array, with each cell being a string element of the array.
To illustrate, let’s try importing our small, 4-row CSV file at once. We do that with the CSV.read
method, passing the file name as the argument. This method is going to read the entire file and store it in the customers
variable:
require 'csv'
customers = CSV.read('customers.csv')
Our customers
variable is now one big array (representing the table) containing the rows as its elements. Since we know that Ruby represents each table row an an array, our customer variable is basically an array containing other arrays. Don’t be confused if you haven’t encountered an array storing other arrays in Ruby. Try running the code with the above data if this is still unclear to you (it was for me until I tried running and tinkering around with stuff myself).
Here’s an illustration to make things clearer:
Let’s try instead reading our file line-by-line. We can do that with the CSV.foreach
method, passing the file name as the argument and providing it a block-variable which to contains the already-processed row as an array:
CSV.foreach('customers.csv') do |row|
puts row.inspect
end
The output of this code will be:
["Dan", "34", "2548", "Lovin it!"]
["Maria", "55", "5054", "Good, delicious food"]
["Carlos", "22", "4352", "I am \"pleased\", but could be better"]
["Stephany", "34", "6542", "I want bigger steaks!!!!!"]
Everything processed from the CSV file is a string, even the numbers (although there are some tips on changing that default behavior…more on that later :) ).
The CSV Library Can Also Process Strings, Not Just Files
If you have comma-separated data as a String
object in Ruby, you can convert the data to the Ruby representation of CSV (where the table is an array containing other arrays i.e. the rows) with the CSV.parse
method:
a_string = "Dan,34\nMaria,55"
CSV.parse(a_string) #=> [["Dan", "34"], ["Maria", "55"]]
You can also provide a block to CSV.parse
:
CSV.parse(a_string) { |row| puts row.inspect }
#=> produces ["Dan", "34"] and ["Maria", "55"] on separate lines
CSV.parse
without a block is similar to CSV.read
in regards to the output. Input-wise, they take comma-separated data from different types (CSV.parse
from a string object and CSV.read
from a file somewhere in your hard drive.) The output, however, is the same (an array of arrays). Doing this:
CSV.parse(File.read('customers.csv'))
# File.read returns a (big) string of the data in 'customers.csv', comma-separated
# and CSV.parse just re-structures that data into an array data structure.
will output exactly the same as this:
CSV.read('customers.csv')
Using CSV.parse
with a block is similar to CSV.foreach
. Both allow you to work with one row at a time (not like CSV.read
where the interpreter just throws a big array at you and it’s done.)
What to do When Your CSV is SSV (semi-colon-separated values)
Uh, oh. Uncle Bob gave as another file, newcomers.csv. This one contains a list of the newest customers for the day, and because it was morning, the list was not that big:
James;1;43;Not bad
Robin;1;56;Fish is tasty
Anna;1;79;"Good; better; the best!"
Houston, we have a problem. These files are NOT comma separated. This is not a CSV, this is a SSV! (okay, I’m not really sure if there’s a standardized term for semi-colon-separated files, but let’s use SSV for now.) What do we do? Are all the methods we’ve just learned useless for SSV files? Fear not!
There’s one common pattern among all 4 methods we’ve learned so far: They all accepted just 1 argument, which is the file name (or rather, the path to the file name) we want Ruby to process. In reality, all of them also accept a second (optional) argument which is a hash (key-value pair) containing various options that instruct Ruby how to process the file. The most used option is :col_sep => ':'
which, as you’ve guessed it, specifies the separator used in the file. All of the methods above will work if we add this hash argument:
new_customers = CSV.read('newcomers.csv', { :col_sep => ';' })
CSV.foreach('newcomers.csv', { :col_sep => ';' }) { |row| p row }
To make things clearer, we can use the new symbol_key: value
hash syntax, available since Ruby 1.9 along with the fact we don’t need to include the curly braces in a hash if that hash is the last argument in the argument list:
CSV.foreach('newcomers.csv', col_sep: ';') { |row| p row }
a_string = "Dan;34\nMaria;55"
CSV.parse(a_string, col_sep: ';') #=> [["Dan", "34"], ["Maria", "55"]]
There are many other options similar to col_sep
you can pass. We’ll cover the most common ones in this series, but if you’re curious, you can see them all here.
Let’s Do Some Manipulation
Uncle Bob wants us to take the CSV with the most frequent customers (customers.csv) and calculate the average money spent for each arrival. Easy enough, right? We just divide the total money the customer spent with the total times they came and took a meal at the restaurant. We already have that data in column 3 and column 2:
average_money_spent = Array.new
CSV.foreach('customers.csv') do |row|
average_money_spent << row[2] / row[1]
# row is just an ordinary array and you access its elements with []
end #=> Undefined method '/' for "2548":String
What’s going on? Remember, even though our CSV file contains numbers, they aren’t treated as numbers in Ruby. By default, everything from the CSV file is treated as a string. Fortunately, you can tell the CSV library to stray from this default behavior with yet another key-value option argument (converters: :numeric
). Let’s change our second line from CSV.foreach('customers.csv')
to CSV.foreach('customers.csv', converters: :numeric)
In this case, the key is a symbol (:converters
) and the value is a symbol (:numeric
). Since we’ve said that all options are part of a hash, we can specify more than one at the same time, like this:
CSV.read('customers_separated_with_semicolons.csv', col_sep: ';', converters: :numeric)
Now, all numbers are going to get converted to their approximate formats. Integers will become Fixnum’s, decimals will become Floats, and numbers with hundreds of decimals Bignums. Sweet!
Outputting Our Results In A File
Let’s try and add a new column to our sheet. Remember, the original CSV file (customers.csv) content is:
Dan,34,2548,Lovin it!
Maria,55,5054,"Good, delicious food"
Carlos,22,4352,"I am ""pleased"", but could be better"
Stephany,34,6542,I want bigger steaks!!!!!
and we’ve got this code:
average_money_spent = Array.new
CSV.foreach('customers.csv', converters: :numeric) do |row|
average_money_spent << row[2] / row[1]
end
which gives us the average money spent for each of the 4 customers in an array. We want to add a 5th column in our CSV file containing these values.
If you’ve worked with CSV files in Excel, the usual workflow is to make a change and save it. Well, things don’t work that way in Ruby (if you’re familiar with the File
class, you’ll know what I’m talking about.) The usual workflow for working with CSV files in Ruby is (assuming they’re on your hard drive):
-
Load the rows into the program’s memory with methods like
CSV.read
andCSV.foreach
. If you’re familiar with theFile
class in Ruby, you’ll be happy to know you can also open files withCSV.open(file_name, 'mode-like-r(+)-w(+)-or-a(+)', { options like converters: :numeric in key-value pairs})
and then call.each
on the block variable to get each row. This is just a longer way of doingCSV.foreach
. -
You can do whatever you want with each row using various Ruby expressions, accessing the row’s content as you would access an array. With
CSV.read
, you’d access your rows just like you would access an array, and withCSV.foreach
, you’ll access your rows as you read them. -
After you’ve done your work, you can save your data back to a new file (like ‘Save as’ in a spreadsheet program). This is the part we’re going to focus on in this section.
Referring back to our example, say we got the content of our entire customers.csv file in memory as an array (let’s call it customers_array
) with 4 elements (also arrays, remember?) and the average-money-spent
variable with (also) 4 elements, the first element (number) corresponding to the average money spent for row 1, the second for row 2, etc. Here’s one approach we can take to add this number to each row:
customers_array = CSV.read('test.txt')
customers_array.each do |customer|
customer << average_money_spent.shift
end
Cool! Now each row (i.e. array) contains a new cell (i.e. element) at the end. Let’s update our CSV file. But wait…there isn’t a magic ‘CTRL+S’ button you can press or command you can execute to do this. You can’t just magically modify your CSV file with the updated values. Working with with CSV library in this regard is almost 100% similar to working with the File class and the same logic applies:
You open a (CSV) file for read, write or append mode and use either or
puts
to append a new line. If you don’t know how the Ruby File
class works, I seriously recommend you spend some time learning it and coming back to this article (you have some great videos for it).
The only difference between File.open
and CSV.open
(yep, it even has the same method name) is that with File.open
, you append strings. With CSV.open
, you append rows (represented as arrays). Let’s do an example with our updated customer_array
:
CSV.open('our-new-customers-file.csv', 'w') do |csv_object|
customers.array.each do |row_array|
csv_object << row_array
end
end
That’s it! We now have a new, updated file named our-new-customers-file.csv
. Uncle Bob is pretty happy with us.
In part 2, we’ll go deeper into working with CSV files with actual headers (notice our example CSV files didn’t have any headers) and explore memory issues when working with large files and limited RAM. We’ll also learn some other nifty things like using enumerators with CSV iterator methods, how adding headers gives you a plethora of new methods, as well as some other useful tricks. Stay tuned!