We have all encountered collections of data at some point when working on Elixir applications. These collections are very handy for storing, retrieving, and manipulating data using different data structures, making them very efficient in managing clean code.
In this article, we'll go through the following:
- What are collections in Elixir?
- The problems likely to be faced when working with collections
- The greedy approach when working with large datasets
- The enumerable in Elixir, used to work around collections
Finally, we'll explore how to build a memory-efficient Elixir application using the lazy processing approach with streams.
What are Collections in Elixir?
You can view a collection as a container that groups multiple things/objects/elements into a single unit. This collection allows you to store, retrieve, and manipulate data.
A couple of examples of collections encountered in the real world include:
- A mail folder that contains received and sent letters.
- A phone directory mapping phone numbers to corresponding names.
In Elixir, List
, Maps
, Ranges
, files
, and tuples
are the most common types of collections. Some of these collections can be iterated through. Each element is passed through once, in order, so we can term them enumerables
. Things that can be iterated in Elixir implement the Enumerable protocol. List
, Maps
, and Ranges
are the most common datatypes used as enumerables.
Elixir has two modules with iteration functions that use the Enumerable protocol: Enum and Stream.
The Challenges of Collections
Working with collections can be very efficient, especially when your collections aren't that large. However, manipulating these collections can be very difficult when dealing with large datasets.
For example, let's say we want to retrieve organization data from a CSV file and display it in our application. Ideally, we have to load all the organizations from the CSV file into an application's memory and transform the retrieved data into the data we want.
Our application can be safe if the memory contains a predictable number of organizations that can be easily transformed or manipulated. However, if we have extensive data on organizations retrieved from the CSV, managing this data can result in slow application performance, because more memory is needed to consume it.
So, how can we elegantly manage our large CSV file? (keeping in mind some of the problems we might face while processing it).
We can process our CSV file using one of two approaches:
- The greedy approach that consumes more memory.
- The memory-efficient lazy processing approach.
Earlier, we mentioned that Elixir has two modules, Enum
and Stream
, with iteration functions. We'll use these modules to process our CSV file.
The Greedy Approach For Our Elixir Application
In the greedy approach, we are going to use the Enum module because:
- When given a collection, it will consume all the content of that collection. That means the whole collection should be loaded into memory before being processed.
- The returned result is another collection that will be loaded into memory, which might be inefficient if data is not needed at that time.
Let's process our CSV file greedily.
We shall use an organization's CSV file with 2,000,000 records downloaded from here for demonstration purposes. The size of the CSV file (after unzipping) is around 283.4MB.
Our focus will be to get the records of all organizations and transform them into data we can easily display in the browser. The result should look like this:
Before we begin, let's open the interactive shell to manipulate the CSV file. After that, start the Erlang Observer
application by running:
We shall use the Observer
application to inspect memory allocators when processing the CSV file.
Step 1: Loading the CSV File Into Memory
Use File.read!("organization.csv")
to load the whole CSV file into memory. The observer shows an extra 283.84 MB allocated to memory.
Step 2: Create a List for Each Row In the CSV
The data in Step 1 is loaded text, which means we are still far from our end goal. In this step, let's take the loaded text data and split it into lines. This will result in a list of string elements, with each element representing a row of the CSV.
The observer shows an increase in allocated memory:
Step 3: Create a Column from Each Row
In the second step, we pattern-matched our result to [_header | rows]
. We are not interested in the header of the CSV file, but in the body.
In this step, we will split each row into a list, with the elements in each list representing a column of the CSV file.
This step alone takes almost 60 seconds to process. The observer also shows a spike increase in memory allocation:
Step 4: Create a Keyword List of Each Row
Here, each column in each row will be zipped to its corresponding header name.
In this case, we use Enum.map/2 and Enum.zip/2 to process the organization record results from step 3 (a list of lists, where each list element represents a row of records). Remember that each call to the Enum module takes a collection and returns a collection. Both Enum.map/2
and Enum.zip/2
will greedily process our data, which explains the spike increase in memory allocation as seen from the observer.
Enum.map/2
takes in a list of lists, where each list element is a row of organization records. It iterates through each corresponding row and invokes Enum.zip/2
on each row to transform it into a keyword list.
The Enum.zip/2
function takes in two enumerables. The first enumerable is a list of header names, and the second is a list of each row. The zip
function zips the corresponding elements from the list of headers and the list of each row into a list of tuples. The tuple element in the list has an atom as the first element (which is the header name and value of the corresponding header name).
Elixir represents this kind of output as a keyword list. In your interactive shell, you will notice each row is transformed into a keyword list, but not a list of tuples.
Enum.map/2
returns a new list of keyword lists.
Step 5: Convert to Map
Remember, our end goal was to work with data that's easy to manipulate when we want to display a list of organizations. In this final step, we will convert the keyword list above into maps.
Compiling everything together:
Observation with Enum
- In each
Enum
step, we generated an output before the final result. The intermediate result in each step is stored in memory as a full collection. That's because, with Enum implementation, the whole collection must be available for the next step in the pipeline to start processing. - With Enum, each step has to wait for its intermediate result to start processing, so it takes more time to process data in each step.
- There is a spike increase in memory allocation in each step, meaning an intermediate output must be kept in memory.
The greedy approach, or working with Enum, works well with a predictable number of elements. We couldn't have easily noticed these problems in a small CSV file.
However, this approach is very inefficient when we have to keep an intermediate result in memory in each step. This is a massive waste of memory that can lead to slow application performance.
Solution: Build a Memory-efficient Elixir App with Streams
First, what are our goals?
- Memory allocation that's only for the final result.
- To process records only when we need them.
Using streams, we can do these things. Streams are known to be lazy enumerables.
This is because streams:
- Lazily load data into memory: That is, one element at a time instead of loading everything at once.
- Only load data when needed.
- Instead of applying a transformation to the given enumerable, return a value with the intended specifications.
- Process one piece of data at a time as it arrives instead of waiting for the whole collection to be available.
- Are a composable enumerator that reads and pipes one single line at a time without needing to wait for all the passed data to be processed at every single step.
The Lazy Processing Approach with Streams in Elixir
We can leverage lazy processing to process our organization's CSV file.
The good thing is that more Elixir modules now support streams. Instead of opening the CSV file and loading the data on step 1, let's use File.stream!/3
to create a stream without necessarily opening the file.
The File.stream!/3
function returns %File.Stream{}
with intended specifications.
The observer doesn't reflect our CSV file's extra memory size, meaning the file still needs to be opened and loaded into memory.
Streams are composable enumerables. We can pipe one stream to another. We will replace the Enum functions from steps 2-4 of the greedy approach with stream functions. We'll avoid processing data and loading it into memory until we decide to run the stream (compute/process the data as intended) by passing it to a function in the Enum module.
Let's compose our first part of the pipeline with streams:
The composed pipeline of streams above returns a stream that only stores the intended computation instead of applying the transformation.
Here is what is happening within our stream processing pipeline:
- Instead of opening the file, we use
File.Stream/3
to create a stream. - Stream.drop/2 lazily drops the first element, the header.
- Stream.map/2:
- splits data into columns
- lazily zips each row into its corresponding header name with Stream.zip/2
- transforms each piece of data into a map
To process our desired result, we have to pass the stream value to a function in the Enum module. First, let's take only the first 3 elements from our stream and observe what happens.
Enum.take/2 receives the transformed map data from the stream and only gets the first 3 elements. Once it gets the 3 elements from the stream, there is no more processing.
Running this pipeline, you will notice that:
- The result is returned instantaneously.
IO.inspect
only prints the first 3 elements. This is proof that the elements are being processed one at a time by subsequent calls to Enum functions.
Lazy Approach Using Streams Vs. Greedy Approach
Let's use the greedy approach to take the first 3 elements from our code:
IO.inspect
prints the whole 2,000,000 records. This shows that with Enum, we have to wait for the whole collection to be available upfront before starting the next processing batch. Using this approach, an intermediate list of 2,000,000 records has to be kept in memory at each step (even if, in the end, we only get the first 3 records using Enum.take/2
).
Unlike the greedy approach, using the lazy approach, there is no need to process the whole collection for the next process to take place. By passing Enum.take/2
to fetch the first 3 elements, we can see there isn't even a peak in memory.
With streams, even if we process 2,000,000 records, we can decide the amount of records to load in memory and ensure that the amount of utilized memory stores the final result. This is not the same case with Enum, since in each step there is a certain amount of records kept in memory.
We can also compare the greedy and lazy approaches by monitoring memory allocators using the Erlang observer.
Remember, in the greedy approach, we processed all records. We will do the same with our lazy approach by passing the stream pipeline to Enum.to_list/1.
Here's the memory allocated to the lazy approach:
And to the greedy approach:
Wrapping Up
In this post, we first took a look at collections in Elixir and some of their associated challenges. We then explored two methods of processing large datasets: the greedy approach and the lazy approach using streams.
We've seen how working with streams is one of the best approaches to help create a memory-efficient Elixir application.
Happy coding!
P.S. If you'd like to read Elixir Alchemy posts as soon as they get off the press, subscribe to our Elixir Alchemy newsletter and never miss a single post!