Today I discovered an awesome property of Elixir maps that let me reduce the time of a matching operation on large datasets from about 30+ minutes to less than a second.
I was working on a project to take about 10,000 “leads” records with basic info like First Name, Last Name, and Email and compare them to an existing set of about 10,000 “contacts” records with a more expansive data set, trying to find any “leads” that match a “contact” by email address. When a match is found, the “lead” is merged into the existing “contact”.
I’ll recreate the situation with some code to generate the big data sets:
The snippet is a little dense but basically we’re creating a bunch of “leads” and a bunch of “contacts” with some random four-letter values for
z, which puts us in a statistical range where we should see some overlap in the email addresses but not much. When I tested the code it returned a couple hundred matches.
Now, our job is to figure out which of our “leads” have an email address that matches the email address of a “contact” in the other data set.
Here’s the method I used at first to implement it, using lists as the primary data structure:
On my Macbook Pro, I let this run for about 30 minutes while writing this article and it still hadn’t finished so I had to kill
From a process standpoint, this method seems to make sense. For each lead, we have to find which if any of the contacts have the same email address. Since there are potentially multiple contacts with the same email address, we need to capture them all so we do
Enum.filter to strip the list down to just the contacts that match the lead’s email address.
The problem is that to access any element in a list, you have to traverse all of the elements preceding that element first. The bigger the list gets, the less efficient it is to keep traversing it every time you need the next value.
Unfortunately, this leads to a very very slow operation.
Because lists were so slow with this big dataset, I needed a way to make this operation faster. I wanted to see if I could get the data structure into a map form, because maps in Elixir are very fast and I thought that might speed up the matching process.
Here’s what I came up with:
First, we put the
contacts list into a map using the
Enum.group_by/3 function. This will give us a map, where the keys are all of the unique email addresses in the
contacts list, and where the values are the contacts themselves. Even better, when more than one contact has the same email address,
Enum.group_by/3 will put them all into a nice list for us!
Next, we see if the
contacts map has a key that matches our current lead’s
nil if there’s not.
Finally, we filter the list to remove all of the
nil entries, and we’re left with the definitive matches between the data sets.
This time the operation completes in only a second or two!
Maps have near instant access times, even on extremely large maps. This makes them ideal when working with large sets of data like we’re seeing here. The result is that we have a very fast matching operation, like 2000 times faster than doing the same operation with lists.
I hope you enjoyed the article! I’m working on a series of posts to help people who have no coding experience get started with Elixir (and Phoenix). If you know anyone who’s interested in learning to code, send a link their way: https://medium.com/@damonvjanis/learn-to-code-the-easy-way-part-1-8067cd3b6307