Data science with clojure

Data science with clojure

Hypothetical reader, you must be asking yourself:

“Why not use python? There are a gazillion libraries and methods already prepared to go! So… why?”

I will tell you one reason in this post! One small disclaimer though, I will not discuss performance here, only convenience ;]

Actually, the point I will make is theoretically valid for any lisp dialect. But I have found a pretty nice project with a solid library in clojure, so here we go.

Some basics…

(If you are already familiar with thread macros (-> and/or ->>) you can skip to So what?)

I will assume that you, hypothetical reader, have some knowledge in lisp. At least the basics, ie, understand how to declare and call functions and not afraid of parenthesis (not too much, at least).

Having said that, maybe I will need to explain a little bit of the thread macro (aka, ->). The why will be clearer later. So…

What is the thread macro?

Well, if you have used pipes (|) in the command line, you are already familiarized with the concept. Now all it takes is to see how it happens in a lisp context.

Let’s say you have a unix command that is something like this:

cat textfile1.txt textfile2.txt | grep "a string"

What this is doing is pretty much giving to grep, the result of the first cat command. Something like:

cat textfile1.txt textfile2.txt | grep "a string" ~result~
	  |                                             ^
	  |                                             |
	  ------------------  result  -------------------

The concept of thread macros is the same. The result of the function will go to the next one. Let’s take a look this simple example:

(-> (+ 2 2)
    (/ 8))

The idea is that everything inside the macro -> will generate a result and give it to the next function as a parameter. In other words the flow of the snippet above is:

(-> 4
    (/ 8))

and

(-> (/ 4 8)) ;; which is the same as (/ 4 8), therefore 0.5

Ok, but what is the difference of -> and ->>?

Very simple! -> will give the result of every function as the first parameter and ->> will give as the last parameter. Take a look:

(->> (+ 2 2)
     (/ 8)) ;; which is the same as (/ 8 4), therefore 2

So what?

Let me introduce you to scicloj ml! This library has some pretty cool features for dealing with tabular dataset. Actually, it is what you would expect if you are familiar with pandas or pyspark, but with lisp!

If you have done some hard work with pandas (or pyspark), you may have noticed that the nature of OOP does no good for the readability of the code. In order to transform your dataframe you would have to dot everything and try indent nicely in order to get some readability.

For example, lets assume we have a dataframe like this:

MovieDirectorBox-OfficeYear
Shutter IslandMartin Scorsese294.82010
Kill Bill Vol.1Quentin Tarantino180.92003
The DepartedMartin Scorsese291.52006
Once Upon a Time in HollywoodQuentin Tarantino377.62019
The Wolf of Wall StreetMartin Scorsese406.92013

Given this dataset, let’s assume we would like to see which director has the highest box office after 2010. There are some ways of doing this, but one way would be:

  1. Filter the data frame to have only the valid information
  2. Use a group by in order to aggregate the values of box office per director
  3. Order the dataframe per box office aggregated

Well, to do so in python, assuming that the variable would be df for the dataframe above, it would go something like

df # the small movie dataframe above

df[df.Year > 2010].groupby('Director').agg({'Box-Office':'sum'}).sort_values(by=['Box-Office'])

# or maybe with a good (?) identation
df[df.Year > 2010].groupby('Director').\
    agg({'Box-Office':'sum'}).\
    sort_values(by=['Box-Office'])

Now, I am not saying it is bad, but it is not ideal either. Let me show something similar in a lisp enviroment. In this case clojure’s scicloj:

(use scicloj.ml :as ml)

df ;; let's say this is the dataframe

(-> df
    (ml/select-rows #(> (:Year %) 2010))
    (ml/group-by :Director)
    (ml/aggregate (fn [ds] #(reduce + (get ds :Box-Office))))
    (ml/order-by [:Box-Office]))

I won’t go in details how each line works but, honestly, only understanding that the thread macro (->) is there to transform your data one line at time, it is enough to make the lisp version more readable to the three steps mentioned earlier.

If you consider this point biased since I am a huge lisp fan, no problem, consider this my first plea to make data science doable in other languages (specially lisp) rather than only python or R.

Cheers!

Quick note: A very good source to play with scicloj ml for me was this link with examples.

Quick note 2: Scicloj has a page with bunch of nice tools for data in clojure: scicloj resources. You are welcome!

/comments ~lucasemmoreira/opinions@lists.sr.ht?Subject=Re: Data science with clojure