A place to collect my thoughts on data analysis within Enrollment Management. Dare I call it Enrollment Science?
I have been playing with Neo4j quite a bit, mostly for fun as I learn how I figure out when and where I could apply it to solve various analytics problems. Neo4j, at it’s core, is a database, which allows us to query data in a structured way. While the graph model within Neo4j is very flexible, the cypher
query language is fantastic. Once you get over the learning curve, with only a few lines of code you can do some really powerful queries.
With that said, I have increasingly realized that it’s better to move the analytics outside of the database. Even though you can execute cypher
statements against the database, how and where you execute them will matter once you go beyond “toy datasets.”
This post is to two helpful code snippets that I inject into my workflow when combining R
and Neo4j
for my project.
First off, I use RNeo4j to connect R to Neo4j
. It is totally possible to load data within dataframes into Neo4j
using the ?cypher
command; Nicole shows you how to do this in the Import
section of the README on the project page. But be warned, once you get to larger datasets, this might not be your best option with respect to speed performance.
If you haven’t already, play around with the neo4j-shell
and the LOAD CSV
functionality. It’s pretty fast and can handle files of a few million records.
The snippet below is a quick way to simply call that procedure from within your R
script.
NEO_SHELL = "~/neo4j-community-2.3.1/bin/neo4j-shell"
build_import = function(neo_shell, cypher_file) {
cmd = sprintf("%s -file %s", neo_shell, cypher_file)
system(cmd)
}
Just point to your shell and store it in the variable NEO_SHELL
. From calling it is simple.
build_import(NEO_SHELL, "../cypher/add-contstraints.cql")
I simply reference the shell, and pass a text file that contains my cypher statements.
From my experiences, there are huge performance gains when using the shell to execute commands, so the helper function above let’s me stay within R but benefit from the performance gains.
R
When developing your code, inevitably you will be playing around with the incredibly helpful browser tool. I use it to prototype my queries, especially before running LOAD CSV
on a larger file.
When using R, I have two options.
cypher
function within the R package. Or,.cql
extension. To me, it’s easier to maintain, but needs to be brought into R.The process below will bring in a cypher query as a statement into R
that can be further passed to the cypher function.
## read the cypher query into a string variable
## http://stackoverflow.com/questions/9068397/import-text-file-as-single-character-string
FILE = "../cypher/get-edges.cql"
cypher_edges = readChar(FILE,file.info(FILE)$size)
## get the edges
edges = cypher(graph, cypher_edges)
Above, I have a query file that I read into the variable cypher_edges
and pass that to cypher. I use this query to return data back to R. To me, I would rather manage my queries in seperate files, not within R, and this allows me to do that. Moreover, I believe it makes organizing your project’s structure much easier.
I recently implemented the two helper snippets above when working through a dataset for work. After trying to load data, and calculate similarities within the datase (a future post), I arrived at this workflow. Given my tool of choice is R
, the workflow below is both manageable and pretty fast.
R
to clean and tidy the data for import into Neo4j
LOAD CSV
. I use Snippet 1 to do that.For one project, in less than 5 minutes, I was able to import two years of applicants, along with various demographics on that pool, calculate the similarity based on interactions, and write those similarties (nearly 2.8 million rows) back to the database. In terms of speed, this workflow will enable me to opertationalize all sorts of models that can be further implemented into marketing and recruitment efforts. I will post on this shortly …
This post was meant to capture two helpful code snippets that improve my workflow, and speed, of analyzing datasets in R
and Neo4j
.