A place to collect my thoughts on data analysis within Enrollment Management. Dare I call it Enrollment Science?
Many moons ago, I wrote some code to build a Tableau Data Extract from the work that I had munged together in python
. I figured it was time to update the code since I recently discovered that the Tableau API has changed.
For a link to that old code, refer to the Jupyter Notebook in this repo.
First off, I am using a Macbook, and while I believe things are getting easier on Windows machines with respect to coding, I prefer to write Terminal commands over point-and-click installs.
For tools, we will be using:
A light-weight python instance from Anaconda called Miniconda. This lets us create environments that house the versions of the packages that we need, and it just works. Installation should be easy and is well documented on their site. The beauty of conda
is that the tool handles all of the version dependencies for us! Prior, getting packages to work could be a nightmare at times.
The Tableau Python SDK. This was surprisingly harder to find than I would have guessed. Simply download the SDK, unpack it, and follow the instructions at the link. We will do this below, however.
While not necessary, I am going to use Yhat’s Rodeo IDE and list how to configure the editor to use a specific conda environment. This is also easy to install, and assumes that you have already done so.
Tableau, of course.
Assuming you have miniconda installed properly, we are going to create a Python 2.7 environment for our work with the Tableau API.
First, let’s explore what I have for environments on my system.
$ conda info --envs
yields
# conda environments:
#
chatbot /Users/btibert/miniconda3/envs/chatbot
ipy /Users/btibert/miniconda3/envs/ipy
tableau /Users/btibert/miniconda3/envs/tableau
root * /Users/btibert/miniconda3
Above are the environments that I have set up on my machine. If you are not familiar with environments, think of them as isolated installs of python with various packages for your use-case. Environments are a great way to isolate the versions and dependencies relative to a project, and allows you to avoid collisions if you just manage a single python environment.
Now, lets create our environment. In this case, I am going to call it tableau-post
.
conda create -n tableau-post python=2.7 pandas jupyter matplotlib
This will install python 2.7
, pandas
, matplotlib
and jupyter
(formerly ipython).
These packages and version are what you need to:
To activate the environment, simply use:
source activate tableau-post
The code above will activate our environment.
To confirm that you are good to go:
source activate tableau-post
which python
In my case, this yields:
/Users/btibert/miniconda3/envs/tableau-post/bin/python
We will use this when we configure Rodeo later on.
From this point forward, I am going to assume that you have tableau-post
activated.
The code below jumps into Downloads
and will get the Tableau SDK from the web.
cd ~/Downloads/
wget https://downloads.tableau.com/tssoftware/Tableau-SDK-Python-OSX-64Bit-10-0-2.tar.gz
tar -xvzf Tableau-SDK-Python-OSX-64Bit-10-0-2.tar.gz
This will download and unpack the file, which will yields a directory that we need for installation.
cd TableauSDK-10000.16.1004.1720/
The command above will navigate into the folder.
sudo python setyp.py install
The command above will install the Tableau SDK for you. For more on this, go to this link
With Rodeo open, navigate to Preferences
. You will see a dialog box similar to the one below.
You can see above that for the Python Command I am using the same location that was printed from the which python
command within the tableau-post
environment.
It’s that simple; now you can use isolated environments when developing with Rodeo!
The majority of the code that I write is in R
using the fantastic Rstudio IDE, and while the Rodeo tool is not as feature-rich (yet), it’s great for the way I develop. The fact that there is “type-ahead” functionality (read: you write less code), and the ability to commit the lines of my script 1x1, is extremely helpful for the way I work through my tasks.
I hardly debug in an efficient way, but I also tend to solve smaller problems and do not require test suites.
Below are a few helpful resources on how to use python to build out a Tableau Data Extract:
The code below builds out a basic extract from a sample dataset available on the web.
## import the libraries
import tableausdk.Extract as tde
import pandas as pd
import os
## bring in a sample Graduate School Admissions datasets
file_name = "http://www.ats.ucla.edu/stat/data/binary.csv"
df = pd.read_csv(file_name)
df.head()
df.shape
## create the extract name, but remove the extract if it already exists
fname = "example.tde"
try:
tdefile = tde.Extract(fname)
except:
os.system('del ' + fname)
os.system('del DataExtract.log')
tdefile = tde.Extract(fname)
## define the table definition
tableDef = tde.TableDefinition()
## create a list of column names and types
colnames = df.columns
coltypes = df.dtypes
## create a dict for the field maps
## Caveat: I am not including all of the possibilities below
fieldMap = {
'float64' : tde.Types.Type.DOUBLE,
'float32' : tde.Types.Type.DOUBLE,
'int64' : tde.Types.Type.DOUBLE,
'int32' : tde.Types.Type.DOUBLE,
'object': tde.Types.Type.DOUBLE,
'bool' : tde.Types.Type.DOUBLE
}
## for each column, add the appropriate info the Table Definition
for i in range(0, len(colnames)):
cname = colnames[i]
ctype = fieldMap.get(str(coltypes[i]))
tableDef.addColumn(cname, ctype)
## create the extract from the Table Definition
## Super Hacky, but legible
## for each row, add the data to the table
## Again, not accounting for every type or errors
with tdefile as extract:
table = extract.addTable("Extract", tableDef)
for r in range(0, df.shape[0]):
row = tde.Row(tableDef)
for c in range(0, len(coltypes)):
if str(coltypes[c]) == 'float64':
row.setDouble(c, df.iloc[r,c])
elif str(coltypes[c]) == 'float32':
row.setDouble(c, df.iloc[r,c])
elif str(coltypes[c]) == 'int64':
row.setDouble(c, df.iloc[r,c])
elif str(coltypes[c]) == 'int32':
row.setDouble(c, df.iloc[r,c])
elif str(coltypes[c]) == 'object':
row.setString(c, df.iloc[r,c])
elif str(coltypes[c]) == 'bool':
row.setBoolean(c, df.iloc[r,c])
else:
row.setNull(c)
# insert the row
table.insert(row)
## close the file
tdefile.close()
Some notes on the process:
I hope that this helps you as you develop with Python and Tableau. I recently have been trying to define a workflow locally to prototype my data needs, and since my dataset has millions of rows, using CSVs and local database tables just isn’t cutting it.
comments powered by Disqus