Automating Excel tasks with Python: Part 1
What does “panda” mean to you? Probably something like this:
Or if you’re in the SEO crowd, then it will also mean this:
It’s the name given to a filter created by Google that aims to decrease site rankings for websites with poor quality content. Talk about a sad panda for those sites.
But today we are looking at yet another panda, it’s my personal favourite out of the three and I think it could be yours too. It’s the Pandas dataframe library for Python.
The Task: Filter data then sum and count
Now that I’ve successfully overused the word panda and variations thereof, let’s get to the good stuff.
The task is as follows: we have a set of spreadsheets or .csv files and need to calculate the sum and length of a given column, conditional on the values of other columns. I encountered this task with an SEMrush organic search dataset looking something like this (but much larger of course):
In particular, let’s say we need to determine the number and total search volume of keywords with positions between 10 and 21 (second page Google results) that have search volume of at least 100. Furthermore, we only want URLs that contain the string “/videos/” so we can look at keywords leading to that subdirectory of the site.
With Excel, the solution is simple enough for one file:
- Create custom filters for Position, Volume, and URL columns
- Calculate sum and mean of target column
Easy right? But doing this for desktop and mobile versions of N sites means the task must be repeated 2*N times. Not only does that suck, but it increases the chance of human error coming into play.
With Python, the solution can be extended to any number of files.
We can use the following function to filter data from a given file and perform the required calculations.
When filter_calc
is called, the .csv formatted data is first read into a pandas dataframe df
(note that it’s possible to custom specify separators with the read_table
function and the read_excel
function can be used for Excel spreadsheets).
Next we create a filtered dataframe called df_fltr
and parse the URLs to find those containing the phrase “/videos/”. We reference the column names: “Volume”, “Position”, and “Url”. You can figure out what your columns are named by doing print(df.columns)
or print(df.head())
to look at the first couple rows of data underneath each.
Finally, we return the calculated values. Want to save the filtered data? Insert the statement df.to_csv('filename.csv')
before returning the results. Or return df
along with the other results and save the data after calling the function.
This function can be iterated over as you desire. For example, we can use the glob.glob
function to get a list of all .csv files in the working directory and call the filter_calc
function for each, producing a table with the results.
It may not be as simple as the Excel option, but it’s quick and clean.
That’s all for today but stay tuned for more python solutions (and more pandas 🐼).