What does “panda” mean to you? Probably something like this:
Or if you’re in the SEO crowd, then it will also mean this:
🐼 Google Panda Update
A search filter released in February 2011 and officially part of Google’s core algorithm as of 2016.
It’s the name given to a filter created by Google that aims to decrease site rankings for websites with poor quality content. Talk about a sad panda for those sites.
But today we are looking at yet another panda, it’s my personal favourite out of the three and I think it could be yours too. It’s the Pandas dataframe library for Python.
The Task: Filter data then sum and count
Now that I’ve successfully overused the word panda and variations thereof, let’s get to the good stuff.
The task is as follows: we have a set of spreadsheets or .csv files and need to calculate the sum and length of a given column, conditional on the values of other columns. I encountered this task with an SEMrush organic search dataset looking something like this (but much larger of course):
In particular, let’s say we need to determine the number and total search volume of keywords with positions between 10 and 21 (second page Google results) that have search volume of at least 100. Furthermore, we only want URLs that contain the string “/videos/” so we can look at keywords leading to that subdirectory of the site.
With Excel, the solution is simple enough for one file:
- Create custom filters for Position, Volume, and URL columns
- Calculate sum and mean of target column
Easy right? But doing this for desktop and mobile versions of N sites means the task must be repeated 2*N times. Not only does that suck, but it increases the chance of human error coming into play.
With Python, the solution can be extended to any number of files.
We can use the following function to filter data from a given file and perform the required calculations.
filter_calc is called, the .csv formatted data is first read into a pandas dataframe
df (note that it’s possible to custom specify separators with the
read_table function and the
read_excel function can be used for Excel spreadsheets).
Next we create a filtered dataframe called
df_fltr and parse the URLs to find those containing the phrase “/videos/”. We reference the column names: “Volume”, “Position”, and “Url”. You can figure out what your columns are named by doing
print(df.head()) to look at the first couple rows of data underneath each.
Finally, we return the calculated values. Want to save the filtered data? Insert the statement
df.to_csv('filename.csv') before returning the results. Or return
df along with the other results and save the data after calling the function.
This function can be iterated over as you desire. For example, we can use the
glob.glob function to get a list of all .csv files in the working directory and call the
filter_calc function for each, producing a table with the results.
It may not be as simple as the Excel option, but it’s quick and clean.
That’s all for today but stay tuned for more python solutions (and more pandas 🐼).