Billboard Hits + Data Munging

Problem statement

Explore

Exploring data sets within python is made exponentially easier with some very helpful libraries.

For this project I utilized the following packages:

pandas
numpy
seaborn
Matplotlib.pyplot

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
data = pd.read_csv("../assets/billboard.csv")

My preferred method of preliminary data exploration is to use the following three commands:

print data.columns.values
print data.head(5)
print data.shape

.shape() calculates length and width of the table. This provides insight into the structure of the data set and the number of values you can expect to be working with.

data.columns.values shows the current column titles. This helps to identify which columns to call and rename if needed.

.head(5) gives a the top 5 rows from each column. This function shows the type of data contained in the array, detailing how it has been collected and formatted.

The exploring your data is a useful step for identifying null values, strange formatting and unusable data types. Within the “billboard” data set, we notice null values contained in the later columns of the data set.

Clean

The first step I took in cleaning the data was by renaming columns. I renamed “artist_inverted” to simply “artist”, and each week column was converted to numerical values representing that week.

## rename columns.
new_cols = ['year', 'artist', 'track', 'time', 'genre', 'date.entered', 'date.peaked', 
            1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
            21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 
            40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 
            60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76,]
data.columns = new_cols

Next, I chose to melt the table the “week” columns into a single column containing the numerical value corresponding to that week.

#Melt the table and drop columns unused columns. Drop rows containing null values
billboard = pd.melt(data, id_vars=data.ix[:,:7].columns.values, var_name='week', value_name='rank')
billboard.drop('year', axis=1, inplace=True)
billboard.drop('date.entered', axis=1, inplace=True)
billboard.drop('date.peaked', axis=1, inplace=True)
billboard.dropna(axis=0, how='any', inplace=True)

After this I chose to drop columns that were unnecessary to my analysis, this included: ”year”, “date-entered” and “data-peaked”.

Finally, I chose to drop any rows which contain null values. This helps to contain the length of weeks within a range of given values (as opposed to an arbitrary length).

The resulting array dimensions are now 5307 rows × 6 columns.