Scenario 1: State tax board
You are a data scientist in residence at the Iowa State tax board. The Iowa State legislature is considering changes in the liquor tax rates and wants a report of current liquor sales by county and projections for the rest of the year.
Getting started.
First step in getting started is to import the tools and data set needed for this project. Here I used Pandas and Numpy for data manipulation. I used SKLearn to create a linear regression model from my data.
import pandas as pd import numpy as np from sklearn import datasets, linear_model, metrics from sklearn.cross_validation import cross_val_score, cross_val_predict, train_test_split from sklearn.metrics import r2_score ## Load the data into a DataFrame df = pd.read_csv('../Iowa_Liquor_sales_sample_10pct.csv') ## Transform the dates if needed, e.g. df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
Explore the data
Once the data has been imported, explore the data using the following functions.
print df.head() print df.describe() df.shape df.info()
- .head() - head will provide a snap shot of the top 5 values from each column. This is usefully to see what type of data you are working with - labels, strings and numerical values.
- .describe() - describe is very useful at summerizing the numerical values from you data set.
- shape - shape gives you a quick snap shot of the size of your data in rows and columns.
- .info() - will provide a quick summery of they data types contained in each column. It also helps to spot null values.
During my cleaning process I chose to drop the redundant columns. In this
#Drop redundent data columns df.drop('Volume Sold (Gallons)', axis=1, inplace=True) df.drop('Category', axis=1, inplace=True) #Rename columns df.columns = ['date', 'store_number','city','zip_code','county_number','county', 'cat_name','vendor','item_number','item_desc','btl_vol','st_btl_cost', 'st_btl_retail','btls_sold','sale','vol_sold'] #Convert the data types of State Bottle Cost df['sale'] = [float(x.replace('$','')) for x in df['sale']] df['st_btl_cost'] = [float(x.replace('$','')) for x in df['st_btl_cost']] df['st_btl_retail'] = [float(x.replace('$','')) for x in df['st_btl_retail']]