Project 3: Liquor Sales + Linear Regression

Scenario 1: State tax board

You are a data scientist in residence at the Iowa State tax board. The Iowa State legislature is considering changes in the liquor tax rates and wants a report of current liquor sales by county and projections for the rest of the year.

Getting started.

First step in getting started is to import the tools and data set needed for this project. Here I used Pandas and Numpy for data manipulation. I used SKLearn to create a linear regression model from my data.

import pandas as pd
import numpy as np

from sklearn import datasets, linear_model, metrics
from sklearn.cross_validation import cross_val_score, cross_val_predict, train_test_split
from sklearn.metrics import r2_score

## Load the data into a DataFrame
df = pd.read_csv('../Iowa_Liquor_sales_sample_10pct.csv')

## Transform the dates if needed, e.g.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")

Explore the data

Once the data has been imported, explore the data using the following functions.

print df.head()
print df.describe()
df.shape
df.info()

.head() - head will provide a snap shot of the top 5 values from each column. This is usefully to see what type of data you are working with - labels, strings and numerical values.
.describe() - describe is very useful at summerizing the numerical values from you data set.
shape - shape gives you a quick snap shot of the size of your data in rows and columns.
.info() - will provide a quick summery of they data types contained in each column. It also helps to spot null values.

During my cleaning process I chose to drop the redundant columns. In this

#Drop redundent data columns
df.drop('Volume Sold (Gallons)', axis=1, inplace=True)
df.drop('Category', axis=1, inplace=True)

#Rename columns
df.columns = ['date', 'store_number','city','zip_code','county_number','county',
              'cat_name','vendor','item_number','item_desc','btl_vol','st_btl_cost',
              'st_btl_retail','btls_sold','sale','vol_sold']

#Convert the data types of State Bottle Cost
df['sale'] = [float(x.replace('$','')) for x in df['sale']]
df['st_btl_cost'] = [float(x.replace('$','')) for x in df['st_btl_cost']]
df['st_btl_retail'] = [float(x.replace('$','')) for x in df['st_btl_retail']]

Projects

Getting started.

Explore the data