Data Set Creation and Processing:

You Have to Use StatTools

Data Set Creation and Processing:

1. Find a data set that is interesting to you.

2. Minimum 1,000 records, (observations, examples).

3. Minimum 15 variables, (attributes, features).

4. It is best to not use a data set that contains extensive missing data.

5. Save the data set as a .CSV file, name file as described above.

6. Save the data set as a MS Excel, name file as described above.

7. In the MS Excel workbook, name the worksheet containing the data, “Data”.

8. In the MS Excel workbook create in worksheet and name it “Data Dictionary”.

1. In the Data Dictionary worksheet: 2. Create three labeled columns: Variable Name, Data Type, Explanation of Variable. 3. List your variables and complete the Data Dictionary worksheet.

9. Create a correlation matrix of your data set.

10. Create a set of scatter plots of “interesting” variables of your choice.

11. Create a set of PivotTables of “interesting” variables of your choose.

12. Create a set of bar charts of “interesting” variables of your choice.

13. Create a set of histograms of “interesting” variables of our choice.

Paper Outline: You are to “pretend” and put yourself into the following context. You are an outside Business Analytics consultant hired by a company that has a goal of improving their current quantitative analysis processes. They provide you with a basic data set, that needs “cleaning up” by preprocessing, filtering examples with missing attributes, replacing attributes, randomly selecting 1,000 examples from a larger data set, when applicable.

Thus, you will have eight parts:

1. Title page

2. Introduction, company information, consulting firm information.Your story, your interest.

3. Data Understanding, what the data set represents and used for.

4. Data Preparation, how the data set was “cleaned up,” file preparation, missing data removal.

5. Discussion of data dictionary. Include URL of data location.

6. Discussion of correlation matrix, explain any “interesting” correlations.

7. Discussion of scatter plots, and why you selected the specific variables to use.

8. Discussion of PivotTables, and why you selected the specific variables to use.

9. Discussion of bar charts, and why you selected the specific variables to use.

10. Discussion of histograms, and why you selected the specific variables to use.