module 4 data mining 1
- Data MIning Assignment_with rubric (181.304 KB)
- usoccupations.xlsx (621.785 KB)
- uscars.xlsx (31.055 KB)
- uspopulation.xlsx (142.184 KB)
This assignment provides you with practice using R for data mining techniques. You will use R to classify and cluster dataset to show how data mining methods can be used to classify and cluster data.
Before beginning this assignment, review the learning resources for this module, especially Introduction to Data Mining with R from R DataMining.com, reviewing the steps taken to classify and cluster the iris data set in R.
The purpose of clustering is to form new classification from numerical variables. Therefore, it is important that you remove original classification from the data set prior to conducting clustering. For example, the species variable needs to be removed from the Iris data set because species is a classification. You may then merge back the original specifies variable and compare the newly formed clusters against the original classification to see how they differ.
Complete the following steps and write a report to record your work, results and analysis.
- Install and load the *factoextra and **NbClust packages.
- Select an appropriate data set in R or the MASS library and use the sample(), ctree() and predict() functions to build a decision tree and plot it. You may also use one of the data sets (usoccupations, uscars, uspopulation) attached to this module for this assignment (you may import directly or convert to CSV first).
- Determine the appropriate number of clusters and produce a k-means cluster. Explain your findings.
- Produce a density-based cluster with DBSCAN or use logistic regression to construct a binary classification and explain your findings.
*The factoextra package is used to determine the optimal number clusters for a given clustering methods and for data visualization
**The NcClust package provides 30 indices for determining the relevant number of clusters and the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods. It can simultaneously compute all the indices and determine the number of clusters in a single function call.
Your assignment/project should have a good cover/title page, introduction of what the goals of the project and the methods you use. It also should follow APA format with at least 1000 words (excluding title page and references page) and references page. In the body of your project you should incorporate the R codes and R outputs with interpretation of your results. Be sure to show all the elements in the official hypothesis, including the null and alternative hypothesis, the critical values, calculation of the test statistics and p-values. Finally, you need to make sense of your results to make good points with proper conclusions, to show your understanding of the course material and its application to the dataset.
Graphs, figures, charts, tables are very useful to increase visual effects to impress your readers. You also should do your best to give insight and understanding to the project with a good conclusion. Please use subtitles to make your assignment more reader friendly as well.