数据挖掘的方法论

  1. 1、下载文档前请自行甄别文档内容的完整性,平台不提供额外的编辑、内容补充、找答案等附加服务。
  2. 2、"仅部分预览"的文档,不可在线预览部分如存在完整性等问题,可反馈申请退款(可完整预览的文档不适用该条件!)。
  3. 3、如文档侵犯您的权益,请联系客服反馈,我们会尽快为您处理(人工客服工作时间:9:00-18:30)。
• • • • • Assembling Customer Signatures Creating a Balanced Sample Including Multiple Timeframes Creating a Model Set for Prediction Partitioning the Model Set Training set Validation set Test set
Step Three: Get to Know the Data
• • • • Examine Distributions Compare Values with Descriptions Validate Assumptions Ask Lots of Questions
Step Four: Create a Model
Convert Counts to Proportions
Chart comparing count of houses with bad plumbing to prevalence of heating with wood.
Chart comparing proportion of houses with bad plumbing to prevalence of heating with wood.
Step Five: Fix Problems with the Data
• Categorical Variables with Too Many Values • Numeric Variables with Skewed Distributions and Outliers • Missing Values • Values with Meanings That Change over Time • Inconsistent Data Encoding
Step One: Translate the Business Probleminto a Data Mining Problem
• • • • What Does a Data Mining Problem Look Like? How Will the Results Be Used? How Will the Results Be Delivered? The Role of Business Users and Information Technology
Step Eleven: Begin Again
Question
Answer
• Learning things that aren’t true • Learning things that are true, but not useful
Learning things that aren’t true
• Patterns May Not Represent Any Underlying Rule • The Model Set May Not Reflect the Relevant Population • Data May Be at the Wrong Level of Detail
Step Seven: Build Models
• directed data mining • undirected data mining
Step Eight: Assess Models
• Assessing Descriptive Models minimum description length • Assessing Directed Models accuracy on previously unseen data • Assessing Classifiers and Predictors error rate • Assessing Estimators Difference(predicted score and the actual result) • Comparing Models Using Lift • Problems with Lift
Data Mining Methodology and Best Practices
• • • •
Why Have a Methodology Hypothesis Testing Models, Profiling, and Prediction The Methodology
Why Have a Methodology?
Fra Baidu bibliotek Step Two: Select Appropriate Data
• • • • • What Is Available? How Much Data Is Enough? How Much History Is Required? How Many Variables? What Must the Data Contain?
Learning things that are true, but not useful
• Learning Things That Are Already Known • Learning Things That Can’t Be Used
Hypothesis Testing
• Generating Hypotheses • Testing Hypotheses
Assessing Estimators
Lift
Step Nine: Deploy Models
data mining environment
scoring environment
Step Ten: Assess Results
cumulative response chart
cumulative profit chart
Models, Profiling, and Prediction
• Models • Profiling and Prediction
Models
• Profiling, and Prediction
The Methodology
• • • • • • • • • • • 1. Translate the business problem into a data mining problem. 2. Select appropriate data. 3. Get to know the data. 4. Create a model set. 5. Fix problems with the data. 6. Transform data to bring information to the surface. 7. Build models. 8. Asses models. 9. Deploy models. 10. Assess results. 11. Begin again.
Step Six: Transform Data to Bring Information to the Surface
• Capture Trends • Create Ratios and Other Combinations of Variables • Convert Counts to Proportions
相关文档
最新文档