Data Preparation for Data Mining,
Edition 1
By Dorian Pyle

Publication Date: 22 Mar 1999

Data Preparation for Data Mining addresses an issue unfortunately ignored by most authorities on data mining: data preparation. Thanks largely to its perceived difficulty, data preparation has traditionally taken a backseat to the more alluring question of how best to extract meaningful knowledge. But without adequate preparation of your data, the return on the resources invested in mining is certain to be disappointing.

Dorian Pyle corrects this imbalance. A twenty-five-year veteran of what has become the data mining industry, Pyle shares his own successful data preparation methodology, offering both a conceptual overview for managers and complete technical details for IT professionals. Apply his techniques and watch your mining efforts pay off-in the form of improved performance, reduced distortion, and more valuable results.

On the enclosed CD-ROM, you'll find a suite of programs as C source code and compiled into a command-line-driven toolkit. This code illustrates how the author's techniques can be applied to arrive at an automated preparation solution that works for you. Also included are demonstration versions of three commercial products that help with data preparation, along with sample data with which you can practice and experiment.

Key Features

* Offers in-depth coverage of an essential but largely ignored subject.* Goes far beyond theory, leading you-step by step-through the author's own data preparation techniques.* Provides practical illustrations of the author's methodology using realistic sample data sets.* Includes algorithms you can apply directly to your own project, along with instructions for understanding when automation is possible and when greater intervention is required.* Explains how to identify and correct data problems that may be present in your application.* Prepares miners, helping them head into preparation with a better understanding of data sets and their limitations.
About the author
By Dorian Pyle, Chief Scientist and Founder of PTI, Leominster, MA, USA
Table of Contents
PrefaceIntroductionChapter 1 Data Exploration As a Process1.1 The Data Exploration Process1.1.1 Stage 1: Exploring the Problem Space1.1.2 Stage 2: Exploring the Solution Space1.1.3 Stage 3: Specifying the Implementation Method1.1.4 Stage 4: Mining the Data1.1.5 Exploration: Mining and Modeling1.2 Data Mining, Modeling and Modeling Tools1.2.1 Ten Golden Rules1.2.2 Introducing Modeling Tools1.2.3 Types of Models1.2.4 Active and Passive Models1.2.5 Explanatory and Predictive Models1.2.6 Static and Continuously-learning Models1.3 SummarySupplemental MaterialChapter 2 The Nature of the World and Its Impact on Data Preparation2.1 Measuring the World2.1.1 Objects2.1.2 Capturing Measurements2.1.3 Errors of Measurement2.1.4 Tying Measurements to the Real World2.2 Types of Measurements2.2.1 Scalar Measurements2.2.2 Non-scalar Measurements2.3 Continua of Attributes of Variables2.3.1 The Qualitative - Quantitative Continuum2.3.2 The Discrete - Continuous Continuum2.4 Scale Measurement Example2.5 Transformations and Difficulties - Variables, Data, and Information2.6 Building Mineable Data Representations2.6.1 Data Representation2.6.2 Building Data - Dealing with Variables2.6.3 Building Mineable Data Sets2.7 SummarySupplemental MaterialChapter 3 Data Preparation as a Process3.1 Data Preparation: Inputs, Outputs, Models, and Decisions3.1.1 Step 1: Prepare the Data3.1.2 Step 2: Survey the Data3.1.3 Step 3: Model the Data3.1.4 Use the Model3.2 Modeling Tools and Data Preparation3.2.1 How Modeling Tools Drive Data Preparation3.2.2 Decision Trees3.2.3 Decision Lists3.2.4 Neural Networks3.2.5 Evolution Programs3.2.6 Modeling Data with the Tools3.2.7 Predictions and Rules3.2.8 Choosing Techniques3.3 Stages of Data Preparation3.3.1 Stage 1: Data Access3.3.2 Stage 2: Data Audit3.3.3 Stage 3: Enhancing and Enriching the Data3.3.4 Stage 4: Sampling Bias3.3.5 Stage 5: Data Structure (Super, Macro and Micro)3.3.6 Stage 6: Building the PIE3.3.7 Stage 7: Surveying the Data3.3.8 Stage 8: Modeling3.4 And the result is . . .?Chapter 4 Getting the Data: Basic Preparation4.1 Data Discovery4.1.1 Data Access Issues4.2 Data Characterization4.2.1 Detail / Aggregation Level (Granularity)4.2.2 Consistency4.2.3 Pollution4.2.4 Objects4.2.5 Relationship4.2.6 Domain4.2.7 Defaults4.2.8 Integrity4.2.9 Concurrency4.2.10 Duplicate or Redundant Variables4.3 Assembling the Data Set4.3.1 Reverse Pivoting4.3.2 Feature Extraction4.3.3 Physical or Behavioral Data Sets4.3.4 Explanatory Structure4.3.5 Data Enhancement or Enrichment4.3.6 Sampling Bias4.4 Example 1: CREDIT4.4.1 Looking at the Variables4.4.2 Relationships between Variables4.5 Example 2: SHOE4.5.1 Looking At the Variables4.5.2 Relationship between Variables4.6 The Data AssayChapter 5 Sampling, Variability and Confidence5.1 Sampling or First catch your hare!5.1.1 How much Data?5.1.2 Variability5.1.3 Converging on a Representative Sample5.1.4 Measuring Variability5.1.5 Variability and Deviation5.2 Confidence5.3 Variability of Numeric Variables5.3.1 Variability and Sampling5.3.2 Variability and Convergence5.4 Variability and Confidence in Alpha Variables5.4.1 Ordering and Rate of Discovery5.5 Measuring Confidence 5.5.1 Modeling and Confidence with the Whole Population 5.5.2 Testing for Confidence5.5.3 Confidence Tests and Variability5.6 Confidence in Capturing Variability5.6.1 A brief introduction to the Normal Distribution5.6.2 Normally Distributed Probabilities5.6.3 Capturing Normally Distributed Probabilities; an Example5.6.4 Capturing Confidence, Capturing Variance5.7 Problems and Shortcomings of Taking Samples using Variability5.7.1 Missing Values5.7.2 Constants (variables with only one value)5.7.3 Problems with Sampling5.7.4 Monotonic Variable Detection5.7.5 Interstitial Linearity.5.7.6 Rate of Discovery5.8 Confidence and Instance Count5.9 SummarySupplemental MaterialChapter 6 Handling Non-Numerical Variables6.1 Representing Alphas and Remapping6.1.1 One-of-n remapping6.1.2 M-of-n remapping6.1.3 Remapping to eliminate Ordering6.1.4 Remapping one-to-many patterns, or ill-formed problems6.1.5 Remapping Circular Discontinuity6.2 State Space6.2.1 Unit State Space6.2.2 Pythagoras in State Space6.2.3 Position in State Space6.2.4 Neighbors and Associates6.2.5 Density and Sparsity6.2.6 Nearby and Distant Nearest Neighbors6.2.7 Normalizing Measured Point Separation6.2.8 Contours, Peaks and Valleys6.2.9 Mapping State Space6.2.10 Objects in State Space6.2.11 Phase Space6.2.12 Mapping Alpha Values6.2.13 Location, location, location!6.2.14 Numerics, Alphas and the Montreal Canadiens6.3 Joint Distribution Tables6.3.1 Two-way Tables6.3.2 More values, more variables and meaning of the numeration6.3.3 Dealing with low-frequency alpha labels, and other problems6.4 Dimensionality6.4.1 Multi-Dimensional Scaling6.4.2 Squashing a Triangle6.4.3 Projecting Alpha Values6.4.4 Scree Plots6.5 Practical Consideration - Implementing Alpha Numeration in theDemonstration Code6.5.1 Implementing Neighborhoods6.5.2 Implementing Numeration in all Alpha Data Sets6.5.3 Implementing Dimensionality reduction for Variables6.6 SummaryChapter 7 Normalizing and Redistributing Variables7.1 Normalizing a Variable's Range7.1.1 Review of Data Preparation and Modeling (Training, Testing andExecution)7.1.2 The Nature and Scope of the Out-of-Range Values Problem7.1.3 Discovering the Range of Values When Building the PIE7.1.4 Out-of-Range Values When Training7.1.5 Out-of-Range Values When Testing7.1.6 Out-of-Range Values When Executing7.1.7 Scaling Transformations7.1.8 Softmax Scaling7.1.9 Normalizing Ranges7.2 - Redistributing Variable Values7.2.1 The Nature of Distributions7.2.2 Distributive Difficulties7.2.3 Adjusting Distributions7.2.4 Modified Distributions7.3 SummarySupplemental MaterialChapter 8 Replacing Missing and Empty Values 8.1 Retaining Information about Missing Values8.1.1 Missing Value Patterns8.1.2 Capturing Patterns8.2 Replacing Missing Values8.2.1 Unbiased Estimators8.2.2 Variability Relationships8.2.3 Relationships between Variables8.2.4 Preserving Between Variable Relationships8.3 SummarySupplemental MaterialChapter 9 Series Variables9.1 Here there be Dragons!9.2 Types of Series9.3 - Describing Series Data9.3.1 Constructing a Series9.3.2 Features of a Series9.3.3 Describing a Series - Fourier9.3.4 Describing a Series - Spectrum9.3.5 Describing a Series - Trend, Season, Cycles, Noise9.3.6 Describing a Series - Autocorrelation9.4 Modeling Series Data9.5 Repairing Series Data Problems9.5.1 Missing values9.5.2 Outliers9.5.3 Non-Uniform Displacement9.5.4 Trend9.6 Tools9.6.1 Filtering9.6.2 Moving Averages9.6.3 Smoothing 1 - PVM Smoothing9.6.4 Smoothing 2 - Median Smoothing, Resmoothing and Hanning9.6.5 Extraction9.6.6 Differencing9.7 Other Problems9.7.1 Numerating Alpha Values9.7.2 Distribution9.8 Preparing Series Data9.8.1 Looking at the Data9.8 2 Signposts on the Rocky Road9.9 Implementation NotesChapter 10 Preparing the Data Set10.1 Using Sparsely Populated Variables10.1.1 Increasing Information Density using Sparsely Populated Variables10.1.2 Binning Sparse Numerical Values10.1.3 Present Value Patterns (PVPs)10.2 Problems with High Dimensionality Data Sets10.2.1 Information Representation10.2.2 Representing High Dimensionality Data in Less Dimensions10.3 Introducing the Neural Network.10.3.1 Training a Neural Network10.3.2 Neurons10.3.3 Reshaping the Logistic Curve10.3.4 Single Input Neurons10.3.5 Multiple Input Neurons10.3.6 Networking Neurons to Estimate a Function10.3.7 Network Learning10.3.8 Network Prediction - Hidden Layer10.3.9 Network Prediction - Output Layer10.3.10 Stochastic Network Performance10.3.11 Network Architecture 1 - The Autoassociative Network10.3.12 Network Architecture 2 - The Sparsely Connected Network10.4 Compressing Variables10.4.1 Using Compressed Dimensionality Data10.5 Removing Variables10.5.1 Estimating Variable Importance 1: What Doesn't Work10.5.2 Estimating Variable Importance 2: Clues10.5.3 Estimating Variable Importance 3: Configuring and Training theNetwork10.6 How Much Data is Enough?10.6.1 Joint Distribution10.6.2 Capturing Joint Variability10.6.3 Degrees of Freedom10.7 Beyond Joint Distribution10.7.1 Enhancing the Data Set10.7.2 Data Sets in Perspective10.8 Implementation Notes10.8.1 Collapsing Extremely Sparsely Populated Variables10.8.2 Reducing Excessive Dimensionality10.8.3 Measuring Variable Importance10.8.4 Feature Enhancement10.9 Where Next?Chapter 11 The Data Survey11.1 Introduction to the Data Survey11.2 Information and Communication11.2.1 Measuring Information: Signals and Dictionaries11.2.2 Measuring Information: Signals11.2.3 Measuring Information: Bits of Information11.2.4 Measuring Information: Surprise11.2.5 Measuring Information: Entropy11.2.6 Measuring Information: Dictionaries11.3 Mapping using Entropy11.3.1 Whole Data Set Entropy11.3.2 Conditional Entropy between Inputs and Outputs11.3.3 Mutual Information11.3.4 Other Survey Uses for Entropy and Information11.3.5 Looking for Information11.4 Identifying Problems with a Data Survey11.4.1 Confidence and Sufficient Data11.4.2 Detecting Sparsity11.4.3 - Manifold Definition11.5 Clusters11.6 Sampling Bias11.7 Making the Data Survey11.8 Other DirectionsSupplemental MaterialChapter 12 Using Prepared Data12.1 Modeling Data12.1.1 Assumptions12.1.2 Models12.1.2 Data Mining versus Exploratory Data Analysis12.2 Modeling Data12.2.1 Decision Trees12.2.2 Clusters12.2.3 Nearest Neighbor12.2.4 Neural Networks and Regression12.3 Prepared Data and Modeling Algorithms12.3.1 Neural Networks and the CREDIT Data Set12.3.2 Decision Trees and the CREDIT Data Set12.4 Practically using Data Preparation and Prepared Data12.5 Looking at Present Modeling Tools, and Future Directions12.5.1 Near Future12.5.2 Farther outAppendix A Using the Demonstration Code on the CDAppendix B Further ReadingIndex
Book details
ISBN: 9781558605299
Page Count: 560
Retail Price : £59.99
This book is aimed at any potential or current practitioner of data mining, including IT managers, consultants, and database administrators/data warehouse professionals.