Data Preprocessing and Cleaning # MCQs Practice set

Q.1 What is the main purpose of data preprocessing in data mining?

To store data efficiently
To clean and transform raw data into a suitable format
To visualize data
To perform predictive modeling directly
Explanation - Data preprocessing is used to clean, integrate, transform, and reduce raw data, making it suitable for mining tasks.
Correct answer is: To clean and transform raw data into a suitable format

Q.2 Which of the following is a common method for handling missing values?

Ignoring all missing values
Filling with mean, median, or mode
Duplicating missing values
Adding random noise
Explanation - Missing values are often replaced by statistical measures such as mean, median, or mode to maintain dataset consistency.
Correct answer is: Filling with mean, median, or mode

Q.3 Which technique is used to normalize data into a specific range?

Discretization
Data smoothing
Min-Max normalization
Dimensionality reduction
Explanation - Min-Max normalization scales data to a fixed range, usually [0,1], improving comparability between attributes.
Correct answer is: Min-Max normalization

Q.4 Data cleaning does NOT typically involve:

Handling missing values
Removing duplicate records
Creating predictive models
Correcting noisy data
Explanation - Data cleaning focuses on preparing data, whereas predictive modeling is a later stage in data mining.
Correct answer is: Creating predictive models

Q.5 Which of the following is an example of noise in data?

A missing value
A duplicate entry
An outlier due to measurement error
All of the above
Explanation - Noise can include missing values, duplicates, or outliers caused by errors or inconsistencies in data.
Correct answer is: All of the above

Q.6 What is the purpose of data integration in preprocessing?

Combining data from multiple sources
Normalizing values
Handling missing values
Removing outliers
Explanation - Data integration merges datasets from different sources to create a unified view for analysis.
Correct answer is: Combining data from multiple sources

Q.7 Which of the following is a method for detecting outliers?

Z-score method
Principal Component Analysis
Data binning
Normalization
Explanation - The Z-score method identifies outliers by measuring how many standard deviations a data point is from the mean.
Correct answer is: Z-score method

Q.8 Which technique reduces data volume while maintaining its integrity?

Data reduction
Data cleaning
Data transformation
Data smoothing
Explanation - Data reduction techniques reduce dataset size by aggregation, sampling, or dimensionality reduction while retaining key information.
Correct answer is: Data reduction

Q.9 Discretization is used to:

Convert continuous attributes into discrete categories
Fill missing values
Remove noise
Detect outliers
Explanation - Discretization transforms continuous numeric data into categorical intervals to simplify analysis.
Correct answer is: Convert continuous attributes into discrete categories

Q.10 Which preprocessing step improves the performance of machine learning algorithms?

Data cleaning
Data transformation
Data reduction
All of the above
Explanation - Cleaning, transforming, and reducing data ensures consistency, reduces noise, and improves algorithm efficiency.
Correct answer is: All of the above

Q.11 Binning is a technique used for:

Outlier detection
Data smoothing
Data integration
Data normalization
Explanation - Binning groups data into intervals and replaces values with summary statistics to reduce noise.
Correct answer is: Data smoothing

Q.12 Which of the following is NOT a type of data cleaning?

Missing value handling
Noise reduction
Data compression
Duplicate elimination
Explanation - Data compression reduces storage size, not cleaning; cleaning focuses on correcting errors and inconsistencies.
Correct answer is: Data compression

Q.13 Why is attribute selection important in preprocessing?

To increase dataset size
To improve model accuracy and reduce complexity
To generate random data
To normalize values
Explanation - Selecting relevant attributes reduces noise and dimensionality, improving model performance and interpretability.
Correct answer is: To improve model accuracy and reduce complexity

Q.14 Which of the following normalization methods preserves the mean and standard deviation?

Z-score normalization
Min-Max normalization
Decimal scaling
Binning
Explanation - Z-score normalization standardizes data by converting it into a distribution with mean 0 and standard deviation 1.
Correct answer is: Z-score normalization

Q.15 Data transformation can include:

Normalization
Aggregation
Generalization
All of the above
Explanation - Data transformation involves modifying data into suitable formats through normalization, aggregation, or generalization.
Correct answer is: All of the above

Q.16 Which of the following is a potential consequence of not handling missing data?

Reduced model accuracy
Incorrect analysis results
Biased predictions
All of the above
Explanation - Ignoring missing data can lead to inaccurate analysis, biased predictions, and reduced reliability of results.
Correct answer is: All of the above

Q.17 Which of the following describes a duplicate record?

A record with missing values
A record identical to another in all or most attributes
A record with noise
A record with normalized values
Explanation - Duplicate records occur when the same data is recorded multiple times and must be removed during cleaning.
Correct answer is: A record identical to another in all or most attributes

Q.18 Which of the following is an example of data inconsistency?

Date format differences across records
Missing attribute values
Presence of outliers
All of the above
Explanation - Inconsistent data formats, such as varying date representations, can cause integration and analysis errors.
Correct answer is: Date format differences across records

Q.19 Smoothing techniques help in:

Reducing noise
Filling missing values
Detecting duplicates
Scaling data
Explanation - Smoothing techniques, such as moving averages or binning, reduce random variability and noise in data.
Correct answer is: Reducing noise

Q.20 Which method of handling missing data replaces missing values using predicted values?

Listwise deletion
Mean/Median imputation
Regression imputation
Ignoring missing values
Explanation - Regression imputation predicts missing values using other related attributes to provide more accurate replacements.
Correct answer is: Regression imputation

Q.21 Generalization in data preprocessing refers to:

Replacing detailed data with higher-level concepts
Filling missing values
Reducing duplicates
Scaling numeric data
Explanation - Generalization replaces detailed values with broader categories, simplifying analysis while preserving essential patterns.
Correct answer is: Replacing detailed data with higher-level concepts

Q.22 Which preprocessing technique helps in detecting patterns by grouping similar data?

Binning
Clustering
Normalization
Regression
Explanation - Clustering groups similar data points to identify patterns and relationships in preprocessing steps.
Correct answer is: Clustering

Q.23 Which of the following is NOT a data reduction technique?

Dimensionality reduction
Data compression
Data smoothing
Aggregation
Explanation - Data smoothing reduces noise but does not reduce data size; reduction techniques include dimensionality reduction, compression, and aggregation.
Correct answer is: Data smoothing

Q.24 Which preprocessing step is critical before data integration from multiple sources?

Data cleaning
Normalization
Duplicate elimination
All of the above
Explanation - Data cleaning, normalization, and removing duplicates are essential to ensure consistency and correctness before integrating data.
Correct answer is: All of the above