The data preprocessing is an essential step in knowledge discovery projects.
The experts affirm that preprocessing tasks take between 50% to 70% of the total time of the knowledge discovery process.
Data collected directly from the owner (primary collection), such as through a survey, tend to need a more robust data quality program because no one else has cleaned the data.
Data obtained from a vendor or organization (secondary collection), such as the government, tend to need less cleaning to be usable because the data have already gone through a data quality process.
Human error includes not reading instructions or definitions.
Quality assurance edits used to check for both completeness and accuracy can be broken into three categories: validity, reasonableness, and warning.We presented two case studies through real datasets: physical activity monitoring (PAM) and occupancy detection of an office room (OD).With the aim of evaluating our proposal, the cleaned datasets by DQF4CT were used to train the same algorithms used in classification tasks by the authors of PAM and OD.In this sense, several authors consider the data cleaning as one of the most cumbersome and critical tasks.Failure to provide high data quality in the preprocessing stage will significantly reduce the accuracy of any data analytic project.Possible strategies include: 1) make it easy to provide accurate data including instructions and a user friendly interface, 2) explain how the esoteric benefits help the provider, and 3) provide data or resulting analysis that is useful to the provider.Human error: Possibly the most common problem is typos, recording data in the wrong column or row, truncation, transposing values, invalid values, or incorrect formats.In this paper, we propose a framework to address the data quality issues in classification tasks DQF4CT.Our approach is composed of: (i) a conceptual framework to provide the user guidance on how to deal with data problems in classification tasks; and (ii) an ontology that represents the knowledge in data cleaning and suggests the proper data cleaning approaches.Format edits are used to reject data that do not conform to the specified format, such as text in a date or numerical field or an email without an @ symbol. Reasonableness edits look for information that is highly unlikely or an extreme outlier but there are rare instances in which it might be possible.It can also check for the validity of content (e.g., ‘FP’ isn’t invalid in U. Reasonableness edits don’t generally cause a data submission to be rejected but may require an explanation.