Data cleansing
5 steps to cleanse data
Remove duplicate or irrelevant observations
As combining datasets imply the possibility of duplicating data, de-duplication is a significant area to be considered in this process. Removing irrelevant observations, such as data that does not align with the specific problem, could also make analysis more efficient.
Fix structural errors
Structural errors may occur when you measure or transfer data. For example, there may be strange naming conventions, typos, or incorrect capitalisation. It's important to fix these errors as the inconsistencies can cause mislabelled categories or classes.
Filter unwanted outliers
Removing an outlier that is irrelevant for analysis or is a mistake, like improper-data entry, could assist the performance of the data you are working with. However, not all outliers are incorrect and could even help prove a theory. That's why this step is needed to determine the validity of a number.
Handle missing data
Missing data shouldn't be ignored, as many algorithms don't accept missing values. Here are some ways to deal with missing data that can be considered:
First option: you can drop observations with missing values, but doing this will drop or lose information, so be mindful of this before you remove it.
Second option: you can input missing values based on other observations, but there is also an opportunity to lose the integrity of the data because you may be operating from assumptions.
Third option: you might alter how the data is used to navigate null values effectively.
Validate
At the end of the data cleaning process, you should be able to answer these questions as a part of basic validation:
Does the data make sense?
Does the data follow the appropriate rules for its field?
Does it prove or disprove your working theory, or bring any insight to light?
Can you find trends in the data to help inform your next theory?
If not, is that because of a data quality issue?