Introduction
The volume of data processed in companies is constantly growing, and the complexity of ensuring the reliability and confidentiality of critical information is increasing. Ghasemaghaei and Calic (2019) claim that “despite the large variety of data, the huge volume of generated data, and the fast velocity of obtaining data, the quality of big data is far from perfect” (p. 38). Conventional access-sharing techniques are not always sufficient. Data masking is one of the most effective methods of protecting important information from unauthorized access. On the other hand, data cleansing is the process of assessing a sample for data mining using Machine Learning algorithms. This stage of data management identifies and removes errors and inconsistencies in the data in order to improve the quality of the dataset. This paper provides an insight into the distinguished nature of data masking and data cleansing.
Data Masking
Data masking is one of the methods of protecting confidential information from unauthorized access, which uses various methods of modifying data that require protection. The method is often used when working with databases, including web applications. Masking allows one to hide part of the data either when creating a copy of the database or in response to a request for some information. Siddartha and Ravikumar (2020) add that “recently, deoxyribose nucleic acid (DNA) sequences and chaotic sequence are jointly used for building efficient data masking model” (p. 6008). It preserves the general information structure and leaves available data necessary for work.
For example, when working in CRM with a client card containing his full name, phone number, and account balance, full information should be available only to the head of the department. Other employees should only have access to the client’s full name and phone number. However, CRM does not allow one to create such a demarcation. In this case, the dynamic masking method can come in handy. Dynamic masking is used to hide some information momentarily by replacing the original data in an application or database response to a user request without altering the information on the storage medium. According to Archana et al. (2018), “the estimation of dynamic information covering lies in its capacity to apply distinctive veils to various sorts of information found underlying databases, applications, and detailing and improvement instruments” (p. 3979). By creating certain rules, it is possible to provide complete information only to those who need it for business reasons without making changes either to the database or to the logic of the CRM.
Another example of the need for data masking is associated with the constant improvement and testing of software in the company. Application and database developers often request a copy of the actual database to test all the functionality of the software they develop. It is impossible to delete all critical information from the copy, as it will violate the integrity of the data structure. In this case, static masking is applied to preserve data. Sharmila et al. (2018) supply that “it is mainly used to refresh non-production environments and prevent insider threat” (p. 3722). Static masking allows one, when creating a copy of a database, to configure the rules for copying a part of information so that the original information will be replaced by a similar one but fake. Therefore, when copying, one can replace all customer phone numbers with random sets of numbers that are similar in structure to phone numbers. The integrity of the database will not be compromised, and such a copy can even be outsourced to developers without fear of leakage of valuable data.
Data Cleansing
Data cleansing deals with the detection and removal of errors and data inconsistencies in order to improve data quality. It is inherently different from the process of data masking in its nature and function. The method is called “cleansing” due to the fact that during it, “dirty” data is cleaned. According to Ridzuan and Wan Zainon (2019), “dirty data is defined as inaccurate, inconsistent and incomplete due to the error found within the dataset” (p. 732). The presence of “dirty” data is one of the most important and difficult to formalize problems of analytical technologies in general and data storage in particular. Therefore, it is of crucial importance to review and correct the data before the analysis begins to ensure operational efficiency, as well as provide employees with access to accurate and consistent information.
Differences between Data Masking and Data Cleansing
The very principle of data masking does not imply making changes to the stored data and only provides for masking the displayed information to the user during the operation of specialized software in a gap. The server independently makes requests for information and masks critical data according to the selected criteria and rules, only then sending it to the user. Therefore, data masking helps protect confidential and personal data and thus reduces the risk of exposure. On the other hand, data cleansing implies that changes must be made to the original data, as it is not up to the programmer’s or the storage’s requirements. The cleansing corrects all mistakes that are present in the original data, making it more stable and reliable. Ridzuan and Wan Zainon (2019) supply that “data cleansing offers a better data quality which will be a great help for the organization to make sure their data is ready for the analyzing phase” (p. 731). Data cleansing precedes analysis during the whole process of data management.
Contribution to Data Management and Operational Efficiency
The processes of continuous software development, testing, and updating are associated with several risks at once. Zhang (2018) adds that “how to ensure big data security and privacy protection has become one of the hot issues in the current stage of research” (p. 1). First, test and development environments are usually deployed using the most affordable public cloud services, which are often not highly secure and meet most regulatory requirements for storing sensitive data. Secondly, third-party specialists, both specialized companies and freelancers, are often involved in development and testing. Gaining access to real business information for such contractors is a completely unjustified risk for the company, and therefore there is a need to mask personal data.
In turn, using synthetic datasets for testing can render the software unusable. For example, it may turn out that the number of records in the database is much larger, or the fields in which the developer expected to see numbers will be filled with letters, and so on. Any discrepancy will require revision of the application software and the provision and additional cleansing of real data on which errors occur. Archana et al. (2018) also state that “information covering can be utilized to stretch out insurance to unstructured and semi-structured information” (p. 3979). That is why data masking is required to solve two tasks at the same time: to protect the information and promptly provide it to internal and external project teams.
Data cleansing is imperative when it is overloaded in the storage, and a lot of attention is paid to the process when developing an ETL strategy. Ridzuan and Wan Zainon (2019) state that “incomplete information will generate uncertainties during data analysis, while errors or missing values in the dataset will produce a different result and may affect the business decision” (p. 732). All information is heterogeneous and is almost always collected from many sources. The presence of various data collection points makes the cleaning process so complex and relevant. Dirty data is an issue that has to be resolved before any analysis operations with data begin, as it can negate all efforts to populate the data warehouse. Moreover, data changed during the cleaning process must be labeled to take this aspect into account in subsequent analysis. Otherwise, there is a risk of relying on it as real information, which may lead to incorrect conclusions. Data cleansing should become a mandatory stage of the work because the storage value is determined not only by the amount of data but the quality of the information collected.
Conclusion
Data management issues, which were previously given the second role among the considered tasks that can affect the operation of enterprises, have recently increasingly come to the fore. In the age of digitalization and the widespread use of automated systems, data is becoming a valuable resource. Therefore, it is of utmost importance to understand the processes behind data management – specifically, data masking and data cleansing.
The processes of data masking and data cleansing are essentially different and provide different functions in the process of data management. While data masking helps protect the information as it is transferred from destination to destination, data cleansing ensures that the data is right and reliable. Both processes hold crucial importance to data management and contribute to the operational efficiency of an organization. It is quite important to understand the differences between data cleansing and data masking, as these two processes require different sets of skills and are to be used at different stages of data management.
References
Ghasemaghaei, M., & Calic, G. (2019). Can big data improve firm decision quality? The role of data quality and data diagnosticity. Decision Support Systems, 120, 38–49.
Archana, R. A., Hegadi, R. S., & Manjunath, T. N. (2018). A study on big data privacy protection models using data masking methods. International Journal of Electrical and Computer Engineering (IJECE), 8(5), 3976.
Ridzuan, F., & Wan Zainon, W. M. (2019). A review on data cleansing methods for Big Data. Procedia Computer Science, 161, 731–738.
Sharmila, K., Catherine, S. B. A., & Sreeja, V. S. (2018). A comprehensive study of data masking techniques on the cloud. International Journal of Pure and Applied Mathematics, 119(15), 3719–3727.
Siddartha, B. K., & Ravikumar, G. K. (2020). An efficient data masking for securing medical data using DNA encoding and a chaotic system. International Journal of Electrical and Computer Engineering (IJECE), 10(6), 6008.
Zhang, D. (2018). Big Data Security and Privacy Protection. Proceedings of the 8th International Conference on Management and Computer Science (ICMCS 2018).