Data Warehousing, Data Mining, Big Data, and Open Data

Words:	2215
Subject:	Sciences, Tech & Engineering
Pages:	9
Type:	Essay

Data Warehousing and Operational and Strategic Data Sets

The term “data warehouse” was first used in 1990. According to Inmon, a data warehouse is a specific, consolidated, time-varying and non-volatile set of data. This data allows analysts tomake informed decisions for their business. strategies. Operational database can be subject to frequent transactional changes on a daily basis. The manager has no data to analyze because the previous data has been updated as a result of the transaction.

The data warehouse provides users with aggregated and integrated data in a multi-dimensional view. In addition to summary and consolidated data views, data warehouses also provide tools for operational analytical processing (OLAP). These tools are useful for interactive and efficient data analysis in the multidimensional space. Such analysis leads to data aggregation and intelligent data analysis (Inmon and Netlibrary, 2002). Intelligent data analysis functions, such as mapping, clustering, classification and forecasting, can be integrated into OLAP operations to improve interactive data processing at multiple levels of abstraction. Operational data can be called a subtype of strategic data, because the former includes information about the internal environment and control mechanisms of any system. Strategic data, in turn, in addition to this information, contains remote data, as well as information related to the long-term functioning of the system.

Data Mining and OLAP compared with OLTP Systems

Web-based analytical processing is a class of software tools that analyse data for business decision making. OLAP systems allow users to analyze information from multiple database systems simultaneously. OLTP manages the everyday transactions of an organization (Shadaksharappa, P. Ramkumar, Dr. T. N. Prabakar, 2022). Its main purpose is to process data rather than analyze it. All data warehouses are OLAP systems:

A company can compare September mobile phone sales with October sales and compare these results with other data stored in a separate database.
Amazon analyzes customer purchases to find a personalized landing page with products that may be of interest to the customer (Shadaksharappa, P. Ramkumar and Dr. T. N. N. Prabakar, 2022).

OLAP creates a common platform for all types of business information, including planning, budgeting, forecasting and analysis. The main advantage of OLAP is the consistency of information and calculations. It is easy to apply security restrictions to users and objects to enforce rules and protect confidential data. Besides, the advantage of OLAP is that query speed is important, but not critical. The weakness of OLAP is that the queries to the system are unregulated and usually quite complex. The effective use of OLTP tools requires the cooperation of users from different departments, which is not always possible.

Although OLAP has begun to enjoy great popularity, an important ethical aspect is the difficulty in coordinating between different departments. Secondly, there are difficulties in justifying the need for expenditure. Purchase and implementation of new technology requires funding, but the return on it is not felt immediately and is not easy to quantify. In most cases, to justify the upcoming costs, you need to provide evidence that the project will pay off within the first year. Therefore, despite the obvious merits of both systems, presenting a financial justification for them is extremely challenging.

Data Mining can include data recieving, data intellectual analysis, or deep data research. These activities conduct a process used by companies to turn raw big data into useful information. While the term Big Data refers to all big data, both processed and unprocessed, Data Mining is a process of deep diving into that data to extract key insights (Shadaksharappa, P. Ramkumar and Dr. T. N. N. Prabakar, 2022). The main problem of detecting patterns in data is the time it takes to enumerate information arrays. Known methods either artificially limit such a search, or build entire decision trees, which reduce the efficiency of the search. Moreover, given the high level of leaks of user information, an important ethical issue is the development of an effective regulatory framework with a system of fines and penalties for violations of privacy (Shadaksharappa, P. Ramkumar and Dr. T. N. N. Prabakar, 2022). At present, legislative activity in this area can be considered underdeveloped.

The Rise of ‘Big Data’ and Its Applications

The term “BigData” presupposes a big set of information whose scale, variety, and complexity require new architectures, methods, algorithms, and analysis tools to manage it. The main task of BigData is the ability to process large amounts of structured, semi-structured, and unstructured data and produce a certain prediction based on it. The main sources of BigData are: Social media and the Internet (as we all produce information), scientific tools (collect all types of data), mobile devices (constantly tracking every object), sensor technology and networks (measure all types of information).

The world’s leading supranational structures and transnational corporations, governments of many countries, businesses of all sizes, industrial and social infrastructure management systems, and, of course, the military-intelligence complex of all major countries are already using Big Data as a crucial strategic resource (Srinivasan, 2018). Below are a few practical examples of how the world’s leading companies are adopting Big Data technologies in various areas of business (Krish Krishnan, 2020). HSBC uses Big Data technologies to counteract fraudulent plastic card transactions.

Based on the definition of BigData, we can formulate the basic principles of working with such data:

Horizontal scalability. The principle of horizontal scalability implies that there can be many machines in a cluster. For example, Yahoo’s Hadoop cluster has over 42,000. This means that some of these machines are guaranteed to fail. Big data methods must take into account the possibility of such failures and survive them without any meaningful consequences.
Fault tolerance.
Data locality. In large distributed systems, data are distributed over a large number of machines. If the data are physically located on one server, but processed on another – the cost of data transfer may exceed the cost of processing itself.

With the help of BigData the company increased the efficiency of the security service by a factor of 3, and the detection of fraudulent incidents by a factor of 10. The economic effect of implementing these technologies exceeded in a great extent. The system currently helps prevent $2 billion in fraudulent payments annually (Krish Krishnan, 2020). The main disadvantage of working with big data is considered the need to address data quality issues. Before using big data for analysis, analysts must make sure that the information they are using is accurate, relevant and in a format suitable for analysis. This slows down the reporting process considerably, but if businesses don’t address data quality issues, they may find that the insights from their analytics are useless – or even harmful, if they work.

It should be noted that collection, processing and analysis of big data has now become a way to make money and, in some cases, a factor that can change an entire industry. Most of this data can be considered personal data and is therefore subject to protection. However, this is not enough, because personal data protection laws data protection laws usually do not cover ethical and moral aspects.

NoSQL’ Databases as Compared with ‘Acid-Compliant’ Databases

SQL databases use a structured query language to manage data, to process, store, update, and delete it. Relational data warehouses use Relational Database Management Systems (RDBMS). In RDBMS, data is stored in a table format. A table is the basic unit of a database; it consists of rows and columns in which data is stored. Tables are the most commonly used type of database objects, or structures. SQL is a language that allows speacialists to easily work with RDBMS (Nordeen, 2020). The main feature of this system is the reliability and immutability of data, low risk of data loss. When updating the data their integrity is guaranteed, they are replaced in one table.

Relational databases, unlike non-relational ones, comply with ACID – the requirements for transactional systems. Compliance with them ensures data integrity and predictability of the database:

Atomicity – no transaction will be partially fixed in the system.
Consistency – only valid transaction results are recorded.
Isolation – transaction results are not affected by transactions that run parallel to it.
Durability – database changes are preserved despite failures or user actions (Nordeen, 2020).

The disadvantage is that the overall performance of the system will fall, because it is necessary to maintain the consistency of data in multiple nodes. MySQL is one of the most popular open source relational databases. Suitable for small and medium-sized projects that need an inexpensive and reliable tool to work with data (Hartmut Will, 2019). Supports many types of tables, there is a huge number of plugins and extensions to facilitate working with the system.

NoSQL databases are excellent for many modern applications that need flexible, scalable, high-performance databases with a wide range of functionality. Such applications include mobile and web projects and computer games. The capabilities of NoSQL databases allow such projects to give their users the best experience of working with them. The reason for the growing popularity of NoSQL databases lies mainly in the need to work with data of very different structure and size. It may be structured, semi-structured or polymorphic data (Nisa, 2018). When working with such data, it is almost impossible to determine the data schema. NoSQL also gives the developer a lot of flexibility when he needs to quickly adapt the system to changes (Raj and Deka, 2018). NoSQL databases offer users API’s with wide functionality, as well as data types designed specifically for certain data models.

NoSQL databases are also easier than SQL databases to scale, because NoSQL databases can scale horizontally. This is done by adding additional nodes when the need arises to handle more traffic than usual. This simplifies both capacity expansion in situations of peak network loads, and its reduction in cases where there is not much traffic (Raj and Deka, 2018). It improves application scalability. This is not to say that SQL databases do not support horizontal scaling (Raj and Deka, 2018). With RDBMS, it’s just that it’s more difficult to do so. SQL databases mostly scale vertically, that is – giving more processing power to the server on which the SQL database is running. The disadvantage of this system is that since NoSQL is much younger than SQL, they have a smaller community, which is characterized by a large disconnect due to the different approaches used in different NoSQL databases.

The Impact of The ‘Open Data’ Movement

Open data, as an ideology and value policy in many countries and activist groups, was originally a topic related to the political accountability of the authorities (Davies et al., 2019). In a sense, open data and transparency were and remain synonymous for any responsible, accountable government (Davies et al., 2019). Many international commitments on openness, such as the IATI Initiative, the G8 Open Data Charter, and others, have emphasized transparency.

The most important argument in favor of open data has always been the potential economic effect of publishing data, the emergence of new and the development of existing commercial companies that use data in their work. There have been many studies on the economic effect, which have shown an effect of 3.2 trillion dollars. Under this initiative, “open access” is defined as publications on the Internet that are open to all and can be read, unloaded, copied, distributed, printed, found, or attached to the full text (Davies et al., 2019). The only restriction on the reproduction and distribution of publications and the only condition of copyright in this area should be the right of the author to control the integrity of his work. The world has already established basic principles and standards for publishing open data. They are invariant between different countries and can be successfully applied without any substantial reworking, as they correspond to universal values in general, and the current state of technology.

Briefly, open access is interpreted as free, unrestricted access to research publications on the Internet, implying the possibility of their any use but excluding the introduction of changes in these works and providing the mandatory indication of the author in citing and other uses. One of the most important types of open data is open government data (OGD) – open data created by government agencies, including the U.S. Treasury Department (Davies et al., 2019). The disadvantage of this system is that there are no adequate regulations for the use of open data, and a person who violates the rules may not be punished.

Open data is a key public interest, and numerous nonprofit organizations and individual activists are pushing for the openness of different information in machine-readable form. Many national governments, as part of their open government strategies, have set up Web sites to disseminate some of the data processed in the public administration sector; that’s why it’s important to build on this movement. The development of open data will allow citizens to get up-to-date and reliable information about aspects of government life and the workings of government that interest them. Open data will allow people to actively participate in the life of the country, determining the course of its development in the direction that satisfies the requirements of the majority. Open data portals are precisely a unified tool in achieving these goals.

Reference List

Davies, T., Walker, S.B., Mor Rubinstein, Perini, F. and International Development Research Centre (Canada (2019). The state of open data: histories and horizons. Cape Town: African Minds Ottawa.

Hartmut Will, H.W. (2019). Big Data: Ideology vs. Enlightenment. International Journal of Computer Auditing, 1(1), pp.004-025.

Inmon, W.H. and Netlibrary, I. (2020). Building the data warehouse. New York: J. Wiley.

Inmon, B. and Puppini, F. (2020). The Unified Star Schema: An Agile and Resilient Approach to Data Warehouse and Analytics Design. Technics Publications.

Krish Krishnan (2020). Building big data applications. London, England: Elsevier.

Nisa, B.U. (2018). A Comparison between Relational Databases and NoSQL Databases. International Journal of Trend in Scientific Research and Development, Volume-2(Issue-3), pp.845–848.

Nordeen, A. (2020). Learn Data Warehousing in 24 Hours. Guru99.

Shadaksharappa, D.B., Mr. P.Ramkumar and Dr. T.N. Prabakar (2022). Data Warehousing & Data Mining. Book Rivers.

Srinivasan, S. (2018). Guide to Big Data Applications. Cham Springer International Publishing Imprint: Springer.

Raj, P. and Deka, G.C. (2018). A deep dive into NoSQL databases : the use cases and applications. Cambridge, MA: Academic Press, is an imprint of Elsevier.