One of the consequences of our automated, always-connected world is an enormous increase in captured digital data. Like it or not, for almost everything you do today—every website visit, financial transaction, social network interaction, medical exam, trip taken—someone is collecting and saving that data in a database.
Data scientists estimate there are now over 64 zettabytes of data online (a zettabyte is a billion terabytes, or a trillion gigabytes); and data storage will grow at a compound rate of 23% annually between 2020 and 2025.
But what good is all this stuff? Although corporate raw data is rich with potentially useful information, the number and enormity of databases make extraction of actionable business knowledge quite challenging.
In order to deal with all these exabytes and zettabytes, a discipline known as data science has evolved. Within that domain the field of data mining was developed.
What is data mining?
Data mining is the computational process of exploring and uncovering patterns and useful knowledge in large data sets, sometimes referred to as “big data.” It is a branch of computer science, using tools and techniques from statistics, database theory, and machine learning. Although data mining has been developing since the 1990s, it builds on the work of mathematicians and computer scientists over the past 250 years.
Key data mining concepts
Data mining is performed by automated analysis of very large sets of data, usually originating in multiple relational databases populated and managed by online transaction processing (OLTP) systems. These systems capture, store, and process data in real time—for example, a retail order entry system connected to an online store, or an airline ticketing system. Traditional databases are great for fast processing of individual transactions, but poor for data mining, especially when the data of interest spans multiple large databases.
Data mining relies on data processing systems called data warehouses. A data warehouse is a solution that extracts and stores data from multiple sources, and enables data mining and other business intelligence (BI) applications—reporting, analysis, dashboards, charts and graphics—to provide users with insights about the current state of an organization.
A data warehouse consists of four major subsystems:
- Interfaces to external data sources (typically OLTP applications and databases) which acquire and regularly update raw data,
- An application server to cleanse, process, and store the data,
- An online analytical processing (OLAP) server for calculation, planning, and forecasting functions; and
- Front-end tools to enable queries, reporting, and data mining.
The raw data stored in OLTP databases often contains anomalies, outliers, and outright errors. These problems must be resolved in order to avoid analytical errors and false conclusions. A key task of the data warehouse application server is data cleaning and preparation—removing or fixing incorrect, corrupt, duplicate, or incomplete data within a dataset. This occurs during data intake and may involve a variety of steps, depending on the data sources, their reliability, and the BI applications supported.
Data cleaning includes:
- Removal of unwanted data. This includes duplicate data, a common problem when using multiple data sources; and irrelevant data that do not fit into the problem being analyzed.
- Repair. Repairing structural errors, for example, inconsistencies in format, numeric precision, or naming convention.
- Filtering. Some data may contain outlier observations that do not fit—either due to data entry errors or problems with the input data source. Removal of outlier data should only be performed if it is truly invalid.
- Handling missing data. This can involve dropping incomplete observations, inputting missing values based on other observations, or changing the processing algorithm to deal with missing data.
- Validation and quality assurance. This final step tests the sensibility of the data as well as its conformance to the format rules.
For example, when ingesting customer feedback data, data cleaning may involve some of these steps:
- Removing or reconciling duplicate entries from the same customer
- Standardizing all rating scale responses to be on the same rating scale
- Removing spammy entries
- Imputing missing answers to questions or making sure the customer response is not dropped if the field that is left empty is accidently made required in your data warehouse
- Standardizing free form entries (e.g. “N/A” and “Not applicable” mean the same thing
Machine Learning (ML)
One of the key technology areas impacting data warehouse analytical processing is machine learning. Machine learning (ML) has long been studied as part of computer science. It is a statistical tool that enables systems to learn and improve from experience without being explicitly programmed. The learning is driven by algorithmic processing of data—the more data, the better.
One useful approach within machine learning is the artificial neural network. This is a program which mimics the human brain through a model composed of interconnected abstract nodes, modeling neurons. The neural net models transmission of a signal between nodes, processing of the signal within a node, and output of signals to one or more other nodes. Properly constructed, this basic model can be trained, and it can learn from experience, based on the data it processes.
What is data mining used for?
In recent years data mining has been most commonly used by businesses with a consumer focus, such as marketing campaigns, communication, education, financial, and retail organizations.
- In marketing, data mining can analyze large databases to better segment the target market, determine interests, and create targeted online advertising campaigns.
- In education, data mining helps educators analyze student performance, predict achievement, and target students weak in particular areas.
- In finance, banks use data mining to better manage risks and augment credit ratings.
- In retail, e-commerce sites use data mining for creative selling approaches, including cross-selling and up-selling.
How are data mining projects run?
Developing a large data mining solution can be daunting. There are a wide variety of commercial and open source tools and approaches. Business professionals have developed a cross-industry methodology and data mining process model to help. This model, called CRISP-DM (for Cross-Industry Standard Process for Data Mining), helps by defining typical project phases, tasks within each phase, and an explanation of the relationships between the tasks.
Briefly, the CRISP-DM model addresses the following project phases:
- Business understanding. Developing an understanding of the project needs and objectives as well as a preliminary plan.
- Data understanding. Identifying and acquiring the necessary data, and assessing its quality and problems.
- Data preparation. Planning and executing the creation of an appropriate dataset from the initial raw data, including any data cleansing.
- Modeling. Using tools and algorithms to identify patterns within the data.
- Evaluation. Construction of models and testing them against the data and requirements to determine which model(s) best achieve the goals.
- Deployment. Putting the software solution into production, monitoring its performance, as well as the ongoing inbound data to ensure all data is properly processed. Making the results available to the business.
Data mining techniques
There are many data mining techniques employed to infer patterns and predict future events. These data mining methods or tasks can be grouped as either predictive or descriptive. Descriptive tasks find data describing patterns and infers new information. Predictive tasks build predictive models from the data that can help determine future values in another data set.
Some common data mining tasks include:
- Anomaly detection is a descriptive task that identifies abnormal patterns in the data, without discarding valid outlier data. This is increasingly difficult in high dimensionality data (also known as “big data,” i.e., data from multiple and diverse distributed sources, each large in scale and frequently updated).
- Association rule learning involves the discovery of probable relationships between data items based on “if-then” statements. Association rule mining is used to detect correlations in sales transactions or medical test data.
- Clustering involves the identification of natural groups of similar data (clusters). This is used to help detect patterns, and is widely used in market segmentation and promotion.
- Summarization involves the generalization of a large data set into a smaller, aggregated data set with reduced detail. This summary may be more easily or quickly employed in analysis tasks.
- Classification is a predictive analytics task that identifies the categories to which subsets of the data belong. It is dependent on the definition of a classification model, and on probabilistic rules for identifying and assigning classifications. Decision trees, organizing classification models into hierarchies, are commonly employed.
- Regression models are useful in predicting values based on other features in the data. This is useful for predicting business and social trends, medical outcomes, and future financial values.
The ethics of data mining
It is clear that we are generating data at an ever increasing pace. However, it’s also become clear that collecting and analyzing user data can be fraught with ethical—and sometimes legal—issues. For example, using demographic data in credit score assignment and mortgage approval can have implicit racial bias, collecting user data for election targeting can have serious legal and political repercussions, or releasing supposedly anonymized data for data mining challenges can leak personally identifiable information. When using user generated data, data mining practitioners must question if the data is generated with user consent and if the technique they are going to use for a task can seriously harm a user group.
The future of data mining
As we move forward, data mining applications are being used increasingly in data-rich fields including medical research (including drug development and bioinformatics), security management (including intrusion detection, fraud detection, lie detection, and surveillance). Newer data forms including multimedia (audio, video, images) are also being mined to identify associations and classifications. Spatial and geographic data can be mined to extract information for geographic information systems and navigation. And time series data can be minded to infer cyclical and seasonal trends, such as retail buying patterns or weather.
The data available to us is growing without apparent bounds—but so too is the risk of negative impacts on the users who have generated the data in the first place. With the right data mining tools and an ethical frame of mind, we can learn from it, unlock patterns of past and future behavior to enhance our security, health, and overall prosperity.
Ready for a search engine that protects your data instead of mining and selling everything you do online? Try Neeva, the world’s first 100% ad-free, private search engine. We will never sell or share your data with anyone, especially advertisers. Try Neeva for yourself, at neeva.com.