Data mining is
the process of extracting patterns from large data
sets by combining methods from statistics and artificial intelligence with database management.
Process:
1.
Pre-processing:
Before data mining algorithms can be used, a
target data set must be assembled. As data mining can only uncover patterns
already present in the data, the target dataset must be large enough to contain
these patterns while remaining concise enough to be mined in an acceptable
timeframe. A common source for data is a data mart
or data warehouse. Pre-process is essential to analyze the
multivariate datasets before clustering or data mining.
The target set is then cleaned. Cleaning
removes the observations with noise and missing data.
The clean data are reduced into feature vectors,
one vector per observation. A feature vector is a summarized version of the raw
data observation. The feature(s) selected will depend on what the objective(s)
is/are; obviously, selecting the "right" feature(s) is fundamental to
successful data mining.
The feature vectors are divided into two
sets, the "training set" and the "test set". The training
set is used to "train" the data mining algorithm(s), while the test
set is used to verify the accuracy of any patterns found.
2.
Data mining:
Data mining commonly involves four classes of
tasks
Clustering
– is the task of discovering groups and structures in the data that are in some
way or another "similar", without using known structures in the data.
Classification – is the task of generalizing known
structure to apply to new data. For example, an email program might attempt to
classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification, neural networks and support vector machines.
Regression – Attempts to find a function which models the
data with the least error.
Association rule learning – Searches for relationships
between variables. For example a supermarket might gather data on customer
purchasing habits. Using association rule learning, the supermarket can
determine which products are frequently bought together and use this
information for marketing purposes. This is sometimes referred to as market
basket analysis.
3.
Results validation:
The final step of knowledge discovery from
data is to verify the patterns produced by the data mining algorithms occur in
the wider data set. Not all patterns found by the data mining algorithms are
necessarily valid. It is common for the data mining algorithms to find patterns
in the training set which are not present in the general data set, this is
called over fitting. To overcome this, the evaluation uses a test set
of data which the data mining algorithm was not trained on. The learnt patterns
are applied to this test set and the resulting output is compared to the
desired output. For example, a data mining algorithm trying to distinguish spam
from legitimate emails would be trained on a training set
of sample emails. Once trained, the learnt patterns would be applied to the
test set of emails which it had not been trained on, the accuracy of these
patterns can then be measured from how many emails they correctly classify. A
number of statistical methods may be used to evaluate the algorithm such as ROC curves.
If the learnt patterns do not meet the desired standards,
then it is necessary to reevaluate and change the preprocessing and data
mining. If the learnt patterns do meet the desired standards then the final
step is to interpret the learnt patterns and turn them into knowledge.
Casinos Near Me (MapYRO) - New Casinos & Slot Machines
ReplyDeleteFind Casinos Near Me (MapYRO) with 24/7 customer support. 문경 출장샵 Find 충청남도 출장샵 addresses, see activity, speak with 광명 출장샵 other MapYRO 양산 출장샵 Casinos! 동해 출장마사지