1. Foreword

This article serves as a brief introduction to key concepts in data mining and machine learning. It begins with an overview of the types of machine learning, the workflow of machine learning, and examples of data and attribute types. This concise introduction aims to be helpful for anyone looking to learn about machine learning and begin exploring data mining.

2. Four Types of Machine Learning

  • Classification: Given a set of labelled examples, the goal is to learn how to classify new instances.
  • Association: Identifying interesting relationships or associations between attributes or combinations of attributes.
  • Clustering: Grouping similar examples together based on shared features.
  • Numerical Prediction: Predicting a numerical value instead of a category.

3. Workflow of Machine Learning

A typical machine learning workflow begins with acquiring data, followed by pre-processing to ensure the dataset is clean by filtering out unnecessary information. The processed data is then used to train a model, which is tested iteratively until desired performance is achieved. Finally, the model is deployed. This is the traditional machine learning process. Today, most machine learning workflows also include steps to optimise the model, such as tuning hyperparameters. Many articles present these workflows in a complex manner, but I’ve included a simple flow diagram based on what my teacher, Ekaterina Komendantskaya, has demonstrated:

4. Description of the Data

(1) Instances

  • Simple examples, represented by rows of data.
  • Input to the learning algorithm = set of examples (dataset).
  • Represented as a single relationship or a flat file.
  • Inputs are generally in a limited form.
  • There are no relationships between individual instances.

(2) Attributes

  • Each instance is described by a fixed set of predefined attributes.
  • The number of attributes may vary.
  • The existence of one attribute can depend on the value of another.

Often, we are interested in predicting the value of a particular attribute, as it is influenced by the values of other attributes. The attribute we want to predict is called the class or target.

Addendum: In general, for two-dimensional datasets, each row represents an instance, and each column represents an attribute.

5. Types of Attribute Values

The four common types of attribute values are described below.

(1) Nominal

  • The term “nominal” is derived from the Latin word for “name.”
  • Values are distinct symbols, used solely as labels or names.
  • No implied relationship between nominal values.
  • Sorting or distance measurement is not applicable.
  • Only equality comparisons are possible.
  • Also referred to as categorical.

Example: Colour (red, yellow), Country (UK, USA).

(2) Ordinal

  • Values are sequentially ordered.
  • No precise distance between values.
  • Addition and subtraction are not applicable.

Example: Temperature (hot, mild, cold).

(3) Interval

  • Values are ordered and measured in equal units.
  • The sum or product of values is meaningless.
  • Zero is not an absolute reference point.

Example: Temperature in degrees Fahrenheit, years in the Gregorian calendar.

(4) Ratio

  • Defines quantities with a true zero point.
  • Ratio values are treated as real numbers.
  • All mathematical operations are allowed.
  • Interval values can be subtracted to obtain a ratio.

Example: Distance between objects.

Copyright Notice
This article, except for the referenced content below, is the original work of Junhao. The author retains the exclusive rights to its final interpretation. If there are any issues regarding copyright infringement, please contact me for removal. Reproduction or distribution of this content without my explicit permission is prohibited.

6. Reference

[1]. Data Mining Book https://www.cs.waikato.ac.nz/ml/weka/book.html