1. Overview of ML Analysis Platform Functions

To facilitate data analysis and model training within the RAInS project (AI Accountability System Project: RAInS Project Website), a machine learning analysis platform will be developed in Python. This platform will support data analysis and modelling, offering features such as a web-based user interface for interaction. It will include functionality to select datasets from local files, automate visual analysis, perform regression and classification analyses, view training logs, examine model parameters and results, and visualise outcomes. Additionally, it will generate the required JSON files, predict new datasets, and offer advanced functionalities such as anomaly detection and association rule mining.

2. Framework and Required Python Packages

Please install the necessary Python packages via Conda or Pip before using the platform.

1
2
3
4
5
6
7
8
9
10
11
12
13
import os
import mlflow
# Using Streamlit for interactive web use
import streamlit as st
import pandas as pd
# Used to display the report on the webpage
from streamlit_pandas_profiling import st_profile_report
# Used to generate reports
from pandas_profiling import ProfileReport
# Machine learning classification
import pycaret.classification as pc_cl
# Machine learning regression
import pycaret.regression as pc_rg

3. Program File Directory

  • data: Stores the training and testing datasets.
  • logs.log: Records the log information generated by the platform during operation.
  • mlruns: Manages machine learning experiment logs and training records using MLflow.
  • main.py: Contains the main program code for the machine learning analysis platform.

4. Design and Approach

(1) Initial Concept of the Program

Initially, I aimed to implement a data flow that could capture the machine learning (ML) process, including information such as training data, whether the data was pre-processed, the real data input into ML after deployment, and ML’s predictions. Additionally, I aimed to gather information regarding runtime anomalies (e.g., memory overflow or CPU overload), hardware errors, real input data formats, and data abnormalities. These records would be saved as JSON files to serve as an interface for other parts of the project. I initially used OpenCV to capture camera information, object movement times, and anomaly details.

However, I soon realised a significant limitation: using specific machine learning models with bespoke methods and parameters would not yield a general solution. Since machine learning does not employ a uniform template for all tasks, finding a general approach that could accommodate a specific method as a subset became necessary.

(2) Program Design

Machine learning relies heavily on data science methodologies. I used UCI datasets to source the required data and allowed developers to define these datasets independently, utilising pandas-profiling for data definition, analysis, and visualisation. The generated reports were stored as JSON files to assist engineers in model development and accountability.

The machine learning process generally involves collecting data, exploring and pre-processing the data, training the model, evaluating it, and finally optimising the model.

For the machine learning analysis, I used the pycaret library for modelling and analysis, employing regression and classification techniques. The program starts by extracting all column names from the dataset, enabling developers to select the desired features, followed by selecting appropriate algorithms based on the task. The entire process, including the logs, is saved in the logs.log file using pycaret.

To manage the models and track predictions, I employed MLflow (MLflow Website). MLflow’s Tracking feature records parameters and results from each run, providing visualisations of model performance. Conveniently, MLflow is already integrated with pycaret, allowing seamless management of training records and logs during execution. Developers can access detailed information by invoking the load_model function, and only need to provide a dataset to complete model predictions.

MLflow is an end-to-end machine learning lifecycle management tool developed by Databricks (Spark). It offers the following features:

  • Tracking and recording experiment processes, enabling cross-comparisons of parameters and results (MLflow Tracking).
  • Packaging code into reusable formats for sharing and deployment (MLflow Projects).
  • Managing and deploying models from various machine learning frameworks to a wide range of platforms (MLflow Models).
  • Providing collaborative management for the entire model lifecycle, including version management, state conversion, and data annotation (MLflow Model Registry).

MLflow is independent of third-party machine learning libraries and supports all machine learning frameworks and languages via REST API and CLI. To make usage easier, SDKs are provided for Python, R, and Java.

For visualisation and UI interaction, I used Streamlit (Streamlit Website). The Streamlit library contains components that meet the needs of most developers, enabling easy HTML design and web UI deployment with single-function calls.

Streamlit is a Python-based visualisation tool that generates interactive web pages. Unlike Django or Flask, Streamlit is not a complete web framework but focuses solely on visualisation.

The platform’s UI design is shown below:

5. Functionality Overview of the ML Platform

(1) Project Deployment and Dataset Analysis

First, clone the project from GitHub to your local environment using git clone. Use pip or Conda to install the required Python packages. Write or debug the program using a Python IDE. Use Streamlit to run the main.py program by entering 'streamlit run main.py' in the terminal. As shown in the image below, port 8501 will be opened (with a local URL and a network URL), allowing you to access the UI page through the browser.

In the browser interface, you will see four functionalities on the left-hand side: “Define Data Source,” “Dataset Analysis,” “Modelling,” and “View System Log.” The user needs to place the dataset in the ./data directory, and they can select the desired model for machine learning training. Users can also specify the number of rows to be read and visualise the dataset by generating a report, as shown below:

Exploratory data analysis involves the following aspects:

  • Are there missing values?
  • Are there any outliers?
  • Are there duplicate values?
  • Is the sample balanced?
  • Is sampling needed?
  • Does the variable require transformation?
  • Should new features be added?

When developers need to access platform data and logs, they can view all system configuration files by selecting the Reproduction option. The config.json file can be downloaded at any time for further work.

(2) Machine Learning Modelling and System Logs

Once the data source has been defined and analysed, modelling can commence. The platform simplifies the modelling process, focusing on Regression and Classification—the two most widely used algorithms in data science. The major difference between these algorithms lies in the form of the loss function: quantitative outputs involve regression algorithms, which predict continuous variables, while classification algorithms are qualitative, predicting discrete variables. Developers can add additional models, such as XGBoost, SVM, and logistic regression. Finally, the developer selects an object from the dataset for prediction (currently, the project does not support cross-validation or simultaneous prediction of multiple objects). To handle this part of the process, three lists are created in the code to store the modelling parameters.

After completing the machine learning modelling, developers can select their trained models and datasets for direct use. They can also access system logs to assist in analysis and refinement. Users can specify the number of log lines to read and review, as shown below:

6. Explanation of Code for the ML Platform

Several helper functions are defined prior to the main function. For example:
concatFilePath(file_folder, file_selected) is used to generate the full path to the data file, which can then be used to load the dataset.

1
2
3
4
5
6
7
# get the full path of the file, used to read the dataset
def concatFilePath(file_folder, file_selected):
if str(file_folder)[-1] != '/':
fileSelectedPath = file_folder + '/' + file_selected
else:
fileSelectedPath = file_folder + file_selected
return fileSelectedPath

Read logs.log in the getModelTrainingLogs(n_lines = 10) function, and display the number of lines selected last. The user can set the number of lines:

1
2
3
4
5
6
7
# read logs.log, display the number of the last
# selected line, the user can set the number of lines
def getModelTrainingLogs(n_lines = 10):
file = open('logs.log', 'r')
lines = file.read().splitlines()
file.close()
return lines[-n_lines:]

Finally, for the performance of the program, the data set will be put into the cache in the function load_csv to load the data set. Repeated loading of the previously used data set will not repeatedly occupy system resources again.

1
2
3
4
5
6
7
8
9
10
11
12
# load the data set, put the data set into the cache
@st.cache(suppress_st_warning=True)
def load_csv(file_selected_path, nrows):
try:
if nrows == -1:
df = pd.read_csv(file_selected_path)
else:
df = pd.read_csv(file_selected_path, nrows=nrows)
except Exception as ex:
df = pd.DataFrame([])
st.exception(ex)
return df

7. Full code presentation of the ML Analysis Platform

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
"""
RAInS Project: machine-learning analysis platform
Author: Junhao Song
Email: songjh.john@gmail.com
Website: http://junhaosong.com/
"""

import os
import mlflow
# use streamlit to achieve interactive use on the web side
import streamlit as st
import pandas as pd
# used to display the report in the web page
from streamlit_pandas_profiling import st_profile_report
# used to generate reports
from pandas_profiling import ProfileReport
# machine learning classification
import pycaret.classification as pc_cl
# machine learning regression
import pycaret.regression as pc_rg

# store some commonly used machine learning modeling techniques
ML_LIST = ['Regression', 'Classification']
RG_LIST = ['lr', 'svm', 'rf', 'xgboost', 'lightgbm']
CL_LIST = ['lr', 'dt', 'svm', 'rf', 'xgboost', 'lightgbm']


# list certain extension files in the folder
def listFiles(directory, extension):
return [f for f in os.listdir(directory) if f.endswith('.' + extension)]


# read logs.log, display the number of the last
# selected line, the user can set the number of lines
def getModelTrainingLogs(n_lines = 10):
file = open('logs.log', 'r')
lines = file.read().splitlines()
file.close()
return lines[-n_lines:]


# get the full path of the file, used to read the dataset
def concatFilePath(file_folder, file_selected):
if str(file_folder)[-1] != '/':
fileSelectedPath = file_folder + '/' + file_selected
else:
fileSelectedPath = file_folder + file_selected
return fileSelectedPath


# load the data set, put the data set into the cache
@st.cache(suppress_st_warning=True)
def load_csv(file_selected_path, nrows):
try:
if nrows == -1:
df = pd.read_csv(file_selected_path)
else:
df = pd.read_csv(file_selected_path, nrows=nrows)
except Exception as ex:
df = pd.DataFrame([])
st.exception(ex)
return df


def app_main():
st.title("Machine learning analysis platform")
if st.sidebar.checkbox('Define Data Source'):
filesFolder = st.sidebar.text_input('folder', value="data")
dataList = listFiles(filesFolder, 'csv')
if len(dataList) ==0:
st.warning('No data set available')
else:
file_selected = st.sidebar.selectbox(
'Select a document', dataList)
file_selected_path = concatFilePath(filesFolder, file_selected)
nrows = st.sidebar.number_input('Number of lines', value=-1)
n_rows_str = 'All' if nrows == -1 else str(nrows)
st.info('Selected file:{file_selected_path},The number of rows read is{n_rows_str}')
else:
file_selected_path = None
nrows = 100
st.warning('The currently selected file is empty, please select:')
if st.sidebar.checkbox('Exploratory Analysis'):
if file_selected_path is not None:
if st.sidebar.button('Report Generation'):
df = load_csv(file_selected_path, nrows)
pr = ProfileReport(df, explorative=True)
st_profile_report(pr)
else:
st.info('No file selected, analysis cannot be performed')
if st.sidebar.checkbox('Modeling'):
if file_selected_path is not None:
task = st.sidebar.selectbox('Select Task', ML_LIST)
if task == 'Regression':
model = st.sidebar.selectbox('Select Model', RG_LIST)
elif task == 'Classification':
model = st.sidebar.selectbox('Select Model', RG_LIST)
df = load_csv(file_selected_path, nrows)
try:
cols = df.columns.to_list()
target_col = st.sidebar.selectbox('Select Prediction Object', cols)
except BaseException:
st.sidebar.warning('The data format cannot be read correctly')
target_col = None

if target_col is not None and st.sidebar.button('Training Model'):
if task == 'Regression':
st.success('Data preprocessing...')
pc_rg.setup(
df,
target=target_col,
log_experiment=True,
experiment_name='ml_',
log_plots=True,
silent=True,
verbose=False,
profile=True)
st.success('Data preprocessing is complete')
st.success('Training model. . .')
pc_rg.create_model(model, verbose=False)
st.success('The model training is complete. . .')
#pc_rg.finalize_model(model)
st.success('Model has been created')
elif task == 'Classification':
st.success('Data preprocessing. . .')
pc_cl.setup(
df,
target=target_col,
fix_imbalance=True,
log_experiment=True,
experiment_name='ml_',
log_plots=True,
silent=True,
verbose=False,
profile=True)
st.success('Data preprocessing is complete.')
st.success('Training model. . .')
pc_cl.create_model(model, verbose=False)
st.success('The model training is complete. . .')
#pc_cl.finalize_model(model)
st.success('Model has been created')

if st.sidebar.checkbox('View System Log'):
n_lines =st.sidebar.slider(label='Number of lines',min_value=3,max_value=50)
if st.sidebar.button("Check View"):
logs = getModelTrainingLogs(n_lines=n_lines)
st.text('System log')
st.write(logs)
try:
allOfRuns = mlflow.search_runs(experiment_ids=0)
except:
allOfRuns = []
if len(allOfRuns) != 0:
if st.sidebar.checkbox('Preview model'):
ml_logs = 'http://kubernetes.docker.internal:5000/ -->Open mlflow, enter the command line: mlflow ui'
st.markdown(ml_logs)
st.dataframe(allOfRuns)
if st.sidebar.checkbox('Choose a model'):
selected_run_id = st.sidebar.selectbox('Choose from saved models', allOfRuns[allOfRuns['tags.Source'] == 'create_model']['run_id'].tolist())
selected_run_info = allOfRuns[(
allOfRuns['run_id'] == selected_run_id)].iloc[0, :]
st.code(selected_run_info)
if st.sidebar.button('Forecast data'):
model_uri = 'runs:/' + selected_run_id + '/model/'
model_loaded = mlflow.sklearn.load_model(model_uri)
df = pd.read_csv(file_selected_path, nrows=nrows)
#st.success('Model prediction. . .')
pred = model_loaded.predict(df)
pred_df = pd.DataFrame(pred, columns=['Predictive Data'])
st.dataframe(pred_df)
pred_df.plot()
st.pyplot()
else:
st.sidebar.warning('Did not find a trained model')
if __name__ == '__main__':
app_main()

The demo video link of the project is as follows:

Demo Video

Finally, this project thanks my mentor Wei Pang for academic guidance and Danny for his technology help.

Copyright Notice
This article, except for the referenced content below, is the original work of Junhao. The author retains the exclusive rights to its final interpretation. If there are any issues regarding copyright infringement, please contact me for removal. Reproduction or distribution of this content without my explicit permission is prohibited.

8. References

[1]. Kaggle XGboost https://www.kaggle.com/alexisbcook/xgboost

[2]. Kaggle MissingValues https://www.kaggle.com/alexisbcook/missing-values

[3]. MLflow Tracking https://mlflow.org/docs/latest/tracking.html

[4]. Google AutoML https://cloud.google.com/automl-tables/docs/beginners-guide

[5]. 7StepML https://towardsdatascience.com/the-7-steps-of-machine-learning-2877d7e5548e

[6]. ScikitLearn https://scikit-learn.org/stable/getting_started.html#model-evaluation

[7]. UCIDataset https://archive.ics.uci.edu/ml/datasets.php

[8]. Wikipedia https://en.wikipedia.org/wiki/Gradient_boosting

[9]. ShuhariBlog https://shuhari.dev/blog/2020/02/streamlit-intro