1. Overview of ML analysis platform functions

In order to realize the basic information of data analysis and training data in the RAInS project (AI Accountability System Project: RAInS Project Website), Python will be used to build a machine learning analysis platform to help data analysis And data modeling. The platform will implement some functions: design a web version of the user interface to support interaction. Support for selecting data sets from the local, support for automated visual analysis, support for regression analysis and classification analysis, support for viewing training records, support for viewing the parameters and results of the training model and drawing. Generate the required JSON files, and can also predict new data sets, with anomaly detection, rule association and other detailed functions.

2. The framework and the interface of Python package

Please install the required Python package through Conda or Pip before use

1
2
3
4
5
6
7
8
9
10
11
12
13
import os
import mlflow
# use streamlit to achieve interactive use on the web side
import streamlit as st
import pandas as pd
# used to display the report in the web page
from streamlit_pandas_profiling import st_profile_report
# used to generate reports
from pandas_profiling import ProfileReport
# machine learning classification
import pycaret.classification as pc_cl
# machine learning regression
import pycaret.regression as pc_rg

3. Program file directory

data: is used to store the training data set and the test data set
logs.log: Used to record the log information generated by the system during the operation of the platform
mlruns: The machine learning template record information used to manage training can be used in mlflow
main.py: The main program code of the machine learning analysis platform

4. Ideas and design

(1). The idea of the program

At the beginning, I hoped to realize a data flow that can capture the machine learning (ML) (basic information of training data, whether to process the training data, the real data that is actually input to ML after ML is deployed, and ML’s prediction of these real data As a result, predict the time it will take.) And try to obtain information as follows: Whether there is an abnormality during runtime, such as memory overflow, CPU overload, etc. Whether the hardware reports an error. There are also real input data format, abnormal size, etc. Then record these data and generate JSON file as an interface to complete other parts of the project. So I completed the capture of camera information, object movement time and abnormal information based on OpenCV. Soon, I realized a serious problem. Trying this work based on specific machine learning can only use specific methods and parameters. Machine learning does not use the same template or a standard to complete every task. Is there a general way to use a special method as a subset to meet the needs?

(2). Program design

Machine learning is inseparable from the assistance and bonuses of data science. I use UCI data sets to obtain the data sources I need, and I can let developers define these data sets themselves, complete data definition and analysis through pandas-profiling, and visualize the original data of machine learning. And the generated report can be saved through JSON file. This operation provides information and help for engineers in later modeling and accountability.

The general process of machine learning is roughly divided into the steps of collecting data, exploring data, and preprocessing data. After processing the data, the next step is to train the model, evaluate the model, and then optimize the model.

To obtain information and records in a complete machine learning process, I used the pycaret library to complete machine learning modeling and analysis, and used regression analysis and classification prediction to implement machine learning analysis. First, get all the column names from the required data set, allowing developers to freely choose the information they need, and also need to get the algorithm according to the selected task. Finally, use pycaret to save the log of the entire process to the logs.log file.
For managing the entire model and predicting work, I thought of using mlflow (MLflow Website). The Tracking function in this tool can record every run The parameters and results, the visualization of the model and other data. I’m very surprised that the mlflow template has been included in pycaret. When I execute pycaret, I will automatically use mlflow to manage running records and logs. Model information, etc. You can get more model information and data by calling the load_model function in the template. Finally, it is very convenient that the developer only needs to input the data set to complete the prediction of the model.

MLflow is an end-to-end machine learning lifecycle management tool launched by Databricks(spark). It has the following four functions:
Track and record the experiment process, cross-compare experiment parameters and corresponding results (MLflow Tracking).
Package the code into a reusable and reproducible format, which can be used for member sharing and online deployment (MLflow Project).
Manage and deploy models from multiple different machine learning frameworks to most model deployment and reasoning platforms (MLflow Models).
In response to the needs of the full life cycle management of the model, it provides centralized collaborative management, including model version management, model state conversion, and data annotation (MLflow Model Registry).
MLflow is independent of third-party machine learning libraries and can be used in conjunction with any machine learning library and any language, because all functions of MLflow are invoked through REST API and CLI. In order to make the invocation more convenient, it also provides for Python, R , And Java language SDK.

Finally, in order to realize the visualization and UI interaction of the program, I used streamlit(Streamlit Website) to complete this work. The components contained in the streamlit library meet the needs of most developers, and only a single function is needed to complete the design and deployment of html in the design of web UI.

Streamlit is a Python-based visualization tool. Unlike other visualization tools, it generates an interactive site (page). But at the same time, it is not a WEB framework like Django and Flask that we often come into contact with.

The UI design of the platform is as follows:

5. Introduction to functions of the ML Platform

(1). Project deployment and data set reading analysis

First, we use git to clone the project in Github to the local with git clone repository. Use pip or Conda to complete the installation of the required Python package. Yes. Use the Python IDE to write or debug the program. Use streamlit to run the main.py program in the project. Enter ‘streamlit run main.py’ in the terminal. See the picture below The information shows that port 8501 has been opened (there will be a Network URL in the local URL), and we can use the program in the browser to enter the UI page.

Enter the browser, on the left page you can see four functions (define data source, data set analysis, modeling, view system log). The user needs to put the data set that the user needs in the ./data directory, Users can choose the model they need to complete machine learning training and modeling. Users can also choose the number of rows that need to be read, and can complete the visualization of the data set by generating a report. As shown in the figure:

The exploratory analysis of data will be divided into the following aspects:

  • Are there missing values?
  • Are there any outliers?
  • Are there duplicate values?
  • Is the sample balanced?
  • Do you need to sample?
  • Does the variable need to be converted?
  • Do you need to add new features?

When developers need to use the data and records of this part of the platform, they can see all the configuration files about the system by selecting the Reproduction option, and can download the config.json file at any time to complete other Work.

(2). Machine learning modeling and viewing system logs

After the definition and analysis of the data source required by the user, modeling is now necessary. The project is very convenient for the modeling part of the function, and the usability is very high. The two most commonly used algorithms in data science are used in the modeling. : Regression and Classification. In my opinion, the biggest difference between these two algorithms is the different form of the loss function. The quantitative output is called the regression algorithm and belongs to continuous variable prediction. The classification algorithm is qualitative. It is a discrete variable forecast. Then developers can add model algorithms, such as xgboost, svm, lr and other common algorithms. Finally, the developer chooses to predict an object in the data set (currently the project cannot support simultaneous prediction of multiple objects, cross-validation, etc.). For this part of the function , Three lists are created in the code to store the modeling parameters that will be used.

After the machine learning modeling work is completed, developers can choose their own trained models and data sets, and can directly use the models. Developers can also view system logs to help analyze and improve. Users can choose to read and Check how many lines of system log. As shown in the figure:

6. Explanation of part of the code of the ML Platform

There are multiple functions with auxiliary functions before the main function:
concatFilePath(file_folder, file_selected) is used to get the full path of the data, and then read the data set:

1
2
3
4
5
6
7
# get the full path of the file, used to read the dataset
def concatFilePath(file_folder, file_selected):
if str(file_folder)[-1] != '/':
fileSelectedPath = file_folder + '/' + file_selected
else:
fileSelectedPath = file_folder + file_selected
return fileSelectedPath

Read logs.log in the getModelTrainingLogs(n_lines = 10) function, and display the number of lines selected last. The user can set the number of lines:

1
2
3
4
5
6
7
# read logs.log, display the number of the last
# selected line, the user can set the number of lines
def getModelTrainingLogs(n_lines = 10):
file = open('logs.log', 'r')
lines = file.read().splitlines()
file.close()
return lines[-n_lines:]

Finally, for the performance of the program, the data set will be put into the cache in the function load_csv to load the data set. Repeated loading of the previously used data set will not repeatedly occupy system resources again.

1
2
3
4
5
6
7
8
9
10
11
12
# load the data set, put the data set into the cache
@st.cache(suppress_st_warning=True)
def load_csv(file_selected_path, nrows):
try:
if nrows == -1:
df = pd.read_csv(file_selected_path)
else:
df = pd.read_csv(file_selected_path, nrows=nrows)
except Exception as ex:
df = pd.DataFrame([])
st.exception(ex)
return df

7. Full code presentation of the ML Analysis Platform

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
"""
RAInS Project: machine-learning analysis platform
Author: Junhao Song
Email: songjh.john@gmail.com
Website: http://junhaosong.com/
"""

import os
import mlflow
# use streamlit to achieve interactive use on the web side
import streamlit as st
import pandas as pd
# used to display the report in the web page
from streamlit_pandas_profiling import st_profile_report
# used to generate reports
from pandas_profiling import ProfileReport
# machine learning classification
import pycaret.classification as pc_cl
# machine learning regression
import pycaret.regression as pc_rg

# store some commonly used machine learning modeling techniques
ML_LIST = ['Regression', 'Classification']
RG_LIST = ['lr', 'svm', 'rf', 'xgboost', 'lightgbm']
CL_LIST = ['lr', 'dt', 'svm', 'rf', 'xgboost', 'lightgbm']


# list certain extension files in the folder
def listFiles(directory, extension):
return [f for f in os.listdir(directory) if f.endswith('.' + extension)]


# read logs.log, display the number of the last
# selected line, the user can set the number of lines
def getModelTrainingLogs(n_lines = 10):
file = open('logs.log', 'r')
lines = file.read().splitlines()
file.close()
return lines[-n_lines:]


# get the full path of the file, used to read the dataset
def concatFilePath(file_folder, file_selected):
if str(file_folder)[-1] != '/':
fileSelectedPath = file_folder + '/' + file_selected
else:
fileSelectedPath = file_folder + file_selected
return fileSelectedPath


# load the data set, put the data set into the cache
@st.cache(suppress_st_warning=True)
def load_csv(file_selected_path, nrows):
try:
if nrows == -1:
df = pd.read_csv(file_selected_path)
else:
df = pd.read_csv(file_selected_path, nrows=nrows)
except Exception as ex:
df = pd.DataFrame([])
st.exception(ex)
return df


def app_main():
st.title("Machine learning analysis platform")
if st.sidebar.checkbox('Define Data Source'):
filesFolder = st.sidebar.text_input('folder', value="data")
dataList = listFiles(filesFolder, 'csv')
if len(dataList) ==0:
st.warning('No data set available')
else:
file_selected = st.sidebar.selectbox(
'Select a document', dataList)
file_selected_path = concatFilePath(filesFolder, file_selected)
nrows = st.sidebar.number_input('Number of lines', value=-1)
n_rows_str = 'All' if nrows == -1 else str(nrows)
st.info('Selected file:{file_selected_path},The number of rows read is{n_rows_str}')
else:
file_selected_path = None
nrows = 100
st.warning('The currently selected file is empty, please select:')
if st.sidebar.checkbox('Exploratory Analysis'):
if file_selected_path is not None:
if st.sidebar.button('Report Generation'):
df = load_csv(file_selected_path, nrows)
pr = ProfileReport(df, explorative=True)
st_profile_report(pr)
else:
st.info('No file selected, analysis cannot be performed')
if st.sidebar.checkbox('Modeling'):
if file_selected_path is not None:
task = st.sidebar.selectbox('Select Task', ML_LIST)
if task == 'Regression':
model = st.sidebar.selectbox('Select Model', RG_LIST)
elif task == 'Classification':
model = st.sidebar.selectbox('Select Model', RG_LIST)
df = load_csv(file_selected_path, nrows)
try:
cols = df.columns.to_list()
target_col = st.sidebar.selectbox('Select Prediction Object', cols)
except BaseException:
st.sidebar.warning('The data format cannot be read correctly')
target_col = None

if target_col is not None and st.sidebar.button('Training Model'):
if task == 'Regression':
st.success('Data preprocessing...')
pc_rg.setup(
df,
target=target_col,
log_experiment=True,
experiment_name='ml_',
log_plots=True,
silent=True,
verbose=False,
profile=True)
st.success('Data preprocessing is complete')
st.success('Training model. . .')
pc_rg.create_model(model, verbose=False)
st.success('The model training is complete. . .')
#pc_rg.finalize_model(model)
st.success('Model has been created')
elif task == 'Classification':
st.success('Data preprocessing. . .')
pc_cl.setup(
df,
target=target_col,
fix_imbalance=True,
log_experiment=True,
experiment_name='ml_',
log_plots=True,
silent=True,
verbose=False,
profile=True)
st.success('Data preprocessing is complete.')
st.success('Training model. . .')
pc_cl.create_model(model, verbose=False)
st.success('The model training is complete. . .')
#pc_cl.finalize_model(model)
st.success('Model has been created')

if st.sidebar.checkbox('View System Log'):
n_lines =st.sidebar.slider(label='Number of lines',min_value=3,max_value=50)
if st.sidebar.button("Check View"):
logs = getModelTrainingLogs(n_lines=n_lines)
st.text('System log')
st.write(logs)
try:
allOfRuns = mlflow.search_runs(experiment_ids=0)
except:
allOfRuns = []
if len(allOfRuns) != 0:
if st.sidebar.checkbox('Preview model'):
ml_logs = 'http://kubernetes.docker.internal:5000/ -->Open mlflow, enter the command line: mlflow ui'
st.markdown(ml_logs)
st.dataframe(allOfRuns)
if st.sidebar.checkbox('Choose a model'):
selected_run_id = st.sidebar.selectbox('Choose from saved models', allOfRuns[allOfRuns['tags.Source'] == 'create_model']['run_id'].tolist())
selected_run_info = allOfRuns[(
allOfRuns['run_id'] == selected_run_id)].iloc[0, :]
st.code(selected_run_info)
if st.sidebar.button('Forecast data'):
model_uri = 'runs:/' + selected_run_id + '/model/'
model_loaded = mlflow.sklearn.load_model(model_uri)
df = pd.read_csv(file_selected_path, nrows=nrows)
#st.success('Model prediction. . .')
pred = model_loaded.predict(df)
pred_df = pd.DataFrame(pred, columns=['Predictive Data'])
st.dataframe(pred_df)
pred_df.plot()
st.pyplot()
else:
st.sidebar.warning('Did not find a trained model')
if __name__ == '__main__':
app_main()

The demo video link of the project is as follows:

Demo Video

Finally, this project thanks my mentor Wei Pang for academic guidance and Danny for his technology help.

Copyright Notice
This article is the original content of Junhao except for the referenced content below, and the final interpretation right belongs to the original author. If there is any infringement, please contact to delete it. Without my authorization, please do not reprint it privately.

8. References

[1]. Kaggle XGboost https://www.kaggle.com/alexisbcook/xgboost

[2]. Kaggle MissingValues https://www.kaggle.com/alexisbcook/missing-values

[3]. MLflow Tracking https://mlflow.org/docs/latest/tracking.html

[4]. Google AutoML https://cloud.google.com/automl-tables/docs/beginners-guide

[5]. 7StepML https://towardsdatascience.com/the-7-steps-of-machine-learning-2877d7e5548e

[6]. ScikitLearn https://scikit-learn.org/stable/getting_started.html#model-evaluation

[7]. UCIDataset https://archive.ics.uci.edu/ml/datasets.php

[8]. Wikipedia https://en.wikipedia.org/wiki/Gradient_boosting

[9]. ShuhariBlog https://shuhari.dev/blog/2020/02/streamlit-intro