About Employee Analysis System
This is a web application which is used for doing employee analysis based on various attributes like individual risk appetite, gender, salary, age, work experience, marital status etc. This system enables the client to bring out certain important conclusions like employee retention, risk taking appetite, employee costing etc.
(1). Description
A type of unsupervised learning known as ‘Clustering’ is used to implementing employee division based on certain attributes. In clustering, employees (data points) are in number of groups in such a way that data points in same groups are more similar to other data points in other groups. The attributes are quantified and can be modulated by the system user from 0 to 1. Clustering of employees will vary according to that. Client can choose as many attributes for clustering as required. We have implemented ‘Dimensional Reduction’, through which irrespective of how many attributes are chosen for clustering; the final result is brought down to two or three dimensional clustering. U-Map library is used for implementing the dimensional reduction. Clustering was implemented in 4 steps:
○ Pre-processing:
Before starting with clustering we have to first take a look at data which is being exposed to the clustering process. If the data is numeric we already cleared step one of pre-processing the data but if it is categorical (data which may be divided into groups- eg. Gender, Marital status etc.) The data will be converted to numerical values
All the columns which are without any label will be removed.
We need to pick our desired columns to include in clustering.
The data will be normalised (make every data ranging into same range) eg. age and salary can lie in different numerical ranges and hence will be weighted differently by default. This different weighing of data can alter the results of clustering.
Weighting of data if required. It can be done in 2 ways row wise and column wise higher the weight higher the importance in clustering.
(2). Dimensionality reduction
Now our data can range from 1 to many columns (dimensions). We need to reduce the number of dimensions as it will lower the complexity of processing.
Dimensionality reduction can be done by many algorithms and we choose 3 state of the art algorithms from them namely PCA, T-SNE, and UMAP.
(3). Outlier detection
Outliers from the data should be separated so they don’t give their bias to the result of clustering.
We used 2 famous algorithms for outlier detection namely DB-SCAN and Isolation Forest.
(4). Clustering
Clustering can be done using many techniques and 2 major of them are Centroids-based Clustering (Partitioning methods) and Density-based Clustering (Model-based methods).
We took 3 algorithms of the above 2 techniques which were DB-SCAN, K-Means and GMM clustering.
(5). Challenges
○ Weighing data column wise
There were already pre made options to apply weighing the data row wise but there were no such ways to weigh the data column wise.
○ Filling null value
There can be cases where employees don’t fill their complete data. So clustering is done first based on the data that employees have provided.
The unfilled data is predicted for the employee on the basis of that.
Once that data gets filled, then final employee clustering is done based on attributes.
○ Algorithm combination
It was challenging to find the best combination of algorithm in the system.
(6). Solution
Researched and found an algorithm to weight data column wise and implemented it from scratch in the module.
For filling null values a pre-clustering is done of the data which does not have null values and from help of its result the values are predicted for null values.
The best combination of the algorithms was found after rigorous testing on different kinds of data and researching on the merits/demerits of the algorithm.
Every task was technically analysed from feasibility point of view. In case anything was not feasible then work around were planned.
(7). Benefits We Delivered
The project development is solely based on the Agile Work methodology and Scrum architecture. This raised our projects to a peak level of quality, security, profit, promotion, etc. We design, build, configure, test and then release the potentially shippable project increment. This process cycle is also performed in an iterative manner or sprints.
○ Listing of major benefits delivered
- Employee segmentation
- Filling null values based on clustering
- Alter the attributes modulation from 0 to 1 and clustering instances.
- Save the cluster reports.
(8). Summary
○ To summarize,
- It has functionality of uploading the data and doing cluster processing on it.
- All the complexity with data is looked and pre-processed by the clustering module
- Analysts can add/remove columns and provide weights/importance to columns.
- Analysts can also select the number of clusters to be formed.
- The data is processed and shown on frontend with dynamic 3d/2d charts and graphs which provide the insight of the employees are clustered on the basis of columns selected.
- The analyst can alter the number of clusters and other configurations for clustering and showing visual graphs.
- There is a functionality to save your clustering by providing a description of clustering, name of clustering, and names of clusters for future.
- Analysts can take surveys on the basis of last clustering from its employees.
- The survey data recorded can be further utilised to do clustering on their basis
- All the clustering instances which were saved will be recorded in history and analysts can monitor the change of cluster of any employee.
- If an analyst wants he can alter and export the data and the clustering instances.