Dtectit is an open-source app that helps you do data analysis and perform machine learning in an easy experience. It needs you to insert data and perform analysis to develop a model that can predict, e.g. Facies for your well log data.
Introduction to Self Organizing Maps
In high dimensional data such as well log data, the human brain cannot easily grasp and capture patterns in data. Therefore, reducing the dimension of the data utilising various techniques plays an essential role in visualising high-dimensional data in a 2D space. Among others, Self-Organizing Maps [Kohonen, 1982a, 1982b] using neural learning algorithm provides meaningful maps in two dimensions. SOM is a form of a neural network but is self-trained, whereas typical neural networks are trained on a calibration curve. The SOM is subsequently calibrated to produce a discrete curve – in this case, a facies curve – or to predict a continuous varying curve like permeability.
SOM has been applied in various fields such as image analysis, speech recognition, chemistry, biology and many more to make it possible to use it for classification, regression purposes, and data mining. This technique uses a regular grid of units that data points are mapped onto and it aims to find similarities between data points in 2D space based on the distance matrix. Unlike other methods and techniques, SOMs concentrate on the most significant similarities.
The classic SOM is a square map where each node is connected to its neighbours in a square grid. One of the drawbacks of this arrangement is known as the “border effect”. This effect causes nodes along the grid border to be more poorly trained than nodes in the centre. In order to mitigate and indeed remove this effect a spherical- or toroidal geometry as SOM can be used. Dtectit offers the toroidal SOM as well as the classic ones.
You can import data into the Dtectit app in the format of .las or .csv. If you choose to import .las file, the version should be 2.0. There are also two examples provided in case the user does not have data to input. The first example will be shown to you by default (“sample1”), but you need to choose “sample2” under the “Dataset” for the second one. If you want to try Dtectit with your own data, then under “Choose Extension”, choose either las or csv.
This part is developed to analyse your data before performing any further action on your data. So you can see a summary of your data under the “Summary” tab. There are few options to choose from; you see a brief summary of your input data by choosing “Statistical Summary” under “Display” in the sidebar or select “Statistical Summary” to get more information about the data. There is another option under “Display” called “Compactly Display Data” that is just to see the data in a compact form. This is mostly just to grasp a quick idea how your data looks like.
Under the “plot” tab, your data will be visualised, and under the sidebar, you can change the interval using “Range” to focus only on a specific part.
Exploratory Data Analysis
We gathered tools under this tab that automate the scan and analyse each variable and visualise them with typical graphical techniques. This way, it is easier to understand and focus only on extracting insight.
“Outliers plot”: It creates a boxplot that illustrates the data points far from other data points for each variable. In other words, it shows exceptional values in your dataset.
“MissingValues plot”: It creates the frequency of missing values in your dataset for each variable.
“Histogram plot”: It creates a histogram for each continuous variable in your dataset. You can change the number of columns and rows by changing the “Choose Number of Column” and “Choose number of Row” under the sidebar.
“Correlation plot”: It creates a correlation heatmap for all variables in your dataset.
“Box Plot”: It creates a boxplot for each continuous variable in your dataset based on a selected variable from the Select “BoxPlot Y axis” under the sidebar.
“PCA plot”: It performs a principal components analysis on your dataset and visualises the output. Variables with zero variance are dropped automatically.
If one or more variables overwhelms the plots because of their values, you can select “Dataset to Exclude” and remove them from analysis.
After brief data analysis and a better understanding of your data, here you can choose which variable should be selected to apply SOM on your data for clustering and partitioning the data.
If the Not Available data points are not excluded yet, you can go to the relevant part in the sidebar and exclude them from your data.
“Select Datasets”: Using this drop-down, you can select the variables for further analysis. If you do not choose the variables, all the variables will be considered by default.
“Train Supervised-SOM against”: If you want to apply supervised SOM on your data, select the variable you want to train against. The variable should be discrete and not a continuous variable.
“Convert Values into NA”: This is used to replace some garbage values that may come up after importing and convert them into NA so they won’t be considered for any further analysis. For example, in .las format you can have -999.0000 as a garbage value when getting imported and you can enter this value into the text box and click on the checkbox “Convert values into NA” to convert.
“Remove rows with missing values”: This function removes all the rows with NA data points in them. It will exclude the whole row if there is at least one missing value.
If you are interested in only a specific interval of your dataset, select the range under “Number range Input 1”, define the range, and activate the checkbox “Activate Range Inputs” you can choose only that interval. The app, for now, can subset only two intervals, so for the second interval, activate the “Activate Range Inputs 2” and select the range.
Model Development for Supervised-SOM
There are both supervised and unsupervised SOM to develop your model. If you already determined your categories (e.g. facies) in your data and want to predict it for another data set, then the choice is “Supervised SOM”. But if there are no categories in your data and you want to cluster your data, for example, determining the facies, then choose unsupervised “SOM”.
We provided features to tune the SOM model. These include learning algorithms, distance function, neighbourhood function. You can adjust them based on your input data; otherwise, use default settings. Features like grid size, number of iteration, topology should be modified based on your dataset. But there is some limitation in the free version. For example, you can not increase the number of grids to more than 12, so basically, there can be at most 144 nodes.
If you developed your model based on supervised-SOM, you need to go to this part to cluster your data. There are, at the moment, only one technique for clustering. But we provided two more techniques to help you with clustering. One is “Analyse Distances”; with this technique, you can check whether there is any cluster in your data or not. The other technique is “Heatmaps”. This also shows you how many groups are in your data. Ultimately after finding the number of clusters in your dataset, by choosing the “Hierarchical” technique, you can cluster your data by providing the Number of Clusters and pushing the blue button (“Cluster Dataset”). The hierarchical technique can be used as an illustration technique as well to find out how many clusters exists in your dataset.
Once you have clustered your dataset, another visualisation tool is provided to better understand how your dataset is clustered. So to use this tool, go to Bordering SOM Plot and see how the dataset is clustered.
Quality Assessment Methods
To be sure that you have chosen the correct number of clusters, we provided two tools. One is called “Sillhouette” plot, where you see if data points in each group belong to that cluster or not. The other tool is “Heatmaps”, where you can see how your dataset is clustered and whether they are right or wrong. If you are not satisfied with your clusters, you can return to “Clustering Techniques” tab and redo the process. Once you are confident with your group, go to the “Prediction” pane.
In this part, you can predict clusters in your dataset based on the model built in the previous pane (“Model Development”). You can use the examples in the app or input your own dataset. Then choose the proper variables to predict clusters in your dataset. Under “Explore” tab and insert data, you can remove NAs in your dataset. Also, it is possible to choose an interval from your dataset. At the moment, it is possible to choose only two intervals in the app and binds them together.
Under the “Data visualisation” tab, you can the selected dataset for prediction. This way, you make sure that you are using the correct dataset for prediction. Moreover, you can perform Exploratory Data Analysis on your dataset. For example, you can check “Missing Values” or see a “Histogram” of your dataset.
After some data analysis on your dataset under “Model Prediction”, push the “Predict” button and add the variable taken out from the dataset to your dataset. Then by clicking on the “Bind” you can see the predicted clusters (Facies). You can visualise the predicted data along with the rest of dataset under “Prediction Visualisation”.
Quality Measurements Analysis
In case of using supervised SOM, under “Quality Measurement Analysis” you can check the quality of the model and how precise is your model. In the case of using supervised SOM, under “Quality Measurement Analysis”, you can check the quality of the model and how precise is your model. Choose one of the three methods to assess your model. “Confusion Matrix” gives general statistics about the model prediction and how accurate it is. The two other “Contingency Tables” are for visualisation purposes.