Exploratory Data Analysis (EDA)
There is something that many newbies to fields such as data science and machine learning will realize that almost all of the millions of sample data used for educational purposes in many sources have been perfectly cleaned, prepared in the most suitable way for the given problem, and modeled with a few lines of code without tampering with the data. It has a perfect structure that allows it to be set up. The data is downloaded, the algorithm is selected, the '.fit()' function is called, and congratulations, you have set up your first model. On the other hand, the real problems dealt with in the sector are unfortunately not that simple...
For many outsiders, perhaps data science can be seen as a field that employs sophisticated statistical and machine learning techniques. The applicability of this definition, which may be partially correct, is through 'understanding the data'. Exploratory Data Analysis (EDA) is one of the steps we will take in order to make sense of this data and make it suitable for the model we will establish. If we want to summarize this situation, we can use the English idiom “garbage in, garbage out” and say “garbage in, perform EDA, possibly garbage out.” We can turn it into a sentence.
EDA can be grouped under 3 main headings;
- 1. Understanding Data
- 2. Clearing Unnecessary Data
- 3. Finding Relationships in Data
Before we begin these steps, it's helpful to take a general look at the data and see what kind of variables we'll be working with.
After taking a general look at the data and having an idea at some points, we go deep into the data by using the “.info()” function to see the quality of our data, unnecessary variables, types of these variables, missing data and many more.
In a summary table here, we can access a lot of information about our data. We can start to do the necessary actions by getting an idea about the issues such as missing data in some columns, the types of the columns, whether there are unnecessary data. These points are of great importance to give the model clean, complete and accurate data before building a machine learning model.
In order to understand our data more and analyze what kind of thing we are dealing with, we can try to uncover the relationships between the columns in the data or make general evaluations about the data. It will be possible to access a lot of valuable information about the relations between the columns from the example table on the side. Apart from that, we can also find some connections by visualizing our data.
To summarize, in any case, the way to build a good model is to 'understand the data'. We can easily make sense of our data with the examples given and the distinctions in the subject title and make our work even easier.
References:
https://towardsdatascience.com/an-extensive-guide-to-exploratory-data-analysis-ddd99a03199e
https://medium.datadriveninvestor.com/introduction-to-exploratory-data-analysis-682eb64063ff