As you may have read in the “About me” section, I have recently become more interested in machine learning. Therefore, I decided that my next article will be in this area. Today I will introduce you to the main problems and challenges of machine learning. I’ll start by introducing the definition and the main purpose of machine learning.
How is machine learning defined?
One of the best-known definitions by Arthur Samuel is as follows:
Field of study that gives computers the ability to learn without being explicitly programmed. – Arthur Samuel, 1959
It can be said that this is an area of science related to providing computers with the opportunity to learn data. The created program is taught based on certain experience and knowledge. The knowledge here is a database used to train our system called the training set.
I could write a separate article about a more precise definition of machine learning, so maybe I will go to the actual topic, so what can we highlight the main problems in creating machine learning models?
We can distinguish a few problems:
- insignificant features
- poor quality data
- data gaps
I’m going to write more about what it’s all about, and how we can remedy it.
The model we have created is able to learn effectively when it contains the right amount of features and is deprived of the excess of irrelevant features. The question that arises is how to assess that a feature is irrelevant to our model? Well, several techniques are used in this case. These are feature extraction, feature selection and obtaining new features from newly supplied data.
Poor quality data
In a situation where the data set we use contains a large number of errors and low quality of measurements, it will be very difficult to achieve satisfactory results by the model being developed. It is rather certain that it will not achieve optimal performance. Therefore, it is very important that such a data collection should be “cleaned of rubbish” before. One of the most common methods is simply rejecting instances that are far removed from the rest of the examples.
If we want the model we create to be really effective and universal, the data set must be large and contain various data. Of course, this depends on the complexity of the problem and the number of features and classes in a given case. The more features, the larger the set should be. In the case of simple problems, several hundred instances may be sufficient, while if it is more difficult, for example, several thousand instances may be needed.
In this case, we are already entering into problems related to algorithm anomalies, not a dataset, as it was before. The simplest way to describe this problem is that our algorithm generalizes too much. In terms of a more technical definition, this means that our model performs well on the learning set, but in the case of the test set, the results are different from the correct ones. The causes of this problem may be too complex model, noise or too small data set. In order to prevent this problem, regulation is applied. One example is dropout, a technique patented by Google to prevent overfitting in neural networks.
It can be said that this is the opposite of the problem described earlier. The reasons of underfitting are usually an insufficient number of samples or too simplistic training model. The problem investigated turns out to be too complicated in this case. To deal with it we can use, for example, selection and reduction of features.
The problems I mentioned show that creating a good system containing artificial intelligence is not only about implementing an appropriate algorithm or using a ready-made library. It consists of a number of actions that we should do before. We must be aware that the excellent performance of the algorithm does not always indicate that the results are correct. We have to confirm them using appropriate metrics, but I will write about it maybe in another article.
Again, I ask for feedback on whether you find the article interesting or you think something should be improved. From here I also thank you for the positive messages and comments on the previous article and also for the constructive criticism I need, cause these are my first steps in technical articles.