Data Points which do not follow a particular data distribution ( Followed by the majority of other data points) are called outliers . An outlier is a data point that is distant from other majority points. They may be due to variability in the measurement. They may be a result of experimental errors. If possible, outliers should be excluded / handled from the data set before training a machine learning model.
Below are some of the methods of treating the outliers
- Trimming/removing the outlier
- Quantile based flooring and capping
- Mean/Median imputation
Trimming/Remove the outliers
In this technique, we remove the outliers from the dataset. Although it is not a good practice to follow.
Python code to delete the outlier and copy the rest of the elements to another array.
# Trimming
for i in sample_outliers:
a = np.delete(sample, np.where(sample==i))
print(a)
# print(len(sample), len(a))
Quantile based flooring and capping
In this technique, the outlier is capped at a certain value above the 90th percentile value or floored at a factor below the 10th percentile value.
Python code:
# Computing 10th, 90th percentiles and replacing the outliers
tenth_percentile = np.percentile(sample, 10)
ninetieth_percentile = np.percentile(sample, 90)
# print(tenth_percentile, ninetieth_percentile)b = np.where(sample<tenth_percentile, tenth_percentile, sample)
b = np.where(b>ninetieth_percentile, ninetieth_percentile, b)
# print("Sample:", sample)
print("New array:",b)
The data points that are lesser than the 10th percentile are replaced with the 10th percentile value and the data points that are greater than the 90th percentile are replaced with 90th percentile value
Mean/Median imputation
As the mean value is highly influenced by the outliers, it is advised to replace the outliers with the median value.
Python Code:
median = np.median(sample)# Replace with median
for i in sample_outliers:
c = np.where(sample==i, 14, sample)
print("Sample: ", sample)
print("New array: ",c)
# print(x.dtype)