Excluding outliers in calculation of average #10204

MrTinkerman · 2022-12-12T19:24:45Z

MrTinkerman
Dec 12, 2022

Hi all,

I'testing a sensor and would like to establish a baseline reading.
The readings jump up and down quite a bit but I got to the point where
I have a function that calculates an average over say 250 readings (see below).

This seems to get consistent results. However, there are also outliers in the results (see below).

On first sight you'd say, ok, the average should be ~1015.

But how can I calculate this in MicroPython, excluding the outliers in the calculation?
Can I calculate a standard deviation and exclude results beyond that?

Thanks in advance!

def baseline_measurement():
    
    baseline = []                                        # List of readings
    
    while True:
        
        for m in range(250):                             # Number of samples
                
            baseline.append(tof.ping())                  # Read sensor and append reading to list
            utime.sleep_ms(10)
            
        print(int(sum(baseline) / len(baseline)))        # Print average
        
        baseline = []                                    # Clear list

# Averages calculated by int(sum(baseline) / len(baseline))

1013
1015
1013
1016
1015
1015
1015
1014
1016
1044    <<<<
1043    <<<<
1015
1015
1014
1013
1042    <<<<
1015
1015

robert-hh · 2022-12-12T19:38:10Z

robert-hh
Dec 12, 2022
Collaborator

You could try the median instead of the average.
Edit: If there are not too many values, sort them and take the one in the middle.
Edit2: Or you scan the data first for for the range and then select just a certain middle range, like 20-80% of the values. That can be done without storing in memory, but requires 2 scans. Anyhow you have to look at the characteristics of the data first.

1 reply

MrTinkerman Dec 12, 2022
Author

Thanks for the reply and the suggestions,

I think sorting and taking an average from the middle (40-60%) could work.
Below is my updated code and the result.

def baseline_measurement():
    
    baseline_measurements = []                                                     # List of readings
    baseline_averages = []
    
    while True:
        
        for baseline_average in range(10):
            
            for measurement in range(250):                                                       # Number of samples
                    
                baseline_measurements.append(tof.ping())                               # Read sensor and append reading to list
                utime.sleep_ms(10)
                
            average = int(sum(baseline_measurements) / len(baseline_measurements))
            
            print(average)                                                             # Print average
            
            baseline_averages.append(average)
            
            baseline_measurements = []                                                              # Clear list
        
        print(baseline_averages)

Raw list:

[1013, 1016, 1015, 1042, 1015, 1016, 1017, 1014, 1047, 1015, 1016, 
1015, 1018, 1014, 1015, 1015, 1015, 1016, 1017, 1013, 1016, 1013, 
1013, 1015, 1013, 1018, 1014, 1016, 1015, 1046, 1015, 1013, 1043, 
1015, 1015, 1017, 1018, 1015, 1017, 1015, 1044, 1014, 1015, 1015, 
1017, 1044, 1016, 1041, 1012, 1013, 1014, 1013, 1044, 1014, 1014, 
1045, 1015, 1015, 1014, 1015, 1015, 1013, 1012, 1015, 1016, 1015, 
1017, 1044, 1014, 1016, 1013, 1014, 1015, 1016, 1012, 1012, 1017, 
1015, 1016, 1013, 1015, 1014, 1041, 1015, 1016, 1015, 1072, 1013, 
1016, 1014, 1017, 1017, 1016, 1016, 1015, 1016, 1017, 1014, 1044, 
1018, 1013, 1013, 1014, 1013, 1013, 1014, 1016, 1014, 1016, 1017, 
1016, 1015, 1018, 1016, 1016, 1016, 1015, 1016, 1017, 1017, 1043, 
1015, 1016, 1016, 1016, 1015, 1061, 1074, 1014, 1017, 1015, 1016, 
1014, 1016, 1014, 1013, 1013, 1015, 1015, 1013, 1012, 1016, 1015, 
1041, 1013, 1013, 1012, 1016, 1015, 1013, 1015, 1013, 1014, 1015, 
1015, 1014, 1015, 1013, 1014, 1014, 1014, 1013, 1013, 1012, 1014, 
1014, 1014, 1014, 1015, 1014, 1014, 1015, 1015, 1042, 1044, 1044, 
1014, 1017, 1015, 1043, 1014, 1014, 1015, 1014, 1013, 1015, 1015, 
1015, 1045, 1015, 1015, 1014, 1014, 1013, 1011, 1015, 1013, 1017, 
1014, 1015, 1015, 1014, 1014, 1012, 1017, 1013, 1016, 1016, 1016, 
1015, 1015, 1014, 1016, 1014, 1015, 1015, 1015, 1013, 1015, 1015, 
1016, 1015, 1015, 1016, 1012, 1014, 1016, 1015, 1044, 1044, 1014, 
1017, 1014, 1012, 1014, 1013, 1014, 1017, 1015, 1013, 1043, 1013, 
1018, 1015, 1012, 1015, 1043, 1016, 1016, 1018, 1015, 1018, 1012, 
1046, 1014, 1015, 1017, 1015, 1013, 1014, 1016, 1016, 1015, 1016, 
1017, 1015, 1016, 1044, 1046, 1015, 1015, 1015, 1016, 1012, 1016, 
1015, 1045, 1017, 1044, 1013, 1013, 1015, 1015, 1015, 1013, 1013, 
1015, 1013, 1014, 1017, 1016, 1044, 1012, 1043, 1016, 1015, 1014, 
1015, 1015, 1016]

Sorted

[1011, 1012, 1012, 1012, 1012, 1012, 1012, 1012, 1012, 1012, 1012, 
1012, 1012, 1012, 1012, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 
1013, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 
1013, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 
1013, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 1013, 
1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 
1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 
1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 
1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 1014, 
1014, 1014, 1014, 1014, 1014, 1014, 1015, 1015, 1015, 1015, 1015, 
middle 20% start >>>>>
1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 
1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 
1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 
1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 
1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 
1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 
1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015, 1015,
<<<<  middle 20% end 
1015, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 
1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 
1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 
1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 1016, 
1016, 1016, 1016, 1016, 1016, 1017, 1017, 1017, 1017, 1017, 1017, 
1017, 1017, 1017, 1017, 1017, 1017, 1017, 1017, 1017, 1017, 1017, 
1017, 1017, 1017, 1017, 1017, 1017, 1018, 1018, 1018, 1018, 1018, 
1018, 1018, 1018, 1041, 1041, 1041, 1042, 1042, 1043, 1043, 1043, 
1043, 1043, 1043, 1044, 1044, 1044, 1044, 1044, 1044, 1044, 1044, 
1044, 1044, 1044, 1044, 1045, 1045, 1045, 1046, 1046, 1046, 1047, 
1061, 1072, 1074]

jimmo · 2022-12-13T01:48:38Z

jimmo
Dec 13, 2022
Maintainer

@MrTinkerman the mean and standard deviation of a dataset can be computed without storing the history of the samples. This is called the "running mean".

The way to do it for the mean can be seen from the formula:

$\mu_n = \frac{\Sigma_{i=1}^{n} x_i}{n}$

so the next mean is

$\mu_{n+1} = \frac{\Sigma_{i=1}^{n+1} x_i}{n+1} = \frac{x_{n+1} + \Sigma_{i=1}^{n} x_i}{n+1} = \frac{x_{n+1} + \mu_n \times n}{n+1}$

So given only the previous mean, the next sample, and the number of samples, you can compute each successive mean. If you only want to include "recent" samples, you can weight the previous mean lower (essentially computing a moving average).

For the running standard deviation see the always awesome John D Cook - https://www.johndcook.com/blog/standard_deviation/

1 reply

mattytrentini Dec 13, 2022
Collaborator Sponsor

(Sweet use of Latex markup!)

peterhinch · 2022-12-13T13:35:09Z

peterhinch
Dec 13, 2022
Collaborator

I would want to find the physical cause of the outliers. Their characteristics are odd: they always increase the value, and usually by a fairly constant 30. Occasionally roughly double that. This isn't random noise: something interesting is going on...

3 replies

rkompass Dec 13, 2022

Absolutely: The outliers mostly form a second mode: You have a multimodal distribution.

Looks like there is a mostly constant shift, confirmed by standard deviations:
Given a is the numpy array of data:
np.std(a[a<=1030]) --> 1.437070403566838 and
np.std(a[np.logical_and(a >1030, a < 1060)]) --> 1.4533486237727757 : identical standard deviation.

Sensor noise + occasional change of electrical connections (breadboard) ??

MrTinkerman Dec 13, 2022
Author

I would want to find the physical cause of the outliers. Their characteristics are odd: they always increase the value, and usually by a fairly constant 30. Occasionally roughly double that. This isn't random noise: something interesting is going on...

Well, I'm using a VL53L0X Time-of-Flight sensor. Seeing that graph is interesting.
But I have no clue what exactly is going on.

smatterchoo Dec 16, 2022

interesting! I bet you're getting some photons coming back after bouncing off your target at an angle, then bouncing off something else, then back to the TOF detector.

MrTinkerman · 2022-12-13T19:42:40Z

MrTinkerman
Dec 13, 2022
Author

Thanks for the input all, I have solved it taking your suggestions into consideration.
jimmo, this might just e a little to complicated for me, but thanks.

I am using a 'rolling average' in a list. Each time I want a reference:
-- Sort the list
-- Get the 'middle' as a subset
-- Calculate the average of the middle subset

My global readings list is 100 entries, the subset is 20.
This way extremes are excluded from the subset.

The code:

def get_mean_reading():
    
    global readings
    
    readings.sort()                                                 # Sort readings to push extremes to beginning and end
    middle = int(len(readings)/2)                                   # Determine the middle index of the list of readings
    subset = readings[middle-10:middle+10]                          # Get a subset around the middle
    return int(sum(subset) / len(subset))                           # Return the average of the subset

3 replies

rkompass Dec 14, 2022

As the outliers are only shifted to higher values as @peterhinch noted, they are probably reflections - the light taking a longer path, with the above data 30 mm longer (pythagoras??). It could be interesting to also filter out the second mode and have a permanently updated display of it, together with its size (how many samples around that value). This way you could try out different positions of sensor and object, together with rearrangements of the environment, perhaps a black tube to narrow the cone of the sensor...

redhead-p Dec 18, 2022

Have a look at the statistical median filter:

https://www.robots.ox.ac.uk/~sjrob/Teaching/SP/l8.pdf

I've used this in the past for removing spikey noise. This filter takes a windowed set of values adjacent to, and centred around, the input value. There are an odd number of values in the set. The set is sorted filter output is the median value, i.e. the middle value of the sorted set. As long as the spikes are single values and well separated they will never be the median value of a set and therefore will not make it to the output stream. I spotted in your sample that there's an instance of two adjacent outlying samples, so you will need a window size of at least 5.

MrTinkerman Dec 19, 2022
Author

Thanks for the input, but man, my math/linear algebra is a bit rusty.
But reading this, it sounds a bit like what my class does. I'll post it below.

MrTinkerman · 2022-12-19T17:44:34Z

MrTinkerman
Dec 19, 2022
Author

Thanks for the input all,

I wrote a 'sampler' class that fits my needs, your replies and hints have been helpful.

# Class to take stable, custom average or mean from a range of sensor readings
# Readings that are 'not too far' from the current mean will be stored to have a stable rolling average


class sampler():
    
    def __init__(self, size, sensor_max):
        
        if (size%2) == 0: self.size = size + 1                        # If size is even, add 1 to make it un-even so sample range 
        else: self.size = size                                        # can have a single middle value, else if uneven, accept
        self.middle_index = int((self.size - 1) / 2)                  # Calculate middle index
        self.samples = [0]*self.size                                  # Create list to hold sampled values
        self.sensor_max = sensor_max                                  # sensor_max is used to ignore rogue readings
        self.index = 0                                                # Set index to zero
    
    
    def set_baseline(self, reading):                                  # When setting a baseline readings should not be 
                                                                      # tested against self.get_averaged_median()
        if reading < self.sensor_max:                                 # Reading should only be tested against sensor_max
            self.samples[self.index] = reading                        # If the reading is valid, store it
            self.index += 1                                           # This function should be called in a for loop:
            if self.index == self.size: self.index = 0                # for index, val in enumerate(sampler.samples):  sampler.set_baseline(read_sensor())
        
        
    def store(self, new_reading):                                     # When storing a single reading it should not be
                                                                      # 'too far off' self.get_averaged_median(xx)
        if abs(self.get_averaged_median(40) - new_reading) < 100:     # This way the reference (averaged mean) can change slightly in time
            self.samples[self.index] = new_reading                    # to account for sensor drift. While at the same time
            self.index += 1                                           # alert readings do not influence the averaged mean
            if self.index == self.size: self.index = 0
     
    # Get single middle value
    def get_median(self):                                                 
        
        self.samples.sort()                                           # Sort the list of samples
        return self.samples[self.middle_index]                        # Return the middle value
    
    # Get average of a middle subset
    def get_averaged_median(self, percent):                                       
        
        offset = round(self.size * (percent/2) / 100)                                       # Calculate start/end offset from middle
        subset = self.samples[self.middle_index - offset : self.middle_index + offset]      # Create subset
        return round(sum(subset)/len(subset))                                               # Return average of subset

2 replies

stain3565 Aug 26, 2023

hi. I have just found this thread. I need to remove outliers in a list of distance sensor readings and your requirements seemed similar to mine. However, I am not great at understanding medians/averaged medians etc so am unsure as to what to call to get the "average" plus outliers. Do you have examples of the calls to these class functions with arrays and I guess the addition of new readings to the array as it appears that new readings are passed in separate to the ongoing array? I would be very grateful. As i loop to get reading every few milliseconds, I could adapt the calls to suit my data.

redhead-p Aug 26, 2023

Here's a simple example from one of my projects. I pre-allocate an integer array and index as these have to persist between readings being taken.

import array
.
.
.
.

# initialise median filter array. There must be an odd number of entries
_iprop_median = array.array('I', (0,0,0,0,0))
_iprop_median_x = 0

The following code uses the results of an analogue read that has been zero adjusted.

    # put the iprop reading through a statistical median filter
    # insert the entry into the circular median buffer
    _iprop_median[_iprop_median_x] = iprop_zeroed
    _iprop_median_x += 1
    if _iprop_median_x == len(_iprop_median):
        _iprop_median_x = 0

    # sort the buffer and extract the median value (1 line of python!)
    iprop_mf = sorted(_iprop_median)[len(_iprop_median)//2]

This is about as simple as a statistical median filter can be. Hope this helps.

Excluding outliers in calculation of average #10204

Uh oh!

Replies: 5 comments · 10 replies

Uh oh!

Uh oh!

robert-hh Dec 12, 2022 Collaborator

Uh oh!

Uh oh!

MrTinkerman Dec 12, 2022 Author

Uh oh!

jimmo Dec 13, 2022 Maintainer

Uh oh!

mattytrentini Dec 13, 2022 Collaborator Sponsor

Uh oh!

peterhinch Dec 13, 2022 Collaborator

Uh oh!

Uh oh!

Uh oh!

MrTinkerman Dec 13, 2022 Author

Uh oh!

Uh oh!

MrTinkerman Dec 13, 2022 Author

Uh oh!

Uh oh!

Uh oh!

MrTinkerman Dec 19, 2022 Author

Uh oh!

Uh oh!

MrTinkerman Dec 19, 2022 Author

Uh oh!

Uh oh!

Replies: 5 comments 10 replies

robert-hh
Dec 12, 2022
Collaborator

MrTinkerman Dec 12, 2022
Author

jimmo
Dec 13, 2022
Maintainer

mattytrentini Dec 13, 2022
Collaborator Sponsor

peterhinch
Dec 13, 2022
Collaborator

MrTinkerman Dec 13, 2022
Author

MrTinkerman
Dec 13, 2022
Author

MrTinkerman Dec 19, 2022
Author

MrTinkerman
Dec 19, 2022
Author