Getting Started with Probability Jargon
Machine learning discussions often use many confusing acronyms. Learning some of the basic terms from statistics is a good place to begin.
This week, I am going to try to clarify some things that can confuse people stating with ML everywhere from the USA to the the PCR: statistical distribution terminology. FYI, this topic should be simple, but people often use too many acronyms without making sure the audience knows what they mean. So, let’s shine a light on this topic ASAP and focus like a laser on a particular example to illustrate.
Let us take an uncomplicated problem like the axial stress in a circular rod as a function of radius (stress=force/(pi*radius^2)). After collecting a lot of data from a Design of Experiments into Altair HyperStudy, you can see the Distribution tab (seen in the GIF below). There are 3 related pieces of data shown the plots.
- Histogram. This is the blue bars that simply counts the frequency of occurrence and uses the left vertical axis. The data is grouped into equal spaced bins, and a simple additive count occurs for each bin. In the plot above, you can see the radius is equally distributed between each bin (FWIW this is expected as the sampling is trying to equally fill the space). But the stress is not so equally distributed and shows a distinct left bias. This result itself is interesting to remember: just because the input variables have a particular distribution, the output responses do not have to have the same distribution, in general.
- Probability distribution function. This red line uses the right axis and is commonly called the PDF. You can imagine making this line by drawing a curve through the tops of the histogram, and then normalizing this curve so that the area under the curve is 1.0. This curve represents the likelihood of occurrence; a higher value means that value is more likely to occur.
- Cumulative distribution function. This is the green curve. It also uses the right axis and is mostly commonly called the CDF. This curve is interpreted as the percentage of data values that falls below a given threshold. For example, 90% of the data is below a stress of approximately 550. The normalization of the PDF is vital for this interpretation: 100% of the data must lie below the maximum value, of course.
Another fun fact: This post uses 12 unique acronyms. Can you find them all? Remember to not be like me and only use acronyms in your writing once their meaning is established!
Thanks for putting us on your radar every week!