What’s a Hypothesis Space?

Last updated: March 18, 2024

hypothesis space in machine learning python

  • Math and Logic

1. Introduction

Machine-learning algorithms come with implicit or explicit assumptions about the actual patterns in the data. Mathematically, this means that each algorithm can learn a specific family of models, and that family goes by the name of the hypothesis space.

In this tutorial, we’ll talk about hypothesis spaces and how to choose the right one for the data at hand.

2. Hypothesis Spaces

Let’s say that we have a binary classification task and that the data are two-dimensional. Our goal is to find a model that classifies objects as positive or negative. Applying Logistic Regression , we can get the models of the form:

which estimate the probability that the object at hand is positive.

2.1. Hypotheses and Assumptions

The underlying assumption of hypotheses ( 1 ) is that the boundary separating the positive from negative objects is a straight line. So, every hypothesis from this space corresponds to a straight line in a 2D plane. For instance:

Two Classification Hypotheses

2.2. Regression

3. expressivity of a hypothesis space.

We could informally say that one hypothesis space is more expressive than another if its hypotheses are more diverse and complex.

We may underfit the data if our algorithm’s hypothesis space isn’t expressive enough. For instance, linear hypotheses aren’t particularly good options if the actual data are extremely non-linear:

Non-linear Data

So, training an algorithm that has a very expressive space increases the chance of completely capturing the patterns in the data. However, it also increases the risk of overfitting. For instance, a space containing the hypotheses of the form:

would start modelling the noise, which we see from its decision boundary:

A too complex hypothesis

Such models would generalize poorly to unseen data.

3.1. Expressivity vs. Interpretability

Additionally, even if a complex hypothesis has a good generalization capability, it may be unusable in practice because it’s too complicated to understand or compute. What’s more, intricated hypotheses offer limited insight into the real-world process that generated the data. For example, a quadratic model:

4. How to Choose the Hypothesis Space?

We need to find the right balance between expressivity and simplicity. Unfortunately, that’s easier said than done. Most of the time, we need to rely on our intuition about the data.

So, we should start by exploring the dataset, using visualizations as much as possible. For instance, we can conclude that a straight line isn’t likely to be an adequate boundary for the above classification data. However, a high-order curve would probably be too complex even though it might split the dataset into two classes without an error.

A second-degree curve might be the compromise we seek, but we aren’t sure. So, we start with the space of quadratic hypotheses:

We get a model whose decision boundary appears to be a good fit even though it misclassifies some objects:

An adequate hypothesis

Since we’re satisfied with the model, we can stop here. If that hadn’t been the case, we could have tried a space of cubic models. The idea would be to iteratively try incrementally complex families until finding a model that both performs well and is easy to understand.

4. Conclusion

In this article, we talked about hypotheses spaces in machine learning. An algorithm’s hypothesis space contains all the models it can learn from any dataset.

The algorithms with too expressive spaces can generalize poorly to unseen data and be too complex to understand, whereas those with overly simple hypotheses may underfit the data. So, when applying machine-learning algorithms in practice, we need to find the right balance between expressivity and simplicity.

hypothesis space in machine learning python

Hypothesis testing in Machine learning using Python

Yogesh Agrawal

Yogesh Agrawal

Towards Data Science

Well probably all who are beginner in machine learning or in intermediate level or statistic student heard about this buzz word hypothesis testing.

Today i will give a brief introduction over this topic which created headache for me when i was learning this. I put all those concept together and examples using python.

some question in mind before i will go for broader things -

What is hypothesis testing ? why do we use it ? what are basic of hypothesis ? which are important parameter of hypothesis testing ?

Let’s start one by one :

1. What is hypothesis testing ?

Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing is basically an assumption that we make about the population parameter.

Ex : you say avg student in class is 40 or a boy is taller than girls.

all those example we assume need some statistic way to prove those. we need some mathematical conclusion what ever we are assuming is true.

2 . why do we use it ?

Hypothesis testing is an essential procedure in statistics. A hypothesis test evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. When we say that a finding is statistically significant, it’s thanks to a hypothesis test .

3 . what are basic of hypothesis ?

The basic of hypothesis is normalisation and standard normalisation . all our hypothesis is revolve around basic of these 2 terms. let’s see these.

You must be wondering what’s difference between these two image, one might say i don’t find, while other will see some flatter graph compare to steep. well buddy this is not what i want to represent , in 1st first you can see there are different normal curve all those normal curve can have different mean’s and variances where as in 2nd image if you notice the graph is properly distributed and mean =0 and variance =1 always . concept of z-score comes in picture when we use standardised normal data.

Normal Distribution -

A variable is said to be normally distributed or have a normal distribution if its distribution has the shape of a normal curve — a special bell-shaped curve . … The graph of a normal distribution is called the normal curve , which has all of the following properties : 1. The mean, median, and mode are equal.

Standardised Normal Distribution —

A standard normal distribution is a normal distribution with mean 0 and standard deviation 1

Which are important parameter of hypothesis testing ?

Null hypothesis :- In inferential statistics, the null hypothesis is a general statement or default position that there is no relationship between two measured phenomena, or no association among groups

In other words it is a basic assumption or made based on domain or problem knowledge.

Example : a company production is = 50 unit/per day etc.

Alternative hypothesis :-

The alternative hypothesis is the hypothesis used in hypothesis testing that is contrary to the null hypothesis. It is usually taken to be that the observations are the result of a real effect (with some amount of chance variation superposed)

Example : a company production is !=50 unit/per day etc.

Level of significance: Refers to the degree of significance in which we accept or reject the null-hypothesis. 100% accuracy is not possible for accepting or rejecting a hypothesis, so we therefore select a level of significance that is usually 5%.

This is normally denoted with alpha(maths symbol ) and generally it is 0.05 or 5% , which means your output should be 95% confident to give similar kind of result in each sample.

Type I error: When we reject the null hypothesis, although that hypothesis was true. Type I error is denoted by alpha. In hypothesis testing, the normal curve that shows the critical region is called the alpha region

Type II errors: When we accept the null hypothesis but it is false. Type II errors are denoted by beta. In Hypothesis testing, the normal curve that shows the acceptance region is called the beta region.

One tailed test :- A test of a statistical hypothesis , where the region of rejection is on only one side of the sampling distribution , is called a one - tailed test .

Example :- a college has ≥ 4000 student or data science ≤ 80% org adopted.

Two-tailed test :- A two - tailed test is a statistical test in which the critical area of a distribution is two - sided and tests whether a sample is greater than or less than a certain range of values. If the sample being tested falls into either of the critical areas, the alternative hypothesis is accepted instead of the null hypothesis.

Example : a college != 4000 student or data science != 80% org adopted

P-value :- The P value , or calculated probability, is the probability of finding the observed, or more extreme, results when the null hypothesis (H 0) of a study question is true — the definition of ‘extreme’ depends on how the hypothesis is being tested.

If your P value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample gives reasonable evidence to support the alternative hypothesis. It does NOT imply a “meaningful” or “important” difference; that is for you to decide when considering the real-world relevance of your result.

Example : you have a coin and you don’t know whether that is fair or tricky so let’s decide null and alternate hypothesis

H0 : a coin is a fair coin.

H1 : a coin is a tricky coin. and alpha = 5% or 0.05

Now let’s toss the coin and calculate p- value ( probability value).

Toss a coin 1st time and result is tail - P-value = 50% (as head and tail have equal probability)

Toss a coin 2nd time and result is tail, now p-value = 50/2 = 25%

and similarly we Toss 6 consecutive time and got result as P-value = 1.5% but we set our significance level as 95% means 5% error rate we allow and here we see we are beyond that level i.e. our null- hypothesis does not hold good so we need to reject and propose that this coin is a tricky coin which is actually.

Degree of freedom :- Now imagine you’re not into hats. You’re into data analysis.You have a data set with 10 values. If you’re not estimating anything, each value can take on any number, right? Each value is completely free to vary.But suppose you want to test the population mean with a sample of 10 values, using a 1-sample t test. You now have a constraint — the estimation of the mean. What is that constraint, exactly? By definition of the mean, the following relationship must hold: The sum of all values in the data must equal n x mean, where n is the number of values in the data set.

So if a data set has 10 values, the sum of the 10 values must equal the mean x 10. If the mean of the 10 values is 3.5 (you could pick any number), this constraint requires that the sum of the 10 values must equal 10 x 3.5 = 35.

With that constraint, the first value in the data set is free to vary. Whatever value it is, it’s still possible for the sum of all 10 numbers to have a value of 35. The second value is also free to vary, because whatever value you choose, it still allows for the possibility that the sum of all the values is 35.

Now Let’s see some of widely used hypothesis testing type :-

  • T Test ( Student T test)
  • Chi-Square Test

T- Test :- A t-test is a type of inferential statistic which is used to determine if there is a significant difference between the means of two groups which may be related in certain features. It is mostly used when the data sets, like the set of data recorded as outcome from flipping a coin a 100 times, would follow a normal distribution and may have unknown variances . T test is used as a hypothesis testing tool, which allows testing of an assumption applicable to a population.

T-test has 2 types : 1. one sampled t-test 2. two-sampled t-test.

One sample t-test : The One Sample t Test determines whether the sample mean is statistically different from a known or hypothesised population mean. The One Sample t Test is a parametric test.

Example :- you have 10 ages and you are checking whether avg age is 30 or not. (check code below for that using python)

Output for above code is :

Two sampled T-test :- The Independent Samples t Test or 2-sample t-test compares the means of two independent groups in order to determine whether there is statistical evidence that the associated population means are significantly different. The Independent Samples t Test is a parametric test . This test is also known as: Independent t Test .

Example : is there any association between week1 and week2 ( code is given below in python)

Paired sampled t-test :- The paired sample t-test is also called dependent sample t-test. It’s an uni variate test that tests for a significant difference between 2 related variables. An example of this is if you where to collect the blood pressure for an individual before and after some treatment, condition, or time point.

H0 :- means difference between two sample is 0

H1:- mean difference between two sample is not 0

check the code below for same

When you can run a Z Test.

Several different types of tests are used in statistics (i.e. f test , chi square test , t test ). You would use a Z test if:

  • Your sample size is greater than 30. Otherwise, use a t test .
  • Data points should be independent from each other. In other words, one data point isn’t related or doesn’t affect another data point.
  • Your data should be normally distributed. However, for large sample sizes (over 30) this doesn’t always matter.
  • Your data should be randomly selected from a population, where each item has an equal chance of being selected.
  • Sample sizes should be equal if at all possible.

Example again we are using z-test for blood pressure with some mean like 156 (python code is below for same) one-sample Z test.

Two-sample Z test- In two sample z-test , similar to t-test here we are checking two independent data groups and deciding whether sample mean of two group is equal or not.

H0 : mean of two group is 0

H1 : mean of two group is not 0

Example : we are checking in blood data after blood and before blood data.(code in python below)

ANOVA (F-TEST) :- The t-test works well when dealing with two groups, but sometimes we want to compare more than two groups at the same time. For example, if we wanted to test whether voter age differs based on some categorical variable like race, we have to compare the means of each level or group the variable. We could carry out a separate t-test for each pair of groups, but when you conduct many tests you increase the chances of false positives. The analysis of variance or ANOVA is a statistical inference test that lets you compare multiple groups at the same time.

F = Between group variability / Within group variability

Unlike the z and t-distributions, the F-distribution does not have any negative values because between and within-group variability are always positive due to squaring each deviation.

One Way F-test(Anova) :- It tell whether two or more groups are similar or not based on their mean similarity and f-score.

Example : there are 3 different category of plant and their weight and need to check whether all 3 group are similar or not (code in python below)

Two Way F-test :- Two way F-test is extension of 1-way f-test, it is used when we have 2 independent variable and 2+ groups. 2-way F-test does not tell which variable is dominant. if we need to check individual significance then Post-hoc testing need to be performed.

Now let’s take a look at the Grand mean crop yield (the mean crop yield not by any sub-group), as well the mean crop yield by each factor, as well as by the factors grouped together

Chi-Square Test- The test is applied when you have two categorical variables from a single population. It is used to determine whether there is a significant association between the two variables.

For example, in an election survey, voters might be classified by gender (male or female) and voting preference (Democrat, Republican, or Independent). We could use a chi-square test for independence to determine whether gender is related to voting preference

check example in python below

You can get all code in my git repository.

ah, finally we came to end of this article. I hope this article would have helped. any feedback is always appreciated.

For more update check my git and follow we on medium.

Yogesh Agrawal

Written by Yogesh Agrawal

Data Scientist

Text to speech

Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

How to calculate hypothesis space

I'm trying to calculate the size of the hypothesis space of a function F. This function takes $N$ binary inputs and outputs a single binary classification.

With $N$ binary inputs, then the size of the domain must be $2^N$ . Then, I would think that for each of these possible $2^N$ instances there must be two hypotheses (one for each output). This would make the total number of hypotheses equal to $2 \times (2^N)$ .

I have read from other sources that the correct number of hypotheses is actually $2^{(2^N)}$ . What is the mistake in my thinking?

  • machine-learning
  • combinatorics

Carl's user avatar

  • 1 $\begingroup$ Could you please explain how you obtain the value of $2\times(2^N)$? That number does not appear to follow from the information you gave. Perhaps a complete enumeration of the cases when $N=2$ would clarify things. $\endgroup$ –  whuber ♦ Commented Dec 25, 2015 at 4:08
  • $\begingroup$ My thinking was that each combination of the N binary inputs could yield a result of either true or false (a binary output). With two possible outputs for each of the 2^N possible function evaluations, I calculated there must be 2*(2^N) different hypotheses. I hope that explains my thinking better. $\endgroup$ –  Isaac Getto Commented Dec 25, 2015 at 4:12
  • $\begingroup$ Please revisit your calculation, because it is incorrect. Explicit consideration of the case $N=2$ may help clear this up. $\endgroup$ –  whuber ♦ Commented Dec 26, 2015 at 14:21

3 Answers 3

In general, whenever we have a function $f: \mathcal{D} \rightarrow \mathcal{C}$ , the function can be considered as an element of the set $\mathcal{C}^\mathcal{D}$ (called the function space ). The set of all possible functions with domain $\mathcal{D}$ and codomain $\mathcal{C}$ is the full function space $\mathcal{C}^\mathcal{D}$ . Each function in the space can be considered as a list of outputs for each of the inputs --- the list has $|\mathcal{D}|$ elements and each element takes on one of $|\mathcal{C}|$ possible outputs. Consequently, using a simple application of the multiplication principle of counting , we have:

$$\begin{align} \text{No. of possible functions with domain } \mathcal{D} \text{ and codomain } \mathcal{C} &= \underbrace{|\mathcal{C}| \times \cdots \times |\mathcal{C}|}_{|\mathcal{D}| \text{ times}} \\[12pt] &= |\mathcal{C}|^{|\mathcal{D}|}. \\[6pt] \end{align}$$

Now, you have already correctly determined that there are $2^n$ possible inputs in the domain of the function, so we have $\mathcal{D} = 2^n$ in the present case. For every possible input in the domain the function output takes on one of two binary values, so we have $|\mathcal{C}| = 2$ . Consequently, in this case we have:

$$\text{No. of possible functions with domain } \mathcal{D} \text{ and codomain } \mathcal{C} = |\mathcal{C}|^{|\mathcal{D}|} = 2^{2^n}. $$

Ben's user avatar

  • 1 $\begingroup$ Your answer requires a a knowledge of set theory and would be confusing to someone who would not start "counting" from zero. I am not familiar with using domain and codomain in the context of set theory, so I do not fully understand your explanation. It is no doubt correct, but accessibility may be an issue. $\endgroup$ –  Carl Commented Feb 20, 2021 at 1:51
  • $\begingroup$ That is true, but I think this question is inherently a question about function spaces, which are generally explained in terms of sets. In order for the OP to obtain a good knowledge of this issue, I think he will ultimately need to read some material on function spaces and the rules of counting sets. $\endgroup$ –  Ben Commented Feb 20, 2021 at 3:24
  • $\begingroup$ I agree, but other people read this as well, and not everyone, e.g., me, wants to learn about set language. There is nothing wrong with your answer, nor with mine, the only difference is jargon. I tried for accessibility, you tried for precision of set language, question of taste really. $\endgroup$ –  Carl Commented Feb 20, 2021 at 7:04
  • $\begingroup$ P.S. +1 for your answer. $\endgroup$ –  Carl Commented Feb 20, 2021 at 11:14
  • 1 $\begingroup$ I like the fact that you have given a non-set based answer (+1). One of the nice things about having multiple answers is that you get explanations pitched with different levels of assumed knowledge and rigour. $\endgroup$ –  Ben Commented Jul 3, 2022 at 1:56

Think of the output as being a lock (0 closed, 1 opened) that is potentially opened by keys. That is, there might be no combination that can open the lock, or as many as $2^n$ keys that can open it. If the lock can be opened by only one key, then counting in binary it is some number between $0000\dots0000$ and $1111\dots1111$ for a binary number of length $n$ , and there are $2^n$ of those. Next we ask how may combinations of two keys can open the lock and there are $\left(\begin{array}{c}2^n\\2\end{array}\right)$ of those.

In general, we are adding up combinations

$$\left(\begin{array}{c}2^n\\0\end{array}\right)+\left(\begin{array}{c}2^n\\1\end{array}\right)+\left(\begin{array}{c}2^n\\2\end{array}\right)+\dots+\left(\begin{array}{c}2^n\\2^n-1\end{array}\right)+\left(\begin{array}{c}2^n\\2^n\end{array}\right).$$

Finally, as order does not matter, we can use the binomial theorem (see e.g., here ) to get $${m \choose 0} + {m \choose 1} + {m \choose 2} + \dots + {m \choose m} = 2^m,$$ which substituting $m=2^n$ leads us to $2^{2^n}$ , which is the answer you read.

  • $\begingroup$ @Sycorax Like this answer better? $\endgroup$ –  Carl Commented Jan 6, 2021 at 7:31
  • $\begingroup$ @Ben Thanks for the edit, but I'm curious, why improve an answer, which implies that you follow what is being said to the point of wanting to say it better, and then not upvote it? $\endgroup$ –  Carl Commented Feb 19, 2021 at 9:13
  • $\begingroup$ Hi @Carl: Glad you liked the edit. I haven't upvoted because I'm still undecided on whether I like this answer. While I like the attempt to use an example, I'm not sure if the keys/locks analogy really makes function spaces easier or harder to understand. I've just upvoted a couple of your other answers in the meantime, while I think more about it. (Since my edit was purely on formatting and syntactical grounds, it does not really imply a like or dislike of the answer; I just wanted to make the formatting nicer.) $\endgroup$ –  Ben Commented Feb 19, 2021 at 13:08

To calculate the Hypothesis Space:

enter image description here

if we have the given image above we can then figure it out the following way.

  • Count the number of attributes or features. In this case, we have four features or (4).

Analyze or if given what are the values corresponding to each feature (e.g. binary, or many different inputs). In this particular case, we have binary values (0/1).

So for each of the 2^4 attributes, the outputs can take 0 or 1.

itsmrbeltre's user avatar

Not the answer you're looking for? Browse other questions tagged machine-learning combinatorics or ask your own question .

  • Featured on Meta
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Bringing clarity to status tag usage on meta sites

Hot Network Questions

  • How Can this Limit be really Evaluated?
  • How can you trust a forensic scientist to have maintained the chain of custody?
  • Stealth Mosquitoes?
  • CPU is idle, but huge load, and NFS client stop responding
  • What happens when a helicopter loses the engine and autorotation is not initiated?
  • Why did General Leslie Groves evade Robert Oppenheimer's question here?
  • A string view over a Java String
  • What is the significance of bringing the door to Nippur in the Epic of Gilgamesh?
  • sudo / sudoers on macOS: regex not working but wildcards/globs are
  • Dress code for examiner in UK PhD viva
  • What issues are there with my perspective on truth?
  • Validity of ticket when using alternative train from a different station
  • Experience related to The GA4
  • Fill a grid with numbers so that each row/column calculation yields the same number
  • Can a TL431 be configured to output a negative reference voltage?
  • Why doesn't the world fill with time travelers?
  • Is there any video of an air-to-air missile shooting down an aircraft?
  • How could Bangladesh protect itself from Indian dams and barrages?
  • My enemy sent me this puzzle!
  • My school wants me to download an SSL certificate to connect to WiFi. Can I just avoid doing anything private while on the WiFi?
  • What can I do when someone else is literally duplicating my PhD work?
  • Order of connection using digital multimeter wall outlet
  • High CPU usage by process with obfuscated name on Linux server – Potential attack?
  • Completely introduce your friends

hypothesis space in machine learning python

Data Science Introduction

  • What Is Data Science? A Beginner's Guide To Data Science
  • Data Science Tutorial – Learn Data Science from Scratch!
  • What are the Best Books for Data Science?
  • Top 15 Hot Artificial Intelligence Technologies
  • Top 8 Data Science Tools Everyone Should Know
  • Top 10 Data Analytics Tools You Need To Know In 2024
  • 5 Data Science Projects – Data Science Projects For Practice
  • Top 10 Data Science Applications with Real Life Examples in 2024
  • Who is a Data Scientist?
  • SQL For Data Science: One stop Solution for Beginners

Statistical Inference

  • All You Need To Know About Statistics And Probability
  • A Complete Guide To Math And Statistics For Data Science
  • Introduction To Markov Chains With Examples – Markov Chains With Python
  • What is Fuzzy Logic in AI and What are its Applications?
  • How To Implement Bayesian Networks In Python? – Bayesian Networks Explained With Examples
  • All You Need To Know About Principal Component Analysis (PCA)
  • Python for Data Science – How to Implement Python Libraries

Machine Learning

  • What is Machine Learning? Machine Learning For Beginners
  • Which is the Best Book for Machine Learning?
  • Mathematics for Machine Learning: All You Need to Know
  • Top 10 Machine Learning Frameworks You Need to Know
  • Predicting the Outbreak of COVID-19 Pandemic using Machine Learning

Introduction To Machine Learning: All You Need To Know About Machine Learning

  • Machine Learning Tutorial for Beginners
  • Top 10 Applications of Machine Learning in Daily Life
  • Machine Learning Algorithms

How To Implement Find-S Algorithm In Machine Learning?

  • What is Cross-Validation in Machine Learning and how to implement it?

All You Need To Know About The Breadth First Search Algorithm

Supervised learning.

  • What is Supervised Learning and its different types?
  • Linear Regression Algorithm from Scratch
  • How To Implement Linear Regression for Machine Learning?
  • Introduction to Classification Algorithms
  • How To Implement Classification In Machine Learning?
  • Naive Bayes Classifier: Learning Naive Bayes with Python
  • A Comprehensive Guide To Naive Bayes In R
  • A Complete Guide On Decision Tree Algorithm
  • Decision Tree: How To Create A Perfect Decision Tree?
  • What is Overfitting In Machine Learning And How To Avoid It?
  • How To Use Regularization in Machine Learning?

Unsupervised Learning

  • What is Unsupervised Learning and How does it Work?
  • K-means Clustering Algorithm: Know How It Works
  • KNN Algorithm: A Practical Implementation Of KNN Algorithm In R
  • Implementing K-means Clustering on the Crime Dataset
  • K-Nearest Neighbors Algorithm Using Python
  • Apriori Algorithm : Know How to Find Frequent Itemsets
  • What Are GANs? How and why you should use them!

Q Learning: All you need to know about Reinforcement Learning

Miscellaneous.

  • Data Science vs Machine Learning - What's The Difference?

AI vs Machine Learning vs Deep Learning

  • Data Analyst vs Data Engineer vs Data Scientist: Salary, Skills, Responsibilities
  • Data Science vs Big Data vs Data Analytics

Career Opportunities

  • Data Science Career Opportunities: Your Guide To Unlocking Top Data Scientist Jobs
  • Data Scientist Skills – What Does It Take To Become A Data Scientist?
  • 10 Skills To Master For Becoming A Data Scientist
  • Data Scientist Resume Sample – How To Build An Impressive Data Scientist Resume
  • Data Scientist Salary – How Much Does A Data Scientist Earn?
  • Machine Learning Engineer vs Data Scientist : Career Comparision
  • How To Become A Machine Learning Engineer? – Learning Path

Interview Questions

  • Top Machine Learning Interview Questions You Must Prepare In 2024
  • Top Data Science Interview Questions For Budding Data Scientists In 2024
  • 120+ Data Science Interview Questions And Answers for 2024

Artificial Intelligence

In Machine Learning , concept learning can be termed as “ a problem of searching through a predefined space of potential hypothesis for the hypothesis that best fits the training examples” – Tom Mitchell. In this article, we will go through one such concept learning algorithm known as the Find-S algorithm. If you want to go beyond this article and really want the level of expertise in you – you can get certified in Machine Learning Certification!

Machine Learning Full Course – Learn Machine Learning 10 Hours | Machine Learning Tutorial | Edureka

Machine Learning Course lets you master the application of AI with the expert guidance. It includes various algorithms with applications.

The following topics are discussed in this article.

What is Find-S Algorithm in Machine Learning?

  • How Does it Work?

Limitations of Find-S Algorithm

Implementation of find-s algorithm.

In order to understand Find-S algorithm, you need to have a basic idea of the following concepts as well:

  • Concept Learning
  • General Hypothesis
  • Specific Hypothesis

1. Concept Learning 

Let’s try to understand concept learning with a real-life example. Most of human learning is based on past instances or experiences. For example, we are able to identify any type of vehicle based on a certain set of features like make, model, etc., that are defined over a large set of features.

These special features differentiate the set of cars, trucks, etc from the larger set of vehicles. These features that define the set of cars, trucks, etc are known as concepts.

Similar to this, machines can also learn from concepts to identify whether an object belongs to a specific category or not. Any algorithm that supports concept learning requires the following:

  • Training Data
  • Target Concept
  • Actual Data Objects

2. General Hypothesis

Hypothesis, in general, is an explanation for something. The general hypothesis basically states the general relationship between the major variables. For example, a general hypothesis for ordering food would be  I want a burger.

G = { ‘?’, ‘?’, ‘?’, …..’?’}

3. Specific Hypothesis

The specific hypothesis fills in all the important details about the variables given in the general hypothesis. The more specific details into the example given above would be  I want a cheeseburger with a chicken pepperoni filling with a lot of lettuce. 

S = {‘Φ’,’Φ’,’Φ’, ……,’Φ’}

Now ,let’s talk about the Find-S Algorithm in Machine Learning.

The Find-S algorithm follows the steps written below:

  • Initialize ‘h’ to the most specific hypothesis.
  • The Find-S algorithm only considers the positive examples and eliminates negative examples. For each positive example, the algorithm checks for each attribute in the example. If the attribute value is the same as the hypothesis value, the algorithm moves on without any changes. But if the attribute value is different than the hypothesis value, the algorithm changes it to ‘?’.

Now that we are done with the basic explanation of the Find-S algorithm, let us take a look at how it works.

How Does It Work?

  • The process starts with initializing ‘h’ with the most specific hypothesis, generally, it is the first positive example in the data set.
  • We check for each positive example. If the example is negative, we will move on to the next example but if it is a positive example we will consider it for the next step.
  • We will check if each attribute in the example is equal to the hypothesis value.
  • If the value matches, then no changes are made.
  • If the value does not match, the value is changed to ‘?’.
  • We do this until we reach the last positive example in the data set.

There are a few limitations of the Find-S algorithm listed down below:

  • There is no way to determine if the hypothesis is consistent throughout the data.
  • Inconsistent training sets can actually mislead the Find-S algorithm, since it ignores the negative examples.
  • Find-S algorithm does not provide a backtracking technique to determine the best possible changes that could be done to improve the resulting hypothesis.

Top 10 Trending Technologies to Learn in 2024 | Edureka

Now that we are aware of the limitations of the Find-S algorithm, let us take a look at a practical implementation of the Find-S Algorithm.

To understand the implementation, let us try to implement it to a smaller data set with a bunch of examples to decide if a person wants to go for a walk.

The concept of this particular problem will be on what days does a person likes to go on walk.

MorningSunnyWarmYesMildStrongYes
EveningRainyColdNoMildNormalNo
MorningSunnyModerateYesNormalNormalYes
EveningSunnyColdYesHighStrongYes

Looking at the data set, we have six attributes and a final attribute that defines the positive or negative example. In this case, yes is a positive example, which means the person will go for a walk.

So now, the general hypothesis is:

h 0 = {‘Morning’, ‘Sunny’, ‘Warm’, ‘Yes’, ‘Mild’, ‘Strong’}

This is our general hypothesis, and now we will consider each example one by one, but only the positive examples.

h 1 = {‘Morning’, ‘Sunny’, ‘?’, ‘Yes’, ‘?’, ‘?’}

h 2 = {‘?’, ‘Sunny’, ‘?’, ‘Yes’, ‘?’, ‘?’}

We replaced all the different values in the general hypothesis to get a resultant hypothesis. Now that we know how the Find-S algorithm works, let us take a look at an implementation using Python .

Let’s try to implement the above example using Python . The code to implement the Find-S algorithm using the above data is given below.

This brings us to the end of this article where we have learned the Find-S Algorithm in Mach ine Learning with its implementation and use case. I hope you are clear with all that has been shared with you in this tutorial.

With immense applications and easier implementations of Python with data science, there has been a significant increase in the number of jobs created for data science every year. Enroll for Edureka’s Data Science with Python and get hands-on experience with real-time industry projects along with 24×7 support, which will set you on the path of becoming a successful Data Scientist,

We are here to help you with every step on your journey and come up with a curriculum that is designed for students and professionals who want to be a   Machine Learning Engineer . The course is designed to give you a head start into Python programming and train you for both core and advanced Python concepts along with various   Machine learning Algorithms   like  SVM ,  Decision Tree , etc.

If you come across any questions, feel free to ask all your questions in the comments section of “Find-S Algorithm In Machine Learning” and our team will be glad to answer.

","eventStatus":"https://schema.org/EventScheduled","image":"https://d1jnx9ba8s6j9r.cloudfront.net/imgver.0000000000/img/co_img_2263_1678870066.png","startDate":"2024-08-31T07:00:00+0000","endDate":"2024-09-22T10:00:00+0000","Duration":"R8/P7DT4H","inLanguage":"en-US","url":"https://www.edureka.co/masters-program/machine-learning-engineer-training?utm_source=blogbatch&utm_campaign=batch_details","eventAttendanceMode":"https://schema.org/OnlineEventAttendanceMode","location":[{"@type":"VirtualLocation","url":"https://www.edureka.co/masters-program/machine-learning-engineer-training?utm_source=blogbatch&utm_campaign=batch_details"}],"organizer":{"@type":"Thing","name":"Edureka","url":"https://www.edureka.co"},"performer":"Edureka","offers":{"@type":"AggregateOffer","url":"https://www.edureka.co/masters-program/machine-learning-engineer-training?utm_source=blogbatch&utm_campaign=batch_details","availability":"InStock","price":"1499","priceCurrency":"USD","lowprice":"1499","highprice":"1499","validFrom":"2024-08-28T00:00:00+0000"}}
Course NameDateDetails

Class Starts on 31st August,2024

31st August

SAT&SUN (Weekend Batch)

Recommended videos for you

Introduction to mahout, recommended blogs for you, top 12 artificial intelligence (ai) tools you need to know, ai for startups: opportunities, challenges, and best practices, top 10 benefits of artificial intelligence, building a chatbot using prompt engineering, what is em algorithm in machine learning, what are the prerequisites for machine learning, openai playground vs chatgpt, top 10 generative ai companies in 2024 and their key features, supervised learning in apache mahout, how to get started with chatgpt, how to use chatgpt for devops, machine learning in r for beginners with example, what is narrow artificial intelligence(narrow ai) with examples, fuzzy k-means clustering in mahout, generative ai models: a comprehensive guide, join the discussion cancel reply, trending courses in artificial intelligence, human-computer interaction (hci) for ai syste ....

  • 2k Enrolled Learners
  • Weekend/Weekday

ChatGPT Training Course: Beginners to Advance ...

  • 15k Enrolled Learners

Generative AI in Business: University of Camb ...

  • 1k Enrolled Learners

Prompt Engineering Course

  • 5k Enrolled Learners

Artificial Intelligence Certification Course

  • 16k Enrolled Learners

MLOps Certification Course Online

  • 6k Enrolled Learners

Large Language Models (LLMs) Certification Co ...

  • 3k Enrolled Learners

Reinforcement Learning

Graphical models certification training, generative ai in hr certification course, browse categories, subscribe to our newsletter, and get personalized recommendations..

Already have an account? Sign in .

20,00,000 learners love us! Get personalised resources in your inbox.

At least 1 upper-case and 1 lower-case letter

Minimum 8 characters and Maximum 50 characters

We have recieved your contact details.

You will recieve an email from us shortly.

Machine Learning Theory - Part 2: Generalization Bounds

Last time we concluded by noticing that minimizing the empirical risk (or the training error) is not in itself a solution to the learning problem, it could only be considered a solution if we can guarantee that the difference between the training error and the generalization error (which is also called the generalization gap ) is small enough. We formalized such requirement using the probability:

That is if this probability is small, we can guarantee that the difference between the errors is not much, and hence the learning problem can be solved.

In this part we’ll start investigating that probability at depth and see if it indeed can be small, but before starting you should note that I skipped a lot of the mathematical proofs here. You’ll often see phrases like “It can be proved that …”, “One can prove …”, “It can be shown that …”, … etc without giving the actual proof. This is to make the post easier to read and to focus all the effort on the conceptual understanding of the subject. In case you wish to get your hands dirty with proofs, you can find all of them in the additional readings, or on the Internet of course!

Independently, and Identically Distributed

The world can be a very messy place! This is a problem that faces any theoretical analysis of a real world phenomenon; because usually we can’t really capture all the messiness in mathematical terms, and even if we’re able to; we usually don’t have the tools to get any results from such a messy mathematical model.

So in order for theoretical analysis to move forward, some assumptions must be made to simplify the situation at hand, we can then use the theoretical results from that simplification to infer about reality.

Assumptions are common practice in theoretical work. Assumptions are not bad in themselves, only bad assumptions are bad! As long as our assumptions are reasonable and not crazy, they’ll hold significant truth about reality.

A reasonable assumption we can make about the problem we have at hand is that our training dataset samples are independently, and identically distributed (or i.i.d. for short), that means that all the samples are drawn from the same probability distribution and that each sample is independent from the others.

This assumption is essential for us. We need it to start using the tools form probability theory to investigate our generalization probability, and it’s a very reasonable assumption because:

  • It’s more likely for a dataset used for inferring about an underlying probability distribution to be all sampled for that same distribution. If this is not the case, then the statistics we get from the dataset will be noisy and won’t correctly reflect the target underlying distribution.
  • It’s more likely that each sample in the dataset is chosen without considering any other sample that has been chosen before or will be chosen after. If that’s not the case and the samples are dependent, then the dataset will suffer from a bias towards a specific direction in the distribution, and hence will fail to reflect the underlying distribution correctly.

So we can build upon that assumption with no fear.

The Law of Large Numbers

Most of us, since we were kids, know that if we tossed a fair coin a large number of times, roughly half of the times we’re gonna get heads. This is an instance of wildly known fact about probability that if we retried an experiment for a sufficiency large amount of times, the average outcome of these experiments (or, more formally, the sample mean ) will be very close to the true mean of the underlying distribution. This fact is formally captured into what we call The law of large numbers :

If $x_1, x_2, …, x_m$ are $m$ i.i.d. samples of a random variable $X$ distributed by $P$. then for a small positive non-zero value $\epsilon$: \[\lim_{m \rightarrow \infty} \mathbb{P}\left[\left|\mathop{\mathbb{E}}_{X \sim P}[X] - \frac{1}{m}\sum_{i=1}^{m}x_i \right| > \epsilon\right] = 0\]

This version of the law is called the weak law of large numbers . It’s weak because it guarantees that as the sample size goes larger, the sample and true means will likely be very close to each other by a non-zero distance no greater than epsilon. On the other hand, the strong version says that with very large sample size, the sample mean is almost surely equal to the true mean.

The formulation of the weak law lends itself naturally to use with our generalization probability. By recalling that the empirical risk is actually the sample mean of the errors and the risk is the true mean, for a single hypothesis $h$ we can say that:

Well, that’s a progress, A pretty small one, but still a progress! Can we do any better?

Hoeffding’s inequality

The law of large numbers is like someone pointing the directions to you when you’re lost, they tell you that by following that road you’ll eventually reach your destination, but they provide no information about how fast you’re gonna reach your destination, what is the most convenient vehicle, should you walk or take a cab, and so on.

To our destination of ensuring that the training and generalization errors do not differ much, we need to know more info about the how the road down the law of large numbers look like. These info are provided by what we call the concentration inequalities . This is a set of inequalities that quantifies how much random variables (or function of them) deviate from their expected values (or, also, functions of them). One inequality of those is Heoffding’s inequality :

If $x_1, x_2, …, x_m$ are $m$ i.i.d. samples of a random variable $X$ distributed by $P$, and $a \leq x_i \leq b$ for every $i$, then for a small positive non-zero value $\epsilon$: \[\mathbb{P}\left[\left|\mathop{\mathbb{E}}_{X \sim P}[X] - \frac{1}{m}\sum_{i=0}^{m}x_i\right| > \epsilon\right] \leq 2\exp\left(\frac{-2m\epsilon^2}{(b -a)^2}\right)\]

You probably see why we specifically chose Heoffding’s inequality from among the others. We can naturally apply this inequality to our generalization probability, assuming that our errors are bounded between 0 and 1 (which is a reasonable assumption, as we can get that using a 0/1 loss function or by squashing any other loss between 0 and 1) and get for a single hypothesis $h$:

This means that the probability of the difference between the training and the generalization errors exceeding $\epsilon$ exponentially decays as the dataset size goes larger. This should align well with our practical experience that the bigger the dataset gets, the better the results become.

If you noticed, all our analysis up till now was focusing on a single hypothesis $h$. But the learning problem doesn’t know that single hypothesis beforehand, it needs to pick one out of an entire hypothesis space $\mathcal{H}$, so we need a generalization bound that reflects the challenge of choosing the right hypothesis.

Generalization Bound: 1st Attempt

In order for the entire hypothesis space to have a generalization gap bigger than $\epsilon$, at least one of its hypothesis: $h_1$ or $h_2$ or $h_3$ or … etc should have. This can be expressed formally by stating that:

Where $\bigcup$ denotes the union of the events, which also corresponds to the logical OR operator. Using the union bound inequality , we get:

We exactly know the bound on the probability under the summation from our analysis using the Heoffding’s inequality, so we end up with:

Where $|\mathcal{H}|$ is the size of the hypothesis space. By denoting the right hand side of the above inequality by $\delta$, we can say that with a confidence $1 - \delta$:

And with some basic algebra, we can express $\epsilon$ in terms of $\delta$ and get:

This is our first generalization bound, it states that the generalization error is bounded by the training error plus a function of the hypothesis space size and the dataset size. We can also see that the the bigger the hypothesis space gets, the bigger the generalization error becomes. This explains why the memorization hypothesis form last time, which theoretically has $|\mathcal{H}| = \infty$, fails miserably as a solution to the learning problem despite having $R_\text{emp} = 0$; because for the memorization hypothesis $h_\text{mem}$:

But wait a second! For a linear hypothesis of the form $h(x) = wx + b$, we also have $|\mathcal{H}| = \infty$ as there is infinitely many lines that can be drawn. So the generalization error of the linear hypothesis space should be unbounded just as the memorization hypothesis! If that’s true, why does perceptrons, logistic regression, support vector machines and essentially any ML model that uses a linear hypothesis work?

Our theoretical result was able to account for some phenomena (the memorization hypothesis, and any finite hypothesis space) but not for others (the linear hypothesis, or other infinite hypothesis spaces that empirically work). This means that there’s still something missing from our theoretical model, and it’s time for us to revise our steps. A good starting point is from the source of the problem itself, which is the infinity in $|\mathcal{H}|$.

Notice that the term $|\mathcal{H}|$ resulted from our use of the union bound. The basic idea of the union bound is that it bounds the probability by the worst case possible, which is when all the events under union are mutually independent. This bound gets more tight as the events under consideration get less dependent. In our case, for the bound to be tight and reasonable, we need the following to be true:

For every two hypothesis $h_1, h_2 \in \mathcal{H}$ the two events $|R(h_1) - R_\text{emp}(h_1)| > \epsilon$ and $|R(h_2) - R_\text{emp}(h_2)| > \epsilon$ are likely to be independent. This means that the event that $h_1$ has a generalization gap bigger than $\epsilon$ should be independent of the event that also $h_2$ has a generalization gap bigger than $\epsilon$, no matter how much $h_1$ and $h_2$ are close or related; the events should be coincidental.

But is that true?

Examining the Independence Assumption

The first question we need to ask here is why do we need to consider every possible hypothesis in $\mathcal{H}$? This may seem like a trivial question; as the answer is simply that because the learning algorithm can search the entire hypothesis space looking for its optimal solution. While this answer is correct, we need a more formal answer in light of the generalization inequality we’re studying.

The formulation of the generalization inequality reveals a main reason why we need to consider all the hypothesis in $\mathcal{H}$. It has to do with the existence of $\sup_{h \in \mathcal{H}}$. The supremum in the inequality guarantees that there’s a very little chance that the biggest generalization gap possible is greater than $\epsilon$; this is a strong claim and if we omit a single hypothesis out of $\mathcal{H}$, we might miss that “biggest generalization gap possible” and lose that strength, and that’s something we cannot afford to lose. We need to be able to make that claim to ensure that the learning algorithm would never land on a hypothesis with a bigger generalization gap than $\epsilon$.

hypothesis space in machine learning python

Looking at the above plot of binary classification problem, it’s clear that this rainbow of hypothesis produces the same classification on the data points, so all of them have the same empirical risk. So one might think, as they all have the same $R_\text{emp}$, why not choose one and omit the others?!

This would be a very good solution if we’re only interested in the empirical risk, but our inequality takes into its consideration the out-of-sample risk as well, which is expressed as:

This is an integration over every possible combination of the whole input and output spaces $\mathcal{X, Y}$. So in order to ensure our supremum claim, we need the hypothesis to cover the whole of $\mathcal{X \times Y}$, hence we need all the possible hypotheses in $\mathcal{H}$.

Now that we’ve established that we do need to consider every single hypothesis in $\mathcal{H}$, we can ask ourselves: are the events of each hypothesis having a big generalization gap are likely to be independent?

Well, Not even close! Take for example the rainbow of hypotheses in the above plot, it’s very clear that if the red hypothesis has a generalization gap greater than $\epsilon$, then, with 100% certainty, every hypothesis with the same slope in the region above it will also have that. The same argument can be made for many different regions in the $\mathcal{X \times Y}$ space with different degrees of certainty as in the following figure.

hypothesis space in machine learning python

But this is not helpful for our mathematical analysis, as the regions seems to be dependent on the distribution of the sample points and there is no way we can precisely capture these dependencies mathematically, and we cannot make assumptions about them without risking to compromise the supremum claim.

So the union bound and the independence assumption seem like the best approximation we can make,but it highly overestimates the probability and makes the bound very loose, and very pessimistic!

However, what if somehow we can get a very good estimate of the risk $R(h)$ without needing to go over the whole of the $\mathcal{X \times Y}$ space, would there be any hope to get a better bound?

The Symmetrization Lemma

Let’s think for a moment about something we do usually in machine learning practice. In order to measure the accuracy of our model, we hold out a part of the training set to evaluate the model on after training, and we consider the model’s accuracy on this left out portion as an estimate for the generalization error. This works because we assume that this test set is drawn i.i.d. from the same distribution of the training set (this is why we usually shuffle the whole dataset beforehand to break any correlation between the samples).

It turns out that we can do a similar thing mathematically, but instead of taking out a portion of our dataset $S$, we imagine that we have another dataset $S’$ with also size $m$, we call this the ghost dataset . Note that this has no practical implications, we don’t need to have another dataset at training, it’s just a mathematical trick we’re gonna use to git rid of the restrictions of $R(h)$ in the inequality.

We’re not gonna go over the proof here, but using that ghost dataset one can actually prove that:

where $R_\text{emp}’(h)$ is the empirical risk of hypothesis $h$ on the ghost dataset. This means that the probability of the largest generalization gap being bigger than $\epsilon$ is at most twice the probability that the empirical risk difference between $S, S’$ is larger than $\frac{\epsilon}{2}$. Now that the right hand side in expressed only in terms of empirical risks, we can bound it without needing to consider the the whole of $\mathcal{X \times Y}$, and hence we can bound the term with the risk $R(h)$ without considering the whole of input and output spaces!

This, which is called the symmetrization lemma , was one of the two key parts in the work of Vapnik-Chervonenkis (1971).

The Growth Function

Now that we are bounding only the empirical risk, if we have many hypotheses that have the same empirical risk (a.k.a. producing the same labels/values on the data points), we can safely choose one of them as a representative of the whole group, we’ll call that an effective hypothesis, and discard all the others.

By only choosing the distinct effective hypotheses on the dataset $S$, we restrict the hypothesis space $\mathcal{H}$ to a smaller subspace that depends on the dataset $\mathcal{H}_{|S}$.

We can assume the independence of the hypotheses in $\mathcal{H}_{|S}$ like we did before with $\mathcal{H}$ (but it’s more plausible now), and use the union bound to get that:

Notice that the hypothesis space is restricted by $S \cup S’$ because we using the empirical risk on both the original dataset $S$ and the ghost $S’$. The question now is what is the maximum size of a restricted hypothesis space? The answer is very simple; we consider a hypothesis to be a new effective one if it produces new labels/values on the dataset samples, then the maximum number of distinct hypothesis (a.k.a the maximum number of the restricted space) is the maximum number of distinct labels/values the dataset points can take. A cool feature about that maximum size is that its a combinatorial measure, so we don’t need to worry about how the samples are distributed!

For simplicity, we’ll focus now on the case of binary classification, in which $\mathcal{Y}=\{-1, +1\}$. Later we’ll show that the same concepts can be extended to both multiclass classification and regression. In that case, for a dataset with $m$ samples, each of which can take one of two labels: either -1 or +1, the maximum number of distinct labellings is $2^m$.

We’ll define the maximum number of distinct labellings/values on a dataset $S$ of size $m$ by a hypothesis space $\mathcal{H}$ as the growth function of $\mathcal{H}$ given $m$, and we’ll denote that by $\Delta_\mathcal{H}(m)$. It’s called the growth function because it’s value for a single hypothesis space $\mathcal{H}$ (aka the size of the restricted subspace $\mathcal{H_{|S}}$) grows as the size of the dataset grows. Now we can say that:

Notice that we used $2m$ because we have two datasets $S,S’$ each with size $m$.

For the binary classification case, we can say that:

But $2^m$ is exponential in $m$ and would grow too fast for large datasets, which makes the odds in our inequality go too bad too fast! Is that the best bound we can get on that growth function?

The VC-Dimension

The $2^m$ bound is based on the fact that the hypothesis space $\mathcal{H}$ can produce all the possible labellings on the $m$ data points. If a hypothesis space can indeed produce all the possible labels on a set of data points, we say that the hypothesis space shatters that set.

But can any hypothesis space shatter any dataset of any size? Let’s investigate that with the binary classification case and the $\mathcal{H}$ of linear classifiers $\mathrm{sign}(wx + b)$. The following animation shows how many ways a linear classifier in 2D can label 3 points (on the left) and 4 points (on the right).

In the animation, the whole space of possible effective hypotheses is swept. For the the three points, the hypothesis shattered the set of points and produced all the possible $2^3 = 8$ labellings. However for the four points,the hypothesis couldn’t get more than 14 and never reached $2^4 = 16$, so it failed to shatter this set of points. Actually, no linear classifier in 2D can shatter any set of 4 points, not just that set; because there will always be two labellings that cannot be produced by a linear classifier which is depicted in the following figure.

4-points

From the decision boundary plot (on the right), it’s clear why no linear classifier can produce such labellings; as no linear classifier can divide the space in this way. So it’s possible for a hypothesis space $\mathcal{H}$ to be unable to shatter all sizes. This fact can be used to get a better bound on the growth function, and this is done using Sauer’s lemma :

If a hypothesis space $\mathcal{H}$ cannot shatter any dataset with size more than $k$, then: \[\Delta_{\mathcal{H}}(m) \leq \sum_{i=0}^{k}\binom{m}{i}\]

This was the other key part of Vapnik-Chervonenkis work (1971), but it’s named after another mathematician, Norbert Sauer; because it was independently proved by him around the same time (1972). However, Vapnik and Chervonenkis weren’t completely left out from this contribution; as that $k$, which is the maximum number of points that can be shattered by $\mathcal{H}$, is now called the Vapnik-Chervonenkis-dimension or the VC-dimension $d_{\mathrm{vc}}$ of $\mathcal{H}$.

For the case of the linear classifier in 2D, $d_\mathrm{vc} = 3$. In general, it can be proved that hyperplane classifiers (the higher-dimensional generalization of line classifiers) in $\mathbb{R}^n$ space has $d_\mathrm{vc} = n + 1$.

The bound on the growth function provided by sauer’s lemma is indeed much better than the exponential one we already have, it’s actually polynomial! Using algebraic manipulation, we can prove that:

Where $O$ refers to the Big-O notation for functions asymptotic (near the limits) behavior, and $e$ is the mathematical constant.

Thus we can use the VC-dimension as a proxy for growth function and, hence, for the size of the restricted space $\mathcal{H_{|S}}$. In that case, $d_\mathrm{vc}$ would be a measure of the complexity or richness of the hypothesis space.

The VC Generalization Bound

With a little change in the constants, it can be shown that Heoffding’s inequality is applicable on the probability $\mathbb{P}\left[|R_\mathrm{emp}(h) - R_\mathrm{emp}’(h)| > \frac{\epsilon}{2}\right]$. With that, and by combining inequalities (1) and (2), the Vapnik-Chervonenkis theory follows:

This can be re-expressed as a bound on the generalization error, just as we did earlier with the previous bound, to get the VC generalization bound :

or, by using the bound on growth function in terms of $d_\mathrm{vc}$ as:

hypothesis space in machine learning python

Professor Vapnik standing in front of a white board that has a form of the VC-bound and the phrase “All your bayes are belong to us”, which is a play on the broken english phrase found in the classic video game Zero Wing in a claim that the VC framework of inference is superior to that of Bayesian inference . [Courtesy of Yann LeCunn ].

This is a significant result! It’s a clear and concise mathematical statement that the learning problem is solvable, and that for infinite hypotheses spaces there is a finite bound on the their generalization error! Furthermore, this bound can be described in term of a quantity ($d_\mathrm{vc}$), that solely depends on the hypothesis space and not on the distribution of the data points!

Now, in light of these results, is there’s any hope for the memorization hypothesis?

It turns out that there’s still no hope! The memorization hypothesis can shatter any dataset no matter how big it is, that means that its $d_\mathrm{vc}$ is infinite, yielding an infinite bound on $R(h_\mathrm{mem})$ as before. However, the success of linear hypothesis can now be explained by the fact that they have a finite $d_\mathrm{vc} = n + 1$ in $\mathbb{R}^n$. The theory is now consistent with the empirical observations.

Distribution-Based Bounds

The fact that $d_\mathrm{vc}$ is distribution-free comes with a price: by not exploiting the structure and the distribution of the data samples, the bound tends to get loose. Consider for example the case of linear binary classifiers in a very higher n-dimensional feature space, using the distribution-free $d_\mathrm{vc} = n + 1$ means that the bound on the generalization error would be poor unless the size of the dataset $N$ is also very large to balance the effect of the large $d_\mathrm{vc}$. This is the good old curse of dimensionality we all know and endure.

However, a careful investigation into the distribution of the data samples can bring more hope to the situation. For example, For data points that are linearly separable, contained in a ball of radius $R$, with a margin $\rho$ between the closest points in the two classes, one can prove that for a hyperplane classifier:

It follows that the larger the margin, the lower the $d_\mathrm{vc}$ of the hypothesis. This is theoretical motivation behind Support Vector Machines (SVMs) which attempts to classify data using the maximum margin hyperplane. This was also proved by Vapnik and Chervonenkis.

One Inequality to Rule Them All

Up until this point, all our analysis was for the case of binary classification. And it’s indeed true that the form of the vc bound we arrived at here only works for the binary classification case. However, the conceptual framework of VC (that is: shattering, growth function and dimension) generalizes very well to both multi-class classification and regression.

Due to the work of Natarajan (1989), the Natarajan dimension is defined as a generalization of the VC-dimension for multiple classes classification, and a bound similar to the VC-Bound is derived in terms of it. Also, through the work of Pollard (1984), the pseudo-dimension generalizes the VC-dimension for the regression case with a bound on the generalization error also similar to VC’s.

There is also Rademacher’s complexity , which is a relatively new tool (devised in the 2000s) that measures the richness of a hypothesis space by measuring how well it can fit to random noise. The cool thing about Rademacher’s complexity is that it’s flexible enough to be adapted to any learning problem, and it yields very similar generalization bounds to the other methods mentioned.

However, no matter what the exact form of the bound produced by any of these methods is, it always takes the form:

where $C$ is a function of the hypothesis space complexity (or size, or richness), $N$ the size of the dataset, and the confidence $1 - \delta$ about the bound. This inequality basically says the generalization error can be decomposed into two parts: the empirical training error, and the complexity of the learning model.

This form of the inequality holds to any learning problem no matter the exact form of the bound, and this is the one we’re gonna use throughout the rest of the series to guide us through the process of machine learning.

References and Additional Readings

  • Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
  • Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
  • Abu-Mostafa, Y. S., Magdon-Ismail, M., & Lin, H. (2012). Learning from data: a short course.

Mostafa Samir

Mostafa Samir

Wandering in a lifelong journey seeking after truth.

Hypothesis Testing in Machine Learning

The process of hypothesis testing is to draw inferences or some conclusion about the overall population or data by conducting some statistical tests on a sample. The same inferences are drawn for different machine learning models through T-test which I will discuss in this tutorial.

For drawing some inferences, we have to make some assumptions that lead to two terms that are used in the hypothesis testing.

Null hypothesis: It is regarding the assumption that there is no anomaly pattern or believing according to the assumption made.

Alternate hypothesis: Contrary to the null hypothesis, it shows that observation is the result of real effect.

It can also be said as evidence or level of significance for the null hypothesis or in machine learning algorithms. It’s the significance of the predictors towards the target.

Generally, we select the level of significance by 5 %, but it is also a topic of discussion for some cases. If you have a strong prior knowledge about your data functionality, you can decide the level of significance.

On the contrary of that if the p-value is less than 0.05 in a machine learning model against an independent variable, then the variable is considered which means there is heterogeneous behavior with the target which is useful and can be learned by the machine learning algorithms.

The steps involved in the hypothesis testing are as follow:

Assume a null hypothesis, usually in machine learning algorithms we consider that there is no anomaly between the target and independent variable.

Collect a sample

Calculate test statistics

Decide either to accept or reject the null hypothesis

Calculating test or T statistics

For Calculating T statistics, we create a scenario.

Suppose there is a shipping container making company which claims that each container is 1000 kg in weight not less, not more. Well, such claims look shady, so we proceed with gathering data and creating a sample.

After gathering a sample of 30 containers, we found that the average weight of the container is 990 kg and showing a standard deviation of 12.5 kg.

So calculating test statistics:

T = (Mean - Claim)/ (Standard deviation / Sample Size^(1/2))

Which is -4.3818 after putting all the numbers.

Now we calculate t value for 0.05 significance and degree of freedom.

Note: Degree of Freedom = Sample Size - 1

From T table the value will be -1.699.

After comparison, we can see that the generated statistics are less than the statistics of the desired level of significance. So we can reject the claim made.

You can calculate the t value using stats.t.ppf() function of stats class of scipy library.

As hypothesis testing is done on a sample of data rather than the entire population due to the unavailability of the resources in terms of data. Due to inferences are drawn on sample data the hypothesis testing can lead to errors, which can be classified into two parts:

Type I Error: In this error, we reject the null hypothesis when it is true.

Type II Error: In this error, we accept the null hypothesis when it is false.

Other Approaches

A lot of different approaches are present to hypothesis testing of two models like creating two models on the features available with us. One model comprises all the features and the other with one less. So we can test the significance of individual features. However feature inter-dependency affect such simple methods.

In regression problems, we generally follow the rule of P value, the feature which violates the significance level are removed, thus iteratively improving the model.

Different approaches are present for each algorithm to test the hypothesis on different features.

If you would like to learn more about Bayesian inferences fundamentals, take DataCamp's Fundamentals of Bayesian Data Analysis in R course.

Check out our Machine Learning Basics tutorial.

Learn more about Machine Learning

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;} Understanding Machine Learning

Machine learning with tree-based models in python, machine learning for time series data in python, hyperparameter optimization in machine learning models.

Sayak Paul's photo

Machine Learning in R for beginners

Karlijn Willems's photo

Karlijn Willems

Probability Distributions in Python Tutorial

DataCamp Team's photo

DataCamp Team

An Introduction to Statistical Machine Learning

Joanne Xiong's photo

Joanne Xiong

Common Data Science Pitfalls & How to Avoid them!

Getting started with machine learning in python.

George Boorman's photo

George Boorman

arXiv's Accessibility Forum starts next month!

Help | Advanced Search

Statistics > Machine Learning

Title: hypothesis spaces for deep learning.

Abstract: This paper introduces a hypothesis space for deep learning that employs deep neural networks (DNNs). By treating a DNN as a function of two variables, the physical variable and parameter variable, we consider the primitive set of the DNNs for the parameter variable located in a set of the weight matrices and biases determined by a prescribed depth and widths of the DNNs. We then complete the linear span of the primitive DNN set in a weak* topology to construct a Banach space of functions of the physical variable. We prove that the Banach space so constructed is a reproducing kernel Banach space (RKBS) and construct its reproducing kernel. We investigate two learning models, regularized learning and minimum interpolation problem in the resulting RKBS, by establishing representer theorems for solutions of the learning models. The representer theorems unfold that solutions of these learning models can be expressed as linear combination of a finite number of kernel sessions determined by given data and the reproducing kernel.
Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Functional Analysis (math.FA)
Cite as: [stat.ML]
  (or [stat.ML] for this version)
  Focus to learn more arXiv-issued DOI via DataCite

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

hypothesis space in machine learning python

Member-only story

Deciphering Complexity: Emulating the Riemann Hypothesis in Machine Learning's Feature Space Analysis

Everton Gomede, PhD

Everton Gomede, PhD

The Modern Scientist

Context : The Riemann Hypothesis influences complex pattern analysis, which applies to machine learning's feature space analysis. Problem : Linear models often struggle with non-linear and complex data structures. Approach : An SVM with an RBF kernel was used to classify a synthetic, non-linearly separable dataset, reflecting the analytical depth of the Riemann Hypothesis. Results : The SVM achieved 100% accuracy, demonstrating effective non-linear separation. Conclusions : This success illustrates the potential of machine learning to handle complex patterns, mirroring the analytical challenges of the Riemann Hypothesis.

Keywords : Riemann Hypothesis; Feature Space Analysis; Machine Learning; SVM (Support Vector Machine); Non-linear Classification.

Introduction

The Riemann Hypothesis, a cornerstone of number theory, has intrigued mathematicians since its inception by Bernhard Riemann in 1859. It posits that all non-trivial zeros of the Riemann zeta function have a natural part of 1/2. While ostensibly a mathematical conjecture, its underlying principles and analytical techniques have far-reaching implications, notably in machine learning, particularly feature space analysis. This essay explores how the Riemann Hypothesis informs and enriches…

Everton Gomede, PhD

Written by Everton Gomede, PhD

Postdoctoral Fellow Computer Scientist at the University of British Columbia creating innovative algorithms to distill complex data into actionable insights.

Text to speech

  • Machine Learning Engineers
  • Python Developers
  • Data Scientists
  • Deep Learning Experts
  • TensorFlow Developers
  • Data Engineers
  • Pandas Developers
  • Computer Vision Developers

Exploring Supervised Machine Learning Algorithms

While machine learning sounds highly technical, an introduction to the statistical methods involved quickly brings it within reach. In this article, Toptal Freelance Software Engineer Vladyslav Millier explores basic supervised machine learning algorithms and scikit-learn, using them to predict survival rates for Titanic passengers.

Exploring Supervised Machine Learning Algorithms

By Vlad Miller

Vlad is a versatile software engineer with experience in many fields. He is currently perfecting his Scala and machine learning skills.

The main goal of this reading is to understand enough statistical methodology to be able to leverage the machine learning algorithms in Python’s scikit-learn library and then apply this knowledge to solve a classic machine learning problem.

The first stop of our journey will take us through a brief history of machine learning. Then we will dive into different algorithms. On our final stop, we will use what we learned to solve the Titanic Survival Rate Prediction Problem .

Some disclaimers:

  • I am a full-stack software engineer, not a machine learning algorithm expert.
  • I assume you know some basic Python .
  • This is exploratory, so not every detail is explained like it would be in a tutorial .

With that noted, let’s dive in!

A Quick Introduction to Machine Learning Algorithms

As soon as you venture into this field, you realize that machine learning is less romantic than you may think. Initially, I was full of hopes that after I learned more I would be able to construct my own Jarvis AI, which would spend all day coding software and making money for me, so I could spend whole days outdoors reading books, driving a motorcycle, and enjoying a reckless lifestyle while my personal Jarvis makes my pockets deeper. However, I soon realized that the foundation of machine learning algorithms is statistics, which I personally find dull and uninteresting. Fortunately, it did turn out that “dull” statistics have some very fascinating applications.

You will soon discover that to get to those fascinating applications, you need to understand statistics very well. One of the goals of machine learning algorithms is to find statistical dependencies in supplied data.

The supplied data could be anything from checking blood pressure against age to finding handwritten text based on the color of various pixels.

That said, I was curious to see if I could use machine learning algorithms to find dependencies in cryptographic hash functions (SHA, MD5, etc.)—however, you can’t really do that because proper crypto primitives are constructed in such a way that they eliminate dependencies and produce significantly hard-to-predict output. I believe that, given an infinite amount of time, machine learning algorithms could crack any crypto model.

Unfortunately, we don’t have that much time, so we need to find another way to efficiently mine cryptocurrency. How far have we gotten up until now?

A Brief History of Machine Learning Algorithms

The roots of machine learning algorithms come from Thomas Bayes, who was English statistician who lived in the 18th century. His paper An Essay Towards Solving a Problem in the Doctrine of Chances underpins Bayes’ Theorem , which is widely applied in the field of statistics.

In the 19th century, Pierre-Simon Laplace published Théorie analytique des probabilités , expanding on the work of Bayes and defining what we know of today as Bayes’ Theorem. Shortly before that, Adrien-Marie Legendre had described the “least squares” method, also widely used today in supervised learning.

The 20th century is the period when the majority of publicly known discoveries have been made in this field. Andrey Markov invented Markov chains, which he used to analyze poems. Alan Turing proposed a learning machine that could become artificially intelligent, basically foreshadowing genetic algorithms. Frank Rosenblatt invented the Perceptron , sparking huge excitement and great coverage in the media.

But then the 1970s saw a lot of pessimism around the idea of AI—and thus, reduced funding—so this period is called an AI winter . The rediscovery of backpropagation in the 1980s caused a resurgence in machine learning research. And today, it’s a hot topic once again.

The late Leo Breiman distinguished between two statistical modeling paradigms: Data modeling and algorithmic modeling. “Algorithmic modeling” means more or less the machine learning algorithms like the random forest .

Machine learning and statistics are closely related fields. According to Michael I. Jordan , the ideas of machine learning, from methodological principles to theoretical tools, have had a long prehistory in statistics. He also suggested data science as a placeholder term for the overall problem that machine learning specialists and statisticians are both implicitly working on.

Categories of Machine Learning Algorithms

The machine learning field stands on two main pillars called supervised learning and unsupervised learning . Some people also consider a new field of study— deep learning —to be separate from the question of supervised vs. unsupervised learning.

Supervised learning is when a computer is presented with examples of inputs and their desired outputs. The goal of the computer is to learn a general formula which maps inputs to outputs. This can be further broken down into:

  • Semi-supervised learning , which is when the computer is given an incomplete training set with some outputs missing
  • Active learning , which is when the computer can only obtain training labels for a very limited set of instances. When used interactively, their training sets can be presented to the user for labeling.
  • Reinforcement learning , which is when the training data is only given as feedback to the program’s actions in the dynamic environment, such as driving a vehicle or playing a game against an opponent

In contrast, unsupervised learning is when no labels are given at all and it’s up to the algorithm to find the structure in its input. Unsupervised learning can be a goal in itself when we only need to discover hidden patterns.

Deep learning is a new field of study which is inspired by the structure and function of the human brain and based on artificial neural networks rather than just statistical concepts. Deep learning can be used in both supervised and unsupervised approaches.

In this article, we will only go through some of the simpler supervised machine learning algorithms and use them to calculate the survival chances of an individual in tragic sinking of the Titanic. But in general, if you’re not sure which algorithm to use, a nice place to start is scikit-learn’s machine learning algorithm cheat-sheet .

Basic Supervised Machine Learning Models

Perhaps the easiest possible algorithm is linear regression. Sometimes this can be graphically represented as a straight line, but despite its name, if there’s a polynomial hypothesis, this line could instead be a curve. Either way, it models the relationships between scalar dependent variable $y$ and one or more explanatory values denoted by $x$.

In layperson’s terms, this means that linear regression is the algorithm which learns the dependency between each known $x$ and $y$, such that later we can use it to predict $y$ for an unknown sample of $x$.

In our first supervised learning example, we will use a basic linear regression model to predict a person’s blood pressure given their age. This is a very simple dataset with two meaningful features: Age and blood pressure.

As already mentioned above, most machine learning algorithms work by finding a statistical dependency in the data provided to them. This dependency is called a hypothesis and is usually denoted by $h(\theta)$.

To figure out the hypothesis, let’s start by loading and exploring the data.

[<matplotlib.lines.Line2D at 0x11424b828>]

A linear hypothesis shown on an age vs blood pressure graph.

On the chart above, every blue dot represents our data sample and the blue line is the hypothesis which our algorithm needs to learn. So what exactly is this hypothesis anyway?

In order to solve this problem, we need to learn the dependency between $x$ and $y$, which is denoted by $y = f(x)$. Therefore $f(x)$ is the ideal target function. The machine learning algorithm will try to guess the hypothesis function $h(x)$ that is the closest approximation of the unknown $f(x)$.

The simplest possible form of hypothesis for the linear regression problem looks like this: $h_\theta(x) = \theta_0 + \theta_1 * x$. We have a single input scalar variable $x$ which outputs a single scalar variable $y$, where $\theta_0$ and $\theta_1$ are parameters which we need to learn. The process of fitting this blue line in the data is called linear regression. It is important to understand that we have only one input parameter $x_1$; however, a lot of hypothesis functions will also include the bias unit ($x_0$). So our resulting hypothesis has a form of $h_\theta(x) = \theta_0 * x_0 + \theta_1 * x_1$. But we can avoid writing $x_0$ because it’s almost always equal to 1.

Getting back to the blue line. Our hypothesis looks like $h(x) = 84 + 1.24x$, which means that $\theta_0 = 84$ and $\theta_1 = 1.24$. How can we automatically derive those $\theta$ values?

We need to define a cost function . Essentially, what cost function does is simply calculates the root mean square error between the model prediction and the actual output.

For example, our hypothesis predicts that for someone who is 48 years old, their blood pressure should be $h(48) = 84 + 1.24 * 48 = 143mmHg$; however, in our training sample, we have the value of $130 mmHg$. Therefore the error is $(143 - 130)^2 = 169$. Now we need to calculate this error for every single entry in our training dataset, then sum it together ($\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})^2$) and take the mean value out of that.

This gives us a single scalar number which represents the cost of the function. Our goal is to find $\theta$ values such that the cost function is the lowest; in the other words, we want to minimize the cost function. This will hopefully seem intuitive: If we have a small cost function value, this means that the error of prediction is small as well.

J(Theta) = 1901.95

Now, we need to find such values of $\theta$ such that our cost function value is minimal. But how do we do that?

There are several possible algorithms, but the most popular is gradient descent . In order to understand the intuition behind the gradient descent method, let’s first plot it on the graph. For the sake of simplicity, we will assume a simpler hypothesis $h(\theta) = \theta_1 * x$. Next, we will plot a simple 2D chart where $x$ is the value of $\theta$ and $y$ is the cost function at this point.

A convex cost function.

The cost function is convex, which means that on the interval $[a, b]$ there is only one minimum. Which again means that the best $\theta$ parameters are at the point where the cost function is minimal.

Basically, gradient descent is an algorithm that tries to find the set of parameters which minimize the function. It starts with an initial set of parameters and iteratively takes steps in the negative direction of the function gradient.

Finding the minimum for a cost function.

If we calculate the derivative of a hypothesis function at a specific point, this will give us a slope of the tangent line to the curve at that point. This means that we can calculate the slope at every single point on the graph.

The way the algorithm works is this:

  • We choose a random starting point (random $\theta$).
  • Calculate the derivative of the cost function at this point.
  • Take the small step towards the slope $\theta_j := \theta_j - \lambda * \frac{\partial}{\partial \theta_j} * J(\theta)$.
  • Repeat steps 2-3 until we converge.

Now, the convergence condition depends on the implementation of the algorithm. We may stop after 50 steps, after some threshold, or anything else.

The local minimum occurs at 4.712194

We will not implement those algorithms in this article. Instead, we will utilize the widely adopted scikit-learn , an open-source Python machine learning library. It provides a lot of very useful APIs for different data mining and machine learning problems.

A learned linear hypothesis on the blood pressure vs. age graph

<matplotlib.figure.Figure at 0x120fae1d0>

Types of Statistical Data

When working with data for machine learning problems, it is important to recognize different types of data. We may have numerical (continuous or discrete), categorical, or ordinal data.

Numerical data has meaning as a measurement. For example, age, weight, number of bitcoins that a person owns, or how many articles the person can write per month. Numerical data can be further broken down into discrete and continuous types.

  • Discrete data represent data that can be counted with whole numbers, e.g., number of rooms in an apartment or number of coin flips.
  • Continuous data can’t necessarily be represented with whole numbers. For example, if you’re measuring the distance you can jump, it may be 2 meters, or 1.5 meters, or 1.652245 meters.

Categorical data represent values such as person’s gender, marital status, country, etc. This data can take numerical value, but those numbers have no mathematical meaning. You cannot add them together.

Ordinal data can be a mix of the other two types, in that categories may be numbered in a mathematically meaningful way. A common example is ratings: Often we are asked to rate things on a scale of one to ten, and only whole numbers are allowed. While we can use this numerically—e.g., to find an average rating for something—we often treat the data as if it were categorical when it comes to applying machine learning methods to it.

Logistic Regression

Linear regression is an awesome algorithm which helps us to predict numerical values, e.g., the price of the house with the specific size and number of rooms. However, sometimes, we may also want to predict categorical data, to get answers to questions like:

  • Is this a dog or a cat?
  • Is this tumor malignant or benign?
  • Is this wine good or bad?
  • Is this email spam or not?
  • Which number is in the picture?
  • Which category does this email belong to?

All these questions are specific to the classification problem . And the simplest classification algorithm is called logistic regression , which is eventually the same as linear regression except that it has a different hypothesis.

First of all, we can reuse the same linear hypothesis $h_\theta(x) = \theta^T X$ (this is in vectorized form). Whereas linear regression may output any number in the interval $[a, b]$, logistic regression can only output values in $[−1, 1]$, which is the probability of the object falling in a given category or not.

Using a sigmoid function , we can convert any numerical value to represent a value on the interval $[−1, 1]$.

Now, instead of $x$, we need to pass an existing hypothesis and therefore we will get:

After that, we can apply a simple threshold saying that if the hypothesis is greater than zero, this is a true value, otherwise false.

This means that we can use the same cost function and the same gradient descent algorithm to learn a hypothesis for logistic regression.

In our next machine learning algorithm example, we will advise the pilots of the space shuttle whether or not they should use automatic or manual landing control. We have a very small dataset —15 samples—which consists of six features and the ground truth .

In machine learning algorithms, the term “ ground truth ” refers to the accuracy of the training set’s classification for supervised learning techniques.

Our dataset is complete, meaning that there are no missing features; however, some of the features have a “*” instead of the category, which means that this feature does not matter. We will replace all such asterisks with zeroes.

Score of our model is 73.33%

Validation?

In the previous example, we validated the performance of our model using the learning data. However, is this now a good option, given that our algorithm can either underfit of overfit the data? Let’s take a look at the simpler example when we have one feature which represents the size of a house and another which represents its price.

The same data modeled by first-, fourth-, and 30th-degree polynomials, to demonstrate underfitting and overfitting.

The machine learning algorithm model is underfitting if it can generalize neither the training data nor new observations. In the example above, we use a simple linear hypothesis which does not really represent the actual training dataset and will have very poor performance. Usually, underfitting is not discussed as it can be easily detected given a good metric.

If our algorithm remembers every single observation it was shown, then it will have poor performance on new observations outside of the training dataset. This is called overfitting . For example, a 30th-degree polynomial model passes through the most of the points and has a very good score on the training set, but anything outside of that would perform badly.

Our dataset consists of one feature and is simple to plot in 2D space; however, in real life, we may have datasets with hundreds of features, which makes them impossible to plot visually in Euclidean space. What other options do we have in order to see if the model is underfitting or overfitting?

It’s time to introduce you to the concept of the learning curve . This is a simple graph that plots the mean squared error over the number of training samples.

In learning materials you will usually see graphs similar to these:

Theoretical learning curve variations based on polynomial degree.

However, in real life, you may not get such a perfect picture. Let’s plot the learning curve for each of our models.

Training scores vs test scores for three graphs with data modeled by first-, fourth-, and 30th-degree polynomials.

In our simulated scenario, the blue line, which represents the training score, seems like a straight line. In reality, it still slightly decreases—you can actually see this in the first-degree polynomial graph, but in the others it’s too subtle to tell at this resolution. We at least clearly see that there is a huge gap between learning curves for training and test observations with a “high bias” scenario.

On the “normal” learning rate graph in the middle, you can see how training score and test score lines come together.

And on the “high variance” graph, you can see that with a low number of samples, the test and training scores are very similar; however, when you increase the number of samples, the training score remains almost perfect while the test score grows away from it.

We can fix underfitting models (also called models with high bias ) if we use a non-linear hypothesis, e.g., the hypothesis with more polynomial features.

Our overfitting model ( high variance ) passes through every single example it is shown; however, when we introduce test data, the gap between learning curves widens. We can use regularization, cross-validation, and more data samples to fix overfitting models.

Cross-validation

One of the common practices to avoid overfitting is to hold onto part of the available data and use it as a test set. However, when evaluating different model settings, such as the number of polynomial features, we are still at risk of overfitting the test set because parameters can be tweaked to achieve the optimal estimator performance and, because of that, our knowledge about the test set can leak into the model. To solve this problem, we need to hold onto one more part of the dataset, which is called the “validation set.” Training proceeds on the training set and, when we think that we’ve achieved the optimal model performance, we can make a final evaluation utilizing the validation set.

However, by partitioning the available data into three sets, we dramatically reduce the number of samples that can be used for training the models, and the results can depend on a particular random choice for the training-validation pair of sets.

One solution to this problem is a procedure called cross-validation. In standard $k$-fold cross-validation, we partition the data into $k$ subsets, called folds. Then, we iteratively train the algorithm on $k-1$ folds while using the remaining fold as the test set (called the “holdout fold”).

A grid demonstrating the position of holdout folds in k-fold cross-validation.

Cross-validation allows you to tune parameters with only your original training set. This allows you to keep your test set as a truly unseen dataset for selecting your final model.

There are a lot more cross-validation techniques, like leave P out , stratified $k$-fold , shuffle and split , etc. but they’re beyond the scope of this article.

Regularization

This is another technique that can help solve the issue of model overfitting. Most of the datasets have a pattern and some noise. The goal of the regularization is to reduce the influence of the noise on the model.

A graph juxtaposing an original function and its regularized counterpart.

There are three main regularization techniques: Lasso, Tikhonov, and elastic net.

L1 regularization (or Lasso regularization ) will select some features to shrink to zero, such that they will not play any role in the final model. L1 can be seen as a method to select important features.

L2 regularization (or Tikhonov regularization ) will force all features to be relatively small, such that they will provide less influence on the model.

Elastic net is the combination of L1 and L2.

Normalization (Feature Scaling)

Feature scaling is also an important step while preprocessing the data. Our dataset may have features with values $[-\infty, \infty]$ and other features with a different scale. This is a method to standardize the ranges of independent values.

Feature scaling is also an important process to improve the performance of the learning models. First of all, gradient descent will converge much faster if all of the features are scaled to the same norm. Also, a lot of algorithms—for example, support vector machines (SVM)—work by calculating the distance between two points and if one of the features has broad values, then the distance will be highly influenced by this feature.

Support Vector Machines

SVM is yet another broadly popular machine learning algorithm which can be used for classification and regression problems. In SVM, we plot each observation as a point in $n$-dimensional space where $n$ is the number of features we have. The value of each feature is the value of particular coordinates. Then, we try to find a hyperplane that separates two classes well enough.

A graph showing a hyperplane separating two classes of data points, with some of their support vectors illustrated as well.

After we identify the best hyperplane, we want to add margins, which would separate the two classes further.

A graph showing a hyperplane with margins.

SVM is very effective where the number of features is very high or if the number of features is larger then the number of data samples. However, since SVM operates on a vector basis, it is crucial to normalize the data prior the usage.

Neural Networks

Neural network algorithms are probably the most exciting field of machine learning studies. Neural networks try to mimic how the brain’s neurons are connected together.

An illustration of a neural network, showing various inputs mapped to temporary values, which are in turn mapped to a single output.

This is how a neural network looks. We combine a lot of nodes together where each node takes a set of inputs, apply some calculations on them, and output a value.

There are a huge variety of neural network algorithms for both supervised and unsupervised learning. Neural networks can be used to drive autonomous cars, play games, land airplanes, classify images, and more.

The Infamous Titanic

The RMS Titanic was a British passenger liner that sank in the North Atlantic Ocean on April 15th, 1912 after it collided with an iceberg. There were about 2,224 crew and passengers, and more than 1,500 died, making it one of the deadliest commercial maritime disasters of all time.

Now, since we understand the intuition behind the most basic machine learning algorithms used for classification problems, we can apply our knowledge to predict the survival outcome for those on board the Titanic.

Our dataset will be borrowed from the Kaggle data science competitions platform .

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

Our first step would be to load and explore the data. We have 891 test records; each record has the following structure:

  • passengerId – ID of the passenger on board
  • survival – Whether or not the person survived the crash
  • pclass – Ticket class, e.g., 1st, 2nd, 3rd
  • gender – Gender of the passenger: Male or female
  • name – Title included
  • age – Age in years
  • sibsp – Number of siblings/spouses aboard the Titanic
  • parch – Number of parents/children aboard the Titanic
  • ticket – Ticket number
  • fare – Passenger fare
  • cabin – Cabin number
  • embarked – Port of embarkation

This dataset contains both numerical and categorical data. Usually, it is a good idea to dive deeper into the data and, based on that, come up with assumptions. However, in this case, we will skip this step and go straight to predictions.

Title \ Sexfemalemale
Capt01
Col02
Countess10
Don01
Dr16
Jonkheer01
Lady10
Major02
Master040
Miss1820
Mlle20
Mme10
Mr0517
Mrs1250
Ms10
Rev06
Sir01
TitleSurvived
0Master0.575000
1Miss0.702703
2Mr0.156673
3Mrs0.793651
4Other0.347826
SurvivedPclassSexAgeSibSpParchFareEmbarkedTitle
003022.0107.250001
111138.01071.283313
213126.0007.925002
311135.01053.100003
403035.0008.050001

At this point, we will rank different types of machine learning algorithms in Python by using scikit-learn to create a set of different models. It will then be easy to see which one performs the best.

  • Logistic regression with varying numbers of polynomials
  • Support vector machine with a linear kernel
  • Support vector machine with a polynomial kernel
  • Neural network

For every single model, we will use $k$-fold validation.

NameMin ScoreMax ScoreMean Score
2Linear SVM0.7932960.8212290.803352
0Logistic Regression0.8268160.8603350.846927
4Neural Net0.8268160.8603350.849162
1Logistic Regression with Polynomial Hypotheses0.8547490.8826820.869274
3RBF SVM0.8435750.8882680.869274

Ok, so our experimental research says that the SVM classifier with a radial basis function (RBF) kernel performs the best. Now, we can serialize our model and re-use it in production applications.

Machine learning is not complicated, but it’s a very broad field of study, and it requires knowledge of math and statistics in order to grasp all of its concepts.

Right now, machine learning and deep learning are among the hottest topics of discussion in Silicon Valley, and are the bread and butter of almost every data science company , mainly because they can automate many repetitive tasks including speech recognition, driving vehicles, financial trading, caring for patients , cooking , marketing , and so on.

Now you can take this knowledge and solve challenges on Kaggle.

This was a very brief introduction to supervised machine learning algorithms. Luckily, there are a lot of online courses and information about machine learning algorithms. I personally would recommend starting with Andrew Ng’s course on Coursera.

  • Andrew Ng’s course on Coursera
  • Kaggle datasets
  • A deep learning reading list
  • A list of free books on machine learning algorithms, data mining, deep learning, and related topics
  • An Introduction to Machine Learning Theory and Its Applications: A Visual Tutorial with Examples

Understanding the basics

How does machine learning work.

Machine learning algorithms form models automatically using statistical analysis, in contrast to traditional, hard-coded algorithms. This allows them to evolve over time as they look for patterns in data and make predictions as to their classification.

What can machine learning be used for?

The applications of machine learning are almost limitless. It can be used for everything from simple weather prediction and data clustering to complex feature learning; autonomous driving and flying; image, speech, and video recognition; search and recommendation engines; patient diagnosis; and more.

What's the difference between supervised and unsupervised classification?

Supervised classification needs labels for training data: One picture is a cat, the other is a dog. Unsupervised classification is where the algorithm finds common traits and separates data itself. It will not explicitly tell us that the image is a cat, but it will be able to separate cats from dogs.

What is supervised learning vs. unsupervised learning?

Supervised learning is where you explicitly tell to the algorithm what the right answer is, so the algorithm can learn and can predict the answer for previously unseen data. Unsupervised learning is where the algorithm has to figure out the answer on its own.

Where can I learn machine learning techniques?

The best place to start learning about machine learning is to watch Andrew’s Ng course on Coursera, linked in the resources at the end of the article. From there, start taking challenges on Kaggle to develop better intuition about different frameworks and approaches.

How can I tell which machine learning algorithm to use?

There are a lot of factors to consider when choosing the right algorithm: the size of the dataset, the nature of the data, speed vs accuracy, etc. Until you develop your own intuition, you can use existing cheatsheets like the one scikit-learn provides.

  • MachineLearning

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy .

Toptal Developers

  • Adobe Commerce (Magento) Developers
  • Algorithm Developers
  • Angular Developers
  • AWS Developers
  • Azure Developers
  • Big Data Architects
  • Blockchain Developers
  • Business Intelligence Developers
  • C Developers
  • Django Developers
  • Docker Developers
  • Elixir Developers
  • Go Engineers
  • GraphQL Developers
  • Jenkins Developers
  • Kotlin Developers
  • Kubernetes Developers
  • .NET Developers
  • R Developers
  • React Native Developers
  • Ruby on Rails Developers
  • Salesforce Developers
  • SQL Developers
  • Tableau Developers
  • Unreal Engine Developers
  • Xamarin Developers
  • View More Freelance Developers

Join the Toptal ® community.

  • Computer Science and Engineering
  • NOC:Introduction to Machine Learning (Video) 
  • Co-ordinated by : IIT Kharagpur
  • Available from : 2016-09-08
  • Intro Video
  • Lecture 01: Introduction
  • Lecture 02: Different Types of Learning
  • Lecture 03: Hypothesis Space and Inductive Bias
  • Lecture 04: Evaluation and Cross-Validation
  • Lecture 05 : Linear Regression
  • Lecture 06 : Introduction to Decision Trees
  • Lecture 07 : Learning Decision Tree
  • Lecture 08 : Overfitting
  • Lecture 9: Python Exercise on Decision Tree and Linear Regression
  • Tutorial II
  • Lecture 12: k-Nearest Neighbour
  • Lecture 13: Feature Selection
  • Lecture 14: Feature Extraction
  • Lecture 15: Collaborative Filtering
  • Lecture 16: Python Exercise on kNN and PCA
  • Lecture 17: Tutorial III
  • Lecture 18: Bayesian Learning
  • Lecture 19: Naive Bayes
  • Lecture 20 : Bayesian Network
  • Lecture 21: Python Exercise on Naive Bayes
  • Lecture 22: Tutorial IV
  • Lecture 23 : Logistic Regression
  • Lecture 24: Introduction Support Vector Machine
  • Lecture 25: SVM : The Dual Formulation
  • Lecture 26: SVM : Maximum Margin with Noise
  • Lecture 27: Nonlinear SVM and Kernel Function
  • Lecture 28: SVM : Solution to the Dual Problem
  • Lecture 29: Python Exercise on SVM
  • Lecture 30: Introduction
  • Lecture 31: Multilayer Neural Network
  • Lecture 32 : Neural Network and Backpropagation Algorithm
  • Lecture 33: Deep Neural Network
  • Lecture 34: Python Exercise on Neural Network
  • Lecture 35: Tutorial VI
  • Lecture 36: Introduction to Computational Learning Theory
  • Lecture 37: Sample Complexity : Finite Hypothesis Space
  • Lecture 38: VC Dimension
  • Lecture 39 : Introduction to Ensembles
  • Lecture 40 : Bagging and Boosting
  • Lecture 41 : Introduction to Clustering
  • Lecture 42 : Kmeans Clustering
  • Lecture 43: Agglomerative Hierarchical Clustering
  • Lecture 44: Python Exercise on kmeans clustering
  • Watch on YouTube
  • Assignments
  • Download Videos
  • Transcripts
Module NameDownload
noc19_cs52_assignment_Week_1
noc19_cs52_assignment_Week_2
noc19_cs52_assignment_Week_3
noc19_cs52_assignment_Week_4
noc19_cs52_assignment_Week_5
noc19_cs52_assignment_Week_6
noc19_cs52_assignment_Week_7
noc19_cs52_assignment_Week_8
Sl.No Chapter Name MP4 Download
1Lecture 01: Introduction
2Lecture 02: Different Types of Learning
3Lecture 03: Hypothesis Space and Inductive Bias
4Lecture 04: Evaluation and Cross-Validation
5Tutorial I
6Lecture 05 : Linear Regression
7Lecture 06 : Introduction to Decision Trees
8Lecture 07 : Learning Decision Tree
9Lecture 08 : Overfitting
10Lecture 9: Python Exercise on Decision Tree and Linear Regression
11Tutorial II
12Lecture 12: k-Nearest Neighbour
13Lecture 13: Feature Selection
14Lecture 14: Feature Extraction
15Lecture 15: Collaborative Filtering
16Lecture 16: Python Exercise on kNN and PCA
17Lecture 17: Tutorial III
18Lecture 18: Bayesian Learning
19Lecture 19: Naive Bayes
20Lecture 20 : Bayesian Network
21Lecture 21: Python Exercise on Naive Bayes
22Lecture 22: Tutorial IV
23Lecture 23 : Logistic Regression
24Lecture 24: Introduction Support Vector Machine
25Lecture 25: SVM : The Dual Formulation
26Lecture 26: SVM : Maximum Margin with Noise
27Lecture 27: Nonlinear SVM and Kernel Function
28Lecture 28: SVM : Solution to the Dual Problem
29Lecture 29: Python Exercise on SVM
30Lecture 30: Introduction
31Lecture 31: Multilayer Neural Network
32Lecture 32 : Neural Network and Backpropagation Algorithm
33Lecture 33: Deep Neural Network
34Lecture 34: Python Exercise on Neural Network
35Lecture 35: Tutorial VI
36Lecture 36: Introduction to Computational Learning Theory
37Lecture 37: Sample Complexity : Finite Hypothesis Space
38Lecture 38: VC Dimension
39Lecture 39 : Introduction to Ensembles
40Lecture 40 : Bagging and Boosting
41Lecture 41 : Introduction to Clustering
42Lecture 42 : Kmeans Clustering
43Lecture 43: Agglomerative Hierarchical Clustering
44Lecture 44: Python Exercise on kmeans clustering
Sl.No Chapter Name English
1Lecture 01: Introduction
2Lecture 02: Different Types of Learning
3Lecture 03: Hypothesis Space and Inductive Bias
4Lecture 04: Evaluation and Cross-Validation
5Tutorial I
6Lecture 05 : Linear Regression
7Lecture 06 : Introduction to Decision Trees
8Lecture 07 : Learning Decision Tree
9Lecture 08 : Overfitting
10Lecture 9: Python Exercise on Decision Tree and Linear Regression
11Tutorial II
12Lecture 12: k-Nearest Neighbour
13Lecture 13: Feature Selection
14Lecture 14: Feature Extraction
15Lecture 15: Collaborative Filtering
16Lecture 16: Python Exercise on kNN and PCA
17Lecture 17: Tutorial III
18Lecture 18: Bayesian Learning
19Lecture 19: Naive Bayes
20Lecture 20 : Bayesian Network
21Lecture 21: Python Exercise on Naive Bayes
22Lecture 22: Tutorial IV
23Lecture 23 : Logistic Regression
24Lecture 24: Introduction Support Vector Machine
25Lecture 25: SVM : The Dual Formulation
26Lecture 26: SVM : Maximum Margin with Noise
27Lecture 27: Nonlinear SVM and Kernel Function
28Lecture 28: SVM : Solution to the Dual Problem
29Lecture 29: Python Exercise on SVM
30Lecture 30: Introduction
31Lecture 31: Multilayer Neural Network
32Lecture 32 : Neural Network and Backpropagation Algorithm
33Lecture 33: Deep Neural Network
34Lecture 34: Python Exercise on Neural Network
35Lecture 35: Tutorial VI
36Lecture 36: Introduction to Computational Learning Theory
37Lecture 37: Sample Complexity : Finite Hypothesis Space
38Lecture 38: VC Dimension
39Lecture 39 : Introduction to Ensembles
40Lecture 40 : Bagging and Boosting
41Lecture 41 : Introduction to Clustering
42Lecture 42 : Kmeans Clustering
43Lecture 43: Agglomerative Hierarchical Clustering
44Lecture 44: Python Exercise on kmeans clustering
Sl.No Language Book link
1EnglishNot Available
2BengaliNot Available
3GujaratiNot Available
4HindiNot Available
5KannadaNot Available
6MalayalamNot Available
7MarathiNot Available
8TamilNot Available
9TeluguNot Available

Finding a Maximally Specific Hypothesis: The List-Then-Eliminate Algorithm | Version Space.

The LIST-THEN-ELIMINATE technique can be used if the hypothesis space H is finite in theory. It provides a number of advantages, including the assurance that all hypotheses will be compatible with the training data.

In this blog, we’ll have a look at the working and need of the list-then-eliminate algorithm and also understand the concept of version space.

To understand it from scratch let’s have a look at all the terminologies involved, 

Hypothesis: 

It is usually represented with an ‘h’. In supervised machine learning, a hypothesis is a function that best characterizes the target. 

For example, Consider a coordinate plane showing the output as positive or negative for a given task. 

The Hypothesis Space is made up of all of the legal ways in which we might partition the coordinate plane to anticipate the outcome of the test data.

Each conceivable path, represented with a gray line is referred to as a hypothesis.

Specific Hypothesis: 

If a hypothesis, h, covers none of the negative cases and there is no other hypothesis, h′, that covers none of the negative examples, then h is strictly more general than h′, then h is said to be the most specific hypothesis. 

The specific hypothesis fills in important details about the variables given in the hypothesis.

Before understanding version space, let’s first have a look at the formal definition for ‘Consistent’ => 

If and only if h(x) = c(x) for each example (x, c(x)) in D, a hypothesis h is consistent with a collection of training examples D. 

Let’s take our EnjoySport example yet again, to understand what Consistent Hypothesis means better, 

SunnyWarmNormalStrongWarmSameYes
SunnyWarmHighStrongWarmSameYes
RainyColdHighStrongWarmChangeNo
SunnyWarmHighStrongCoolChangeYes

Here,consider a hypothesis,  h1 = < ?, ?, ?, Strong, ?, ?>. 

But, h(x)!= c(x) in the case of training example (3). As a result, hypothesis h1 is not consistent with the training data set. 

Whereas consider a hypothesis, h2 = < ?, Warm, ?, Strong, ?, ?>.

h(x) = c(x) is true in all of the training instances. As a result, hypothesis h2 is consistent with the training data set.

Version Space: 

With regard to hypothesis space H and training examples D, the version space, denoted as VS H,D , is the subset of hypotheses from H that are consistent with the training instances in D.

In the above example, We have two hypotheses from H in the case above, both of which are consistent with the training dataset.

h 1 =< Sunny, Warm, ?, Strong, ?, ?> and

h 2 =< ?, Warm, ?, Strong, ?, ?>

As a result, the collection of hypotheses h 1 , h 2 is referred to as a Version Space.

List – Then – Eliminate: 

The LIST-THEN-ELIMINATE method first populates the version space with all hypotheses in H, then discards any hypothesis that contradicts any training example.

As more instances are observed, the version space of candidate hypotheses reduces, until ideally just one hypothesis exists that is compatible with all of the observed cases.

If there isn’t enough evidence to reduce the version space down to a single hypothesis, the algorithm can generate the whole set of hypotheses that are compatible with the data.

=> Algorithm:

1. Initialize all the hypotheses in H to the Version Space. 

2. Keep removing the inconsistent hypothesis from the Version Space.

For each training instance, <x 1 , c(x) > remove any hypothesis that is h(x) != c(x).

3. Output the list of version space after checking all the examples. 

Let’s take another brief example to understand the working of list-then-eliminate. 

F1F2TargetClass
AXYes
AYYes

The version space for the above example is, 

(A, X), (A, Y), (A, ?), (B, X), (B, Y), (B, ?), (?, X), (?, Y) (?, ?), (ø, ø). 

Keep removing the hypothesis that is not consistent with the training examples. 

It was result in a Version Space => (A, ?), (?, ?).  

Some of the Key Disadvantages of using the List-Then-Eliminate Algorithm are, 

  • There must be a limit to the hypothesis space.
  • Enumeration of all hypotheses is a time-consuming process.

Javatpoint Logo

Machine Learning

Artificial Intelligence

Control System

Supervised Learning

Classification, miscellaneous, related tutorials.

Interview Questions

JavaTpoint

The hypothesis is a common term in Machine Learning and data science projects. As we know, machine learning is one of the most powerful technologies across the world, which helps us to predict results based on past experiences. Moreover, data scientists and ML professionals conduct experiments that aim to solve a problem. These ML professionals and data scientists make an initial assumption for the solution of the problem.

This assumption in Machine learning is known as Hypothesis. In Machine Learning, at various times, Hypothesis and Model are used interchangeably. However, a Hypothesis is an assumption made by scientists, whereas a model is a mathematical representation that is used to test the hypothesis. In this topic, "Hypothesis in Machine Learning," we will discuss a few important concepts related to a hypothesis in machine learning and their importance. So, let's start with a quick introduction to Hypothesis.

It is just a guess based on some known facts but has not yet been proven. A good hypothesis is testable, which results in either true or false.

: Let's understand the hypothesis with a common example. Some scientist claims that ultraviolet (UV) light can damage the eyes then it may also cause blindness.

In this example, a scientist just claims that UV rays are harmful to the eyes, but we assume they may cause blindness. However, it may or may not be possible. Hence, these types of assumptions are called a hypothesis.

The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is specifically used in Supervised Machine learning, where an ML model learns a function that best maps the input to corresponding outputs with the help of an available dataset.

There are some common methods given to find out the possible hypothesis from the Hypothesis space, where hypothesis space is represented by and hypothesis by Th ese are defined as follows:

It is used by supervised machine learning algorithms to determine the best possible hypothesis to describe the target function or best maps input to output.

It is often constrained by choice of the framing of the problem, the choice of model, and the choice of model configuration.

. It is primarily based on data as well as bias and restrictions applied to data.

Hence hypothesis (h) can be concluded as a single hypothesis that maps input to proper output and can be evaluated as well as used to make predictions.

The hypothesis (h) can be formulated in machine learning as follows:

Where,

Y: Range

m: Slope of the line which divided test data or changes in y divided by change in x.

x: domain

c: intercept (constant)

: Let's understand the hypothesis (h) and hypothesis space (H) with a two-dimensional coordinate plane showing the distribution of data as follows:

Hypothesis space (H) is the composition of all legal best possible ways to divide the coordinate plane so that it best maps input to proper output.

Further, each individual best possible way is called a hypothesis (h). Hence, the hypothesis and hypothesis space would be like this:

Similar to the hypothesis in machine learning, it is also considered an assumption of the output. However, it is falsifiable, which means it can be failed in the presence of sufficient evidence.

Unlike machine learning, we cannot accept any hypothesis in statistics because it is just an imaginary result and based on probability. Before start working on an experiment, we must be aware of two important types of hypotheses as follows:

A null hypothesis is a type of statistical hypothesis which tells that there is no statistically significant effect exists in the given set of observations. It is also known as conjecture and is used in quantitative analysis to test theories about markets, investment, and finance to decide whether an idea is true or false. An alternative hypothesis is a direct contradiction of the null hypothesis, which means if one of the two hypotheses is true, then the other must be false. In other words, an alternative hypothesis is a type of statistical hypothesis which tells that there is some significant effect that exists in the given set of observations.

The significance level is the primary thing that must be set before starting an experiment. It is useful to define the tolerance of error and the level at which effect can be considered significantly. During the testing process in an experiment, a 95% significance level is accepted, and the remaining 5% can be neglected. The significance level also tells the critical or threshold value. For e.g., in an experiment, if the significance level is set to 98%, then the critical value is 0.02%.

The p-value in statistics is defined as the evidence against a null hypothesis. In other words, P-value is the probability that a random chance generated the data or something else that is equal or rarer under the null hypothesis condition.

If the p-value is smaller, the evidence will be stronger, and vice-versa which means the null hypothesis can be rejected in testing. It is always represented in a decimal form, such as 0.035.

Whenever a statistical test is carried out on the population and sample to find out P-value, then it always depends upon the critical value. If the p-value is less than the critical value, then it shows the effect is significant, and the null hypothesis can be rejected. Further, if it is higher than the critical value, it shows that there is no significant effect and hence fails to reject the Null Hypothesis.

In the series of mapping instances of inputs to outputs in supervised machine learning, the hypothesis is a very useful concept that helps to approximate a target function in machine learning. It is available in all analytics domains and is also considered one of the important factors to check whether a change should be introduced or not. It covers the entire training data sets to efficiency as well as the performance of the models.

Hence, in this topic, we have covered various important concepts related to the hypothesis in machine learning and statistics and some important parameters such as p-value, significance level, etc., to understand hypothesis concepts in a better way.





Youtube

  • Send your Feedback to [email protected]

Help Others, Please Share

facebook

Learn Latest Tutorials

Splunk tutorial

Transact-SQL

Tumblr tutorial

Reinforcement Learning

R Programming tutorial

R Programming

RxJS tutorial

React Native

Python Design Patterns

Python Design Patterns

Python Pillow tutorial

Python Pillow

Python Turtle tutorial

Python Turtle

Keras tutorial

Preparation

Aptitude

Verbal Ability

Interview Questions

Company Questions

Trending Technologies

Artificial Intelligence

Cloud Computing

Hadoop tutorial

Data Science

Angular 7 Tutorial

B.Tech / MCA

DBMS tutorial

Data Structures

DAA tutorial

Operating System

Computer Network tutorial

Computer Network

Compiler Design tutorial

Compiler Design

Computer Organization and Architecture

Computer Organization

Discrete Mathematics Tutorial

Discrete Mathematics

Ethical Hacking

Ethical Hacking

Computer Graphics Tutorial

Computer Graphics

Software Engineering

Software Engineering

html tutorial

Web Technology

Cyber Security tutorial

Cyber Security

Automata Tutorial

C Programming

C++ tutorial

Data Mining

Data Warehouse Tutorial

Data Warehouse

RSS Feed

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

sustainability-logo

Article Menu

hypothesis space in machine learning python

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Assessing the impact of straw burning on pm 2.5 using explainable machine learning: a case study in heilongjiang province, china.

hypothesis space in machine learning python

1. Introduction

2. materials and methods, 2.1. fengyun-3 series global active fire products, 2.2. land cover, 2.3. auxiliary data, 2.3.1. climate-related variables, 2.3.2. dem and aod data, 2.3.3. chinahighpm 2.5, 2.4. feature selection, 2.5. random forest model, 2.6. interpretable analysis, 2.7. data preparation for temporal and spatial models, 3. results and discussion, 3.1. comparison with modis fire points, 3.2. spatial distribution of fire points, 3.3. temporal patterns and variations, 3.4. monthly variations in crop fire points, 3.5. correlation and collinearity analyses of input features, 3.6. accuracy of the temporal and spatial models, 3.7. impacts of straw burning and other influencing factors on pm 2.5, 4. discussion, 5. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest.

  • Shi, Y.; Gong, S.; Zang, S.; Zhao, Y.; Wang, W.; Lv, Z.; Matsunaga, T.; Yamaguchi, Y.; Bai, Y. High-resolution and multi-year estimation of emissions from open biomass burning in Northeast China during 2001–2017. J. Clean. Prod. 2021 , 310 , 127496. [ Google Scholar ] [ CrossRef ]
  • Yin, S.; Guo, M.; Wang, X.; Yamamoto, H.; Ou, W. Spatiotemporal variation and distribution characteristics of crop residue burning in China from 2001 to 2018. Environ. Pollut. 2021 , 268 , 115849. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Huang, L.; Zhu, Y.; Wang, Q.; Zhu, A.; Liu, Z.; Wang, Y.; Allen, D.T.; Li, L. Assessment of the effects of straw burning bans in China: Emissions, air quality, and health impacts. Sci. Total Environ. 2021 , 789 , 147935. [ Google Scholar ] [ CrossRef ]
  • Cheng, Z.; Wang, S.; Fu, X.; Watson, J.G.; Jiang, J.; Fu, Q.; Chen, C.; Xu, B.; Yu, J.; Chow, J.C.; et al. Impact of biomass burning on haze pollution in the Yangtze River delta, China: A case study in summer 2011. Atmos. Chem. Phys. 2014 , 14 , 4573–4585. [ Google Scholar ] [ CrossRef ]
  • Singh, J. Paddy and wheat stubble blazing in Haryana and Punjab states of India: A menace for environmental health. Environ. Qual. Manag. 2018 , 28 , 47–53. [ Google Scholar ] [ CrossRef ]
  • Wu, X.; Chen, W.; Zhang, S.; Li, R.; Zhang, M.; Liu, J.; Jiang, Y.; Liu, Y. Temporal variation and chemical components of rural ambient PM2.5 during main agricultural activity periods in the black soil region of Northeast China. Atmosphere 2019 , 10 , 510. [ Google Scholar ] [ CrossRef ]
  • Zha, S.; Zhang, S.; Cheng, T.; Chen, J.; Huang, G.; Li, X.; Wang, Q. Agricultural fires and their potential impacts on regional air quality over China. Aerosol Air Qual. Res. 2013 , 13 , 992–1001. [ Google Scholar ] [ CrossRef ]
  • Qiu, X.; Duan, L.; Chai, F.; Wang, S.; Yu, Q.; Wang, S. Deriving high-resolution emission inventory of open biomass burning in China based on satellite observations. Environ. Sci. Technol. 2016 , 50 , 11779–11786. [ Google Scholar ] [ CrossRef ]
  • Wu, J.; Kong, S.; Wu, F.; Cheng, Y.; Zheng, S.; Qin, S.; Liu, X.; Yan, Q.; Zheng, H.; Zheng, M.; et al. The moving of high emission for biomass burning in China: View from multi-year emission estimation and human-driven forces. Environ. Int. 2020 , 142 , 105812. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Zhou, Y.; Xing, X.; Lang, J.; Chen, D.; Cheng, S.; Wei, L.; Wei, X.; Liu, C. A comprehensive biomass burning emission inventory with high spatial and temporal resolution in China. Atmos. Chem. Phys. 2017 , 17 , 2839–2864. [ Google Scholar ] [ CrossRef ]
  • Wu, J.; Kong, S.; Wu, F.; Cheng, Y.; Zheng, S.; Yan, Q.; Zheng, H.; Yang, G.; Zheng, M.; Liu, D.; et al. Estimating the open biomass burning emissions in central and eastern China from 2003 to 2015 based on satellite observation. Atmos. Chem. Phys. 2018 , 18 , 11623–11646. [ Google Scholar ] [ CrossRef ]
  • Chen, J.; Yao, Q.; Chen, Z.; Li, M.; Hao, Z.; Liu, C.; Zheng, W.; Xu, M.; Chen, X.; Yang, J.; et al. The Fengyun-3D (FY-3D) global active fire product: Principle, methodology and validation. Earth Syst. Sci. Data 2022 , 14 , 3489–3508. [ Google Scholar ] [ CrossRef ]
  • Lin, Z.; Chen, F.; Niu, Z.; Li, B.; Yu, B.; Jia, H.; Zhang, M. An active fire detection algorithm based on multi-temporal FengYun-3C VIRR data. Remote Sens. Environ. 2018 , 211 , 376–387. [ Google Scholar ] [ CrossRef ]
  • Zhang, L.; Liu, Y.; Hao, L. Contributions of open crop straw burning emissions to PM 2.5 concentrations in China. Environ. Res. Lett. 2016 , 11 , 014014. [ Google Scholar ] [ CrossRef ]
  • He, G.; Liu, T.; Zhou, M. Straw burning, PM 2.5 , and death: Evidence from China. J. Dev. Econ. 2020 , 145 , 102468. [ Google Scholar ] [ CrossRef ]
  • Mehmood, K.; Bao, Y.; Saifullah; Bibi, S.; Dahlawi, S.; Yaseen, M.; Abrar, M.M.; Srivastava, P.; Fahad, S.; Faraj, T.K. Contributions of open biomass burning and crop straw burning to air quality: Current research paradigm and future outlooks. Front. Environ. Sci. 2022 , 10 , 852492. [ Google Scholar ] [ CrossRef ]
  • Chen, W.; Li, J.; Bao, Q.; Gao, Z.; Cheng, T.; Yu, Y. Evaluation of straw open burning prohibition effect on provincial air quality during October and November 2018 in Jilin Province. Atmosphere 2019 , 10 , 375. [ Google Scholar ] [ CrossRef ]
  • Zhang, H.; Hu, J.; Qi, Y.; Li, C.; Chen, J.; Wang, X.; He, J.; Wang, S.; Hao, J.; Zhang, L.; et al. Emission characterization, environmental impact, and control measure of PM 2.5 emitted from agricultural crop residue burning in China. J. Clean. Prod. 2017 , 149 , 629–635. [ Google Scholar ] [ CrossRef ]
  • Mehmood, K.; Wu, Y.; Wang, L.; Yu, S.; Li, P.; Chen, X.; Li, Z.; Zhang, Y.; Li, M.; Liu, W.; et al. Relative effects of open biomass burning and open crop straw burning on haze formation over central and eastern China: Modeling study driven by constrained emissions. Atmos. Chem. Phys. 2020 , 20 , 2419–2443. [ Google Scholar ] [ CrossRef ]
  • Cui, S.; Song, Z.; Zhang, L.; Shen, Z.; Hough, R.; Zhang, Z.; An, L.; Fu, Q.; Zhao, Y.; Jia, Z. Spatial and temporal variations of open straw burning based on fire spots in northeast China from 2013 to 2017. Atmos. Environ. 2021 , 244 , 117962. [ Google Scholar ] [ CrossRef ]
  • Wang, Z.; Zhao, J.; Xu, J.; Jia, M.; Li, H.; Wang, S. Influence of straw burning on urban air pollutant concentrations in Northeast China. Int. J. Environ. Res. Public Health 2019 , 16 , 1379. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Grange, S.K.; Lee, J.D.; Drysdale, W.S.; Lewis, A.C.; Hueglin, C.; Emmenegger, L.; Carslaw, D.C. COVID-19 lockdowns highlight a risk of increasing ozone pollution in European urban areas. Atmos. Chem. Phys. 2021 , 21 , 4169–4185. [ Google Scholar ] [ CrossRef ]
  • Zhang, Z.; Xu, B.; Xu, W.; Wang, F.; Gao, J.; Li, Y.; Li, M.; Feng, Y.; Shi, G. Machine learning combined with the PMF model reveal the synergistic effects of sources and meteorological factors on PM 2.5 pollution. Environ. Res. 2022 , 212 , 113322. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Cheng, Y.; Huang, X.-F.; Peng, Y.; Tang, M.-X.; Zhu, B.; Xia, S.-Y.; He, L.-Y. A novel machine learning method for evaluating the impact of emission sources on ozone formation. Environ. Pollut. 2023 , 316 , 120685. [ Google Scholar ] [ CrossRef ]
  • Zhang, L.; Wang, L.; Ji, D.; Xia, Z.; Nan, P.; Zhang, J.; Li, K.; Qi, B.; Du, R.; Sun, Y.; et al. Explainable ensemble machine learning revealing the effect of meteorology and sources on ozone formation in megacity Hangzhou, China. Sci. Total Environ. 2024 , 922 , 171295. [ Google Scholar ] [ CrossRef ]
  • García, M.V.; Aznarte, J.L. Shapley additive explanations for NO 2 forecasting. Ecol. Inform. 2020 , 56 , 101039. [ Google Scholar ] [ CrossRef ]
  • Zheng, W.; Chen, J.; Yan, H.; Liu, C.; Tang, S.H. Global fire monitoring products of FY-3D/MERSI-II and their applications. J. Remote Sens. 2020 , 24 , 521–530. [ Google Scholar ] [ CrossRef ]
  • Lin, Z.; Chen, F.; Li, B.; Yu, B.; Shirazi, Z.; Wu, Q.; Wu, W. FengYun-3C VIRR active fire monitoring: Algorithm description and initial assessment using MODIS and Landsat data. IEEE Trans. Geosci. Remote Sens. 2017 , 55 , 6420–6430. [ Google Scholar ] [ CrossRef ]
  • Buchhorn, M.; Lesiv, M.; Tsendbazar, N.-E.; Herold, M.; Bertels, L.; Smets, B. Copernicus global land cover layers—Collection 2. Remote Sens. 2020 , 12 , 1044. [ Google Scholar ] [ CrossRef ]
  • Muñoz Sabater, J. ERA5-Land monthly averaged data from 1981 to present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). 2019. [ Google Scholar ]
  • Terblanche, D.; Lynch, A.; Chen, Z.; Sinclair, S. ERA5-Derived Precipitation: Insights from Historical Rainfall Networks in Southern Africa. J. Appl. Meteorol. Climatol. 2022 , 61 , 1473–1484. [ Google Scholar ] [ CrossRef ]
  • Farr, T.G.; Rosen, P.A.; Caro, E.; Crippen, R.; Duren, R.; Hensley, S.; Kobrick, M.; Paller, M.; Rodriguez, E.; Roth, L.; et al. The shuttle radar topography mission. Rev. Geophys. 2007 , 45 . [ Google Scholar ] [ CrossRef ]
  • Bai, K.; Li, K.; Shao, L.; Li, X.; Liu, C.; Li, Z.; Ma, M.; Han, D.; Sun, Y.; Zheng, Z.; et al. LGHAP v2: A global gap-free aerosol optical depth and PM 2.5 concentration dataset since 2000 derived via big earth data analytics. Earth Syst. Sci. Data Discuss. 2024 , 2024 , 1–29. [ Google Scholar ] [ CrossRef ]
  • Wei, J.; Li, Z.; Lyapustin, A.; Sun, L.; Peng, Y.; Xue, W.; Su, T.; Cribb, M. Reconstructing 1-km-resolution high-quality PM 2.5 data records from 2000 to 2018 in China: Spatiotemporal variations and policy implications. Remote Sens. Environ. 2021 , 252 , 112136. [ Google Scholar ] [ CrossRef ]
  • Wei, J.; Li, Z.; Cribb, M.; Huang, W.; Xue, W.; Sun, L.; Guo, J.; Peng, Y.; Li, J.; Lyapustin, A.; et al. Improved 1 km resolution PM 2.5 estimates across China using enhanced space–time extremely randomized trees. Atmos. Chem. Phys. 2020 , 20 , 3273–3289. [ Google Scholar ] [ CrossRef ]
  • Islam, A.R.M.T.; Al Awadh, M.; Mallick, J.; Pal, S.C.; Chakraborty, R.; Fattah, M.A.; Ghose, B.; Kakoli, M.K.A.; Islam, M.A.; Naqvi, H.R.; et al. Estimating ground-level PM 2.5 using subset regression model and machine learning algorithms in Asian megacity, Dhaka, Bangladesh. Air Qual. Atmos. Health 2023 , 16 , 1117–1139. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Chen, B.; Hu, J.; Wang, Y. Synergistic observation of FY-4A&4B to estimate CO concentration in China: Combining interpretable machine learning to reveal the influencing mechanisms of CO variations. npj Clim. Atmos. Sci. 2024 , 7 , 9. [ Google Scholar ]
  • Xu, Z.; Wang, Z.; Zhang, X. Mapping the forest litterfall mercury deposition in China. Sci. Total Environ. 2022 , 839 , 156288. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Li, S.; Yuan, W.; Ciais, P.; Viovy, N.; Ito, A.; Jia, B.; Zhu, D. Benchmark estimates for aboveground litterfall data derived from ecosystem models. Environ. Res. Lett. 2019 , 14 , 084020. [ Google Scholar ] [ CrossRef ]
  • Yang, C.; Wang, Y.; Zhang, A.; Fan, H.; Guo, L. A Random Forest Algorithm Combined with Bayesian Optimization for Atmospheric Duct Estimation. Remote Sens. 2023 , 15 , 4296. [ Google Scholar ] [ CrossRef ]
  • Berdugo, M.; Gaitán, J.J.; Delgado-Baquerizo, M.; Crowther, T.W.; Dakos, V. Prevalence and drivers of abrupt vegetation shifts in global drylands. Proc. Natl. Acad. Sci. USA 2022 , 119 , e2123393119. [ Google Scholar ] [ CrossRef ]
  • Qian, S.; Qiao, X.; Zhang, W.; Yu, Z.; Dong, S.; Feng, J. Machine learning-based prediction for settling velocity of microplastics with various shapes. Water Res. 2024 , 249 , 121001. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Zhou, Y.; Zhang, Y.; Zhao, B.; Lang, J.; Xia, X.; Chen, D.; Cheng, S. Estimating air pollutant emissions from crop residue open burning through a calculation of open burning proportion based on satellite-derived fire radiative energy. Environ. Pollut. 2021 , 286 , 117477. [ Google Scholar ] [ CrossRef ]
  • Chen, D.; Pereira, J.M.; Masiero, A.; Pirotti, F. Mapping fire regimes in China using MODIS active fire and burned area data. Appl. Geogr. 2017 , 85 , 14–26. [ Google Scholar ] [ CrossRef ]
  • Lundberg, S.M.; Erion, G.G.; Lee, S.-I. Consistent individualized feature attribution for tree ensembles. arXiv 2018 , arXiv:1802.03888. [ Google Scholar ]
  • Prasad, A.M.; Iverson, L.R.; Liaw, A. Newer classification and regression tree techniques: Bagging and random forests for ecological prediction. Ecosystems 2006 , 9 , 181–199. [ Google Scholar ] [ CrossRef ]
  • Cui, L.; Wang, S. Mapping the daily nitrous acid (HONO) concentrations across China during 2006–2017 through ensemble machine-learning algorithm. Sci. Total Environ. 2021 , 785 , 147325. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Gu, X.; Wu, Z.; Liu, X.; Qiao, R.; Jiang, Q. Exploring the Nonlinear Interplay between Urban Morphology and Nighttime Thermal Environment. Sustain. Cities Soc. 2024 , 101 , 105176. [ Google Scholar ] [ CrossRef ]
  • Logothetis, S.A.; Salamalikis, V.; Gkikas, A.; Kazadzis, S.; Amiridis, V.; Kazantzidis, A. 15-year variability of desert dust optical depth on global and regional scales. Atmos. Chem. Phys. 2021 , 21 , 16499–16529. [ Google Scholar ] [ CrossRef ]
  • Zhai, S.; Jacob, D.J.; Brewer, J.F.; Li, K.; Moch, J.M.; Kim, J.; Lee, S.; Lim, H.; Lee, H.C.; Kuk, S.K.; et al. Relating geostationary satellite measurements of aerosol optical depth (AOD) over East Asia to fine particulate matter (PM 2.5 ): Insights from the KORUS-AQ aircraft campaign and GEOS-Chem model simulations. Atmos. Chem. Phys. 2021 , 21 , 16775–16791. [ Google Scholar ] [ CrossRef ]
  • Zhang, F. Factors Influencing the Spatio–Temporal Variability of Aerosol Optical Depth over the Arid Region of Northwest China. Atmosphere 2023 , 15 , 54. [ Google Scholar ] [ CrossRef ]
  • Wang, Y.; Yang, L.; Xie, D.; Hu, Y.; Cao, D.; Huang, H.; Zhao, D. Investigation of Spatiotemporal Variation and Drivers of Aerosol Optical Depth in China from 2010 to 2020. Atmosphere 2023 , 14 , 477. [ Google Scholar ] [ CrossRef ]
  • Wen, Y.; Zhang, S.; Wang, Y.; Yang, J.; He, L.; Wu, Y.; Hao, J. Dynamic Traffic Data in Machine-Learning Air Quality Mapping Improves Environmental Justice Assessment. Environ. Sci. Technol. 2024 , 58 , 3118–3128. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Zhao, N.; Liu, Y.; Cao, G.; Samson, E.L.; Zhang, J. Forecasting China’s GDP at the pixel level using nighttime lights time series and population images. GISci. Remote Sens. 2017 , 54 , 407–425. [ Google Scholar ] [ CrossRef ]
  • Bondarenko, M. Individual Countries 1 km Population Density (2000–2020). 2020. [ Google Scholar ] [ CrossRef ]

Click here to enlarge figure

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Xu, Z.; Liu, B.; Wang, W.; Zhang, Z.; Qiu, W. Assessing the Impact of Straw Burning on PM 2.5 Using Explainable Machine Learning: A Case Study in Heilongjiang Province, China. Sustainability 2024 , 16 , 7315. https://doi.org/10.3390/su16177315

Xu Z, Liu B, Wang W, Zhang Z, Qiu W. Assessing the Impact of Straw Burning on PM 2.5 Using Explainable Machine Learning: A Case Study in Heilongjiang Province, China. Sustainability . 2024; 16(17):7315. https://doi.org/10.3390/su16177315

Xu, Zehua, Baiyin Liu, Wei Wang, Zhimiao Zhang, and Wenting Qiu. 2024. "Assessing the Impact of Straw Burning on PM 2.5 Using Explainable Machine Learning: A Case Study in Heilongjiang Province, China" Sustainability 16, no. 17: 7315. https://doi.org/10.3390/su16177315

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

  • Data Science
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • AI ML DS Interview Series
  • AI ML DS Projects series
  • Data Engineering
  • Web Scrapping

ML – Candidate Elimination Algorithm

The candidate elimination algorithm incrementally builds the version space given a hypothesis space H and a set E of examples. The examples are added one by one; each example possibly shrinks the version space by removing the hypotheses that are inconsistent with the example. The candidate elimination algorithm does this by updating the general and specific boundary for each new example. 

  • You can consider this as an extended form of the Find-S algorithm.
  • Consider both positive and negative examples.
  • Actually, positive examples are used here as the Find-S algorithm (Basically they are generalizing from the specification).
  • While the negative example is specified in the generalizing form.

Terms Used:   

  • Concept learning: Concept learning is basically the learning task of the machine (Learn by Train data)
  • General Hypothesis: Not Specifying features to learn the machine.
  • G = {‘?’, ‘?’,’?’,’?’…}: Number of attributes
  • Specific Hypothesis: Specifying features to learn machine (Specific feature)
  • S= {‘pi’,’pi’,’pi’…}: The number of pi depends on a number of attributes.
  • Version Space: It is an intermediate of general hypothesis and Specific hypothesis. It not only just writes one hypothesis but a set of all possible hypotheses based on training data-set.

Consider the dataset given below:

hypothesis space in machine learning python

Algorithmic steps:

The Candidate Elimination Algorithm (CEA) is an improvement over the Find-S algorithm for classification tasks. While CEA shares some similarities with Find-S, it also has some essential differences that offer advantages and disadvantages. Here are some advantages and disadvantages of CEA in comparison with Find-S:

Advantages of CEA over Find-S:

  • Improved accuracy: CEA considers both positive and negative examples to generate the hypothesis, which can result in higher accuracy when dealing with noisy or incomplete data.
  • Flexibility: CEA can handle more complex classification tasks, such as those with multiple classes or non-linear decision boundaries.
  • More efficient: CEA reduces the number of hypotheses by generating a set of general hypotheses and then eliminating them one by one. This can result in faster processing and improved efficiency.
  • Better handling of continuous attributes: CEA can handle continuous attributes by creating boundaries for each attribute, which makes it more suitable for a wider range of datasets.

Disadvantages of CEA in comparison with Find-S:

  • More complex: CEA is a more complex algorithm than Find-S, which may make it more difficult for beginners or those without a strong background in machine learning to use and understand.
  • Higher memory requirements: CEA requires more memory to store the set of hypotheses and boundaries, which may make it less suitable for memory-constrained environments.
  • Slower processing for large datasets: CEA may become slower for larger datasets due to the increased number of hypotheses generated.
  • Higher potential for overfitting: The increased complexity of CEA may make it more prone to overfitting on the training data, especially if the dataset is small or has a high degree of noise.

Please Login to comment...

Similar reads.

  • How to Get a Free SSL Certificate
  • Best SSL Certificates Provider in India
  • Elon Musk's xAI releases Grok-2 AI assistant
  • What is OpenAI SearchGPT? How it works and How to Get it?
  • Content Improvement League 2024: From Good To A Great Article

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

IMAGES

  1. Hypothesis Space Search in Decision Tree Learning || Machine Learning || JNTU-K || CSE ||ML

    hypothesis space in machine learning python

  2. Hypothesis Testing In Machine Learning While Using Python- Tutorial

    hypothesis space in machine learning python

  3. Hypothesis in Machine Learning

    hypothesis space in machine learning python

  4. Hypothesis testing in Machine learning using Python

    hypothesis space in machine learning python

  5. The hypothesis space is the set of all possible hypotheses (i.e

    hypothesis space in machine learning python

  6. Hypothesis in Machine Learning

    hypothesis space in machine learning python

COMMENTS

  1. Hypothesis in Machine Learning

    A hypothesis is a function that best describes the target in supervised machine learning. The hypothesis that an algorithm would come up depends upon the data and also depends upon the restrictions and bias that we have imposed on the data. The Hypothesis can be calculated as: y = mx + b y =mx+b. Where, y = range. m = slope of the lines.

  2. What exactly is a hypothesis space in machine learning?

    To get a better idea: The input space is in the above given example 24 2 4, its the number of possible inputs. The hypothesis space is 224 = 65536 2 2 4 = 65536 because for each set of features of the input space two outcomes ( 0 and 1) are possible. The ML algorithm helps us to find one function, sometimes also referred as hypothesis, from the ...

  3. What is a Hypothesis in Machine Learning?

    Supervised machine learning is often described as the problem of approximating a target function that maps inputs to outputs. This description is characterized as searching through and evaluating candidate hypothesis from hypothesis spaces. The discussion of hypotheses in machine learning can be confusing for a beginner, especially when "hypothesis" has a distinct, but related meaning […]

  4. Hypothesis Testing with Python: Step by step hands-on tutorial with

    It tests the null hypothesis that the population variances are equal (called homogeneity of variance or homoscedasticity). Suppose the resulting p-value of Levene's test is less than the significance level (typically 0.05).In that case, the obtained differences in sample variances are unlikely to have occurred based on random sampling from a population with equal variances.

  5. What's a Hypothesis Space?

    Our goal is to find a model that classifies objects as positive or negative. Applying Logistic Regression, we can get the models of the form: (1) which estimate the probability that the object at hand is positive. Each such model is called a hypothesis, while the set of all the hypotheses an algorithm can learn is known as its hypothesis space ...

  6. Hypothesis testing in Machine learning using Python

    Hypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing is basically an assumption that we make about the population parameter. Ex : you say avg student in class is 40 or a boy is taller than girls.

  7. machine learning

    To calculate the Hypothesis Space: if we have the given image above we can then figure it out the following way. Count the number of attributes or features. In this case, we have four features or (4). Analyze or if given what are the values corresponding to each feature (e.g. binary, or many different inputs).

  8. ID3 Algorithm and Hypothesis space in Decision Tree Learning

    Hypothesis Space Search by ID3: ID3 climbs the hill of knowledge acquisition by searching the space of feasible decision trees. It looks for all finite discrete-valued functions in the whole space. Every function is represented by at least one tree. It only holds one theory (unlike Candidate-Elimination).

  9. 17 Statistical Hypothesis Tests in Python (Cheat Sheet)

    In this post, you will discover a cheat sheet for the most popular statistical hypothesis tests for a machine learning project with examples using the Python API. Each statistical test is presented in a consistent way, including: The name of the test. What the test is checking. The key assumptions of the test. How the test result is interpreted.

  10. PDF STAT 451: Machine Learning Lecture Notes

    Decision trees can represent any Boolean (binary) function, and the hypothesis space being searched is the entire space of Boolean functions1; however, we need to keep in mind that a critical challenge in machine learning is whether an algorithm can learn/ nd the \right" function or a good approximation within that subspace being searched.

  11. Machine Learning- General-To-Specific Ordering of Hypothesis

    Reference. General-To-Specific Ordering of Hypothesis. The theories can be sorted from the most specific to the most general. This will allow the machine learning algorithm to thoroughly investigate the hypothesis space without having to enumerate each and every hypothesis in it, which is impossible when the hypothesis space is infinitely vast.

  12. Machine Learning: The Basics

    A learning rate or step-size parameter used by gradient-based methods. h() A hypothesis map that reads in features x of a data point and delivers a prediction ^y= h(x) for its label y. H A hypothesis space or model used by a ML method. The hypothesis space consists of di erent hypothesis maps h: X!Ybetween which the ML method has to choose. 8

  13. Find-S Algorithm In Machine Learning: Concept Learning

    In Machine Learning, concept learning can be termed as "a problem of searching through a predefined space of potential hypothesis for the hypothesis that best fits the training examples" - Tom Mitchell. In this article, we will go through one such concept learning algorithm known as the Find-S algorithm. If you want to go beyond this article and really want the level of expertise in you ...

  14. Machine Learning Theory

    Generalization Bound: 1st Attempt. In order for the entire hypothesis space to have a generalization gap bigger than ϵ, at least one of its hypothesis: h 1 or h 2 or h 3 or … etc should have. This can be expressed formally by stating that: P[ sup h ∈ H | R(h) − Remp(h) | > ϵ] = P[ ⋃ h ∈ H | R(h) − Remp(h) | > ϵ]

  15. Hypothesis Testing in Machine Learning

    The steps involved in the hypothesis testing are as follow: Assume a null hypothesis, usually in machine learning algorithms we consider that there is no anomaly between the target and independent variable. Collect a sample. Calculate test statistics. Decide either to accept or reject the null hypothesis.

  16. [2403.03353] Hypothesis Spaces for Deep Learning

    This paper introduces a hypothesis space for deep learning that employs deep neural networks (DNNs). By treating a DNN as a function of two variables, the physical variable and parameter variable, we consider the primitive set of the DNNs for the parameter variable located in a set of the weight matrices and biases determined by a prescribed depth and widths of the DNNs. We then complete the ...

  17. Deciphering Complexity: Emulating the Riemann Hypothesis in Machine

    Creating a Python example that connects the Riemann Hypothesis directly with feature space analysis in machine learning, particularly with a synthetic dataset, is relatively abstract.

  18. Supervised Machine Learning Algorithms in Python

    The main goal of this reading is to understand enough statistical methodology to be able to leverage the machine learning algorithms in Python's scikit-learn library and then apply this knowledge to solve a classic machine learning problem. The first stop of our journey will take us through a brief history of machine learning.

  19. NPTEL :: Computer Science and Engineering

    Modules / Lectures. Lecture 02: Different Types of Learning. Lecture 03: Hypothesis Space and Inductive Bias. Lecture 04: Evaluation and Cross-Validation. Tutorial I. Lecture 32 : Neural Network and Backpropagation Algorithm. Lecture 36: Introduction to Computational Learning Theory. Lecture 37: Sample Complexity : Finite Hypothesis Space.

  20. Machine Learning- Finding a Maximally Specific Hypothesis: The List

    In supervised machine learning, a hypothesis is a function that best characterizes the target. ... Python Boot Camp. Email * Single Line Text * Enroll Now. Machine Learning - Tutorial ... Hypothesis space search; Machine Learning- Genetic programming; Machine Learning- GENETIC ALGORITHM: MODELS OF EVOLUTION ...

  21. ML

    The find-S algorithm is a basic concept learning algorithm in machine learning. The find-S algorithm finds the most specific hypothesis that fits all the positive examples. We have to note here that the algorithm considers only those positive training example. The find-S algorithm starts with the most specific hypothesis and generalizes this ...

  22. Hypothesis in Machine Learning

    The hypothesis is one of the commonly used concepts of statistics in Machine Learning. It is specifically used in Supervised Machine learning, where an ML model learns a function that best maps the input to corresponding outputs with the help of an available dataset. In supervised learning techniques, the main aim is to determine the possible ...

  23. Sustainability

    Straw burning is recognized as a significant contributor to deteriorating air quality, but its specific impacts, particularly on PM2.5 concentrations, are still not fully understood or quantified. In this study, we conducted a detailed examination of the spatial and temporal patterns of straw burning in Heilongjiang Province, China—a key agricultural area—utilizing high-resolution fire ...

  24. ML

    The candidate elimination algorithm incrementally builds the version space given a hypothesis space H and a set E of examples. The examples are added one by one; each example possibly shrinks the version space by removing the hypotheses that are inconsistent with the example. The candidate elimination algorithm does this by updating the general ...