A correction to the previous post: Instead of putting the projects in the box, I uploaded the full list of project ranks, accuracy and grades. Most of the checked projects hold no additional information. For two groups, 5 points were reduced due to report clarity. If someone wants to discuss these results further, or to get your checked reports, you are welcome to contact Lior.

Enjoy,

Regev

]]>Final grades were submitted, and should be updated soon. I've uploaded the full grades - exercises, project, exam and final grade - given the last 4 digits of your ID number.

The checked project reports (with your relative ranking and actual accuracy/error) should be in Tomer's box.

Have a great semester B!

Regev

]]>Help.

]]>when would you publish a solution to moed A?

]]>As far as I understand, odd lines in the 'positions' array contain the chromosome number whereas even numbers contain the nucleotide number on the chromosome. If this is the case, it only provides information for half of the rows in the dataset (and only for 8 out of 23 chromosome pairs). Can you please explain where my mistake might be (or otherwise upload the rest of the 'positions' array too)?

Thanks!

]]>Please take extra care to go through them, and if you feel it is necessary, please appeal sooner than later. ]]>

אנא פנו ל arikp@mail tau ac il

(החליפו רווחים בנקודות) ]]>

Question 4b - Why do we do max between 0 and the other solution? Why not choose between them otherwise? ]]>

In question 3, in both sections, we have examples that contradict the answers. for example, in a we can have only 1 hypothesis and it's the correct one (also if we have 2 hypothesis which one is always correct and one is always wrong) - then the error on the sample, as well as the examples that we haven't seen, would be 0.

In the answers, they claim that one is always smaller than the other…

In section b we can always make sure that in one set of hypothesises there will be overffitting=0, and in the other one bigger than 0, and the other way around. if the extra 100 hypothesis are useless anyway, it has no meaning that we have more hypothesises.

]]>y~N(ax,x^2)

after you use log on the ML function, usually the log(1/root(2*pi*sigma^2)) part is const so you can leave it out, but here it contains x (since sigma^2=x^2).

how come it is not part of the optimization function?

]]>Usually MAP and ML are used to find parameters of a distribution. How are they used for choosing hypotheses?

]]>It's easy to see that a linear kernel can only solve the first sample.

Isn't a linear kernel a private case of a polynomial kernel… can't a polynomial kernel solve the first sample as well?

Also, can't both the gaussian and polynomial kernel find a separator for both of the last two data sets? A circle can be represented in the polynomial kernel space and the same circle used to describe the separator using a gaussian kernel in the second sample can be used to separate the examples in the third sample set.

]]>Isn't that factoring in the prior, making it a MAP estimator? ]]>

Can you explain me what is my mistake? ]]>

The solution is not very clear - why it is 1+xi and not 1-xi ? why doesn't the kernel function appear in the solution?

Thanks. ]]>

In the lecture about perceptron we talked about the unrealizable case. We said that the previous analysis about the number of mistakes is no longer correct, and we bound the number of mistakes as a function of the total hinge loss.

The question is - what is the algorithm for the unrealizable case? the scribes only bound the number of mistakes, but if we run perceptron without modifications, there will be infinite mistakes.

So - what is the algorithm in the unrealizable case? ]]>

(D being a diagonal matrix without zeros on the diagonal)

My guess is that it does change the solution. Although setting the "new" w^{T}* to w^{T}D^{-1} will still fulfill all the constraints, I do not see why such a solution will necessarily minimize the norm of w*.

What is the right answer?

In the spectral clustering algorithm description it says that we should choose the smallest eigenvalues (except for the last one), whereas in the first example ('intuition' - slide 38), we choose the eigenvector with the largest eigenvalue. Why?

Thanks!

]]>I didn't understand why can we assume that b=0 in our linear separator - in order to make all examples separable that way, we need to know who b is (to add a const coordinate b to all examples….).

Also, a similar assumption in WINNOW is unclear to me - why can we assume that all coefficients of w* are positive?

Thanks

]]>I saw that the margin in the perceptron algorithm is defined as min{(w*x)/||x||}. (See rec6 theorem 6.1).

My question is: Why do we divide by the norm of x?

As I understand it, w*x is |w||x|cos(a) where "a" is the angle between w and x. so dividing this by ||w|| should give us the distance of x from the seperating hyperplane.

As I understand it, by this you have also decided the creteria in the "Margin Perceptron Algorithm" presented in rec6.

(I assume this is why we calculate there (w*x)/||w|| and compare it to (gamma)/2.

Thanks in advance.

]]>1 0 0 0 2

0 0 3 0 0

0 0 0 0 0

0 4 0 0 0

הייתי מעוניין לדעת אלגוריתם כללי , ובפועל עבור המטריצה הזו למשל איך הוא מתבצע

תודה

]]>תודה

]]>The answer in the solution was 15 - there are 7 other parameters (w.l.o.g. $P(x_1), P(x_2|x_1), P(x_2|!x_1), P(x_3|x_2,x_1), P(x_3|!x_2,x_1), P(x_3|x_2,!x_1), P(x_3|!x_2,!x_1)$), but

they are irrelevant as you are already given $x$.

the first class represented by training sample set x1,..,xn and the second class by x*1,…,x*m.

x is in Rd .

Let two of them are linearly separable and the equation of the optimal canonical hyperplane is w'*x + b = 0 [w' - transpose of vector w1….wd]

for the first class <xi , w> + b >= 1 , for the second class <x*i,w>+b <= -1 .

in both cases , there are vectorst (support vectors) where the equality is reached (canonical property).

optimality means that among all such hyperplane the minumum of the expression ||w||2 (L2 norm) is reached.

To prove that the optimal caninical hyperplane is unique.

Thanks :)

]]>2. In q2, how do you define a variance of matrix where you don't assume what we assumed in class that $\sum_{i=1} ^{n} x_i =0$ ?

]]>2. Is the vector u a unit vector? ]]>

שמי גיא בראודה ואני מחפש שותפים לפרויקט גמר.

אני סטודנט לתואר שני ככה שאני דיי גמיש בתאריכים בהם אעבוד על הפרויקט.

יכולים לשלוח לי הודעה בפייסבוק,

guy braude

יום טוב

]]>This is your opportunity to request what we would talk about: If there are any weak points in the materials you'd like to review, some confusing questions from the HW or elsewhere you want us to solve explicitly, or anything else - please let me know, either in the forum or in my mail (li.ca.uat.tsop|regiewhcs#li.ca.uat.tsop|regiewhcs). Thanks!

]]>Thx

]]>The mean $\bar{x} =\frac{1}{m}\sum_{i=1}^nx_i$ is not clear to me, shouldn't it be $\bar{x} =\frac{1}{\mathbf{n}}\sum_{i=1}^nx_i$ because we sum up the columns for each dimension$\le{m}$ in row?

Also for the covariance perhaps it should be with $\frac{1}{\mathbf{n}}$?

Thanks again

]]>In question 1, shouldn't the covariance matrix $\sum$ be Positive Definite? If it is Positive Semi-Definite then it might not be invertible due to singularity.

Thanks,

]]>לחסוך את כל הקטע של ההתעסקות עם ההרשאות ]]>

נ.ב. כשכתוב arbitrary vector הכוונה לוקטור מנורמל?

]]>ההרצאה על PCA [זה לא נותן לי להעלות לינקים].

]]>in part 1, we should use cross validation for every d and c, and plot for every d the average error (over the 10-folds) as a function of c. Correct?

In part 2, we have to count the number of support vectors that lie on the margin hyperplane. Should we write the average of the number of support vectors on the margin hyperplane (over the 10 classifiers we create), or train a new classifier, that trains on all the data, and write the number of support vectors that are on the margin hyperplane for that classifier?

The same goes for computing the margin - is this the average margin size of the 10 margins we created?

Specifically, Matlab won't run since it doesn't have enough memory.

So we can't test our code on Linux.

Regev, any ideas?

]]>I don't recall we talked about computing the margin hyperplane, can you please explain how to do it?

Thanks

]]>and i saw that libsvm returns sv_coef but it returns a matrix of size nSVx9 which doesn't make sense because we should have a coefficients for each 1-against-1 problem so we should get nSVx45 matrix, what am i getting wrong? ]]>

In a previous post, you (Regev :P) answered "Yes, but as before, this means to the scale the data so the the mean is 0 and the std is 1" but

in the exercise it says "Normalize the input vectors such that all attributes are in [−1, 1]".

Scaling the data so that the mean is 0 and the std is 1 doesn't promise attributes in [-1,1].

What should we do? Normalize the mean and std? or just scale the data to be within the limits,

(data - min(data)) / (max(data) - min(data)) * 2 - 1

for example..?

In the second programming assignment, we are asked to compute the margin obtained for each d.

However, we are using nonlinear SVM.

Doesn't this mean that we are using a polynomial kernel and therefore never finds w for the decision rule?

As I see it, one can calculate the margin easily only if he knows "b" and "w" in the SVM decision rule.

But we know them only for linear SVM…

How are we suppose to obtain the margin from libsvm?

The only thing we thought about is looking at the support vectors and finding 2 which are on the margin and from different classes and calculating the distance between them.

Is there something simpler we should do?

Thanks.

]]>איך אני יכול ליצור קשר עם הבודק? ]]>

Otherwise, if, for example, we saw only m times the first vector - w will stay 0 and we will never get a separating

hyperplane.. ]]>

Does the perceptron algorithm normalize the samples or use them as they are?

]]>