What does Singular Value Decomposition exactly do?

Textminer
Textminer New Altair Community Member
edited November 2024 in Community Q&A
Dear all,

I have a question regarding the dimension reduction technique called Singular Value Decomposition in Rapidminer. I am using it in the context of textmining and i want to know what it actually does. I have searched everywhere for an answer including this forum but i couldn't find any.

I did some experiments to find out what SVD does and to my experience it decomposes the (term versus document )matrix in three matrices: USV*. Then it replaces the original (term versus document )matrix with the matrix U and applies the dimension reduction on this matrix. Is this correct and if so, why is the orginal matrix replaced with the matrix U. Is there some explanation or theory behind this?

I hope that you can help me out with this. Maybe there is some file that describes the working of SVD in Rapidminer, if there is such an file, maybe you can pass me a link to it.

Thanx in advance.

Greetings,
Textminer
Tagged:

Answers

  • Textminer
    Textminer New Altair Community Member
    Hi,

    Can someone help me out please! I really need the anwser. What about the moderators, doesnt anyone of you know the answer to my question (what about you mierswa??). I find this very strange, is it such a difficult question?

    I hope to hear some reactions.

    Bye,

    Textminer
  • Textminer
    Textminer New Altair Community Member
    Hi Haddock,

    Thanx for your link, but i know what Singular Value Decomposition is and what it does. I actually want to know what rapidminer does. Which actions does it perform on the term by document matrix? If you select SVDreduction in Rapidminer it only states "a dimensionality reduction method based on singular value decomposition". What does this mean in practice, i cant find this anywhere.

    Greetings,

    Textminer
  • haddock
    haddock New Altair Community Member
    Hi Textminer,

    I have sympathy with your problem, because it concerns not the "what" but the "how" of the RM SVD implementation, and that means getting down and dirty with the source and Eclipse, or being very nice to Ingo, he of the very pointy head.

    However, I take the more general point, namely that RM documentation could be improved. Being an automaton myself,  I too can learn from supervised examples, so it would be nice if you could drill down directly from the "new operator" tab to examples, forum articles, and outside links, as you can with other IDEs.

    That being said, you have to go with what you've got, which is in its own way a world leader, actively supported and developed by some of the most qualified, able, and enthusiastic minds you are likely to meet.

    Happy coding - I took a squint and survived the experience!
  • IngoRM
    IngoRM New Altair Community Member
    Hi Textminer, hi Haddock,

    thanks I would like to thank Haddock for your kind words - I must admit that I am always looking forward to your answers and comments since they are always a pleasure to read (did you consider to work as an author? I would surely read your books / articles / editorials / ...).


    About the documentation issue in general:

    We really would like to have more ressources for improving the documentation and actually already started on this. But this of course takes much time. And then again this is one of the major advantages in using open-source software: you can check out the concrete details yourself. As a developer, I always stick to the following two rules:

    1.) Don't write comments which are likely to become wrong sometime
    2.) Don't write code which is less clear than a comment

    If you work like this this will have two consequences: less comments but clearly written code which can often be read even by non-developers (of course it is easier for developers or at least for people with some mathematical background).


    About the SVD:

    You do not have to be a developer and work with all these developer tools to get insight into the code. A simple web browser is enough. The following link leads to the base of all source code of RapidMiner:

    http://yale.cvs.sourceforge.net/yale/


    And here you can find the concrete source for the SVD:

    http://yale.cvs.sourceforge.net/yale/yale/src/com/rapidminer/operator/features/transformation/SVDReduction.java?view=markup


    One of the most important lines here is

    import Jama.SingularValueDecomposition;

    meaning that we do not compute it ourself but ask a library (Jama) for this. Since this is again open source you could check there for more details.

    Cheers,
    Ingo

  • Textminer
    Textminer New Altair Community Member
    Hi Ingo,

    Thanx for your answer about SVD. I wasnt aware of the fact that you could check out the sourcecode of rapidminer online. Thanx for pointing that out. The important lines are indeed:

    Matrix u = svd.getU().getMatrix(0, es.size() - 1, 0, dimensions - 1);
    return u;

    In the Jama pack you have a class called SingularValueDecomposition which computes: "For an m-by-n matrix A with m >= n, the singular value decomposition is an m-by-n orthogonal matrix U, an n-by-n diagonal matrix S, and an n-by-n orthogonal matrix V so that A = U*S*V'. ".
    This class has a method getU() wich returns the left singular vectors (in other words it returns the columns of matrix U). It is this method that is called upon in the important lines. So this means that I was right in the first place (see my first post).

    But in this first post I also wandered if there is some theory or explaination behind this? Because you are one of the authors of the code I ask you Ingo this question. Why do you select the columns of the matrix U? Has this some connection with LSI (latenst semantic indexing)? I hope you can help me out with this question. I would appreciate it very much.

    Thanx in advance,

    Greetings,

    Textminer
  • IngoRM
    IngoRM New Altair Community Member
    Hi,

    But in this first post I also wandered if there is some theory or explaination behind this? Because you are one of the authors of the code I ask you Ingo this question. Why do you select the columns of the matrix U? Has this some connection with LSI (latenst semantic indexing)? I hope you can help me out with this question. I would appreciate it very much.
    You can think of U (the left singular vectors) as the representation of a vector basis for the most relevant information in the system similar to that of principal components in PCA. Those vectors also form an orthonormal basis for the data points. I am not too familiar with LSI and SVD (at least not for text processing) but it seems that both approaches could be used to work on a similar goal: remove the synonymy and polysemy of keywords in texts. Sorry, I am afraid I do not know much more about this topic.

    Cheers,
    Ingo