"The most appropriate feature selection setup"

Hi, I recently purchased rapidminer enterprise and attribute selection plugin. I have a dataset consisting of 1300 rows each with 800 features, resulting in 7200 overall features.

Id like to reduce this dataset down by employing feature selection.

I have 64-bit windows xp running on an intel core i7 processor with 12gb of ram.

The overall goal is to reduce the dataset down and then train an SVM so that i can query a database for similar results.

So, what would be the best setup to perform this feature selection?

Thanks in advance for any help,
Dave

Find more posts tagged with

AI Studio

Feature Selection

Accepted answers

All comments

land

Hi Dave,
from your description I guess, that the goal is to find the most similar examples (rows) for some query example?
Usually the SVM is used for distinguishing two classes or predicting a numerical outcome from other numerical attributes. If you don't have a label (target, class) in your examples it will be a little bit difficult to measure the performance. But the Attribute Selection Plugin uses a performance estimate to find the best attribute subset possible. Do you have anything like that? Or how are you going to estimate performance at all?

Greetings,
Sebastian

dave10

Basically each row of 800 features represents a colour texture descriptor for an image. But 800 is far too many so i want to reduce each image representation down.

Then I will hopefully be able to train an SVM or cluster similar images to form classes or groups.

I can then reduce any new query image down to the same number of features and query the database.

land

Hi,
ok. It seems to me, we don't have a target suitable for SVM training, as long as we don't have a class attribute. I will describe shortly, what I understood, you are want to achieve:
You have rows of 800 attributes (features) and you are searching a subset of these attributes so that the distances or at least the distance relation between the rows will remain nearly the same? So if we have a request row pand on there would be an order of the nearest training rows like that:
dist(p, row_1) < dist(p, row_2) < dist(p, row_3) < dist(p, row_4) ...
then we are searching a function f, that will reduce the number of attributes and still would gain the same or equal order
dist(f(p), f(row_1)) < dist(f(p), f(row_2)) < dist(f(p), f(row_3)) < dist(p, f(row_4)) ...

Is this correct?

Greetings,
Sebastian

dave10

Hi Sebastian, thanks alot for your help, much appreciated.

Yea thats correct. I need to reduce the number of attributes but still have a distinctive representation for each row, i.e. the dominant features.

Then use possibly as you said a distance metric to check for similarity amongst those images in the database which have been reduced and any new images being used as queries, to return those closest.

holger

Why don't you use a PCA to reduce the dimensionality of your dataset?

Best, Holger

land

Hi Dave,
as Holger said, the problem is normally solved by applying a dimension reduction technique, but as I understood, you are going to query a database, so you are doing it to save database time. A PCA or similar would need to query all values for calculating the new attributes, so this can't be used for saving query time. Interesting problem...

Ok, here's my suggestion for estimate a performance for performing an Attribute Selection:
We need to compute the distances between all examples and give them a range attribute. For example if we have an example set containing three examples with the id values id1, id2, id3, we would have as a result

query_id	document_id	original_range
id1	id3	1
id1	id2	2
id2	id3	1
id2	id1	2
id3	id2	1
id3	id1	2

Alternatively one could store the distance instead of the range. The range reflects the ordering, while the distances also reflect the relative distances. One might think about what would be smarter to optimize.
We have to store this using an IOStorer to get access to this example set inside the feature selection.
Inside the feature selection we have to compute the same table again, but on the basis of the reduced feature set. We then could load the table using IORetriever and join both example sets, so that beside the original range attribute, theres a range attribute for the reduced set. We then could set the one as label and the other as prediction and use a regression performance operator to estimate a performance.

That's the way one could do it with the standard operators (also you will have to trick around a bit to get the mentioned tables). Otherwise you would need to do a little programming or again ask for a special operator solving this problem.

Greetings,
Sebastian

dave10

Hi Sebastian, thanks alot for your answer much appreciated.

Is there anyway of employing the advanced attribute selection operator to reduce the dataset down?

Dave

dave10

The 800 attributes representing each image are to be reduced but still maintain enough descriptive power to represent the image uniquely. These reduced image feature vectors will then be clustered or trained on an SVM if possible and loaded to a database for convenience.

When a new image comes along it will be reduced to the same attributes and number of attributes. This will then enable the a matching process to see what images are closely related to the new one and thus enable a returned set of similar images from the database.

Thanks,
Dave

land

Hi Dave,
it will be possible, but it's a rather complex process. I guess it will take me half a day for designing, so I can't do this here in the forum. That would be to much of consulting...I tried to give an outline in my last post, so that you might be able to take this ideas and design it on your own. I hope it will help you.

Greetings,
Sebastian

dave10

Thanks for your help sebastian. Im only new to the area so struggling a little with it.

I'll try and stick to just reducing the number attributes using a feature selection algorithm and try go from there.

Cheers,
Dave