Repository size question

kayman
New Altair Community Member
Assume following process :
around 10K articles in different languages, 80% are Latin based, 10% Greek based and 10% is Cyrillic
My process is using unicode range filtering to define the language of the content, and then stores the matching ones in a repository for furter handling.
so all articles identified as cyrillic go to a cyrillic repository,
all Greek articles go to a Greek repository
everything else goes to the latin repository.
So far so good, this was rather easy to accomplish but I noticed that all my repositories are (almost) equal in size, even if the cyrillic and Greek content account for about 10% of the content and should therefore be much smaller as the Latin one.
Loading the Greek repo only shows indeed Greek content, the Cyrillic only Cyrillic, so it seems ok.
However, when viewing the directory with a text editor directly all of them show basically all of the content. So both the filtered content (which is shown when opening the repo) and the inverted results (which remain hidden)
Why is this redundant data stored, and how can I get rid of this to reduce size ?
Thanks!
around 10K articles in different languages, 80% are Latin based, 10% Greek based and 10% is Cyrillic
My process is using unicode range filtering to define the language of the content, and then stores the matching ones in a repository for furter handling.
so all articles identified as cyrillic go to a cyrillic repository,
all Greek articles go to a Greek repository
everything else goes to the latin repository.
So far so good, this was rather easy to accomplish but I noticed that all my repositories are (almost) equal in size, even if the cyrillic and Greek content account for about 10% of the content and should therefore be much smaller as the Latin one.
Loading the Greek repo only shows indeed Greek content, the Cyrillic only Cyrillic, so it seems ok.
However, when viewing the directory with a text editor directly all of them show basically all of the content. So both the filtered content (which is shown when opening the repo) and the inverted results (which remain hidden)
Why is this redundant data stored, and how can I get rid of this to reduce size ?
Thanks!
Tagged:
0
Answers
-
Hi,
was your text set to polynominal or text? If you use text this should not happen. With polynominal RM is using a mapping table in the background. This mapping table is not cleaned up if you filter. You need to use Remove Unused to force this.
~Martin0 -
Thanks Martin, using the remove unused did the trick.
0