Auto adjusting fields as you add new reports
I've built a model using RegExs to capture the fields I want, however this will be an ongoing effort, with updated reports, where certain fields will vary in length. An issue I've noticed is that some fields grow in length, and the regex does not adjust the field size to capture new data. Is there anyway around this? We will ultimately be running the project on an Automater server.
Answers
-
Hi Steve and welcome,
Firstly are you using Data Prep Studio or Classic as the basis for your data extraction?
Secondly, I assume you are working with some sort of "report".
If so regex, though powerful, is not the first line of approach for creating a model to trap specific data records and map the fields you require.
In fact I would say that for most reports it may be the last resort when nothing else works or the number of records that could be extracted need some form of in-line filtering to keep table sizes down.
Normally one would look for a repeating pattern in the records in the report that identifies the lines of groups of lines that make up a record, define a trap (keeping it as simple as possible and very simple is ideal), mapping where the fields are using the sample data and then seeing what you have in the data table. If you extract record you do not need for the task at hand, work out how to differentiate the records you need form those you do not need (usually some form of filtering on one or more fields) and apply that leaving only what you want this time. The model (Classic) or Workspace (DPS), when saved, will then be available for reuse for current or other requirements later. The data extraction will not, in that case, be data limited but the resulting table is made data limited by the filtering.
Moving on to your variable field sizes ... we need to consider whether the fields are coming from a fixed format report and can be made as large as the largest possible example you will ever find in the field definition. If so just amend the field definition in the model.
If you are working with a "csv" type of report where fields can be any length and are defined by a separator character you may want to consider treating the input file as a database rather than a report. Ideally you would still identify the maximum possible width of a field in the files coming from that source set the field definition accordingly after the initial model auto-creation and save.
If, on the other hand, you are dealing with a PDF (or an "ordinary" report that has some interesting built in "features") whereby the line by line structure may be rather inconsistent then a somewhat different approach may be required. I won't attempt to dive into that subject for now.
There is a feature known as a "Floating Trap" that may be useful to you BUT it is not something to dive into lightly unless all other options have been tried and proved lacking. So let's consider the others for now. If you really need floating traps and/or regex that should quickly become apparent as we look at the other options with you.
Grant
0 -
Altair Forum User said:
Hi Steve and welcome,
Firstly are you using Data Prep Studio or Classic as the basis for your data extraction?
Secondly, I assume you are working with some sort of "report".
If so regex, though powerful, is not the first line of approach for creating a model to trap specific data records and map the fields you require.
In fact I would say that for most reports it may be the last resort when nothing else works or the number of records that could be extracted need some form of in-line filtering to keep table sizes down.
Normally one would look for a repeating pattern in the records in the report that identifies the lines of groups of lines that make up a record, define a trap (keeping it as simple as possible and very simple is ideal), mapping where the fields are using the sample data and then seeing what you have in the data table. If you extract record you do not need for the task at hand, work out how to differentiate the records you need form those you do not need (usually some form of filtering on one or more fields) and apply that leaving only what you want this time. The model (Classic) or Workspace (DPS), when saved, will then be available for reuse for current or other requirements later. The data extraction will not, in that case, be data limited but the resulting table is made data limited by the filtering.
Moving on to your variable field sizes ... we need to consider whether the fields are coming from a fixed format report and can be made as large as the largest possible example you will ever find in the field definition. If so just amend the field definition in the model.
If you are working with a "csv" type of report where fields can be any length and are defined by a separator character you may want to consider treating the input file as a database rather than a report. Ideally you would still identify the maximum possible width of a field in the files coming from that source set the field definition accordingly after the initial model auto-creation and save.
If, on the other hand, you are dealing with a PDF (or an "ordinary" report that has some interesting built in "features") whereby the line by line structure may be rather inconsistent then a somewhat different approach may be required. I won't attempt to dive into that subject for now.
There is a feature known as a "Floating Trap" that may be useful to you BUT it is not something to dive into lightly unless all other options have been tried and proved lacking. So let's consider the others for now. If you really need floating traps and/or regex that should quickly become apparent as we look at the other options with you.
Grant
Grant,
Thanks for the detailed response. I have chosen to use RegEx as my trap because of varying spacing between records on the reports. The reports themselves are print to pdfs from a web page. Due to the varying nature of spacing between records and their labels, a RegEx ha sproven to be the most robust choice.
The issue I've run into has been the necessary report verifications I need to do when running new reports through the process. I am looking to automate this step, or find a work around. Simply finding the maximum character length does not work, as in some reports the same field may be a couple tabs off from the corresponding fields on others. and Example would be:
Transaction date: 09/20/2017
Transaction date: 9/21/2017
0 -
Altair Forum User said:
Grant,
Thanks for the detailed response. I have chosen to use RegEx as my trap because of varying spacing between records on the reports. The reports themselves are print to pdfs from a web page. Due to the varying nature of spacing between records and their labels, a RegEx ha sproven to be the most robust choice.
The issue I've run into has been the necessary report verifications I need to do when running new reports through the process. I am looking to automate this step, or find a work around. Simply finding the maximum character length does not work, as in some reports the same field may be a couple tabs off from the corresponding fields on others. and Example would be:
Transaction date: 09/20/2017
Transaction date: 9/21/2017
Hi Steven,
Hmm.
Regular PDFs can be tricky enough to uncover consistency of some sort.
Screen grab PDF prints?. I wish you well with that project. It's not a source that I have ever (knowingly) battled with.
Two thoughts came to mind.
Firstly it might be worth looking to the HTML or whatever code rather than a PDF. At least the screen writing code should have structure.
Secondly one would need to work with a sample output to have much chance of working out what might work and what surely will not work so it's not easy to
Maybe best to abandon the structure offered by the PDF writing engine and take a first step hit to restructure whatever text has been extracted. But only if it looks like restructuring might offer a more consistent base from which to work.
The objective would be to use Monarch text manipulation tools to eliminate as much inconsistency as possible and thus make modelling more viable.
It's unlikely to be a ten minute task - although sometimes it can be.
HTH.
Grant
0