What is the best way to build an incremental load to prepare data
Answers
-
As always it depends what the source of the data will be.
If its form a DB the best way will be by managing a GUID and a timestamp.
If its from a web source like twitter you will need the max id of the last time you pulled data from that source.
If the data is extracted from files you will need to keep track of what you had already processed and what is new.
Give us more information of what the sorce of the data and then we can begin to process the information.1 -
One handy tip is that you can always use the macro %{process_start} for the time when a process is run. You could then store that value in an attribute in an exampleset and then use that in a later process to set the minimum for a date/time stamp to select only data after that point.
0 -
It depends.
There are 4 kinds of fields that will help us performing incremental load, here are these from best to worst:- Creation Date, Date.
- Modification Date, Date.
- Deletion Date, Date.
- ID, Numerical.
- If there is a creation date, then we can use Aggregate and read the last date on the database to perform INSERT.
- If there is a modification date, then we can use the last date on the Creation Date to perform UPDATE.
- If there is a deletion date, then we can use the last date on the Creation Date to perform DELETE. Haven't seen anyone who uses the deletion date without the other two.
- If there are no dates because the system doesn't have auditing, then we can rely upon a numerical ID. Notice that ID's are created with sequences, and sequences are prone to lack of synchronization (at least in some databases). In that case, the Aggregate is still required, just on the ID field.
If this is not what you are looking for, please, break this glass ONLY in case of emergency:There is another idea. It is overkill and totally not recommended, but it is the last resort if we want to keep the database updated. If there is a unique field that isn't a sequential one and you are required to keep the data warehouse up to date with the required information, and the database supports it, you can perform an UPSERT. Let's say the one who wrote the database had the marvelous idea of putting car plates, identification numbers or addresses as ID's (yes, I saw them, they exist, and I actually became a bit famous for using data science to overcome technical debt... I hope you don't have the same problems but if you do, let me tell you you are not alone), your only alternative is to perform an INSERT that can fail on a unique value but update the entire rest.
Again, this is overkill, as it performs up to 3 operations per row. I have to do something similar: upserts on a table that has 4.7 Gb of floating point numbers per hour on a little pet project (Dear @sgenzer, please insert recorded laughs here) that I'm doing. I ended up creating the table dynamically and updating the view to point to the last one, because that was the best way I could go back to recover my hours of sleep.
Still, this is not a common approach, not recommended. Only a seasoned DBA or a totally insane and fearless person could give this one to you. I am in the second class of people.Hope this helps,
Rodrigo.
0 - Creation Date, Date.