Hi Community
Disclaimer:
first-timer here, data science newbie, unfamiliar with the correct technical terminology. I'm somewhat good with concepts but neither strong in statistics, nor higher math nor programming

but I try

doing a bachelors degree of course I have basic statistics and programming knowledge, but am very untrained since years.
Background:
As part of my business information technology studies I'm working on my bachelor thesis "improved future sales forecasting by applying machine learning" (as opposed to simple compare-to-last-year-figures based prediction) together with a company operating convenience stores.
I have access to their BI system to pull historical sales data with several attributes, for example: date, shop, article, number sold.
Data preparation:
To develop a model, I have selected two customer contexts which may trigger a visit to the store to buy very specific goods: "grill party at lake" and "students breakfast".
I then looked at a handful shops close to lakes ("grill party") and/or universities ("students breakfast") and pulled the BI data of affected articles (Chips, Beers, Sausages, Bagels, Coffee, etc).
I then added several hopefully relevant attributes such as HasLake (is shop close to a lake), HasUniversity (is shop close to university), HasSemester (is transaction during or inbetween university semesters), HasHoliday (is it a public holiday) and weather figures (temp., amount sunshine, amount rain).
My current (anonymized simplified) example dataset is attached as Excel.
Trying my luck:
I am asking for help now, how to proceed best.
I remodelled my exampleset several times (articles as rows, articles as columns; more attributes, less attributes; ...) and tried to put together a process but failed horribly every time.
I then went for Auto Model. Deep learning and Gradient Boosted Trees yielded quite good results but a) produces a "black box model" difficult to get away with in a bachelor thesis and b) the automated feature selection seems to primarily target attributes which are not "generic" but highly specific to the exampleset, e.g. a single shop. This makes sense, as in the data, one specific shop has very high numbers for beer. But this makes the model not applicable to other customer contexts in other shops (which are not included in the exampleset; there's ~200 shops in total with 3000 articles each and at least a dozen contexts for some but not other shops, e.g. high volume highway petrol station has nothing to do with neither university nor grill party at lake).
I tried to get inspired by the Auto Models created and reproduce the results to a degree, but they are way too complex for me to properly understand what's happening and why certain parameters are tuned the way they are.
I figured setting "Shop" to cluster and setting "quarter" or "week" to either batch (I also tried vice versa, shop as batch and timeperiod as cluster) should improve feature selection. Apparently not, as set roles and special attributes are being purged when automodelling. Is deep learning or GBT the wrong approach? Should I do something with "forecast" given the exampleset? I'm at a loss.
Could I ask you guys and gals to support me to get off the starting line? Many many thanks in advance!