I am interested in learning some of the best practices for database design for dealing with the copious amount of data which can come from a SCADA system for lab/research machinery where several instruments could be logged at a high frequency. I have been a part of a project where a pre-commercial demonstrator facility was built and used to test and research a particular technology and while we managed to develop procedures which worked, more or less, for coping with the amount of data we were collecting, it felt clunky and inelegant. When it came time to conduct a second-look audit/analysis of the data from the machine, I found myself needing to dynamically select databases and tables in my queries in the program which retrieved and analyzed the raw instrument data from the database and from some reading online, this need to dynamically determine the database and table to query is usually a symptom of a poor database design.
Most tutorials and information I can find on databases are for web apps or sales databases with user or sales data, while sometimes helpful (I'd imagine an online store the likes of Amazon could generate a lot of sales data in a short period of time), often feel too generic to be applicable to my context. I would like to know of any strategies or best practices for designing databases for the specific context of SCADA and instrument data.
For those interested, additional context of the pilot plant with the some details on my specific context are below:
I have a MS SQL Server which contains the raw instrument data from a pilot plant (additional context of the pilot plant and the data from it have been published here: https://borealisdata.ca/dataset.xhtml?persistentId=doi:10.5683/SP3/LZ56EG). The pilot plant had a locally hosted OPC server (TOP Server) for interacting with equipment and monitoring instrumentation through our custom built human-machine-interface (HMI) written with MATLAB on our main control PC. The control PC also locally hosted a MS SQL server which was populated with instrument data through TOP Server's DataLogger plug-in which would be configured to poll a set list of tags (in our case it would be around 50 unique tags) on the OPC server and the frequency to record them (once every second, 5 mins or whatever). It would write the name of the tag, the timestamp of when the value was retrieved, its value, and the tag quality. We would normally have the logger frequency to be once every second for periods of active testing/operating the machine and once every 5 minutes for periods where the machine was offline or on stand-by.
This data would be queried from the SQL Server through MATLAB for offline analysis of the instrument data collected during experiments. Our experiments eventually became standardized and automated and we recorded the start and end time of the experiment. Eventually, the time to execute a query for the data during an experiment became noticeable when the table with the data log was around 2 GB in size. The procedure was to make a new table and re-configure the data logger to write the instrument log to this new table. If there was ever a change to the set of tags that DataLogger would poll when producing the log, we would make a new database on the MS SQL Server and a new table so that it was readily apparent to us that a change had been made.
Tables with logged instrument data would have the columns:
- TableIndex (int, not null)
- TagTimestamp (datetime, not null)
- TagItemID (varchar(50), not null)
- TagValue (sql_variant, not null)
- TagQuality (int, not null)
This was the only data contained in the database as other aspects that, in retrospect, could have been other tables in the database, were managed with spreadsheets. We were a group of Mining Engineering grad students who did not have formal training on designing and managing databases and I imagine that there are plenty of reasons not to do it how we did it but it ended up working for our purposes. Feel free to tell me what some of those reasons would be.
One reason that became apparent to me after conducting a second-look analysis of the data was that the structure of the database required that I programmatically determine which table contains the log of a given experiment and this led to the SQL queries sent to the database not being able to be parameterized fully as the database and table were filled in with string formatting. Because this database can only be accessed locally, the likely-hood of a SQL injection attack is really low but it could become a problem if the SQL server would be made remotely accessible.
Any guidance on a "good" or better way to structure the database would be greatly appreciated. It is not likely that this database will be ever used again but I am curious for my own benefit to learn from this experience.
To summarize, I am looking for the following things:
- materials on strategies and best practices for database design for logging instrument data from a SCADA system
- strategies to cope with copious amounts of data which can come from an instrument data log
- specific problems with the structure of the database of the pilot plant
- potential remedies for restructuring the database
Edit: Fixed some typos and grammatical errors