Big Data for the SQL Eye
SQL Server is a great technology – I’ve been using it since 1993 when the user interface consisted of a query window with the options to save and execute and not much else. With every release there’s something new and exciting and there’s always something to learn about even the most familiar of features. However, not everyone uses SQL Server for every storage and compute opportunity – sad but true.
So what is a SQL geek to do in the face of all the new options out there – many under the umbrella of Big Data (distributed processing)? Why just jump right on in and learn it! No one can know all the pieces because it’s a big, fluid, messy collection of “things”. But don’t worry about that, start with one thing and build from there. Even if you never plan to implement a production Big Data system you need to learn about it – because if you don’t have some hands-on experience with it then someone who does have that experience will be influencing the decision makers without you. For a SQL Pro I suggest Hive as that easy entry point. At some point maybe Spark SQL will jump into that gap, but for now Hive is the easiest entry point for most SQL pros.
For more, I refer you to the talk I gave at the Pacific Northwest SQL Server User Group meeting on October 14, 2015. Excerpts are below, the file is attached.
Look, it’s SQL!
SELECT score, fun
FROM toDo
WHERE type = 'they pay me for this?';
Here’s how that code looks from Visual Studio along with the links to how you find the output and logs:
And yet it’s more!
CREATE EXTERNAL TABLE IF NOT EXISTS toDo
(fun STRING,
rank INT COMMENT 'rank the greatness',
type STRING)
COMMENT 'two tables walk into a bar....'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/data/demo/';
A mix of old and new
-- read some data
SELECT 'you cannot make me ', score, fun, type
FROM toDo
WHERE score <= 0
ORDER BY score;
SELECT 'when can we ', score, fun, type
FROM toDo
WHERE score > 0
DISTRIBUTE BY score SORT BY score;
That’s Hive folks!
Hive
on Hadoop
on HDInsight
on AzureBig Data in the cloud!
Hadoop Shines When….
(refer to https://blogs.msdn.com/b/cindygross/archive/2015/02/25/master-choosing-the-right-project-for-hadoop.aspx)
Data exploration, analytics and reporting, new data-driven actionable insights
Rapid iterating
Unknown unknowns
Flexible scaling
Data driven actions for early competitive advantage or first to market
Low number of direct, concurrent users
Low cost data archival
Hadoop Anti-Patterns….
Replace system whose pain points don’t align with Hadoop’s strengths
OLTP needs adequately met by an existing system
Known data with a static schema
Many end users
Interactive response time requirements (becoming less true)
Your first Hadoop project + mission critical system
Azure has so much more
Go straight to the business code
Scale storage and compute separately
Open Source
Linux
Managed and unmanaged services
Hybrid
On-demand and 24x7 options
SQL Server
It’s a Polyglot
Stream your data into a lake
Pick the best compute for each task
And it’s Fun!
I hope you enjoyed this small bite of big data!