Azure Data Lake Analytics and U-SQL Spring 2018 Updates: Parquet support, small files, dynamic output, fast file sets, and much more!
Hello Azure Data Lake and U-SQL fans and followers.
It is high time for the release notes for all the cool features we released over the winter as well as listing all the pending deprecation items and breaking changes. There was so much cool new stuff that it took me several weeks to write the release notes (on top of my day job!) that the next release will probably already be out by the time you read this! I promise that the June release notes will come sooner! :).
Without further ado, here are the Spring 2018 Updates for Azure Data Lake U-SQL and Developer Tooling!
Supporting data formats of your choice at high scale
The top items include expanding our built-in support for standard file formats with native Parquet support for extractors and outputters (in public preview) and ORC (in private preview)!
In addition, since the fast file set feature now has been generally released, we can consume hundreds of thousands of such files in bulk in a single EXTRACT statement. We will publish a blog at a later date to give you much more detailed information on how this capability helps you to process so many files efficiently in a scalable way.
Important aspects of processing files at scale include:
- the ability to generate many files from a rowset in a single statement, providing a way to dynamically partition the data for future use with Hadoop or Spark, or to provide individual files for customers. This has been our top customer ask on the ADL Feedback forum --and now it is in private preview!
- the ability to handle many small files. We recommend that you make your files large enough for the processing to be efficient (300MB to 4GB is a good range), but often, your file formats (e.g., images) or data ingestion pipelines (e.g., EventHub archives) are not able to reach that size. Thus, we are adding the ability to group several files into a vertex to increase efficiency and lower cost of your job (we have seen 10 to 30 times improvement in some customer jobs!).
You can find a great end-to-end example using several of these important capabilities together in our new Azure blog post announcing the new release and the accompanying detailed walk-through blog post.
Other cool stuff
There is so much stuff to discover that I encourage you all to read through the release notes and use the samples that we also have published as a Visual Studio solution on our U-SQL GitHub site.
Here are a few additional highlights:
- Use the new AU modeler to optimize your jobs cost/performance trade-off!
- Extract from files that use the
Windows
code pages! - Augment your script with job information through
@@JobInfo
! - Light-weight, self-contained script development with in-script C# named lambdas and script-scoped U-SQL objects!
Thanks to all of you who continue to volunteer to test the preview features and provide us valuable feedback. We are looking forward to seeing you use all the new cool stuff.
Make sure that you update your scripts that are affected by the future deprecations and breaking changes!
Please contact us or leave a comment below if you have feedback on this or other features.
Here is the list of topics with links to the detailed release notes:
- Pending and Upcoming Deprecations and Breaking Changes
- U-SQL jobs will introduce an upper limit for the number of table-backing files being read
- Built-in extractors will change mapping of empty fields from zero-length string to null with quoting enabled
- Disallow `DECLARE EXTERNAL` inside packages, procedures and functions
- Removal of undocumented use of `CLUSTER` | `CLUSTERED BY` in `CREATE INDEX` and `CREATE TABLE`
- Calling static members of types in a `SELECT` expression that also references a column having the same name as the type will need to fully qualify the type name or rename the column
- Disallow U-SQL identifiers in C# delegate bodies in scripts
- Strengthening of Read after DML check
- Breaking Changes
- Major U-SQL Bug Fixes, Performance and Scale Improvements
- U-SQL Preview Features
- Input File Set uses less resources when operating on many small files (Public Preview)
- Built-in Parquet Extractor and Outputter (Public Preview)
- Automatic GZip compression on `OUTPUT` statement (Public Preview)
- Built-in ORC Extractor and Outputter (Private Preview)
- Data-driven Output Partitioning with `OUTPUT` fileset (Private Preview)
- A limited flexible-schema feature for U-SQL table-valued function parameters is now available for private preview
- New U-SQL capabilities
- U-SQL adds job information system variable `@@JOBINFO`
- U-SQL adds support for computed file property columns on `EXTRACT`
- DiagnosticStream support in .Net User-code
- Built-in Text/Csv/Tsv Extractors and Outputters support ANSI/Windows 8-bit codepage encodings
- U-SQL supports ANSI SQL `CASE` expression
- U-SQL adds C# `Func`-typed variables in `DECLARE` statements (named lambdas)
- U-SQL adds temporary, script-bound meta data objects with `DECLARE` statements
- The `ORDER BY FETCH` clause can be used with all query expressions
- The EXTRACT expression's schema can be specified with a table-type
- The `EXTRACT`, `REDUCE` and `COMBINE` expressions now support a `SORTED BY` assertion
- The `REQUIRED` clause for UDO invocations now allows `NONE`
- The `EXTRACT` expressions now support the `REQUIRED` clause to support column pruning in user-defined extractors
- U-SQL adds compile-time user errors and warnings
- U-SQL Cognitive Library additions
- Azure Data Lake Tools for Visual Studio New Capabilities
- ADL Tools for VisualStudio provides an improved Analytics Unit modeler to help improve a job's performance and cost
- Job Submission's simple interface now makes it easier to change the allocated AUs
- The stage tool tip is simplified and makes it easier to find the Vertex Operator View
- Improved visualization of the job execution graph inside a vertex
- The job stage graph and job execution graph now indicates if the stage contains user-defined operators and what language they have been authored in
- New "Data" tab for enumerating all input and output data
- Job View includes a link to the diagnostic folder
- U-SQL compilation errors are now shown in the "Error List" window
- U-SQL Project supports MSBuild
- Azure Portal Updates