Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Question
Saturday, January 28, 2017 10:27 PM
I've got a wonderful fixed width format text file. It loads fine with sc.textFile and I get a nice RDD of strings. I have no trouble running the hvac example, https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql
I just can't figure out how to parse a fixed width file using the column specs.
I tried pandas (in spark) and got the following:
import pandas as pd
path = ‘wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv'
#using pandas with a column specification
col_specification =[(0, 2), (3, 6)]
data = pd.read_fwf(path, colspecs=col_specification)
#gives this error [Errno 2] No such file or directory: 'wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv'
Any thoughts? (Other than don't use fixed width files)
All replies (7)
Monday, January 30, 2017 9:56 PM ✅Answered
Some kind gentleman on Stack Overflow resolved. Interestingly (I think) the first line of his code read
df = spark.read.text("blah:text.txt") I need to educate myself about contexts. to make it work I had to use
df = sqlContext.read.text(
df = sqlContext.read.text("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
df.select(
df.value.substr(1,1).cast('integer').alias('id'),
df.value.substr(4,8).alias('date'),
df.value.substr(12,3).alias('string')
).show()
++++
| id| date|string|
++++
|null|e,Time,T| arg|
| 6|/13,0:00| :01|
| 6|/13,1:00| :01|
| 6|/13,2:00| :01|
| 6|/13,3:00| :01|
Sunday, January 29, 2017 7:00 PM
Hello,
We are checking on the query and would get back to you soon on this.
I apologize for the inconvenience and appreciate your time and patience in this matter.
Regards,
Pradeep
Monday, January 30, 2017 3:50 PM
trying a few things on my end, with pyspark.substring, but trouble with the context as set in the HVAC example
simple code and errors below
pyspark.sql.functions.<tt class="descname">substring</tt><big>(</big>str, pos, len<big></big>)
df = sqlContext.createDataFrame([('abcd',)], ['s',])
dfp = df.select(sqlContext.substring(df.s, 1, 2).alias('s')).collect()
'HiveContext' object has no attribute 'substring'
df = spark.createDataFrame([('abcd',)], ['s',])
name 'spark' is not defined
df = sc.createDataFrame([('abcd',)], ['s',])
'SparkContext' object has no attribute 'createDataFrame'
Monday, January 30, 2017 5:46 PM
Hello,
You might want to parse it in Spark instead of Pandas.
Please refer the links given below:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=regex#pyspark.sql.functions.regexp_extract
or
Regards,
Pradeep
Monday, January 30, 2017 7:05 PM
Thanks Pradeep - I tried the second link to the substring documentation. I put my errors in the reply above. I agree that I'll want to parse it in Spark ultimately. I'll be using files in the 100's of GB range.
Monday, January 30, 2017 7:17 PM
Hi,
What's is the version of Spark and HDI?
Regards,
Pradeep
Monday, January 30, 2017 8:24 PM
Spark on Linux (HDI 3.4.1000.0)