Share via


Fixed width format files: parsing in pyspark

Question

Saturday, January 28, 2017 10:27 PM

I've got a wonderful fixed width format text file. It loads fine with sc.textFile and I get a nice RDD of strings. I have no trouble running the hvac example, https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-jupyter-spark-sql

I just can't figure out how to parse a fixed width file using the column specs.

I tried pandas (in spark) and got the following:

import pandas as pd

path = ‘wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv'

#using pandas with a column specification  

col_specification =[(0, 2), (3, 6)]

data = pd.read_fwf(path, colspecs=col_specification)

#gives this error [Errno 2] No such file or directory: 'wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv'

Any thoughts? (Other than don't use fixed width files)

All replies (7)

Monday, January 30, 2017 9:56 PM ✅Answered

Some kind gentleman on Stack Overflow resolved. Interestingly (I think) the first line of his code read

df = spark.read.text("blah:text.txt") I need to educate myself about contexts. to make it work I had to use

df = sqlContext.read.text(

df = sqlContext.read.text("wasbs:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
df.select(
    df.value.substr(1,1).cast('integer').alias('id'),
    df.value.substr(4,8).alias('date'),
    df.value.substr(12,3).alias('string')
).show()

++++
|  id|    date|string|
++++
|null|e,Time,T|   arg|
|   6|/13,0:00|   :01|
|   6|/13,1:00|   :01|
|   6|/13,2:00|   :01|
|   6|/13,3:00|   :01|


Sunday, January 29, 2017 7:00 PM

Hello,

We are checking on the query and would get back to you soon on this.
I apologize for the inconvenience and appreciate your time and patience in this matter.

Regards,
Pradeep


Monday, January 30, 2017 3:50 PM

trying a few things on my end, with pyspark.substring, but trouble with the context as set in the HVAC example

simple code and errors below

pyspark.sql.functions.<tt class="descname">substring</tt><big>(</big>str, pos, len<big></big>)

df = sqlContext.createDataFrame([('abcd',)], ['s',])
dfp = df.select(sqlContext.substring(df.s, 1, 2).alias('s')).collect()

'HiveContext' object has no attribute 'substring'

df = spark.createDataFrame([('abcd',)], ['s',])

name 'spark' is not defined

df = sc.createDataFrame([('abcd',)], ['s',])

'SparkContext' object has no attribute 'createDataFrame'


Monday, January 30, 2017 5:46 PM

Hello,

You might want to parse it in Spark instead of Pandas.

Please refer the links given below:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=regex#pyspark.sql.functions.regexp_extract

or

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=substring#pyspark.sql.functions.substring

Regards,
Pradeep


Monday, January 30, 2017 7:05 PM

Thanks Pradeep -  I tried the second link to the substring documentation. I put my errors in the reply above. I agree that I'll want to parse it in Spark ultimately. I'll be using files in the 100's of GB range.


Monday, January 30, 2017 7:17 PM

Hi,

What's is the version of Spark and HDI?

Regards,
Pradeep


Monday, January 30, 2017 8:24 PM

Spark on Linux (HDI 3.4.1000.0)