Coding Zone @jobssss - Tumblr Blog

mysql to check if string contains two words

SELECT 'acxiom contains [1248]' REGEXP '^.*acxiom.*1248.*$';

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Just to cover more following steps after kicking off the query: INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' select books from table;

In my case, the generated data under temp folder is in deflate format, and it looks like this:

$ ls 000000_0.deflate 000001_0.deflate 000002_0.deflate 000003_0.deflate 000004_0.deflate 000005_0.deflate 000006_0.deflate 000007_0.deflate

Here's the command to unzip the deflate files and put everything into one csv file:

hadoop fs -text "file:///home/lvermeer/temp/*" > /home/lvermeer/result.csv

#hadoop #hive

I have a table as follows: user_id email u1 e1, e2 u2 null My goal is to convert this into the following format: user_id email u1 e1 u1 e2 u2 null So for this I am usin...

Difference between lateral view explode vs lateral view outer explode

#hive

Coding Zone turned 5 today!

#tumblr birthday #tumblr milestone

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

:: prepends a single item whereas ::: prepends a complete list. So, if you put a List in front of :: it is taken as one item, which results in a nested structure.

#scala

I have a column of json strings and would like to be able to convert them to structs, similar to how SQLContext.read.json() will make that transformation on initial read from the file. Alternati...

How to transform a column of json strings to structs

....

NOT resolved

#spark #json

ERROR ... mkdirs failed for /hive, error 30 mkdirs failed for /hive, error 30

With Spark 2, we have seen this occur when users have missed adding a new (coming from Spark 1.6) method call into their SparkSession initialization. See further details in the Spark documentation.

The resolution is to include a call to enableHiveSupport():

spark =SparkSession.builder.appName("Python Spark SQL Hive integration example").enableHiveSupport().getOrCreate()

#spark #hive

I want to create a hive table using my Spark dataframe's schema. How can I do that? For fixed columns, I can use: val CreateTable_query = "Create Table my table(a string, b string, c double)"

Assuming, you are using Spark 2.1.0 or later and my_DF is your dataframe,

//get the schema split as string with comma-separated field-datatype pairs StructType my_schema = my_DF.schema(); StructField[] fields = my_schema.fields(); String fieldStr = ""; for (StructField f : fields) { fieldStr += f.name() + " " + f.dataType().typeName() + ","; } //drop the table if already created spark.sql("drop table if exists my_table"); //create the table using the dataframe schema spark.sql("create table my_table(" + fieldStr.subString(0,fieldStr.length()-1)+ ") row format delimited fields terminated by '|' location '/my/hdfs/location'"); //write the dataframe data to the hdfs location for the created Hive table my_DF.write() .format("com.databricks.spark.csv") .option("delimiter","|") .mode("overwrite") .save("/my/hdfs/location");

The other method using temp table

my_DF.createOrReplaceTempView("my_temp_table"); spark.sql("drop table if exists my_table"); spark.sql("create table my_table as select * from my_temp_table");

#spark #scala #dataframe

How to extract the column name and data type from nested struct type in spark schema getting like this: (events,StructType( StructField(beaconType,StringType,true), StructField(beaconV...

Question is somewhat unclear, but if you're looking for a way to "flatten" a DataFrame schema (i.e. get an array of all non-struct fields), here's one:

def flatten(schema: StructType): Array[StructField] = schema.fields.flatMap { f => f.dataType match { case struct: StructType => flatten(struct) case _ => Array(f) } }

For example:

val schema = StructType(Seq(StructField("events", StructType(Seq( StructField("beaconVersion", IntegerType, true), StructField("client", StringType, true), StructField("data", StructType(Seq( StructField("ad", StructType(Seq( StructField("adId", StringType, true) ))) ))) ))) )) println(flatten(schema).toList) // List(StructField(beaconVersion,IntegerType,true), StructField(client,StringType,true), StructField(adId,StringType,true))

#scala #spark #dataframe

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

Spark play with json

https://databricks.com/blog/2015/02/02/an-introduction-to-json-support-in-spark-sql.html

// Create a SQLContext (sc is an existing SparkContext) val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Suppose that you have a text file called people with the following content: // {"name":"Yin", "address":{"city":"Columbus","state":"Ohio"}} // {"name":"Michael", "address":{"city":null, "state":"California"}} // Create a SchemaRDD for the JSON dataset. val people = sqlContext.jsonFile("[the path to file people]") // Register the created SchemaRDD as a temporary table. people.registerTempTable("people")

scala> val people = sqlContext.jsonFile("test/person.json") warning: there was one deprecation warning; re-run with -deprecation for details people: org.apache.spark.sql.DataFrame = [address: struct<city: string, state: string>, name: string] people.printSchema() root |-- address: struct (nullable = true) | |-- city: string (nullable = true) | |-- state: string (nullable = true) |-- name: string (nullable = true)

scala> val comp = sqlContext.jsonFile("test/compromised_account.json") comp: org.apache.spark.sql.DataFrame = [date: string, email: string ... 1 more field] scala> 2017-08-11 16:54:47,690:14857(0x7ff5e576c700):ZOO_WARN@zookeeper_interest@1570: Exceeded deadline by 13ms comp.printSchema() root |-- date: string (nullable = true) |-- email: string (nullable = true) |-- reset: boolean (nullable = true)

#spark

Apache Spark's classpath is built dynamically (to accommodate per-application user code) which makes it vulnerable to such issues. @user7337271's answer is correct, but there are some more concerns, depending on the cluster manager ("master") you're using.

First, a Spark application consists of these components (each one is a separate JVM, therefore potentially contains different classes in its classpath):

Driver: that's your application creating a SparkSession (or SparkContext) and connecting to a cluster manager to perform the actual work

Cluster Mamanger: serves as an "entry point" to the cluster, in charge of allocating executorsfor each application. There are several different types supported in Spark: standalone, YARN and Mesos, which we'll describe bellow.

Executors: these are the processes on the cluster nodes, performing the actual work (running Spark tasks)

The relationsip between these is described in this diagram from Apache Spark's cluster mode overview:

Now - which classes should reside in each of these components?

This can be answered by the following diagram:

Let's parse that slowly:

Spark Code are Spark's libraries. They should exist in ALL three components as they include the glue that let's Spark perform the communication between them. By the way - Spark authors made a design decision to include code for ALL components in ALL components (e.g. to include code that should only run in Executor in driver too) to simplify this - so Spark's "fat jar" (in versions up to 1.6) or "archive" (in 2.0, details bellow) contain the necessary code for all components and should be available in all of them.

Driver-Only Code this is user code that does not include anything that should be used on Executors, i.e. code that isn't used in any transformations on the RDD / DataFrame / Dataset. This does not necessarily have to be separated from the distributed user code, but it can be.

Distributed Code this is user code that is compiled with driver code, but also has to be executed on the Executors - everything the actual transformations use must be included in this jar(s).

Now that we got that straight, how do we get the classes to load correctly in each component, and what rules should they follow?

Spark Code: as previous answers state, you must use the same Scala and Spark versions in all components.

1.1 In Standalone mode, there's a "pre-existing" Spark installation to which applications (drivers) can connect. That means that all drivers must use that same Spark versionrunning on the master and executors.

1.2 In YARN / Mesos, each application can use a different Spark version, but all components of the same application must use the same one. That means that if you used version X to compile and package your driver application, you should provide the same version when starting the SparkSession (e.g. via spark.yarn.archive or spark.yarn.jars parameters when using YARN). The jars / archive you provide should include all Spark dependencies (including transitive dependencies), and it will be shipped by the cluster manager to each executor when the application starts.

Driver Code: that's entirely up to - driver code can be shipped as a bunch of jars or a "fat jar", as long as it includes all Spark dependencies + all user code

Distributed Code: in addition to being present on the Driver, this code must be shipped to executors (again, along with all of its transitive dependencies). This is done using the spark.jars parameter.

To summarize, here's a suggested approach to building and deploying a Spark Application (in this case - using YARN):

Create a library with your distributed code, package it both as a "regular" jar (with a .pom file describing its dependencies) and as a "fat jar" (with all of its transitive dependencies included).

Create a driver application, with compile-dependencies on your distributed code library and on Apache Spark (with a specific version)

Package the driver application into a fat jar to be deployed to driver

Pass the right version of your distributed code as the value of spark.jars parameter when starting the SparkSession

Pass the location of an archive file (e.g. gzip) containing all the jars under lib/ folder of the downloaded Spark binaries as the value of spark.yarn.archive

#spark

The recommended technique to convert between Java and Scala collections (since Scala 2.8.1) is via scala.collection.JavaConverters. This gives you more control than JavaConversions and avoids some possible implicit conflicts.

You won't have a direct implicit conversion this way. Instead, you get asScala and asJavamethods pimped onto collections, allowing you to perform the conversions explicitly.

To convert a Java iterator to a Scala iterator:

javaIterator.asScala

To convert a Java iterator to a Scala List (via the scala iterator):

javaIterator.asScala.toList

You may also want to consider converting toSeq instead of toList. In the case of iterators, this'll return a Stream - allowing you to retain the lazy behaviour of iterators within the richer Seqinterface.

EDIT: There's no toVector method, but (as Daniel pointed out) there's a toIndexedSeq method that will return a Vector as the default IndexedSeq subclass (just as List is the default Seq).

javaIterator.asScala.toIndexedSeq

#scala

How to check Linux operation version

-bash-4.2$ cat /etc/*rele* NAME="Red Hat Enterprise Linux Server" VERSION="7.3 (Maipo)" ID="rhel" ID_LIKE="fedora" VERSION_ID="7.3" PRETTY_NAME="Red Hat Enterprise Linux Server 7.3 (Maipo)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:7.3:GA:server" HOME_URL="https://www.redhat.com/" BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7" REDHAT_BUGZILLA_PRODUCT_VERSION=7.3 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="7.3" Red Hat Enterprise Linux Server release 7.3 (Maipo) Red Hat Enterprise Linux Server release 7.3 (Maipo) cpe:/o:redhat:enterprise_linux:7.3:ga:server IMAGE_RELEASE="RC9" MORE_INFO="https://confluence.walmart.com/display/PGPIAAS/Virtual+Machine+Images" IMAGE_BUILD_DATE="Mon Feb 27 23:12:25 UTC 2017" IMAGE_ID="5111ba8b-fe92-42d9-b5bd-2b12522df6e5"

#bash #linux

jobssss

How to show hive table size in GB ?

1. find out the path of the hive tables:

for example, find the path for table r_scan1,

hive> describe formatted r_scan1;

=> Location: maprfs:/hive/username.db/r_scan1

Then you know the default path is “maprfs:/hive/username.db/”

2. Run the following command

$ hadoop fs -du /hive/username.db/ | awk ’/^[0-9]+/ { print int($1/(1024**3)) “ [GB]\t” $2 }’

0 [GB] /hive/username.db/cst_fl_ga_tn_return_info 0 [GB] /hive/username.db/cyberfendrequest 0 [GB] /hive/username.db/cyberfendres 11 [GB] /hive/username.db/cyberfendresolution

jobssss

deprecate $ hadoop fs -du /hive/username.db/ | awk ’/^[0-9]+/ { print int($1/(1024**3)) “ [GB]\t” $2 }’

USE $ hadoop fs -du -h /hive/xhuo2.db/ instead

will show something like

51.3 M /hive/username.db/metrics 129 /hive/username.db/r_scan1 129 /hive/username.db/r_visit1 2.5 G /hive/username.db/scantesttenderid

#hive

•18+ Adults Only

Watch Anya Live on Cam

Anya is live and ready to show you everything. Watch her strip, dance, and perform exclusive shows just for you. Interact in real-time and make your fantasies come true.

✓ Live Streaming✓ Interactive Chat✓ Private Shows✓ HD Quality✓ Free Actions

Free to watch • No registration required • HD streaming

How to show hive table size in GB ?

1. find out the path of the hive tables:

for example, find the path for table r_scan1,

hive> describe formatted r_scan1;

=> Location: maprfs:/hive/username.db/r_scan1

Then you know the default path is “maprfs:/hive/username.db/”

2. Run the following command

$ hadoop fs -du /hive/username.db/ | awk '/^[0-9]+/ { print int($1/(1024**3)) " [GB]\t" $2 }'

0 [GB] /hive/username.db/cst_fl_ga_tn_return_info 0 [GB] /hive/username.db/cyberfendrequest 0 [GB] /hive/username.db/cyberfendres 11 [GB] /hive/username.db/cyberfendresolution

#hive

Find out hive table path in HDFS

hive -S -e "describe formatted <table_name> ;" | grep 'Location' | awk '{ print $NF }'

https://stackoverflow.com/a/10412663/2345313

hive> set hive.metastore.warehouse.dir;

https://stackoverflow.com/a/18027691/2345313

#hive #hdfs

Trending Blogs

Last Seen Blogs

Coding Zone