Handling Data at Scale Using One-Line EDA Libraries
In this era, handling data is one of the key challenges organizations face worldwide. Irrespective of advanced data analytics capabilities, the first step is always the exploration part, where businesses need to understand, slice, and dice the data. This becomes the base for the next steps where advanced analytics come into the picture. Hence, the significance of doing exploratory data analysis is growing, and the challenges while performing Exploratory Data Analysis (EDA) with the large data volume are becoming more complex.
In one of our recent works for a leading technology firm, we performed EDA for around 5TB of data. We couldn’t proceed with Excel or any other BI tools because handling vast amounts of data is not feasible in such platforms. Hence we had to choose an alternate method. The one-line EDA libraries allow us to explore the data quickly. During this process, we explored some of the best-in-class one-line EDAs and finally figured out the best one that suited our requirements. This blog will take you through a few one-line EDAs used in various EDA use cases depending on the problem and data.
What is EDA?
Exploratory data analysis (EDA) is the first step in data science to investigate data sets without prior background. The ultimate goal of EDA is to understand what the data tells us by summarizing the main characteristics of data. Developed in the early 1970s by American mathematician John Tukey, EDA continues to be a widely used technique to understand the data.
Why do data scientists use EDA?
Here’s a truth that all data scientists need to accept – data comes with several flaws. For example, raw data may have missing outliers and duplicate values. So it is crucial to use EDA to perform graphical and non-graphical analysis to get unbiased and accurate results.
Non-Graphical Analysis includes:
Describing data to analyze data types, min, max, mode, median, quartiles, and more Handling missing and duplicate data Outlier detection Understanding correlation between the variables
Graphical Analysis includes:
Univariate Analysis Bivariate Analysis Multivariate Analysis Performing EDA on TB data size involving graphical and non-graphical analysis needs several lines of code to be written and is time-consuming and challenging. Hence, we bring in one-line EDA libraries that perform all these tasks in a single line of code.
What is a one-line EDA?
One-line EDA is easy-to-use libraries that provide a better overview of data by quickly analyzing and generating detailed reports of the dataset, saving both time and effort.
Some of the one line EDA are:
Sweetviz Autoviz Pandas Profiling D-tale We started exploring the one-line EDA tools mentioned above, experimented with a small sample dataset on-premise, and gathered the reports.
Sweetviz
According to the Sweetviz documentation, “Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code. The output is a fully self-contained HTML application.”
pip install sweetviz import sweetviz as sv report = sv.analyse(dataframe) report.show_html()
Learn more at https://www.latentview.com/blog/handling-data-at-scale-using-one-line-eda-libraries/











