1. Big Data:
Everyone talks about the term Big Data. But what is Big Data? Is it an 80 Mb Excel sheet with questions from a survey, or is it all data collected by Google or Facebook? What makes data big data?
Wikipedia writes the following about this:
Big data is used when one or more data sets that are too large to be maintained with regular database management systems. The amount of data that is stored is growing exponentially. This is because we still store more data in the form of files, pictures and movies (eg Facebook or YouTube), but also because more and more devices themselves collect, store and exchange (the so-called Internet of Things) and more and more sensor data is available. Not only the storage of these quantities is a challenge. Analyzing this data is also playing an increasingly important role.
2. Structured and unstructured data: Today, analysts/data scientists are increasingly making the distinction between structured and unstructured data. The form of data used in most companies to mine information is structured data. This form of data is often stored in operational databases, spreadsheets, etc. Columns (fields) are predefined. Likewise, the links (keys) between the fields are well defined. Relationships can be established and analysis files created with relatively little effort. BI departments are able to increase the quality of the data by means of duplication layers.
However, we also store all kinds of data in documents, e-mail, complaints, Twitter, internet sites, etc. This data is not directly applicable in analyzes. Before analysis can be done on the existing data, the individual values or strings of words must be converted to columns in an analysis file. This means that text data must be normalized. This includes standardizing uppercase and lowercase letters, removing stop words and defining the stem. The process involved in this is called natural language processing (NLP). Unstructured data is therefore not so structured from its origin.
In addition to data from documents, we can now also extract an incredible amount of data from all kinds of devices. Think of smartphones, telematics in cars and control mechanisms in houses. This information is often stored in long strings. Due to a large amount of data, an important process is recognizing patterns in strings. Systems biologists and microbiologists in particular, but also process analysts, have developed all kinds of algorithms to streamline this process. In short, the amount of unstructured data is many times greater than the amount of structured data.
3. Supervised and Unsupervised Learning:
Two concepts that are frequently used by today’s analysts are Supervised and Unsupervised Learning, in contrast to marketers often talk about segmentation models and score models. Supervised learning is very similar to scoring/predicting a model. You try to predict a binary value (yes / no or 0/1) or a continuous value as accurately as possible. The former is referred to by today’s data scientist/machine learning classification, the latter is often referred to as regression. An in-depth look at supervised learning in deep learning.
Unsupervised Learning is very similar to the aggregation of values or characteristics. It corresponds to concepts such as clustering, factor analysis, correlations, etc. In the case of Unsupervised Learning, we do not predict any value. In addition to Unsupervised Learning and supervised learning, we also know the terms reinforcement learning, online learning and active learning.
4. Scraping:
You need data to make a model. Previously, most data came from operational company databases. Today, data is mined from all kinds of different sources. But the internet is a source of data for today’s data scientist/analyst. The process of mining internet data is called scraping. Scraping is a discipline in itself. The art of scraping is to remove all tags and keep only the plain text. An important point of attention in arriving at tidy data is the normalization of the data. Because the amount of unstructured data is enormous, it is important to develop processes that transform unstructured data into columns as efficiently as possible. Techniques such as Map-Reduce play an important role in this.