Your Voice, Our Headlines

Download Folkspaper App with no Ads!


A fast-growing newspaper curated by the online community.

What is Data and types of data?

  • tag_facesReaction
  • Tip Bones

We have heard so many buzz words about Data science, of course it is a fastest growing technology. Many of them does not have clear understanding of Data Science. In this Article, we are going to learn about “What is Data Science?”, To understand the Data Science First we need to know about Data. In this technology era people are using machines to do work easily, they are depending on gadgets to do professional and personal tasks.

You may have heard that data is important for building Data Science like AI systems, Natural Language Processing, Prediction of results, Forecasting reports etc.

For an example of a table of data which we also call a dataset.

With the help of Data Science Algorithms, we can to predict the price of houses which can be helpful to buy or sell those easily.

To perform this task, we need to collect a dataset with some basic information. For an example, a MS excel spreadsheet of data where one column is the size of the house in square feet or square meters, and the second column is the price of the house. So, if you're trying to build a AI system or Machine Learning system to help you set prices for houses or figure out if a house is priced appropriately, you might decide that the size of the house is A and the price of the house is B, and have an AI system learn this input to output or A to B mapping. You can take one or multiple columns as input data and predict price of houses.

So, given that table of data, given the dataset, it's actually up to you, up to your business use case to decide what is A and what is B. Data is often unique to your business, and this is an example of a dataset that a rural state agency might have that they tried to help price houses. It's up to you to decide what is A and what is B, and how to choose these definitions of A and B to make it valuable for your business.

As another example, if you have a certain budget and you want to decide what is the size of house you can afford, then you might decide that the input A is how much does someone spend and B is just the size of the house in square feet, and that would be a totally different choice of A and B that tells you, given a certain budget, what's the size of the house you should be maybe looking at.

Let's say that you want to build a system to recognize cats in pictures. you collect a dataset where the input A is a set of different images and the output B are labels as ‘cat’ and ‘not-cat’. In Machine Learning tradition, data is important. But how do you get data? How do you acquire data? Well, one way to get data is manual labelling.

For example, you might collect a set of pictures like these over here, and then you might either yourself or have someone else go through these pictures and label each of them. So, the first one is a cat, second one is not a cat, third one is a cat, fourth one is not a cat. By manually labelling each of these images, you now have a dataset for building a cat detector. To do that, you need more than four pictures.

You might need hundreds of thousands of pictures, but manual labelling is a tried and true way of getting a dataset where you have both A and B. Another way to get a dataset is from observing user behaviours or other types of behaviours.

So, for example, you run a website that sells things online. So, an e-commerce or an electronic commerce website where you offer things to users at different prices, and you can just observe if they buy your product or not. So, just through the act of either buying or not buying your product, you may be able to collected a data set like this, where you can store the user ID, the time the user visited your website, the price you offer the product to the users as well as whether or not they purchased it. So, just by using your website, users can generate this data from you.

This was an example of observing user behaviors. We can also observe behaviors of other things such as machines. If you run a large machine in a factory and you want to predict if a machine is about to fail or have a fault, then just by observing the behavior of a machine, you can then record a dataset like this. There's a machine ID, there's a temperature of the machine, there's a pressure within the machine, and then did the machine fail or not. If your application is prevent the maintenance, say you want to figure out if a machine is about to fail, then you could for example, choose this as the input A and choose that as the output B to try to figure out if a machine is about to fail in which case you might do preventative maintenance on the machine. The third and very common way of acquiring data is to download it from a website or to get it from a partner. you can download data for free, ranging from computer vision or image datasets, to self-driving car datasets, to speech recognition datasets, to medical imaging data sets to many more. So, if your application needs a type of data, you just download off the web keeping in mind licensing and copyright, then that could be a great way to get started on the application. Finally, if you're working with a partner, say you're working with a factory, then they may already have collected a big dataset, machines, and temperatures, and pressure into the machines fail not that they could give to you.

Data is important, but there's also little bit over-hyped and sometimes misused. Let me just describe to you two of the most common misuses or the bad ways of thinking about data. So many companies storing data since last three years to build up AI model. Then after three years, they will do AI then. It turns out that's a bad strategy. Instead, what I recommend, once you've started collecting some data, go ahead and start showing it or feeding it to an AI team. Because often, the AI team can give feedback to your IT team on what types of data to collect and what types of IT infrastructure to keep on building.

For example, maybe an AI team can look at your factory data and say, "Hey. You know what? If you can collect data from this big manufacturing machine, not just once every ten minutes, but instead once every minute, then we could do a much better job building a preventative maintenance system for you." So, there's often this interplay of this back and forth between IT and AI teams, and usually try to get feedback from AI earlier, because it can help you guide the development of your IT infrastructure.

Second, misuse of data. Unfortunately, I've seen some CEOs read about the importance of the trend in use, and then say, "Hey, I have so much data. Surely, an AI team can make it valuable." Unfortunately, this doesn't always work out. More data is usually better than less data, but I wouldn't take it for granted that just because you have many terabytes or gigabytes of data, that an AI team can make that valuable. So, don't throw data on an AI team and assume it will be valuable. In fact, in one extreme case, I saw one company go and acquire a whole string of other companies in medicine, on the thesis, on the hypothesis that their data would be very valuable. Now, a couple years later, as far as I know the engineers have not yet figured out how to take all this data and create value out of it. So, sometimes it works and sometimes it doesn't.

However, they will not over-invest in just acquiring data for the sake of data until unless you're also getting an AI team to look at it. Because, they can help guide you to think through what is the data that is the most valuable.

Finally, data is messy. You may have heard the phrase garbage in garbage out, and if you have bad data, then the AI will learn inaccurate things.

Here are some examples of data problems. Let's say you have this data sets of size of houses, number of bedrooms, and the price. You can have incorrect labels or just incorrect data. For example, this house is probably not going to sell for $0.1 just for one dollar. Or, data can also have missing values such as we have here a whole bunch of unknown values. So, AI team will need to figure out how to clean up the data or how to deal with these incorrect labels and all missing values.

There are also multiple types of data. For example, sometimes you hear about images, audio, and text. These are types of data that humans find it very easy to interpret. There's a term for this. This is called unstructured data, and there's a certain types of AI techniques that could work with images to recognize cats or audios to recognize speech or texts or understand that email is spam. Then, there are also datasets like the one on the right. This is an example of structured data.

That basically means data that lives in a giant spreadsheet, and the techniques for dealing with unstructured data are little bit different than the techniques for dealing with structured data. But AI techniques can work very well for both types of data, unstructured data and structured data. In this Article, you learned what is data and how not to misuse data. Now, AI has a complicated terminology when people throw around terms like AI, Machine Learning, Data Science.