Data Collection

Khulood Nasher
4 min readSep 30, 2020

--

To do any project in data science, we need to first think about the problem that we have to address, second we need to obtain data.

In this post, I’m going to show you how to define the problem and search for data. Defining the problem is the objective of the research. It is the main title of the project. For example the expected forecasting of sales/profit, classification of pictures or texts, prediction of success or diagnosis of disease, etc.

To set a good question, we need first to ask ourselves what do I need to achieve in my study? This main question will lead to too many questions that have a start with what, how, and why. Most of the questions will have a common question such as what features that have a great effect on the target, and how these features cause this change?

Answering these questions can lead to concluding and setting up recommendations such as paying attention to a specific feature because this feature makes more profit or improve another feature because this specific feature can do better if being addressed differently.

Data can be collected from two main sources. These sources are primary sources and secondary sources. The primary sources are sources that have been obtained by the researcher and have never been collected before. The primary sources can take a quantitative form or a qualitative form and it can take both mixes of qualitative and quantitative form. A good example of this is a doctor in a clinic who is researching autism among toddlers. The doctor might collect quantitative information which is numbers such as the age of toddlers, height, weight, number of siblings and can collect qualitative information that comes in form of texts like ethnicity of the toddler, family member with a history of autism, gender, description of toddlers’ attention and concentration in different ages ..etc. This quantitative and qualitative information can be collected over several patients at least 1000 toddlers and as much number of the data is collected, as much the doctor can draw an accurate conclusion about the autism symptoms, and causes and can find answers to some questions that the researcher has them in mind before collecting data such as when autism symptoms are mostly clear? Does genetics affect autism, is there a relationship between ethnicity and autism? And if yes, what is the ethnicity that autism is most prevalent, is autism more prevalent in a certain gender, is there a relationship between autism in male/ female within ethnicity or age…etc.

As we see, we can ask more and more questions where our collected data can help us in ending up with a good conclusion that benefits human beings. If researchers analyze in-depth the numeric data of quantitative features and textual description of qualitative features, they can conclude answering the research questions and show their answers more clearly through visualization such as histograms, bar plots, box plots, and at the end they can set strong and clear recommendations.

The secondary resources of collecting data however are the data that has been collected previously by other researchers, and as same as the primary sources, this data can be taken in form of quantitative and qualitative or a form of a mix of quantitative and qualitative. The secondary resources of data can be data that are published such as in papers, statics records as in census, for example, books, or published records in websites. It can also be not published such as lab measurements kept for internal use, or internal company records of sales, production, or purchases. Examples of secondary free sources of published data are Google Cloud datasets, Amazon Web Services, Kaggle, US government data, public data authorities such as the National Center for Environmental, WHO, and other national and international organizations.

Methods of Collecting Data

Depending on the type of project or research, a method of obtaining data must be decided first. Numeric data can be collected by direct measurements in labs or by using tools of measurements. Qualitative data can be obtained by filling out surveys either answering questions of paper surveys, phone surveys, electronic surveys, or face-to-face interview surveys. The Answers to the questions of surveys can be a multiple choice of scale of agreeing degree with the above statement or in form of a textual description of the behavior towards the target topic.

Needless to say, collecting data plays the main role in the success of any project. Data must be unbiased, varied, homogeneous, and representative of the whole population. More details of an ideal data collection will be discussed in future blogs. I welcome your feedback and interaction.

Please feel free to reach me via my email khuloodnasher1@gmail.com or at any above mentioned social media.

--

--

Khulood Nasher
Khulood Nasher

Written by Khulood Nasher

Data Scientist,Health Physicist, NLP Researcher, & Arabic Linguist. https://www.linkedin.com/in/khuloodnasher https:/khuloodnasher1.wixsite.com/resume

No responses yet