Big Data in Social Sciences. An Introduction to the Automation of Textual Data Analysis Using Natural Language Processing and Machine Learning
DOI:
https://doi.org/10.54790/rccs.51Keywords:
big data, natural language processing, social sciences, machine learning, text miningAbstract
Innovations in the field of computer engineering and artificial intelligence provide new methodological opportunities for scientific research, enabling the study of emerging social phenomena that are born and inhabit virtual spaces. The purpose of this paper is to familiarise the social scientist with the widely established processes in massive text analysis using machine learning techniques that give rise to what we know today as natural language processing (NLP). First, a brief overview of the history of NLP and its relation to text analysis in the social sciences is given. Then, in each section of the text, the steps to follow when applying NLP to social research are assessed, providing information on software, tools, data sources and useful links, with the aim of offering an introductory and simplified guide to serve as an initial approach to this discipline. Finally, the main challenges that the social sciences face when implementing NLP techniques are examined and assessed.
Downloads
References
Abbott, A. (1997). Of Time and Space: The Contemporary Relevance of the Chicago School. Social Forces, 75(4), 1149. doi: 10.2307/2580667. DOI: https://doi.org/10.2307/2580667
Ajmal, S., Khan, S., Hossain, M., Lomonaco, V., Cannons, K., Xu, Z. y Cuzzolin, F. (2022). International Workshop on Continual Semi-Supervised Learning: Introduction, Benchmarks and Baselines. Continual Semi-Supervised Learning, Vol. 13418 (pp. 1-14). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-17587-9_1 DOI: https://doi.org/10.1007/978-3-031-17587-9_1
Alinejad-Rokny, H. (2016). Proposing on Optimized Homolographic Motif Mining Strategy Based on Parallel Computing for Complex Biological Networks. Journal of Medical Imaging and Health Informatics, 6(2), 416-424. https://doi.org/10.1166/jmihi.2016.1707 DOI: https://doi.org/10.1166/jmihi.2016.1707
Bird, S., Klein, E. y Loper, E. (2009). Natural language processing with Python. O’Reilly.
Bitter, C., Elizondo, D. A. y Yang, Y. (2010). Natural language processing: A prolog perspective. Artificial Intelligence Review, 33(1-2), 151-173. https://doi.org/10.1007/s10462-009-9151-4 DOI: https://doi.org/10.1007/s10462-009-9151-4
Calzolari, N. (2020). LREC 2020 Marseille Twelfth International Conference on Language Resources and Evaluation$dMay 11-16, 2020, Palais Du Pharo, Marseille, France: Conference Proceedings. Paris: The European Language Resources Association (ELRA).
Castells, M. (2018). La era de la información: economía, sociedad y cultura. Vol. 3, Fin de milenio. 4a ed., 2ª reimpr. Madrid: Alianza Editorial.
Dahlin, E. (2021). Email Interviews: A Guide to Research Design and Implementation. International Journal of Qualitative Methods, 20:160940692110254. doi: 10.1177/16094069211025453. DOI: https://doi.org/10.1177/16094069211025453
Dhiraj, M. (2008). Digital Ethnography: An Examination of the Use of New Technologies for Social Research. Sociology, 42(5), 837-855. doi: 10.1177/0038038508094565. DOI: https://doi.org/10.1177/0038038508094565
Dogra, V., Verma, S., Kavita, Chatterjee, P., Shafi, J., Choi, J. y Ijaz, M. F. (2022). A Complete Process of Text Classification System Using State-of-the-Art NLP Models. En S. K. Sah Tyagi (Ed.), Computational Intelligence and Neuroscience (pp. 1-26). doi: 10.1155/2022/1883698. DOI: https://doi.org/10.1155/2022/1883698
Egger, R. y Yu, J. (2022). A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Frontiers in Sociology, 7:886498. doi: 10.3389/fsoc.2022.886498. DOI: https://doi.org/10.3389/fsoc.2022.886498
Gibbs, G. (2012). El análisis de datos cualitativos en investigación cualitativa. Madrid: Ediciones Morata.
Gillingham, P. y Graham, T. (2017). Big Data in Social Welfare: The Development of a Critical Perspective on Social Work’s Latest «Electronic Turn». Australian Social Work, 70(2), 135-147. https://doi.org/10.1080/0312407X.2015.1134606 DOI: https://doi.org/10.1080/0312407X.2015.1134606
Gualda, E., Taboada Villamarín, A. y Rebollo Díaz, C. (2023). Big data y ciencias sociales: Una mirada comparativa a las publicaciones de antropología, sociología y trabajo social. Gazeta de Antropología, 39(1).
Gualda, E. y Rebollo, C. (2020). Big data y Twitter para el estudio de procesos migratorios: Métodos, técnicas de investigación y software. Empiria. Revista de metodología de ciencias sociales, 46, 147. https://doi.org/10.5944/empiria.46.2020.26970 DOI: https://doi.org/10.5944/empiria.46.2020.26970
Hockett, C. F. (2020). The state of the art. De Gruyter.
Holtz, P., Kronberger, N. y Wagner, W. (2012). Analyzing Internet Forums: A Practical Guide. Journal of Media Psychology, 24(2), 55-66. https://doi.org/10.1027/1864-1105/a000062 DOI: https://doi.org/10.1027/1864-1105/a000062
James, G., Witten, D., Hastie, T. y Tibshirani, R. (2013). An Introduction to Statistical Learning (vol. 103). New York: Springer. https://doi.org/10.1007/978-1-4614-7138-7 DOI: https://doi.org/10.1007/978-1-4614-7138-7
Johri, P., Khatri, S. K., Al-Taani, A. T., Sabharwal, M., Suvanov, S. y Kumar, A. (2021). Natural Language Processing: History, Evolution, Application, and Future Work. En A. Abraham, O. Castillo, y D. Virmani (Eds.), Proceedings of 3rd International Conference on Computing Informatics and Networks (vol. 167, pp. 365-375). Springer Singapore. https://doi.org/10.1007/978-981-15-9712-1_31 DOI: https://doi.org/10.1007/978-981-15-9712-1_31
Justicia de la Torre, C., Sánchez, D., Blanco, I. y Martín-Bautista, M. J. (2018). Text Mining: Techniques, Applications, and Challenges. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 26(04), 553-582. https://doi.org/10.1142/S0218488518500265 DOI: https://doi.org/10.1142/S0218488518500265
Khanday, A. M. U. D., Rabani, S. T. Khan, Q. R. y Malik, S. H. (2022). Detecting Twitter Hate Speech in COVID-19 Era Using Machine Learning and Ensemble Learning Techniques. International Journal of Information Management Data Insights, 2(2), 100120. doi: 10.1016/j.jjimei.2022.100120. DOI: https://doi.org/10.1016/j.jjimei.2022.100120
Li, S. (2018). Named Entity Recognition and Classification with Scikit-Learn. https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
Lindstedt, Nathan C. (2019). Structural Topic Modeling For Social Scientists: A Brief Case Study with Social Movement Studies Literature, 2005-2017. Social Currents, 6(4), 307-318. doi: 10.1177/2329496519846505. DOI: https://doi.org/10.1177/2329496519846505
Maud, R. y Blanchard, A. (2022). The Framing of Health Technologies on Social Media by Major Actors: Prominent Health Issues and COVID-Related Public Concerns. International Journal of Information Management Data Insights, 2(1), 100068. doi: 10.1016/j.jjimei.2022.100068. DOI: https://doi.org/10.1016/j.jjimei.2022.100068
Mbona, I. y Eloff, J. H. P. (2023). Classifying Social Media Bots as Malicious or Benign Using Semi-Supervised Machine Learning. Journal of Cybersecurity, 9(1), tyac015. doi: 10.1093/cybsec/tyac015. DOI: https://doi.org/10.1093/cybsec/tyac015
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M. A. y Aiden, E. L. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331(6014), 176-182. https://doi.org/10.1126/science.1199644 DOI: https://doi.org/10.1126/science.1199644
Microsoft (2022). Especificaciones y límites de Excel. https://support.microsoft.com/es-es/office/especificaciones-y-l%C3%ADmites-de-excel-1672b34d-7043-467e-8e27-269d656771c3
Morimoto, J. y Ponton, F. (2021). Virtual reality in biology: Could we become virtual naturalists? Evolution: Education and Outreach, 14(1), 7. https://doi.org/10.1186/s12052-021-00147-x DOI: https://doi.org/10.1186/s12052-021-00147-x
Müller, A. C. y Guido, S. (2016). Introduction to aprendizaje automático with Python: A guide for data scientists. O’Reilly Media, Inc.
Naseeba, B., Challa, N. P., Doppalapudi, A., Chirag, S. y Nair, N. S. (2023). Machine Learning Models for News Article Classification. 5th International Conference on Smart Systems and Inventive Technology (ICSSIT) (pp. 1009-1016). Tirunelveli, India: IEEE. https://doi.org/10.1109/ICSSIT55814.2023.10061095 DOI: https://doi.org/10.1109/ICSSIT55814.2023.10061095
Nikolenko, S. I., Koltcov, S. y Koltsova, O. (2017). Topic modelling for qualitative studies. Journal of Information Science, 43(1), 88-102. https://doi.org/10.1177/0165551515617393 DOI: https://doi.org/10.1177/0165551515617393
Pavlova, A., y Berkers, P. (2020). Mental Health Discourse and Social Media: Which Mechanisms of Cultural Power Drive Discourse on Twitter. Social Science & Medicine, 263, 113250. doi: 10.1016/j.socscimed.2020.113250. DOI: https://doi.org/10.1016/j.socscimed.2020.113250
Piotrowski, M. (2012). Natural Language Processing for Historical Texts. Cham: Springer. https://doi.org/10.1007/978-3-031-02146-6 DOI: https://doi.org/10.1007/978-3-031-02146-6_4
Radick, G. (2016). The unmaking of a modern synthesis: Noam Chomsky, Charles Hockett, and the politics of behaviorism, 1955-1965. Isis, 107(1), 49-73. https://doi.org/10.1086/686177 DOI: https://doi.org/10.1086/686177
Ruelens, A. (2022). Analyzing user-generated content using natural language processing: A case study of public satisfaction with healthcare systems. Journal of Computational Social Science, 5(1), 731-749. https://doi.org/10.1007/s42001-021-00148-2 DOI: https://doi.org/10.1007/s42001-021-00148-2
Saleem, Z., Alhudhaif, A., Qureshi, K. N. y Jeon, G. (2021). Context-aware text classification system to improve the quality of text: A detailed investigation and techniques. Concurrency and Computation: Practice and Experience. https://doi.org/10.1002/cpe.6489 DOI: https://doi.org/10.1002/cpe.6489
Sambeek, I. (2021). Natural Language Processing & Social Sciences. Towards Data Science. https://towardsdatascience.com/natural-language-processing-social-sciences-94a35a8a7c78
Shevtsov, A., Oikonomidou, M., Antonakaki, D., Pratikakis, P. y Ioannidis, S. (2023). What Tweets and YouTube Comments Have in Common? Sentiment and Graph Analysis on Data Related to US Elections 2020. PLOS ONE, 18(1), e0270542. doi: 10.1371/journal.pone.0270542. DOI: https://doi.org/10.1371/journal.pone.0270542
Thorsten, J. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. En C. Nédellec y C. Rouveirol. Aprendizaje automático: ECML-98. Vol. 1398, Lecture Notes in Computer Science (pp. 137-142). Berlin, Heidelberg: Springer. https://doi.org/10.1007/BFb0026683 DOI: https://doi.org/10.1007/BFb0026683
Vilkova, O. (2020). Web Scraping as a Method of Data Extraction in Sociological Studies: On Scientific Applicability. Vestnik Tomskogo gosudarstvennogo universiteta. Filosofiya, sotsiologiya, politologiya, (54), 163-175. doi: 10.17223/1998863X/54/16. DOI: https://doi.org/10.17223/1998863X/54/16
Yuanbo, Q. (2017). The Openness of Open Application Programming Interfaces. Information, Communication & Society, 20(11), 1720-36. doi: 10.1080/1369118X.2016.1254268. DOI: https://doi.org/10.1080/1369118X.2016.1254268
Zwilling, Moti (2023). Big Data Challenges in Social Sciences: An NLP Analysis. Journal of Computer Information Systems, 63(3), 537-554. doi: 10.1080/08874417.2022.2085211. DOI: https://doi.org/10.1080/08874417.2022.2085211
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Alba Taboada
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.