Big data en ciencias sociales. Una introducción a la automatización de análisis de datos de texto mediante procesamiento de lenguaje natural y aprendizaje automático
DOI:
https://doi.org/10.54790/rccs.51Palabras clave:
datos masivos, procesamiento de lenguaje natural, ciencias sociales, aprendizaje automático, minería de textoResumen
Las innovaciones en el campo de la ingeniería computacional y la inteligencia artificial brindan nuevas oportunidades metodológicas para la investigación científica, permitiendo el estudio de fenómenos sociales emergentes que nacen y habitan en los espacios virtuales. El propósito de este trabajo es familiarizar al científico social con los procesos ampliamente establecidos en el análisis masivo de texto mediante técnicas de aprendizaje automático que dan lugar a lo que hoy conocemos como procesamiento de lenguaje natural (PLN). En primer lugar, se lleva a cabo un breve recorrido por la historia del PLN y su relación con el análisis de texto en las ciencias sociales. Luego, en cada sección del texto, se valoran los pasos a seguir cuando se aplica PLN a investigaciones de carácter social, proporcionando información sobre programas informáticos, herramientas, fuentes de datos y enlaces útiles, con el propósito de ofrecer una guía introductoria y simplificada que sirva como acercamiento inicial a esta disciplina. Por último, se examinan y evalúan los principales desafíos que las ciencias sociales enfrentan al implementar técnicas de PLN.
Descargas
Métricas
Citas
Abbott, A. (1997). Of Time and Space: The Contemporary Relevance of the Chicago School. Social Forces, 75(4), 1149. doi: 10.2307/2580667. DOI: https://doi.org/10.2307/2580667
Ajmal, S., Khan, S., Hossain, M., Lomonaco, V., Cannons, K., Xu, Z. y Cuzzolin, F. (2022). International Workshop on Continual Semi-Supervised Learning: Introduction, Benchmarks and Baselines. Continual Semi-Supervised Learning, Vol. 13418 (pp. 1-14). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-17587-9_1 DOI: https://doi.org/10.1007/978-3-031-17587-9_1
Alinejad-Rokny, H. (2016). Proposing on Optimized Homolographic Motif Mining Strategy Based on Parallel Computing for Complex Biological Networks. Journal of Medical Imaging and Health Informatics, 6(2), 416-424. https://doi.org/10.1166/jmihi.2016.1707 DOI: https://doi.org/10.1166/jmihi.2016.1707
Bird, S., Klein, E. y Loper, E. (2009). Natural language processing with Python. O’Reilly.
Bitter, C., Elizondo, D. A. y Yang, Y. (2010). Natural language processing: A prolog perspective. Artificial Intelligence Review, 33(1-2), 151-173. https://doi.org/10.1007/s10462-009-9151-4 DOI: https://doi.org/10.1007/s10462-009-9151-4
Calzolari, N. (2020). LREC 2020 Marseille Twelfth International Conference on Language Resources and Evaluation$dMay 11-16, 2020, Palais Du Pharo, Marseille, France: Conference Proceedings. Paris: The European Language Resources Association (ELRA).
Castells, M. (2018). La era de la información: economía, sociedad y cultura. Vol. 3, Fin de milenio. 4a ed., 2ª reimpr. Madrid: Alianza Editorial.
Dahlin, E. (2021). Email Interviews: A Guide to Research Design and Implementation. International Journal of Qualitative Methods, 20:160940692110254. doi: 10.1177/16094069211025453. DOI: https://doi.org/10.1177/16094069211025453
Dhiraj, M. (2008). Digital Ethnography: An Examination of the Use of New Technologies for Social Research. Sociology, 42(5), 837-855. doi: 10.1177/0038038508094565. DOI: https://doi.org/10.1177/0038038508094565
Dogra, V., Verma, S., Kavita, Chatterjee, P., Shafi, J., Choi, J. y Ijaz, M. F. (2022). A Complete Process of Text Classification System Using State-of-the-Art NLP Models. En S. K. Sah Tyagi (Ed.), Computational Intelligence and Neuroscience (pp. 1-26). doi: 10.1155/2022/1883698. DOI: https://doi.org/10.1155/2022/1883698
Egger, R. y Yu, J. (2022). A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Frontiers in Sociology, 7:886498. doi: 10.3389/fsoc.2022.886498. DOI: https://doi.org/10.3389/fsoc.2022.886498
Gibbs, G. (2012). El análisis de datos cualitativos en investigación cualitativa. Madrid: Ediciones Morata.
Gillingham, P. y Graham, T. (2017). Big Data in Social Welfare: The Development of a Critical Perspective on Social Work’s Latest «Electronic Turn». Australian Social Work, 70(2), 135-147. https://doi.org/10.1080/0312407X.2015.1134606 DOI: https://doi.org/10.1080/0312407X.2015.1134606
Gualda, E., Taboada Villamarín, A. y Rebollo Díaz, C. (2023). Big data y ciencias sociales: Una mirada comparativa a las publicaciones de antropología, sociología y trabajo social. Gazeta de Antropología, 39(1).
Gualda, E. y Rebollo, C. (2020). Big data y Twitter para el estudio de procesos migratorios: Métodos, técnicas de investigación y software. Empiria. Revista de metodología de ciencias sociales, 46, 147. https://doi.org/10.5944/empiria.46.2020.26970 DOI: https://doi.org/10.5944/empiria.46.2020.26970
Hockett, C. F. (2020). The state of the art. De Gruyter.
Holtz, P., Kronberger, N. y Wagner, W. (2012). Analyzing Internet Forums: A Practical Guide. Journal of Media Psychology, 24(2), 55-66. https://doi.org/10.1027/1864-1105/a000062 DOI: https://doi.org/10.1027/1864-1105/a000062
James, G., Witten, D., Hastie, T. y Tibshirani, R. (2013). An Introduction to Statistical Learning (vol. 103). New York: Springer. https://doi.org/10.1007/978-1-4614-7138-7 DOI: https://doi.org/10.1007/978-1-4614-7138-7
Johri, P., Khatri, S. K., Al-Taani, A. T., Sabharwal, M., Suvanov, S. y Kumar, A. (2021). Natural Language Processing: History, Evolution, Application, and Future Work. En A. Abraham, O. Castillo, y D. Virmani (Eds.), Proceedings of 3rd International Conference on Computing Informatics and Networks (vol. 167, pp. 365-375). Springer Singapore. https://doi.org/10.1007/978-981-15-9712-1_31 DOI: https://doi.org/10.1007/978-981-15-9712-1_31
Justicia de la Torre, C., Sánchez, D., Blanco, I. y Martín-Bautista, M. J. (2018). Text Mining: Techniques, Applications, and Challenges. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 26(04), 553-582. https://doi.org/10.1142/S0218488518500265 DOI: https://doi.org/10.1142/S0218488518500265
Khanday, A. M. U. D., Rabani, S. T. Khan, Q. R. y Malik, S. H. (2022). Detecting Twitter Hate Speech in COVID-19 Era Using Machine Learning and Ensemble Learning Techniques. International Journal of Information Management Data Insights, 2(2), 100120. doi: 10.1016/j.jjimei.2022.100120. DOI: https://doi.org/10.1016/j.jjimei.2022.100120
Li, S. (2018). Named Entity Recognition and Classification with Scikit-Learn. https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2
Lindstedt, Nathan C. (2019). Structural Topic Modeling For Social Scientists: A Brief Case Study with Social Movement Studies Literature, 2005-2017. Social Currents, 6(4), 307-318. doi: 10.1177/2329496519846505. DOI: https://doi.org/10.1177/2329496519846505
Maud, R. y Blanchard, A. (2022). The Framing of Health Technologies on Social Media by Major Actors: Prominent Health Issues and COVID-Related Public Concerns. International Journal of Information Management Data Insights, 2(1), 100068. doi: 10.1016/j.jjimei.2022.100068. DOI: https://doi.org/10.1016/j.jjimei.2022.100068
Mbona, I. y Eloff, J. H. P. (2023). Classifying Social Media Bots as Malicious or Benign Using Semi-Supervised Machine Learning. Journal of Cybersecurity, 9(1), tyac015. doi: 10.1093/cybsec/tyac015. DOI: https://doi.org/10.1093/cybsec/tyac015
Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M. A. y Aiden, E. L. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331(6014), 176-182. https://doi.org/10.1126/science.1199644 DOI: https://doi.org/10.1126/science.1199644
Microsoft (2022). Especificaciones y límites de Excel. https://support.microsoft.com/es-es/office/especificaciones-y-l%C3%ADmites-de-excel-1672b34d-7043-467e-8e27-269d656771c3
Morimoto, J. y Ponton, F. (2021). Virtual reality in biology: Could we become virtual naturalists? Evolution: Education and Outreach, 14(1), 7. https://doi.org/10.1186/s12052-021-00147-x DOI: https://doi.org/10.1186/s12052-021-00147-x
Müller, A. C. y Guido, S. (2016). Introduction to aprendizaje automático with Python: A guide for data scientists. O’Reilly Media, Inc.
Naseeba, B., Challa, N. P., Doppalapudi, A., Chirag, S. y Nair, N. S. (2023). Machine Learning Models for News Article Classification. 5th International Conference on Smart Systems and Inventive Technology (ICSSIT) (pp. 1009-1016). Tirunelveli, India: IEEE. https://doi.org/10.1109/ICSSIT55814.2023.10061095 DOI: https://doi.org/10.1109/ICSSIT55814.2023.10061095
Nikolenko, S. I., Koltcov, S. y Koltsova, O. (2017). Topic modelling for qualitative studies. Journal of Information Science, 43(1), 88-102. https://doi.org/10.1177/0165551515617393 DOI: https://doi.org/10.1177/0165551515617393
Pavlova, A., y Berkers, P. (2020). Mental Health Discourse and Social Media: Which Mechanisms of Cultural Power Drive Discourse on Twitter. Social Science & Medicine, 263, 113250. doi: 10.1016/j.socscimed.2020.113250. DOI: https://doi.org/10.1016/j.socscimed.2020.113250
Piotrowski, M. (2012). Natural Language Processing for Historical Texts. Cham: Springer. https://doi.org/10.1007/978-3-031-02146-6 DOI: https://doi.org/10.1007/978-3-031-02146-6_4
Radick, G. (2016). The unmaking of a modern synthesis: Noam Chomsky, Charles Hockett, and the politics of behaviorism, 1955-1965. Isis, 107(1), 49-73. https://doi.org/10.1086/686177 DOI: https://doi.org/10.1086/686177
Ruelens, A. (2022). Analyzing user-generated content using natural language processing: A case study of public satisfaction with healthcare systems. Journal of Computational Social Science, 5(1), 731-749. https://doi.org/10.1007/s42001-021-00148-2 DOI: https://doi.org/10.1007/s42001-021-00148-2
Saleem, Z., Alhudhaif, A., Qureshi, K. N. y Jeon, G. (2021). Context-aware text classification system to improve the quality of text: A detailed investigation and techniques. Concurrency and Computation: Practice and Experience. https://doi.org/10.1002/cpe.6489 DOI: https://doi.org/10.1002/cpe.6489
Sambeek, I. (2021). Natural Language Processing & Social Sciences. Towards Data Science. https://towardsdatascience.com/natural-language-processing-social-sciences-94a35a8a7c78
Shevtsov, A., Oikonomidou, M., Antonakaki, D., Pratikakis, P. y Ioannidis, S. (2023). What Tweets and YouTube Comments Have in Common? Sentiment and Graph Analysis on Data Related to US Elections 2020. PLOS ONE, 18(1), e0270542. doi: 10.1371/journal.pone.0270542. DOI: https://doi.org/10.1371/journal.pone.0270542
Thorsten, J. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. En C. Nédellec y C. Rouveirol. Aprendizaje automático: ECML-98. Vol. 1398, Lecture Notes in Computer Science (pp. 137-142). Berlin, Heidelberg: Springer. https://doi.org/10.1007/BFb0026683 DOI: https://doi.org/10.1007/BFb0026683
Vilkova, O. (2020). Web Scraping as a Method of Data Extraction in Sociological Studies: On Scientific Applicability. Vestnik Tomskogo gosudarstvennogo universiteta. Filosofiya, sotsiologiya, politologiya, (54), 163-175. doi: 10.17223/1998863X/54/16. DOI: https://doi.org/10.17223/1998863X/54/16
Yuanbo, Q. (2017). The Openness of Open Application Programming Interfaces. Information, Communication & Society, 20(11), 1720-36. doi: 10.1080/1369118X.2016.1254268. DOI: https://doi.org/10.1080/1369118X.2016.1254268
Zwilling, Moti (2023). Big Data Challenges in Social Sciences: An NLP Analysis. Journal of Computer Information Systems, 63(3), 537-554. doi: 10.1080/08874417.2022.2085211. DOI: https://doi.org/10.1080/08874417.2022.2085211
Descargas
Publicado
Cómo citar
Número
Sección
Licencia
Derechos de autor 2024 Alba Taboada
Esta obra está bajo una licencia internacional Creative Commons Atribución-NoComercial-CompartirIgual 4.0.