Форум программистов, компьютерный форум, киберфорум
Python: Решение задач
Войти
Регистрация
Восстановить пароль
Блоги Сообщество Поиск Заказать работу  
 
Рейтинг 4.57/7: Рейтинг темы: голосов - 7, средняя оценка - 4.57
0 / 0 / 0
Регистрация: 15.12.2021
Сообщений: 2

Python, parsing

15.12.2021, 17:26. Показов 1602. Ответов 6

Студворк — интернет-сервис помощи студентам
Есть у меня такой код для парсинга сайта CASAFARI. Его нужно исправить чтоби роботал


Кликните здесь для просмотра всего текста
{"nbformat":4,"nbformat_minor":0,"metada ta":{"colab":{"name":"CASAFARI_SpiderAss essment.ipynb","provenance":[],"collapsed_sections":[]},"kernelspec":{"display_name":"Python [default]","language":"python","name":"python3"}, "language_info":{"codemirror_mode":{"nam e":"ipython","version":3},"file_extensio n":".py","mimetype":"text/x-python","name":"python","nbconvert_expor ter":"python","pygments_lexer":"ipython3 ","version":"3.6.6"}},"cells":[{"cell_type":"markdown","metadata":{"id" :"6mEGUK7rYoRC"},"source":["# Casafari Take-Home Challenge - Summer Internship\n","\n","### Personal Identification\n","Fill here your personal information to accelerate the assessment by our team:\n","* Your name;\n","* Link to your git (or other portfolio website) and/or LinkedIn profile;\n","\n","### General Information\n","\n","The test is split in parts and it was designed to give you a complete, yet short, overview of some your daily activities as summer intern at Casafari. However, if you have time and skills, you can explore the dataset and provide us some valuable insights that will give a boost in your evaluation.\n","\n","**Important: You are not allowed to share this test on public repositories. If you want, use a private repository for version control.**\n","\n","### Guidelines\n","* We expect that the test should take around 1 hour to do. However, we strongly advise you to carefully read this assignment, think about approaches and try to understand the data before diving into the questions. You are free to spend as much time on it as you want, within the timeframe given by our recruiter.\n","* **You can complete this assignment working on Google Colab, or if you prefer you can download it and use it as standalone jupyter notebook and send them back**\n","* In case of using this Google Colab, you'll need to download those files in [this link](https://drive.google.com/open?... eOWA1Pdckp) and upload it on this notebook running the cell below.\n","* If you want to use some python packages that are not yet installed on this notebook, use !pip install package."]},{"cell_type":"code","metadata":{"id":" wn82o9okMFJR"},"source":["from google.colab import files\n","\n","uploaded = files.upload()"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"i d":"LT_oz7zmYwoL"},"source":["# Data Extraction (CSS + REGEX)\n","\n","Casafari tracks the entire real estate market by aggregating properties from thousands of different websites. The first step of this process is to collect all the relevant information using web crawlers. This task will give a brief overview of how this extraction is made. \n","\n","The step consists of 3 parts, which will evaluate your skills in CSS3 selectors and regular expressions knowledge, which are essential to data extraction processes. We believe that even if you do not have previous knowledge of CSS, HTML and REGEX, you should be able to complete this task in less than a hour. There are many tutorials and informations on how to use CSS3 selectors and regular expressions to extract data. Do not be afraid to google it! This task is also a evaluation of your learning capabilities.\n","\n","The normal questions already have some examples and can be solved only by filling the CSS3 selectors or the regular expressions in the given space. You can check if you have the correct results by running the pre-made script after it. However, if you feel comfortable, you can use another python package and rewrite the script in a similar way to extract the data.\n","\n","For the extra challenges, you'll need to construct the scripts from scratch."]},{"cell_type":"markdown","metadata":{"i d":"D58kOtOXjrR7"},"source":["#### Task 1:\n","\n","For the following task, use the _listing.html_ file, which represents a listings for a property. Open the HTML file on your browser, investigate it with the Inspect tool, view the source code and explore it. \n","After that, fill the CSS3 selectors in the following script to extract the following information about this property:\n","\n","* Number of bathrooms\n","* Number of bedrooms\n","* Living Area\n","* Energy Rating\n","* Description\n","* Agent Name\n","* Extract the location of the property"]},{"cell_type":"code","metadata":{"id":" _ubceG6_rEek"},"source":["!pip install lxml\n","!pip install cssselect"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":" Cqba2Ye43Hyy"},"source":["# EXAMPLE SELECTOR TO EXTRACT THE PROPERTY TYPE\n","Selector_Example = "h1.lbl_titulo""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id": "-UYR51QwrYWW"},"source":["# EXAMPLE CODE, RUN TO CHECK THE EXAMPLE SELECTOR \n","\n","from lxml import html,etree\n","\n","with open(r'listing.html', "r") as f:\n"," page = f.read()\n","tree = html.fromstring(page)\n","\n","print('Ex ample -> Property type: {}'.format(tree.cssselect(Selector_Examp le)[0].text))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"i d":"M7MJswrmP89Z"},"source":["Now that you understand the example, just fill the CSS selectors here and check it by running the below cells:"]},{"cell_type":"code","metadata":{"id":" o-2ra_gioTOy"},"source":["############## Q1 ANSWERS ##################\n","Selector_1 = "WRITE SELECTOR HERE"\n","Selector_2 = "WRITE SELECTOR HERE"\n","Selector_3 = "WRITE SELECTOR HERE"\n","Selector_4 = "WRITE SELECTOR HERE"\n","Selector_5 = "WRITE SELECTOR HERE"\n","Selector_6 = "WRITE SELECTOR HERE"\n","Selector_7 = "WRITE SELECTOR HERE""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":" AaJuBU1nqsub"},"source":["############### RUN TO CHECK YOUR ANSWERS ##################\n","print('Bathrooms: {}'.format(tree.cssselect(Selector_1)[0].text))\n","print('')\n","print('Bedroom s: {}'.format(tree.cssselect(Selector_2)[0].text))\n","print('')\n","print('Living area: {}'.format(tree.cssselect(Selector_3)[0].text))\n","print('')\n","print('Energy Rating: {}'.format(tree.cssselect(Selector_4)[0].text))\n","print('')\n","print('Descrip tion: {}'.format(tree.cssselect(Selector_5)[0].text))\n","print('')\n","print('Agent name: {}'.format(tree.cssselect(Selector_6)[0].text))\n","print('')\n","print('Locatio n: {}'.format(tree.cssselect(Selector_7)[0].text))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"i d":"__U3ndDeRbt6"},"source":["__Extra Challenge 1__:\n","\n","Write from scratch a script to extract all the features of the property and print each one splitting them by comma (e.g: "Garden, Gas Heating, 2 garages and Large pool")"]},{"cell_type":"code","metadata":{"id":" iLmuSkUFR-LA"},"source":["############### WRITE THE SCRIPT TO SOLVE THE EXTRA CHALLENGE HERE ##################"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"i d":"5CX07lJog1Jv"},"source":["#### Task 2:\n","In the second part you will still have to use the html file. However, this time, you should use regular expressions to extract the following data from the webpage:\n","\n","* The agent telephone number\n","* The property price"]},{"cell_type":"code","metadata":{"id":" ANpQ4SvPSvpg"},"source":["# REGEXP EXAMPLE TO EXTRACT THE AGENT EMAIL\n","Regexp_Example = r"\">(.*?@.*?)<""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":" 6mPc9TCF2jOx"},"source":["# RUN TO CHECK THE EXAMPLE RESULTS\n","import re\n","\n","with open(r'listing.html', "r") as f:\n"," page = f.read()\n","\n","print("Email extracted: {}".format(re.findall(Regexp_Example, page)[0]))"],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":" 6z_yGAIm8uyC"},"source":["# WRITE YOUR REGULAR EXPRESSIONS HERE\n","Regexp_1 = r"WRITE REGEXP HERE"\n","Regexp_2 = r"WRITE REGEXP HERE""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":" nRZMlLL8r0Pc"},"source":["############### RUN TO CHECK YOUR ANSWERS ##################\n","print("Agent Phone Number: {}".format(re.findall(Regexp_1, page)[0]))\n","print('')\n","print("Property price: {}".format(re.findall(Regexp_2, page)[0]))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"i d":"o4DbKPVDSk8w"},"source":["__Extra Challenge 2:__\n","* Extract latitude and longitude value from html __(those values are in the html code, but are not shown on the page__)"]},{"cell_type":"code","metadata":{"id":" o5AylkERSo6F"},"source":["############### WRITE THE SCRIPT TO SOLVE THE EXTRA CHALLENGE HERE ##################"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"i d":"DxAqykAnY22Q"},"source":["#### Task 3:\n","For the last task, use the *sample.json* file. This file contains JSON that has a list of objects inside. Open the file in a code editor, try to identify some pattern on it and check it's structure first. Each object is under unique ID: \n","\n","\n","\n","\n","{ \n","\n",""SV350": { ... // data, describing the object ... }, \n","\n",""fKDFI3": { ... // data, describing the object ... },\n","\n","...\n","\n",""38shF": { ... // data, describing the object ... } \n","\n","}\n","\n","\n","\n","\n","Ther efore, you need to write one regular expression to extract the following information:\n","* Every unique ID on this file (for example, the first unique ID should be NC065 and the last should be NN574). \n","\n","Hint: The length of your list should be 211"]},{"cell_type":"code","metadata":{"id":" VPI6oVuXURc7"},"source":["# WRITE YOUR REGULAR EXPRESSION HERE\n","Regexp_JSON = r"WRITE REGEXP HERE""],"execution_count":null,"outputs":[]},{"cell_type":"code","metadata":{"id":" Sa6N56T4kBx7"},"source":["with open(r'sample.json', "r") as f:\n"," json = f.read()\n","\n","print('----- Expressions extracted -----')\n","print("First unique id: {}".format(re.findall(Regexp_JSON, json)[0]))\n","print("Last unique id: {}".format(re.findall(Regexp_JSON, json)[-1]))\n","print("Length of list of unique ids: {}".format(len(re.findall(Regexp_JSON, json))))"],"execution_count":null,"outputs":[]},{"cell_type":"markdown","metadata":{"i d":"TBgy4nEtueI-"},"source":["__Extra Challenge 3:__\n","* Do you see a better option than use regex to extract this expression ? How would you structure it ?"]},{"cell_type":"markdown","metadata":{"i d":"YwNzVMIwadT-"},"source":["# Data Querying (SQL)\n","\n","You have now collected the data, and cleaned it. It was published in Casafari database and you have to query the data in order to prepare it for analysis. \n","\n","To solve this problem consider the data set provided in _properties.csv_ and _agents.csv_ to test your queries. As before, please fill in your queries in the cells provided (double click the blank cells to fill them in). \n","\n","In this task we just want to evaluate your knowledge of SQL syntax, so keep it simple. Do not try to overclean the data in this task.\n","\n","### Questions:\n","- (Q1) Write a query to extract only listings with a property type “quinta” or “house”;\n","- (Q2) Write a query to extract only listings of properties with a pool;\n","- (Q3) Write a query calculating the average price per square meter of all apartments in Nagüeles.\n","\n","#### HINT:\n","Assume that location names and property type can be found only within the title."]},{"cell_type":"markdown","metadata":{"i d":"ZMiufR188DzQ"},"source":["Query 1:\n","``` **mysql**\n","\n","SELECT *\n","FROM _______\n","WHERE _______;\n","\n","```"]},{"cell_type":"markdown","metadata":{"i d":"RB-PnF489SmC"},"source":["Query 2:\n","``` **mysql**\n","\n","\n","\n","\n","```"]},{"cell_type":"markdown","metadata":{"i d":"lgzd0mTe9U9O"},"source":["Query 3:\n","``` **mysql**\n","\n","\n","```"]}]}
0
Programming
Эксперт
39485 / 9562 / 3019
Регистрация: 12.04.2006
Сообщений: 41,671
Блог
15.12.2021, 17:26
Ответы с готовыми решениями:

Python, parsing?
Ребят, помогите пожалуйста, что не так с кодом, импортирую в бот парсинг, чтоб выдавало разные анекдоты, но выдает один все время. ...

Python parsing <span>
Понадобилось сделать парсинг количества фотографий в альбоме, в vk_api python я ничего не нашёл. Надумал сделать через парсинг сайта. ...

Parsing на Python, Json, Xml
Нужно создать парсер на python для конвертации данных из формата json в формат xml. Как это сделать не используя готовых библиотек?

6
Супер-модератор
Эксперт функциональных языков программированияЭксперт Python
 Аватар для Catstail
38180 / 21115 / 4307
Регистрация: 12.02.2012
Сообщений: 34,724
Записей в блоге: 14
15.12.2021, 19:24
Ivan_Yohanson, это не код, дикая каша.
0
Автоматизируй это!
Эксперт Python
 Аватар для Welemir1
7391 / 4818 / 1246
Регистрация: 30.03.2015
Сообщений: 13,693
Записей в блоге: 29
15.12.2021, 19:36
боюсь с таким "кодом" наши полномочия -всё!
0
0 / 0 / 0
Регистрация: 15.12.2021
Сообщений: 2
15.12.2021, 22:42  [ТС]
Ето мне задание дали я вот и думаю что делать
0
Эксперт PythonЭксперт Java
19530 / 11067 / 2931
Регистрация: 21.10.2017
Сообщений: 23,294
15.12.2021, 22:51
Цитата Сообщение от Ivan_Yohanson Посмотреть сообщение
что делать
Спросить у того кто тебе это дал - "Что это за хрень?"
1
Автоматизируй это!
Эксперт Python
 Аватар для Welemir1
7391 / 4818 / 1246
Регистрация: 30.03.2015
Сообщений: 13,693
Записей в блоге: 29
16.12.2021, 17:51
Цитата Сообщение от iSmokeJC Посмотреть сообщение
Спросить у того кто тебе это дал - "Что это за хрень?"
лучше не сдерживать себя в выражениях
1
312 / 192 / 98
Регистрация: 01.05.2014
Сообщений: 522
16.12.2021, 19:09
Цитата Сообщение от iSmokeJC Посмотреть сообщение
"Что это за хрень?"
"Я художник - я так вижу"
1
Надоела реклама? Зарегистрируйтесь и она исчезнет полностью.
inter-admin
Эксперт
29715 / 6470 / 2152
Регистрация: 06.03.2009
Сообщений: 28,500
Блог
16.12.2021, 19:09
Помогаю со студенческими работами здесь

Parsing
есть текст из которого нужно вытянуть чаcть &quot;от и до&quot; от &quot;the 1500s, &quot; до &quot;own printer took&quot; например и получить тогда...

Parsing ymaps
Здравствуйте! Подскажите, пожалуйста, как распарсить XML YMAPS из API Яндекс-карт? У меня задача: вытащить координаты, т.е. то,...

SyntaxError: unexpected EOF while parsing
При выполнении кода # -*- coding: utf-8 -*- print(&quot;Калькулятор двух чисел v1&quot;) what=input(&quot;Какое действие Вы хотите выполнить (+, -, *,...

SyntaxError: unexpected EOF while parsing
Есть условия задачи: В некоторых играх кубики не 6-гранные и, возможно, их в игре 2 или больше. Напишите программу, которая позволяет...

SyntaxError: unexpected EOF while parsing
$ python s.py File &quot;s.py&quot;, line 22 ^ SyntaxError:...


Искать еще темы с ответами

Или воспользуйтесь поиском по форуму:
7
Ответ Создать тему
Новые блоги и статьи
делаю науч статью по влиянию грибов на сукцессию
anaschu 13.03.2026
прикрепляю статью
SDL3 для Desktop (MinGW): Создаём пустое окно с нуля для 2D-графики на SDL3, Си и C++
8Observer8 10.03.2026
Содержание блога Финальные проекты на Си и на C++: hello-sdl3-c. zip hello-sdl3-cpp. zip Результат:
Установка CMake и MinGW 13.1 для сборки С и C++ приложений из консоли и из Qt Creator в EXE
8Observer8 10.03.2026
Содержание блога MinGW - это коллекция инструментов для сборки приложений в EXE. CMake - это система сборки приложений. Здесь описаны базовые шаги для старта программирования с помощью CMake и. . .
Как дизайн сайта влияет на конверсию: 7 решений, которые реально повышают заявки
Neotwalker 08.03.2026
Многие до сих пор воспринимают дизайн сайта как “красивую оболочку”. На практике всё иначе: дизайн напрямую влияет на то, оставит человек заявку или уйдёт через несколько секунд. Даже если у вас. . .
Модульная разработка через nuget packages
DevAlt 07.03.2026
Сложившийся в . Net-среде способ разработки чаще всего предполагает монорепозиторий в котором находятся все исходники. При создании нового решения, мы просто добавляем нужные проекты и имеем. . .
Модульный подход на примере F#
DevAlt 06.03.2026
В блоге дяди Боба наткнулся на такое определение: В этой книге («Подход, основанный на вариантах использования») Ивар утверждает, что архитектура программного обеспечения — это структуры,. . .
Управление камерой с помощью скрипта OrbitControls.js на Three.js: Вращение, зум и панорамирование
8Observer8 05.03.2026
Содержание блога Финальная демка в браузере работает на Desktop и мобильных браузерах. Итоговый код: orbit-controls-threejs-js. zip. Сканируйте QR-код на мобильном. Вращайте камеру одним пальцем,. . .
SDL3 для Web (WebAssembly): Синхронизация спрайтов SDL3 и тел Box2D
8Observer8 04.03.2026
Содержание блога Финальная демка в браузере. Итоговый код: finish-sync-physics-sprites-sdl3-c. zip На первой гифке отладочные линии отключены, а на второй включены:. . .
КиберФорум - форум программистов, компьютерный форум, программирование
Powered by vBulletin
Copyright ©2000 - 2026, CyberForum.ru