第一步獲取資料-Python - 台灣廟宇大搜羅

這裡提供兩種獲取資料的方法，一種是利用Python，一種是利用Excel，兩種大致上都很快速，不過後續因為處理資料會需要用到Python，所以如果能趁此機會學習Python，會非常建議採取Python的作法。

Python獲取資料

經過觀察之後，得知網站提供的是xml檔案格式，因此這裡運用到的方法，即是先從網路上下載xml檔案，再將下載後的xml檔案轉換成excel xlsx即可。

從網路上下載xml檔案
讀取xml檔案並轉換成excel xlsx檔

從網路上下載xml檔案

import shutil 
import requests

url = 'https://religion.moi.gov.tw/Report/temple.xml'
response = requests.get(url, stream=True)

with open('sample.xml', 'wb') as out_file:
  shutil.copyfileobj(response.raw, out_file)

print('The file was saved successfully')

程式碼簡單解釋:

import shutil
# shutil是一個常用於文件操作的module
import requests
# requests這個module則可以向網頁發送請求

url = 'https://religion.moi.gov.tw/Report/temple.xml'
# url設定我們要抓取的xml檔案網址
response = requests.get(url, stream=True)
# requests.get(url, stream=True)中可以設置stream的參數為true避免文件過大，塞滿空間，所以在此設為true，因不確定有幾筆廟宇資料

with open('temple.xml', 'wb') as out_file:
  shutil.copyfileobj(response.raw, out_file)
# 最後則是利用基本的Python寫檔，將response中的資料寫入檔案中
print('The file was saved successfully')
# 印出檔案是否成功寫出

讀取xml檔案並轉換成Excel xlsx檔

import pandas as pd
df = pd.read_xml('temple.xml')
df.to_excel('temple.xlsx')

程式碼簡單解釋:

import pandas as pd
# pandas這個module是資料操作上非常方便的工具
df = pd.read_xml('temple.xml')
# 這裡要注意的是，pd.read_xml是pandas 1.3.0版本中才有支援，所以如果你的pandas沒有這個功能，記得要先更新pandas，使用pip install --upgrade pandas

df.to_excel('temple.xlsx')
# 將dataframe輸出成Excel xlsx

參考資料:
Python – requests大文件下载stream参数 – AI备忘录 (aiuai.cn)
Python 爬蟲常用技巧 (持續更新) | Titangene Blog
pandas.read_xml — pandas 1.5.1 documentation (pydata.org)
How do I download a file using Python Requests? (reqbin.com)
shutil — High-level file operations — Python 3.10.8 documentation

肉哥

kevin2758

23 10 月, 2022

Python獲取資料

發表迴響取消回覆