chatGPTでPython教育カリキュラムを作ってみた(3) セクション2の例題と演習問題

第2週は「データの収集と前処理」に焦点を当てます。このセクションでは、データを収集、クリーニング、そして前処理する方法を学びます。

### 第2週: データの収集と前処理

#### 例題

1. **PandasでCSVファイルを読み込む**:
```python
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())
```

2. **Pandasで欠損値を扱う**:
```python
df.fillna(0, inplace=True) # 欠損値を0で埋める
df.dropna(inplace=True) # 欠損値を含む行を削除
```

3. **正規表現でテキストデータをクリーニング**:
```python
import re
text = "Example text with email@example.com"
cleaned_text = re.sub(r'[^\w\s]', '', text) # 句読点を削除
print(cleaned_text)
```

4. **データのマージ**:
```python
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A2', 'A3', 'A4', 'A5'],
'C': ['C2', 'C3', 'C4', 'C5']})
merged_df = pd.merge(df1, df2, on='A', how='inner')
print(merged_df)
```

5. **ウェブスクレイピングでHTMLページからデータを抽出**:
```python
from bs4 import BeautifulSoup
import requests

url = 'http://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

titles = soup.find_all('h1')
for title in titles:
print(title.text)
```

6. **日付と時刻のデータを扱う**:
```python
from datetime import datetime
now = datetime.now()
print(now.strftime("%Y-%m-%d %H:%M:%S"))
```

7. **カテゴリーデータのエンコーディング**:
```python
df['Category_encoded'] = df['Category'].astype('category').cat.codes
print(df[['Category', 'Category_encoded']])
```

8. **データセットの正規化**:
```python
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Normalized']] = scaler.fit_transform(df[['Value']])
print(df[['Value', 'Normalized']])
```

9. **データセットの分割**:
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['Features']], df['Target'], test_size=0.2)
print(X_train.head())
```

10. **Pandasで特定の条件に一致するサブセットを作成**:
```python
subset = df[df['Column'] > 10]
print(subset)
```

#### 演習問題

1. **特定のAPIからJSON形式のデータを収集し、Pandasデータフレームに変換**:
```python
import requests
import pandas as pd

response = requests.get('https://api.example.com/data')
data = response.json()
df = pd.DataFrame(data)
print(df.head())
```

2. **Pandasを用いて特定の列に存在する欠損値を平均値で埋める**:
```python
df['Column'].fillna(df['Column'].mean(), inplace=True)
print(df)
```

3. **正規表現を使ってテキストから特定のパターンを抽出**:
```python
import re
text = "Sample text with various email addresses such as test@example.com"emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print(emails)
```

emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print(emails)
```

4. **複数のデータソースを統合する**:
```python
df1 = pd.DataFrame({'Key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})

df2 = pd.DataFrame({'Key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})

result = pd.merge(df1, df2, on='Key')
print(result)
```

5. **ウェブスクレイピングで特定の属性を持つ要素を抽出する**:
```python
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', class_='specific-class')
for link in links:
print(link['href'])
```

6. **日付データの形式を変更する**:
```python
df['Date'] = pd.to_datetime(df['Date'])
df['Formatted Date'] = df['Date'].dt.strftime('%Y/%m/%d')
print(df[['Date', 'Formatted Date']])
```

7. **ダミー変数を用いてカテゴリーデータを数値に変換する**:
```python
df_with_dummies = pd.get_dummies(df, columns=['Category'])
print(df_with_dummies.head())
```

8. **データセットの標準化**:
```python
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df[['Standardized']] = scaler.fit_transform(df[['Value']])
print(df[['Value', 'Standardized']])
```

9. **特定の条件を満たす行のみを含む新しいデータフレームを作成する**:
```python
new_df = df[df['Column'] > 10].copy()
print(new_df)
```

10. **Pandasを使用して特定の列の値に基づいて行をグループ化し、各グループの平均値を計算する**:
```python
grouped = df.groupby('Category')['Value'].mean()
print(grouped)
```

これらの例題と演習問題は、データの収集と前処理に関する基本的なスキルを習得するのに役立ちます。APIからのデータ収集、Pandasを使用したデータ操作、正規表現によるテキスト処理、ウェブスクレイピング、そして機械学習のデータ前処理技術など、データサイエンスの基礎を強化するための重要なトピックが含まれています。

おっさんのブログ

ブログの説明を入力します。

chatGPTでPython教育カリキュラムを作ってみた(3) セクション2の例題と演習問題