Friday, April 21, 2023

AirFlow

 Apache Airflow is an open-source platform to run any type of workflow, it using the Python programming language to define the pipelines. 


When to use Apache Airflow?

- When process is stable and once deployed, is expected to differ from time to time (weeks rather than hours or minutes)

- related to the time interval

- scheduled on time


Apache Airflow can be used to schedule:

- ETL pipelines that extract data from multiple sources

- Training machine learning models

- Report generation

- Backups and similar DevOps operations


Airflow DAG

Workflows are defined in Airflow by DAGs (Directed Acyclic Graphs) and are nothing more than a python file.

from airflow.models import DAG
from airflow.utils.dates import days_ago
from airflow.operators.dummy_operator import DummyOperator
with DAG(
"etl_sales_daily",
start_date=days_ago(1),
schedule_interval=None,
) as dag:
task_a = DummyOperator(task_id="task_a")
task_b = DummyOperator(task_id="task_b")
task_c = DummyOperator(task_id="task_c")
task_d = DummyOperator(task_id="task_d")
task_a >> [task_b, task_c]
task_c >> task_d





Every task in a Airflow DAG has its own task_id that has to be unique within a DAG. Each task has a set of dependencies that define its relationships to other tasks. These include:


Upstream tasks — a set of tasks that will be executed before this particular task.

Downstream tasks — set of tasks that will be executed after this task.

In our example task_b and task_c are downstream of task_a. And respectively task_a is in upstream of both task_b and task_c.

 A common way of specifying a relation between tasks is using the >> operator which works for tasks and collection of tasks



Trigger Rule

each task can specify trigger_rule which allows users to make the relations between tasks even more complex. Examples of trigger rules are:


all_success—meaning that all tasks in upstream of a task have to succeed before Airflow attempts to execute this task

one_success— one succeeded task in upstream is enough to trigger a task with this rule

none_failed— each task in upstream has to either succeed or be skipped, no failed tasks are allowed to trigger this task


Complete DAG sample

from PythonProcessVideo.StartProcess import StartProcess
from PythonProcessVideo.StartProcess import StartSecondProcess
from PythonProcessVideo.StartProcess import Start3rdProcess

import airflow
from airflow import DAG
from airflow.operators import bash_operator
from airflow.operators import python_operator
from datetime import timedelta


default_args = {
    'start_date': airflow.utils.dates.days_ago(0),
    'retries': 1,
    'retry_delay': timedelta(minutes=5)
}

dag = DAG(
    'PushVideosDAG',
    default_args=default_args,
    description='Push videos from firebase to Ziggeo',
    schedule_interval= timedelta(minutes=5),
    dagrun_timeout= timedelta(minutes=15)
    )

start_bash_notofication = bash_operator.BashOperator(
    task_id='echo',
    bash_command='echo Process Started',
    dag=dag,
    depends_on_past=False)

start_python_task1 = python_operator.PythonOperator(
        task_id='pushVideos',
        dag=dag,
        depends_on_past=False,
        python_callable=StartProcess)
        
start_python_task2 = python_operator.PythonOperator(
        task_id='DetectPersonFace',
        dag=dag,
        depends_on_past=False,
        python_callable=StartSecondProcess)

start_python_task3 = python_operator.PythonOperator(
        task_id='pushPersonInfo',
        dag=dag,
        depends_on_past=False,
        python_callable=Start3rdProcess)
        
start_bash_notofication >> start_python_task2 >> start_python_task3 >> start_python_task1



Thursday, April 20, 2023

Python data crawling

 Scan website and extract data using Python


from bs4 import BeautifulSoup
import requests

Sources=[
            {
                "URL":"https://k-tb.com/books/disc/تاريخ?page=",
                "Pages":100,
                "FT_URL":"https://archive.org/download/history", #'https://archive.org/download/history10000/history09907.zip'
                "start":7,
                "end":9
            },

            {
                "URL":"https://k-tb.com/books/disc/الزهد-والرقائق?page=",
                "Pages":27,
                "FT_URL":"https://archive.org/download/tarbyah", #'https://archive.org/download/tarbyah3000/tarbyah02693.zip'
                "start":7,
                "end":9
            },

            {
                "URL":"https://k-tb.com/books/disc/الحديث-وعلومه?page=",
                "Pages":87,
                "FT_URL":"https://archive.org/download/hadeeth", #'https://archive.org/download/hadeeth9000/hadeeth8697.zip'
                "start":7,
                "end":8
            },
            {
                "URL":"https://k-tb.com/books/disc/التفسير-وعلوم-القرآن?page=",
                "Pages":120,
                "FT_URL":"https://archive.org/download/Quran", #'https://archive.org/download/Quran12000/Quraan11975.zip'
                "start":6,
                "end":8
            },
            {
                "URL":"https://k-tb.com/books/disc/العقيدة-والمذاهب-والأديان?page=",
                "Pages":92,
                "FT_URL":"https://archive.org/download/aqidah01/" #https://archive.org/download/aqidah01/Aqidah09200.zip
            },

            {
                "URL":"https://k-tb.com/books/disc/الدعوة-والاحتساب?page=",
                "Pages":10,
                "FT_URL":"https://archive.org/download/dawah1000/" #https://archive.org/download/dawah1000/dawah00927.zip
            },
            {
                "URL":"https://k-tb.com/books/disc/الثقافة-الإسلامية-?page=",
                "Pages":6,
                "FT_URL":"https://archive.org/download/Th2000/" #'https://archive.org/download/Th2000/Th1898.zip'
            },
            {
                "URL":"https://k-tb.com/books/disc/السيرة-النبوية?page=",
                "Pages":11,
                "FT_URL":"https://archive.org/download/serah1000/" #https://archive.org/download/serah1000/serah01055.zip
            },
            {
                "URL":"https://k-tb.com/books/disc/دوريات-ومجلات?page=",
                "Pages":4,
                "FT_URL":"https://archive.org/download/magazine1000/" #https://archive.org/download/magazine1000/magazine0003.zip
            },
        ]
for item in Sources 
    for i in range(1,item["Pages"])
        r  = requests.get(item["URL"] + str(i))
        html =  r.text
        soup = BeautifulSoup(html, "lxml")
        table = soup.find("table",{"class":"table-hover"})
        rows = table.find_all('tr', recursive=False)                  
        for row in rows:
            cell = row.find_all(['td'], recursive=False)         
            if cell:
                ID= cell[0].string
                Title= cell[1].string
                Author=cell[2].string
                OriginalURL=cell[3].find('a').get('href')
                if item["FT_URL"][-1]=="/":
                    FT=item["FT_URL"]+""+ID+".zip"
                else
                    FT=item["FT_URL"]+str(int(ID[item["start"]:end["start"]])+1)+"000/"+ID+".zip"
                print(ID)
                print(Title)
                print(Author)
                print(FT)