Hospital Project Interview FAQ

Domain & Models

Q1: What fields does `PatientRecord` capture and how does it infer `categoria`?

The dataclass in src/core/models.py stores id_paciente, nombre, fecha_nacimiento, edad, sexo, email, telefono, ciudad, and categoria; PatientRecord.from_dict calls categorize_patient_age to compute categoria from the provided age or birthdate.

Q2: Which values does `PatientCategory` expose and when does code pick `UNKNOWN`?

The enum lists CHILD, ADULT, SENIOR, and UNKNOWN; categorize_patient_age returns UNKNOWN when neither an integer age nor a parsable fecha_nacimiento is present.

Q3: What strategy does `categorize_patient_age` apply when given an age versus a birthdate?

If edad is an int it delegates to categorize_by_value; otherwise it tries to parse ISO-formatted fecha_nacimiento, calculates the difference versus the current UTC year, and defaults to UNKNOWN when parsing fails.

Q4: How does `categorize_by_value` bucket ages?

Values below 18 map to CHILD, values below 65 map to ADULT, and all others become SENIOR.

Q5: What metrics does `CompletenessMetric` track for each field?

It records the field, the total and missing counters, the completeness ratio, and dictionaries per_city_missing and per_category_missing so stakeholders can see percentages per city and per patient category.

Q6: What information does `ImputationPlan` describe?

The dataclass bundles a target field, the chosen strategy, and a rationale string describing the recommended fix for missing data.

Q7: What does `AgeCorrectionLogEntry` record for inconsistencies?

Each entry logs id_paciente, nombre, fecha_nacimiento, the registered and calculated ages, an action such as inconsistent_age or imputed_age, and a descriptive note.

Q8: How does `AppointmentRecord.from_dict` normalize the `ciudad` value?

It prefers ciudad and falls back to ciudad_cita so each appointment retains a city even when only the appointment location is supplied.

Q9: Which bucket metadata does `AppointmentIndicatorEntry` expose?

The entry captures period_type, period_value, especialidad, estado_cita, medico, and the aggregated count.

Q10: What summary elements live inside `CostAuditReport`?

The report stores total_records, analyzed_records, a list of summaries (SpecialtyCostSummary) with averages and deviations, and a list of anomalies (CostAnomalyEntry).

Q11: What does `PatientTravelEntry` log for each traveler?

It records id_paciente, nombre, residence, the travel_cities set, travel_count, the computed severity, and last_travel_dates.

Q12: How is the business rule catalog modeled?

A BusinessRule provides an id, title, description, and flexible details; BusinessRulesCatalog aggregates those rules with a created_at timestamp.

Repositories & ETL

Q13: What abstract operations do `PatientRepository` and `AppointmentRepository` export?

Each port defines a single abstract method, list_patients or list_appointments, so the core services depend on interfaces instead of concrete adapters.

Q14: How do the JSON repositories load records from disk?

They open the dataset path, parse JSON, fetch the pacientes or citas_medicas array, and yield domain objects through PatientRecord.from_dict or AppointmentRecord.from_dict.

Q15: Which dataset file does `backend/app.py` point to by default?

The module sets DATASET to BASE_PATH / "dataset_hospital 2 AWS.json", so every service call uses that JSON file unless it is replaced.

Q16: What steps does `ETLPipelineService.run()` perform?

It records a start timestamp, calls extract, transform, and load, enriches the summary with patient and appointment counts plus start/end/duration, persists metrics, and returns that summary dictionary.

Q17: How does `transform` handle orphan appointments?

After cleaning both dataframes, it applies a mask to drop rows whose id_paciente is not in the cleaned patient frame and notes the dropped id_citas as orphans.

Q18: Where are the cleaned tables written?

The load step writes pacientes_cleaned.csv/.parquet and citas_cleaned.csv/.parquet under reports/etl.

Q19: What happens if parquet support is unavailable?

The helper catches ImportError from df.to_parquet and writes an empty UTF-8 placeholder so the CSV still documents the table.

Q20: How does `_persist_metrics` keep track of ETL runs?

It appends an entry to reports/etl/etl_metrics.json with start/end times, duration, exported counts, and orphan total, creating the file when it does not exist.

Q21: How does the backend derive dataset and report directories?

It computes BASE_PATH as two levels above app.py and defines REPORT_DIR and SCRIPTS_DIR relative to that root.

Q22: How are automatable scripts discovered for the API?

The loader looks for scripts/run_*.py, reads each docstring, normalizes the key, and registers descriptors used by the /scripts endpoints.

AccessibilityService

Q23: What is the main goal of `AccessibilityService.evaluate()`?

It cross compares appointment counts per patient with their residence city to surface patients whose volume significantly exceeds local peers.

Q24: How does it detect deviation from city averages?

The service builds overall and city-specific counts, computes mean and pstdev, and flags counts that exceed the city mean plus twice the city standard deviation or the global average plus global standard deviation when city data is sparse.

Q25: What fields does `AccessibilityReport` publish?

It emits total_pacientes, flagged, and the entries list of AccessibilityEntry objects sorted by travel volume.

Q26: Under what condition does a patient entry become flagged?

When their appointment count is above the city threshold and the city has enough data, or when the overall standard deviation is positive and their count exceeds overall average plus standard deviation.

AgeConsistencyService

Q27: What does `AgeConsistencyService.audit_ages()` report?

It builds an AgeConsistencyReport with the cutoff date, total records, inconsistency count, imputations, missing birthdates, and log entries for each anomaly.

Q28: How are missing or malformed birthdates recorded?

Entries with blank fecha_nacimiento get action missing_birthdate, while parsable failures get invalid_birthdate along with notes explaining why.

Q29: What happens when a patient lacks `edad` but has a valid birthdate?

The service creates an AgeCorrectionLogEntry with action imputed_age and records the calculated value based on the cutoff date.

Q30: How does `_calculate_age` treat birthdays before the cutoff date?

It subtracts birth years, decrements by one if the cutoff month/day precedes the birthday, and never returns a negative value.

AgeSpecialtyMismatchService

Q31: Which specialty keywords use custom age bounds in `AgeSpecialtyMismatchService`?

The service defines ranges for keywords containing pediatr (0 to 17) and geriatr (65 to 150).

Q32: What range is used when the specialty lacks those keywords?

It falls back to the default range of 18 to 64.

Q33: How does it compute the appointment age when either value is missing?

It returns None if either fecha_nacimiento or fecha_cita is missing or malformed, using safe ISO parsing.

Q34: When does it append an `AgeSpecialtyMismatchEntry`?

When the calculated age sits below the expected minimum or above the expected maximum for the normalized specialty.

AppointmentAlertService

Q35: What data conditions trigger an `AppointmentAlertEntry`?

The service flags appointments that lack a fecha_cita or that do not have an assigned medico.

Q36: How is the descriptive note constructed?

It collects missing pieces into note_parts, joins them with a conjunction, and prefixes the phrase so reviewers know which fields to validate.

Q37: What summary metrics does `AppointmentAlertReport` include?

Total input records, the number of alerts, and the detailed entries list.

Q38: Why are fully populated appointments skipped?

Because the loop continues whenever both falta_fecha and falta_medico are false, avoiding false positives.

AppointmentCostAuditService

Q39: How are specialty cost statistics derived in `AppointmentCostAuditService.analyze()`?

Costs are grouped by normalized specialty, then mean and standard deviation are computed per group while also keeping a map of processed records.

Q40: What defines a cost anomaly?

A record is anomalous when its cost deviates from the specialty average by more than twice the specialty standard deviation computed without the current appointment.

Q41: What expected range do `SpecialtyCostSummary` entries expose?

The properties expected_min and expected_max subtract or add twice the standard deviation from the average.

Q42: What fallback does `_normalize_specialty` provide for blank values?

It returns the trimmed specialty string when present or sin_especialidad when the field is empty.

AppointmentIndicatorService

Q43: Which periods does `measure_indicators` capture?

The method tallies counts for daily buckets and weekly buckets (ISO week), producing entries for both.

Q44: How is the missing-date counter computed?

Each appointment without a parseable fecha_cita increments missing_dates.

Q45: What does `AppointmentIndicatorReport` contain?

It reports total_records, missing_date, the combined entries, and the top five bottlenecks.

Q46: How does `_safe` sanitize values?

It strips strings, ensures they are not empty, and falls back to defaults such as sin_especialidad or sin_medico.

AppointmentReviewService

Q47: Which appointment statuses trigger review?

Only records whose estado_cita is Completada or Cancelada are processed.

Q48: What issues does `_collect_issues` detect?

It notes absent fecha_cita or missing medico, adding descriptive strings for the report.

Q49: What metrics appear in `AppointmentReviewReport`?

The report returns total_citas, reviewed_citas, and the entries list of problematic appointments.

Q50: Why are other statuses ignored?

The service focuses on completed and canceled visits for quality control, so it skips states outside the target set.

AppointmentStateTimelineService

Q51: How does `AppointmentStateTimelineService` reconstruct state history?

It groups records by id_cita, sorts them by parsed dates, and builds ordered transitions lists.

Q52: What data drives the occupancy impact list?

Reprogramming events increment counters per doctor-week and track affected appointment IDs inside the occupancy dict.

Q53: When does the service consider an event a reprogramming?

Any status whose lowercase string starts with reprogram increments the reprogram counter and contributes to occupancy data.

Q54: What determines the final status in each entry?

It reads the last transition value (or defaults to sin_estado) after sorting all recorded states.

BusinessRulesCatalogService

Q55: Which rules are documented in the catalog?

It describes valid appointment states, age ranges per specialty, email format, and phone format along with identifiers such as rule-estado and rule-edad-especialidad.

Q56: How is the creation time recorded?

The service uses datetime.utcnow().isoformat() for the catalog created_at field.

Q57: What pattern enforces email formatting?

The email rule stores the regex ^[^@\s]+@[^@\s]+\.[^@\s]+$ inside its details dictionary.

Q58: How can other components use this catalog?

They can read the documented states, ranges, and format expectations to align validation or governance checks with the expected contract.

CancellationRiskService

Q59: What threshold marks a cancellation as high risk?

The service uses _risk_threshold equal to 0.6 so scores at or above that value are considered high risk.

Q60: Which signals increase the risk weight?

Previous cancellations or reprograms, short gaps between visits, current cancellation status, and the specialty weight from _specialty_weights each add to the weight while logging explanatory factors.

Q61: How is the weight turned into a probability score?

The helper _sigmoid applies 1 / (1 + exp(-weight)) so the final risk_score remains between zero and one.

Q62: How does `_compute_days_between` avoid negative values?

It parses the current ISO date, subtracts the previous datetime, and returns the non negative day difference with max(delta.days, 0).

CleaningAuditService

Q63: What metadata does `register_changes` expect per event?

Each event provides table, field, action, user, and optionally a timestamp and note.

Q64: How are owner and contact resolved?

The constructor indexes the provided FieldResponsibility objects so register_changes can look up the owner/contact pair for each table-field key.

Q65: What timestamp is recorded when none is supplied?

The code defaults to datetime.utcnow().isoformat() so every entry remains timestamped.

Q66: What does `CleaningAuditReport` wrap?

It returns a report with generated_at and the list of CleaningAuditEntry objects collected from the change events.

CompletenessService

Q67: Which fields does `CompletenessService` evaluate?

The service checks email, telefono, and ciudad.

Q68: How are per-city and per-category percentages computed?

Counters accumulate total and missing values per city and patient category, then the metric converts them into percentages inside per_city_missing and per_category_missing.

Q69: What role does `ImputationStrategy` play?

The service delegates to the supplied imputer with suggest(records), letting custom implementations return ImputationPlan recommendations.

Q70: Why does `evaluate` return both metrics and the imputation plan?

The tuple gives observers a snapshot of current data quality and a prioritized list of fields to fix.

DemandForecastService

Q71: How far ahead does `DemandForecastService` project demand?

The constant MONTHS_AHEAD equals 3 so it builds projections for the next three months beyond the last observed month.

Q72: How is doctor capacity derived?

For each doctor, the service multiplies the last seen monthly count by 1.2, rounds to an integer, and trusts BASE_CAPACITY of 15 as the minimum.

Q73: How does `_compute_avg_growth` treat zero periods?

It skips any month pair where the prior count is zero and returns zero if no valid growth rates exist.

Q74: What extra metadata does `DemandForecastReport` provide?

Besides entries it includes generated_at, avg_monthly_growth, months_ahead, future_months, and total_capacity.

DoctorNotificationService

Q75: What patient patterns generate doctor alerts?

The service looks for repeated appointments close in time (within seven days) or recurrent cancellations to build patterns.

Q76: How is severity assigned?

Entries with more than one detected pattern become high severity, while single-pattern alerts remain medium.

Q77: What strings appear inside `patterns`?

The helper returns phrases such as "Multiples citas en periodos cortos" and "Cancelaciones recurrentes".

Q78: How is the final report ordered?

It sorts the entries by the number of patterns in descending order before building DoctorNotificationReport.

DoctorUtilizationService

Q79: Which thresholds control the flags in `DoctorUtilizationService`?

UTILIZATION_THRESHOLD is 0.75 and CANCELLATION_THRESHOLD is 0.2 to decide when utilization is low or cancellations are high.

Q80: How is utilization rate calculated?

It divides completed appointments by the total scheduled count per doctor-specialty pair.

Q81: When is an entry added to the report?

Any doctor-specialty whose utilization falls below 0.75 or whose cancellation rate (cancels plus reprograms over scheduled) exceeds 0.2 gets appended.

Q82: What does the `deviation` field measure?

It stores utilization - UTILIZATION_THRESHOLD so readers can see how far the rate is from the target.

DuplicateDetectionService

Q83: What grouping criteria does `DuplicateDetectionService` apply?

It groups patients by normalized name plus birthdate and by normalized name plus city to detect potential duplicates.

Q84: How is the canonical record chosen?

It sorts each group by the completeness score (count of filled attributes) and keeps the record with the highest score and lowest id.

Q85: What totals does `DuplicateConsolidationReport` expose?

The report returns total_records, total_groups, total_duplicates, and the log_entries.

Q86: How does `_completeness_score` work?

It awards one point for each non empty value among fecha_nacimiento, edad, sexo, email, telefono, and ciudad.

ExecutiveDiscrepancyService

Q87: Which files drive executive discrepancy entries?

The definition list references referential_integrity_log.json, appointment_alerts_log.json, appointment_review_log.json, appointment_cost_audit_log.json, and reports/appointment_state_timeline_log.json.

Q88: When does an entry appear in the report?

If _read_value returns one or more items the category is included with its count.

Q89: What channel receives the aggregated report?

The service sets CHANNEL to gobernanza@hospital.local.

Q90: How does `_read_value` navigate nested structures?

It splits the path_expr on dots, iterates through dicts or digit indexes for lists, and returns a numeric count or zero when navigation fails.

ManagementKpiService

Q91: Which KPIs does `ManagementKpiService` produce?

It returns the overall average cost and wait days plus ManagementKpiEntry rows per specialty with counts, average costs, and average waits.

Q92: How are wait times derived?

The service sorts each patient timeline by appointment date and subtracts consecutive dates in days to accumulate global_waits and specialty buckets.

Q93: How does `_normalize_specialty` sanitize labels?

It strips whitespace, title cases the value, and falls back to General when empty.

Q94: What filter ensures only completed visits influence costs?

Costs are added only when estado_cita trimmed and lowercased equals completada and the costo field is not null.

OccupancyDashboardService

Q95: Which statuses are normalized when summarizing occupancy?

The service maps COMPLETED, CANCELED, and REPROGRAMMED label sets to normalized buckets before counting.

Q96: How is the (city, specialty) key built?

It uses a map from patient id to normalized city and pairs it with the normalized specialty before incrementing the dictionary bucket.

Q97: What does each `CitySpecialtyOccupancy` entry contain?

It tracks the city, specialty, and counts for completed, canceled, and reprogrammed appointments that share the same buckets.

Q98: How is `total_appointments` computed?

The code increments the counter for each appointment whose status can be normalized into the predefined categories.

PatientSegmentationService

Q99: What age buckets define the `PatientSegmentationService`?

It uses ranges 0-17 (kids), 18-34 (young adults), 35-64 (adults), and 65+ (seniors).

Q100: How does it bucket appointment frequency?

Zero appointments become "Sin citas registradas", one to two appointments become "Frecuencia baja", three to four as "Frecuencia moderada", and five or more as "Frecuencia alta".

Q101: What does `PatientSegmentationReport` include?

It reports generated_at, the total_patients, and the sorted list of PatientSegment cohorts.

Q102: How does `_normalize_sex` label gender?

It maps variants of female to Femenino, male to Masculino, returns No declarado when missing, and title cases other values.

PatientTravelService

Q103: What travel patterns does `PatientTravelService` flag?

Patients whose appointment city differs from their residence are aggregated, and travelers must record more than one distinct city to be flagged.

Q104: How is severity determined for travel entries?

A severity of high requires more than one visited city, otherwise the severity is medium.

Q105: What summary does `PatientTravelReport` provide?

It lists generated_at, total_travelers, and the sorted travel entries.

ReferentialIntegrityService

Q106: Which appointment issues does `ReferentialIntegrityService` capture?

It logs appointments with missing id_paciente or whose id_paciente is absent from the patient repository.

Q107: What does each `ReferentialIntegrityEntry` store?

Each entry holds id_cita, the optional id_paciente, and a motivo string describing the problem.

Q108: What fields does `ReferentialIntegrityReport` provide?

It returns total_citas, orphan_citas, and the list of entries.

TextNormalizationService

Q109: Which normalization steps does `TextNormalizationService` apply?

It strips whitespace, lowercases, and removes diacritics via unicodedata normalization to deliver the NORMALIZATION_METHOD.

Q110: Which fields are normalized per record?

It iterates over nombre and ciudad.

Q111: What does `TextNormalizationReport` include?

It returns total_records, a normalized_fields count, and the normalization log_entries.

QualityKpiService

Q112: Which tables does `QualityKpiService` compare before and after cleaning?

It evaluates the pacientes table and the citas_medicas table, computing metrics for both the before and after repositories.

Q113: What metrics appear on each `FieldQualityMetric`?

Each metric reports completeness, uniqueness, and format_valid for the field.

Q114: How are format checkers configured per table?

The PATIENT_FIELDS list uses _is_valid_date and _is_valid_email where relevant, while APPOINTMENT_FIELDS requests _is_valid_date only for fecha_cita.

Backend API & Scripts

Q115: What does GET /reports return?

It lists all JSON report filenames under reports by reading REPORT_DIR.glob("*.json").

Q116: How are HTML reports delivered?

GET /reports/{name}/html returns a FileResponse streaming the corresponding HTML file after checking the path exists.

Q117: How does POST /run/{case} execute built-in services?

The endpoint calls _service_runner with the requested case, runs the selected service (doctor notifications, utilization, patient travel, management KPIs, or ETL), and returns the summary dictionary or report.to_dict().

Q118: What metadata does GET /scripts return?

It returns ScriptInfo objects for each discovered script, including the normalized key, module path, description, and relative path.

Q119: What validations does /dataset/upload perform?

It ensures the uploaded file has content type application/json and is not empty before overwriting the dataset file.

Q120: What does GET /health provide?

A JSON payload with status ok and the dataset path string so callers can confirm the API and dataset location.